Abstract
In this paper, we study abstractive summarization for open-domain videos. Unlike the traditional text news summarization, the goal is
less to “compress” text information but rather
to provide a fluent textual summary of information that has been collected and fused from
different source modalities, in our case video
and audio transcripts (or text). We show how
a multi-source sequence-to-sequence model
with hierarchical attention can integrate information from different modalities into a coherent output, compare various models trained
with different modalities and present pilot experiments on the How2 corpus of instructional
videos. We also propose a new evaluation metric (Content F1) for abstractive summarization
task that measures semantic adequacy rather
than fluency of the summaries, which is covered by metrics like ROUGE and BLEU.