Abstract
Transcripts of natural, multi-person meetings
differ significantly from documents like news
articles, which can make Natural Language
Generation models generate unfocused summaries. We develop an abstractive meeting
summarizer from both videos and audios of
meeting recordings. Specifically, we propose a
multi-modal hierarchical attention mechanism
across three levels: topic segment, utterance
and word. To narrow down the focus into
topically-relevant segments, we jointly model
topic segmentation and summarization. In addition to traditional textual features, we introduce new multi-modal features derived from
visual focus of attention, based on the assumption that an utterance is more important
if its speaker receives more attention. Experiments show that our model significantly outperforms the state-of-the-art with both BLEU
and ROUGE measures