Abstract. Paragraph generation from images is an important task for
video summarization, editing, and support of the disabled, which has
gained popularity recently. Traditional image captioning methods fall
short on this front, since they aren’t designed to generate long informative
descriptions. However, the naive approach of simply concatenating multiple short sentences, possibly synthesized from traditional image captioning
systems, doesn’t embrace the intricacies of paragraphs: coherent sentences,
globally consistent structure, and diversity. To address those challenges,
we propose to augment paragraph generation techniques with “coherence
vectors,” “global topic vectors,” and modeling of the inherent ambiguity
of associating paragraphs with images via a variational auto-encoder
formulation. We demonstrate the effectiveness of the developed approach
on two datasets, outperforming existing state-of-the-art techniques on
both