Densely Supervised Hierarchical Policy-Value Network
for Image Paragraph Generation
Abstract
Image paragraph generation aims to describe an image with a paragraph in natural language. Compared to image captioning with a single sentence,
paragraph generation provides more expressive and
fine-grained description for storytelling. Existing
approaches mainly optimize paragraph generator
towards minimizing word-wise cross entropy loss,
which neglects linguistic hierarchy of paragraph
and results in “sparse” supervision for generator
learning. In this paper, we propose a novel Densely
Supervised Hierarchical Policy-Value (DHPV) network for effective paragraph generation. We design new hierarchical supervisions consisting of hierarchical rewards and values at both sentence and
word levels. The joint exploration of hierarchical rewards and values provides dense supervision
cues for learning effective paragraph generator. We
propose a new hierarchical policy-value architecture which exploits compositionality at token-totoken and sentence-to-sentence levels simultaneously and can preserve the semantic and syntactic
constituent integrity. Extensive experiments on the
Stanford image-paragraph benchmark have demonstrated the effectiveness of the proposed DHPV approach with performance improvements over multiple state-of-the-art methods