Abstract
A Semantic Compositional Network (SCN) is developed
for image captioning, in which semantic concepts (i.e., tags)
are detected from the image, and the probability of each tag
is used to compose the parameters in a long short-term memory (LSTM) network. The SCN extends each weight matrix of
the LSTM to an ensemble of tag-dependent weight matrices.
The degree to which each member of the ensemble is used
to generate an image caption is tied to the image-dependent
probability of the corresponding tag. In addition to captioning images, we also extend the SCN to generate captions for
video clips. We qualitatively analyze semantic composition
in SCNs, and quantitatively evaluate the algorithm on three
benchmark datasets: COCO, Flickr30k, and Youtube2Text.
Experimental results show that the proposed method signifi-
cantly outperforms prior state-of-the-art approaches, across
multiple evaluation metrics.