Abstract
Image captioning often requires a large set of training
image-sentence pairs. In practice, however, acquiring suf-
ficient training pairs is always expensive, making the recent captioning models limited in their ability to describe
objects outside of training corpora (i.e., novel objects). In
this paper, we present Long Short-Term Memory with Copying Mechanism (LSTM-C) — a new architecture that incorporates copying into the Convolutional Neural Networks
(CNN) plus Recurrent Neural Networks (RNN) image captioning framework, for describing novel objects in captions. Specifically, freely available object recognition datasets
are leveraged to develop classifiers for novel objects. Our
LSTM-C then nicely integrates the standard word-by-word
sentence generation by a decoder RNN with copying mechanism which may instead select words from novel objects at
proper places in the output sentence. Extensive experiments
are conducted on both MSCOCO image captioning and ImageNet datasets, demonstrating the ability of our proposed
LSTM-C architecture to describe novel objects. Furthermore, superior results are reported when compared to stateof-the-art deep models