Abstract
This paper presents a novel approach for automaticallygenerating image descriptions: visual detectors, languagemodels, and multimodal similarity models learnt directlyfrom a dataset of image captions. We use multiple instancelearning to train visual detectors for words that commonlyoccur in captions, including many different parts of speechsuch as nouns, verbs, and adjectives. The word detectoroutputs serve as conditional inputs to a maximum-entropylanguage model. The language model learns from a set ofover 400,000 image descriptions to capture the statisticsof word usage. We capture global semantics by re-rankingcaption candidates using sentence-level features and a deepmultimodal similarity model. Our system is state-of-the-arton the official Microsoft COCO benchmark, producing aBLEU-4 score of 29.1%. When human judges compare thesystem captions to ones written by other people on our held-out test set, the system captions have equal or better quality34% of the time.