Abstract
One property that remains lacking in image captions
generated by contemporary methods is discriminability: being able to tell two images apart given the caption for one of
them. We propose a way to improve this aspect of caption
generation. By incorporating into the captioning training
objective a loss component directly related to ability (by a
machine) to disambiguate image/caption matches, we obtain systems that produce much more discriminative caption, according to human evaluation. Remarkably, our approach leads to improvement in other aspects of generated
captions, reflected by a battery of standard scores such as
BLEU, SPICE etc. Our approach is modular and can be
applied to a variety of model/loss combinations commonly
proposed for image captioning