Abstract
For some images, descriptions written by multiple peo-ple are consistent with each other. But for other images, de-scriptions across people vary considerably. In other words,some images are specific – they elicit consistent descriptionsfrom different people – while other images are ambiguous.Applications involving images and text can benefit from anunderstanding of which images are specific and which onesare ambiguous. For instance, consider text-based image re-trieval. If a query description is moderately similar to thecaption (or reference description) of an ambiguous image,that query may be considered a decent match to the image.But if the image is very specific, a moderate similarity between the query and the reference description may not be sufficient to retrieve the image. In this paper, we introduce the notion of image speci-ficity. We present two mechanisms to measure specificity given multiple descriptions of an image: an automated measure and a measure that relies on human judgement. We analyze image specificity with respect to image content and properties to better understand what makes an image specific. We then train models to automatically predict the specificity of an image from image features alone without requiring textual descriptions of the image. Finally, we show that modeling image specificity leads to improvements in a text-based image retrieval application.