Abstract
We introduce a new dataset for joint reasoning about natural language and images, with a
focus on semantic diversity, compositionality,
and visual reasoning challenges. The data contains 107,292 examples of English sentences
paired with web photographs. The task is
to determine whether a natural language caption is true about a pair of photographs. We
crowdsource the data using sets of visually
rich images and a compare-and-contrast task
to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation
using state-of-the-art visual reasoning methods shows the data presents a strong challenge.