Abstract
Collecting fully annotated image datasets is challenging and expensive. Many types of weak supervision have
been explored: weak manual annotations, web search results, temporal continuity, ambient sound and others. We
focus on one particular unexplored mode: visual questions
that are asked about images. The key observation that inspires our work is that the question itself provides useful information about the image (even without the answer being
available). For instance, the question “what is the breed
of the dog?” informs the AI that the animal in the scene
is a dog and that there is only one dog present. We make
three contributions: (1) providing an extensive qualitative
and quantitative analysis of the information contained in
human visual questions, (2) proposing two simple but surprisingly effective modifications to the standard visual question answering models that allow them to make use of weak
supervision in the form of unanswered questions associated
with images and (3) demonstrating that a simple data augmentation strategy inspired by our insights results in a 7.1%
improvement on the standard VQA benchmark.