Abstract
Recently, algorithms for object recognition and relatedtasks have become sufficiently proficient that new visiontasks can now be pursued. In this paper, we build a sys-tem capable of answering open-ended text-based questionsabout images, which is known as Visual Question Answer-ing (VQA). Our approach’s key insight is that we can pre-dict the form of the answer from the question. We formu-late our solution in a Bayesian framework. When our ap-proach is combined with a discriminative model, the com-bined model achieves state-of-the-art results on four benchmark datasets for open-ended VQA: DAQUAR, COCO-QA, The VQA Dataset, and Visual7W.