Abstract
We present a method that learns to answer visual ques-tions by selecting image regions relevant to the text-basedquery. Our method maps textual queries and visual featuresfrom various regions into a shared space where they arecompared for relevance with an inner product. Our methodexhibits significant improvements in answering questionssuch as “what color,” where it is necessary to evaluatea specific location, and “what room,” where it selectively identifies informative image regions. Our model is testedon the recently released VQA [1] dataset, which featuresfree-form human-annotated questions and answers.