Abstract. Most existing works in visual question answering (VQA) are
dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer
is of the same or even more importance compared with the answer itself,
since it makes the question answering process more understandable and
traceable. To this end, we propose a new task of VQA-E (VQA with
Explanation), where the models are required to generate an explanation
with the predicted answer. We first construct a new dataset, and then
frame the VQA-E problem in a multi-task learning architecture. Our
VQA-E dataset is automatically derived from the VQA v2 dataset by
intelligently exploiting the available captions. We also conduct a user
study to validate the quality of the synthesized explanations . We quantitatively show that the additional supervision from explanations can not
only produce insightful textual sentences to justify the answers, but also
improve the performance of answer prediction. Our model outperforms
the state-of-the-art methods by a clear margin on the VQA v2 dataset