Abstract
Studies have shown that a dominant class of questions
asked by visually impaired users on images of their surroundings involves reading text in the image. But today’s
VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new
“TextVQA” dataset to facilitate progress on this important
problem. Existing datasets either have a small proportion
of questions about text (e.g., the VQA dataset) or are too
small (e.g., the VizWiz dataset). TextVQA contains 45,336
questions on 28,408 images that require reasoning about
text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the
context of the image and the question, and predicts an answer which might be a deduction based on the text and the
image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing
state-of-the-art VQA models on our TextVQA dataset. We
find that the gap between human performance and machine
performance is significantly larger on TextVQA than on
VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.