Multi-grained Attention with Object-level Grounding
for Visual Question Answering
Abstract
Attention mechanisms are widely used in Visual Question Answering (VQA) to search for
visual clues related to the question. Most approaches train attention models from a coarsegrained association between sentences and images, which tends to fail on small objects
or uncommon concepts. To address this
problem, this paper proposes a multi-grained
attention method. It learns explicit wordobject correspondence by two types of wordlevel attention complementary to the sentenceimage association. Evaluated on the VQA
benchmark, the multi-grained attention model
achieves competitive performance with stateof-the-art models. And the visualized attention maps demonstrate that addition of objectlevel groundings leads to a better understanding of the images and locates the attended objects more precisely