Abstract
Visual Grounding (VG) aims to locate the most relevant
object or region in an image, based on a natural language
query. The query can be a phrase, a sentence or even a
multi-round dialogue. There are three main challenges in
VG: 1) what is the main focus in a query; 2) how to understand an image; 3) how to locate an object. Most existing
methods combine all the information curtly, which may suffer from the problem of information redundancy (i.e. ambiguous query, complicated image and a large number of
objects). In this paper, we formulate these challenges as
three attention problems and propose an accumulated attention (A-ATT) mechanism to reason among them jointly. Our
A-ATT mechanism can circularly accumulate the attention
for useful information in image, query, and objects, while
the noises are ignored gradually. We evaluate the performance of A-ATT on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and Guesswhat?!),
and the experimental results show the superiority of the proposed method in term of accuracy.