Bottom-Up and Top-Down Attention for Image Captioningand Visual Question Answering
Abstract
Top-down visual attention mechanisms have been used
extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through
fine-grained analysis and even multiple steps of reasoning.
In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism
(based on Faster R-CNN) proposes image regions, each
with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO
test server establish a new state-of-the-art for the task,
achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5
and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA
we obtain first place in the 2017 VQA Challenge.