Abstract. Visual Question Answering (VQA) requires integration of feature maps
with drastically different structures. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence
and naturally cluster into semantically different question types. A lot of previous
works use complex models to extract feature representations but neglect to use
high-level information summary such as question types in learning. In this work,
we propose Question Type-guided Attention (QTA). It utilizes the information
of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks.
We experiment with multiple VQA architectures with extensive input ablation
studies over the TDIUC dataset and show that QTA systematically improves the
performance by more than 5% across multiple question type categories such as
“Activity Recognition”, “Utility” and “Counting” on TDIUC dataset compared to
the state-of-art. By adding QTA on the state-of-art model MCB, we achieve 3%
improvement in overall accuracy. Finally, we propose a multi-task extension to
predict question types which generalizes QTA to applications that lack question
type, with a minimal performance loss