Abstract
We study how to leverage off-the-shelf visual and linguistic data to cope with out-of-vocabulary answers in
visual question answering task. Existing large-scale visual datasets with annotations such as image class labels,
bounding boxes and region descriptions are good sources
for learning rich and diverse visual concepts. However, it
is not straightforward how the visual concepts can be captured and transferred to visual question answering models
due to missing link between question dependent answering
models and visual data without question. We tackle this
problem in two steps: 1) learning a task conditional visual classifier, which is capable of solving diverse questionspecific visual recognition tasks, based on unsupervised
task discovery and 2) transferring the task conditional visual classifier to visual question answering models. Specifically, we employ linguistic knowledge sources such as
structured lexical database (e.g. WordNet) and visual descriptions for unsupervised task discovery, and transfer a
learned task conditional visual classifier as an answering
unit in a visual question answering model. We empirically
show that the proposed algorithm generalizes to out-ofvocabulary answers successfully using the knowledge transferred from the visual dataset.