Abstract
We investigate the problem of cross-dataset adaptation
for visual question answering (Visual QA). Our goal is to
train a Visual QA model on a source dataset but apply it
to another target one. Analogous to domain adaptation for
visual recognition, this setting is appealing when the target
dataset does not have a sufficient amount of labeled data
to learn an “in-domain” model. The key challenge is that
the two datasets are constructed differently, resulting in the
cross-dataset mismatch on images, questions, or answers.
We overcome this difficulty by proposing a novel domain
adaptation algorithm. Our method reduces the difference
in statistical distributions by transforming the feature representation of the data in the target dataset. Moreover, it
maximizes the likelihood of answering questions (in the target dataset) correctly using the Visual QA model trained
on the source dataset. We empirically studied the effectiveness of the proposed approach on adapting among several
popular Visual QA datasets. We show that the proposed
method improves over baselines where there is no adaptation and several other adaptation methods. We both quantitatively and qualitatively analyze when the adaptation can
be mostly effective