MULTIQA: An Empirical Investigation of Generalization and Transfer in
Reading Comprehension
Abstract
A large number of reading comprehension
(RC) datasets has been created recently, but
little analysis has been done on whether they
generalize to one another, and the extent to
which existing datasets can be leveraged for
improving performance on new ones. In this
paper, we conduct such an investigation over
ten RC datasets, training on one or more
source RC datasets, and evaluating generalization, as well as transfer to a target RC dataset.
We analyze the factors that contribute to generalization, and show that training on a source
RC dataset and transferring to a target dataset
substantially improves performance, even in
the presence of powerful contextual representations from BERT (Devlin et al., 2019).
We also find that training on multiple source
RC datasets leads to robust generalization and
transfer, and can reduce the cost of example
collection for a new RC dataset. Following
our analysis, we propose MULTIQA, a BERTbased model, trained on multiple RC datasets,
which leads to state-of-the-art performance on
five RC datasets. We share our infrastructure
for the benefit of the research community