Abstract
We introduce a novel method of generating
synthetic question answering corpora by combining models of question generation and answer extraction, and by filtering the results to
ensure roundtrip consistency. By pretraining
on the resulting corpora we obtain significant
improvements on SQuAD2 (Rajpurkar et al.,
2018) and NQ (Kwiatkowski et al., 2019), establishing a new state-of-the-art on the latter.
Our synthetic data generation models, for both
question generation and answer extraction, can
be fully reproduced by finetuning a publicly
available BERT model (Devlin et al., 2018)
on the extractive subsets of SQuAD2 and NQ.
We also describe a more powerful variant that
does full sequence-to-sequence pretraining for
question generation, obtaining exact match
and F1 at less than 0.1% and 0.4% from human performance on SQuAD2.