Despite the advancement of question answering (QA) systems and rapid improvements on
held-out test sets, their generalizability is a
topic of concern. We explore the robustness of QA models to question paraphrasing
by creating two test sets consisting of paraphrased SQuAD questions. Paraphrased questions from the first test set are very similar
to the original questions designed to test QA
models’ over-sensitivity, while questions from
the second test set are paraphrased using context words near an incorrect answer candidate
in an attempt to confuse QA models. We
show that both paraphrased test sets lead to
significant decrease in performance on multiple state-of-the-art QA models. Using a neural
paraphrasing model trained to generate multiple paraphrased questions for a given source
question and a set of paraphrase suggestions,
we propose a data augmentation approach that
requires no human intervention to re-train the
models for improved robustness to question