Abstract
Multi-hop reading comprehension (RC) questions are challenging because they require
reading and reasoning over multiple paragraphs. We argue that it can be difficult to construct large multi-hop RC datasets. For example, even highly compositional questions can
be answered with a single hop if they target
specific entity types, or the facts needed to
answer them are redundant. Our analysis is
centered on HOTPOTQA, where we show that
single-hop reasoning can solve much more of
the dataset than previously thought. We introduce a single-hop BERT-based RC model that
achieves 67 F1—comparable to state-of-theart multi-hop models. We also design an evaluation setting where humans are not shown all
of the necessary paragraphs for the intended
multi-hop reasoning but can still answer over
80% of questions. Together with detailed error
analysis, these results suggest there should be
an increasing focus on the role of evidence in
multi-hop reasoning and possibly even a shift
towards information retrieval style evaluations
with large and diverse evidence collections