Abstract
We are surprised to find that BERT’s peak performance of 77% on the Argument Reasoning
Comprehension Task reaches just three points
below the average untrained human baseline.
However, we show that this result is entirely
accounted for by exploitation of spurious statistical cues in the dataset. We analyze the
nature of these cues and demonstrate that a
range of models all exploit them. This analysis informs the construction of an adversarial
dataset on which all models achieve random
accuracy. Our adversarial dataset provides a
more robust assessment of argument comprehension and should be adopted as the standard
in future work