Human vs. Muppet: A Conservative Estimate of Human
Performance on the GLUE Benchmark
Abstract
The GLUE benchmark (Wang et al., 2019b) is
a suite of language understanding tasks which
has seen dramatic progress in the past year,
with average performance moving from 70.0
at launch to 83.9, state of the art at the time of
writing (May 24, 2019). Here, we measure human performance on the benchmark, in order
to learn whether significant headroom remains
for further progress. We provide a conservative estimate of human performance on the
benchmark through crowdsourcing: Our annotators are non-experts who must learn each
task from a brief set of instructions and 20 examples. In spite of limited training, these annotators robustly outperform the state of the
art on six of the nine GLUE tasks and achieve
an average score of 87.1. Given the fast pace
of progress however, the headroom we observe is quite limited. To reproduce the datapoor setting that our annotators must learn in,
we also train the BERT model (Devlin et al.,
2019) in limited-data regimes, and conclude
that low-resource sentence classification remains a challenge for modern neural network
approaches to text understanding