Abstract
It is standard practice in speech & language
technology to rank systems according to performance on a test set held out for evaluation. However, few researchers apply statistical tests to determine whether differences in
performance are likely to arise by chance, and
few examine the stability of system ranking
across multiple training-testing splits. We conduct replication and reproduction experiments
with nine part-of-speech taggers published between 2000 and 2018, each of which reports
state-of-the-art performance on a widely-used
“standard split”. We fail to reliably reproduce some rankings using randomly generated
splits. We suggest that randomly generated
splits should be used in system comparison