Abstract
How language-agnostic are current state-ofthe-art NLP tools? Are there some types of
language that are easier to model with current methods? In prior work (Cotterell et al.,
2018) we attempted to address this question
for language modeling, and observed that recurrent neural network language models do
not perform equally well over all the highresource European languages found in the
Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we
introduce a new paired-sample multiplicative
mixed-effects model to obtain language dif-
ficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is
aware of inter-sentence variation and can handle missing data. Exploiting this model, we
show that “translationese” is not any easier to
model than natively written language in a fair
comparison. Trying to answer the question of
what features difficult languages have in common, we try and fail to reproduce our earlier
(Cotterell et al., 2018) observation about morphological complexity and instead reveal far
simpler statistics of the data that seem to drive
complexity in a much larger sample.