Putting Evaluation in Context: Contextual Embeddings improve Machine
Translation Evaluation
Abstract
Accurate, automatic evaluation of machine
translation is critical for system tuning, and
evaluating progress in the field. We proposed
a simple unsupervised metric, and additional
supervised metrics which rely on contextual
word embeddings to encode the translation
and reference sentences. We find that these
models rival or surpass all existing metrics
in the WMT 2017 sentence-level and systemlevel tracks, and our trained model has a
substantially higher correlation with human
judgements than all existing metrics on the
WMT 2017 to-English sentence level dataset