Abstract
We argue for an alternative paradigm in evaluating machine translation quality that is strongly empirical but more accurately re?ects the utility of translations, by returning to a representational foundation based on AI oriented lexical semantics, rather than the super?cial ?at n-gram and string representations recently dominating the ?eld. Driven by such metrics as BLEU and WER, current SMT frequently produces unusable translations where the semantic event structure is mistranslated: who did what to whom, when, where, why, and how? We argue that it is time for a new generation of more intelligent” automatic and semi-automatic metrics, based clearly on getting the structure right at the lexical semantics level. We show empirically that it is possible to use simple PropBank style semantic frame representations to surpass all currently widespread metrics’ correlation to human adequacy judgments, including even HTER. We also show that replacing human annotators with automatic semantic role labeling still yields much of the advantage of the approach. We combine the best of both worlds: from an SMT perspective, we provide superior yet low-cost quantitative objective functions for translation quality; and yet from an AI perspective, we regain the representational transparency and clear re?ection of semantic utility of structural frame-based knowledge representations.