How to (Properly) Evaluate Cross-Lingual Word Embeddings:
On Strong Baselines, Comparative Analyses, and Some Misconceptions
Abstract
Cross-lingual word embeddings (CLEs) facilitate cross-lingual transfer of NLP models. Despite their ubiquitous downstream usage, increasingly popular projection-based CLE models are almost exclusively evaluated on bilingual lexicon induction (BLI). Even the BLI
evaluations vary greatly, hindering our ability
to correctly interpret performance and properties of different CLE models. In this work,
we take the first step towards a comprehensive
evaluation of CLE models: we thoroughly evaluate both supervised and unsupervised CLE
models, for a large number of language pairs,
on BLI and three downstream tasks, providing
new insights concerning the ability of cuttingedge CLE models to support cross-lingual
NLP. We empirically demonstrate that the performance of CLE models largely depends on
the task at hand and that optimizing CLE models for BLI may hurt downstream performance.
We indicate the most robust supervised and
unsupervised CLE models and emphasize the
need to reassess simple baselines, which still
display competitive performance across the
board. We hope our work catalyzes further research on CLE evaluation and model analysis.