A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings
Based on Graph Modularity
Abstract
Cross-lingual word embeddings encode the
meaning of words from different languages
into a shared low-dimensional space. An
important requirement for many downstream
tasks is that word similarity should be independent of language—i.e., word vectors within
one language should not be more similar to
each other than to words in another language.
We measure this characteristic using modularity, a network measurement that measures the
strength of clusters in a graph. Modularity
has a moderate to strong correlation with three
downstream tasks, even though modularity is
based only on the structure of embeddings and
does not require any external resources. We
show through experiments that modularity can
serve as an intrinsic validation metric to improve unsupervised cross-lingual word embeddings, particularly on distant language pairs in
low-resource settings.