Unsupervised Multilingual Word Embedding with Limited Resourcesusing Neural Language Models
Abstract
Recently, a variety of unsupervised methods
have been proposed that map pre-trained word
embeddings of different languages into the
same space without any parallel data. These
methods aim to find a linear transformation
based on the assumption that monolingual
word embeddings are approximately isomorphic between languages. However, it has been
demonstrated that this assumption holds true
only on specific conditions, and with limited
resources, the performance of these methods
decreases drastically. To overcome this problem, we propose a new unsupervised multilingual embedding method that does not rely
on such assumption and performs well under
resource-poor scenarios, namely when only
a small amount of monolingual data (i.e.,
50k sentences) are available, or when the domains of monolingual data are different across
languages. Our proposed model, which we
call ‘Multilingual Neural Language Models’,
shares some of the network parameters among
multiple languages, and encodes sentences of
multiple languages into the same space. The
model jointly learns word embeddings of different languages in the same space, and generates multilingual embeddings without any
parallel data or pre-training. Our experiments on word alignment tasks have demonstrated that, on the low-resource condition,
our model substantially outperforms existing
unsupervised and even supervised methods
trained with 500 bilingual pairs of words. Our
model also outperforms unsupervised methods given different-domain corpora across languages. Our code is publicly available1.