Abstract
We present a new method for sentiment lexicon induction that is designed to be applicable to the entire range of typological diversity of the world’s languages. We evaluate our method on Parallel Bible Corpus+
(PBC+), a parallel corpus of 1593 languages.
The key idea is to use Byte Pair Encodings
(BPEs) as basic units for multilingual embeddings. Through zero-shot transfer from
English sentiment, we learn a seed lexicon
for each language in the domain of PBC+.
Through domain adaptation, we then generalize the domain-specific lexicon to a general
one. We show – across typologically diverse
languages in PBC+ – good quality of seed and
general-domain sentiment lexicons by intrinsic and extrinsic and by automatic and human
evaluation. We make freely available our code,
seed sentiment lexicons for all 1593 languages
and induced general-domain sentiment lexicons for 200 languages