Improved Sentiment Detection via Label Transfer
from Monolingual to Synthetic Code-Switched Text
Abstract
Multilingual writers and speakers often alternate between two languages in a single discourse, a practice called “code-switching”.
Existing sentiment detection methods are usually trained on sentiment-labeled monolingual
text. Manually labeled code-switched text, especially involving minority languages, is extremely rare. Consequently, the best monolingual methods perform relatively poorly
on code-switched text. We present an effective technique for synthesizing labeled
code-switched text from labeled monolingual
text, which is more readily available. The
idea is to replace carefully selected subtrees
of constituency parses of sentences in the
resource-rich language with suitable token
spans selected from automatic translations to
the resource-poor language. By augmenting scarce human-labeled code-switched text
with plentiful synthetic code-switched text, we
achieve significant improvements in sentiment
labeling accuracy (1.5%, 5.11%, 7.20%) for
three different language pairs (English-Hindi,
English-Spanish and English-Bengali). We
also get significant gains for hate speech detection: 4% improvement using only synthetic
text and 6% if augmented with real text