Abstract
Code-switching, the interleaving of two or more
languages within a sentence or discourse is pervasive in multilingual societies. Accurate language
models for code-switched text are critical for NLP
tasks. State-of-the-art data-intensive neural language models are difficult to train well from scarce
language-labeled code-switched text. A potential
solution is to use deep generative models to synthesize large volumes of realistic code-switched
text. Although generative adversarial networks
and variational autoencoders can synthesize plausible monolingual text from continuous latent space,
they cannot adequately address code-switched text,
owing to their informal style and complex interplay
between the constituent languages. We introduce
VACS, a novel variational autoencoder architecture
specifically tailored to code-switching phenomena.
VACS encodes to and decodes from a two-level
hierarchical representation, which models syntactic contextual signals in the lower level, and language switching signals in the upper layer. Decoding representations sampled from prior produced
well-formed, diverse code-switched sentences. Extensive experiments show that using synthetic codeswitched text with natural monolingual data results
in significant (33.06%) drop in perplexity