Abstract
Automatic question generation (QG) is a challenging problem in natural language understanding. QG systems are typically built assuming access to a large number of training instances where each instance is a question and
its corresponding answer. For a new language,
such training instances are hard to obtain making the QG problem even more challenging.
Using this as our motivation, we study the
reuse of an available large QG dataset in a secondary language (e.g. English) to learn a QG
model for a primary language (e.g. Hindi) of
interest. For the primary language, we assume
access to a large amount of monolingual text
but only a small QG dataset. We propose a
cross-lingual QG model which uses the following training regime: (i) Unsupervised pretraining of language models in both primary
and secondary languages and (ii) joint supervised training for QG in both languages. We
demonstrate the efficacy of our proposed approach using two different primary languages,
Hindi and Chinese. We also create and release
a new question answering dataset for Hindi
consisting of 6555 sentences.