Learning Morphosyntactic Analyzers from the Bible via Iterative
Annotation Projection across 26 Languages
Abstract
A large percentage of computational tools are
concentrated in a very small subset of the
planet’s languages. Compounding the issue,
many languages lack the high-quality linguistic annotation necessary for the construction
of such tools with current machine learning
methods. In this paper, we address both issues simultaneously: leveraging the high accuracy of English taggers and parsers, we
project morphological information onto translations of the Bible in 26 varied test languages.
Using an iterative discovery, constraint, and
training process, we build inflectional lexica
in the target languages. Through a combination of iteration, ensembling, and reranking,
we see double-digit relative error reductions
in lemmatization and morphological analysis
over a strong initial system