Unsupervised Parallel Sentence Extraction with Parallel SegmentDetection Helps Machine Translation
Abstract
Mining parallel sentences from comparable
corpora is important. Most previous work relies on supervised systems, which are trained
on parallel data, thus their applicability is
problematic in low-resource scenarios. Recent developments in building unsupervised
bilingual word embeddings made it possible to
mine parallel sentences based on cosine similarities of source and target language words.
We show that relying only on this information is not enough, since sentences often have
similar words but different meanings. We detect continuous parallel segments in sentence
pair candidates and rely on them when mining parallel sentences. We show better mining
accuracy on three language pairs in a standard
shared task on artificial data. We also provide
the first experiments showing that parallel sentences mined from real life sources improve
unsupervised MT. Our code is available, we
hope it will be used to support low-resource
MT research.