Abstract
Unseen words, also called out-of-vocabulary
words (OOVs), are difficult for machine translation. In neural machine translation, byte-pair
encoding can be used to represent OOVs, but
they are still often incorrectly translated. We
improve the translation of OOVs in NMT using easy-to-obtain monolingual data. We look
for OOVs in the text to be translated and translate them using simple-to-construct bilingual
word embeddings (BWEs). In our MT experiments we take the 5 best candidates, which
is motivated by intrinsic mining experiments.
Using all five of the proposed target language
words as queries we mine target-language sentences. We then back-translate, forcing the
back-translation of each of the five proposed
target-language OOV-translation-candidates to
be the original source-language OOV. We
show that by using this synthetic data to finetune our system the translation of OOVs can be
dramatically improved. In our experiments we
use a system trained on Europarl and mine sentences containing medical terms from monolingual data.