The data preprocessing script get_enfr_data.sh (copied from UnsupervisedMT-Pytorch,
but remove the cmd of binarizing dataset by torch and add the special
tokens to the vocabulary files) will take care of installing everything
(except Python, TensorFlow).
Download / preprocess data
The first thing to do to download and preprocess data. To do so, just run:
cd UnsupervisedMT-TensorFlow
./get_enfr_data.sh
Note that there are several ways to train cross-lingual embeddings:
Train monolingual embeddings separately for each language, and align them with MUSE (please refer to the original paper for more details).
Concatenate the source and target monolingual corpora in a single
file, and train embeddings with fastText on that generated file (this is
what is implemented in the get_enfr_data.sh script).
The second method works better when the source and target languages
are similar and share a lot of common words (such as French and English
in get_enfr_data.sh). However, when the overlap between the
source and target vocabulary is too small, the alignment will be very
poor and you should opt for the first method using MUSE to generate your
cross-lingual embeddings.
You can skip the script for preprocessing since it takes long time
including downloading, learning/applying BPE, and fasttext training.
Just download the prepared datasets, after running get_enfr_data.sh.
cd UnsupervisedMT-TensorFlow
./download_enfr_data.sh
Train the NMT model
./run.sh
The hyperparameter in run.sh are almost identical to the UnsupervisedMT-Pytorch except the batch_size=2048 which is a token level batch size.
On newstest2014 en-fr, the above command should give more than 22
BLEU after 100K steps training on a P100 (similar to Pytorch code).
Main Implementation Difference
In our code, the gradient of each update is computed by the summed
loss from both directions: lang1 <-> lang2, while Pytorch code
updates twice with the loss of each direction.
TODO
Mixed Data Loader (for training monoligual and parallel datasets together)