Unsupervised Machine Translation (Transformer Based UNMT)

This repository provides the TensorFlow implementation of the transformer based unsupervised NMT model presented in
Phrase-Based & Neural Unsupervised Machine Translation (EMNLP 2018).

Requirements

Python 3
TensorFlow 1.12
Moses (clean and tokenize text)
fastBPE (generate and apply BPE codes)
fastText (generate embeddings)
(optional) MUSE (generate cross-lingual embeddings)

The data preprocessing script get_enfr_data.sh (copied from UnsupervisedMT-Pytorch, but remove the cmd of binarizing dataset by torch and add the special tokens to the vocabulary files) will take care of installing everything (except Python, TensorFlow).

Download / preprocess data

The first thing to do to download and preprocess data. To do so, just run:

cd UnsupervisedMT-TensorFlow
./get_enfr_data.sh

Note that there are several ways to train cross-lingual embeddings:

Train monolingual embeddings separately for each language, and align them with MUSE (please refer to the original paper for more details).
Concatenate the source and target monolingual corpora in a single file, and train embeddings with fastText on that generated file (this is what is implemented in the get_enfr_data.sh script).

The second method works better when the source and target languages are similar and share a lot of common words (such as French and English in get_enfr_data.sh). However, when the overlap between the source and target vocabulary is too small, the alignment will be very poor and you should opt for the first method using MUSE to generate your cross-lingual embeddings.

You can skip the script for preprocessing since it takes long time including downloading, learning/applying BPE, and fasttext training. Just download the prepared datasets, after running get_enfr_data.sh.

cd UnsupervisedMT-TensorFlow
./download_enfr_data.sh

Train the NMT model

./run.sh

The hyperparameter in run.sh are almost identical to the UnsupervisedMT-Pytorch except the batch_size=2048 which is a token level batch size.

On newstest2014 en-fr, the above command should give more than 22 BLEU after 100K steps training on a P100 (similar to Pytorch code).

Main Implementation Difference

In our code, the gradient of each update is computed by the summed loss from both directions: lang1 <-> lang2, while Pytorch code updates twice with the loss of each direction.

TODO

Mixed Data Loader (for training monoligual and parallel datasets together)
Multi-GPUs Training
Beam Search

References Github

上一篇：sequence-labeler

下一篇：UnsupervisedMT

用户评价

全部评价

还没有评论，说两句吧！

热门资源

TensorFlow-Course

This repository aims to provide simple and read...
seetafaceJNI

项目介绍基于中科院seetaface2进行封装的JAVA...
mxnet_VanillaCNN

This is a mxnet implementation of the Vanilla C...
tensorflow-sketch...

Discrlaimer: This is not an official Google pro...
vsepp_tensorflow

Improving Visual-Semantic Embeddings with Hard ...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com