资源算法UnsupervisedMT-TensorFlow

UnsupervisedMT-TensorFlow

2019-12-19 | |  57 |   0 |   0

Unsupervised Machine Translation (Transformer Based UNMT)

This repository provides the TensorFlow implementation of the transformer based unsupervised NMT model presented in
Phrase-Based & Neural Unsupervised Machine Translation (EMNLP 2018).

Requirements

  • Python 3

  • TensorFlow 1.12

  • Moses (clean and tokenize text)

  • fastBPE (generate and apply BPE codes)

  • fastText (generate embeddings)

  • (optional) MUSE (generate cross-lingual embeddings)

The data preprocessing script get_enfr_data.sh (copied from UnsupervisedMT-Pytorch, but remove the cmd of binarizing dataset by torch and add the special tokens to the vocabulary files) will take care of installing everything (except Python, TensorFlow).

Download / preprocess data

The first thing to do to download and preprocess data. To do so, just run:

cd UnsupervisedMT-TensorFlow
./get_enfr_data.sh

Note that there are several ways to train cross-lingual embeddings:

  • Train monolingual embeddings separately for each language, and align them with MUSE (please refer to the original paper for more details).

  • Concatenate the source and target monolingual corpora in a single file, and train embeddings with fastText on that generated file (this is what is implemented in the get_enfr_data.sh script).

The second method works better when the source and target languages are similar and share a lot of common words (such as French and English in get_enfr_data.sh). However, when the overlap between the source and target vocabulary is too small, the alignment will be very poor and you should opt for the first method using MUSE to generate your cross-lingual embeddings.

You can skip the script for preprocessing since it takes long time including downloading, learning/applying BPE, and fasttext training. Just download the prepared datasets, after running get_enfr_data.sh.

cd UnsupervisedMT-TensorFlow
./download_enfr_data.sh

Train the NMT model

./run.sh

The hyperparameter in run.sh are almost identical to the UnsupervisedMT-Pytorch except the batch_size=2048 which is a token level batch size.

On newstest2014 en-fr, the above command should give more than 22 BLEU after 100K steps training on a P100 (similar to Pytorch code).

Main Implementation Difference

In our code, the gradient of each update is computed by the summed loss from both directions: lang1 <-> lang2, while Pytorch code updates twice with the loss of each direction.

TODO

  • Mixed Data Loader (for training monoligual and parallel datasets together)

  • Multi-GPUs Training

  • Beam Search

References Github


上一篇:sequence-labeler

下一篇:UnsupervisedMT

用户评价
全部评价

热门资源

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • inferno-boilerplate

    This is a very basic boilerplate example for pe...