资源算法PyTorch_GBW_LM

PyTorch_GBW_LM

2019-09-18 | |  75 |   0 |   0

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the Google 1-Billion Word (GBW) dataset

Results

  • 46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]

  • Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)

  • Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

| Type | LM Memory Size | GPU | | -------------------- | -------------- | ------------------------------ | | w/o tied weights | ~9 GB | Nvidia 1080 TI, Nvidia Titan X | | w/ tied weights [6] | ~7 GB | Nvidia 1070 or higher |

  • There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

| Parameter | Value | | ----------------------------- | ------------- | | # Epochs | 5 | | Training Batch Size | 128 | | Evaluation Batch Size | 1 | | BPTT | 20 | | Embedding Size | 256 | | Hidden Size | 2048 | | Projection Size | 256 | | Tied Embedding + Softmax | False | | # Layers | 1 | | Optimizer | AdaGrad | | Learning Rate | 0.10 | | Gradient Clipping | 1.00 | | Dropout | 0.01 | | Weight-Decay (L2 Penalty) | 1e-6 |

Setup - Torch Data Format

  1. Download Google Billion Word Dataset for Torch - Link

  2. Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file

  3. Install Cython framework and build Log_Uniform Sampler

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

  • Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)

  • Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

  1. Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

References

  1. Exploring the Limits of Language Modeling Github

  2. Factorization Tricks for LSTM networks Github

  3. Efficient softmax approximation for GPUs Github

  4. Candidate Sampling

  5. Torch GBW

  6. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling


上一篇:Wide Residual Networks

下一篇:mixup

用户评价
全部评价

热门资源

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • inferno-boilerplate

    This is a very basic boilerplate example for pe...