PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the Google 1-Billion Word (GBW) dataset

Results

46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

| Type | LM Memory Size | GPU | | -------------------- | -------------- | ------------------------------ | | w/o tied weights | ~9 GB | Nvidia 1080 TI, Nvidia Titan X | | w/ tied weights [6] | ~7 GB | Nvidia 1070 or higher |

There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

| Parameter | Value | | ----------------------------- | ------------- | | # Epochs | 5 | | Training Batch Size | 128 | | Evaluation Batch Size | 1 | | BPTT | 20 | | Embedding Size | 256 | | Hidden Size | 2048 | | Projection Size | 256 | | Tied Embedding + Softmax | False | | # Layers | 1 | | Optimizer | AdaGrad | | Learning Rate | 0.10 | | Gradient Clipping | 1.00 | | Dropout | 0.01 | | Weight-Decay (L2 Penalty) | 1e-6 |

Setup - Torch Data Format

Download Google Billion Word Dataset for Torch - Link
Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
Install Cython framework and build Log_Uniform Sampler

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.