PyTorch Large-Scale Language Model
A Large-Scale PyTorch Language Model trained on the Google 1-Billion Word (GBW) dataset
Results
46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
Implemented Sampled Softmax and Log-Uniform Sampler functions
GPU Hardware Requirement
| Type | LM Memory Size | GPU | | -------------------- | -------------- | ------------------------------ | | w/o tied weights | ~9 GB | Nvidia 1080 TI, Nvidia Titan X | | w/ tied weights [6] | ~7 GB | Nvidia 1070 or higher |
Hyper-Parameters [3]
| Parameter | Value | | ----------------------------- | ------------- | | # Epochs | 5 | | Training Batch Size | 128 | | Evaluation Batch Size | 1 | | BPTT | 20 | | Embedding Size | 256 | | Hidden Size | 2048 | | Projection Size | 256 | | Tied Embedding + Softmax | False | | # Layers | 1 | | Optimizer | AdaGrad | | Learning Rate | 0.10 | | Gradient Clipping | 1.00 | | Dropout | 0.01 | | Weight-Decay (L2 Penalty) | 1e-6 |
Setup - Torch Data Format
Download Google Billion Word Dataset for Torch - Link
Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
Install Cython framework and build Log_Uniform Sampler
I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.
Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)
Setup - Original Data Format
Download 1-Billion Word Dataset - Link
The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.
References
Exploring the Limits of Language Modeling Github
Factorization Tricks for LSTM networks Github
Efficient softmax approximation for GPUs Github
Candidate Sampling
Torch GBW
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling