Abstract
Highly regularized LSTMs achieve impressive
results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last
token in the context using the predicted distribution of the next token. This biases the
model towards retaining more contextual information, in turn improving its ability to predict the next token. With negligible overhead
in the number of parameters and training time,
our Past Decode Regularization (PDR) method
improves perplexity on the Penn Treebank
dataset by up to 1.8 points and by up to 2.3
points on the WikiText-2 dataset, over strong
regularized baselines using a single softmax.
With a mixture-of-softmax model, we show
gains of up to 1.0 perplexity points on these
datasets. In addition, our method achieves
1.169 bits-per-character on the Penn Treebank
Character dataset for character level language
modeling. Each of these results constitute improvements over models without PDR in their
respective settings.