Abstract
We examine the benefits of visual context in
training neural language models to perform
next-word prediction. A multi-modal neural
architecture is introduced that outperform its
equivalent trained on language alone with a
2% decrease in perplexity, even when no visual context is available at test. Fine-tuning
the embeddings of a pre-trained state-of-theart bidirectional language model (BERT) in
the language modeling framework yields a
3.5% improvement. The advantage for training with visual context when testing without
is robust across different languages (English,
German and Spanish) and different models
(GRU, LSTM, ?-RNN, as well as those that
use BERT embeddings). Thus, language models perform better when they learn like a baby,
i.e, in a multi-modal environment. This finding is compatible with the theory of situated
cognition: language is inseparable from its
physical context.