Abstract
Automatically generating natural language descriptions
of videos plays a fundamental challenge for computer vision community. Most recent progress in this problem has
been achieved through employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) to encode video content
and Recurrent Neural Networks (RNNs) to decode a sentence. In this paper, we present Long Short-Term Memory
with Transferred Semantic Attributes (LSTM-TSA)—a novel
deep architecture that incorporates the transferred semantic
attributes learnt from images and videos into the CNN plus
RNN framework, by training them in an end-to-end manner. The design of LSTM-TSA is highly inspired by the facts
that 1) semantic attributes play a significant contribution to
captioning, and 2) images and videos carry complementary
semantics and thus can reinforce each other for captioning.
To boost video captioning, we propose a novel transfer unit to model the mutually correlated attributes learnt from
images and videos. Extensive experiments are conducted
on three public datasets, i.e., MSVD, M-VAD and MPIIMD. Our proposed LSTM-TSA achieves to-date the best
published performance in sentence generation on MSVD:
52.8% and 74.0% in terms of BLEU@4 and CIDEr-D. Superior results are also reported on M-VAD and MPII-MD
when compared to state-of-the-art methods.