Abstract
In this paper, the problem of describing visual contents
of a video sequence with natural language is addressed.
Unlike previous video captioning work mainly exploiting
the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with
a novel encoder-decoder-reconstructor architecture, which
leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow
to produce the sentence description based on the encoded
video semantic features. Two types of reconstructors are
customized to employ the backward flow and reproduce the
video features based on the hidden state sequence generated by the decoder. The generation loss yielded by the
encoder-decoder and the reconstruction loss introduced by
the reconstructor are jointly drawn into training the proposed RecNet in an end-to-end fashion. Experimental results on benchmark datasets demonstrate that the proposed
reconstructor can boost the encoder-decoder models and
leads to significant gains in video caption accuracy