Spatio-Temporal Dynamics and Semantic Attribute EnrichedVisual Encoding for Video Captioning
Abstract
Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for
video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed
from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual
feature encoding technique to generate semantically rich
captions using Gated Recurrent Units (GRUs). Our method
embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level
semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final
representation is projected to a compact space and fed to a
language model. By learning a relatively simple language
model comprising two GRU layers, we establish new stateof-the-art on MSVD and MSR-VTT datasets for METEOR
and ROUGEL metrics.