Abstract
Automatically describing open-domain videos with natural language are attracting increasing interest in the field of
artificial intelligence. Most existing methods simply borrow
ideas from image captioning and obtain a compact video
representation from an ensemble of global image feature
before feeding to an RNN decoder which outputs a sentence of variable length. However, it is not only arduous
for the generator to focus on specific salient objects at different time given the global video representation, it is more
formidable to capture the fine-grained motion information
and the relation between moving instances for more subtle linguistic descriptions. In this paper, we propose a Trajectory Structured Attentional Encoder-Decoder (TSA-ED)
neural network framework for more elaborate video captioning which works by integrating local spatial-temporal
representation at trajectory level through structured attention mechanism. Our proposed method is based on a LSTMbased encoder-decoder framework, which incorporates an
attention modeling scheme to adaptively learn the correlation between sentence structure and the moving objects
in videos, and consequently generates more accurate and
meticulous statement description in the decoding stage. Experimental results demonstrate that the feature representation and structured attention mechanism based on the trajectory cluster can efficiently obtain the local motion information in the video to help generate a more fine-grained
video description, and achieve the state-of-the-art performance on the well-known Charades and MSVD datasets.