资源论文Spatio-Temporal Dynamics and Semantic Attribute EnrichedVisual Encoding for Video Captioning

Spatio-Temporal Dynamics and Semantic Attribute EnrichedVisual Encoding for Video Captioning

2019-09-16 | |  99 |   43 |   0 0 0
Abstract Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new stateof-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGEL metrics.

上一篇:Single-frame Regularization for Temporally Stable CNNs

下一篇:Toward Convolutional Blind Denoising of Real Photographs

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...