资源论文Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning

Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning

2019-10-16 | |  96 |   103 |   0
Abstract It is often laborious and costly to manually annotate videos for training high-quality video recognition models, so there has been some work and interest in exploring alternative, cheap, and yet often noisy and indirect training signals for learning the video representations. However, these signals are still coarse, supplying supervision at the whole video frame level, and subtle, sometimes enforcing the learning agent to solve problems that are even hard for humans. In this paper, we instead explore geometry, a grand new type of auxiliary supervision for the self-supervised learning of video representations. In particular, we extract pixel-wise geometry information as flow fields and disparity maps from synthetic imagery and real 3D movies, respectively. Although the geometry and highlevel semantics are seemingly distant topics, surprisingly, we find that the convolutional neural networks pre-trained by the geometry cues can be effectively adapted to semantic video understanding tasks. In addition, we also find that a progressive training strategy can foster a better neural network for the video recognition task than blindly pooling the distinct sources of geometry cues together. Extensive results on video dynamic scene recognition and action recognition tasks show that our geometry guided networks significantly outperform the competing methods that are trained with other types of labeling-free supervision signals

上一篇:Gaze Prediction in Dynamic 360? Immersive Videos

下一篇:Going from Image to Video Saliency: Augmenting Image Salience with Dynamic Attentional Push

用户评价
全部评价

热门资源

  • Regularizing RNNs...

    Recently, caption generation with an encoder-de...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • Deep Cross-media ...

    Cross-media retrieval is a research hotspot in ...

  • Visual Reinforcem...

    For an autonomous agent to fulfill a wide range...

  • Joint Pose and Ex...

    Facial expression recognition (FER) is a challe...