资源论文Convolutional Two-Stream Network Fusion for Video Action Recognition

Convolutional Two-Stream Network Fusion for Video Action Recognition

2019-12-26 | |  45 |   44 |   0

Abstract

Recent applications of Convolutional Neural Networks(ConvNets) for human action recognition in videos haveproposed different solutions for incorporating the appearance and motion information. We study a number of waysof fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal infor-mation. We make the following findings: (i) that ratherthan fusing at the softmax layer, a spatial and temporalnetwork can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially atthe last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy;finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

上一篇:Video Segmentation via Object Flow

下一篇:Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection

用户评价
全部评价

热门资源

  • The Variational S...

    Unlike traditional images which do not offer in...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • Learning to learn...

    The move from hand-designed features to learned...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Learning to Predi...

    Much of model-based reinforcement learning invo...