Convolutional Two-Stream Network Fusion for Video Action Recognition

资源分类

2019-12-26 |

45 |

44 |

Abstract

Recent applications of Convolutional Neural Networks(ConvNets) for human action recognition in videos haveproposed different solutions for incorporating the appearance and motion information. We study a number of waysof fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal infor-mation. We make the following findings: (i) that ratherthan fusing at the softmax layer, a spatial and temporalnetwork can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially atthe last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy;finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

上一篇：Video Segmentation via Object Flow

下一篇：Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection

用户评价

全部评价

还没有评论，说两句吧！

热门资源

The Variational S...

Unlike traditional images which do not offer in...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
Learning to learn...

The move from hand-designed features to learned...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Learning to Predi...

Much of model-based reinforcement learning invo...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com