ActionVLAD: Learning spatio-temporal aggregation for action classification

资源分类

2019-12-06 |

71 |

47 |

Abstract In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art twostream networks [42] with learnable spatio-temporal feature aggregation [6]. The resulting architecture is end-toend trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classifi- cation benchmarks

上一篇：Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning

下一篇：AMC: Attention guided Multi-modal Correlation Learning for Image Search

用户评价

全部评价

还没有评论，说两句吧！

热门资源

The Variational S...

Unlike traditional images which do not offer in...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
Learning to learn...

The move from hand-designed features to learned...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Learning to Predi...

Much of model-based reinforcement learning invo...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com