Abstract
In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of
the video. We do so by integrating state-of-the-art twostream networks [42] with learnable spatio-temporal feature aggregation [6]. The resulting architecture is end-toend trainable for whole-video classification. We investigate
different strategies for pooling across space and time and
combining signals from the different streams. We find that:
(i) it is important to pool jointly across space and time,
but (ii) appearance and motion streams are best aggregated
into their own separate representations. Finally, we show
that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classifi-
cation benchmarks