Abstract
This paper presents a general ConvNet architecture for
video action recognition based on multiplicative interactions of spacetime features. Our model combines the appearance and motion pathways of a two-stream architecture by motion gating and is trained end-to-end. We theoretically motivate multiplicative gating functions for residual networks and empirically study their effect on classi-
fication accuracy. To capture long-term dependencies we
inject identity mapping kernels for learning temporal relationships. Our architecture is fully convolutional in spacetime and able to evaluate a video in a single forward pass.
Empirical investigation reveals that our model produces
state-of-the-art results on two standard action recognition
datasets.