Abstract
Spatiotemporal feature learning in videos is a fundamental problem in computer vision. This paper presents a
new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-toend manner. ARTNets are constructed by stacking multiple generic building blocks, called as SMART, whose goal
is to simultaneously model appearance and relation from
RGB input in a separate and explicit manner. Specifically,
SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a
relation branch for temporal modeling. The appearance
branch is implemented based on the linear combination of
pixels or filter responses in each frame, while the relation
branch is designed based on the multiplicative interactions
between pixels or filter responses across multiple frames.
We perform experiments on three action recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating
that SMART blocks obtain an evident improvement over 3D
convolutions for spatiotemporal feature learning. Under
the same training setting, ARTNets achieve superior performance on these three datasets to the existing state-of-the-art
methods