Abstract
This paper combines three contributions to establish a
new state-of-the-art in dynamic scene recognition. First,
we present a novel ConvNet architecture based on temporal
residual units that is fully convolutional in spacetime. Our
model augments spatial ResNets with convolutions across
time to hierarchically add temporal residuals as the depth
of the network increases. Second, existing approaches to
video-based recognition are categorized and a baseline of
seven previously top performing algorithms is selected for
comparative evaluation on dynamic scenes. Third, we introduce a new and challenging video database of dynamic
scenes that more than doubles the size of those previously
available. This dataset is explicitly split into two subsets
of equal size that contain videos with and without camera
motion to allow for systematic study of how this variable interacts with the defining dynamics of the scene per se. Our
evaluations verify the particular strengths and weaknesses
of the baseline algorithms with respect to various scene
classes and camera motion parameters. Finally, our temporal ResNet boosts recognition performance and establishes
a new state-of-the-art on dynamic scene recognition, as well
as on the complementary task of action recognition.