What Makes a Video a Video: Analyzing Temporal Information in Video
Understanding Models and Datasets
Abstract
The ability to capture temporal information has been
critical to the development of video understanding models.
While there have been numerous attempts at modeling motion in videos, an explicit analysis of the effect of temporal information for video understanding is still missing. In
this work, we aim to bridge this gap and ask the following question: How important is the motion in the video for
recognizing the action? To this end, we propose two novel
frameworks: (i) class-agnostic temporal generator and (ii)
motion-invariant frame selector to reduce/remove motion
for an ablation analysis without introducing other artifacts.
This isolates the analysis of motion from other aspects of the
video. The proposed frameworks provide a much tighter estimate of the effect of motion (from 25% to 6% on UCF101
and 15% to 5% on Kinetics) compared to baselines in our
analysis. Our analysis provides critical insights about existing models like C3D, and how it could be made to achieve
comparable results with a sparser set of frames