Abstract
This paper introduces a video dataset of spatiotemporally localized Atomic Visual Actions (AVA). The AVA
dataset densely annotates 80 atomic visual actions in 437
15-minute video clips, where actions are localized in space
and time, resulting in 1.59M action labels with multiple
labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic
visual actions, rather than composite actions; (2) precise
spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these
atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using
movies to gather a varied set of action representations. This
departs from existing datasets for spatio-temporal action
recognition, which typically provide sparse annotations for
composite actions in short video clips.
AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and
UCF101-24 categories. While setting a new state of the art
on existing datasets, the overall results on AVA are low at
15.8% mAP, underscoring the need for developing new approaches for video understanding