Abstract
With the advent of drones, aerial video analysis becomesincreasingly important; yet, it has received scant attentionin the literature. This paper addresses a new problem ofparsing low-resolution aerial videos of large spatial areas,in terms of 1) grouping, 2) recognizing events and 3) assign-ing roles to people engaged in events. We propose a novelframework aimed at conducting joint inference of the above tasks, as reasoning about each in isolation typically fails inour setting. Given noisy tracklets of people and detectionsof large objects and scene surfaces (e.g., building, grass),we use a spatiotemporal AND-OR graph to drive our jointinference, using Markov Chain Monte Carlo and dynamicprogramming. We also introduce a new formalism of spa-tiotemporal templates characterizing latent sub-events. For evaluation, we have collected and released a new aerial videos dataset using a hex-rotor flying over picnic areas rich with group events. Our results demonstrate that we successfully address above inference tasks under challenging conditions.