Abstract
This paper addresses the problem of recognizing and localizing coher- ent activities of a group of people, called collective activities, in video. Related work has argued the benefits of capturing long-range and higher-order dependen- cies among video features for robust recognition. To this end, we formulate a new deep model, called Hierarchical Random Field (HiRF). HiRF models only hierarchical dependencies between model variables. This effectively amounts to modeling higher-order temporal dependencies of video features. We specify an efficient inference of HiRF that iterates in each step linear programming for es- timating latent variables. Learning of HiRF parameters is speci fied within the max-margin framework. Our evaluation on the benchmark New Collective Ac- tivity and Collective Activity datasets, demonstrates that HiRF yields superior recognition and localization as compared to the state of the art.