Abstract
Multi-person event recognition is a challenging task, of-ten with many people active in the scene but only a smallsubset contributing to an actual event. In this paper, wepropose a model which learns to detect events in such videoswhile automatically “attending” to the people responsible for the event. Our model does not use explicit annotationsregarding who or where those people are during trainingand testing. In particular, we track people in videos anduse a recurrent neural network (RNN) to represent the trackfeatures. We learn time-varying attention weights to com-bine these features at each time-instant. The attended fea-tures are then processed using another RNN for event de-tection/classification. Since most video datasets with mul-tiple people are restricted to a small number of videos, we also collected a new basketball dataset comprising 257 basketball games with 14K event annotations corresponding to 11 event classes. Our model outperforms state-of-the-art methods for both event classification and detection on this new dataset. Additionally, we show that the attention mechanism is able to consistently localize the relevant players.