Abstract
This paper focuses on multi-person action forecasting
in videos. More precisely, given a history of H previous
frames, the goal is to detect actors and to predict their future actions for the next T frames. Our approach jointly
models temporal and spatial interactions among different
actors by constructing a recurrent graph, using actor proposals obtained with Faster R-CNN as nodes. Our method
learns to select a subset of discriminative relations without requiring explicit supervision, thus enabling us to tackle
challenging visual data. We refer to our model as Discriminative Relational Recurrent Network (DR2N). Evaluation of
action prediction on AVA demonstrates the effectiveness of
our proposed method compared to simpler baselines. Furthermore, we significantly improve performance on the task
of early action classification on J-HMDB, from the previous
SOTA of 48% to 60%.