Abstract
Deep Recurrent Neural Network architectures, thoughremarkably capable at modeling sequences, lack an intu-itive high-level spatio-temporal structure. That is while many problems in computer vision inherently have an un-derlying high-level structure and can benefit from it. Spatio-temporal graphs are a popular tool for imposing such high-level intuitions in the formulation of real world problems.In this paper, we propose an approach for combining thepower of high-level spatio-temporal graphs and sequence learning success of Recurrent Neural Networks (RNNs). Wedevelop a scalable method for casting an arbitrary spatiotemporal graph as a rich RNN mixture that is feedforward, fully differentiable, and jointly trainable. The proposed method is generic and principled as it can be used for transforming any spatio-temporal graph through employing a certain set of well defined steps. The evaluations of the proposed approach on a diverse set of problems, ranging from modeling human motion to object interactions, shows improvement over the state-of-the-art with a large margin. Weexpect this method to empower new approaches to problem formulation through high-level spatio-temporal graphs and Recurrent Neural Networks. Links: mWeb