Abstract How do humans recognize the action “opening a book”? We
argue that there are two important cues: modeling temporal shape dynamics and modeling functional relationships between humans and objects.
In this paper, we propose to represent videos as space-time region graphs
which capture these two important cues. Our graph nodes are defined by
the object region proposals from different frames in a long range video.
These nodes are connected by two types of relations: (i) similarity relations capturing the long range dependencies between correlated objects
and (ii) spatial-temporal relations capturing the interactions between
nearby objects. We perform reasoning on this graph representation via
Graph Convolutional Networks. We achieve state-of-the-art results on the
Charades and Something-Something datasets. Especially for Charades
with complex environments, we obtain a huge 4.4% gain when our model
is applied in complex environments