Abstract. Modeling structured relationships between people in a scene is an important step toward visual understanding. We present a Hierarchical Relational
Network that computes relational representations of people, given graph structures describing potential interactions. Each relational layer is fed individual person representations and a potential relationship graph. Relational representations
of each person are created based on their connections in this particular graph. We
demonstrate the efficacy of this model by applying it in both supervised and unsupervised learning paradigms. First, given a video sequence of people doing a
collective activity, the relational scene representation is utilized for multi-person
activity recognition. Second, we propose a Relational Autoencoder model for unsupervised learning of features for action and scene retrieval. Finally, a Denoising
Autoencoder variant is presented to infer missing people in the scene from their
context. Empirical results demonstrate that this approach learns relational feature representations that can effectively discriminate person and group activity
classes