Abstract
This paper presents a method to predict the future movements (location and gaze direction) of basketball players as
a whole from their first person videos. The predicted behaviors reflect an individual physical space that affords to take
the next actions while conforming to social behaviors by engaging to joint attention. Our key innovation is to use the
3D reconstruction of multiple first person cameras to automatically annotate each other’s visual semantics of social
configurations.
We leverage two learning signals uniquely embedded in
first person videos. Individually, a first person video records
the visual semantics of a spatial and social layout around a
person that allows associating with past similar situations.
Collectively, first person videos follow joint attention that
can link the individuals to a group. We learn the egocentric visual semantics of group movements using a Siamese
neural network to retrieve future trajectories. We consolidate the retrieved trajectories from all players by maximizing a measure of social compatibility—the gaze alignment
towards joint attention predicted by their social formation,
where the dynamics of joint attention is learned by a longterm recurrent convolutional network. This allows us to
characterize which social configuration is more plausible
and predict future group trajectories.