Multiple-gaze geometry: Inferring novel 3Dlocations from gazes observed in monocularvideo
Abstract. We develop using person gaze direction for scene understanding. In particular, we use intersecting gazes to learn 3D locations that
people tend to look at, which is analogous to having multiple camera
views. The 3D locations that we discover need not be visible to the camera. Conversely, knowing 3D locations of scene elements that draw visual
attention, such as other people in the scene, can help infer gaze direction. We provide a Bayesian generative model for the temporal scene
that captures the joint probability of camera parameters, locations of
people, their gaze, what they are looking at, and locations of visual attention. Both the number of people in the scene and the number of
extra objects that draw attention are unknown and need to be inferred.
To execute this joint inference we use a probabilistic data association
approach that enables principled comparison of model hypotheses. We
use MCMC for inference over the discrete correspondence variables, and
approximate the marginalization over continuous parameters using the
Metropolis-Laplace approximation, using Hamiltonian (Hybrid) Monte
Carlo for maximization. As existing data sets do not provide the 3D
locations of what people are looking at, we contribute a small data set
that does. On this data set, we infer what people are looking at with
59% precision compared with 13% for a baseline approach, and where
those objects are within about 0.58m.