Abstract
We present a probabilistic generative model for simultane- ously recognizing daily actions and predicting gaze locations in videos recorded from an egocentric camera. We focus on activities requiring eye-hand coordination and model the spatio-temporal relationship be- tween the gaze point, the scene ob jects, and the action label. Our model captures the fact that the distribution of both visual features and ob- ject occurrences in the vicinity of the gaze point is correlated with the verb-ob ject pair describing the action. It explicitly incorporates known properties of gaze behavior from the psychology literature, such as the temporal delay between fixation and manipulation events. We present an inference method that can predict the best sequence of gaze locations and the associated action label from an input sequence of images. We demonstrate improvements in action recognition rates and gaze predic- tion accuracy relative to state-of-the-art methods, on two new datasets that contain egocentric videos of daily activities and gaze.