Abstract. We address the task of jointly determining what a person is
doing and where they are looking based on the analysis of video captured
by a headworn camera. We propose a novel deep model for joint gaze
estimation and action recognition in First Person Vision. Our method
describes the participant’s gaze as a probabilistic variable and models its
distribution using stochastic units in a deep network. We sample from
these stochastic units to generate an attention map. This attention map
guides the aggregation of visual features in action recognition, thereby
providing coupling between gaze and action. We evaluate our method on
the standard EGTEA dataset and demonstrate performance that exceeds
the state-of-the-art by a significant margin of 3.5%