Abstract. In this work, we consider the problem of robust gaze estimation in natural environments. Large camera-to-subject distances and high
variations in head pose and eye gaze angles are common in such environments. This leads to two main shortfalls in state-of-the-art methods for
gaze estimation: hindered ground truth gaze annotation and diminished
gaze estimation accuracy as image resolution decreases with distance.
We first record a novel dataset of varied gaze and head pose images in a
natural environment, addressing the issue of ground truth annotation by
measuring head pose using a motion capture system and eye gaze using
mobile eyetracking glasses. We apply semantic image inpainting to the
area covered by the glasses to bridge the gap between training and testing
images by removing the obtrusiveness of the glasses. We also present a
new real-time algorithm involving appearance-based deep convolutional
neural networks with increased capacity to cope with the diverse images
in the new dataset. Experiments with this network architecture are conducted on a number of diverse eye-gaze datasets including our own, and
in cross dataset evaluations. We demonstrate state-of-the-art performance
in terms of estimation accuracy in all experiments, and the architecture
performs well even on lower resolution images