Deep Future Gaze: Gaze Anticipation on Egocentric Videos
Using Adversarial Networks
Abstract
We introduce a new problem of gaze anticipation on egocentric videos. This substantially extends the conventional gaze prediction problem to future frames by no longer
confining it on the current frame. To solve this problem, we
propose a new generative adversarial neural network based
model, Deep Future Gaze (DFG). DFG generates multiple
future frames conditioned on the single current frame and
anticipates corresponding future gazes in next few seconds. It consists of two networks: generator and discriminator.
The generator uses a two-stream spatial temporal convolution architecture (3D-CNN) explicitly untangling the foreground and the background to generate future frames. It
then attaches another 3D-CNN for gaze anticipation based
on these synthetic frames. The discriminator plays against
the generator by differentiating the synthetic frames of the
generator from the real frames. Through competition with
discriminator, the generator progressively improves quality
of the future frames and thus anticipates future gaze better.
Experimental results on the publicly available egocentric
datasets show that DFG significantly outperforms all wellestablished baselines. Moreover, we demonstrate that DFG
achieves better performance of gaze prediction on current
frames than state-of-the-art methods. This is due to benefiting from learning motion discriminative representations in
frame generation. We further contribute a new egocentric
dataset (OST) in the object search task. DFG also achieves
the best performance for this challenging dataset.