Abstract
We investigate the problem of object referring (OR) i.e.
to localize a target object in a visual scene coming with
a language description. Humans perceive the world more
as continued video snippets than as static images, and describe objects not only by their appearance, but also by
their spatio-temporal context and motion features. Humans
also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images
only, which fall short in providing many such cues. This
paper addresses OR in videos with language and human
gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show that our
method effectively utilizes motion cues, human gaze, and
spatio-temporal context. Our method outperforms previous
OR methods. For dataset and code, please refer https:
//people.ee.ethz.ch/˜arunv/ORGaze.html.