Where and Why Are They Looking? Jointly Inferring Human Attention and
Intentions in Complex Tasks
Abstract
This paper addresses a new problem - jointly inferring
human attention, intentions, and tasks from videos. Given
an RGB-D video where a human performs a task, we answer
three questions simultaneously: 1) where the human is looking - attention prediction; 2) why the human is looking there
- intention prediction; and 3) what task the human is performing - task recognition. We propose a hierarchical model of human-attention-object (HAO) which represents tasks,
intentions, and attention under a unified framework. A task
is represented as sequential intentions which transition to
each other. An intention is composed of the human pose,
attention, and objects. A beam search algorithm is adopted for inference on the HAO graph to output the attention,
intention, and task results. We built a new video dataset of
tasks, intentions, and attention. It contains 14 task classes,
70 intention categories, 28 object classes, 809 videos, and
approximately 330,000 frames. Experiments show that our
approach outperforms existing approaches