Attentional Push: A Deep Convolutional Network for Augmenting Image
Salience with Shared Attention Modeling in Social Scenes
Abstract
We present a novel visual attention tracking technique
based on Shared Attention modeling. By considering the
viewer as a participant in the activity occurring in the
scene, our model learns the loci of attention of the scene actors and use it to augment image salience. We go beyond image salience and instead of only computing the power of image regions to pull attention, we also consider the strength
with which the scene actors push attention to the region in
question, thus the term Attentional Push. We present a convolutional neural network (CNN) which augments standard
saliency models with Attentional Push. Our model contains
two pathways: an Attentional Push pathway which learns
the gaze location of the scene actors and a saliency pathway. These are followed by a shallow augmented saliency
CNN which combines them and generates the augmented
saliency. For training, we use transfer learning to initialize and train the Attentional Push CNN by minimizing the
classification error of following the actors’ gaze location on
a 2-D grid using a large-scale gaze-following dataset. The
Attentional Push CNN is then fine-tuned along with the augmented saliency CNN to minimize the Euclidean distance
between the augmented saliency and ground truth fixations
using an eye-tracking dataset, annotated with the head and
the gaze location of the scene actors. We evaluate our model
on three challenging eye fixation datasets, SALICON, iSUN
and CAT2000, and illustrate significant improvements in
predicting viewers’ fixations in social scenes