Going from Image to Video Saliency: Augmenting Image Salience with Dynamic
Attentional Push
Abstract
We present a novel method to incorporate the recent
advent in static saliency models to predict the saliency in
videos. Our model augments the static saliency models
with the Attentional Push effect of the photographer and
the scene actors in a shared attention setting. We demonstrate that not only it is imperative to use static Attentional Push cues, noticeable performance improvement is
achievable by learning the time-varying nature of Attentional Push. We propose a multi-stream Convolutional Long
Short-Term Memory network (ConvLSTM) structure which
augments state-of-the-art in static saliency models with dynamic Attentional Push. Our network contains four pathways, a saliency pathway and three Attentional Push pathways. The multi-pathway structure is followed by an augmenting convnet that learns to combine the complementary
and time-varying outputs of the ConvLSTMs by minimizing the relative entropy between the augmented saliency
and viewers fixation patterns on videos. We evaluate our
model by comparing the performance of several augmented
static saliency models with state-of-the-art in spatiotemporal saliency on three largest dynamic eye tracking datasets,
HOLLYWOOD2, UCF-Sport and DIEM. Experimental results illustrates that solid performance gain is achievable
using the proposed methodology