Abstract. In this paper, we propose a novel deep learning based video saliency prediction method, named DeepVS. Specifically, we establish a large-scale
eye-tracking database of videos (LEDOV), which includes 32 subjects’ fixations on 538 videos. We find from LEDOV that human attention is more likely to
be attracted by objects, particularly the moving objects or the moving parts of
objects. Hence, an object-to-motion convolutional neural network (OM-CNN) is
developed to predict the intra-frame saliency for DeepVS, which is composed
of the objectness and motion subnets. In OM-CNN, cross-net mask and hierarchical feature normalization are proposed to combine the spatial features of the
objectness subnet and the temporal features of the motion subnet. We further
find from our database that there exists a temporal correlation of human attention
with a smooth saliency transition across video frames. We thus propose saliencystructured convolutional long short-term memory (SS-ConvLSTM) network, using the extracted features from OM-CNN as the input. Consequently, the interframe saliency maps of a video can be generated, which consider both structured
output with center-bias and cross-frame transitions of human attention maps. Finally, the experimental results show that DeepVS advances the state-of-the-art in
video saliency prediction