Abstract
Deep Neural Networks (DNNs) have substantially improved the state-of-the-art in salient object detection. However, training DNNs requires costly pixel-level annotations.
In this paper, we leverage the observation that imagelevel tags provide important cues of foreground salient objects, and develop a weakly supervised learning method for
saliency detection using image-level tags only. The Foreground Inference Network (FIN) is introduced for this challenging task. In the first stage of our training method, FIN is
jointly trained with a fully convolutional network (FCN) for
image-level tag prediction. A global smooth pooling layer
is proposed, enabling FCN to assign object category tags to
corresponding object regions, while FIN is capable of capturing all potential foreground regions with the predicted
saliency maps. In the second stage, FIN is fine-tuned with
its predicted saliency maps as ground truth. For refinement
of ground truth, an iterative Conditional Random Field is
developed to enforce spatial label consistency and further
boost performance.
Our method alleviates annotation efforts and allows the
usage of existing large scale training sets with image-level
tags. Our model runs at 60 FPS, outperforms unsupervised
ones with a large margin, and achieves comparable or even
superior performance than fully supervised counterparts.