Abstract
Weakly supervised learning with only coarse labels can
obtain visual explanations of deep neural network such as
attention maps by back-propagating gradients. These attention maps are then available as priors for tasks such
as object localization and semantic segmentation. In one
common framework we address three shortcomings of previous approaches in modeling such attention maps: We (1)
make attention maps an explicit and natural component of
the end-to-end training for the first time, (2) provide selfguidance directly on these maps by exploring supervision
from the network itself to improve them, and (3) seamlessly
bridge the gap between using weak and extra supervision if
available. Despite its simplicity, experiments on the semantic segmentation task demonstrate the effectiveness of our
methods. We clearly surpass the state-of-the-art on PASCAL VOC 2012 test and val. sets. Besides, the proposed
framework provides a way not only explaining the focus of
the learner but also feeding back with direct guidance towards specific tasks. Under mild assumptions our method
can also be understood as a plug-in to existing weakly supervised learners to improve their generalization performance