Abstract
We tackle the problem of weakly labeled semantic segmentation, where the only source of annotation are image tags encoding which classes are present in the scene. Thisis an extremely difficult problem as no pixel-wise labelings are available, not even at training time. In this paper, weshow that this problem can be formalized as an instance of learning in a latent structured prediction framework, where the graphical model encodes the presence and absence of aclass as well as the assignments of semantic labels to super-pixels. As a consequence, we are able to leverage standardalgorithms with good theoretical properties. We demon-strate the effectiveness of our approach using the challenging SIFT-flow dataset and show average per-class accuracy improvements of 7% over the state-of-the-art.