Abstract
Representing ob jects using elements from a visual dictio- nary is widely used in ob ject detection and categorization. Prior work on dictionary learning has shown improvements in the accuracy of ob- ject detection and categorization by learning discriminative dictionaries. However none of these dictionaries are learnt for joint ob ject categoriza- tion and segmentation. Moreover, dictionary learning is often done sep- arately from classifier training, which reduces the discriminative power of the model. In this paper, we formulate the semantic segmentation problem as a joint categorization, segmentation and dictionary learn- ing problem. To that end, we propose a latent conditional random field (CRF) model in which the observed variables are pixel category labels and the latent variables are visual word assignments. The CRF energy consists of a bottom-up segmentation cost, a top-down bag of (latent) words categorization cost, and a dictionary learning cost. Together, these costs capture relationships between image features and visual words, re- lationships between visual words and ob ject categories, and spatial re- lationships among visual words. The segmentation, categorization, and dictionary learning parameters are learnt jointly using latent structural SVMs, and the segmentation and visual words assignments are inferred jointly using energy minimization techniques. Experiments on the Graz02 and CamVid datasets demonstrate the performance of our approach.