Abstract
In this paper we present a framework for semantic scene parsing and ob ject recognition based on dense depth maps. Five view- independent 3D features that vary with ob ject class are extracted from dense depth maps at a superpixel level for training a classifier using ran- domized decision forest technique. Our formulation integrates multiple features in a Markov Random Field (MRF) framework to segment and recognize different ob ject classes in query street scene images. We evalu- ate our method both quantitatively and qualitatively on the challenging Cambridge-driving Labeled Video Database (CamVid). The result shows that only using dense depth information, we can achieve overall better accurate segmentation and recognition than that from sparse 3D features or appearance, or even the combination of sparse 3D features and appear- ance, advancing state-of-the-art performance. Furthermore, by aligning 3D dense depth based features into a unified coordinate frame, our algo- rithm can handle the special case of view changes between training and testing scenarios. Preliminary evaluation in cross training and testing shows promising results.