Abstract
In this paper we present Stixmantics, a novel medium-level scene representation for real-time visual semantic scene understanding. Relevant scene structure, motion and ob ject class information is encoded using so-called Stixels as primitive elements. Sparse feature-point tra jec- tories are used to estimate the 3D motion field and to enforce temporal consistency of semantic labels. Spatial label coherency is obtained by using a CRF framework. The proposed model abstracts and aggregates low-level pixel informa- tion to gain robustness and efficiency. Yet, enough flexibility is retained to adequately model complex scenes, such as urban traffic. Our experimen- tal evaluation focuses on semantic scene segmentation using a recently introduced dataset for urban traffic scenes. In comparison to our best baseline approach, we demonstrate state-of-the-art performance but re- duce inference time by a factor of more than 2,000, requiring only 50 ms per image.