Abstract
Current systems for scene understanding typically rep-resent objects as 2D or 3D bounding boxes. While theserepresentations have proven robust in a variety of applica-tions, they provide only coarse approximations to the true2D and 3D extent of objects. As a result, object-object inter-actions, such as occlusions or ground-plane contact, can berepresented only superficially. In this paper, we approachthe problem of scene understanding from the perspective of3D shape modeling, and design a 3D scene representation that reasons jointly about the 3D shape of multiple objects. This representation allows to express 3D geometry and occlusion on the fine detail level of individual vertices of 3Dwireframe models, and makes it possible to treat dependencies between objects, such as occlusion reasoning, in a deterministic way. In our experiments, we demonstrate thebenefit of jointly estimating the 3D shape of multiple objects in a scene over working with coarse boxes, on the recently proposed KITTI dataset of realistic street scenes.