Abstract. Given a single RGB image of a complex outdoor road scene
in the perspective view, we address the novel problem of estimating an
occlusion-reasoned semantic scene layout in the top-view. This challenging problem not only requires an accurate understanding of both the 3D
geometry and the semantics of the visible scene, but also of occluded
areas. We propose a convolutional neural network that learns to predict
occluded portions of the scene layout by looking around foreground objects like cars or pedestrians. But instead of hallucinating RGB values, we
show that directly predicting the semantics and depths in the occluded
areas enables a better transformation into the top-view. We further show
that this initial top-view representation can be significantly enhanced by
learning priors and rules about typical road layouts from simulated or, if
available, map data. Crucially, training our model does not require costly
or subjective human annotations for occluded areas or the top-view, but
rather uses readily available annotations for standard semantic segmentation in the perspective view. We extensively evaluate and analyze our
approach on the KITTI and Cityscapes data sets.