Abstract
The goal of this paper is to take a single 2D image of
a scene and recover the 3D structure in terms of a small
set of factors: a layout representing the enclosing surfaces
as well as a set of objects represented in terms of shape and
pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a
large dataset of indoor scenes. Our experiments evaluate a
number of practical design questions, demonstrate that we
can infer this representation, and quantitatively and qualitatively demonstrate its merits compared to alternate representations