Abstract. We present an approach to infer a layer-structured 3D representation
of a scene from a single input image. This allows us to infer not only the depth
of the visible pixels, but also to capture the texture and depth for content in the
scene that is not directly visible. We overcome the challenge posed by the lack
of direct supervision by instead leveraging a more naturally available multi-view
supervisory signal. Our insight is to use view synthesis as a proxy task: we enforce
that our representation (inferred from a single image), when rendered from a novel
perspective, matches the true observed image. We present a learning framework
that operationalizes this insight using a new, differentiable novel view renderer.
We provide qualitative and quantitative validation of our approach in two different
settings, and demonstrate that we can learn to capture the hidden aspects of a
scene. The project website can be found at https://shubhtuls.github.
io/lsi/