Abstract
This paper tackles the problem of novel view synthesis
from a single image. In particular, we target real-world
scenes with rich geometric structure, a challenging task
due to the large appearance variations of such scenes and
the lack of simple 3D models to represent them. Modern,
learning-based approaches mostly focus on appearance to
synthesize novel views and thus tend to generate predictions
that are inconsistent with the underlying scene structure.
By contrast, in this paper, we propose to exploit the 3D
geometry of the scene to synthesize a novel view. Specifi-
cally, we approximate a real-world scene by a fixed number of planes, and learn to predict a set of homographies
and their corresponding region masks to transform the input image into a novel view. To this end, we develop a new
region-aware geometric transform network that performs
these multiple tasks in a common framework. Our results on
the outdoor KITTI and the indoor ScanNet datasets demonstrate the effectiveness of our network in generating highquality synthetic views that respect the scene geometry, thus
outperforming the state-of-the-art methods