Abstract
Learning to generate natural scenes has always been
a challenging task in computer vision. It is even more
painstaking when the generation is conditioned on images
with drastically different views. This is mainly because understanding, corresponding, and transforming appearance
and semantic information across the views is not trivial.
In this paper, we attempt to solve the novel problem of
cross-view image synthesis, aerial to street-view and vice
versa, using conditional generative adversarial networks
(cGAN). Two new architectures called Crossview Fork (XFork) and Crossview Sequential (X-Seq) are proposed to
generate scenes with resolutions of 64×64 and 256×256
pixels. X-Fork architecture has a single discriminator and
a single generator. The generator hallucinates both the image and its semantic segmentation in the target view. X-Seq
architecture utilizes two cGANs. The first one generates
the target image which is subsequently fed to the second
cGAN for generating its corresponding semantic segmentation map. The feedback from the second cGAN helps the
first cGAN generate sharper images. Both of our proposed
architectures learn to generate natural images as well as
their semantic segmentation maps. The proposed methods
show that they are able to capture and maintain the true semantics of objects in source and target views better than the
traditional image-to-image translation method which considers only the visual appearance of the scene. Extensive
qualitative and quantitative evaluations support the effectiveness of our frameworks, compared to two state of the
art methods, for natural scene generation across drastically
different views.