Abstract. Training a deep network to perform semantic segmentation
requires large amounts of labeled data. To alleviate the manual effort
of annotating real images, researchers have investigated the use of synthetic data, which can be labeled automatically. Unfortunately, a network trained on synthetic data performs relatively poorly on real images.
While this can be addressed by domain adaptation, existing methods all
require having access to real images during training. In this paper, we
introduce a drastically different way to handle synthetic images that
does not require seeing any real images at training time. Our approach
builds on the observation that foreground and background classes are
not affected in the same manner by the domain shift, and thus should
be treated differently. In particular, the former should be handled in
a detection-based manner to better account for the fact that, while
their texture in synthetic images is not photo-realistic, their shape looks
natural. Our experiments evidence the effectiveness of our approach on
Cityscapes and CamVid with models trained on synthetic data only