Abstract
The recent advances in deep neural networks have convincingly demonstrated high capability in learning vision
models on large datasets. Nevertheless, collecting expert
labeled datasets especially with pixel-level annotations is
an extremely expensive process. An appealing alternative
is to render synthetic data (e.g., computer games) and generate ground truth automatically. However, simply applying the models learnt on synthetic images may lead to high
generalization error on real images due to domain shift. In
this paper, we facilitate this issue from the perspectives of
both visual appearance-level and representation-level domain adaptation. The former adapts source-domain images
to appear as if drawn from the “style” in the target domain
and the latter attempts to learn domain-invariant representations. Specifically, we present Fully Convolutional Adaptation Networks (FCAN), a novel deep architecture for semantic segmentation which combines Appearance Adaptation Networks (AAN) and Representation Adaptation Networks (RAN). AAN learns a transformation from one domain to the other in the pixel space and RAN is optimized
in an adversarial learning manner to maximally fool the
domain discriminator with the learnt source and target representations. Extensive experiments are conducted on the
transfer from GTA5 (game videos) to Cityscapes (urban
street scenes) on semantic segmentation and our proposal
achieves superior results when comparing to state-of-theart unsupervised adaptation techniques. More remarkably,
we obtain a new record: mIoU of 47.5% on BDDS (drivecam videos) in an unsupervised setting