Abstract. Unsupervised image-to-image translation techniques are able
to map local texture between two domains, but they are typically unsuccessful when the domains require larger shape change. Inspired by
semantic segmentation, we introduce a discriminator with dilated convolutions that is able to use information from across the entire image to
train a more context-aware generator. This is coupled with a multi-scale
perceptual loss that is better able to represent error in the underlying
shape of objects. We demonstrate that this design is more capable of representing shape deformation in a challenging toy dataset, plus in complex
mappings with significant dataset variation between humans, dolls, and
anime faces, and between cats and dogs