Abstract. We introduce an unsupervised feature learning approach that
embeds 3D shape information into a single-view image representation.
The main idea is a self-supervised training objective that, given only
a single 2D image, requires all unseen views of the object to be predictable from learned features. We implement this idea as an encoderdecoder convolutional neural network. The network maps an input image
of an unknown category and unknown viewpoint to a latent space, from
which a deconvolutional decoder can best “lift” the image to its complete
viewgrid showing the object from all viewing angles. Our class-agnostic
training procedure encourages the representation to capture fundamental shape primitives and semantic regularities in a data-driven manner—
without manual semantic labels. Our results on two widely-used shape
datasets show 1) our approach successfully learns to perform “mental rotation” even for objects unseen during training, and 2) the learned latent
space is a powerful representation for object recognition, outperforming
several existing unsupervised feature learning methods