Abstract
We propose an approach to learn image representations
that consist of disentangled factors of variation without exploiting any manual labeling or data domain knowledge. A
factor of variation corresponds to an image attribute that
can be discerned consistently across a set of images, such
as the pose or color of objects. Our disentangled representation consists of a concatenation of feature chunks, each
chunk representing a factor of variation. It supports applications such as transferring attributes from one image
to another, by simply mixing and unmixing feature chunks,
and classification or retrieval based on one or several attributes, by considering a user-specified subset of feature
chunks. We learn our representation without any labeling
or knowledge of the data domain, using an autoencoder architecture with two novel training objectives: first, we propose an invariance objective to encourage that encoding of
each attribute, and decoding of each chunk, are invariant to
changes in other attributes and chunks, respectively; second, we include a classification objective, which ensures
that each chunk corresponds to a consistently discernible
attribute in the represented image, hence avoiding degenerate feature mappings where some chunks are completely
ignored. We demonstrate the effectiveness of our approach
on the MNIST, Sprites, and CelebA datasets.