Abstract
Hand image synthesis and pose estimation from RGB images are both highly challenging tasks due to the large discrepancy between factors of variation ranging from image
background content to camera viewpoint. To better analyze these factors of variation, we propose the use of disentangled representations and a disentangled variational
autoencoder (dVAE) that allows for specific sampling and
inference of these factors. The derived objective from the
variational lower bound as well as the proposed training
strategy are highly flexible, allowing us to handle crossmodal encoders and decoders as well as semi-supervised
learning scenarios. Experiments show that our dVAE can
synthesize highly realistic images of the hand specifiable by
both pose and image background content and also estimate
3D hand poses from RGB images with accuracy competitive
with state-of-the-art on two public benchmarks.