Abstract
The human hand moves in complex and highdimensional ways, making estimation of 3D hand pose con-
figurations from images alone a challenging task. In this
work we propose a method to learn a statistical hand model
represented by a cross-modal trained latent space via a generative deep neural network. We derive an objective function from the variational lower bound of the VAE framework and jointly optimize the resulting cross-modal KLdivergence and the posterior reconstruction objective, naturally admitting a training regime that leads to a coherent
latent space across multiple modalities such as RGB images, 2D keypoint detections or 3D hand configurations.
Additionally, it grants a straightforward way of using semisupervision. This latent space can be directly used to estimate 3D hand poses from RGB images, outperforming the
state-of-the art in different settings. Furthermore, we show
that our proposed method can be used without changes
on depth images and performs comparably to specialized
methods. Finally, the model is fully generative and can
synthesize consistent pairs of hand configurations across
modalities. We evaluate our method on both RGB and depth
datasets and analyze the latent space qualitatively