Abstract
Crucial to the success of training a depth-based 3D hand pose
estimator (HPE) is the availability of comprehensive datasets
covering diverse camera perspectives, shapes, and pose variations. However, collecting such annotated datasets is challenging.
We propose to complete existing databases by generating new
database entries. The key idea is to synthesize data in the skeleton
space (instead of doing so in the depth-map space) which enables
an easy and intuitive way of manipulating data entries. Since the
skeleton entries generated in this way do not have the corresponding depth map entries, we exploit them by training a separate hand
pose generator (HPG) which synthesizes the depth map from the
skeleton entries. By training the HPG and HPE in a single unified
optimization framework enforcing that 1) the HPE agrees with the
paired depth and skeleton entries; and 2) the HPG-HPE combination satisfies the cyclic consistency (both the input and the output
of HPG-HPE are skeletons) observed via the newly generated unpaired skeletons, our algorithm constructs a HPE which is robust
to variations that go beyond the coverage of the existing database.
Our training algorithm adopts the generative adversarial
networks (GAN) training process. As a by-product, we obtain
a hand pose discriminator (HPD) that is capable of picking out
realistic hand poses. Our algorithm exploits this capability to
refine the initial skeleton estimates in testing, further improving
the accuracy. We test our algorithm on four challenging
benchmark datasets (ICVL, MSRA, NYU and Big Hand 2.2M
datasets) and demonstrate that our approach outperforms or is on
par with state-of-the-art methods quantitatively and qualitatively.