Abstract
We present a self-supervision method for 3D hand pose
estimation from depth maps. We begin with a neural network initialized with synthesized data and fine-tune it on
real but unlabelled depth maps by minimizing a set of data-
fitting terms. By approximating the hand surface with a set
of spheres, we design a differentiable hand renderer to align
estimates by comparing the rendered and input depth maps.
In addition, we place a set of priors including a data-driven
term to further regulate the estimate’s kinematic feasibility.
Our method makes highly accurate estimates comparable to
current supervised methods which require large amounts of
labelled training samples, thereby advancing state-of-theart in unsupervised learning for hand pose estimation.