Abstract. Estimating the 3D pose of a hand is an essential part of
human-computer interaction. Estimating 3D pose using depth or multiview sensors has become easier with recent advances in computer vision,
however, regressing pose from a single RGB image is much less straightforward. The main difficulty arises from the fact that 3D pose requires
some form of depth estimates, which are ambiguous given only an RGB
image. In this paper we propose a new method for 3D hand pose estimation from a monocular image through a novel 2.5D pose representation.
Our new representation estimates pose up to a scaling factor, which can
be estimated additionally if a prior of the hand size is given. We implicitly learn depth maps and heatmap distributions with a novel CNN
architecture. Our system achieves state-of-the-art accuracy for 2D and
3D hand pose estimation on several challenging datasets in presence of
severe occlusions