Abstract. Learning and predicting the pose parameters of a 3D hand
model given an image, such as locations of hand joints, is challenging due
to large viewpoint changes and articulations, and severe self-occlusions
exhibited particularly in egocentric views. Both feature learning and prediction modeling have been investigated to tackle the problem. Though
effective, most existing discriminative methods yield a single deterministic estimation of target poses. Due to their single-value mapping intrinsic,
they fail to adequately handle self-occlusion problems, where occluded
joints present multiple modes. In this paper, we tackle the self-occlusion
issue and provide a complete description of observed poses given an input depth image by a novel method called hierarchical mixture density
networks (HMDN). The proposed method leverages the state-of-the-art
hand pose estimators based on Convolutional Neural Networks to facilitate feature learning, while it models the multiple modes in a two-level
hierarchy to reconcile single-valued and multi-valued mapping in its output. The whole framework with a mixture of two differentiable density
functions is naturally end-to-end trainable. In the experiments, HMDN
produces interpretable and diverse candidate samples, and significantly
outperforms the state-of-the-art methods on two benchmarks with occlusions, and performs comparably on another benchmark free of occlusions