Abstract. Convolutional Neural Networks (CNNs)-based methods for
3D hand pose estimation with depth cameras usually take 2D depth images as input and directly regress holistic 3D hand pose. Different from
these methods, our proposed Point-to-Point Regression PointNet directly takes the 3D point cloud as input and outputs point-wise estimations,
i.e., heat-maps and unit vector fields on the point cloud, representing
the closeness and direction from every point in the point cloud to the
hand joint. The point-wise estimations are used to infer 3D joint locations with weighted fusion. To better capture 3D spatial information in
the point cloud, we apply a stacked network architecture for PointNet
with intermediate supervision, which is trained end-to-end. Experiments
show that our method can achieve outstanding results when compared
with state-of-the-art methods on three challenging hand pose datasets