Pushing the Envelope for RGB-basedDense 3D Hand Pose Estimation via Neural Rendering
Abstract
Estimating 3D hand meshes from single RGB images is
challenging, due to intrinsic 2D-3D mapping ambiguities
and limited training data. We adopt a compact parametric
3D hand model that represents deformable and articulated
hand meshes. To achieve the model fitting to RGB images,
we investigate and contribute in three ways: 1) Neural
rendering: inspired by recent work on human body, our
hand mesh estimator (HME) is implemented by a neural
network and a differentiable renderer, supervised by 2D
segmentation masks and 3D skeletons. HME demonstrates
good performance for estimating diverse hand shapes
and improves pose estimation accuracies. 2) Iterative
testing refinement: Our fitting function is differentiable. We
iteratively refine the initial estimate using the gradients, in
the spirit of iterative model fitting methods like ICP. The
idea is supported by the latest research on human body.
3) Self-data augmentation: collecting sized RGB-mesh (or
segmentation mask)-skeleton triplets for training is a big
hurdle. Once the model is successfully fitted to input RGB
images, its meshes i.e. shapes and articulations, are realistic,
and we augment view-points on top of estimated dense hand
poses. Experiments using three RGB-based benchmarks
show that our framework offers beyond state-of-the-art
accuracy in 3D pose estimation, as well as recovers
dense 3D hand shapes. Each technical component above
meaningfully improves the accuracy in the ablation study.