Abstract
We present a simple and effective method for 3D hand
pose estimation from a single depth frame. As opposed to
previous state-of-the-art methods based on holistic 3D regression, our method works on dense pixel-wise estimation.
This is achieved by careful design choices in pose parameterization, which leverages both 2D and 3D properties of
depth map. Specifically, we decompose the pose parameters
into a set of per-pixel estimations, i.e., 2D heat maps, 3D
heat maps and unit 3D directional vector fields. The 2D/3D
joint heat maps and 3D joint offsets are estimated via multitask network cascades, which is trained end-to-end. The
pixel-wise estimations can be directly translated into a vote
casting scheme. A variant of mean shift is then used to aggregate local votes while enforcing consensus between the
the estimated 3D pose and the pixel-wise 2D and 3D estimations by design. Our method is efficient and highly accurate.
On MSRA and NYU hand dataset, our method outperforms
all previous state-of-the-art approaches by a large margin.
On the ICVL hand dataset, our method achieves similar accuracy compared to the nearly saturated result obtained
by [5] and outperforms various other proposed methods.
Code is available online