In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations.pdf
Abstract
Convolutional Neural Network based approaches for
monocular 3D human pose estimation usually require a
large amount of training images with 3D pose annotations.
While it is feasible to provide 2D joint annotations for large
corpora of in-the-wild images with humans, providing accurate 3D annotations to such in-the-wild corpora is hardly
feasible in practice. Most existing 3D labelled data sets are
either synthetically created or feature in-studio images. 3D
pose estimation algorithms trained on such data often have
limited ability to generalize to real world scene diversity.
We therefore propose a new deep learning based method for
monocular 3D human pose estimation that shows high accuracy and generalizes better to in-the-wild scenes. It has
a network architecture that comprises a new disentangled
hidden space encoding of explicit 2D and 3D features, and
uses supervision by a new learned projection model from
predicted 3D pose. Our algorithm can be jointly trained
on image data with 3D labels and image data with only 2D
labels. It achieves state-of-the-art accuracy on challenging