Abstract
This work addresses the problem of estimating the full
body 3D human pose and shape from a single color image. This is a task where iterative optimization-based solutions have typically prevailed, while Convolutional Networks (ConvNets) have suffered because of the lack of training data and their low resolution 3D predictions. Our work
aims to bridge this gap and proposes an efficient and effective direct prediction method based on ConvNets. Central
part to our approach is the incorporation of a parametric
statistical body shape model (SMPL) within our end-to-end
framework. This allows us to get very detailed 3D mesh
results, while requiring estimation only of a small number
of parameters, making it friendly for direct network prediction. Interestingly, we demonstrate that these parameters can be predicted reliably only from 2D keypoints and
masks. These are typical outputs of generic 2D human analysis ConvNets, allowing us to relax the massive requirement
that images with 3D shape ground truth are available for
training. Simultaneously, by maintaining differentiability,
at training time we generate the 3D mesh from the estimated
parameters and optimize explicitly for the surface using a
3D per-vertex loss. Finally, a differentiable renderer is employed to project the 3D mesh to the image, which enables
further refinement of the network, by optimizing for the consistency of the projection with 2D annotations (i.e., 2D keypoints or masks). The proposed approach outperforms previous baselines on this task and offers an attractive solution
for direct prediction of 3D shape from a single color image