Abstract. Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape
from natural images, however, is highly challenging due to factors such
as variation in human bodies, clothing and viewpoint. Prior methods
addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an
alternative representation and propose BodyNet, a neural network for
direct inference of volumetric body shape from a single image. BodyNet
is an end-to-end trainable network that benefits from (i) a volumetric
3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of
them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network
output and show state-of-the-art results on the SURREAL and Unite
the People datasets, outperforming recent approaches. Besides achieving
state-of-the-art performance, our method also enables volumetric bodypart segmentation