Abstract. We present a method for simultaneously estimating 3D human pose and body shape from a sparse set of wide-baseline camera views.
We train a symmetric convolutional autoencoder with a dual loss that
enforces learning of a latent representation that encodes skeletal joint
positions, and at the same time learns a deep representation of volumetric
body shape. We harness the latter to up-scale input volumetric data by a
factor of 4×, whilst recovering a 3D estimate of joint positions with equal
or greater accuracy than the state of the art. Inference runs in real-time
(25 fps) and has the potential for passive human behaviour monitoring
where there is a requirement for high fidelity estimation of human body
shape and pose.