Abstract
Human sensing has greatly benefited from recent advances in deep learning, parametric human modeling, and
large scale 2d and 3d datasets. However, existing 3d models make strong assumptions about the scene, considering
either a single person per image, full views of the person,
a simple background or many cameras. In this paper, we
leverage state-of-the-art deep multi-task neural networks
and parametric human and scene modeling, towards a fully
automatic monocular visual sensing system for multiple interacting people, which (i) infers the 2d and 3d pose and
shape of multiple people from a single image, relying on
detailed semantic representations at both model and image
level, to guide a combined optimization with feedforward
and feedback components, (ii) automatically integrates scene
constraints including ground plane support and simultaneous volume occupancy by multiple people, and (iii) extends
the single image model to video by optimally solving the
temporal person assignment problem and imposing coherent
temporal pose and motion reconstructions while preserving image alignment fidelity. We perform experiments on
both single and multi-person datasets, and systematically
evaluate each component of the model, showing improved
performance and extensive multiple human sensing capability. We also apply our method to images with multiple
people, severe occlusions and diverse backgrounds captured
in challenging natural scenes, and obtain results of good
perceptual quality.