Abstract
We propose a deep multitask architecture for fully automatic 2d and 3d human sensing (DMHS), including recognition and reconstruction, in monocular images. The system computes the figure-ground segmentation, semantically
identifies the human body parts at pixel level, and estimates
the 2d and 3d pose of the person. The model supports
the joint training of all components by means of multi-task
losses where early processing stages recursively feed into
advanced ones for increasingly complex calculations, accuracy and robustness. The design allows us to tie a complete
training protocol, by taking advantage of multiple datasets
that would otherwise restrictively cover only some of the
model components: complex 2d image data with no body
part labeling and without associated 3d ground truth, or
complex 3d data with limited 2d background variability. In
detailed experiments based on several challenging 2d and
3d datasets (LSP, HumanEva, Human3.6M), we evaluate
the sub-structures of the model, the effect of various types
of training data in the multitask loss, and demonstrate that
state-of-the-art results can be achieved at all processing
levels. We also show that in the wild our monocular RGB
architecture is perceptually competitive to a state-of-the art
(commercial) Kinect system based on RGB-D data