Abstract
Learning general image representations has proven key
to the success of many computer vision tasks. For example,
many approaches to image understanding problems rely
on deep networks that were initially trained on ImageNet,
mostly because the learned features are a valuable starting
point to learn from limited labeled data. However, when it
comes to 3D motion capture of multiple people, these features are only of limited use.
In this paper, we therefore propose an approach to learning features that are useful for this purpose. To this end, we
introduce a self-supervised approach to learning what we
call a neural scene decomposition (NSD) that can be exploited for 3D pose estimation. NSD comprises three layers
of abstraction to represent human subjects: spatial layout
in terms of bounding-boxes and relative depth; a 2D shape
representation in terms of an instance segmentation mask;
and subject-specific appearance and 3D pose information.
By exploiting self-supervision coming from multiview data,
our NSD model can be trained end-to-end without any 2D
or 3D supervision. In contrast to previous approaches,
it works for multiple persons and full-frame images. Because it encodes 3D geometry, NSD can then be effectively
leveraged to train a 3D pose estimation network from small
amounts of annotated data. Our code and newly introduced
boxing dataset is available at github.com and cvlab.epfl.ch.