Abstract. We present a deep learning based volumetric approach for
performance capture using a passive and highly sparse multi-view capture
system. State-of-the-art performance capture systems require either prescanned actors, large number of cameras or active sensors. In this work,
we focus on the task of template-free, per-frame 3D surface reconstruction
from as few as three RGB sensors, for which conventional visual hull or
multi-view stereo methods fail to generate plausible results. We introduce
a novel multi-view Convolutional Neural Network (CNN) that maps 2D
images to a 3D volumetric field and we use this field to encode the
probabilistic distribution of surface points of the captured subject. By
querying the resulting field, we can instantiate the clothed human body
at arbitrary resolutions. Our approach scales to different numbers of
input images, which yield increased reconstruction quality when more
views are used. Although only trained on synthetic data, our network can
generalize to handle real footage from body performance capture. Our
method is suitable for high-quality low-cost full body volumetric capture
solutions, which are gaining popularity for VR and AR content creation.
Experimental results demonstrate that our method is significantly more
robust and accurate than existing techniques when only very sparse views
are available