Abstract
We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos. Unlike previous algorithms which operate on
single frames, we show that reconstructing a person over an
entire sequence gives extra constraints that can resolve ambiguities. This is because videos often give multiple views
of a person, yet the overall body shape does not change and
3D positions vary slowly. Our method improves not only on
standard mocap-based datasets like Human 3.6M – where
we show quantitative improvements – but also on challenging in-the-wild datasets such as Kinetics. Building upon our
algorithm, we present a new dataset of more than 3 million
frames of YouTube videos from Kinetics with automatically
generated 3D poses and meshes. We show that retraining a
single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on
the 3DPW and HumanEVA datasets.