Abstract. In this work, we propose a method that combines a single
hand-held camera and a set of Inertial Measurement Units (IMUs) attached at the body limbs to estimate accurate 3D poses in the wild.
This poses many new challenges: the moving camera, heading drift, cluttered background, occlusions and many people visible in the video. We
associate 2D pose detections in each image to the corresponding IMUequipped persons by solving a novel graph based optimization problem
that forces 3D to 2D coherency within a frame and across long range
frames. Given associations, we jointly optimize the pose of a statistical body model, the camera pose and heading drift using a continuous optimization framework. We validated our method on the TotalCapture dataset, which provides video and IMU synchronized with ground
truth. We obtain an accuracy of 26mm, which makes it accurate enough
to serve as a benchmark for image-based 3D pose estimation in the
wild. Using our method, we recorded 3D Poses in the Wild (3DPW ),
a new dataset consisting of more than 51, 000 frames with accurate
3D pose in challenging sequences, including walking in the city, going
up-stairs, having coffee or taking the bus. We make the reconstructed
3D poses, video, IMU and 3D models available for research purposes
at http://virtualhumans.mpi-inf.mpg.de/3DPW.