Abstract
We propose an efficient approach to exploiting motioninformation from consecutive frames of a video sequence torecover the 3D pose of people. Previous approaches typ-ically compute candidate poses in individual frames andthen link them in a post-processing step to resolve ambigui-ties. By contrast, we directly regress from a spatio-temporalvolume of bounding boxes to a 3D pose in the central frame. We further show that, for this approach to achieve itsfull potential, it is essential to compensate for the motionin consecutive frames so that the subject remains centered.This then allows us to effectively overcome ambiguities andimprove upon the state-of-the-art by a large margin on the Human3.6m, HumanEva, and KTH Multiview Football 3D human pose estimation benchmarks.