Abstract. Ego-pose estimation, i.e., estimating a person’s 3D pose with
a single wearable camera, has many potential applications in activity
monitoring. For these applications, both accurate and physically plausible estimates are desired, with the latter often overlooked by existing work. Traditional computer vision-based approaches using temporal
smoothing only take into account the kinematics of the motion without
considering the physics that underlies the dynamics of motion, which
leads to pose estimates that are physically invalid. Motivated by this,
we propose a novel control-based approach to model human motion
with physics simulation and use imitation learning to learn a videoconditioned control policy for ego-pose estimation. Our imitation learning framework allows us to perform domain adaption to transfer our
policy trained on simulation data to real-world data. Our experiments
with real egocentric videos show that our method can estimate both accurate and physically plausible 3D ego-pose sequences without observing
the cameras wearer’s body