Abstract
Understanding the camera wearer’s activity is central to
egocentric vision, yet one key facet of that activity is inherently invisible to the camera—the wearer’s body pose.
Prior work focuses on estimating the pose of hands and
arms when they come into view, but this 1) gives an incomplete view of the full body posture, and 2) prevents any pose
estimate at all in many frames, since the hands are only
visible in a fraction of daily life activities. We propose to
infer the “invisible pose” of a person behind the egocentric camera. Given a single video, our efficient learningbased approach returns the full body 3D joint positions for
each frame. Our method exploits cues from the dynamic
motion signatures of the surrounding scene—which change
predictably as a function of body pose—as well as static
scene structures that reveal the viewpoint (e.g., sitting vs.
standing). We further introduce a novel energy minimization scheme to infer the pose sequence. It uses soft predictions of the poses per time instant together with a nonparametric model of human pose dynamics over longer windows. Our method outperforms an array of possible alternatives, including typical deep learning approaches for direct pose regression from images