Abstract
In this paper, we present a method for estimating articu-lated human poses in videos. We cast this as an optimizationproblem defined on body parts with spatio-temporal linksbetween them. The resulting formulation is unfortunatelyintractable and previous approaches only provide approx-imate solutions. Although such methods perform well oncertain body parts, e.g., head, their performance on lower arms, i.e., elbows and wrists, remains poor. We present a new approximate scheme with two steps dedicated to pose estimation. First, our approach takes into account temporal links with subsequent frames for the less-certain parts,namely elbows and wrists. Second, our method decomposes poses into limbs, generates limb sequences across time, and recomposes poses by mixing these body part sequences. We introduce a new dataset “Poses in the Wild”, which is more challenging than the existing ones, with sequences containing background clutter, occlusions, and severe camera motion. We experimentally compare our method with recent approaches on this new dataset as well as on two other benchmark datasets, and show significant improvement.