Abstract
We present a novel approach to modeling human pose, to- gether with interacting ob jects, based on compositional models of local visual interactions and their relations. Skeleton models, while flexi- ble enough to capture large articulations, fail to accurately model self-occlusions and interactions. Poselets and Visual Phrases address this limitation, but do so at the expense of requiring a large set of templates. We combine all three approaches with a compositional model that is flex- ible enough to model detailed articulations but still captures occlusions and ob ject interactions. Unlike much previous work on action classifi- cation, we do not assume test images are labeled with a person, and instead present results for “action detection” in an unlabeled image. No- tably, for each detection, our model reports back a detailed description including an action label, articulated human pose, ob ject poses, and oc- clusion flags. We demonstrate that modeling occlusion is crucial for rec- ognizing human-ob ject interactions. We present results on the PASCAL Action Classification challenge that shows our unified model advances the state-of-the-art for detection, action classification, and articulated pose estimation.