Abstract
Human action recognition in videos is a challenging prob- lem with wide applications. State-of-the-art approaches often adopt the popular bag-of-features representation based on isolated local patches or temporal patch tra jectories, where motion patterns like ob ject relation- ships are mostly discarded. This paper proposes a simple representation specifically aimed at the modeling of such motion relationships. We adopt global and local reference points to characterize motion information, so that the final representation can be robust to camera movement. Our ap- proach operates on top of visual codewords derived from local patch tra- jectories, and therefore does not require accurate foreground-background separation, which is typically a necessary step to model ob ject relation- ships. Through an extensive experimental evaluation, we show that the proposed representation offers very competitive performance on challeng- ing benchmark datasets, and combining it with the bag-of-features rep- resentation leads to substantial improvement. On Hollywood2, Olympic Sports, and HMDB51 datasets, we obtain 59.5%, 80.6% and 40.7% re- spectively, which are the best reported results to date.