Abstract
Real-world videos often contain dynamic backgrounds and evolving people activities, especially for those web videos generated by users in uncon- strained scenarios. This paper proposes a new visual representation, namely scene aligned pooling, for the task of event recognition in complex videos. Based on the observation that a video clip is often composed with shots of different scenes, the key idea of scene aligned pooling is to decompose any video features into con- current scene components, and to construct classi fication models adaptive to dif- ferent scenes. The experiments on two large scale real-world datasets including the TRECVID Multimedia Event Detection 2011 and the Human Motion Recog- nition Databases (HMDB) show that our new visual representation can consis- tently improve various kinds of visual features such as different low-level color and texture features, or middle-level histogram of local descriptors such as SIFT, or space-time interest points, and high level semantic model features, by a signif- icant margin. For example, we improve the-state-of-the-art accuracy on HMDB dataset by 20% in terms of accuracy.