Abstract
In fine-grained action (ob ject manipulation) recognition, it is important to encode ob ject semantic (contextual) information, i.e., which ob ject is being manipulated and how it is being operated. How- ever, previous methods for action recognition often represent the seman- tic information in a global and coarse way and therefore cannot cope with fine-grained actions. In this work, we propose a representation and classification pipeline which seamlessly incorporates localized semantic information into every processing step for fine-grained action recognition. In the feature extraction stage, we explore the geometric information between local motion features and the surrounding ob jects. In the fea- ture encoding stage, we develop a semantic-grouped locality-constrained linear coding (SG-LLC) method that captures the joint distributions between motion and ob ject-in-use information. Finally, we propose a semantic-aware multiple kernel learning framework (SA-MKL) by uti- lizing the empirical joint distribution between action and ob ject type for more discriminative action classification. Extensive experiments are performed on the large-scale and difficult fine-grained MPII cooking ac- tion dataset. The results show that by effectively accumulating localized semantic information into the action representation and classification pipeline, we significantly improve the fine-grained action classification performance over the existing methods.