Abstract
Most video-based action recognition approaches choose
to extract features from the whole video to recognize actions. The cluttered background and non-action motions limit
the performances of these methods, since they lack the explicit modeling of human body movements. With recent advances of human pose estimation, this work presents a novel
method to recognize human action as the evolution of pose
estimation maps. Instead of relying on the inaccurate human poses estimated from videos, we observe that pose estimation maps, the byproduct of pose estimation, preserve
richer cues of human body to benefit action recognition.
Specifically, the evolution of pose estimation maps can be
decomposed as an evolution of heatmaps, e.g., probabilistic maps, and an evolution of estimated 2D human poses,
which denote the changes of body shape and body pose, respectively. Considering the sparse property of heatmap, we
develop spatial rank pooling to aggregate the evolution of
heatmaps as a body shape evolution image. As body shape
evolution image does not differentiate body parts, we design
body guided sampling to aggregate the evolution of poses
as a body pose evolution image. The complementary properties between both types of images are explored by deep
convolutional neural networks to predict action label. Experiments on NTU RGB+D, UTD-MHAD and PennAction
datasets verify the effectiveness of our method, which outperforms most state-of-the-art methods