Abstract
Most state-of-the-art methods for action recognition rely
on a two-stream architecture that processes appearance and
motion independently. In this paper, we claim that considering them jointly offers rich information for action recognition. We introduce a novel representation that gracefully encodes the movement of some semantic keypoints. We use the
human joints as these keypoints and term our Pose moTion
representation PoTion. Specifically, we first run a stateof-the-art human pose estimator [4] and extract heatmaps
for the human joints in each frame. We obtain our PoTion
representation by temporally aggregating these probability
maps. This is achieved by ‘colorizing’ each of them depending on the relative time of the frames in the video clip
and summing them. This fixed-size representation for an entire video clip is suitable to classify actions using a shallow
convolutional neural network.
Our experimental evaluation shows that PoTion outperforms other state-of-the-art pose representations [6, 48].
Furthermore, it is complementary to standard appearance
and motion streams. When combining PoTion with the
recent two-stream I3D approach [5], we obtain state-ofthe-art performance on the JHMDB, HMDB and UCF101
datasets