Abstract
This paper presents a new method for 3D action recognition with skeleton sequences (i.e., 3D trajectories of human
skeleton joints). The proposed method first transforms each
skeleton sequence into three clips each consisting of several frames for spatial temporal feature learning using deep
neural networks. Each clip is generated from one channel of the cylindrical coordinates of the skeleton sequence.
Each frame of the generated clips represents the temporal
information of the entire skeleton sequence, and incorporates one particular spatial relationship between the joints.
The entire clips include multiple frames with different spatial relationships, which provide useful spatial structural information of the human skeleton. We propose to use deep
convolutional neural networks to learn long-term temporal
information of the skeleton sequence from the frames of the
generated clips, and then use a Multi-Task Learning Network (MTLN) to jointly process all frames of the generated
clips in parallel to incorporate spatial structural information for action recognition. Experimental results clearly
show the effectiveness of the proposed new representation
and feature learning method for 3D action recognition