Modeling Temporal Dynamics and Spatial Configurations of Actions Using
Two-Stream Recurrent Neural Networks
Abstract
Recently, skeleton based action recognition gains more
popularity due to cost-effective depth sensors coupled with
real-time skeleton estimation algorithms. Traditional approaches based on handcrafted features are limited to represent the complexity of motion patterns. Recent methods
that use Recurrent Neural Networks (RNN) to handle raw
skeletons only focus on the contextual dependency in the
temporal domain and neglect the spatial configurations of
articulated skeletons. In this paper, we propose a novel
two-stream RNN architecture to model both temporal dynamics and spatial configurations for skeleton based action
recognition. We explore two different structures for the temporal stream: stacked RNN and hierarchical RNN. Hierarchical RNN is designed according to human body kinematics. We also propose two effective methods to model the
spatial structure by converting the spatial graph into a sequence of joints. To improve generalization of our model,
we further exploit 3D transformation based data augmentation techniques including rotation and scaling transformation to transform the 3D coordinates of skeletons during
training. Experiments on 3D action recognition benchmark
datasets show that our method brings a considerable improvement for a variety of actions, i.e., generic actions, interaction activities and gestures.