Abstract. The representation of 3D pose plays a critical role for 3D
action and gesture recognition. Rather than representing a 3D pose directly by its joint locations, in this paper, we propose a Deformable Pose
Traversal Convolution Network that applies one-dimensional convolution to traverse the 3D pose for its representation. Instead of fixing the
receptive field when performing traversal convolution, it optimizes the
convolution kernel for each joint, by considering contextual joints with
various weights. This deformable convolution better utilizes the contextual joints for action and gesture recognition and is more robust to noisy
joints. Moreover, by feeding the learned pose feature to a LSTM, we perform end-to-end training that jointly optimizes 3D pose representation
and temporal sequence recognition. Experiments on three benchmark
datasets validate the competitive performance of our proposed method,
as well as its efficiency and robustness to handle noisy joints of pose