Abstract
This paper argues that large-scale action recognition in video can be greatly improved by providing an additional modality in training data – namely, 3D human-skeleton sequences – aimed at complementing poorly represented or missing features of human actions in the training videos. For recognition, we use Long Short Term Memory (LSTM) grounded via a deep Convolutional Neural Network (CNN) onto the video. Training of LSTM is regularized using the output of another encoder LSTM (eLSTM) grounded on 3D human-skeleton training data. For such regularized training of LSTM, we modify the standard backpropagation through time (BPTT) in order to address the wellknown issues with gradient descent in constraint optimization. Our evaluation on three benchmark datasets – Sports- 1M, HMDB-51, and UCF101 – shows accuracy improvements from 1.7% up to 14.8% relative to the state of the art.