Abstract
Generative models of 3D human motion are often restricted to a small number of activities and can therefore not
generalize well to novel movements or applications. In this
work we propose a deep learning framework for human motion capture data that learns a generic representation from
a large corpus of motion capture data and generalizes well
to new, unseen, motions. Using an encoding-decoding network that learns to predict future 3D poses from the most
recent past, we extract a feature representation of human
motion. Most work on deep learning for sequence prediction focuses on video and speech. Since skeletal data has
a different structure, we present and evaluate different network architectures that make different assumptions about
time dependencies and limb correlations. To quantify the
learned features, we use the output of different layers for
action classification and visualize the receptive fields of the
network units. Our method outperforms the recent state
of the art in skeletal motion prediction even though these
use action specific training data. Our results show that
deep feedforward networks, trained from a generic mocap
database, can successfully be used for feature extraction
from human motion data and that this representation can
be used as a foundation for classification and prediction