Abstract. In this work, we address the problem of 3D human pose estimation from a sequence of 2D human poses. Although the recent success
of deep networks has led many state-of-the-art methods for 3D pose estimation to train deep networks end-to-end to predict from images directly,
the top-performing approaches have shown the effectiveness of dividing
the task of 3D pose estimation into two steps: using a state-of-the-art 2D
pose estimator to estimate the 2D pose from images and then mapping
them into 3D space. They also showed that a low-dimensional representation like 2D locations of a set of joints can be discriminative enough
to estimate 3D pose with high accuracy. However, estimation of 3D pose
for individual frames leads to temporally incoherent estimates due to independent error in each frame causing jitter. Therefore, in this work we
utilize the temporal information across a sequence of 2D joint locations
to estimate a sequence of 3D poses. We designed a sequence-to-sequence
network composed of layer-normalized LSTM units with shortcut connections connecting the input to the output on the decoder side and
imposed temporal smoothness constraint during training. We found that
the knowledge of temporal consistency improves the best reported result
on Human3.6M dataset by approximately 12.2% and helps our network
to recover temporally consistent 3D poses over a sequence of images even
when the 2D pose detector fails