Abstract
Facial analysis in videos, including head pose estimation and facial landmark localization, is key for many applications such as facial animation capture, human activity
recognition, and human-computer interaction. In this paper, we propose to use a recurrent neural network (RNN) for
joint estimation and tracking of facial features in videos. We
are inspired by the fact that the computation performed in
an RNN bears resemblance to Bayesian filters, which have
been used for tracking in many previous methods for facial
analysis from videos. Bayesian filters used in these methods, however, require complicated, problem-specific design
and tuning. In contrast, our proposed RNN-based method
avoids such tracker-engineering by learning from training
data, similar to how a convolutional neural network (CNN)
avoids feature-engineering for image classification. As an
end-to-end network, the proposed RNN-based method provides a generic and holistic solution for joint estimation
and tracking of various types of facial features from consecutive video frames. Extensive experimental results on
head pose estimation and facial landmark localization from
videos demonstrate that the proposed RNN-based method
outperforms frame-wise models and Bayesian filtering. In
addition, we create a large-scale synthetic dataset for head
pose estimation, with which we achieve state-of-the-art performance on a benchmark dataset