Abstract
We consider the task of automated estimation of facial expression intensity. This involves estimation of multiple output variables (facial action units — AUs) that are
structurally dependent. Their structure arises from statistically induced co-occurrence patterns of AU intensity levels.
Modeling this structure is critical for improving the estimation performance; however, this performance is bounded
by the quality of the input features extracted from face images. The goal of this paper is to model these structures and
estimate complex feature representations simultaneously by
combining conditional random field (CRF) encoded AU dependencies with deep learning. To this end, we propose
a novel Copula CNN deep learning approach for modeling multivariate ordinal variables. Our model accounts for
ordinal structure in output variables and their non-linear
dependencies via copula functions modeled as cliques of a
CRF. These are jointly optimized with deep CNN feature
encoding layers using a newly introduced balanced batch
iterative training algorithm. We demonstrate the effectiveness of our approach on the task of AU intensity estimation
on two benchmark datasets. We show that joint learning
of the deep features and the target output structure results
in significant performance gains compared to existing deep
structured models for analysis of facial expressions.