Abstract
This paper is about estimating intensity levels of Facial Action Units (FAUs) in videos as an important step toward interpreting facial expressions. As input features, we use locations of facial landmark points detected in video frames. To address uncertainty of input, we formulate a generative latent tree (LT) model, its inference, and novel algorithms for effificient learning of both LT parameters and structure. Our structure learning iteratively builds LT by adding either a new edge or a new hidden node to LT, starting from initially independent nodes of observable features. A graph-edit operation that increases maximally the likelihood and minimally the model complexity is selected as optimal in each iteration. For FAU intensity estimation, we derive closed-form expressions of posterior marginals of all variables in LT, and specify an effificient bottom-up/topdown inference. Our evaluation on the benchmark DISFA and ShoulderPain datasets, in subject-independent setting, demonstrate that we outperform the state of the art, even under signifificant noise in facial landmarks. Effectiveness of our structure learning is demonstrated by probabilistically sampling meaningful facial expressions from the LT