Abstract
We study the problem of facial analysis in videos. We propose a novel weakly supervised learning method that models the video event (expression, painetc.) as a sequence of automatically mined, discriminative sub-events (e.g. onset and offset phase for smile, brow lower and cheek raise for pain). The proposed model is inspired by the recent works on Multiple In-stance Learning and latent SVM/HCRF – it extends such frameworks to model the ordinal or temporal as-pect in the videos, approximately. We obtain consistentimprovements over relevant competitive baselines on four challenging and publicly available video basedfacial analysis datasets for prediction of expression, clinical pain and intent in dyadic conversations. In combination with complimentary features, we report state-of-the-art results on these datasets.