Abstract. Effectively measuring the similarity between two human motions is
necessary for several computer vision tasks such as gait analysis, person identi-
fication and action retrieval. Nevertheless, we believe that traditional approaches
such as L2 distance or Dynamic Time Warping based on hand-crafted local pose
metrics fail to appropriately capture the semantic relationship across motions and,
as such, are not suitable for being employed as metrics within these tasks. This
work addresses this limitation by means of a triplet-based deep metric learning
specifically tailored to deal with human motion data, in particular with the problem of varying input size and computationally expensive hard negative mining
due to motion pair alignment. Specifically, we propose (1) a novel metric learning objective based on a triplet architecture and Maximum Mean Discrepancy; as
well as, (2) a novel deep architecture based on attentive recurrent neural networks.
One benefit of our objective function is that it enforces a better separation within
the learned embedding space of the different motion categories by means of the
associated distribution moments. At the same time, our attentive recurrent neural
network allows processing varying input sizes to a fixed size of embedding while
learning to focus on those motion parts that are semantically distinctive. Our experiments on two different datasets demonstrate significant improvements over
conventional human motion metrics