Abstract
We investigate the spatio-temporal alignment of videos or features/signals extracted from them. Specifically, we formally define an alignment manifold and formulate the alignment problem as an opti- mization procedure on this non-linear space by exploiting its intrinsic geometry. We focus our attention on semantically meaningful videos or signals, e.g., those describing or capturing human motion or activities, and propose a new formalism for temporal alignment accounting for ex- ecuting rate variations among realizations of the same video event. By construction, we address this static and deterministic alignment task in a dynamic and stochastic manner: we regard the search for optimal align- ment parameters as a recursive state estimation problem for a particular dynamic system evolving on the alignment manifold. Consequently, a Sequential Importance Sampling iteration on the alignment manifold is designed for effective and efficient alignment. We demonstrate the per- formance on several types of input data that arise in vision problems.