Abstract
We address the problem of learning good features for under- standing video data. We introduce a model that learns latent represen- tations of image sequences from pairs of successive images. The convolu- tional architecture of our model allows it to scale to realistic image sizes whilst using a compact parametrization. In experiments on the NORB dataset, we show our model extracts latent “flow fields” which correspond to the transformation between the pair of input frames. We also use our model to extract low-level motion features in a multi-stage architecture for action recognition, demonstrating competitive performance on both the KTH and Hollywood2 datasets.