Abstract
Convolutional neural networks (CNNs) have been exten-sively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentationand retrieval. In this work we propose and evaluate severaldeep neural network architectures to combine image infor-mation across a video over longer time periods than previ-ously attempted. We propose two methods capable of han-dling full length videos. The first method explores variousconvolutional temporal feature pooling architectures, ex-amining the various design choices which need to be madewhen adapting a CNN for this task. The second proposedmethod explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neuralnetwork that uses Long Short-Term Memory (LSTM) cellswhich are connected to the output of the underlying CNN.Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 73.0%).