Abstract. The state of the art in video understanding suffers from two
problems: (1) The major part of reasoning is performed locally in the
video, therefore, it misses important relationships within actions that
span several seconds. (2) While there are local methods with fast perframe processing, the processing of the whole video is not efficient and
hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture1
that takes longterm content into account and enables fast per-video processing at the
same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with
a sampling strategy, which exploits that neighboring frames are largely
redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video can consist of
a few hundred frames. The approach achieves competitive performance
across all datasets while being 10x to 80x faster than state-of-the-art
methods