Abstract
Deep convolutional neural networks (CNNs) have made
impressive progress in many video recognition tasks such
as video pose estimation and video object detection. However, CNN inference on video is computationally expensive
due to processing dense frames individually. In this work,
we propose a framework called Recurrent Residual Module
(RRM) to accelerate the CNN inference for video recognition
tasks. This framework has a novel design of using the similarity of the intermediate feature maps of two consecutive
frames, to largely reduce the redundant computation. One
unique property of the proposed method compared to previous work is that feature maps of each frame are precisely
computed. The experiments show that, while maintaining
the similar recognition performance, our RRM yields averagely 2× acceleration on the commonly used CNNs such
as AlexNet, ResNet, deep compression model (thus 8 12×
faster than the original dense models using the efficient inference engine), and impressively 9× acceleration on some
binary networks such as XNOR-Nets (thus 500× faster than
the original model). We further verify the effectiveness of the
RRM on speeding up CNNs for video pose estimation and
video object detection