资源论文Efficient Video Classification Using Fewer Frames

Efficient Video Classification Using Fewer Frames

2019-09-16 | |  256 |   118 |   0

Abstract Recently, there has been a lot of interest in building compact models for video classifification which have a small memory footprint (< 1 GB) [16]. While these models are compact, they typically operate by repeated application of a small weight matrix to all the frames in a video. For example, recurrent neural network based methods compute a hidden state for every frame of the video using a recurrent weight matrix. Similarly, cluster-and-aggregate based methods such as NetVLAD have a learnable clustering matrix which is used to assign soft-clusters to every frame in the video. Since these models look at every frame in the video, the number of flfloating point operations (FLOPs) is still large even though the memory footprint is small. In this work, we focus on building compute-effificient video classififi- cation models which process fewer frames and hence have less number of FLOPs. Similar to memory effificient models, we use the idea of distillation albeit in a different setting. Specififically, in our case, a compute-heavy teacher which looks at all the frames in the video is used to train a compute-effificient student which looks at only a small fraction of frames in the video. This is in contrast to a typical memory effificient Teacher-Student setting, wherein both the teacher and the student look at all the frames in the video but the student has fewer parameters. Our work thus complements the research on memory effificient video classifification. We do an extensive evaluation with three types of models for video classifification, viz., (i) recurrent models (ii) cluster-and-aggregate models and (iii) memory-effificient cluster-and-aggregate models and show that in each of these cases, a see-it-all teacher can be used to train a compute effificient see-very-little student. Overall, we show that the proposed student network can reduce the inference time by 30% and the number of FLOPs by approximately 90% with a negligible drop in the performance.

上一篇:Gait Recognition via Disentangled Representation Learning

下一篇:Direct Object Recognition Without Line-of-Sight Using Optical Coherence

用户评价
全部评价

热门资源

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Bounding the Inef...

    Social networks on the Internet have seen an en...

  • Shape-based Autom...

    We present an algorithm for automatic detection...

  • Joint Pose and Ex...

    Facial expression recognition (FER) is a challe...