Abstract
Algorithms using “bag of features”-style video representa- tions currently achieve state-of-the-art performance on action recogni- tion tasks, such as the challenging Hollywood2 benchmark [1,2,3]. These algorithms are based on local spatiotemporal descriptors that can be ex- tracted either sparsely (at interest points) or densely (on regular grids), with dense sampling typically leading to the best performance [1]. Here, we investigate the benefit of space-variant processing of inputs, inspired by attentional mechanisms in the human visual system. We employ saliency-mapping algorithms to find informative regions and descriptors corresponding to these regions are either used exclusively, or are given greater representational weight (additional codebook vectors). This ap- proach is evaluated with three state-of-the-art action recognition algo- rithms [1,2,3], and using several saliency algorithms. We also use saliency maps derived from human eye movements to probe the limits of the ap- proach. Saliency-based pruning allows up to 70% of descriptors to be dis- carded, while maintaining high performance on Hollywood2. Meanwhile, pruning of 20-50% (depending on model) can even improve recognition. Further improvements can be obtained by combining repre- sentations learned separately on salience-pruned and unpruned descrip- tor sets. Not surprisingly, using the human eye movement data gives the best mean Average Precision (mAP; 61.9%), providing an upper bound on what is possible with a high-quality saliency map. Even without such external data, the Dense Tra jectories model [2] enhanced by automated saliency-based descriptor sampling achieves the best mAP (60.0%) re- ported on Hollywood2 to date.