Abstract
Video event recognition still faces great challenges dueto large intra-class variation and low image resolution, in particular for surveillance videos. To mitigate these chal-lenges and to improve the event recognition performance, various context information from the feature level, the semantic level, as well as the prior level is utilized. Differentfrom most existing context approaches that utilize context inone of the three levels through shallow models like supportvector machines, or probabilistic models like BN and MRF, we propose a deep hierarchical context model that simultaneously learns and integrates context at all three levels, andholistically utilizes the integrated contexts for event recognition. We first introduce two types of context features describing the event neighborhood, and then utilize the proposed deep model to learn the middle level representations and combine the bottom feature level, middle semanticlevel and top prior level contexts together for event recog-nition. The experiments on state of art surveillance video event benchmarks including VIRAT 1.0 Ground Dataset, VIRAT 2.0 Ground Dataset, and the UT-Interaction Dataset demonstrate that the proposed model is quite effective in utilizing the context information for event recognition. It outperforms the existing context approaches that also utilize multiple level contexts on these event benchmarks.