Abstract. This paper presents a new deep learning approach for videobased scene classification. We design a Heterogeneous Deep Discriminative Model (HDDM) whose parameters are initialized by performing
an unsupervised pre-training in a layer-wise fashion using Gaussian Restricted Boltzmann Machines (GRBM). In order to avoid the redundancy
of adjacent frames, we extract spatiotemporal variation patterns within
frames and represent them sparsely using Sparse Cubic Symmetrical Pattern (SCSP). Then, a pre-initialized HDDM is separately trained using
the videos of each class to learn class-specific models. According to the
minimum reconstruction error from the learnt class-specific models, a
weighted voting strategy is employed for the classification. The performance of the proposed method is extensively evaluated on two action
recognition datasets; UCF101 and Hollywood II, and three dynamic texture and dynamic scene datasets; DynTex, YUPENN, and Maryland.
The experimental results and comparisons against state-of-the-art methods demonstrate that the proposed method consistently achieves superior
performance on all datasets