Abstract
While convolutional neural networks (CNN) have beenexcellent for object recognition, the greater spatial vari-ability in scene images typically meant that the standardfull-image CNN features are suboptimal for scene classifi-cation. In this paper, we investigate a framework allowinggreater spatial flexibility, in which the Fisher vector (FV)encoded distribution of local CNN features, obtained froma multitude of region proposals per image, is considered in-stead. The CNN features are computed from an augment-ed pixel-wise representation comprising multiple modali-ties of RGB, HHA and surface normals, as extracted fromRGB-D data. More significantly, we make two postulates: (1) component sparsity — that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal non-sparsity — within these discriminative components, allmodalities have important contribution. In our framework,these are implemented through regularization terms applying group lasso to GMM components and exclusive grouplasso across modalities. By learning and combining regres-sors for both proposal-based FV features and global CNN features, we were able to achieve state-of-the-art sceneclassification performance on the SUNRGBD Dataset and NYU Depth Dataset V2.