Grassmann Pooling as Compact Homogeneous
Bilinear Pooling for Fine-Grained Visual
Classification
Abstract. Designing discriminative and invariant features is the key to
visual recognition. Recently, the bilinear pooled feature matrix of Convolutional Neural Network (CNN) has shown to achieve state-of-the-art
performance on a range of fine-grained visual recognition tasks. The bilinear feature matrix collects second-order statistics and is closely related
to the covariance matrix descriptor. However, the bilinear feature could
suffer from the visual burstiness phenomenon similar to other visual representations such as VLAD and Fisher Vector. The reason is that the
bilinear feature matrix is sensitive to the magnitudes and correlations of
local CNN feature elements which can be measured by its singular values.
On the other hand, the singular vectors are more invariant and reasonable to be adopted as the feature representation. Motivated by this point,
we advocate an alternative pooling method which transforms the CNN
feature matrix to an orthonormal matrix consists of its principal singular
vectors. Geometrically, such orthonormal matrix lies on the Grassmann
manifold, a Riemannian manifold whose points represent subspaces of
the Euclidean space. Similarity measurement of images reduces to comparing the principal angles between these “homogeneous” subspaces and
thus is independent of the magnitudes and correlations of local CNN
activations. In particular, we demonstrate that the projection distance
on the Grassmann manifold deduces a bilinear feature mapping without
explicitly computing the bilinear feature matrix, which enables a very
compact feature and classifier representation. Experimental results show
that our method achieves an excellent balance of model complexity and
accuracy on a variety of fine-grained image classification datasets.