Abstract
Representing local image patches in an invariant and
discriminative manner is an active research topic in computer vision. It has recently been demonstrated that local
feature learning based on deep Convolutional Neural Network (CNN) can significantly improve the matching performance. Previous works on learning such descriptors have
focused on developing various loss functions, regularizations and data mining strategies to learn discriminative CNN representations. Such methods, however, have little analysis on how to increase geometric invariance of their generated descriptors. In this paper, we propose a descriptor that
has both highly invariant and discriminative power. The
abilities come from a novel pooling method, dubbed Subspace Pooling (SP) which is invariant to a range of geometric deformations. To further increase the discriminative
power of our descriptor, we propose a simple distance kernel integrated to the marginal triplet loss that helps to focus
on hard examples in CNN training. Finally, we show that
by combining SP with the projection distance metric [13],
the generated feature descriptor is equivalent to that of the
Bilinear CNN model [22], but outperforms the latter with
much lower memory and computation consumptions. The
proposed method is simple, easy to understand and achieves
good performance. Experimental results on several patch
matching benchmarks show that our method outperforms
the state-of-the-arts significantly