Abstract
This paper addresses three issues in integrating partbased representations into convolutional neural networks
(CNNs) for object recognition. First, most part-based models rely on a few pre-specified object parts. However, the
optimal object parts for recognition often vary from category to category. Second, acquiring training data with
part-level annotation is labor-intensive. Third, modeling
spatial relationships between parts in CNNs often involves
an exhaustive search of part templates over multiple network streams. We tackle the three issues by introducing a
new network layer, called co-occurrence layer. It can extend a convolutional layer to encode the co-occurrence between the visual parts detected by the numerous neurons,
instead of a few pre-specified parts. To this end, the feature
maps serve as both filters and images, and mutual correlation filtering is conducted between them. The co-occurrence
layer is end-to-end trainable. The resultant co-occurrence
features are rotation- and translation-invariant, and are robust to object deformation. By applying this new layer
to the VGG-16 and ResNet-152, we achieve the recognition rates of 83.6% and 85.8% on the Caltech-UCSD bird
benchmark, respectively. The source code is available at
https://github.com/yafangshih/Deep-COOC