Abstract
Semantic part localization can facilitate fine-grained catego- rization by explicitly isolating subtle appearance differences associated with specific ob ject parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of ob ject detection. We propose a model for fine-grained categorization that overcomes these limitations by leverag- ing deep convolutional features computed on bottom-up region propos- als. Our method learns whole-ob ject and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained cate- gory from a pose-normalized representation. Experiments on the Caltech- UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.