Abstract
Fine-grained classification is challenging because cate-gories can only be discriminated by subtle and local dif-ferences. Variances in the pose, scale or rotation usuallymake the problem more difficult. Most fine-grained classification systems follow the pipeline of finding foregroundobject or object parts (where) to extract discriminative fea-tures (what). In this paper, we propose to apply visual attention to finegrained classification task using deep neural network. Our pipeline integrates three types of attention: the bottom-upattention that propose candidate patches, the object-leveltop-down attention that selects relevant patches to a certainobject, and the part-level top-down attention that localizesdiscriminative parts. We combine these attentions to traindomain-specific deep nets, then use it to improve both the what and where aspects. Importantly, we avoid using ex-pensive annotations like bounding box or part informationfrom end-to-end. The weak supervision constraint makesour work easier to generalize. We have verified the effectiveness of the method on the subsets of ILSVRC2012 dataset and CUB200 2011dataset. Our pipeline delivered significant improvements and achieved the best accuracy under the weakest supervision condition. The performance is competitive againstother methods that rely on additional annotations.