Abstract
This work addresses fine-grained image classification.
Our work is based on the hypothesis that when dealing with
subtle differences among object classes it is critical to identify and only account for a few informative image parts, as
the remaining image context may not only be uninformative
but may also hurt recognition. This motivates us to formulate our problem as a sequential search for informative
parts over a deep feature map produced by a deep Convolutional Neural Network (CNN). A state of this search is a
set of proposal bounding boxes in the image, whose “informativeness” is evaluated by the heuristic function (H),
and used for generating new candidate states by the successor function (S). The two functions are unified via a Long
Short-Term Memory network (LSTM) into a new deep recurrent architecture, called HSnet. Thus, HSnet (i) generates proposals of informative image parts and (ii) fuses all
proposals toward final fine-grained recognition. We specify
both supervised and weakly supervised training of HSnet
depending on the availability of object part annotations.
Evaluation on the benchmark Caltech-UCSD Birds 200-
2011 and Cars-196 datasets demonstrate our competitive
performance relative to the state of the art.