Abstract
Recent algorithms in convolutional neural networks (CNN)considerably advance the fine-grained image clas-sification,which aims to differentiate subrle diferences among subordinate classes.However,previous studies have rarely focused on learning a fined-grained and struc-tured feature representation that is able to locate simi-lar images at different levels of relevance,e.g.,discover-ing cars from the same make or the same model,both of which require high precision.In this paper,we propose two main contributions to tackle this problem.I)A multi-task learning framework is designed to effectively learn fine-grained feature representations by jointly optimizing both classification and similarity constraints.2)To model the multi-level relevance,label structures such as hierar-chy or shared attributes are seamlessly embedded into the framework by generalizing the triplet loss.Extensive and thorough e.rperiments have been conducted on three fine-grained datasets,i.e.,the Stanford car,the Car-333,and the food datasets,which contain either hierarchical labels or shared attributes.Our proposed method has achieved very competitive performance,i.e.,among state-of-the-art classification accuracy when not using parts.More im-portantly,it significantly outperforms previous fine-grained feature representations for image retrieval at different levels of relevance.