Abstract
Finetuning from a pretrained deep model is found to yield state-of-the-art performance for many vision tasks. This paper investigates many factors that in?uence the performance in ?netuning for object detection. There is a longtailed distribution of sample numbers for classes in object detection. Our analysis and empirical results show thatclasses with more samples have higher impact on the feature learning. And it is better to make the sample num-ber more uniform across classes. Generic object detectioncan be considered as multiple equally important tasks. Detection of each class is a task. These classes/tasks havetheir individuality in discriminative visual appearance rep-resentation. Taking this individuality into account, we clus-ter objects into visually similar class groups and learndeep representations for these groups separately. A hi-erarchical feature learning scheme is proposed. In this scheme, the knowledge from the group with large number of classes is transferred for learning features in its subgroups. Finetuned on the GoogLeNet model, experimental results show 4.7% absolute mAP improvement of our approach on the ImageNet object detection dataset without increasing much computational cost at the testing stage. Code is available on www.ee.cuhk.edu.hk/˜wlouyang/ projects/ImageNetFactors/CVPR16.html