Abstract. Recent CNN based object detectors, either one-stage methods like YOLO, SSD, and RetinaNet, or two-stage detectors like Faster
R-CNN, R-FCN and FPN, are usually trying to directly finetune from
ImageNet pre-trained models designed for the task of image classification. However, there has been little work discussing the backbone feature
extractor specifically designed for the task of object detection. More importantly, there are several differences between the tasks of image classification and object detection. (i) Recent object detectors like FPN and
RetinaNet usually involve extra stages against the task of image classi-
fication to handle the objects with various scales. (ii) Object detection
not only needs to recognize the category of the object instances but also
spatially locate them. Large downsampling factors bring large valid receptive field, which is good for image classification, but compromises the
object location ability. Due to the gap between the image classification
and object detection, we propose DetNet in this paper, which is a novel
backbone network specifically designed for object detection. Moreover,
DetNet includes the extra stages against traditional backbone network
for image classification, while maintains high spatial resolution in deeper
layers. Without any bells and whistles, state-of-the-art results have been
obtained for both object detection and instance segmentation on the
MSCOCO benchmark based on our DetNet (4.8G FLOPs) backbone.
Codes will be released