Abstract. State-of-the-art object detectors usually learn multi-scale representations to get better results by employing feature pyramids. However, the current designs for feature pyramids are still inefficient to integrate the semantic information over different scales. In this paper, we
begin by investigating current feature pyramids solutions, and then reformulate the feature pyramid construction as the feature reconfiguration process. Finally, we propose a novel reconfiguration architecture to
combine low-level representations with high-level semantic features in a
highly-nonlinear yet efficient way. In particular, our architecture which
consists of global attention and local reconfigurations, is able to gather
task-oriented features across different spatial locations and scales, globally and locally. Both the global attention and local reconfiguration are
lightweight, in-place, and end-to-end trainable. Using this method in the
basic SSD system, our models achieve consistent and significant boosts
compared with the original model and its other variations, without losing
real-time processing speed