Abstract
Visual recognition requires rich representations that span
levels from low to high, scales from small to large, and
resolutions from fine to coarse. Even with the depth of features in a convolutional network, a layer in isolation is not
enough: compounding and aggregating these representations improves inference of what and where. Architectural
efforts are exploring many dimensions for network backbones, designing deeper or wider architectures, but how to
best aggregate layers and blocks across a network deserves
further attention. Although skip connections have been incorporated to combine layers, these connections have been
“shallow” themselves, and only fuse by simple, one-step operations. We augment standard architectures with deeper
aggregation to better fuse information across layers. Our
deep layer aggregation structures iteratively and hierarchically merge the feature hierarchy to make networks with
better accuracy and fewer parameters. Experiments across
architectures and tasks show that deep layer aggregation
improves recognition and resolution compared to existing
branching and merging schemes