Abstract
Semantic image segmentation is a basic street scene understanding task in autonomous driving, where each pixel in
a high resolution image is categorized into a set of semantic labels. Unlike other scenarios, objects in autonomous
driving scene exhibit very large scale changes, which poses
great challenges for high-level feature representation in a
sense that multi-scale information must be correctly encoded. To remedy this problem, atrous convolution[14] was
introduced to generate features with larger receptive fields
without sacrificing spatial resolution. Built upon atrous
convolution, Atrous Spatial Pyramid Pooling (ASPP)[2]
was proposed to concatenate multiple atrous-convolved features using different dilation rates into a final feature representation. Although ASPP is able to generate multi-scale
features, we argue the feature resolution in the scale-axis
is not dense enough for the autonomous driving scenario.
To this end, we propose Densely connected Atrous Spatial Pyramid Pooling (DenseASPP), which connects a set
of atrous convolutional layers in a dense way, such that it
generates multi-scale features that not only cover a larger
scale range, but also cover that scale range densely, without significantly increasing the model size. We evaluate
DenseASPP on the street scene benchmark Cityscapes[4]
and achieve state-of-the-art performance