Encoder-Decoder with Atrous Separable
Convolution for Semantic Image Segmentation
Abstract. Spatial pyramid pooling module or encode-decoder structure
are used in deep neural networks for semantic segmentation task. The
former networks are able to encode multi-scale contextual information by
probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks
can capture sharper object boundaries by gradually recovering the spatial
information. In this work, we propose to combine the advantages from
both methods. Specifically, our proposed model, DeepLabv3+, extends
DeepLabv3 by adding a simple yet effective decoder module to refine the
segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution
to both Atrous Spatial Pyramid Pooling and decoder modules, resulting
in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes
datasets, achieving the test set performance of 89% and 82.1% without
any post-processing. Our paper is accompanied with a publicly available
reference implementation of the proposed models in Tensorflow at https:
//github.com/tensorflow/models/tree/master/research/deeplab.