Abstract
Recent advances in semantic image segmentation have mostly been achieved by training deep convolutional neuralnetworks (CNNs). We show how to improve semantic seg-mentation through the use of contextual information; specif-ically, we explore ‘patch-patch’ context between image re-gions, and ‘patch-background’ context. For learning fromthe patch-patch context, we formulate Conditional Random Fields (CRFs) with CNN-based pairwise potential functions to capture semantic correlations between neighboringpatches. Efficient piecewise training of the proposed deepstructured model is then applied to avoid repeated expensive CRF inference for back propagation. For capturing the patch-background context, we show that a network design with traditional multi-scale image input and sliding pyramid pooling is effective for improving performance. Our experimental results set new state-of-the-art performance on a number of popular semantic segmentation datasets, including NYUDv2, PASCAL VOC 2012, PASCAL-Context, andSIFT-flow. In particular, we achieve an intersection-overunion score of 78.0 on the challenging PASCAL VOC 2012 dataset.