Abstract
We introduce a purely feed-forward architecture for semantic segmentation. We map small image elements (su-perpixels) to rich feature representations extracted from a sequence of nested regions of increasing extent. These re-gions are obtained by ”zooming out” from the superpixelall the way to scene-level resolution. This approach exploitsstatistical structure in the image and in the label space with-out setting up explicit structured prediction mechanisms,and thus avoids complex and expensive inference. Insteadsuperpixels are classified by a feedforward multilayer net-work. Our architecture achieves 69.6% average accuracyon the PASCAL VOC 2012 test set.