Abstract
We aim to detect all instances of a category in an image and, for each instance, mark the pixels that belong to it. We call this task Si- multaneous Detection and Segmentation (SDS). Unlike classical bound- ing box detection, SDS requires a segmentation and not just a box. Unlike classical semantic segmentation, we require individual ob ject instances. We build on recent work that uses convolutional neural networks to clas- sify category-independent region proposals (R-CNN [16]), introducing a novel architecture tailored for SDS. We then use category-specific, top- down figure-ground predictions to refine our bottom-up proposals. We show a 7 point boost (16% relative) over our baselines on SDS, a 5 point boost (10% relative) over state-of-the-art on semantic segmentation, and state-of-the-art performance in ob ject detection. Finally, we provide di- agnostic tools that unpack performance and provide directions for future work.