Abstract
Common visual recognition tasks such as classification,
object detection, and semantic segmentation are rapidly
reaching maturity, and given the recent rate of progress, it is
not unreasonable to conjecture that techniques for many of
these problems will approach human levels of performance
in the next few years. In this paper we look to the future:
what is the next frontier in visual recognition?
We offer one possible answer to this question. We propose a detailed image annotation that captures information
beyond the visible pixels and requires complex reasoning
about full scene structure. Specifically, we create an amodal
segmentation of each image: the full extent of each region is
marked, not just the visible pixels. Annotators outline and
name all salient regions in the image and specify a partial
depth order. The result is a rich scene structure, including
visible and occluded portions of each region, figure-ground
edge information, semantic labels, and object overlap.
We create two datasets for semantic amodal segmentation. First, we label 500 images in the BSDS dataset with
multiple annotators per image, allowing us to study the
statistics of human annotations. We show that the proposed
full scene annotation is surprisingly consistent between annotators, including for regions and edges. Second, we annotate 5000 images from COCO. This larger dataset allows
us to explore a number of algorithmic ideas for amodal segmentation and depth ordering. We introduce novel metrics
for these tasks, and along with our strong baselines, define
concrete new challenges for the community