Abstract
We present a new dataset with the goal of advancing the state-of-the-art in ob ject recognition by placing the question of ob ject recognition in the context of the broader question of scene understand- ing. This is achieved by gathering images of complex everyday scenes containing common ob jects in their natural context. Ob jects are labeled using per-instance segmentations to aid in precise ob ject localization. Our dataset contains photos of 91 ob jects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled in- stances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detec- tion, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.