Abstract
A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets.
Unfortunately, in the context of RGB-D scene understanding, very little data is available – current datasets cover a
small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an
RGB-D video dataset containing 2.5M views in 1513 scenes
annotated with 3D camera poses, surface reconstructions,
and semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system
that includes automated surface reconstruction and crowdsourced semantic annotation.We show that using this data
helps achieve state-of-the-art performance on several 3D
scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval.