Abstract
Modern computer vision algorithms typically require ex-pensive data acquisition and accurate manual labeling.In this work, we instead leverage the recent progress incomputer graphics to generate fully labeled, dynamic, andphoto-realistic proxy virtual worlds. We propose an effi-cient real-to-virtual world cloning method, and validate ourapproach by building and publicly releasing a new videodataset, called “Virtual KITTI” 1 , automatically labeledwith accurate ground truth for object detection, tracking,scene and instance segmentation, depth, and optical flow.We provide quantitative experimental evidence suggestingthat (i) modern deep learning algorithms pre-trained onreal data behave similarly in real and virtual worlds, and(ii) pre-training on virtual data improves performance. Asthe gap between real and virtual worlds is small, virtualworlds enable measuring the impact of various weather andimaging conditions on recognition performance, all other things being equal. We show these factors may affect drastically otherwise high-performing deep models for tracking.