Abstract
Deep networks are extremely hungry for data. They devour hundreds of thousands of labeled images to learn robust and semantically meaningful feature representations.
Current networks are so data hungry that collecting labeled
data has become as important as designing the networks
themselves. Unfortunately, manual data collection is both
expensive and time consuming. We present an alternative,
and show how ground truth labels for many vision tasks
are easily extracted from video games in real time as we
play them. We interface the popular Microsoft
R DirectX
R
rendering API, and inject specialized rendering code into
the game as it is running. This code produces ground truth
labels for instance segmentation, semantic labeling, depth
estimation, optical flow, intrinsic image decomposition, and
instance tracking. Instead of labeling images, a researcher
now simply plays video games all day long. Our method
is general and works on a wide range of video games. We
collected a dataset of 220k training images, and 60k test
images across 3 video games, and evaluate state of the art
optical flow, depth estimation and intrinsic image decomposition algorithms. Our video game data is visually closer to
real world images, than other synthetic dataset