Abstract
Popular research areas like autonomous driving and
augmented reality have renewed the interest in image-based
camera localization. In this work, we address the task
of predicting the 6D camera pose from a single RGB image in a given 3D environment. With the advent of neural networks, previous works have either learned the entire camera localization process, or multiple components
of a camera localization pipeline. Our key contribution
is to demonstrate and explain that learning a single component of this pipeline is sufficient. This component is a
fully convolutional neural network for densely regressing
so-called scene coordinates, defining the correspondence
between the input image and the 3D scene space. The
neural network is prepended to a new end-to-end trainable
pipeline. Our system is efficient, highly accurate, robust in
training, and exhibits outstanding generalization capabilities. It exceeds state-of-the-art consistently on indoor and
outdoor datasets. Interestingly, our approach surpasses existing techniques even without utilizing a 3D model of the
scene during training, since the network is able to discover
3D scene geometry automatically, solely from single-view
constraints