Abstract
Maps are a key component in image-based camera localization and visual SLAM systems: they are used to establish geometric constraints between images, correct drift
in relative pose estimation, and relocalize cameras after lost
tracking. The exact definitions of maps, however, are often
application-specific and hand-crafted for different scenarios (e.g. 3D landmarks, lines, planes, bags of visual words).
We propose to represent maps as a deep neural net called
MapNet, which enables learning a data-driven map representation. Unlike prior work on learning maps, MapNet exploits cheap and ubiquitous sensory inputs like visual
odometry and GPS in addition to images and fuses them
together for camera localization. Geometric constraints expressed by these inputs, which have traditionally been used
in bundle adjustment or pose-graph optimization, are formulated as loss terms in MapNet training and also used
during inference. In addition to directly improving localization accuracy, this allows us to update the MapNet (i.e.,
maps) in a self-supervised manner using additional unlabeled video sequences from the scene. We also propose a
novel parameterization for camera rotation which is better suited for deep-learning based camera pose regression.
Experimental results on both the indoor 7-Scenes dataset
and the outdoor Oxford RobotCar dataset show significant
performance improvement over prior work. The MapNet
project webpage is https://goo.gl/mRB3Au