Abstract
Deep learning has shown to be effective for robust and
real-time monocular image relocalisation. In particular,
PoseNet [22] is a deep convolutional neural network which
learns to regress the 6-DOF camera pose from a single image. It learns to localize using high level features and is
robust to difficult lighting, motion blur and unknown camera intrinsics, where point based SIFT registration fails.
However, it was trained using a naive loss function, with
hyper-parameters which require expensive tuning. In this
paper, we give the problem a more fundamental theoretical treatment. We explore a number of novel loss functions
for learning camera pose which are based on geometry and
scene reprojection error. Additionally we show how to automatically learn an optimal weighting to simultaneously
regress position and orientation. By leveraging geometry,
we demonstrate that our technique significantly improves
PoseNet’s performance across datasets ranging from indoor
rooms to a small city