Abstract
We present an efficient method for geolocalization in urban environments starting from a coarse estimate of the location provided by a GPS and using a simple untextured
2.5D model of the surrounding buildings. Our key contribution is a novel efficient and robust method to optimize the
pose: We train a Deep Network to predict the best direction
to improve a pose estimate, given a semantic segmentation
of the input image and a rendering of the buildings from this
estimate. We then iteratively apply this CNN until converging to a good pose. This approach avoids the use of reference images of the surroundings, which are difficult to acquire and match, while 2.5D models are broadly available.
We can therefore apply it to places unseen during training