Abstract
Machine learning techniques, namely convolutional neural networks (CNN) and regression forests, have recently
shown great promise in performing 6-DoF localization
of monocular images. However, in most cases imagesequences, rather only single images, are readily available. To this extent, none of the proposed learning-based
approaches exploit the valuable constraint of temporal
smoothness, often leading to situations where the per-frame
error is larger than the camera motion. In this paper we
propose a recurrent model for performing 6-DoF localization of video-clips. We find that, even by considering
only short sequences (20 frames), the pose estimates are
smoothed and the localization error can be drastically reduced. Finally, we consider means of obtaining probabilistic pose estimates from our model. We evaluate our method
on openly-available real-world autonomous driving and indoor localization datasets