Abstract
Autonomous agents need to reason about the world beyond their instantaneous sensory input. Integrating information over time, however, requires switching from an egocentric representation of a scene to an allocentric one, expressed in the world reference frame. It must also be possible to update the representation dynamically, which requires localizing and registering the sensor with respect to
the world reference. In this paper, we develop a differentiable module that satisfies such requirements, while being
robust, efficient, and suitable for integration in end-to-end
deep networks. The module contains an allocentric spatial
memory that can be accessed associatively by feeding to it
the current sensory input, resulting in localization, and then
updated using an LSTM or similar mechanism. We formulate efficient localization and registration of sensory information as a dual pair of convolution/deconvolution operators in memory space. The map itself is a 2.5D representation of an environment storing information that a deep neural network module learns to distill from RGBD input. The
result is a map that contains multi-task information, different from classical approaches to mapping such as structurefrom-motion. We present results using synthetic mazes, a
dataset of hours of recorded gameplay of the classic game
Doom, and the very recent Active Vision Dataset of real images captured from a robot