Vision-and-Language Navigation: Interpreting visually-grounded
navigation instructions in real environments
Abstract
A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive
robot helpers. It is a dream that remains stubbornly distant.
However, recent advances in vision and language methods have made incredible progress in closely related areas.
This is significant because a robot interpreting a naturallanguage navigation instruction on the basis of what it sees
is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and
language methods to the problem of interpreting visuallygrounded navigation instructions, we present the Matterport3D Simulator – a large-scale reinforcement learning
environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision
and language tasks, we provide the first benchmark dataset
for visually-grounded natural language navigation in real
buildings – the Room-to-Room (R2R) dataset