Embodied Question Answering in
Photorealistic Environments with Point Cloud Perception
Abstract
To help bridge the gap between internet vision-style
problems and the goal of vision for embodied perception we
instantiate a large-scale navigation task – Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that
utilize 3D point clouds, RGB images, or their combination. Our analysis of these models reveals several key findings. We find that two seemingly naive navigation baselines, forward-only and random, are strong navigators and
challenging to outperform, due to the specific choice of the
evaluation setting presented by [1]. We find a novel lossweighting scheme we call Inflection Weighting to be important when training recurrent models for navigation with behavior cloning and are able to out perform the baselines
with this technique. We find that point clouds provide a
richer signal than RGB images for learning obstacle avoidance, motivating the use (and continued study) of 3D deep
learning models for embodied navigation.