Abstract
We present a new AI task – Embodied Question Answering
(EmbodiedQA) – where an agent is spawned at a random
location in a 3D environment and asked a question (‘What
color is the car?’). In order to answer, the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric)
vision, and then answer the question (‘orange’).
EmbodiedQA requires a range of AI skills – language understanding, visual recognition, active perception, goaldriven navigation, commonsense reasoning, long-term
memory, and grounding language into actions. In this work,
we develop a dataset of questions and answers in House3D
environments [1], evaluation metrics, and a hierarchical
model trained with imitation and reinforcement learning