Abstract
We introduce Interactive Question Answering (IQA),
the task of answering questions that require an autonomous
agent to interact with a dynamic visual environment. IQA
presents the agent with a scene and a question, like: “Are
there any apples in the fridge?” The agent must navigate
around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and
plan for a series of actions conditioned on the question.
Popular reinforcement learning approaches with a single
controller perform poorly on IQA owing to the large and
diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized
set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN,
we introduce IQUAD V1, a new dataset built upon AI2-
THOR [35], a simulated photo-realistic environment of con-
figurable indoor scenes with interactive objects. IQUAD V1
has 75,000 questions, each paired with a unique scene con-
figuration. Our experiments show that our proposed model
outperforms popular single controller based methods on
IQUAD V1. For sample questions and results, please view
our video: https://youtu.be/pXd3C-1jr98.