Abstract. Despite significant progress in a variety of vision-and-language
problems, developing a method capable of asking intelligent, goal-oriented
questions about images is proven to be an inscrutable challenge. Towards
this end, we propose a Deep Reinforcement Learning framework based
on three new intermediate rewards, namely goal-achieved, progressive
and informativeness that encourage the generation of succinct questions,
which in turn uncover valuable information towards the overall goal. By
directly optimizing for questions that work quickly towards fulfilling the
overall goal, we avoid the tendency of existing methods to generate long
series of inane queries that add little value. We evaluate our model on
the GuessWhat?! dataset and show that the resulting questions can help
a standard ‘Guesser’ identify a specific object in an image at a much
higher success rate