Abstract
This paper introduces the PhotoBook dataset,
a large-scale collection of visually-grounded,
task-oriented dialogues in English designed to
investigate shared dialogue history accumulating during conversation. Taking inspiration from seminal work on dialogue analysis,
we propose a data-collection task formulated
as a collaborative game prompting two online
participants to refer to images utilising both
their visual context as well as previously established referring expressions. We provide a detailed description of the task setup and a thorough analysis of the 2,500 dialogues collected.
To further illustrate the novel features of the
dataset, we propose a baseline model for reference resolution which uses a simple method
to take into account shared information accumulated in a reference chain. Our results show
that this information is particularly important
to resolve later descriptions and underline the
need to develop more sophisticated models of
common ground in dialogue interaction