Abstract
We introduce the task of Visual Dialog, which requires an
AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifi-
cally, given an image, a dialog history, and a question about
the image, the agent has to ground the question in image,
infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific
downstream task so as to serve as a general test of machine
intelligence, while being grounded in vision enough to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person chat datacollection protocol to curate a large-scale Visual Dialog
dataset (VisDial). VisDial contains 1 dialog (10 questionanswer pairs) on ?140k images from the COCO dataset,
with a total of ?1.4M dialog question-answer pairs.
We introduce a family of neural encoder-decoder models
for Visual Dialog with 3 encoders (Late Fusion, Hierarchical Recurrent Encoder and Memory Network) and 2 decoders (generative and discriminative), which outperform a
number of sophisticated baselines. We propose a retrievalbased evaluation protocol for Visual Dialog where the AI
agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human
response. We quantify gap between machine and human
performance on the Visual Dialog task via human studies.
Our dataset, code, and trained models will be released publicly at visualdialog.org. Putting it all together, we
demonstrate the first ‘visual chatbot’!