Abstract
Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted
based on visual and audio aspects of a given
video, is significantly more challenging than
traditional image or text-grounded dialogue
systems because (1) feature space of videos
span across multiple picture frames, making
it difficult to obtain semantic information; and
(2) a dialogue agent must perceive and process
information from different modalities (audio,
video, caption, etc.) to obtain a comprehensive
understanding. Most existing work is based
on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in
videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from
different modalities. We also propose queryaware attention through an auto-encoder to
extract query-aware features from non-text
modalities. We develop a training procedure
to simulate token-level decoding to improve
the quality of generated responses during inference. We get state of the art performance
on Dialogue System Technology Challenge
7 (DSTC7). Our model also generalizes to
another multimodal visual-grounded dialogue
task, and obtains promising performance