Abstract
Multimodal dialogue systems have opened
new frontiers in the traditional goal-oriented
dialogue systems. The state-of-the-art dialogue systems are primarily based on unimodal sources, predominantly the text, and
hence cannot capture the information present
in the other sources such as videos, audios, images etc. With the availability of large scale
multimodal dialogue dataset (MMD) (Saha
et al., 2018) on the fashion domain, the visual
appearance of the products is essential for understanding the intention of the user. Without capturing the information from both the
text and image, the system will be incapable
of generating correct and desirable responses.
In this paper, we propose a novel position and
attribute aware attention mechanism to learn
enhanced image representation conditioned on
the user utterance. Our evaluation shows that
the proposed model can generate appropriate
responses while preserving the position and attribute information. Experimental results also
prove that our proposed approach attains superior performance compared to the baseline
models, and outperforms the state-of-the-art
approaches on text similarity based evaluation
metrics