Abstract
Emotion recognition in conversations (ERC) is
a challenging task that has recently gained popularity due to its potential applications. Until now, however, there has been no largescale multimodal multi-party emotional conversational database containing more than
two speakers per dialogue. To address this
gap, we propose the Multimodal EmotionLines
Dataset (MELD), an extension and enhancement of EmotionLines. MELD contains about
13,000 utterances from 1,433 dialogues from
the TV-series Friends. Each utterance is annotated with emotion and sentiment labels,
and encompasses audio, visual, and textual
modalities. We propose several strong multimodal baselines and show the importance
of contextual and multimodal information for
emotion recognition in conversations. The
full dataset is available for use at http://
affective-meld.github.io