Abstract
Sarcasm is a subtle form of language in which
people express the opposite of what is implied.
Previous works of sarcasm detection focused
on texts. However, more and more social media platforms like Twitter allow users to create multi-modal messages, including texts, images, and videos. It is insufficient to detect sarcasm from multi-model messages based only
on texts. In this paper, we focus on multimodal sarcasm detection for tweets consisting
of texts and images in Twitter. We treat text
features, image features and image attributes
as three modalities and propose a multi-modal
hierarchical fusion model to address this task.
Our model first extracts image features and attribute features, and then leverages attribute
features and bidirectional LSTM network to
extract text features. Features of three modalities are then reconstructed and fused into
one feature vector for prediction. We create a multi-modal sarcasm detection dataset
based on Twitter. Evaluation results on the
dataset demonstrate the efficacy of our proposed model and the usefulness of the three
modalities.