Abstract
As an important task in Sentiment Analysis, Targetoriented Sentiment Classification (TSC) aims to
identify sentiment polarities over each opinion target in a sentence. However, existing approaches to
this task primarily rely on the textual content, ignoring the other increasingly popular multimodal
data sources (e.g., images), which can enhance
the robustness of these text-based models. Motivated by this observation and inspired by the
recently proposed BERT architecture, we study
Target-oriented Multimodal Sentiment Classification (TMSC) and propose a multimodal BERT architecture. To model intra-modality dynamics, we
first apply BERT to obtain target-sensitive textual
representations. We then borrow the idea from selfattention and design a target attention mechanism
to perform target-image matching to derive targetsensitive visual representations. To model intermodality dynamics, we further propose to stack a
set of self-attention layers on top to capture multimodal interactions. Experimental results show that
our model can outperform several highly competitive approaches for TSC and TMSC