Abstract
In intelligent speech interaction, automatic speech
emotion recognition (SER) plays an important role
in understanding user intention. While sentimental speech has different speaker characteristics but
similar acoustic attributes, one vital challenge in
SER is how to learn robust and discriminative representations for emotion inferring. In this paper,
inspired by human emotion perception, we propose
a novel representation learning component (RLC)
for SER system, which is constructed with Multihead Self-attention and Global Context-aware Attention Long Short-Term Memory Recurrent Neutral Network (GCA-LSTM). With the ability of
Multi-head Self-attention mechanism in modeling
the element-wise correlative dependencies, RLC
can exploit the common patterns of sentimental
speech features to enhance emotion-salient information importing in representation learning. By
employing GCA-LSTM, RLC can selectively focus on emotion-salient factors with the consideration of entire utterance context, and gradually produce discriminative representation for emotion inferring. Experiments on public emotional benchmark database IEMOCAP and a tremendous realistic interaction database demonstrate the outperformance of the proposed SER framework, with 6.6%
to 26.7% relative improvement on unweighted accuracy compared to state-of-the-art techniques