Constructing Interpretive Spatio-Temporal Features for Multi-TurnResponses Selection
Abstract
Response selection plays an important role
in fully automated dialogue systems. Given
the dialogue context, the goal of response selection is to identify the best-matched nextutterance (i.e., response) from multiple candidates. Despite the efforts of many previous useful models, this task remains challenging due to the huge semantic gap and also
the large size of candidate set. To address
these issues, we propose a Spatio-Temporal
Matching network (STM) for response selection. In detail, soft alignment is first used to
obtain the local relevance between the context and the response. And then, we construct spatio-temporal features by aggregating
attention images in time dimension and make
use of 3D convolution and pooling operations
to extract matching information. Evaluation
on two large-scale multi-turn response selection tasks has demonstrated that our proposed
model significantly outperforms the state-ofthe-art model. Particularly, visualization analysis shows that the spatio-temporal features
enables matching information in segment pairs
and time sequences, and have good interpretability for multi-turn text matching