Abstract
Self-attention networks (SAN) have attracted a
lot of interests due to their high parallelization
and strong performance on a variety of NLP
tasks, e.g. machine translation. Due to the
lack of recurrence structure such as recurrent
neural networks (RNN), SAN is ascribed to
be weak at learning positional information of
words for sequence modeling. However, neither this speculation has been empirically con-
firmed, nor explanations for their strong performances on machine translation tasks when
“lacking positional information” have been explored. To this end, we propose a novel word
reordering detection task to quantify how well
the word order information learned by SAN
and RNN. Specifically, we randomly move
one word to another position, and examine
whether a trained model can detect both the
original and inserted positions. Experimental
results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the
position embedding; and 2) SAN trained on
machine translation learns better positional information than its RNN counterpart, in which
position embedding plays a critical role. Although recurrence structure make the model
more universally-effective on learning word
order, learning objectives matter more in the
downstream tasks such as machine translation.