Abstract
Self-attention networks have received increasing research attention. By default, the hidden states of each word are hierarchically calculated by attending to all words in the sentence, which assembles global information.
However, several studies pointed out that taking all signals into account may lead to overlooking neighboring information (e.g. phrase
pattern). To address this argument, we propose a hybrid attention mechanism to dynamically leverage both of the local and global information. Specifically, our approach uses a
gating scalar for integrating both sources of
the information, which is also convenient for
quantifying their contributions. Experiments
on various neural machine translation tasks
demonstrate the effectiveness of the proposed
method. The extensive analyses verify that the
two types of contexts are complementary to
each other, and our method gives highly effective improvements in their integration