Abstract
Recently, the Transformer machine translation system has shown strong results by stacking attention layers on both the source and target-language
sides. But the inference of this model is slow
due to the heavy use of dot-product attention in
auto-regressive decoding. In this paper we speed
up Transformer via a fast and lightweight attention model. More specifically, we share attention
weights in adjacent layers and enable the effi-
cient re-use of hidden states in a vertical manner.
Moreover, the sharing policy can be jointly learned
with the MT model. We test our approach on ten
WMT and NIST OpenMT tasks. Experimental results show that it yields an average of 1.3X speedup (with almost no decrease in BLEU) on top of
a state-of-the-art implementation that has already
adopted a cache for fast inference. Also, our approach obtains a 1.8X speed-up when it works with
the AAN model. This is even 16 times faster than
the baseline with no use of the attention cache.