Abstract
Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for real-world mobile applications since mobile phones are tightly constrained by the hardware resources and battery. In this paper, we investigate the mobile setting (under 500M Mult-Adds) for NLP tasks to facilitate the deployment on the edge devices. We present Long-Short Range Attention (LSRA), where some heads specialize in the local context modeling (by convolution) while the others capture the long-distance relationship (by attention). Based on this primitive, we design Mobile Transformer (MBT) that is tailored for the mobile NLP application. Our MBT demonstrates consistent improvement over the transformer on three well-established language tasks: IWSLT 2014 German-English, WMT 2014 English-German and WMT 2014 English-French. It outperforms the transformer by 0.9 BLEU under 500M Mult-Adds and 1.2 BLEU under 100M Mult-Adds on WMT’14 English-German. On WMT’14 English-French, our MBT reduces the computation of the transformer by 2.5× with negligible BLEU degradation. Without the costly architecture search that requires more than 250 GPU years, our MBT achieves 0.5 higher BLEU than the AutoML-based Evolved Transformer under the mobile setting.