Abstract
While very deep neural networks have shown
effectiveness for computer vision and text
classification applications, how to increase the
network depth of neural machine translation
(NMT) models for better translation quality remains a challenging problem. Directly stacking more blocks to the NMT model results
in no improvement and even reduces performance. In this work, we propose an effective two-stage approach with three specially
designed components to construct deeper
NMT models, which result in significant
improvements over the strong Transformer
baselines on WMT14 English?German and
English?French translation tasks.