Abstract
Previous work on end-to-end translation from
speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We
show that a na¨?ve method to create compressed
phoneme-like speech representations is far
more effective and efficient for translation than
traditional frame-level speech features. Specifically, we generate phoneme labels for speech
frames and average consecutive frames with
the same label to create shorter, higher-level
source sequences for translation. We see improvements of up to 5 BLEU on both our high
and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two
language pairs.