Abstract
A widespread approach to processing spoken language is to first automatically transcribe it into text. An alternative is to use an
end-to-end approach: recent works have proposed to learn semantic embeddings of spoken
language from images with spoken captions,
without an intermediate transcription step. We
propose to use multitask learning to exploit existing transcribed speech within the end-to-end
setting. We describe a three-task architecture
which combines the objectives of matching
spoken captions with corresponding images,
speech with text, and text with images. We
show that the addition of the SPEECH/TEXT
task leads to substantial performance improvements on image retrieval when compared to
training the SPEECH/IMAGE task in isolation.
We conjecture that this is due to a strong inductive bias transcribed speech provides to the
model, and offer supporting evidence for this.