Rafael Valle*, Jason Li*, Ryan Prenger and Bryan Catanzaro
In our recent paper we propose Mellotron: a multispeaker voice synthesis model
based on Tacotron 2 GST that can make a voice emote and sing without emotive or
singing training data.
By explicitly conditioning on rhythm and continuous pitch
contours from an audio signal or music score, Mellotron is able to generate
speech in a variety of styles ranging from read speech to expressive speech,
from slow drawls to rap and from monotonous voice to singing voice.