LJ Speech Dataset is recently widely used as a benchmark dataset in
the TTS task because it is publicly available, and it has 24 hours of
reasonable quality samples.
Nick's and Kate's audiobooks are additionally used to see if the model
can learn even with less data, variable speech samples. They are 18
hours and 5 hours long, respectively. Finally, KSS Dataset is a Korean
single speaker speech dataset that lasts more than 12 hours.
The paper didn't mention normalization, but without normalization I couldn't get it to work. So I added layer normalization.
The paper fixed the learning rate to 0.001, but it didn't work for me. So I decayed it.
I tried to train Text2Mel and SSRN simultaneously, but it didn't
work. I guess separating those two networks mitigates the burden of
training.
The authors claimed that the model can be trained within a day, but
unfortunately the luck was not mine. However obviously this is much
fater than Tacotron as it uses only convolution layers.
Thanks to the guided attention, the attention plot looks monotonic
almost from the beginning. I guess this seems to hold the aligment tight
so it won't lose track.
The paper didn't mention dropouts. I applied them as I believe it helps for regularization.