mellotron

资源分类

mellotron

2019-12-23 |

39 |

0 |

mellotron

Rafael Valle, Jason Li, Ryan Prenger and Bryan Catanzaro

In our recent paper we propose Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data.

By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice.

Visit our website for audio samples.

Pre-requisites

NVIDIA GPU + CUDA cuDNN

Setup

Clone this repo: git clone https://github.com/NVIDIA/mellotron.git
CD into this repo: cd mellotron
Initialize submodule: git submodule init; git submodule update
Install PyTorch
Install Apex
Install python requirements or build docker image

Install python requirements: pip install -r requirements.txt

Training

Update the filelists inside the filelists folder to point to your data
python train.py --output_directory=outdir --log_directory=logdir
(OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the speaker embedding layer is ignored

Download our published Mellotron model trained on LibriTTS
python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start