waveglow-vqvae
Tensorflow implementation of WaveGlow: A Flow-based Generative Network for Speech Synthesisand Neural Discrete Representation Learning.
This implementation includes multi-gpu and mixed precision(unstable yet) support. It is highly based on some github repositories:waveglow. Data used here are the LJSpeech dataset and VCTK Corpus.
You can choose local conditions among mel-spectrogram or vector-quantized representations and also choose whether to use speaker identity as a global condition. As more options, polyak-averaging, FiLM and weight normalization are implemented.
Mel spectrogram condition (original WaveGlow): https://drive.google.com/open?id=1HuV51fnhEZG_6vGubXVrer6lAtZK7py9
VQVAE condition: https://drive.google.com/open?id=1xcGSelMycn2g-72noZH4vPiPpG0d7pZq
It does not work well at now :(
Source (360): https://drive.google.com/open?id=1CfEvnQS_dVYRhsvj8NDqogOJlzK7npTd
Target (303): https://drive.google.com/open?id=1-kcSglimKgJrRjLDfPbD7s5KxZuFRY-i
I slightly modify the original VQVAE optimization technique to increase robustness w.r.t hyperparameter choices and diversity of latent code usage without index-collapse. That is,
the original technique contains 1) finding neareast latent codes given encoded vectors and 2) updating latent codes according to matching encoded vectors.
I modify them as 1) finding distribution of latent codes given encoded vectors and 2) updating latent codes to increase the likelihood given distribution of matching encoded vectors.
By replacing EMA with the gradient descent method, it can give additional gradient signals to latent codes to reduce reconstruction loss (which is impossible in the EMA setting.).
It resembles Soft-EM method a lot. The difference between Soft-EM is to replace closed form Maximization step with a gradient descent method. For more information, please see em_toy.ipynb or contact me(jaywalnut310@gmail.com).
As I haven't investigated this method thoroughly, I cannot say it is better than previous methods in almost every cases. But I found this novel method works pretty well in all of my experimental settings (no index-collapse).
Tensorflow 1.12 (1.13 would work with some deprecation warnings)
(If fp16 training is needed) Volta GPUs
# 1. create dataset foldermkdir datasetscd datasets# 2. Download and extract datasetswget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 tar -jxvf LJSpeech-1.1.tar.bz2# Additionally, download VCTK Corpuswget http://homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz tar -zxvf VCTK-Corpus.tar.gzcd ../filelists python resample_vctk.py # Change sample rate# 3. Create TFRecordspython generate_data.py# Additionally, create VCTK TFRecordspython generate_data.py -c tfr_dir=datasets/vctk tfr_prefix=vctk train_files=filelists/vctk_sid_audio_text_train_filelist.txt eval_files=filelists/vctk_sid_audio_text_eval_filelist.txt
# 1. Create log directorymkdir ~/your-log-dir# 2. (Optional) Copy configscp ./config.yml ~/your-log-dir# 3. Run trainingpython train.py -m ~/your-log-dir
If you want to change hparams, then you can do it by choosing one of two options.
modify config.yml
add arguments as below:
python train.py -m ~/your-log-dir --c hidden_size=512 num_heads=8
Example configs:
fp32 training: python train.py -m ~/your-log-dir --c ftype=float32 loss_scale=1
mel condition: python train.py -m ~/your-log-dir --c local_condition=mel use_vq=false
remove FiLM layers: python train.py -m ~/your-log-dir --c use_film=false
Compressed model directories with pretrained weights are available: WILL BE UPLOADED SOON!
You can generate samples with those models in inference.ipynb.
You may have to change tfr_dir and model_dir to work on your settings.
For fp16 settings, you need 1 week to train 1M steps with 4 V100 GPUs.
I haven't tried fp32 training, so there might be some issues to train high quality models.
As fp16 training is not robust enough (at now), I usually train FiLM enabled model and unabled model consequently and choose one which survives.
For a single speaker dataset(LJ Speech dataset), trained model vocoding quality is good enough compared to mel-spectrogram condtioned one.
For multi-speaker dataset(VCTK Corpus), disentangling between speaker identity and local condition does not work well (at now). I am investigating reasons though.
The next step would be training Text-to-LatentCodes model(as Transformer) so that fully TTS is possible.
If you're interested in this project, please improve models with me!
还没有评论,说两句吧!
热门资源
seetafaceJNI
项目介绍 基于中科院seetaface2进行封装的JAVA...
spark-corenlp
This package wraps Stanford CoreNLP annotators ...
Keras-ResNeXt
Keras ResNeXt Implementation of ResNeXt models...
capsnet-with-caps...
CapsNet with capsule-wise convolution Project ...
inferno-boilerplate
This is a very basic boilerplate example for pe...
智能在线
400-630-6780
聆听.建议反馈
E-mail: support@tusaishared.com