waveglow-vqvae

2019-12-26 |

65 |

0 |

waveglow-vqvae

WaveGlow vocoder with VQVAE

Tensorflow implementation of WaveGlow: A Flow-based Generative Network for Speech Synthesisand Neural Discrete Representation Learning.

This implementation includes multi-gpu and mixed precision(unstable yet) support. It is highly based on some github repositories:waveglow. Data used here are the LJSpeech dataset and VCTK Corpus.

You can choose local conditions among mel-spectrogram or vector-quantized representations and also choose whether to use speaker identity as a global condition. As more options, polyak-averaging, FiLM and weight normalization are implemented.

Audio Samples

LJ dataset

Mel spectrogram condition (original WaveGlow): https://drive.google.com/open?id=1HuV51fnhEZG_6vGubXVrer6lAtZK7py9

VQVAE condition: https://drive.google.com/open?id=1xcGSelMycn2g-72noZH4vPiPpG0d7pZq

VCTK Corpus (Voice conversion)

It does not work well at now :(

Source (360): https://drive.google.com/open?id=1CfEvnQS_dVYRhsvj8NDqogOJlzK7npTd

Target (303): https://drive.google.com/open?id=1-kcSglimKgJrRjLDfPbD7s5KxZuFRY-i

My Humble Contribution

I slightly modify the original VQVAE optimization technique to increase robustness w.r.t hyperparameter choices and diversity of latent code usage without index-collapse. That is,

the original technique contains 1) finding neareast latent codes given encoded vectors and 2) updating latent codes according to matching encoded vectors.
I modify them as 1) finding distribution of latent codes given encoded vectors and 2) updating latent codes to increase the likelihood given distribution of matching encoded vectors.
By replacing EMA with the gradient descent method, it can give additional gradient signals to latent codes to reduce reconstruction loss (which is impossible in the EMA setting.).

It resembles Soft-EM method a lot. The difference between Soft-EM is to replace closed form Maximization step with a gradient descent method. For more information, please see em_toy.ipynb or contact me(jaywalnut310@gmail.com).

As I haven't investigated this method thoroughly, I cannot say it is better than previous methods in almost every cases. But I found this novel method works pretty well in all of my experimental settings (no index-collapse).

Pre-requisites

Tensorflow 1.12 (1.13 would work with some deprecation warnings)
(If fp16 training is needed) Volta GPUs

Setup

# 1. create dataset foldermkdir datasetscd datasets# 2. Download and extract datasetswget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -jxvf LJSpeech-1.1.tar.bz2# Additionally, download VCTK Corpuswget http://homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz
tar -zxvf VCTK-Corpus.tar.gzcd ../filelists
python resample_vctk.py # Change sample rate# 3. Create TFRecordspython generate_data.py# Additionally, create VCTK TFRecordspython generate_data.py -c tfr_dir=datasets/vctk tfr_prefix=vctk train_files=filelists/vctk_sid_audio_text_train_filelist.txt eval_files=filelists/vctk_sid_audio_text_eval_filelist.txt

Training

# 1. Create log directorymkdir ~/your-log-dir# 2. (Optional) Copy configscp ./config.yml ~/your-log-dir# 3. Run trainingpython train.py -m ~/your-log-dir

If you want to change hparams, then you can do it by choosing one of two options.

modify config.yml

add arguments as below:

python train.py -m ~/your-log-dir --c hidden_size=512 num_heads=8

Example configs:

fp32 training: python train.py -m ~/your-log-dir --c ftype=float32 loss_scale=1
mel condition: python train.py -m ~/your-log-dir --c local_condition=mel use_vq=false
remove FiLM layers: python train.py -m ~/your-log-dir --c use_film=false

Pre-trained models

Compressed model directories with pretrained weights are available: WILL BE UPLOADED SOON!

You can generate samples with those models in inference.ipynb.

You may have to change tfr_dir and model_dir to work on your settings.

Disclaimer

For fp16 settings, you need 1 week to train 1M steps with 4 V100 GPUs.
I haven't tried fp32 training, so there might be some issues to train high quality models.
As fp16 training is not robust enough (at now), I usually train FiLM enabled model and unabled model consequently and choose one which survives.
For a single speaker dataset(LJ Speech dataset), trained model vocoding quality is good enough compared to mel-spectrogram condtioned one.
For multi-speaker dataset(VCTK Corpus), disentangling between speaker identity and local condition does not work well (at now). I am investigating reasons though.
The next step would be training Text-to-LatentCodes model(as Transformer) so that fully TTS is possible.
If you're interested in this project, please improve models with me!

上一篇：Waveglow_Inference_in_CUDA

下一篇：constant-memory-waveglow

用户评价

全部评价

还没有评论，说两句吧！

热门资源

TensorFlow-Course

This repository aims to provide simple and read...
seetafaceJNI

项目介绍基于中科院seetaface2进行封装的JAVA...
mxnet_VanillaCNN

This is a mxnet implementation of the Vanilla C...
DuReader_QANet_BiDAF

Machine Reading Comprehension on DuReader Usin...
Klukshu-Sockeye-...

KLUKSHU SOCKEYE PROJECTS 2016 This repositor...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com