voicefilter

2019-09-16 |

108 |

0 |

voicefilter

VoiceFilter

Unofficial PyTorch implementation of Google AI's: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

Result

Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100).

Audio Sample

Listen to audio sample at webpage: http://swpark.me/voicefilter/

Metric

| Median SDR | Paper | Ours | | ---------------------- | ----- | ---- | | before VoiceFilter | 2.5 | 1.9 | | after VoiceFilter | 12.6 | 10.2 |

SDR converged at 10, which is slightly lower than paper's.

Dependencies

Python and packages
This code was tested on Python 3.6 with PyTorch 1.0.1. Other packages can be installed by:
```
pip install -r requirements.txt
```
Miscellaneous
ffmpeg-normalize is used for resampling and normalizing wav files. See README.md of ffmpeg-normalize for installation.

Prepare Dataset

Download LibriSpeech dataset
To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/. train-clear-100.tar.gz(6.3G) contains speech of 252 speakers, and train-clear-360.tar.gz(23G) contains 922 speakers. You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be.

Resample & Normalize wav files

First, unzip tar.gz file to desired folder:

tar -xvzf train-clear-360.tar.gz

Next, copy utils/normalize-resample.sh to root directory of unzipped data folder. Then:

vim normalize-resample.sh # set "N" as your CPU core number.chmod a+x normalize-resample.sh
./normalize-resample.sh # this may take long

Edit config.yaml

cd config
cp default.yaml config.yaml
vim config.yaml

Preprocess wav files
In order to boost training speed, perform STFT for each files before training by:
```
python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]
```
This will create 100,000(train) + 1000(test) data. (About 160G)

Train VoiceFilter

Get pretrained model for speaker recognition system
VoiceFilter utilizes speaker recognition system (d-vector embeddings). Here, we provide pretrained model for obtaining d-vector embeddings.
This model was trained with VoxCeleb2 dataset, where utterances are randomly fit to time length [70, 90] frames. Tests are done with window 80 / hop 40 and have shown equal error rate about 1%. Data used for test were selected from first 8 speakers of VoxCeleb1 test dataset, where 10 utterances per each speakers are randomly selected.
The model can be downloaded at this GDrive link.
Run
After specifying train_dir, test_dir at config.yaml, run:
```
python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]
```
This will create chkpt/name and logs/name at base directory(-b option, . in default)
View tensorboardX
```
tensorboard --logdir ./logs
```

Resuming from checkpoint

python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name

Evaluate

python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -o [output directory]

Possible improvments

These are some of my personal opinions for improvement. If you have other ideas, don't hesitate to open issue.

Masks performed poorly on high-frequency channels.

Training embedder system with linear-scale spectrogram instead of mel might improve this.

Replace zero-padding with partial convolution.
Try power-law compressed reconstruction error as loss function, instead of MSE.

Tried power=0.3, but failed.

Author

Seungwon Park at MINDsLab (yyyyy@snu.ac.kr, swpark@mindslab.ai)

License

Apache License 2.0

This repository contains codes adapted/copied from the followings: - utils/adabound.py from https://github.com/Luolc/AdaBound (Apache License 2.0) - utils/audio.py from https://github.com/keithito/tacotron (MIT License) - utils/hparams.py from https://github.com/HarryVolek/PyTorch_Speaker_Verification (No License specified) - utils/normalize-resample.sh from https://unix.stackexchange.com/a/216475

上一篇：YOLOv2

下一篇：Random-Erasing

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Keras-ResNeXt

Keras ResNeXt Implementation of ResNeXt models...
seetafaceJNI

项目介绍基于中科院seetaface2进行封装的JAVA...
spark-corenlp

This package wraps Stanford CoreNLP annotators ...
capsnet-with-caps...

CapsNet with capsule-wise convolution Project ...
inferno-boilerplate

This is a very basic boilerplate example for pe...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com