Speech in NIPS 2019/2020 - Tsinghua University
Transcript of Speech in NIPS 2019/2020 - Tsinghua University
![Page 1: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/1.jpg)
Speech in NIPS 2019/2020
Lantian Li
2020-12-21
![Page 2: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/2.jpg)
Untangling in Invariant Speech Recognition
• How information is untangled within DNNs trained to recognize speech.
• Define several metrics (manifold capacity) which connecting geometric properties of network representations and the separability of classes.
• A theory-driven geometric analysis of representation untangling in tasks.
• CNN, Deep Speech 2
• WSJ, Librispeech
![Page 3: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/3.jpg)
![Page 4: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/4.jpg)
![Page 5: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/5.jpg)
Anchor points (support vectors)
![Page 6: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/6.jpg)
Manifold capacity measures
• Mean-Field Theoretic (MFT) Manifold Capacity
![Page 7: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/7.jpg)
![Page 8: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/8.jpg)
![Page 9: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/9.jpg)
![Page 10: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/10.jpg)
FastSpeech: Fast, Robust and Controllable Text to Speech
• Neural TTS suffers from slow inference speech, lack of robustness (word skipping or repeating) and uncontrollability (speed or prosody).
• Using a feed-forward Transformer (instead of conventional encoder-attention-decoder framework) to generate Mel-spectrogram in parallel.
• Using a length regulator to expand the phoneme sequence to match the length of the target Mel-spectrogram sequence.
![Page 11: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/11.jpg)
FastSpeech
![Page 12: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/12.jpg)
FastSpeech 2
![Page 13: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/13.jpg)
Length regulator
![Page 14: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/14.jpg)
Controllability
![Page 15: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/15.jpg)
Robustness
https://speechresearch.github.io/fastspeech/
![Page 16: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/16.jpg)
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
![Page 17: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/17.jpg)
![Page 18: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/18.jpg)
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
![Page 19: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/19.jpg)
Listening to Sounds of Silence for Speech Denoising• A silent interval reveals noise characteristics.
• Several silent intervals assemble a time-varying noise distribution.
• Silent Interval Detection, Noise Estimation, Noise Removal
![Page 20: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/20.jpg)
![Page 21: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/21.jpg)
Loss functions and training
![Page 22: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/22.jpg)
Silent interval supervision
![Page 23: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/23.jpg)
Data construction
• AVSPEECH: audio-video speech
• 2214 videos for training and 234 videos for testing
• DEMAND and Google’s AudioSet
• The SNRs range in [-10dB, 10dB]
![Page 24: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/24.jpg)
![Page 25: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/25.jpg)
Performance of SID
![Page 26: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/26.jpg)
Ablation studies
![Page 27: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/27.jpg)
Comparison with SOTA
![Page 28: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/28.jpg)
Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
• Exploring wav2vec 2.0 on speaker verification and language identification• https://arxiv.org/abs/2012.06185
![Page 29: Speech in NIPS 2019/2020 - Tsinghua University](https://reader031.fdocuments.us/reader031/viewer/2022022501/6216f3947cc1b751d3597014/html5/thumbnails/29.jpg)
The Cone of Silence: Speech Separation by Localization
https://grail.cs.washington.edu/projects/cone-of-silence/