Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media ›...
Transcript of Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media ›...
![Page 1: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/1.jpg)
AdaptingWavenetforSpeechEnhancementDARIORETHAGE| JULY12, 2017
![Page 2: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/2.jpg)
Iam
vMasterStudent
v 6months@MusicTechnologyGroup,Universitat Pompeu Fabra
v Deeplearningforacousticsourceseparation
vWithJordiPons,AudioSignalProcessingLab
![Page 3: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/3.jpg)
Learningfromrawaudio
v Highdimensionality
vManylevelsofstructure
v Nohandcraftedfeatureextraction
v Nodiscardingofinformation(phase)
v Untilrecentlycomputationallyintractable
timbrephoneme
phonetictransition
![Page 4: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/4.jpg)
Wavenet:AGenerativeModelforRawAudio
v Speechsynthesisonwaveformlevelusingauto-regressive,generativemodel
v Generates8-bit(256values)probabilitydistribution
v Sampleoutputdistribution(probabilistictask)
v Considerableparametersavings§ Smallfilters§ Largedilations
v 16kHzsamplingrate(wide-band)
v Veryslow
v Notstrictlyend-to-end
![Page 5: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/5.jpg)
Wavenet:KeyFeaturesv Causality
v GatedUnits
v Softmax Output
v μ-lawQuantization
v Dilation
v Stacks
![Page 6: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/6.jpg)
Causalityv Onlypreviousandcurrentsampleinformpredictionofsamplet+1
v Asymmetricpadding
v 2x1filters
GatedUnitsv Controlcontributionofeachlayer
![Page 7: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/7.jpg)
μ-lawquantization
v Non-linearcompanding
v Betteruseof8-bitquantizationspace
Softmax
v Noassumptionsaboutoutputdistribution
vWellsuitedformulti-modaldistributions
v Requiresdiscretizationofoutput
![Page 8: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/8.jpg)
Stacksv Repeatdilationpattern
vMoredepth,lesswidth
Dilation
v Largerreceptivefield,sameparameters
v Bypowersof2
![Page 9: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/9.jpg)
Wavenet:Reimplementation
vManyopenquestions§ FilterDepths§ NumberofLayers
v TrainedonVCTK,109nativespeakersofEnglish,goodphoneticcoverage
v Proofofconcept
v ~600kparameters
![Page 10: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/10.jpg)
SpeechEnhancementvWithinacousticsourceseparation
v Deterministic
v Goal:Improveintelligibilityand/oroverallperceptualqualityofspeechsignal
v Untilrecently,greatestsuccessesinthefrequencydomainv e.g.estimatingspectralmask
Eitherestimate𝒔" given𝒎directlyor𝒃& given𝒎,since𝒔 = 𝒎 − 𝒃
𝑚𝑡 = 𝑠𝑡 + 𝑏𝑡𝑚:mixture𝑠:speech𝑏:background
![Page 11: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/11.jpg)
AWavenetForSourceSeparationv Genericarchitecture,suitableforanyacousticsourceseparation
v Blindtwo-sourceseparation
v Discriminative
v End-to-end§ Time-domaininput/output§ Nopre/post-filtering§ Noquantization
v 16kHzsamplingrate(wide-band)
v Flexible
![Page 12: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/12.jpg)
KeyContributionsv Non-causality
v Real-valuedpredictions
v Non-autoregressive
v Targetfields
v Enforcestimecontinuity
v Energy-conservingloss
![Page 13: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/13.jpg)
Non-causalityv Equalcontextinthepastandfuture
v Symmetricpadding
v 3x1filters
![Page 14: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/14.jpg)
Real-valuedPredictionsv AssumesGaussianoutputdistribution
v Noquantizationerror
v Oneoutputunitperoutputsample
Wavenet ProposedModel
v μ-lawcompandingdisadvantageous
![Page 15: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/15.jpg)
TargetFieldstargetsample
![Page 16: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/16.jpg)
TargetFields
![Page 17: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/17.jpg)
TargetFields
![Page 18: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/18.jpg)
TargetFields
![Page 19: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/19.jpg)
TargetFields
![Page 20: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/20.jpg)
TargetFields
![Page 21: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/21.jpg)
TargetFieldstargetfield
![Page 22: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/22.jpg)
TargetFieldsv Autoregressionrequiressequential,samplebysample,inference→slow
v ParallelpredictionoftargetfieldbenefitsinferenceANDtraining
![Page 23: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/23.jpg)
EnforcingTimeContinuityvWithoutauroregression,originalWavenetproducespointdiscontinuities
v Veryunpleasantsound
v 3x1filtersinfinal(non-dilated)layersallowtimecontinuitytobereflectedintheloss
Pointdiscontinuity3x1filters
![Page 24: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/24.jpg)
Energy-ConservingLoss
v Goal:𝐸/0 ≡ 𝐸/20v Inspiredbydissimilaritylosses
v Empirically,reducesalgorithmicartifacts
![Page 25: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/25.jpg)
FlexibilityinTemporalDimensionv Samemodelcanbedeployedonreducedcomputationalresources
v Audioinputofarbitrarylength→one-shotdenoising
v Reducesredundantcomputations
v 25sofaudioinsingleforwardpass(TitanXPascal)
v ~0.56sper1secondofnoisyaudio
v Fullyconvolutional
![Page 26: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/26.jpg)
Experiments
Setup
v 33Layers§ Dilations:1,2,...,256,512§ Stacks:3
v 384msReceptiveField
v 6.3mparameters
Data
v VCTKforvoice
v DEMANDforenvironmentalsounds
Unseenspeakersinunseennoiseconditions
TrainingSNR:0dB– 18dB
TestSNR:2.5dB– 17.5dB
![Page 27: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/27.jpg)
EvaluationMetricsv Shouldbeperceptuallymeaningful
vMOS=meanopinionscore(predicted)inrange[1,5]
vWeightedcombinationofobjectivespeechqualitymeasures
v SIG:MOSratingofthesignaldistortionattendingonlytothespeechsignal
v BAK:MOSratingoftheintrusivenessofbackgroundnoise
v OVL:MOSratingoftheoveralleffect
![Page 28: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/28.jpg)
Results
![Page 29: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/29.jpg)
BestConfiguration
v Energy-conservingloss
v 10%noise-onlyaugmentation
v 100mstargetfield
v Conditioning
Mixed Speech Background Wiener
Mixed Speech Background Wiener
Mixed Speech Background Wiener
12.5dB
7.5dB
2.5dB
![Page 30: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/30.jpg)
PerceptualEvaluation
v 33participants
v 20samples,5ateachSNR
v 1-5qualityrating
“giveanoverallqualityscore,takingintoconsiderationboth:speechqualityandbackground-noisesuppression”
WienerFiltering ProposedModel
2.92 3.60
![Page 31: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/31.jpg)
Takeawayv AdiscriminativeadaptationofWavenetforspeechenhancement
v Reductionintimecomplexity,withoutsacrificingexpressivecapability
v Noise-onlyaugmentationnecessaryforgeneratingsilence
v Nospeech-specificconstraints
v Energy-conservation
v Perceptualtrials:PreferredoverWienerFiltering
v Possibletolearnmulti-scalehierarchicalrepresentationsfromrawaudio
v Audiosamplesonline,sourceonGitHub
![Page 32: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/32.jpg)
FutureWorkv Continueexploringtheideaofenergy-conservinglossesinneuralaudioprocessingmodels
v Betterhandlingofshort-timehighenergyevents,e.g.honkincitytraffic
v Applytootheraudiodomains§ Music,multi-trackseparation
![Page 33: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed402858d46b66d226344c4/html5/thumbnails/33.jpg)
Thankyou