Speaker Magazine Article | Keynote Speaker | Business Speaker
INTERSPEECH 2016 Tutorial: Machine Learning for Speaker...
Transcript of INTERSPEECH 2016 Tutorial: Machine Learning for Speaker...
![Page 1: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/1.jpg)
INTERSPEECH 2016 Tutorial:Machine Learning for Speaker Recognition
Man-Wai Mak† and Jen-Tzung Chien‡
†The Hong Kong Polytechnic University, Hong Kong‡National Chiao Tung University, Taiwan
September 8, 2016
1 / 274
![Page 2: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/2.jpg)
Table of Contents
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction
2 / 274
![Page 3: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/3.jpg)
Outline
1 Introduction1.1. Fundamentals of speaker recognition1.2. Feature extraction and scoring1.3. Modern speaker recognition approaches
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction
3 / 274
![Page 4: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/4.jpg)
Fundamentals of speaker recognition
Speaker recognition is a technique to recognize the identity of aspeaker from a speech utterance.
Speaker recognition
Speaker identification
Speaker verification
Speaker diarization
Text dependent
Text independent
Open set
Close set
4 / 274
![Page 5: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/5.jpg)
Speaker identification
Determine whether unknown speaker matches one of a set knownspeakersOne-to-many mappingOften assumed that unknown voice must come from a set of knownspeakers – referred to as close-set identificationAdding “none of the above” option to closed-set identification givesopen-set identification
5 / 274
![Page 6: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/6.jpg)
Speaker verification
Determine whether unknown speaker matches a specific speakerOne-to-one mappingClose-set verification: The population of clients is fixedOpen-set verification: New clients can be added without having toredesign the system.
6 / 274
![Page 7: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/7.jpg)
Speaker diarization
Determine when a speaker change has occurred in speech signal(segmentation)Group together speech segments corresponding to the same speaker(clustering)Prior speaker information may or may not be available
7 / 274
![Page 8: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/8.jpg)
Input mode
Text-dependentRecognition system knows text spoken by personsFixed phrases or prompted phrasesUsed for applications with strong control over user input, e.g.,biometric authenticationSpeech recognition can be used for checking spoken text to improvesystem performanceSentences typically very short
Text-independentNo restriction on the text, typically conversational speechUsed for applications with less control over user input, e.g., forensicspeaker IDMore flexible but recognition is more difficultSpeech recognition can be used for extracting high-level features toboost performanceSentences typically very long
8 / 274
![Page 9: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/9.jpg)
Outline
1 Introduction1.1. Fundamentals of speaker recognition1.2. Feature extraction and scoring1.3. Modern speaker recognition approaches
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction
9 / 274
![Page 10: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/10.jpg)
Feature extraction
Speech is a time-varying signal conveying multiple layers ofinformation
WordsSpeakerLanguageEmotion
Information in speech is observed in the time and frequency domains
Acoustic Features • Speech is a continuous evolution of the vocal tract • Need to extract a sequence of spectra or sequence of spectral
coefficients • Use a sliding window - 25 ms window, 10 ms shift
DCT log|X(ω)| MFCC
10 / 274
![Page 11: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/11.jpg)
Feature extraction from speech
Feature extraction consists in transforming the speech signal to a setof feature vectors. Most of the feature extraction used in speakerrecognition systems relies on a cepstral representation of speech.
Figure: Modular representation of MFCC feature extractor
11 / 274
![Page 12: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/12.jpg)
Computing MFCC
Parameterization 34
Mel-Frequency Cepstrum Coefficients (MFCCs)
nc
⎥⎦
⎤⎢⎣
⎡= ∑
−
=
1
0
2 )()(ln)(N
km kHkSmX
)(ns
MmkHm ≤≤1 )(
∑−
=
π−=1
0
2)()(N
n
NnkjenskS
MFCCn = cn = cos n m− 12
"
#$
%
&'πM
(
)*
+
,-
m=1
M
∑ X(m) = dnTX, 0 ≤ n ≤ P
⇒ MFCC vector = c0 c1 cP()
+,T=DX
D : DCT Transformation matrix [P × M] M : No. of triangular filters in the filter bank, typically 20 ~ 30 P : No. of cepstral coefficients, typically 12 c0 : Logarithm of energy of the current frame
Filter Bank
H15(k) : the 15th triangular filter
N
Figure: Computing MFCC from one frame of speech12 / 274
![Page 13: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/13.jpg)
Modeling sequence of features
For most recognition tasks, we need to model the distribution offeature vector sequences
In practice, we often use the Gaussian mixture models (GMMs)
13 / 274
![Page 14: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/14.jpg)
GMM–UBM speaker verification
A Gaussian mixture model, namely the universal background model(UBM), is trained to represent the speech of the general population.
p(x|UBM) = p(x|Λubm) =C∑
c=1πubm
c N (x|µubmc ,Σubm
c )
The UBM parameters Λubm =πubm
c ,µubmc ,Σubm
c
C
c=1are estimated
by the expectation-maximization algorithm using the speech of manyspeakers.
14 / 274
![Page 15: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/15.jpg)
Expectation maximization (EM)
Denote the acoustic vectors from a large population asX = xt ; t = 1, . . . ,TExpectation step:
Conditional distribution of mixture component c:
γt(c) = p(c|xt) = πcN (xt |µubmc ,Σubm
c )∑Cc=1 π
ubmc N (xt |µubm
c ,Σubmc )
Maximization step:Mixture weights: πubm
c = 1T∑T
t=1 γt(c)
Mean vectors: µubmc =
∑Tt=1
γt (c)xt∑Tt=1
γt (c)
Covariance matrices: Σubmc =
∑Tt=1
γt (c)xt xTt∑T
t=1γt (c)
− µubmc (µubm
c )T
15 / 274
![Page 16: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/16.jpg)
Target-speakers’ GMMs
Each target speaker is represented by a Gaussian mixture model:
p(x|Spk s) = p(x|Λ(s)) =C∑
c=1π(s)
c N (x|µ(s)c ,Σ(s)
c )
where Λ(s) =π
(s)c ,µ
(s)c ,Σ(s)
c
C
c=1are learned by using maximum a
posteriori (MAP) adaptation [Reynolds et al., 2000].
16 / 274
![Page 17: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/17.jpg)
Maximum a posteriori (MAP)
The MAP algorithm finds the parameters of target-speaker’s GMMgiven UBM parameters Λubm =
πubm
c ,µubmc ,Σubm
c
C
c=1First step is the same as EM. Given Ts acoustic vectorsX (s) = x1, . . . , xTs from speaker s, we compute the statistics:
nc =Ts∑
t=1γt(c) and Ec(X (s)) = 1
nc
Ts∑t=1
γt(c)xt
Adapt UBM parameters by
µ(s)c = αcEc(X (s)) + (1− αc)µubm
c
whereαc = nc
nc + rand r is called the relevance factor. Typically, r = 16.
17 / 274
![Page 18: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/18.jpg)
MAP adaptation
Adapt the UBM model to each speaker using the MAP algorithm:1
µ(s)c = αcEc(X (s)) + (1− αc)µubm
c
αc → 1 when X (s) comprises lots of data and αc → 0 otherwise.
1Source: J. H. L. Hansen and T. Hasan, IEEE Signal Processing Magazine, 2015.18 / 274
![Page 19: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/19.jpg)
MAP adaptation
In practice, only the mean vectors will be adapted:
19 / 274
![Page 20: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/20.jpg)
GMM-UBM scoring
Given the acoustic vectors X (t) from a test speaker and a claimedidentity s, speaker verification can be formulated as a 2-classhypothesis problem:
H0: X (t) comes from the true speaker sH1: X (t) comes from an impostor
Verification score is a log-likelihood ratio:
SLR(X (t)|Λ(s),Λubm) = log p(X (t)|Λ(s))− log p(X (t)|Λubm)
20 / 274
![Page 21: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/21.jpg)
Sources of variability
21 / 274
![Page 22: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/22.jpg)
How to account for variabilityGMM-SVM [Campbell et al., 2006]:
Create supervectors from target-speaker GMMs.Then, project the supervectors to a subspace in which inter-speakervariability is maximized and nuisance variability is minimized.Perform SVM classification on the projected subspace.
Joint Factor Analysis:Speaker and session variabilities are represented by latent variables(speaker factors and channel factors) in a factor analysis model.During scoring, session variabilities are accounted for by integratingover the latent variables, e.g., the channel factors as in[Kenny et al., 2007a].
I-Vector + PLDA:2Utterances are represented by the posterior means of latent factors,called the i-vectors [Dehak et al., 2011].I-vectors capture both speaker and channel information.During scoring, the unwanted channel variability is removed by LDAprojection or by integrating out the latent factors in the PLDA model.
2For the relationship between JFA and I-vectors and their derivations, seehttp://www.eie.polyu.edu.hk/∼mwmak/papers/FA-Ivector.pdf
22 / 274
![Page 23: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/23.jpg)
Performance measures
For speaker identification
Recogntion Rate = No. of correct recognitionsTotal no. of trials
For speaker verification
False Rejection Rate (FRR) = Miss probability
= No. of true-speakers rejectedTotal no. of true-speaker trials
False Acceptance Rate (FAR) = False alarm probability
= No. of impostors acceptedTotal no. of impostor attempts
Equal error rate (EER) corresponds to the operating point at whichFAR = FRR
23 / 274
![Page 24: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/24.jpg)
Performance measures for speaker verification
Detection error tradeoff (DET) curves are similar to receiveroperating characteristic curves but with nonlinear x- and y-axis.
24 / 274
![Page 25: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/25.jpg)
Detection cost functions
Detection cost function (DCF) is a weighted sum of the FRR(PMiss|Target) and FAR (PFalseAlarm|Nontarget):
CDet(θ) = CMiss × PMiss|Target(θ)× PTarget +CFalseAlarm × PFalseAlarm|Nontarget(θ)× (1− PTarget)
where θ is a decision threshold.Normalized cost:
CNorm = CDet(θ)/CDefault
whereCDefault = min
CMiss × PTargetCFalseAlarm × (1− PTarget)
NIST 2008 SRE and earlier:CMiss = 10; CFalseAlarm = 1; PTarget = 0.01
NIST 2010 SRE:CMiss = 1; CFalseAlarm = 1; PTarget = 0.001
25 / 274
![Page 26: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/26.jpg)
Detection cost functions
Dectection cost function for NIST 2012 SRE:CDet(θ) = CMiss × PMiss|Target(θ)× PTarget +
CFalseAlarm × (1− PTarget) ×[PFalseAlarm|KnownNontarget(θ)× PKnown +
PFalseAlarm|UnKnownNontarget × (1− PKnown])CNorm(θ) = CDet(θ)/(CMiss × PTarget)
Parameters for core test conditionsCMiss = 1; CFalseAlarm = 1; PTarget1 = 0.01; PTarget2 = 0.001;PKnown = 0.5
PTarget1 → CNorm1(θ1) and PTarget2 → CNorm2(θ2)Primary cost:
CPrimary = CNorm1(θ1) + CNorm2(θ2)2
26 / 274
![Page 27: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/27.jpg)
Detection cost functions
Detection cost function for NIST 2016 SRE:
CDet(θ) = CMiss × PMiss|Target(θ)× PTarget +CFalseAlarm × PFalseAlarm|Nontarget(θ)× (1− PTarget)
CNorm(θ) = CDet(θ)/(CMiss × PTarget)
Parameters for core test conditions
CMiss = 1; CFalseAlarm = 1; PTarget1 = 0.01; PTarget2 = 0.005
PTarget1 → CNorm1(θ1) and PTarget2 → CNorm2(θ2)Primary cost:
CPrimary = CNorm1(θ1) + CNorm2(θ2)2
27 / 274
![Page 28: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/28.jpg)
Outline
1 Introduction1.1. Fundamentals of speaker recognition1.2. Feature extraction and scoring1.3. Modern speaker recognition approaches
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction
28 / 274
![Page 29: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/29.jpg)
Main approachParametric model
Artificial neural network
Gaussian mixture model
Joint factor analysis
i-vector
Auto-encoder
Deep neural network
Support vector machine
Probabilistic linear discriminant analysis
Restricted Boltzmann machine
Nonparametric model
29 / 274
![Page 30: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/30.jpg)
Outline
1 Introduction
2 Learning Algorithms2.1. Machine learning2.2. EM algorithm2.3. Approximate inference2.4. Bayesian learning
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction30 / 274
![Page 31: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/31.jpg)
Outline
1 Introduction
2 Learning Algorithms2.1. Machine learning2.2. EM algorithm2.3. Approximate inference2.4. Bayesian learning
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction31 / 274
![Page 32: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/32.jpg)
Model-based method
Machine learning provides a wide range of model-based approachesfor speaker recognitionModel-based approach aims to incorporate the physical phenomena,measurements, uncertainties and noises in the form of mathematicalmodelsThis approach is developed in a unified manner through differentalgorithms, examples, applications, and case studiesMain-stream methods are based on the statistical modelsLatent variable models in speaker recognition include− joint factor analysis (JFA)− probabilistic linear discriminant analysis (PLDA)− Gaussian mixture model (GMM)− mixture of PLDA
32 / 274
![Page 33: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/33.jpg)
Neural network
Deep structured/hierarchical learningRapidly developed and widely applied for many applicationsMultiple layers of nonlinear processing unitsHigh-level abstraction
Run
Jump
33 / 274
![Page 34: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/34.jpg)
Model-based method vs. neural network
Model-based method Neural network
Structure Top-down Bottom-upRepresentation Intuitive DistributedInterpretation Easy Harder
Semi/unsupervised Easier HarderIncorp. domain knowl. Easy HardIncorp. constraint Easy HardIncorp. uncertainty Easy HardLearning Many algorithms Back-propagationInference/decode Harder EasierEvaluation on ELBO End performance
34 / 274
![Page 35: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/35.jpg)
Model-based method vs. neural network
Model-based method Neural network
Structure Top-down Bottom-upRepresentation Intuitive DistributedInterpretation Easy Harder
Semi/unsupervised Easier HarderIncorp. domain knowl. Easy HardIncorp. constraint Easy HardIncorp. uncertainty Easy Hard
Learning Many algorithms Back-propagationInference/decode Harder EasierEvaluation on ELBO End performance
35 / 274
![Page 36: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/36.jpg)
Model-based method vs. neural network
Model-based method Neural network
Structure Top-down Bottom-upRepresentation Intuitive DistributedInterpretation Easy HarderSemi/unsupervised Easier HarderIncorp. domain knowl. Easy HardIncorp. constraint Easy HardIncorp. uncertainty Easy Hard
Learning Many algorithms Back-propagationInference/decode Harder EasierEvaluation on ELBO End performance
36 / 274
![Page 37: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/37.jpg)
Modern machine learning
Model-based method Neural network
Structure Top-down Bottom-upRepresentation Intuitive DistributedInterpretation Easy HarderSemi/unsupervised Easier HarderIncorp. domain knowl. Easy HardIncorp. constraint Easy HardIncorp. uncertainty Easy HardLearning Many algorithms Back-propagationInference/decode Harder EasierEvaluation on ELBO End performance
37 / 274
![Page 38: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/38.jpg)
Outline
1 Introduction
2 Learning Algorithms2.1. Machine learning2.2. EM algorithm2.3. Approximate inference2.4. Bayesian learning
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction38 / 274
![Page 39: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/39.jpg)
Parameter estimation
Assume we have a collection of acoustic frames X = xtTt=1 for
estimation of model parameters θMaximum likelihood (ML) estimation
θML = arg maxθ
p(X |θ)
Maximum a posteriori (MAP) estimation
θMAP = arg maxθ
p(θ|X ) = arg maxθ
p(X |θ)p(θ)
where p(θ) denotes the prior distribution of θ
39 / 274
![Page 40: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/40.jpg)
Expectation-maximization algorithm
Likelihood function for observations x in latent variable model withlatent variable z
p(x|θ) =∑
zp(x, z|θ)
Expectation (E) step: calculate an auxiliary function
Q(θ,θold) = Ez[log p(x, z|θ)|x,θold]
Maximization (M) step: find a new estimate θnew via
θnew = arg maxλ
Q(θ,θold)
EM algorithm [Dempster et al., 1977] for ML can be extended forMAP
40 / 274
![Page 41: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/41.jpg)
Lower bound & KL divergence
Introduce an approximate or variational distribution q(z) and adoptthe Jensen’s inequality for convex function − log(·) to obtain
log p(x|θ) = log∑
z
p(x, z|θ)q(z) q(z) = logEq
[p(x, z|θ)q(z)
]≥ Eq
[log p(x, z|θ)
q(z)
], L(q,θ)
∑z
q(z) log p(x|θ)− L(q,θ) = −∑
zq(z) log
p(z|x,θ)q(z)
, KL(q‖p)
Evidence Decomposition
log p(x|θ) = KL(q‖p) + L(q,θ)
41 / 274
![Page 42: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/42.jpg)
Maximum Likelihood
KL(q‖p) = −Eq[log p(z|x,θ)]−Hq[z]
L(q,θ) = Eq[log p(x, z|θ)] + Hq[z]
Maximizing p(x|θ) is equivalent to first setting KL(q‖p) = 0 orapproximating (E-step)
q(z) = p(z|x,θold )
then maximizing the resulting lower bound (M-step)
L(q,θ) , Q(θ,θold) + const
where Q(θ,θold) , Eq[log p(x, z|θ)|x,θold] is a concave function
42 / 274
![Page 43: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/43.jpg)
EM algorithm
00
KL(qkp)KL(qkp)
L(q;µ)L(q;µ)
log p(xjµ)log p(xjµ)
43 / 274
![Page 44: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/44.jpg)
EM algorithm: E-step
00
KL(qkp)KL(qkp)
L(q;µold)L(q;µold)
= 0= 0
log p(xjµold)log p(xjµold)
44 / 274
![Page 45: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/45.jpg)
EM algorithm: M-step
00
KL(qkp)KL(qkp)log p(xjµnew)log p(xjµnew)
L(q;µnew)L(q;µnew)
45 / 274
![Page 46: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/46.jpg)
EM algorithm: lower bound
µoldµold µnewµnew
L(q;µold)L(q;µold)
L(q;µnew)L(q;µnew)
log p(xjµ)log p(xjµ)
46 / 274
![Page 47: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/47.jpg)
Outline
1 Introduction
2 Learning Algorithms2.1. Machine learning2.2. EM algorithm2.3. Approximate inference2.4. Bayesian learning
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction47 / 274
![Page 48: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/48.jpg)
Why approximate inference?
There are a number of latent variables in model-based speakerrecognition− i-vectors− common factors− variability matrix− mixture labels− channel, speaker and noise information
Posterior distribution of latent variables should be analytical andfactorizableEvolution of inference algorithms− maximum likelihood− maximum a posteriori− variational Bayesian− Gibbs sampling
48 / 274
![Page 49: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/49.jpg)
Posterior distribution
PosteriorLikelihood Prior
Marginal Likelihood aaa(model evidence)
p(zjx) =p(xjz) p(z)
Rp(xjz)p(z)dz
p(zjx) =p(xjz) p(z)
Rp(xjz)p(z)dz
p(x)p(x)
Latent variables and parameters z = z1, . . . , zm are coupled
49 / 274
![Page 50: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/50.jpg)
Approximate posterior
q(z)q(z)Divergence
p(zjx)p(zjx)
KL(qkp)KL(qkp) true
posteriorproxy
Find an approximate distribution q(z) that is factorizable andmaximally similar to the true posterior p(z|x)
50 / 274
![Page 51: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/51.jpg)
Variational Bayesian inference
q(z1:m|ν1:m) =m∏
j=1q(zj |νj)
Variationalcalculus
Optimization problem
functional
L(q) : q 7! L(q)L(q) : q 7! L(q)
maxq L(q)
s.t.R
zq(dz) = 1
maxq L(q)
s.t.R
zq(dz) = 1
p(x) = KL(qkp) + L(q)
where KL(qkp) = ¡Eq[ln p(zjx)] ¡Hq[z]
L(q) = Eq[ln p(x;z)] + Hq[z]
p(x) = KL(qkp) + L(q)
where KL(qkp) = ¡Eq[ln p(zjx)] ¡Hq[z]
L(q) = Eq[ln p(x;z)] + Hq[z]
(Evidence Lower BOund, ELBO)
51 / 274
![Page 52: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/52.jpg)
ln p(x)ln p(x)ln p(x)ln p(x)
00 00
KL(qkp)KL(qkp)
L(q)L(q)
L(q)L(q)
KL(qkp)KL(qkp)
Estimation for variational distributionmaxq(z)
Eq[log p(x, z)] + Hq[z]
s.t.∫
zq(dz) = 1
q(zj |νj) = exp(Ei 6=j [log p(x, z|ν)])∫exp(Ei 6=j [log p(x, z|ν)])dzj
52 / 274
![Page 53: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/53.jpg)
Variational Bayesian (VB) inference is implemented via adoubly-looped algorithm
VB-EM algorithmVB-E step: calculate the variational distribution q(z) in inner loop
q(z) = arg maxq(z)L(q,θ)
VB-M step: calculate the model parameter θ in outer loop
θ = arg maxθL(q,θ)
Convex optimization is performedVB-EM steps converge by a number of iterations
53 / 274
![Page 54: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/54.jpg)
Gibbs sampling algorithm
Initialize z(1), where z = z1:m
for τ ← 1 to T − 1 do
for j ← 1 to m do
Sample z(τ+1)j ∼ p(zj |z(τ+1)
1:(j−1), z(τ)j+1:m)
end for
end for
54 / 274
![Page 55: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/55.jpg)
Gibbs sampling
Two dimensional Gaussian mixture model with two mixture components
55 / 274
![Page 56: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/56.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Randomly assign mixture component for each sample j
56 / 274
![Page 57: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/57.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Extract one sample and compute the conditional distribution
57 / 274
![Page 58: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/58.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Sample a mixture component from the conditional distribution
58 / 274
![Page 59: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/59.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Extract one sample and compute the conditional distribution
59 / 274
![Page 60: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/60.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Sample a mixture component from the conditional distribution
60 / 274
![Page 61: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/61.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Extract one sample and compute the conditional distribution
61 / 274
![Page 62: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/62.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Sample a mixture component from the conditional distribution
62 / 274
![Page 63: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/63.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Extract one sample and compute the conditional distribution
63 / 274
![Page 64: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/64.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Sample a mixture component from the conditional distribution
64 / 274
![Page 65: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/65.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Extract one sample and compute the conditional distribution
65 / 274
![Page 66: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/66.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Sample a mixture component from the conditional distribution
66 / 274
![Page 67: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/67.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Extract one sample and compute the conditional distribution
67 / 274
![Page 68: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/68.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Sample a mixture component from the conditional distribution
68 / 274
![Page 69: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/69.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Extract one sample and compute the conditional distribution
69 / 274
![Page 70: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/70.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Sample a mixture component from the conditional distribution
70 / 274
![Page 71: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/71.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Extract one sample and compute the conditional distribution
71 / 274
![Page 72: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/72.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Sample a mixture component from the conditional distribution
72 / 274
![Page 73: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/73.jpg)
Gibbs sampling
zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)
Finally obtain an appropriate clustering result
73 / 274
![Page 74: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/74.jpg)
Variational Bayes
deterministic approximationfind an analytical proxy q(z)that is maximally similar top(z|x)inspect distribution statisticsnever generate exact resultsfastoften hard work to deriveconvergence guaranteesneed a specific parametricform
Gibbs sampling
stochastic approximationdesign an algorithm thatdraws samples z(1), . . . , z(τ)
from p(z|x)inspect sample statisticsasymptotically exactcomputationally expensivetricky engineering concernsno convergence guaranteesno need parametric form
74 / 274
![Page 75: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/75.jpg)
Outline
1 Introduction
2 Learning Algorithms2.1. Machine learning2.2. EM algorithm2.3. Approximate inference2.4. Bayesian learning
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction75 / 274
![Page 76: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/76.jpg)
Challenges in model-based approach
Thomas Bayes (1701-1761)
We are facing the challenges of big dataAn enormous amount of multimedia data is available in internet whichcontains speech, text, image, music, video, social networks and anyspecialized technical dataThe collected data are usually noisy, non-labeled, non-aligned, mismatched,and ill-posedProbabilistic models may be improperly-assumed, over-estimated, orunder-estimated
76 / 274
![Page 77: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/77.jpg)
Uncertainty modeling
We need tools for modeling, analyzing, searching, recognizing andunderstanding real-world dataOur modeling tools should− faithfully represent uncertainty in model structure and its parameters− reflect noise condition in observed data− be automated and adaptive− assure robustness− scalable for large data sets
Uncertainty can be properly expressed by prior distribution or process
77 / 274
![Page 78: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/78.jpg)
Model regularization
Regularization refers to a process of introducing additional informationin order to solve the ill-posed problem or to prevent overfittingOccam’s razor is imposed to deal with the issue of model selectionScalable modeling
78 / 274
![Page 79: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/79.jpg)
Bayesian speaker recognition
Real-world speaker recognition− unsupervised learning− number of factors is unknown− very short enrollment utterance− high inter/intra speaker variabilities− variabilities from channel and noise
Why Bayesian? [Watanabe and Chien, 2015]− exploration for latent variables− model regularization− uncertainty modeling− approximate Bayesian inference− better prediction
79 / 274
![Page 80: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/80.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine
4 Deep Learning
5 Case Studies
6 Future Direction 80 / 274
![Page 81: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/81.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine
4 Deep Learning
5 Case Studies
6 Future Direction 81 / 274
![Page 82: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/82.jpg)
GMM distribution from three Gaussians
-20 -10 0 10 20 30 40 50 600
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
x
p(x)
82 / 274
![Page 83: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/83.jpg)
Gaussian mixture model
Gaussian mixture model (GMM) is a weighted sum of Gaussians
p(x|θ) =M∑
i=1πibi (x)
θ = πi ,ui ,Σi
πi : mixture weightui : mixture mean vectorΣi : mixture covariance matrix
bi (x) = 1(2π)D/2|Σi |1/2 exp
(−1
2(x− ui )>Σ−1i (x− ui )
)Mixture component zi is a latent variable which is either zero or one
83 / 274
![Page 84: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/84.jpg)
Maximum likelihood
In ML estimation, we need to− compute the likelihood of a sequence of features given a GMM− estimate the parameters of GMM given a set of feature vectors
Assuming independence between features in a sequence, we have
p(X |θ) = p(x1, . . . , xT |θ) =T∏
t=1p(xt |θ)
ML estimation is performed by
θML = arg maxθ
p(X |θ) = arg maxθ
T∑t=1
log[ M∑
i=1πibi (xt)
]
84 / 274
![Page 85: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/85.jpg)
Parameter estimation
E-step is to calculate the auxiliary function
Q(θ,θold) = Ez[log p(X , z|θ)|X ,θold]
=T∑
t=1
M∑i=1
p(zti = 1|X ,θold) log p(xt , zti = 1|θ)
ML estimates are obtained via M-step as
πnewi = Ti
T
µnewi = 1
Ti
T∑t=1
γ(zti )xt = 1Ti
Ei [x]
Σnewi = 1
Ti
T∑t=1
γ(zti )(xt − µi )(xt − µi )> = 1Ti
Ei [xx>]− µiµ>i
where Ti =∑
t γ(zti ) and γ(zti ) = p(zti = 1|X ,θold)85 / 274
![Page 86: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/86.jpg)
E-step
x
p(z1 =1jX)
p(z2 =1jX)
p(z3 =1jX)
p(zi = 1jX) =¼ibi(x)
M
j=1
¼jbj(x)
Ti =TP
t=1
p(zti = 1jX)
Ei(x) =TP
t=1
p(zti = 1jX)xt
Ei(xxT ) =TP
t=1
p(zti = 1jX)xtxTt
Accumulate sufficient statistics
Probabilistically align samples to each mixture
86 / 274
![Page 87: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/87.jpg)
M-step
x ¼newi = Ti
T
¹newi = 1
TiEi[x]
§newi = 1
TiEi[xx
T ]¡¹i¹Ti
GMM parameters
Update model parameters
xxx
xxx
x
87 / 274
![Page 88: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/88.jpg)
Speaker verification
Realization of log likelihood ratio test from signal detection theory
SLR(X |θtarget,θubm) = log(X |θtarget)− log(X |θubm)
Feature
extractor
Target model
Background
model
x1; ::::;xT+
¡
GMMs are used for both target and background models− target model trained using enrollment speech− universal background model trained using speech from many speakers
88 / 274
![Page 89: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/89.jpg)
Target model & UBM
Target model is adapted from universal background model (UBM)− good with limited target training data
Maximum a posteriori (MAP) adaptation− align target training vectors to UBM− accumulate sufficient statistics− update target model parameters with smoothing to UBM parameters
Adaptation for those parameters of seen acoustic events− sparse regions of feature space filled in by UBM parameters
Side benefits− keep correspondence between target and UBM mixtures− allow for fast scoring when using many target models (top-M scoring)
89 / 274
![Page 90: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/90.jpg)
Maximum a posteriori adaptation
Prior density for GMM mean vector µ = µi is introduced
p(µi ) = N (µi |µubmi , σ2I)
MAP estimation [Gauvain and Lee, 1994] is performed by using theenrollment data Xs = xTs
t=1 from a target speaker s− E-step is to calculate
Q(µi ,µoldi ) =
Ts∑t=1
γ(zti ) log p(xt |zti = 1,µi ) + log p(µi |µubmi , σ2I)
− M-step is to maximize Q(µi ,µoldi ) to find
µnewi =
∑t γ(zti )xt∑
t γ(zti ) + r + rµubmi∑
t γ(zti ) + r = αiEi [x] + (1− αi )µubmi
where αi =∑
tγ(zti )∑
tγ(zti )+r
90 / 274
![Page 91: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/91.jpg)
MAP adaptation for GMM-UBM
UBM is based on GMM and trained by using EM algorithmSpeaker GMM is established by adjusting UBM by using MAPadaptation
EM
MAP
UBM
Speaker
GMM
Training data
Enrollment data
91 / 274
![Page 92: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/92.jpg)
Speaker recognition procedure
Feature
extractorModeling GMM-UBM
Feature
extractorScoring Results
Enrollment
Test utterance
92 / 274
![Page 93: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/93.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine
4 Deep Learning
5 Case Studies
6 Future Direction 93 / 274
![Page 94: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/94.jpg)
Joint factor analysis
Factor analysis is a statistical method which is used to describe thevariability among the observed variables in terms of potentially lowernumber of unobserved variables called factorsFactor analysis is a latent variable model for feature extractionJoint factor analysis (JFA) was the initial paradigm for speakerrecognition
u = m + Vy + Ux + Dz
Speaker
SupervectorUBM
Speaker
factors
Channel
factors
Residual
factors
94 / 274
![Page 95: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/95.jpg)
Intuition & interpretation
A supervector for a speaker should be decomposable into speakerindependent, speaker dependent, channel dependent, and residualcomponentsEach component is represented by low-dimensional factors, whichoperate along the principal dimensions of the correspondingcomponentSpeaker dependent component, known as the eigenvoice, and thecorresponding factors
V ¢ y =
2
4j j j j
v1 v2 : : : vN
j j j j
3
5
2
664
y1
y2...
yN
3
775
Eigenvoice matrix
Low dimensional eigenvoice factors
Each speaker factor controls
an eigendimension of the
eigenvoice matrix
95 / 274
![Page 96: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/96.jpg)
Factor decomposition
GMM supervector u for a speaker can be decomposed as
u = m + Vy + Ux + Dz
Speaker
supervectorSpeaker-independent
component
Speaker-dependent
component
Channel-dependent
component
Speaker-dependent
resuidual component
wherem is a speaker-independent supervector from UBMV is the eigenvoice matrixy ∼ N (0, I) is the speaker factor vectorU is the eigenchannel matrixx ∼ N (0, I) is the channel factor vectorD is the residual matrix, and is diagonalz ∼ N (0, I) is the speaker-specific residual factor vector
96 / 274
![Page 97: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/97.jpg)
Dimensionality
For a 512-mixture GMM-UBM system, the dimensions of each JFAcomponent are typically as follows− V 20,000 by 300 (300 eigenvoices)− y 300 by 1 (300 speaker factors)− U 20,000 by 100 (100 eigenchannels)− x 100 by 1 (100 channel factors)− D 20,000 by 20,000 (20,000 residuals)− z 20,000 by 1 (20,000 speaker-specific residuals)
These dimensions have been empirically determined to produce thebest resultsBayesian model selection can helpJudge by the marginal likelihood over latent component underdifferent dimensions
97 / 274
![Page 98: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/98.jpg)
Training procedure
We train the JFA matricies in the following order[Kenny et al., 2007a]
1. Train the eigenvoice matrix V, assuming that U and D are zero2. Train the eigenchannel matrix U given the estimate of V, assuming
that D is zero3. Train the residual matrix D given the estimates of V and U
Using these matrices, we compute y for speaker, x for channel, and zfor residual factorsWe compute the final score by using these matrices and factors
98 / 274
![Page 99: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/99.jpg)
Total variability
Subspaces U and V are not completely independentA combined total variability space was used [Dehak et al., 2011]
u = m + Vy + Ux + Dz
Speaker
supervectorUBM
Speaker
factors
Channel
factors
Residual
factors
u = m + Tw
Speaker
supervectorUBM
Total variability matrix
Intermediate/identity
vector (i-vector)
99 / 274
![Page 100: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/100.jpg)
Training total variability space
Rank of T is set prior to trainingT and w are latent variablesEM algorithm is usedTraining total variability matrix T is similar to training V except thattraining T is performed by using all utterances from a given speakerbut as produced by different speakersRandom initialization for TEach ot has dimension D. Number of Gaussian components is M.Dimension of supervector is M · DUBM diagonal covariance matrix Σ (MD ×MD) is introduced tomodel the residual variability not captured by T
100 / 274
![Page 101: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/101.jpg)
Sufficient statistics
0th order statistics Nc(u) =∑
t γc(ot) of an utterance u1th order statistics Fc(u) =
∑t γc(ot)ot
2nd order statistics Sc(u) = diag(∑
t γc(ot)oto>t)
where
γc(ot) = p(c|ot ,θubm) = πcp(ot |mc ,Σc)∑Mj=1 πip(ot |mj ,Σj)
Centralized 1th and 2nd order statistics
Fc(u) =T∑
t=1γc(ot)(ot −mc)
Sc(u) = diag( T∑
t=1γc(ot)(ot −mc)(ot −mc)>
)
where mc is the subvector corresponding to mixture component c101 / 274
![Page 102: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/102.jpg)
EM algorithm
Sufficient statistics
N(u) =
N1(u) · ID×D 0 · · · 0
0 N2(u) · ID×D 0...
... 0. . . 0
0 · · · 0 NM (u) · ID×D
F (u) =
F1(u)F2(u)
...FM (u)
EM algorithm [Kenny et al., 2005]– Initialize m, Σ and T– E-step: for each utterance u, calculate the parameters of the posterior
distribution of w(u) using the current estimates of m,Σ,T– M-step: update T and Σ by solving a set of linear equations in which
w(u)’s play the role of explanatory variables– Iterate until data likelihood given the estimated parameters converges
102 / 274
![Page 103: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/103.jpg)
E-step: posterior distribution of w(u)
For each utterance u, we calculate the matrix
L(u) = I + T>Σ−1N(u)T
Posterior distribution of w(u) conditioned on the acousticobservations of an utterance u is Gaussian with mean
E[w(u)] = L−1(u)T>Σ−1F (u)
and covariance matrix
Cov(w(u),w(u)) = L−1(u)
Variational Bayesian JFA was developed for speaker verification[Zhao and Dong, 2012]
103 / 274
![Page 104: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/104.jpg)
Linear discriminant analysis
I-vectors from JFA model are used in linear discriminant analysis(LDA)
u = m + Tw
W = Aw
Factor analysis
Linear discriminant analysis
Both methods used to reduce the dimensionality of speaker modelA is chosen such that within-speaker variability Sw is minimized andbetween-speaker variability Sb is maximized within the spaceA is found by eigenvalue method via maximizing
J (A) = TrS−1w Sb
104 / 274
![Page 105: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/105.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine
4 Deep Learning
5 Case Studies
6 Future Direction 105 / 274
![Page 106: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/106.jpg)
Factor analysis
Assuming a factor analysis model of the i-vectors of the form
w = u + Fh + ε
w is the i-vector, u is the mean of i-vectors, and h ∼ N (0, I) is thelatent factorsFirst compute the maximum likelihood estimate of the factor loadingmatrix F, also known as the eigenvoice subspaceFull covariance of residual noise ε explains the variability not capturedthrough the latent variables
PLDAUnder Gaussian assumption, this model is known in face recognition asPLDA [Prince and Elder, 2007]
106 / 274
![Page 107: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/107.jpg)
Gaussian PLDA
Assume that there are low dimensional, normally distributed hiddenvariables x1 and x2r such that
Dr = m + U1x1︸ ︷︷ ︸S
+ U2x2r + εr︸ ︷︷ ︸Cr
Residual εr is normally distributed with mean 0 and precision matrix Λm is the center of acoustic space and x1 is the speaker factorsColumns of U1 are the eigenvoices
Cov(S, S) = U1U>1
x2r varies from one recording to another (channel factors)Columns of U2 are the eigenchannels
Cov(Cr ,Cr ) = Λ−1 + U2U>2
107 / 274
![Page 108: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/108.jpg)
Graphical representation
x1
x2r Dr "r
m
r = 1; 2; :::; R
¤
Including x2r enables the decomposition of speaker and channelfactorsx2r can always be eliminated at recognition timeBetween-speaker covariance matrix Cov(S,S) & within-speakercovariance matrix Cov(Cr ,Cr )These matrices cannot be treated as full rank
108 / 274
![Page 109: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/109.jpg)
PLDA speaker recognition
Given two i-vectors D1 and D2, we would like to perform thehypothesis test
H1: the speakers are the sameH0: the speakers are different
Likelihood ratio is calculated by
p(D1,D2|H1)p(D1|H0)p(D2|H0)
Likelihood ratio for any type of speaker recognition or speakerclustering problemThe evidence integral should be calculated∫
p(D, z)dz
109 / 274
![Page 110: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/110.jpg)
Model assumption
Assume that− we have succeeded in estimating the model parametersθ = m,U1,U2,Λ
− given a collection D = (D1, . . . ,DR ) of i-vectors associated with aspeaker, we have figured out how to evaluate the marginal likelihood orthe evidence
p(D) =∫
p(D, z)dz =∫
p(D|z)p(z)dz
− z = x1, x2r is the hidden variables associated with the speaker
We show how to do speaker recognition in this situation and howboth problems are tackled by using variational Bayes to approximatethe posterior distribution p(z|D)
110 / 274
![Page 111: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/111.jpg)
Variational approximation
Evidence p(D) can be evaluated exactly in the Gaussian case but thisinvolves inverting the large sparse block matricesIf q(z) is any distribution on z, variational lower bound is yielded as
L , Eq
[log p(D, z)
q(z)
]where log p(D) ≥ L with equality iff q(z) = p(z|D)Variational Bayes provides a principled way to find a goodapproximation q(z) to p(z|D)Model parameters θ = m,U1,U2,Λ are estimated by maximizingthe evidence lower bound (ELBO) L which is calculated over all ofthe speakers in a training set
111 / 274
![Page 112: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/112.jpg)
Bayesian speaker recognition
Full Bayesian avoids the point estimates of model parametersCoupling of multiple latent variables is tackled in VB[Villalba and Lleida, 2014]Uncertainties are compensated for model regularization in speakerrecognitionPrior densities p(U1) and p(U2) can be flexibly incorporatedSelection for the number of speaker factors or channel factorsManually tuning for unknown variables is avoidedAnalogous to the the treatment of the number of mixturecomponents in Bayesian estimation of GMMBayesian mixture of PLDA [Mak et al., 2016] was recently developed
112 / 274
![Page 113: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/113.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine
4 Deep Learning
5 Case Studies
6 Future Direction 113 / 274
![Page 114: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/114.jpg)
What is a good decision boundary?
Consider a two-class, linearlyseparable classification problemMany decision boundaries!− perceptron algorithm can be
used to find such a boundary− different algorithms have been
proposedAre all decision boundariesequally good? Class 1
Class 2
114 / 274
![Page 115: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/115.jpg)
Large-margin decision boundary
Decision boundary should be as far away from the data of bothclasses as possible [Vapnik, 2013]− we should maximize the margin m− distance between the origin and the line w>x = k is k
‖w‖
Class 2
Class 1
w
wTx + b = ¡1
wTx + b = 1
wTx + b = 0
m
m = 2kwk
115 / 274
![Page 116: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/116.jpg)
Finding the decision boundary
x1, . . . , xn is the data set and yi ∈ 1,−1 is the class label of xi
Decision boundary should classify all points correctly
yi (w>xi + b) ≥ 1, for i = 1, . . . , n
Decision boundary can be found by solving the constrainedoptimization problem
Minimize 12‖w‖
2
subject to yi (w>xi + b) ≥ 1, for i = 1, . . . , n
116 / 274
![Page 117: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/117.jpg)
Constrained optimization
Original problem
Minimize 12‖w‖
2
subject to yi (w>xi + b) ≥ 1, for i = 1, . . . , n
We introduce the Lagrange multipliers αi ≥ 0 to form the Lagrangianfunction
L = 12‖w‖
2 −n∑
i=1αi(yi (w>xi + b)− 1
)Setting the gradient of L w.r.t w and b to zero, we have
w =n∑
i=1αiyi xi
n∑i=1
αiyi = 0
117 / 274
![Page 118: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/118.jpg)
Dual problem
New objective function is expressed in terms of αi onlyIf we know w, we know all αi . If we know all αi , we know wOriginal problem is known as the primal problemA quadratic programming objective is formed by
maximize L(α) =n∑iαi −
12
n∑i=1
n∑j=1
αiαjyiyjx>i xj
subject to αi ≥ 0,n∑
i=1αiyi = 0
Global maximum of αi can be found
118 / 274
![Page 119: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/119.jpg)
Characteristics of the solution
Many of αi are zero− w is a linear combination of a small number of data points− This sparse representation can be viewed as data compression as in the
construction of KNN classifierxi with non-zero αi are called support vectors− decision boundary is determined only by support vectors− tj , j = 1, . . . , s are the indices of s support vectors
w =s∑
j=1αtj ytj xtj
For testing with a new data z− compute f = w>z + b =
∑sj=1 αtj ytj x>tj
z + b and classify z as class 1 ifthe sum is positive, and class 2 otherwise
119 / 274
![Page 120: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/120.jpg)
Non-separable problem
We allow error ξi in classification. It is based on the output of thediscriminant function w>x + bξi approximates the number of misclassified samples
Class 2
Class 1
w
wTx + b = ¡1
wTx + b = 1
wTx + b = 0
»i
»j
xj
xi
120 / 274
![Page 121: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/121.jpg)
Soft margin hyperplane
If we minimize∑
i ξi , ξi can be computed byw>xi + b ≥ 1− ξi yi = 1w>xi + b ≤ −1 + ξi yi = −1ξi ≥ 0 ∀i
− ξi are slack variables in optimization− ξi = 0 if there is no error for xi− ξi is an upper bound of the number of errors
We want to minimize 12‖w‖
2 + C∑n
i=1 ξi− C is a tradeoff parameter between error and margin
Optimization problem becomes
Minimize 12‖w‖
2 + Cn∑
i=1ξi
subject to yi (w>xi + b) ≥ 1− ξi , ξi ≥ 0
121 / 274
![Page 122: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/122.jpg)
Dual problem
Dual of this new constrained optimization problem is
maximize L(α) =n∑iαi −
12
n∑i=1
n∑j=1
αiαjyiyjx>i xj
subject to C ≥ αi ≥ 0,n∑
i=1αiyi = 0
w is recovered as w =∑s
j=1 αtj ytj xtj
This is very similar to the optimization problem in the linear separablecase, except that there is an upper bound C on αi nowOnce again, a quadratic programming solver can be used to find αi
122 / 274
![Page 123: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/123.jpg)
Non-linear decision boundary
So far, we have only considered large-margin classifier with a lineardecision boundaryHow to generalize it to become non-linear?Key idea: transform xi to a higher dimensional space to make lifeeasier− input space: the points xi are located− feature space: the space of φ(xi ) after transformation
Why transform?− linear operation in the feature space is equivalent to non-linear
operation in input space− classification can become easier with a proper transformation.− In the XOR problem, for example, adding a new feature make the
problem linearly separable
123 / 274
![Page 124: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/124.jpg)
Dimensionality in feature space
Á(²)Á( )
Á( )
Á( )
Á( )
Á( )Á( )
Á( )
Á( )
Á( )Á( )
Á( )
Á( )
Á( )
Á( )
Input space Feature space
Computation in the feature space can be costly because it is highlydimensional− feature space is typically infinite-dimensional!
Kernel trick comes to rescue
124 / 274
![Page 125: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/125.jpg)
Kernel trick
maximize L(α) =n∑iαi −
12
n∑i=1
n∑j=1
αiαjyiyjx>i xj
subject to C ≥ αi ≥ 0,n∑
i=1αiyi = 0
x>i xj is inner productAs long as we can calculate the inner product in the feature space, wedo not need the mapping explicitlyMany common geometric operations, e.g. angle, distance, can beexpressed by inner productsDefine the kernel function K by
K (xi , xj) = φ(xi )>φ(xj)
125 / 274
![Page 126: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/126.jpg)
Kernel function in SVM
Change all inner products to kernel functionsFor training− original
maximum L(α) =n∑iαi −
12
n∑i=1
n∑j=1
αiαjyiyjx>i xj
subject to C ≥ αi ≥ 0,n∑
i=1αiyi = 0
− with kernel function
maximum L(α) =n∑iαi −
12
n∑i=1
n∑j=1
αiαjyiyjK (xi , xj )
subject to C ≥ αi ≥ 0,n∑
i=1αiyi = 0
126 / 274
![Page 127: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/127.jpg)
Kernel function in SVM
For testing, the new data z is classified as class 1 if f ≥ 0 and as class2 if f < 0− original
w =s∑
j=1αtj ytj xtj
f = w>z + b =s∑
j=1αtj ytj x>tj
z + b
− with kernel function
w =s∑
j=1αtj ytjφ(xtj )
f = 〈w, φ(z)〉+ b =s∑
j=1αtj ytj K (xtj , z) + b
127 / 274
![Page 128: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/128.jpg)
Similarity measure
Since the training of SVM only requires the value of K (xi , xj), thereis no restriction of the form of xi and xj− xi can be a sequence or a tree instead of a feature vector
K (xi , xj) is just a similarity measure comparing xi and xj
Kernel function needs to satisfy the Mercer function, i.e., the functionis positive-definiteFor a test object z, the discriminant function essentially is a weightedsum of the similarity between z and a pre-selected set of objects, alsocalled the support vectors
f (z) =∑xi∈S
αiyiK (z, xi ) + b
where S denotes the set of support vectors
128 / 274
![Page 129: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/129.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine
4 Deep Learning
5 Case Studies
6 Future Direction 129 / 274
![Page 130: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/130.jpg)
Restricted Boltzmann machine
Bipartite the undirected graphical model with visible variable v andhidden variable hBuilding-block for deep belief networks and deep Boltzmann machinesRBMs are generative models of v based on the marginal distributionJoint distribution of (v, h) is an exponential familyDiscriminative fine-tuning can be appliedVariables are typically binary, however no such restriction existsBidirectional graphical model
130 / 274
![Page 131: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/131.jpg)
Graphical model
h
v
W
No connection between nodes of the same layer (i.e. sparsity)Allow fast training (blocked-Gibbs sampling)Correlations between nodes in v are still present in the marginalp(v|W)Hidden variable h captures the higher level information
131 / 274
![Page 132: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/132.jpg)
Joint distribution
Energy-based distribution is defined by
p(v,h|θ) = Z (θ)−1p∗(v,h|θ)
where
p∗(v,h|θ) = exp(∑
ivibi +
∑j
hjaj +∑i ,j
vihjWij︸ ︷︷ ︸−E(v,h)
)
and θ = W ,b, ab, a denote the biases and are usually assumed to be zero forcompact notationZ (θ) =
∑v,h p∗(v,h|θ) is the partition function
132 / 274
![Page 133: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/133.jpg)
Individual distribution
Consider binary (v,h) with zero biases b, ap(h|v) =
∏j p(hj |v) where p(hj |v) = 1
1+exp(−∑
i vi Wij )
p(v|h) =∏
i p(vi |h) where p(vi |h) = 11+exp(−
∑j hj Wij )
Product form is due to the restricted structure
h
v
W
133 / 274
![Page 134: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/134.jpg)
Learning with RBM
Maximize the likelihood of θ given vn
p(vn) =∏
nZ−1∑
hexp
∑ij
vni hn
j Wij
We obtain ∂logp(vn)
∂Wij= Epdata [vihj ]− Epmodel [vihj ]
where pdata(vn,hn) = p(hn|vn)p(vn)p(hn|vn) is an easy and exact calculation for RBMp(vn) is an empirical distribution∂logZ(θ)∂Wij
= Epmodel [vihj ] is hard to computeLearning using contrastive divergence with mini-batches is performed
134 / 274
![Page 135: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/135.jpg)
135 / 274
![Page 136: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/136.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning
5 Case Studies
6 Future Direction 136 / 274
![Page 137: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/137.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning
5 Case Studies
6 Future Direction 137 / 274
![Page 138: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/138.jpg)
Neural network
Inputs
Hidden units
xtD
xt1
xt0
zt0
zt1
yt1
ytKztM
w(1)
DM
w(2)
MK
Outputs
Multilayer perceptron
138 / 274
![Page 139: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/139.jpg)
Nonlinear activation
ytk = yk(xt ,w) = f
M∑j=0
w (2)jk f
( D∑i=0
w (1)ij xti
)
−6 −4 −2 0 2 4 6
−1
−0.5
0
0.5
1
f(x)
x
ReLUSigmoidTanh
activation function f (·)
139 / 274
![Page 140: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/140.jpg)
Deep learning
Deep belief networks (DBN) obtained great results due to goodinitialization and deep model structure− pre-train each layer from bottom up− each pair of layers is a restricted Boltzmann machine− jointly fine tune all layers using back-propagation
Deep neural network (DNN)− discriminative model works for classification tasks− empirically works well for image recognition, speech recognition,
information retrieval and many others− no theoretical guarantee
140 / 274
![Page 141: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/141.jpg)
DBN-DNN training
o
z1
z3
GRBM
RBM
RBM
z1
z2
o
y
z1
w1
w2
w3
w4
z3
z2
141 / 274
![Page 142: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/142.jpg)
Why go deep?
Deep architecture can be representationally efficient− fewer computational units for the same function
Deep representation might allow for a hierarchical representation− allows non-local generalization− comprehensibility
Multiple levels of latent variables allow combinatorial sharing ofstatistical strengthDeep architecture works well for representation of vision, audio, NLP,music and many other technical data
142 / 274
![Page 143: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/143.jpg)
Different level of abstraction
Hierarchical learning− natural progression from low
level to high level structure asseen in natural complexity
− easier to monitor what isbeing learnt and to guide themachine to better subspaces
− a good lower levelrepresentation can be used indifferent tasks
143 / 274
![Page 144: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/144.jpg)
Trainable feature hierarchy
Hierarchy of representations with increasing level of abstractionEach stage is a kind of trainable feature transformImage− Pixel → edge → texton → motif → part → object
Text− Character → word → word group → clause → sentence → story
Speech− Sample → spectral band → sound → . . .→ phone → phoneme →
word
144 / 274
![Page 145: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/145.jpg)
Deep architecture
Feed-forward: multilayer neural nets, convolutional nets
Feed-back: stacked sparse coding, deconvolutional nets
Bi-directional: deep Boltzmann machines, stacked auto-encoders
145 / 274
![Page 146: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/146.jpg)
Training strategy
Purely supervised− initialize parameters randomly− train in supervised mode
− typically with SGD, using backprop to compute gradients− used in most practical systems for speech and image recognition
Unsupervised, layerwise + supervised classifier on top− train each layer unsupervised, one after the other− train a supervised classifier on top, keeping the other layers fixed− good when very few labeled samples are available
Unsupervised, layerwise + global supervised fine-tuning− train each layer unsupervised, one after the other− add a classifier layer, and retrain the whole thing supervised− good when label set is poor
Unsupervised pre-training often uses the regularized auto-encoders
146 / 274
![Page 147: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/147.jpg)
Generalizable learning
Shared representation− multi-task learning− unsupervised training
Raw input
Shared
intermediate
representation
Task 1
output
Task 2
outputTask 3
output
Partial feature sharing− mixed mode learning− composition of functions
…
…
…
…
Low-level features
…High-level features
Task 1
Ouptut
Task N
Ouptut y1 yN
147 / 274
![Page 148: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/148.jpg)
Forward & backward passes
Forward propagation− sum inputs, produce activation, feed-forward
Inputs
x2
x1
z1
z2
z3
y2
y1
Outputs
Training: back propagation of error− calculate total error at the top− calculate contributions to error at each step going backwards
Inputs
x2
x1
z1
z2
z3
y2
y1
Outputs
148 / 274
![Page 149: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/149.jpg)
Deep neural network
Simple to construct− sigmoid nonlinearity for hidden layers− softmax for the output layer
Backpropagation does not work well ifrandomly initialized[Bengio et al., 2007]− deep networks trained without
unsupervised pretraining performworse than shallow networks
( Bengio et al., NIPS 2007)
149 / 274
![Page 150: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/150.jpg)
Problems and solvers with back propagation
Gradient is progressively getting more dilute− below top few layers, correction signal is minimal
Gets stuck in local minima− random initialization: may start out far from good regions
In usual settings, we can use only labeled data− almost all data are unlabeled− the brain can learn from unlabeled data
Use unsupervised learning via greedy layer-wise training− allow abstraction to develop naturally from one layer to another− help the network initialize with good parameters
Perform supervised top-down training as final step− refine the features in intermediate layers more relevant for the task
150 / 274
![Page 151: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/151.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning
5 Case Studies
6 Future Direction 151 / 274
![Page 152: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/152.jpg)
Deep belief network
Deep belief network (DBN) is a probabilistic generative modelDeep architecture with multiple hidden layersUnsupervised pre-learning provides a good initialization− maximizing the lower-bound of the log-likelihood of data
Supervised fine-tuning− generative: up-down algorithm− discriminative: back propagation
152 / 274
![Page 153: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/153.jpg)
Model structure
Hidden Layers
Visible Layers
h3
h1
h2
v
RBM
Directed
belief nets
p(v,h1,h2, . . . ,hl ) = p(v|h1)p(h1|h2) . . . p(hl−2|hl−1)p(hl−1|hl )
153 / 274
![Page 154: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/154.jpg)
Greedy training
First step:− construct an RBM with an
input layer v and a hiddenlayer h
− train the RBM
h
v
W1
154 / 274
![Page 155: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/155.jpg)
Greedy training
Second step:− Stack another hidden layer on
top of the RBM to form anew RBM
− Fix W1, sample h1 fromq(h1|v) as input. Train W2 asRBM
h1
h2
v
W1
W2
q(h1jv)
155 / 274
![Page 156: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/156.jpg)
Greedy training
Third step:− continue to stack layers on
top of the network, train it asprevious step, with samplesampled from q(h2|h1)
And so on...
h3
h1
h2
v
W1
W2
W3
q(h2jh1)
q(h1jv)
156 / 274
![Page 157: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/157.jpg)
Deep Boltzmann machine
p(v) =∑
h1,h2,h3
1Z exp[vT W1h + (h1)>W2h2 + (h2)>W3h3]
W3
W2
W1
v
h1
h2
h3
[Salakhutdinov and Hinton, 2009]
Undirected connections between alllayers. No connections between thenodes in the same layerHigh-level representations are built fromunlabeled inputs. Labeled data is usedto only slightly fine-tune the model
157 / 274
![Page 158: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/158.jpg)
Training procedure
Pre-training− initialize from stacked RBMs
Generative fine-tuning− positive phase: variational or
mean-field approximation− negative phase: persistent
chain & stochasticapproximation
Discriminative fine-tuning− back-propagation
W3
W2
W1
v
h1
h2
h3
158 / 274
![Page 159: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/159.jpg)
Why greedy layer wise training works
Regularization hypothesis− pre-training is constraining the parameters in a region relevant to
unsupervised dataset− better generalization - representations that better describe unlabeled
data are more discriminative for labeled dataOptimization hypothesis− unsupervised training initializes lower level parameters near localities of
better minima than random initialization can
159 / 274
![Page 160: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/160.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning
5 Case Studies
6 Future Direction 160 / 274
![Page 161: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/161.jpg)
Denoising auto-encoder
Corrupted input
Hidden code
(representation)
Raw input Reconstruction
KL (reconstruction | raw input)
X X
[Vincent et al., 2008]
Corrupt the input, e.g. set 25% of inputs to 0Reconstruct the uncorrupted inputUse the uncorrupted encoding as input to next level
161 / 274
![Page 162: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/162.jpg)
Manifold learning perspective
Learn a vector field towards higher probability regionsMinimize the variational lower bound on a generative modelCorrespond to the regularized score matching on an RBM
162 / 274
![Page 163: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/163.jpg)
Stacked denoising auto-encoders
h1
x
W1 W>1
h1
x
W1 W>1
x
h2
h1
W2W>
2
h1 h1
W1 W>1
W>2
W2
x x
h2
U
y
163 / 274
![Page 164: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/164.jpg)
Greedy layer-wise learning
Start with the lowest level and stack upwardsTrain each layer of auto-encoder using the intermediate codes orfeatures from the layer belowTop layer can have a different output, e.g. softmax non-linearity, toprovide an output for classification
h1
x
W1 U1
h1
x
W1 U1
y
h2
y
W2 U2
h1
W1 U1
W2
x y
h2
U2
y
y
164 / 274
![Page 165: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/165.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning
5 Case Studies
6 Future Direction 165 / 274
![Page 166: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/166.jpg)
Auto-encoder
::::::
::::::
::::::
xx xx
zz
Encoder Decoder
ÁÁ μμ
166 / 274
![Page 167: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/167.jpg)
Variational auto-structure
::::::
::::::
::::::
Encoder Decoder::::::
qÁ(zjx)qÁ(zjx)
pμ(xjz)pμ(xjz)xx
zz
xx
Sampling
ÁÁμμ
167 / 274
![Page 168: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/168.jpg)
Graphical model
zz
x
μμÁÁRecognition
modelGenerative
model
qÁ(zjx)qÁ(zjx) pμ(xjz)pμ(xjz)
[Kingma and Welling, 2014]Mean-field approach requires analytical solutions Eq, which areintractable in the case of neural network
Use neural network and sample the latent variables z from variationalposterior
168 / 274
![Page 169: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/169.jpg)
Variational inference
Variational Bayesian inference aims to find a variational distributionq(z|x) that is maximally close to the original true posteriordistribution p(z|x)
According to the evidence decomposition, we have
p(x) = L(q) + KL(q‖p)
L(q) = Eq[log p(x, z)] + Hq[z]
KL(q‖p) = −Eq[log p(z|x)]−Hq[z]
169 / 274
![Page 170: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/170.jpg)
Mean field variational inference
Assume that q(z|x) can be factorized into the product of individualprobability distributions
q(z|x) =N∏
n=1q(zn|xn)
We can perform the coordinate ascent for each factorized variationaldistributions by
q(zj |xj) ∝ exp(Eq(zi 6=j )[log p(x, z)])
170 / 274
![Page 171: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/171.jpg)
Variational lower bound
Model parameters are learned by maximizing the variational lowerbound
log p(x) ≥ Eqφ(z|x)[log pθ(x|z)]− KL(qφ(z|x)||pω(z))= Eqφ(z|x)[log pθ(x, z)− log qφ(z|x)], Eqφ(z|x)[fΘ(x, z)], LΘ
where Θ = θ, φ, ω
171 / 274
![Page 172: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/172.jpg)
Stochastic backpropagation
L£ = EqÁ(zjx)[f£(x; z)]L£ = EqÁ(zjx)[f£(x; z)]
sample z(l) from qÁ(zjx)sample z(l) from qÁ(zjx)
L£ ' f£(xjz(l))L£ ' f£(xjz(l))
r£L£ ' r£f£(x; z(l))r£L£ ' r£f£(x; z(l))
Objective:
Gradient:
Step1
Step2
Step3
Problem: high variance by directly sampling z [Rezende et al., 2014]
172 / 274
![Page 173: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/173.jpg)
Stochastic gradient variational Bayes
L£ = EqÁ(zjx)[f£(x; z)]L£ = EqÁ(zjx)[f£(x; z)]
L£ ' f£(xjz(l))L£ ' f£(xjz(l))
r£L£ ' r£f£(x; z(l))r£L£ ' r£f£(x; z(l))
Objective:Gradient:
Step1
Step2
Step3
z(l) = ¹z + ¾z¯ ²(l)z(l) = ¹z + ¾z¯ ²(l)
sample ²(l) from N (0; I)sample ²(l) from N (0; I)
Step4
Reduce the variance caused by directly sampling z
173 / 274
![Page 174: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/174.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning
5 Case Studies
6 Future Direction 174 / 274
![Page 175: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/175.jpg)
Why transfer learning?
Mismatch between training and test data in speaker recognitionalways existsTraditional machine learning works well under an assumption thattraining and test data follow the same distribution− real-world data may not follow this assumption
Feature-based domain adaptation is a common approach− allow knowledge to be transferred across domains through learning a
good feature representationCo-train for feature representation and speaker recognition withoutlabeling in target domain
175 / 274
![Page 176: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/176.jpg)
Transfer learning
Let D = X , p(X ) denote a domain− feature space X− marginal probability distribution p(X )− X = x1, · · · , xT ⊂ X
Let T = Y, f (·) denote a task− label space Y− objective predictive function f (·)
can be written as p(Y |X )
Assumptions in transfer learning− source and target domains are different DS 6= DT− source and target tasks are different TS 6= TT
176 / 274
![Page 177: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/177.jpg)
Multi-task learning
minθ`(D, θ) + λΩ(θ)
Joint
Learning
Task
apple, not apple pear, not pear
Input
Target
Auxiliary TaskMain Task
177 / 274
![Page 178: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/178.jpg)
Multi-task neural network learning
minθ`(D, θ) + λΩ(θ)
z1 zJz2 z3
Ym Y1 YK
Main task Auxiliary tasks
Output
Input
Shared feature
representation
178 / 274
![Page 179: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/179.jpg)
Learning strategy and task
Classifier
Feature
Extractor
Unlabeled Data
in Target Domain
Unlabeled Data
in Target Domain
Auxiliary TaskMain Task Main Task
Training Test
Labeled Data
in Source Domain
z0 zM
bx1 bxD
x0 xD
by1 byV
fewkjg
fwjig
fwkjg
Distribution
Matching
Semi-supervised learning is conducted under multiple objectives
179 / 274
![Page 180: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/180.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM
6 Future Direction 180 / 274
![Page 181: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/181.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM
6 Future Direction 181 / 274
![Page 182: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/182.jpg)
Motivation
Motivation of i-vectors:Insufficiency of joint factor analysis (JFA) in distinguishing betweenspeaker and channel information, as channel factors were shown tocontain speaker information.Better to use a two-step approach: (1) use low-dimensional vectors(called i-vectors) that comprise both speaker and channel informationto represent utterances; and (2) model the channel and variabilities ofthe i-vectors during scoring.
Motivation of Heavy-tailed PLDA:JFA assumes that the speaker and channel components follow Gaussiandistributions.The Gaussian assumption prohibits large deviations from the mean.But speaker effects (e.g., non-native speakers) and channel effect(gross channel distortion) could cause large deviations.Use heavy-tailed distributions instead of Gaussians for modeling thespeaker and channel components in i-vectors [Kenny, 2010].
182 / 274
![Page 183: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/183.jpg)
Generative model with heavy-tailed priors
Assuming that we have Hi i-vectors Xi = xij , j = 1, . . . ,Hi fromspeaker i , the generative model is
xij = m + Vhi + Grij + εij
where V and G represent the the speaker and channel subspaces,respectively.In heavy-tailed PLDA,
hi ∼ N (0, u−11 I) u1 ∼ G(n1/2, n1/2)
rij ∼ N (0, u−12j I) u2j ∼ G(n2/2, n2/2)
εij ∼ N (0, (vjΛ)−1) vj ∼ G(νj/2, νj/2)
By integrating out the hyperparameters (u1, u2j , and vj), one canshow [Eq. 2.161 of Bishop (2006)] that the priors of hi , rij , and εijfollow Student’s t. So, xij also follows Student’s t.
183 / 274
![Page 184: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/184.jpg)
Performance on NIST 2008 SRE
Telephone speech, without score normalization
Microphone speech, with score normalization
[Kenny, 2010]
184 / 274
![Page 185: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/185.jpg)
From HT-PLDA to Gaussian PLDA
In 2011, [Garcia-Romero and Espy-Wilson, 2011] discovered thatGaussian PLDA performs as good as HT-PLDA provided thati-vectors have been subjected to the following pre-processing steps:
Whitening + Length normalization
These steps have the effect of making the i-vectors more Gaussian.As Gaussian PLDA is computationally much simpler than HT-PLDA,the former has been extensively used in speaker verification.
185 / 274
![Page 186: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/186.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM
6 Future Direction 186 / 274
![Page 187: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/187.jpg)
Motivation
While i-vector extraction followed by PLDA is very effective inaddressing channel variabilityPerformance degrades rapidly in the presence of background noisewith a wide range of signal-to-noise ratios (SNR)Classical approach: Multi-condition training where i-vectors fromvarious background noise level are pooled to train a PLDA model.
187 / 274
![Page 188: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/188.jpg)
Motivation
We argue that the variation caused by SNR variability can bemodeled by an SNR subspace and utterances falling within a narrowSNR range should share the same set of SNR factors.SNR-specific information were separated from speaker-specificinformation through marginalizing out the SNR factors during scoring
188 / 274
![Page 189: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/189.jpg)
Motivation
I-vectors derived from utterances of similar SNR will be mapped to asmall region in the SNR subspace.
189 / 274
![Page 190: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/190.jpg)
SNR-Invariant PLDA
Classical PLDA: xij = m + Vhi + εij
By adding an SNR factor to the conventional PLDA, we haveSNR-invariant PLDA [Li and Mak, 2015]:
xkij = m + Vhi + Uwk + εk
ij , k = 1, . . . ,K
where U denotes the SNR subspace, wk is an SNR factor, and hi isthe speaker (identity) factor for speaker i .Note that it is not the same as PLDA with channel subspace:
xkij = m + Vhi + Grij + εij ,
where G defines the channel subspace and rij represents the channelfactors.
190 / 274
![Page 191: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/191.jpg)
SNR-invariant PLDA
Generative model:
xkij = m + Vhi + Uwk + εk
ij , k=1,. . . ,K
hi is speaker factors with prior distribution N (0, I)xk
ij is the j-th i-vector from speaker i in the k-th SNR subgroupV is the eigenvoice matrixU defines the SNR subspacewk is SNR factor with prior distribution N (0, I)εk
ij is a residual term with prior distribution N (0,Σ); Σ is a fullcovariance matrix aiming to model the channel variability
191 / 274
![Page 192: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/192.jpg)
SNR-invariant PLDA
Training utterances are divided into K groups, accroding to their SNR
192 / 274
![Page 193: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/193.jpg)
PLDA vs. SNR-invariant PLDA
Comparing the use of training i-vectors with conventional PLDA
193 / 274
![Page 194: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/194.jpg)
PLDA vs. SNR-invariant PLDA
Comparing generative models:
PLDA SNR-Invariant PLDA
xij = m + Vhi + εij xkij = m + Vhi + Uwk + εk
ij
x ∼ N (x|m,VVT + Σ) x ∼ N (x|m,VVT + UUT + Σ)
θ = m,V,Σ θ = m,V,U,Σ
194 / 274
![Page 195: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/195.jpg)
Auxiliary function for SNR-invariant PLDA
The parameters θ = m,V,U,Σ can be learned from a training setX using maximum likelihood estimation.X = xk
ij ; i = 1, . . . ,S; j = 1, . . . ,Hi (k); k = 1, . . . ,KS: No. of training speakersK : No. of SNR groupsHi (k): No. of utterances from speaker i in the k-th SNR group.
Given an initial value θ, we aim to find a new estimate θ thatmaximizes the auxiliary function:
Q(θ|θ) = Eh,w[∑
ikjln(p(xk
ij |hi ,wk , θ)p(hi ,wk))∣∣∣X ,θ]
= Eh,w[∑
ikj
(lnN (xk
ij |m + Vhi + Uwk ,Σ)
+ ln p(hi ,wk))∣∣∣X ,θ]
195 / 274
![Page 196: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/196.jpg)
Posterior distributions of latent variables
We show 3 ways to compute the posteriors:1 Computing the posterior of hi and wk separately.2 Computing the posterior hi while fixing wk using the Gauss-Seidel
method.3 Computing the joint posterior of hi and wk using variational Bayes.
196 / 274
![Page 197: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/197.jpg)
Method 1: Computing posteriors separately
Given i-vectors xkij , the posterior density of hi has the form:
p(hi |xkij ,θ) ∝ p(xk
ij |hi ,θ)p(hi )
=∫
p(xkij ,wk |hi ,θ)p(hi )dwk
=∫
p(xkij |hi ,wk ,θ)p(wk)p(hi )dwk
=∫N (xk
ij |m + Vhi + Uwk ,Σ)N (wk |0, I)N (hi |0, I)dwk
= N (xkij |m + Vhi ,Φ)N (hi |0, I)
∝ exp
hTi VTΦ−1(xk
ij −m)− 12hT
i (I + VTΦ−1V)hi
where Φ = UUT + Σ.
197 / 274
![Page 198: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/198.jpg)
Method 1: Computing posteriors separatelyIf all of the i-vectors of speaker i , say Xi , are given,
p(hi |xkij ∀j and k,θ) ∝
K∏k=1
Hi (k)∏j=1
p(xkij |hi ,θ)p(hi )
∝ exp
hTi VTΦ−1
K∑k=1
Hi (k)∑j=1
(xkij −m)− 1
2hTi
(I +
K∑k=1
Hi (k)VTΦ−1V)
hi
This is a Gaussian with mean and 2nd-order (uncentralized) moment
〈hi |Xi〉 =(
I +∑K
k=1Hi (k)VTΦ−1V
)−1VTΦ−1
∑K
k=1
∑Hi (k)
j=1(xk
ij −m)
〈hi hTi |Xi〉 =
(I +∑K
k=1Hi (k)VTΦ−1V
)−1+ 〈hi |Xi〉〈hi |Xi〉T,
(1)
N (h|µh,Ch) ∝ exp−1
2 (h− µh)TC−1h (h− µh)
∝ exp
hTC−1
h µh −12hTC−1
h h
198 / 274
![Page 199: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/199.jpg)
Method 1: Computing posteriors separately
Similarly, to compute the posterior expectations of wk , we marginalizeover hi ’s. Thus, the posterior density of wk is
p(wk |xkij ,θ) ∝
∫p(xk
ij |hi ,wk ,θ)p(hi )p(wk)dhi
=∫N (xk
ij |m + Vhi + Uwk ,Σ)N (hi |0, I)N (wk |0, I)dhi
= N (xkij |m + Uwk ,Ψ)N (wk |0, I)
∝ exp
wTk UTΨ−1(xk
ij −m)− 12wT
k (I + UTΨ−1U)wk
199 / 274
![Page 200: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/200.jpg)
Method 1: Computing posteriors separately
Given all of the i-vectors (X k) from the k-th SNR group, we cancompute the posterior expectations as follows:
〈wk |X k〉 =(
I +S∑
i=1Hi (k)UTΨ−1U
)−1
UTΨ−1S∑
i=1
Hi (k)∑j=1
(xkij −m)
〈wkwTk |X k〉 =
(I +
S∑i=1
Hi (k)UTΨ−1U)−1
+ 〈wk |X k〉〈wk |X k〉T
(2)
where Ψ = VVT + Σ
200 / 274
![Page 201: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/201.jpg)
Method 2: Computing posteriors by Gauss-Seidel method
Another approach to computing p(hi |Xi ) is to assume that wk ’s arefixed for all k = 1, . . . ,K .This is called the Gauss-Seidel method [Barrett et al., 1994]We fix wk to its posterior mean: w∗k ≡ 〈wk |X k〉The posterior density of hi becomes:
p(hi |Xi ,w∗k ,θ) ∝K∏
k=1
Hi (k)∏j=1
p(xkij |hi ,w∗k ,θ)p(hi )
=K∏
k=1
Hi (k)∏j=1N (xk
ij |m + Vhi + Uw∗k ,Σ)N (hi |0, I)
∝ exp
hTi VTΣ−1
K∑k=1
Hi (k)∑j=1
(xkij −m−Uw∗k ) −
12hT
i
(I +
K∑k=1
Hi (k)VTΣ−1V)
hi
201 / 274
![Page 202: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/202.jpg)
Method 2: Computing posteriors by Gauss-Seidel method
Comparing this posterior density with a standard Gaussian, we have
〈hi |Xi〉 =(
L(1)i
)−1VTΣ−1
K∑k=1
Hi (k)∑j=1
(xkij −m−Uw∗k)
〈hi hTi |Xi〉 =
(L(1)
i
)−1+ 〈hi |Xi〉〈hi |Xi〉T,
(3)
where L(1)i ≡ I +
∑Kk=1 Hi (k)VTΣ−1V
Note that these formulations is similar to the JFA model estimation in[Vogt and Sridharan, 2008].
202 / 274
![Page 203: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/203.jpg)
Method 2: Computing posteriors by Gauss-Seidel method
Apply the same approach to computing the posterior density of wk ,we have
〈wk |X k〉 =(
L(2)k
)−1UTΣ−1
S∑i=1
Hi (k)∑j=1
(xkij −m− Vh∗i )
〈wkwTk |X k〉 =
(L(2)
k
)−1+ 〈wk |X k〉〈wk |X k〉T
(4)
where L(2)k = I +
∑Si=1 Hi (k)UTΣ−1U and h∗i ≡ 〈hi |Xi〉
203 / 274
![Page 204: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/204.jpg)
Method 3: Computing posteriors by variational Bayes
Denote w = [w1, . . . ,wK ] and h = [h1, . . . ,hS ]In variational Bayes [Bishop, 2006, Kenny, 2010], we factorize thejoint posterior as follows:
ln p(h,w|X ) ≈ ln q(h) + ln q(w) =S∑
i=1ln q(hi ) +
K∑k=1
ln q(wk)
where
ln q(h) = Ewln p(h,w,X )+ constln q(w) = Ehln p(h,w,X )+ const
where Ew means taking expectation with respect to w.
204 / 274
![Page 205: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/205.jpg)
Method 3: Computing posteriors by variational BayesConsider ln q(h):ln q(h) = Ewln p(h, w,X )+ const= 〈ln p(X|h, w)〉w + 〈ln p(h, w)〉w + const
=∑
ijr
⟨lnN (xr
ij |m + Vhi + Uwr , Σ)⟩
wr
+∑
i〈lnN (hi |0, I)〉w +
∑r〈lnN (wr |0, I)〉w + const
= −12∑
ijr(xr
ij −m− Vhi −Uw∗r )TΣ−1(xrij −m− Vhi −Uw∗r )
− 12∑
ihT
i hi + const (5)
=∑
i
[hT
i VTΣ−1∑
jr(xr
ij −m−Uw∗r )− 12 hT
i
(I +∑
jrVTΣ−1V
)hT
i
]+ const
q(hi ) a Gaussian with mean and precision identical to Eq. 3:
〈hi |Xi〉 =(
L(1)i
)−1VTΣ−1
∑jr
(xrij −m−Uw∗r )
L(1)i = I +
∑jr
VTΣ−1V(6)
205 / 274
![Page 206: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/206.jpg)
Method 3: Computing posteriors by variational Bayes
lnq(w) = 〈ln p(X|h, w)〉h + 〈ln p(h, w)〉h + const
=∑
ijk
⟨lnN (xk
ij |m + Vhi + Uwk , Σ)⟩
hi
+∑
i〈lnN (hi |0, I)〉hi +
∑k〈lnN (wk |0, I)〉h + const
= −12∑
ijk(xk
ij −m− Vh∗i −Uwk )TΣ−1(xkij −m− Vh∗i −Uwk )
− 12∑
kwT
k wk + const
=∑
k
[wT
k UTΣ−1∑
ij(xk
ij −m− Vh∗i )− 12 wT
k
(I +∑
ijUTΣ−1U
)wT
k
]+ const
q(wk) is a Gaussian with mean and precision identical to Eq. 4:
〈wk |X k〉 =(
L(2)k
)−1UTΣ−1
∑ij
(xkij −m− Vh∗i )
L(2)k = I +
∑ij
UTΣ−1U(7)
Note : 〈lnN (hi |0, I)〉hi is the differential entropy of normal distribution and isindependent of wk , see [Norwich, 1993](Ch 8).
206 / 274
![Page 207: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/207.jpg)
Computing posterior moment
The exact posterior moment 〈wkhTi |X 〉 will be complicated because
hi and wk are correlated in the posterior.If Gauss-Seidel’s method is used, we may approximate the posteriormoments by (Kenny 2010, p.6)
〈wkhTi |X 〉 ≈ 〈wk |X k〉(h∗i )T
〈hi wTk |X 〉 ≈ 〈hi |Xi〉(w∗k)T
where h∗i and w∗k are the most up-to-date posterior means in the EMiterations.Alternatively, we may compute the exact joint posterior.3 But it willbe computationally intensive.
3http://www.eie.polyu.edu.hk/∼mwmak/papers/si-plda.pdf207 / 274
![Page 208: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/208.jpg)
Computing posterior moment
A better approach is to use variational Bayes:
p(hi ,wk |X ) ≈ q(hi )q(wk) (8)
Note that as both q(hi ) and q(wk) are Gaussians. Based on the lawof total expectation,4 the factorization in Eq. 8 gives
〈wkhTi |X 〉 ≈ 〈wk |X k〉〈hi |Xi〉T
〈hi wTk |X 〉 ≈ 〈hi |Xi〉〈wk |X k〉T
4https://en.wikipedia.org/wiki/Product distribution208 / 274
![Page 209: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/209.jpg)
Maximization Step
In the M-step, we maximize the auxiliary function:
Q(θ) = Eh,w
∑ijk
lnN(
xkij∣∣m + Vhi + Uwk ,Σ
)p(hi ,wk)
∣∣∣∣X ,θ=∑ijk
Eh,w
−1
2 log |Σ| − 12(
xkij −m− Vhi −Uwk
)TΣ−1
×(
xkij −m− Vhi −Uwk
)+ ln p(hi ,wk)
∣∣∣∣X ,θAs p(hi ,wk) is independent of the model parameters V, U, and Σ,they could be taken out of Q(θ) in the M-step[Prince and Elder, 2007].
209 / 274
![Page 210: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/210.jpg)
Maximization Step
Differentiating Q(θ) with respect to V, U, and Σ and set the resultsto 0, we obtain
V =
∑ijk
[(xk
ij −m)〈hi |Xi〉 −U〈wkhTi |X 〉
]∑
ijk〈hi hT
i |X 〉
−1
U =
∑ijk
[(xk
ij −m)〈wk |X k〉 − V〈hi wTk |X 〉
]∑
ijk〈wkwT
k |X 〉
−1
Σ = 1N∑ijk
[(xk
ij −m)(xkij −m)T
− V〈hi |Xi〉(xkij −m)T −U〈wk |X k〉(xk
ij −m)T]
210 / 274
![Page 211: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/211.jpg)
Likelihood Ratio Scores
Given target-speaker’s i-vector xs and test-speaker’s i-vector xtIf xs and xt are from the same speaker, they should share the samespeaker factor h:[
xsxt
]=[
mm
]+[
V U 0V 0 U
] hwswt
+[εsεt
]=⇒ xst = m + Azst + εst .
Same-speaker likelihood:
p(xst |same-speaker) =∫
p(xst |zst)p(zst)dzst
=∫N (xst |m + Azst , Σ)N (zst |0, I)dzst
= N (xst |m, AAT + Σ)
= N([
xsxt
] ∣∣∣∣ [mm
],
[Σtot ΣacΣac Σtot
])where Σ = diagΣ,Σ, Σtot = VVT + UUT + Σ and Σac = VVT
211 / 274
![Page 212: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/212.jpg)
Likelihood Ratio Scores
If xs and xt are from different speakers, they should have their ownspeaker factor (hs ,ht):
[xsxt
]=[
mm
]+[
V 0 U 00 V 0 U
]hshtwswt
+[εsεt
]
=⇒ xst = m + Azst + εst
Different-speaker likelihood:
p(xst |diff-speaker) =∫
p(xst |zst)p(zst)dzst
=∫N (xst |m + Azst , Σ)N (zst |0, I)dzst
= N (xst |m, AAT + Σ)
= N([
xsxt
] ∣∣∣∣[
mm
],
[Σtot 0
0 Σtot
])212 / 274
![Page 213: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/213.jpg)
Likelihood Ratio Scores
Log-likelihood ratio score (assuming i-vectors have been meansubtracted, x← x−m)
SLR(xs , xt) = log p(xs , xt |Same-speaker)p(xs , xt |Diff-speaker)
= logN([
xsxt
] ∣∣∣∣[
00
],
[Σtot ΣacΣac Σtot
])
N([
xsxt
] ∣∣∣∣[
00
],
[Σtot 0
0 Σtot
])
= 12[xT
s Qxs + 2xTs Pxt + xT
t Qxt ] + const
(9)
where
Q = Σ−1tot − (Σtot −ΣacΣ−1
totΣac)−1
P = Σ−1totΣac(Σtot −ΣacΣ−1
totΣac)−1
213 / 274
![Page 214: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/214.jpg)
Likelihood Ratio Scores
The LLR in Eq. 9 assumes that the SNR of both target-speaker’sutterance and test utterance are unknown.If both SNRs (`s , `t) are known, we may compute the score as follows:
SLR(xs , xt |`s , `t) = log p(xs , xt |Same-speaker, `s , `t)p(xs , xt |Diff-speaker, `s , `t)
214 / 274
![Page 215: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/215.jpg)
Likelihood Ratio Scores
Given i-vector x and utterance SNR `, the likelihood of x is
p(x|`) =∫
h
∫w
p(x|h,w, `)p(h,w|`)dhdw
=∫
h
∫w
p(x|h,w, `)p(h|w, `)p(w|`)dhdw
=∫
h
∫w
p(x|h,w, `)p(h)dhp(w|`)dw
where we have assumed that h is a priori independent of w and `.Note that if ` ∈ k-th SNR group, we have w = w∗k ≡ 〈wk |X k〉
p(x|` ∈ k-th SNR group) =∫
hp(x|h,w∗k)p(h)dh
=∫
hN (x|m + Vh + Uw∗k ,Σ)N (h|0, I)dh
= N (x|m + Uw∗k ,VVT + Σ)
215 / 274
![Page 216: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/216.jpg)
Likelihood Ratio Scores
SLR(xs , xt |`s , `t) = log p(xs , xt |Same-speaker, `s , `t)p(xs , xt |Diff-speaker, `s , `t)
= logN([
xsxt
] ∣∣∣∣[
m + Uw∗ksm + Uw∗kt
],
[Ψ Σac
Σac Ψ
])
N([
xsxt
] ∣∣∣∣[
m + Uw∗ksm + Uw∗kt
],
[Ψ 00 Ψ
])
= 12[xT
s Qxs + 2xTs Pxt + xT
t Qxt ] + const
(10)
wherexs = xs −m−Uw∗ks
xt = xt −m−Uw∗kt
Q = Ψ−1 − (Ψ−ΣacΨ−1Σac)−1
P = Ψ−1Σac(Ψ−ΣacΨ−1Σac)−1
Ψ = VVT + Σ; Σac = VVT216 / 274
![Page 217: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/217.jpg)
Compare with scoring in JFA
Scoring in JFA is based on the sequential mode[Kenny et al., 2007b]:
SJFA-LR(Os ,Ot) =PΛ(s)(Ot)PΛ(Ot)
where Λ(s) denotes the adapted speaker model based on enrollmentspeech Os from speaker s.Computing PΛ(s)(Ot) requires the posterior density of speaker factors[y(s) and z(s) in Kenny 2007], which are posteriorly correlated.The scoring function in Eq. 9 is based on the batch mode.Batch mode is similar to speaker comparison in which no modeladaptation is performed. So, the posterior correlation betweenspeaker factors and SNR factors does not occur in Eq. 9.
217 / 274
![Page 218: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/218.jpg)
Scoring based on sequential mode
The batch-mode scoring (Eq. 10) requires inverting a big matrix if thetarget speaker has a large number of enrollment utterances.The sequential-mode scoring can mitigate this problem.For notational simplicity, we assume that the target speaker only haveone enrollment utterance with i-vector xs :
SLR(xs , xt |`s , `t) = p(xs , xt |`s , `t)p(xs |`s)p(xt |`t)
= p(xt |xs , `s , `t)p(xs |`s)p(xs |`s)p(xt |`t)
= p(xt |xs , `s , `t)p(xt |`t)
218 / 274
![Page 219: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/219.jpg)
Scoring based on sequential mode
For simplicity, we omit `s and `t from now on.
p(xt |xs) =∫ ∫
p(xt |h,w)p(h,w|xs)dhdw
As h and w are posteriorly dependent, we use variational Bayes toapproximate the joint posterior:
p(xt |xs) ≈∫ ∫
p(xt |h,w)q(h)q(w)dhdw
=∫
h
∫wN (xt |m + Vh + Uw,Σ)N (h|µhs ,Σhs )N (w|µws ,Σws )dhdw
(11)
where µhs , Σhs , µws , and Σws are posterior means and posteriorcovariances.
219 / 274
![Page 220: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/220.jpg)
Scoring based on sequential modeEq. 11 is a convolution of Gaussians
p(xt |xs) ≈∫
w
∫hN (xt |m + Vh + Uw,Σ)N (h|µhs ,Σhs )dhN (w|µws ,Σws )dw
=∫
wN (xt |m + Vµhs + Uw,VΣhs V
T + Σ)N (w|µws ,Σws )dw
= N (xt |m + Vµhs + Uµws ,VΣhs VT + UΣws U
T + Σ)If `s falls on the k-th SNR group, we may replace µws byw∗k ≡ 〈wk |X k〉 and assume that Σw∗k → 0:
p(xt |xs) = N (xt |m + Vµhs + Uw∗k ,VΣhs VT + Σ)p(xt) is a marginal density
p(xt) =∫
p(xt |h,w)p(h,w)dhdw
=∫N (xt |m + Vh + Uw,Σ)N (h|0, I)N (w|0, I)dhdw
= N (xt |m,VVT + UUT + Σ)220 / 274
![Page 221: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/221.jpg)
Scoring based on sequential mode
The posteriors means and covariances can be obtained from Eq. 6and Eq. 7 by considering a single utterance from target-speaker s:
µhs = 〈hs |xs〉 = Σhs VTΣ−1(xs −m−Uµws )µws = 〈ws |xs〉 = Σws UTΣ−1(xs −m− Vµhs )Σhs = (I + VTΣ−1V)−1
Σws = (I + UTΣ−1U)−1
Note that µhs and µws depend on each other, meaning that theyshould be found iteratively.
221 / 274
![Page 222: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/222.jpg)
Experiments on SRE12
Evaluation dataset: Common evaluation condition 1 and 4 of NISTSRE 2012 core set.Parameterization: 19 MFCCs together with energy plus their 1stand 2nd derivatives =⇒ 60-Dim acoustic vectorsUBM: Gender-dependent, mic+tel, 1024 mixturesTotal Variability Matrix: Gender-dependent, mic+tel, 500 totalfactorsI-Vector Preprocessing: Whitening by WCCN then lengthnormalization followed by non-parametric feature analysis (NFA)5 andWCCN (500-dim → 200-dim)
5Z. Li, D. Lin, and X. Tang, “Nonparametric discriminant analysis for facerecognition,” IEEE Trans. on PAMI, 2009.
222 / 274
![Page 223: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/223.jpg)
Prepare training speech files
223 / 274
![Page 224: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/224.jpg)
SNR distributions
SNR Distribution of training and test utterances in CC4
224 / 274
![Page 225: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/225.jpg)
Performance on SRE12
225 / 274
![Page 226: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/226.jpg)
Performance on SRE12
226 / 274
![Page 227: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/227.jpg)
Performance on SRE12
227 / 274
![Page 228: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/228.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM
6 Future Direction 228 / 274
![Page 229: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/229.jpg)
Mixture of PLDA: Motivation
Conventional i-vector/PLDA systems use a single PLDA model tohandle all SNR conditions
229 / 274
![Page 230: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/230.jpg)
Mixture of PLDA: Motivation
We argue that a PLDA model should focus on a small range of SNR.
230 / 274
![Page 231: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/231.jpg)
Distribution of SNR
231 / 274
![Page 232: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/232.jpg)
Proposed solution
The full spectrum of SNRs is handled by a mixture of PLDA in whichthe posteriors of the indicator variables depend on the utterance’sSNR.Verification scores depend not only on the same-speaker anddifferent-speaker likelihoods but also on the posterior probabilities ofSNR.
232 / 274
![Page 233: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/233.jpg)
Mixture of PLDA [Mak et al., 2016]
Model parameters:
θ = π, µ, σ,m,V,Σ= πk , µk , σk︸ ︷︷ ︸
Modeling SNR
, mk ,Vk ,Σk︸ ︷︷ ︸Speaker subspaces
Kk=1
Generative model:
xij ∼K∑
k=1P(yijk = 1|`ij)N (xij |mk ,VkVT
k + Σk)
whereP(yijk = 1|`ij) = πkN (`ij |µk , σ
2k)∑K
k′=1 πk′N (`ij |µk′ , σ2k′))
and `ij is the SNR of the utterance j from speaker i .
233 / 274
![Page 234: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/234.jpg)
Mixture of PLDA
Graphical model:
234 / 274
![Page 235: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/235.jpg)
PLDA vs. Mixture of PLDA
Graphical models:
𝒙𝑖𝑗 𝒛𝑖
𝑁
𝒎
𝑽
𝚺
𝐻𝑖 ℓ ij
xij zi
yijk
NHi
Kπ
µ , σ
V
m Σ
PLDA Mixture of PLDA
235 / 274
![Page 236: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/236.jpg)
Experiments on SRE12
Evaluation dataset: Common evaluation condition 1 and 4 of NISTSRE 2012 core set.Parameterization: 19 MFCCs together with energy plus their 1stand 2nd derivatives =⇒ 60-Dim acoustic vectorsUBM: Gender-dependent, mic+tel, 1024 mixturesTotal Variability Matrix: Gender-dependent, mic+tel, 500 totalfactorsI-Vector Preprocessing: Whitening by WCCN then lengthnormalization followed by LDA and WCCN (500-dim → 200-dim)
236 / 274
![Page 237: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/237.jpg)
Experiments on SRE12
In NIST 2012 SRE, training utterances from telephone channels areclean, but some of the test utterances are noisy.We used the FaNT tool to add babble noise to the clean trainingutterances
237 / 274
![Page 238: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/238.jpg)
DET Performance
Train on tel+mic speech and test on noisy tel speech (CC4).
238 / 274
![Page 239: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/239.jpg)
I-vector Cluster Alignment
0 10 20 30 40 50 60
1
2
3
SNR (dB)
Clu
ster
ID
Aligned to Mixture 1
Aligned to Mixture 3
Without using SNR
0 10 20 30 40 50 60
1
2
3
SNR (dB)
Clu
ster
ID
Aligned to Mixture 1Aligned to Mixture 2Aligned to Mixture 3
With SNR as guidance239 / 274
![Page 240: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/240.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM
6 Future Direction 240 / 274
![Page 241: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/241.jpg)
DNN for speaker recognition
Main idea: replacing the universal background model (UBM) with aphonetically-aware DNN for computing the frame posteriorprobabilities.The most successful application of DNN to speaker recognition[Lei et al., 2014, Ferrer et al., 2016, Richardson et al., 2015]
Source: Richardson, F., et al. IEEE Signal Processing Letters, 2015 241 / 274
![Page 242: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/242.jpg)
DNN I-vector extraction
Source: Ferrer, L. et al. IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, 2016.
242 / 274
![Page 243: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/243.jpg)
UBM I-vectors
Factor analysis model for UBM i-vectors:
µc = µ(b)c + Tcw c = 1, . . . ,C
Given the MFCC vectors of an utterance O = o1, . . . , oT, itsi-vector is the posterior mean of w
x ≡ 〈w|O〉 = L−1C∑
c=1TT
c
(Σ(b)
c
)−1 T∑t=1
γc(ot)(ot − µ(b)c )
where
L = I +C∑
c=1
T∑t=1
γc(ot)TTc (Σ(b)
c )−1Tc
γc(ot) ≡ Pr(Mixture = c|ot) = λ(b)c N (ot |µ(b)
c ,Σ(b)c )∑C
j=1 λ(b)j N (ot |µ(b)
j ,Σ(b)j )
243 / 274
![Page 244: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/244.jpg)
DNN I-vectors
Replace γc(ot) by DNN output, γDNNc (at)
The DNN is trained to produce posterior probabilities of senones,given multiple frames of acoustic features, at , as input.
Acoustic features for speech recognition in the DNN are not necessarythe same as the features for the i-vector extractor.
244 / 274
![Page 245: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/245.jpg)
DNN I-vectors
Given MFCC or bottleneck (BN) feature vectors O = o1, . . . , oT,the DNN i-vector is6
x ≡ 〈w|O〉 = L−1C∑
c=1TT
c
(ΣDNN
c
)−1 T∑t=1
γDNNc (at)(ot − µDNN
c )
where
µDNNc =
∑Ni=1
∑Tit=1 γ
DNNc (ait)oit∑N
i=1∑Ti
t=1 γDNNc (ait)
ΣDNNc =
∑Ni=1
∑Tit=1 γ
DNNc (ait)oitoT
it∑Ni=1
∑Tit=1 γ
DNNc (ait)
− µDNNc (µDNN
c )T
6For a full set of formulae, seehttp://www.eie.polyu.edu.hk/∼mwmak/papers/FA-Ivector.pdf
245 / 274
![Page 246: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/246.jpg)
Denoising deep classifier [Tan et al., 2016]
246 / 274
![Page 247: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/247.jpg)
DNN I-vectors from denoising deep classifier
247 / 274
![Page 248: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/248.jpg)
Performance on NIST 2012 SRE
Performance on CC4 with test utterances contaminated with babblenoise.
248 / 274
![Page 249: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/249.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM
6 Future Direction 249 / 274
![Page 250: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/250.jpg)
PLDA–RBM [Stafylakis et al., 2012]Main idea:
Use i-vectors as input to the Gaussian visible layer of an RBMDivide RBM weights into two parts: speaker and channelConsider RBM weights as analogue to PLDA’s loading matricesDivide the Gaussian hidden layer into two parts: speaker and channelHidden nodes are considered as latent factors
RBM RBM–PLDA250 / 274
![Page 251: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/251.jpg)
PLDA vs. PLDA–RBM
PLDA (omitting global mean):v = Vs + Uc + ε
where v is an i-vector, s and c are speaker and channel factors.RBM-PLDA:
vn = σv
[Ws
sstσs
+ Wccstσc
]where vn is the expected value of visible layer in the negative phase ofCD-1 sampling, sst and cst are the states of Gaussian hidden nodes ofthe RBM.
251 / 274
![Page 252: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/252.jpg)
Scoring in PLDA–RBM
Given two i-vectors vi , i = 1, 2, compute si = WTs vi .
The log-likelihood ratio is
LLR = −12(s1 − s2)T(s1 − s2) + const
If ‖si‖ = 1, the model is similar to cosine distance scoring.
252 / 274
![Page 253: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/253.jpg)
Results on NIST 2010 SRE
NIST’10, female, core-extended:
253 / 274
![Page 254: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/254.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction6.1. Domain adaptation6.2. Short-utterance speaker verification6.3. Text-dependent speaker recognition
254 / 274
![Page 255: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/255.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction6.1. Domain adaptation6.2. Short-utterance speaker verification6.3. Text-dependent speaker recognition
255 / 274
![Page 256: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/256.jpg)
Domain adaptation
Domain adaptation aims to adapt a system from a resource-richdomain to a resource-limited domain.Domain mismatch: systems perform very well in the domain(environment) for which they are trained; however, their performancesuffers when the users use the systems in other domain.Domain Adaptation Challenge (DAC13) was introduced in 2013.The effect of domain mismatch on PLDA parameters is morepronouncedPopular approaches:
Bayesian adaptation of PLDA models [Villalba and Lleida, 2014]Unsupervised clustering of i-vectors for adapting covariance matrices ofPLDA models [Shum et al., 2014, Garcia-Romero et al., 2014]Inter-dataset variability compensation[Aronowitz, 2014, Kanagasundaram et al., 2015]
256 / 274
![Page 257: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/257.jpg)
Bayesian Adaptation of PLDA (Villalba and Lleida, 2014)
Use a generative model to generate out-of-domain labelled (θdj) data
and in-domain (target) unlabelled (θj) data.Unknown labels (θj) are modelled as latent variables
257 / 274
![Page 258: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/258.jpg)
Unsupervised clustering of i-vectors (Shum et al., 2014)
Step 1 Use the within-class and between-class covariance matrices ofout-of-domain data to compute a pairwise affinity matrix onunlabelled in-domain data.
Step 2 Use the affinity matrix to obtain hypothesized speaker clusters ofin-domain data.
Step 3 Linearly interpolate the in-domain and out-of-domain covariancematrices to obtained the adapted matrices:
Σadapt = αwcΣin + (1− αwcΣout)Φadapt = αacΦin + (1− αacΦout)
258 / 274
![Page 259: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/259.jpg)
Inter-dataset variability compensation (IDVC)
IDVC aims at explicitly modeling dataset shift variability in thei-vector space and compensating it as a pre-processing cleanup step.No need to use in-domain data to adapt the PLDA model.
Step 1 Divide the development data into different subsets according totheir sources, e.g., different LDC distributions of Switchboard.
Step 2 Estimate an inter-dataset variability subspace with the largestvariability across the mean i-vectors of these subsets.
Step 3 Remove the variability of i-vectors in this subspace via nuisanceattribute projection (NAP).
259 / 274
![Page 260: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/260.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction6.1. Domain adaptation6.2. Short-utterance speaker verification6.3. Text-dependent speaker recognition
260 / 274
![Page 261: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/261.jpg)
Short-utterance speaker verification
Performance of i-vector/PLDA systems degrades rapidly when thesystems are presented with short utterances or utterances withvarying durations.These systems consider long and short utterances as equally reliable.However, the i-vectors of short utterances have much bigger posteriorcovariances.Popular approaches:
Uncertainty Propagation: Propagate the posterior covariances ofi-vectors to PLDA [Kenny et al., 2013]Full posterior distribution PLDA [Cumani et al., 2014]
261 / 274
![Page 262: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/262.jpg)
Uncertainty propagation (Kenny et al., 2013)
In i-vector extraction, besides the posterior mean of the latentvariable (i-vector), we also have the posterior covariance matrix,which reflects the uncertainty of the i-vector estimate.
262 / 274
![Page 263: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/263.jpg)
Uncertainty propagation (Kenny et al., 2013)
wij = m + Vhi + Uijzij + εij
Uij is the Cholesky decomposition of the posterior covariance matrix(L−1
ij ) of the j-th i-vector by the i-th speaker.The intra-speaker covariance matrix becomes:
cov(wij ,wij |hi ) = UijUTij + Σ
where UijUTij changes from utterance to utterance, thus reflecting the
reliability of the i-vector wij .Scoring is computationally expensive. See [Lin and Mak, 2016] for afast scoring algorithm.
263 / 274
![Page 264: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/264.jpg)
Outline
1 Introduction
2 Learning Algorithms
3 Learning Models
4 Deep Learning
5 Case Studies
6 Future Direction6.1. Domain adaptation6.2. Short-utterance speaker verification6.3. Text-dependent speaker recognition
264 / 274
![Page 265: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/265.jpg)
Text-dependent SV using short utterances
Very difficult for short-utterance text-dependent SVText-dependent tasks with the uncertainty propagation version ofi-vector/PLDA were unsatisfactory [Stafylakis et al., 2013].It is more natural to use HMMs rather than GMMs fortext-dependent tasks. But HMMs require local hidden variables,which are difficult to handle because of data fragmentation.Emergent approaches:
Content matching [Scheffer and Lei, 2014]Exploiting out-of-domain data via domain adaptation[Aronowitz and Rendel, 2014, Kenny et al., 2014]y-vector and JFA backend [Kenny et al., 2015b]I-vector backend [Kenny et al., 2015a]Hidden suppervector backend [Kenny et al., 2016]Use DNN/RNN to extract utterance-level features[Heigold et al., 2016, Bhattacharya et al., 2016, Zeinali et al., 2016]
265 / 274
![Page 266: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/266.jpg)
References I
Aronowitz, H. (2014). Compensating inter-dataset variability in PLDA hyper-parametersfor robust speaker recognition. In Odyssey: The Speaker and Language RecognitionWorkshop.
Aronowitz, H. and Rendel, A. (2014). Domain adaptation for text dependent speakerverification. In Interspeech, pages 1337–1341.
Barrett, R., Berry, M. W., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., Eijkhout,V., Pozo, R., Romine, C., and Van der Vorst, H. (1994). Templates for the solution oflinear systems: building blocks for iterative methods, volume 43. Siam.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wisetraining of deep networks. In Scholkopf, B., Platt, J. C., and Hoffman, T., editors,Advances in Neural Information Processing Systems 19, pages 153–160. MIT Press.
Bhattacharya, G., Kenny, P., Alam, J., and Stafylakis, T. (2016). Deep neural networkbased text-dependent speaker verification : Preliminary results. In Odyssey 2016: TheSpeaker and Language Recognition Workshop, pages 9–15, Bilbao, Spain.
Bishop, C. (2006). Pattern recognition and machine learning. springer, New York.
Campbell, W. M., Sturim, D. E., and Reynolds, D. A. (2006). Support vector machinesusing GMM supervectors for speaker verification. IEEE Signal Processing Letters,13(5):308–311.
266 / 274
![Page 267: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/267.jpg)
References II
Cumani, S., Plchot, O., and Laface, P. (2014). On the use of i–vector posteriordistributions in probabilistic linear discriminant analysis. IEEE/ACM Transactions onAudio, Speech, and Language Processing, 22(4):846–857.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet, P. (2011). Front-endfactor analysis for speaker verification. IEEE Transactions on Audio, Speech, and LanguageProcessing, 19(4):788–798.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm. Journal of the royal statistical society. Series B(methodological), pages 1–38.
Ferrer, L., Lei, Y., McLaren, M., and Scheffer, N. (2016). Study of senone-based deepneural network approaches for spoken language recognition. IEEE/ACM Transactions onAudio, Speech, and Language Processing, 24(1):105–116.
Garcia-Romero, D. and Espy-Wilson, C. (2011). Analysis of i-vector length normalizationin speaker recognition systems. In Proc. of Interspeech, pages 249–252.
Garcia-Romero, D., Zhang, X., McCree, A., and Povey, D. (2014). Improving speakerrecognition performance in the domain adaptation challenge using deep neural networks. InProc. of IEEE Spoken Language Technology Workshop (SLT), pages 378–383.
267 / 274
![Page 268: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/268.jpg)
References III
Gauvain, J. L. and Lee, C.-H. (1994). Maximum a posteriori estimation for multivariateGaussian mixture observations of Markov chains. IEEE Transactions on Speech and AudioProcessing, 2(2):291–298.
Heigold, G., Moreno, I., Bengio, S., and Shazeer, N. (2016). End-to-end text-dependentspeaker verification. In Proc. of IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pages 5115–5119.
Kanagasundaram, A., Dean, D., and Sridharan, S. (2015). Improving out-domain PLDAspeaker verification using unsupervised inter-dataset variability compensation approach. InProc. of ICASSP, pages 4654–4658.
Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In Proc. ofOdyssey: Speaker and Language Recognition Workshop, Brno, Czech Republic.
Kenny, P., Boulianne, G., and Dumouchel, P. (2005). Eigenvoice modeling with sparsetraining data. IEEE Transactions on speech and audio processing, 13(3):345–354.
Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2007a). Joint factor analysisversus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech andLanguage Processing, 15(4):1435–1447.
268 / 274
![Page 269: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/269.jpg)
References IV
Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2007b). Speaker and sessionvariability in GMM-based speaker verification. IEEE Transactions on Audio, Speech, andLanguage Processing, 15(4):1448–1460.
Kenny, P., Stafylakis, T., Alam, J., Gupta, V., and Kockmann, M. (2016). Uncertaintymodeling without subspace methods for text-dependent speaker recognition. In Odyssey:The Speaker and Language Recognition Workshop, pages 16–23, Bilbao, Spain.
Kenny, P., Stafylakis, T., Alam, J., and Kockmann, M. (2015a). An i-vector backend forspeaker verification. In Proc. of INTERSPEECH, pages 2307–2310.
Kenny, P., Stafylakis, T., Alam, J., and Kockmann, M. (2015b). JFA modeling withleft-to-right structure and a new backend for text-dependent speaker recognition. In Proc.of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),pages 4689–4693.
Kenny, P., Stafylakis, T., Alam, M. J., Ouellet, P., and Kockmann, M. (2014). In-domainversus out-of-domain training for text-dependent JFA. In INTERSPEECH, pages1332–1336.Kenny, P., Stafylakis, T., Ouellet, P., Alam, M. J., and Dumouchel, P. (2013). PLDA forspeaker verification with utterances of arbitrary duration. In Proc. ICASSP, pages7649–7653.
269 / 274
![Page 270: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/270.jpg)
References V
Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In Proc. ofInternational Conference on Learning Representation.
Lei, Y., Scheffer, N., Ferrer, L., and McLaren, M. (2014). A novel scheme for speakerrecognition using a phonetically-aware deep neural network. In Proc. of ICASSP.
Li, N. and Mak, M. W. (2015). SNR-invariant PLDA modeling in nonparametric subspacefor robust speaker verification. IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, 23(10):1648–1659.
Lin, W. W. and Mak, M. W. (2016). Fast scoring for plda with uncertainty propagation. InOdyssey 2016: The Speaker and Language Recognition Workshop, pages 31–38, Bilbao,Spain.
Mak, M.-W., Pang, X., and Chien, J.-T. (2016). Mixture of PLDA for noise robust i-vectorspeaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing,24(1):130–142.
Norwich, K. H. (1993). Information, sensation, and perception. Academic Press San Diego.
Prince, S. and Elder, J. (2007). Probabilistic linear discriminant analysis for inferencesabout identity. In Proc. of IEEE International Conference on Computer Vision, pages 1–8.
270 / 274
![Page 271: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/271.jpg)
References VI
Reynolds, D. A., Quatieri, T. F., and Dunn, R. B. (2000). Speaker verification usingadapted Gaussian mixture models. Digital Signal Processing, 10(1–3):19–41.
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation andapproximate inference in deep generative models. In Proc. of International Conference onMachine Learning, pages 1278–1286.
Richardson, F., Reynolds, D., and Dehak, N. (2015). Deep neural network approaches tospeaker and language recognition. IEEE Signal Processing Letters, 22(10):1671–1675.
Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In Proc. ofInternational Conference on Artificial Intelligence and Statistics, page 3.
Scheffer, N. and Lei, Y. (2014). Content matching for short duration speaker recognition.In Proc. of INTERSPEECH, pages 1317–1321.
Shum, S., Reynolds, D. A., Garcia-Romero, D., and McCree, A. (2014). Unsupervisedclustering approaches for domain adaptation in speaker recognition systems. In Odyssey:The Speaker and Language Recognition Workshop, pages 266–272.
Stafylakis, T., Kenny, P., Ouellet, P., Perez, J., Kockmann, M., and Dumouchel, P.(2013). Text-dependent speaker recognition using PLDA with uncertainty propagation.matrix, 500:1.
271 / 274
![Page 272: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/272.jpg)
References VII
Stafylakis, T., Kenny, P., Senoussaoui, M., and Dumouchel, P. (2012). PLDA usingGaussian restricted Boltzmann machines with application to speaker verification. In Proc.of INTERSPEECH, pages 1692–1695.
Tan, Z. L., Zhu, Y. K., Mak, M. W., and Mak, B. (2016). Senone i-vectors for robustspeaker verification. In ISCSLP’16.
Vapnik, V. (2013). The nature of statistical learning theory. Springer Science & BusinessMedia.Villalba, J. and Lleida, E. (2014). Unsupervised adaptation of PLDA by using variationalBayes methods. In Proc. ICASSP, pages 744–748.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting andcomposing robust features with denoising autoencoders. In Proc. of InternationalConference on Machine learning, pages 1096–1103.
Vogt, R. and Sridharan, S. (2008). Explicit modelling of session variability for speakerverification. Computer Speech & Language, 22(1):17–38.
Watanabe, S. and Chien, J.-T. (2015). Bayesian Speech and Language Processing.Cambridge University Press.
272 / 274
![Page 273: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/273.jpg)
References VIII
Zeinali, H., Burget, L., Sameti, H., Glembek, O., and Plchot, O. (2016). Deep neuralnetworks and hidden Markov models in i-vector-based text-dependent speaker verification.In Odyssey: The Speaker and Language Recognition Workshop, pages 24–30, Bilbao,Spain.
Zhao, X. and Dong, Y. (2012). Variational Bayesian joint factor analysis models forspeaker verification. IEEE Transactions on Audio, Speech, and Language Processing,20(3):1032–1042.
273 / 274
![Page 274: INTERSPEECH 2016 Tutorial: Machine Learning for Speaker ...mwmak/papers/IS2016-tutorial.pdfINTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak†and Jen-Tzung](https://reader035.fdocuments.us/reader035/viewer/2022071213/60291dab9852f226216e78e3/html5/thumbnails/274.jpg)
Contact information
Man-Wai MakThe Hong Kong Polytechnic [email protected]://www.eie.polyu.edu.hk/∼mwmak/Jen-Tzung ChienNational Chiao Tung [email protected]://chien.cm.nctu.edu.tw
274 / 274