Gaussian Mixture Model (GMM)and
Hidden Markov Model (HMM)
Samudravijaya K
Majority of the slides are taken from S.Umesh’s tutorial on ASR (WiSSAP 2006).
1 of 88
Pattern Recognition
Signal
ModelGeneration
PatternMatching
Input
Output
Training
Testing
Processing
GMM: static patterns
HMM: sequential patterns
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 2 of 88
Basic Probability
Joint and Conditional probability
p(A,B) = p(A|B) p(B) = p(B|A) p(A)
Bayes’ rule
p(A|B) =p(B|A) p(A)
p(B)
If Ais are mutually exclusive events,
p(B) =∑
i
p(B|Ai) p(Ai)
p(A|B) =p(B|A) p(A)∑i p(B|Ai) p(Ai)
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 3 of 88
Normal Distribution
Many phenomenon are described by Gaussian pdf
p(x|θ) =1√
2πσ2exp
(− 1
2σ2(x − µ)2
)(1)
pdf is parameterised by θ = [µ, σ2] where mean = µ and variance=σ2.
A convenient pdf : second order statistics is sufficient.
Example: Heights of Pygmies ⇒ Gaussian pdf with µ = 4ft & std-dev(σ) = 1ft
OR: Heights of bushmen ⇒ Gaussian pdf with µ = 6ft & std-dev(σ) = 1ft
Question:If we arbitrarily pick a person from a population ⇒what is the probability of the height being a particular value?
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 4 of 88
If I pick arbitrarily a Pygmy, say x, then
Pr(Height of x=4’1”) =1√2π.1
exp
(− 1
2.1(4′1′′ − 4)2
)(2)
Note: Here mean and variances are fixed, only the observations, x, change.
Also see: Pr(x = 4′1′′) ≫ Pr(x = 5′) AND Pr(x = 4′1′′) ≫ Pr(x = 3′)
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 5 of 88
Conversely: Given a person’s height is 4′1′′ ⇒Person is more likely to be a pygmy than bushman.
If we observe heights of many persons – say 3′6′′, 4′1′′, 3′8′′, 4′5′′, 4′7′′, 4′, 6′5′′
and all are from same population (i.e. either pygmy or bushmen.)
⇒ then more certain we are that the population is pygmy.
More the observations ⇒ better will be our decision
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 6 of 88
Likelihood Function
x[0], x[1], . . . , x[N − 1]
⇒ set of independent observations from pdf parameterised by θ.
Previous Example: x[0], x[1], . . . , x[N − 1] are heights observed and
θ is the mean of density which is unknown (σ2 assumed known).
L(X; θ) = p(x0 . . . xN−1; θ) =N∏
i=0
p(xi; θ)
=1
(2πσ2)N |2exp
(− 1
2σ2
N∑
i=0
(xi − θ)2
)(3)
L(X; θ) is a function of θ and is called Likelihood Function
Given: x0 . . . xN−1, ⇒ what can we say about value of θ, i.e. best estimate of θ.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 7 of 88
Maximum Likelihood Estimation
Example: We know height of a person x[0] = 4′4′′.
Most likely to have come from which pdf ⇒ θ = 3′, 4′6′′ or 6′ ?
Maximum of L(x[0]; θ = 3′), L(x[0]; 4′6′′) and L(x[0]; θ = 6′) ⇒ choose θ = 4′6′′.
If θ is just a parameter, we will choose arg maxθ
L(x[0]; θ).
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 8 of 88
Maximum Likelihood Estimator
Given x[0], x[1], . . . , x[N − 1] and pdf parameterised by θ =
θ1
θ2
.
.
θm−1
We form Likelihood function L(X; θ) =N∏
i=0
p(xi;θ)
θMLE = arg maxθ
L(X; θ)
For height problem:
⇒ can show (θ)MLE = 1N
∑xi
⇒ Estimate of mean of Gaussian = sample mean of measured heights.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 9 of 88
Bayesian Estimation
• MLE ⇒ θ is assumed unknown but deterministic
• Bayesian Approach: θ is assumed random with pdf p(θ) ⇒ Prior Knowledge.
p(θ|x)︸ ︷︷ ︸Aposterior
=p(x|θ)p(θ)
p(x)∝ p(x|θ) p(θ)︸︷︷︸
Prior
• Height problem: Unknown mean is random ⇒ pdf Gaussian N (γ, ν2)
p(µ) =1√
2πν2exp
(− 1
2ν2(µ − γ)2
)
Then : (µ)Bayesian =σ2γ + nν2x
σ2 + nν2
⇒ Weighted average of sample mean and a prior mean
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 10 of 88
Gaussian Mixture Model
p(x) = α p(x|N(µ1; σ1)) + (1 − α) p(x|N(µ2; σ2))
p(x) =M∑
m=1
wm p(x|N(µm; σm)),∑
wi = 1
Characteristics of GMM:
Just like ANNs are universal approximators of functions, GMMs are universal
approximators of densities (provided sufficient no. of mixtures are used); true for
diagonal GMMs as well.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 11 of 88
General Assumption in GMM
• Assume that there are M components.
• Each component generates data from a Gaussian with mean µm and covariance
matrix Σm.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 12 of 88
GMM
Consider the following probability density function shown in solid blue
It is useful to parameterise or “model” this seemingly arbitrary “blue” pdf
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 13 of 88
Gaussian Mixture Model (Contd.)
Actually – pdf is a mixture of 3 Gaussians, i.e.
p(x) = c1N(x; µ1, σ1) + c2N(x; µ2, σ2) + c3N(x; µ3, σ3) and∑
ci = 1 (4)
pdf parameters: c1, c2, c3, µ1, µ2, µ3, σ1, σ2, σ3
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 14 of 88
Observation from GMM
Experiment: An urn contains balls of 3 different colurs: red, blue or green. Behind
a curtain, a person picks a ball from urn
If red ball ⇒ generate x[i] from N(x; µ1, σ1)
If blue ball ⇒ generate x[i] from N(x; µ2, σ2)
If green ball ⇒ generate x[i] from N(x; µ3, σ3)
We have access only to observations x[0], x[1], . . . , x[N − 1]
Therefore : p(x[i];θ) = c1N(x; µ1, σ1) + c2N(x; µ2, σ2) + c3N(x; µ3, σ3)
but we do not which urn x[i] comes from!
Can we estimate component θ = [c1 c2 c3 µ1 µ2 µ3 σ1 σ2 σ3]T from the observations?
arg maxθ
p(X; θ) = arg maxθ
N∏
i=1
p(xi;θ) (5)
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 15 of 88
Estimation of Parameters of GMM
Easier Problem: We know the component for each observation
Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]
Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3
X1 = {x[0], x[3], x[5] } belong to p1(x;µ1, σ1)
X2 = {x[1], x[2], x[8], x[9] } belongs to p2(x; µ2, σ2)
X3 = {x[3], x[6], x[7], x[10], x[11], x[12]} belongs to p3(x; µ3, σ3)
From: X1 = {x[0], x[3], x[5] }c1 = 3
13
and µ1 = 13 {x[0] + x[3] + x[5]}
σ21 = 1
3
{(x[0] − µ1)
2 + (x[2] − µ1)2 + (x[5] − µ1)
2}
In practice we do not know which observation come from which pdf.
⇒ How do we solve for arg maxθ
p(X ;θ) ?
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 16 of 88
Incomplete & Complete Data
x[0], x[1], . . . , x[N − 1] ⇒ incomplete data,
Introduce another set of variables y[0], y[1], . . . , y[N − 1]
such that y[i] = 1 if x[i] ∈ p1, y[i] = 2 if x[i] ∈ p2 and y[i] = 3 if x[i] ∈ p3
Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]
Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3
miss: y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9] y[10] y[11] y[12]
y[i] = missing data—unobserved data ⇒ information about component
z = (x; y) is complete data ⇒ observations and which density they come from
p(z;θ) = p(x, y;θ)
⇒ θMLE = arg maxθ
p(z;θ)
Question: But how do we find which observation belongs to which density ?
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 17 of 88
Given observation x[0] and θg, what is the probability of x[0] coming from first
distribution?
p (y[0] = 1|x[0];θg)
= p(y[0]=1,x[0];θg)p(x[0];θg)
=p(x[0]|y[0]=1;µ
g1,σ
g1) · p(y[0]=1)∑3
j=1 p(x[0]|y[0]=j;θg) p(y[0]=j)
=p(x[0]|y[0]=1,µ
g1,σ
g1) · c
g1
p(x[0]|y[0]=1;µg1,σ
g1) c
g1+p(x[0]|y[0]=2;µ
g2,σ
g2) c
g2+...
All parameters are known ⇒ we can calculate p(y[0] = 1|x[0];θg)
(Similarly calculate p(y[0] = 2|x[0]; θg), p(y[0] = 3|x[0];θg)
Which density? ⇒ y[0] = arg maxi
p(y[0] = i|x[0]; θg) – Hard allocation
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 18 of 88
Parameter Estimation for Hard Allocation
x[0] x[1] x[2] x[3] x[4] x[5] x[6]
p(y[j] = 1|x[j];θg) 0.5 0.6 0.2 0.1 0.2 0.4 0.2
p(y[j] = 2|x[j];θg) 0.25 0.3 0.75 0.3 0.7 0.5 0.6
p(y[j] = 3|x[j];θg) 0.25 0.1 0.05 0.6 0.1 0.1 0.2
Hard Assign. y[0]=1 y[1]=1 y[2]=2 y[3]=3 y[4]=2 y[5]=2 y[6]=2
Updated Parameters: c1 = 27 c2 = 4
7 c3 = 17 (different from initial guess!)
Similarly (for Gaussian) find: µi, σ2i for ith pdf
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 19 of 88
Parameter Estimation for Soft Assignment
x[0] x[1] x[2] x[3] x[4] x[5] x[6]
p(y[j] = 1|x[j];θg) 0.5 0.6 0.2 0.1 0.2 0.4 0. 2
p(y[j] = 2|x[j];θg) 0.25 0.3 0.75 0.3 0.7 0.5 0.6
p(y[j] = 3|x[j];θg) 0.25 0.1 0.05 0.6 0.1 0.1 0.2
Example: Prob. of each sample belonging to component 1
p(y[0] = 1|x[0]; θg), p(y[1] = 1|x[1]; θg) p(y[2] = 1|x[2]; θg), · · · · · ·
Average probability that a sample belongs to Comp.#1 is
cnew1 =
1
N
N∑
i=1
p(y[i] = 1|x[i];θg)
=0.5 + 0.6 + 0.2 + 0.1 + 0.2 + 0.4 + 0.2
7=
2.2
7
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 20 of 88
Soft Assignment – Estimation of Means & Variances
Recall: Prob. of sample j belonging to component i
p(y[j] = i|x[j];θg)
Soft Assignment: Parameters estimated by taking weighted average !
µnew1 =
∑Ni=1 xi · p(y[i] = 1 | x[i]; θg)∑N
i=1 p(y[i] = 1 | x[i];θg)
(σ21)
new =
∑Ni=1(xi − µ1)
2 · p(y[i] = 1 | x[i]; θg)∑N
i=1 p(y[i] = 1 | x[i]; θg)
These are updated parameters starting with initial guess θg
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 21 of 88
Maximum Likelihood Estimation of Parameters of GMM
1. Make initial guess of parameters: θg = cg1, c
g2, c
g3, µ
g1, µ
g2, µ
g3, σ
g1, σ
g2, σ
g3
2. Knowing parameters θg, find Prob. of sample xi belonging to jth component.
p[y[i] = j | x[i]; θg] for i = 1, 2, . . . N ⇒ no. of observations
for j = 1, 2, . . . M ⇒ no. of components
3.
bcnewj =
1
N
NX
i=1
p(y[i] = j | x[i]; θg)
4.
µnewj =
PNi=1 xi · p(y[i] = j | x[i]; θg)
PNi=1 p(y[i] = j | x[i]; θg)
5.
(σ2j )
new =
PNi=1(xi − bµ1)
2 · p(y[i] = j | x[i]; θg)PN
i=1 p(y[i] = j | x[i]; θg)
6. Go back to (2) and repeat until convergence
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 22 of 88
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 23 of 88
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 24 of 88
Live demonstration: http://www.neurosci.aist.go.jp/˜ akaho/MixtureEM.html
Practical Issues
• E.M. can get stuck in local minima.
• EM is very sensitive to initial conditions; a good initial guess helps; k-means
algorithm is used prior to application of EM algorithm
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 25 of 88
Size of a GMM
Bayesian Information Criterion (BIC) value of a GMM can be defined as follows:
BIC(G | X) = logp(X | G) − d
2logN
where
G represent the GMM with the ML parameter configuration
d represents the number of parameters in G
N is the size of the dataset
the first term is the log-likelihood term;
the second term is the model complexity penalty term.
BIC selects the best GMM corresponding to the largest BIC value by trading off
these two terms.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 26 of 88
source: Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005
The BIC criterion can discover the true GMM size effectively as shown in the figure.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 27 of 88
Maximum A Posteriori (MAP)
• Sometimes, it is difficult to get sufficient number of examples for robust estimation
of parameters.
• However, one may have access to large number of similar examples which can be
utilized.
• Adapt the target distribution from such a distribution. For example, adapt a
speaker independent model to a new speaker using small amount of adaptation
data.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 28 of 88
MAP Adaptation
ML Estimation
θMLE = arg maxθ
p(X |θ)
MAP Estimation
θMAP = arg maxθ
p(θ|X)
= arg maxθ
p(X |θ) p(θ)
p(X)
= arg maxθ
p(X |θ) p(θ)
p(θ) is the a priori distribution of parameters θ.
A conjugate prior is chosen such that the corresponding posterior belongs to the
same functional family as the prior.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 29 of 88
Simple Implementation of MAP-GMMs
Source: Statistical Machine Learning from Data: GMM; Samy Bengio
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 30 of 88
Simple Implementation
Train a prior model p with a large amount of available data (say, from multiple
speakers). Adapt the parameters to a new speaker using some adaptation data (X).
Let α = [0, 1] be a parameter that describes the faith on the prior model.
Adapted weight of jth mixture
wj =
[αw
pj + (1 − α)
∑
i
p(j|xi)
]γ
Here γ is a normalization factor such that∑
wj = 1.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 31 of 88
Simple Implementation (contd.)
means
µj = αµpj + (1 − α)
∑i p(j|xi) xi∑
i p(j|xi)Weighted average of sample mean and a prior mean
variances
σj = α(σ
pj + µ
pjµ
p′
j
)+ (1 − α)
∑i p(j|xi) xix
′
i∑i p(j|xi)
− µjµj
′
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 32 of 88
HMM
• Primary role of speech signal is to carry a message; sequence of sounds (phonemes)
encode a sequence of words.
• The acoustic manifestation of a phoneme is mostly determined by:
– Configuration of articulators (jaw, tongue, lip)
– physiology and emotional state of speaker
– Phonetic context
• HMM models sequential patterns; speech is a sequential pattern
• Most text dependent speaker recognition systems use HMMs
• Text verification involves verification/recognition of phonemes
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 33 of 88
Phoneme recognition
Consider two phonemes classes /aa/ and /iy/.
Problem: Determine to which class a given sound belongs.
Processing of speech signal results in a sequence of feature (observation) vectors:
o1, . . . , oT (say MFCC vectors)
We say the speech is /aa/ if: p(aa|O) > p(iy|O)
Using Bayes Rule
AcousticModel︷ ︸︸ ︷p(O|aa) p(aa)
p(O)V s.
p(O|iy)
PriorProb︷ ︸︸ ︷p(iy)
p(O)
Given p(O|aa), p(aa), p(O|iy) and p(iy) ⇒ which is more probable ?
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 34 of 88
Parameter Estimation of Acoustic Model
How do we find the density function paa(.) and piy(.).
We assume a parametric model:⇒ paa() parameterised by θaa
⇒ pij() parameterised by θiy
Training Phase: Collect many examples of /aa/ being said
⇒ Compute corresponding observations o1, . . . , oTaa
Use the Maximum Likelihood Principle
θaa = arg maxθaa
p(O;θaa)
Recall: if the pdf is modelled as a Gaussian Mixture Model
. ⇒ then we use EM Algorithm
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 35 of 88
Modelling of Phoneme
To enunciate /aa/ in a word ⇒Our Articulators are moving from a configuration
for previous phoneme to /aa/ and then proceeding
to move to configuration of next phoneme.
Can think of 3 distinct time periods:
⇒ Transition from previous phoneme
⇒ Steady state
⇒ Transition to next phoneme
Features for 3 “time-interval ”are quite different
⇒ Use different density functions to model the three time intervals
⇒ model as paa1(;θaa1) paa2(;θaa
2) paa3(;θaa3)
Also need to model the time durations of these time-intervals – transition probs.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 36 of 88
Stochastic Model (HMM)
f(Hz)f(Hz)
p(f)
f(Hz)
p(f)
a12
a11
1
p(f)
2 3
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 37 of 88
HMM Model of Phoneme
• Use term “State”for each of the three time periods.
• Prob. of ot from jth state, i.e. paaj(ot; θaaj) ⇒ denoted as bj(ot)
1 2 3. . .p(; �1aa) p(; �2aa) p(; �3aa)o10o3o2o1
• Observation, ot, is generated by which state density?
– Only observations are seen, the state-sequence is “hidden”
– Recall: In GMM, the “mixture component is “hidden”
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 38 of 88
Probability of Observation
Recall: To classify, we evaluate Pr(O|Λ) – where Λ are parameters of models
In /aa/ Vs /iy/ calculation: ⇒ Pr(O|Λaa) Vs Pr(O|Λiy)
Example: 2-state HMM model and 3 observations o1 o2 o3
Model parameters are assumed known:
Transition Prob. ⇒ a11, a12, a21, and a22 – model time durations
State density ⇒ b1(ot) and b2(ot).
bj(ot) are usually modelled as single Gaussian with parameter µj, σ2j or by GMMs
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 39 of 88
Probability of Observation through one Path
T = 3 observations and N = 2 nodes ⇒ 8 paths thru 2 nodes for 3 observations
Example: Path P1 through states 1, 1, 1.
Pr{O|P1,Λ} = b1(o1) · b1(o2) · b1(o3)
Prob. of Path P1 = Pr{P1|Λ} = a01 · a11 · a11
Pr{O, P1|Λ} = Pr{O|P1,Λ} · Pr{P1|Λ} = a01b1(o1).a11b1(o2).a11b1(o3)
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 40 of 88
Probability of Observation
Path o1 o2 o3 p(O, Pi|Λ)
P1 1 1 1 a01b1(o1).a11b1(o2).a11b1(o3)
P2 1 1 2 a01b1(o1).a11b1(o2).a12b2(o3)
P3 1 2 1 a01b1(o1).a12b2(o2).a21b1(o3)
P4 1 2 2 a01b1(o1).a12b2(o2).a22b2(o3)
P5 2 1 1 a02b2(o1).a21b1(o2).a11b1(o3)
P6 1 1 2 a02b2(o1).a21b1(o2).a12b2(o3)
P7 1 1 1 a02b2(o1).a22b1(o2).a21b1(o3)
P8 1 1 2 a02b2(o1).a22b1(o2).a22b2(o3)
p(O|Λ) =∑
Pi
P{O, Pi|Λ} =∑
Pi
P{O|Pi, Λ} · P{Pi,Λ}
Forward Algorithm ⇒ Avoid Repeat Calculations:
Two Multiplications︷ ︸︸ ︷a01b1(o1)a11b1(o2).a11b1(o3) + a02b2(o1).a21b1(o2).a11b1(o3)
=[a01.b1(o1)a11b1(o2) + a02b2(o1)a21b1(o2)]a11b1(o3)︸ ︷︷ ︸One Multiplication
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 41 of 88
Forward Algorithm – Recursion
Let :α1(t = 1) = a01b1(o1)
Let :α2(t = 1) = a02b2(o2)
Recursion : α1(t = 2) = [a01b1(o1).a11 + a02b2(o1).a21].b1(o2)
= [α1(t = 1).a11 + α2(t = 1).a21].b1(o2)
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 42 of 88
General Recursion in Forward Algorithm
αj(t) =[∑
αi(t − 1)aij
].bj(ot)
= P{o1, o2, . . .ot, st = j|Λ}
Note
αj(t)
Sum of probabilities of all paths ending at
⇒ node j at time t with partial observation
sequence o1,o2, . . . ,ot
The probability of the entire observation (o1,o2, . . . ,oT ), therefore, is
p(0|Λ) =
N∑
j=1
P{o1,o2, . . . ,oT , ST = j|Λ}
=N∑
j=1
αj(T )
where N=No. of nodes
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 43 of 88
Backward Algorithm
• analogous to Forward, but coming from the last time instant T
Example: a01b1(o1).a11b1(o2).a11b1(o3) + a01b1(o1).a11b1(o2)a12b2(o3) + . . .
= [a01.b1(o1).a11b1(o2)].(a11b1(o3) + a12b2(o3))
β1(t = 2) = p{o3|st=2 = 1, Λ}= p{o3, st=3 = 1|st=2 = 1;Λ} + p{o3, st=3 = 2|st=2 = 1, Λ}
=p{o3|st=3 = 1, st=2 = 1, Λ}.p{st=3 = 1|st=2 = 1, Λ} +
p{o3|st=3 = 2, st=2 = 1,Λ}.p{st=3 = 2|st=2 = 1, Λ}= b1(o3).a11 + b2(o3).a12
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 44 of 88
General Recursion in Backward Algorithm
βj(t)
Given that we are at node j at time t
⇒ Sum of probabilities of all paths such that
partial sequence ot+1, . . . , oT are observed
βi(t) =
N∑
j=1
[aijbj(ot+1)]
︸ ︷︷ ︸Going to each node from ith node
βj(t + 1)︸ ︷︷ ︸Prob. of observation ot+2 . . . oT given
now we are in jth node at t + 1
= p{ot+1, . . . , ot|st = i,Λ}
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 45 of 88
Estimation of Parameters of HMM Model
• Given known Model parameters, Λ:
– Evaluated p(O|Λ) ⇒ useful for classification
– Efficient Implementation: Use Forward or Backward Algo.
• Given set of observation vectors, ot how do we estimate parameters of HMM?
– Do not know which states ot come from
∗ Analogous to GMM – do not know which component
– Use a special case of EM – Baum-Welch Algorithm
– Use following relations from Forward/Backward
p(O|Λ) = αN(T ) = β1(T ) =N∑
j=1
αj(t)βj(t)
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 46 of 88
Parameter Estimation for Known State Sequence
Assume each state is modelled as a single Gaussian:
µj = Sample mean of observations assigned to state j.
σ2j = Variance of the observations assigned to state j.
and
Trans. Prob. from state i to j =No. of times transition was made from i to j
Total number of times we made transition from i
In practice since we do not know which state generated the observation
⇒ So we will do probabilistic assignment.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 47 of 88
Review of GMM Parameter Estimation
Do not know which component of the GMM generated output observation.
Given initial model parameters Λg, and observation sequence x1, . . . , xT .
Find probability xi comes from component j ⇒ Soft Assignment
p[component = 1|xi; Λg] =
p[component = 1, xi|Λg]
p[xi|Λg]
So, re-estimation equations are:
bCj =1
T
TX
i=1
p(comp = j|xi; Λg)
bµnewj =
PTi=1 xip(comp = j|xi; Λ
g)PT
i=1 p(comp = j|xi; Λg)bσ2
j =
PTi=1(xi − bµj)
2p(comp = j|xi; Λg)PTi=1 p(comp = j|xi; Λg)
A similar analogy holds for hidden Markov models
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 48 of 88
Baum-Welch Algorithm
Here: We do not know which observation ot comes from which state si
Again like GMM we will assume initial guess parameter Λg
Then prob. of being in “state=i at time=t” and “state=j at time=t+1” is
τt(i, j) = p{qt = i, qt+1 = j|O,Λg} =p{qt = i, qt+1 = j,O|Λg}
p{O|Λg}
where p{O|Λg} = αN(T ) =∑
i αi(T )
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 49 of 88
Baum-Welch Algorithm
Then prob. of being in “state=i at time=t” and “state=j at time=t+1” is
τt(i, j) = p{qt = i, qt+1 = j|O,Λg} =p{qt = i, qt+1 = j, O|Λg}
p{O|Λg}
where p{O|Λg} = αN(T ) =∑
i αi(T )
From ideas of Forward-Backward Algorithm, numerator is
p{qt = i, qt+1 = j,O|Λg} = αi(t).aijbj(ot+1).βj(t + 1)
So τt(i, j) =αi(t).aijbj(ot+1)βj(t + 1)
αN(t)
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 50 of 88
Estimating Transition Probability
Trans. Prob. from state i to j = No. of times transition was made from i to jTotal number of times we made transition from i
τt(i, j) ⇒ prob. of being in “state=i at time=t” and “state=j at time=t+1”
If we average τt(i, j) over all time-instants, we get the number of times the system
was in ith state and made a transition to jth state. So, a revised estimation of
transition probability is
anewij =
∑T−1t=1 τt(i, j)
∑Tt=1(
N∑
j=1
τt(i, j)
︸ ︷︷ ︸all transitions out
of i at time=t
)
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 51 of 88
Estimating State-Density Parameters
Analogous to GMM: which observation belonged to which component,
New estimates for the state pdf parameters are (assuming single Gaussian)
µi =
∑Tt=1 γi(t)ot∑Tt=1 γi(t)
∑i
=
∑Tt=1 γi(t)(ot − µi)(ot − µi)
T
∑Tt=1 γi(t)
These are weighted averages ⇒ weighted by Prob. of being in state j at t
– Given observation ⇒ HMM model parameters estimated iteratively
– p(O|Λ) ⇒ evaluated efficiently by Forward/Backward algorithm
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 52 of 88
Viterbi Algorithm
Given the observation sequence,
• the goal is to find corresponding state-sequence that generated it
• there are many possible combination (NT ) of state sequence ⇒ many paths.
One possible criterion : Choose the state sequence corresponding to path that with
maximum probability
maxi
P{O, Pi|Λ}
Word : represented as sequence of phones
Phone : represented as sequence states
Optimal state-sequence ⇒ Optimal phone-sequence ⇒ Word sequence
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 53 of 88
Viterbi Algorithm and Forward Algorithm
Recall Forward Algorithm : We found probability over each path and summed over
all possible paths
NT∑
i=1
p{O, Pi|Λ}
Viterbi is just special case of Forward algo.
At each node
{instead of sum of prob. of all paths
choose path with max prob.
In Practice: p(O|Λ) approximated by Viterbi (instead of Forward Algo.)
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 54 of 88
Viterbi Algorithm
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 55 of 88
Decoding
• Recall: Desired transcription W obtained by maximising
W = arg maxW
p(W |O)
• Search over all possible W – astronomically large!
• Viterbi Search – find most likely path through a HMM
– Sequence of phones (states) which is most probable
– Mostly: most probable sequence of phones correspond to most probable
sequence of words
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 56 of 88
Training of a Speech Recognition System
– HMM parameter’s estimated using large databases – 100 hours
∗ Parameters estimated using Maximum Likelihood Criterion
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 57 of 88
Recognition
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 58 of 88
Speaker Recognition
Spectra (formants) of a given sound are different for different speakers.
Spectra of 2 speakers for one “frame” of /iy/
Derive speaker dependent model of a new speaker by MAP adaptation of Speaker-
Independent (SI) model using small amount of adaptation data; use for speaker
recognition.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 59 of 88
References
• Pattern Classification, R.O.Duda, P.E.Hart and D.G.Stork, John Wiley, 2001.
• Introduction to Statistical Pattern Recognition, K.Fukunaga, Academic Press,
1990.
• The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam
Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707
• The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam
Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707
• Fundamentals of Speech Recognition, Lawrence Rabiner & Biing-Hwang Juang,
Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993, ISBN
0-13-015157-2
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 60 of 88
• Hidden Markov models for speech recognition, X.D. Huang, Y. Ariki, M.A. Jack.
Edinburgh: Edinburgh University Press, c1990.
• Statistical methods for speech recognition, F.Jelinek, The MIT Press, Cambridge,
MA., 1998.
• Maximum Likelihood from incomplete data via the em algorithm, J. Royal
Statistical Soc. 39(1), pp. 1-38, 1977.
• Maximum a Posteriori Estimation for Multivariate Guassian Mixture Observation
of Markov Chains, J.-L.Gauvain and C.-H.Lee, IEEE Trans. SAP, 2(2), pp. 291-
298, 1994.
• A Gentle Tutorial of the EM Algorithm and its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov Models, J.A.Bilmes,
ISCI, TR-97-021.
• Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in
N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005.
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 61 of 88
Top Related