Post on 24-Oct-2014
14
Chapter III
Hidden Markov Model [2]
This section will describe a method to train and recognize speech utterance from
given observations, Q
t RO 2 , where t is a time index and 2Q is the vector
dimension. A complete sequence of observations used to describe the utterance will
be denote O=(O1, O2,…,OT). The utterance may be a word, a phoneme, a complete
sentence or paragraph. The method described here is the Hidden Markov Model
(HMM). The HMM is an stochastic approach which models the given problem as a
“doubly stochastic process” in which the observed data are thought to be the result
of having passed the “true” (hidden) process through a second process. Both
processes are to be characterized using only the one that could be observed. The
problem with this approach is that one does not know anything about the Markov
chains that generate the speech. The number of states in the model is unknown, the
probabilistic functions are unknown and one can not tell from which state an
observation was produced. These properties are hidden, and thereby the name
Hidden Markov Model.
III.1. Discrete Markov Process
Consider a system which may be described at any time as being one of a set of N
distinct states, S1, S2, …, SN, as illustrated in figure III.1. At regularly spaced
discrete times, the system undergoes a change of state (possibly back to the same
state) according to a set of probabilities associated with that state. Denote the time
instants associated with state changes as t=1, 2, …, and denote the actual state at
time t as qt. A full probabilistic description of the above system would, in general,
require specification of the current state (at time t), as well as all the predecessor
states. For the special case of a discrete, first order, Markov chain, this probabilistic
description is truncated to just the current and the predecessor state, i.e.,
15
Figure III.1. A Markov chain with 5 states with selected state transitions.
P[qt=Sj|qt-1=Sj, qt-2=Sk, …]
= P[qt=Sj|qt-1=Si]. (III.1)
Furthermore we only consider those processes in which the right-hand side of (III.1)
is independent of time, thereby leading to the set of state transition probabilities a ij
of the form with the state transition coefficients having the properties since they
obey standard stochastic constrains.
NjiSqSqPa itjtij ,1,| 1 (III.2)
N
j
ij
ij
a
a
1
1
0
The above stochastic process could be called an observable Markov model since the
output of the process is the set of states at each instant of time, where each state
corresponds to a physical (observable) event. To get ideas, consider as simple 3-
S1
S2
S3
S4
S5
a11
a12
a23
a22
a33
a44
a55
a35 a14
a54
a45
a51
a43
(III.3a)
(III.3b)
16
state Markov model of the weather. We assume that once a day (e.g., at noon), the
weather is observed as being one of the following:
State 1: rain or (snow)
State2: cloudy
State3: sunny.
We postulate that the weather on day t is characterized by a single one of the three
states above, and that the matrix A of state transition probabilities is
.
8.01.01.0
2.06.02.0
3.03.04.0
ijaA
Given that the weather on day 1 (t=1) is sunny (state3), we can ask the question:
What is the probability (according to the model) that the weather for the next 7 days
will be “ sun-sun-rain-rain-sun-cloudy-sun…”? State more formally, we define the
observation sequence O as O={S3, S3, S1, S1, S3, S2, S3} corresponding to t=1,2, …,
8, and we wish to determine the probability of O, given the model. This probability
can be expressed (and evaluated) as
P(O|Model)=P[S3,S3,S3,S1,S1,S3,S2,S3|Model]
=P[S3]*P[S3|S3]*P[S3|S3]*P[S1|S3]
= π*a33*a33*a31*a11*a13*a32*a23
=1*(0.8)*(0.8)*(0.1)*(0.4)*(0.3)*(0.1)*(0.2)
= 1.536x10-4
Where we use the notation
πi = P[q1=Si], 1 ≤ i ≤ N (III.4)
to denote the initial state probabilities.
17
Another interesting question we can ask (and answer using the model) is: Given
that the model is in a known state, what is the probability it stays in that state for
exactly d days? This probability can be evaluated as the probability of the
observation sequence
O={Si, Si, Si, … , Si, Sj ≠ Si},
1 2 3 d d+1
Given the model, which is
P(O|Model, q1=Si) =(aii)d-1
(1-aii)=pi(d). (III.5)
The quantity pi(d) is the (discrete) probability density function of duration d in state
i. The exponential duration density is characteristic of the state duration in a
Markov chain. Based on pi(d), we can readily calculate the expected number of
observations in a state, conditioned on starting in that state as
1
)(d
ii dpdd (III.6a)
1
1
1
1)1()(
d ii
ii
d
iia
aad (III.6b)
Thus the expected number of consecutive days of sunny weather, according to the
model, is 1/(0.2)=5; for cloudy it is 2.5; for rain it is 1.67.
III.2. Hidden Markov Models [1]
So far we have considered Markov models in which each state corresponded to an
observable (physical) event. This model is too restrictive to be applicable to many
problems. In this section we extend the concept of Markov models to include the
case where the observation is a probabilistic function of the state –i.e., the resulting
model (which is called a hidden Markov model) is a doubly embedded stochastic
process with an underlying stochastic process that is not observable (it is hidden),
but can only be observed through another set of stochastic processes that produce
the sequence of observations. To fix ideas, consider the following model of some
simple coin tossing experiments.
18
Coin Toss Models: Assume the following scenario. You are in a room with a barrier
(e.g., a curtain) through which you cannot see what is happening. On the other side
of the barrier is another person who is performing a coin (or multiple coin) toss ing
experiment. The other person will not tell you anything about what he is doing
exactly; he will only tell you the result of each coin flip. Thus a sequence of hidden
coin tossing experiments is performed, with the observation sequence consisting of
a series of heads and tails; e.g., a typical observation sequence would be
O = O1 O2 O3 . . . OT
= ϩ ϩ Ϫ Ϫ Ϫ ϩ ϩ Ϫ . . . Ϫ
Where ϩ stands for heads and Ϫ stands for tails.
Given the above scenario, the problem of interest is how do we build an HMM to
explain (model) the observed sequence of heads and tails. The first problem one
faces is deciding what the states in the model correspond to, and the deciding how
many states should be in the model. One possible choice would be to assume that
only a single biased coin was being tossed. In this case we could model the situation
with a 2-state model where each state corresponds to a side of the coin (i.e., head or
tail). This model is depicted in figure III.2a. In this case the Markov model is
observable, and the only issue for complete specification of the model would be to
decide on the best value for the bias (i.e., the probability of, say, heads).
Interestingly, an equivalent HMM to that of figure III.2a would be a degenerate 1-
state model, where the state corresponds to the single biased coin, and the unknown
parameter is the bias of the coin.
A second form of HMM for explaining the observed sequence of coin toss outcome
is given in Figure III.2(b). In this case there are 2 states in the model and each state
corresponds to a different, biased, coin being tossed. Each state is characterized by a
probability distribution of heads and tails, and transitions between states are
characterized by a state transition matrix. The physical mechanism which accounts
for how state transitions are selected could itself be a set of independent coin tosses,
or some other probabilistic event.
A third form of HMM for explaining the observed sequence of coin toss outcome is
given in figure III.2 (c). This model corresponds to using 3 biased coins, and
choosing from among the three, based on some probabilistic event.
19
P(H)
1-P(H) 1-P(H)
P(H)
(a)
1 2
HEADS TAILS
O= H H T T H T H H T T H …
S= 1 1 2 2 1 2 1 1 2 2 1 …
a11
a22 1-a11
1-a22
(b)
1 2
P(H)=P1
P(T)=1-P1
P(H)=P2
P(T)=1-P2
O= H H T T H T H H T T H …
S= 2 1 1 2 1 2 1 2 2 1 2 …
a11
a22 a12
a21
(c)
1 2
STATE
P(H) 321
321
PPP
P(T) 1-P1 1-P2 1-P3
P(H)=P2
P(T)=1-P2
O= H H T T H T H H T T H …
S= 3 1 2 3 3 1 1 2 3 1 3 … 1
3
a33
Figure III.2. Three possible Markov models which can account for the results of hidden coin
tossing experiments. (a) 1-coin model. (b) 2-coins model. (c) 3-coins model.
20
Given the choice among the three models shown in figure III.2 for explaining the
observed sequence of heads and tails, a natural question would be which model best
matches the actual observations. It should be clear that the simple 1-coin model of
figure III.2a has only 1 unknown parameter; the 2-coin model of figure III.2b has 4
unknown parameters; and the 3-coin model of figure III.2c has 9 unknown
parameters. Thus, with the greater degrees of freedom, the larger HMMs would
seem to inherently be more capable of modeling a series of coin tossing experiments
than would equivalently smaller models. Although this is theoretically true, we will
see later that practical consideration impose some strong limitations on the size of
models that we can consider. Furthermore, it might just be the case that only a
single coin is being tossed. Then using the 3-coin model of figure III.2c would be
inappropriate, since the actual physical event would not correspond to the model
being used –i.e., we would be using an underspecified system.
The Urn and Ball Model: To extend the ideas of the HMM to a somewhat more
complicated situation, consider the urn and ball system of figure III.3. We assume
that there are N (large) glass urns in a room. Within each urn there are a large
number of colored balls. We assume there are K distinct colors of the balls. The
physical process for obtaining observations is as follows. A genie is in the room,
and according to some random process, he (or she) chooses an initial urn. From this
urn, a ball is chosen at random, and its color is recorded as the observation. The ball
is then replaced in the urn from which it was selected. A new urn is then selected
according to the random selection process associated with the current urn, and the
ball selection process is repeated. This entire process generates a finite observation
sequence of colors, which we would like to model as the observable output of an
HMM.
It should be obvious that the simplest HMM that corresponds to the urn and ball
process is one in which each state corresponds to a specific urn, and for which a
(ball) color probability is defined for each state. The choice of color of urns is
dictated by state transition matrix of the HMM.
21
III.2.1. Discrete Observation Densities
The urn and ball example described in previous section is an example of a discrete
observation density HMM. This because there are K distinct colors. In general the
discrete observation density HMMs are based on partitioning the probability density
function (pdf) of observations into a discrete set of small cells and symbols v1 , v2,
…, vK, one symbol representing each cell. This partitioning subject is usually called
vector quantization. After a vector quantization is performed, a codebook is created
of the mean vectors for every cluster.
The corresponding symbol for the observation is determined by the nearest neighbor
rule, i.e. select the symbol of the cell with the nearest codebook vector. To make a
parallel to the urn and ball model, this means that if a dark gray ball is observed,
will it probably be closest to the black color. In this case the symbols v1, v2, …, vK
are represented by one color each (e.g. v1= RED). The observation symbol
-------
URN 1 URN 2 URN N
P(RED) =b1(1)
P(BLUE) =b1(2)
P(GREEN) =b1(3)
P(YELLOW)=b1(4)
----------------------
P(ORANGE)=b1(K)
P(RED) =b2(1)
P(BLUE) =b2(2)
P(GREEN) =b2(3)
P(YELLOW)=b2(4)
----------------------
P(ORANGE)=b2(K)
P(RED) =bN(1)
P(BLUE) =bN(2)
P(GREEN) =bN(3)
P(YELLOW)=bN(4)
----------------------
P(ORANGE)=bN(K)
O={GREEN, GREEN, BLUE, RED, YELLOW, RED, … ,BLUE
Figure III.3. An N-state urn and ball model which illustrates the
general case of a discrete symbol HMM.
22
probability distribution, N
jtj obB1
will now have the symbol distribution at
state j, bj(ot), defined as:
bj(ot)=bj(k)=P(ot=vk|qt=j), 1 ≤ k ≤ K (III.7)
The estimation of the probabilities bj(k) is normally accomplished in two steps, first
the determination of the codebook and then the estimation of the sets of observation
probabilities for each codebook vector in each state.
In this project, the codebook will be determined by K-means algorithm
The K-Means Algorithm
1. Initialization
Choose K vectors from the training vectors, here denoted x, at random. These
vectors will be the centroids µk, which are to be found correctly.
2. Recursion
For each vector in the training set, let every vector belong to a cluster k. This is
done by choosing the cluster closest to the vector:
)8.(),(minarg* IIIxdk k
k
Where d(x,µk) is a distance measure, here is the Euclidian distance measure is used:
)9.()()(),( IIIxxxd k
T
kk
3. Test
Recomputed the centroids, µk, by taking a mean of the vectors that belong to this
centroid. This is done for every µk. If no vectors belongs to some µk for some value
k-create new µk by choosing a random vector from x. If there has been has been no
change of the centroids from the previous step goto termination, otherwise go back
to step 2.
23
III.2.2. Continuous Observation Densities
To create continuous observation density HMMs, bj(ot) are created as some
parametric probability density functions (pdf) or mixtures of them. The most
general representation of pdf, for which a reestimation procedure has been
formulated, is a finite mixture of form:
Njobcob t
K
k
jkjktj ....,,2,1),()(1
(III.10)
Where K is the number of mixtures and the following stochastic constraints fort the
mixture weights, cjk, holds:
KkNjc
Njc
jk
K
k
jk
...,,2,1,....,,2,10
...,,2,111
(III.11)
And bjk(ot) is a D-dimensional log-concave or elliptically symmetric density with
mean vector jk and covariance matrix jk :
),,()( jkjkttjk oob (III.12)
The most used D-dimensional log-concave or elliptically symmetric density, is the
Gaussian density. The Gaussian density can be found as:
jkrjkT
jkt oo
jk
Djkjkttjk eoob
1
2
1
2/12/)2(
1),,()(
(III.13)
To approximate simple observation sources, the mixture Gaussians provide an easy
way to gain a considerable accuracy due to the flexibility and convenient estimation
of the pdfs. If the observation source generates a complicated high dimensional pdf,
the mixture Gaussians become computationally difficult to treat, due to excessive
number of parameters and large covariance matrixes.
24
As the length of the feature vectors are increased, the size of the covariance
matrices increases in square proportional to vector dimension. If feature vectors are
designed to avoid redundant components, the off diagonal elements of the
covariance matrices are usually small. This suggest to the covariance approximation
by diagonal matrices. The diagonality also provides a simpler and faster
implementation:
D
l jkl
jkltl
jkrjkT
jkt
o
jkl
D
l
D
oo
jk
Djkjkttjk
e
eoob
12
2
1
2
)(
2/1
1
2/
2
1
2/12/
2
1
)2(
1),,()(
(III.14)
Where jkDjkjk ...,,, 21 are the diagonal elements of the covariance matrix jk
III.2.3. Elements of an HMM
The above examples give us a pretty good idea of what an HMM is and how it can
be applied to some simple scenarios. We now formally define the elements of an
HMM, and explain how the model generates observation sequences.
An HMM is characterized by the following:
1) N, the number of states in the model. Although the states are hidden, for
many practical application there is often some physical significance attached
to the states or to sets of states of the model. Hence, in the coin tossing
experiments, each state corresponded to a distinct biased coin. In the urn and
ball model, the states corresponded to the urns. Generally the states are
interconnected in such a way that any state can be reached from any other
state (e.g., an ergodic model); however, we will see later in this paper that
other possible interconnections of states are often of interest. We denote the
individual states as S = {S1, S2, …, SN}, and the state at time t as qt.
25
2) K, the number of distinct observations symbols per state, i.e., the discrete
alphabet size. The observation symbols correspond to the physical output o f
the system being modeled. For the coin toss experiments the observation
symbols were simply heads or tails; for the ball and urn model they were the
colors of balls selected from the urns. We denote the individual symbols as
V={v1, v2, …, vM}.
3) The state transition probability distribution A={ai j} where
aij=P[qt+1=Sj|qt=Si], 1≤ i, j ≤ N (III.15)
For the special case where any state can reach any other state in a single step,
we have aij>0 for all i,j. For other types of HMMs, we would have aij=0 for
one or more (i, j) pairs.
4) The observation symbol probability distribution in state j, B={b j(k)}, where
bj(k) = P[vk at t|qt = Sj], 1≤ j ≤ N ; 1 ≤ k ≤ K (III.16)
5) The initial state distribution π = {π i} where
πi=P[q1=Si], 1≤ i ≤ N (III.17)
Given appropriate value of N, K, A, B, and π, the HMM can be used as a generator
to give an observation sequence
O= O1 O2 … OT
(Where each observation Ot is one of the symbols from V, and T is the number of
observations in the sequence ) as follows:
a. Choose an initial state q1=Si according to the initial state distribution π
b. Set t=1.
c. Choose Ot = vk according to the symbol probability distribution in state
Si, i.e., bi(k).
26
d. Transit to a new state qt+1 =Sj according to the state transition probability
distribution for state Si, i.e., aij.
e. Set t=t+1; return to step c. if t<T; otherwise terminate the procedure.
The above procedure can be used as both a generator of observation, and as a model
for how a given observation sequence was generation by an appropriate HMM.
It can be seen from the above discussion that a complete specification of an HMM
requires specification of two model parameters (N and K), specification of
observation symbols, and the specification of the three probability measures A, B ,
and π. For convenience, we use the compact notation
λ=(A, B , π)
to indicate the complete parameter set of the model.
III.3. The There Basic Problem for HMMs [2]
Given the form of HMM of the previous section, there are three basic problem of
interest that must be solved for the model to be useful in real – world applications.
These problems are the following:
Problem 1: Given the observation sequence O=O1O2…OT , and a model λ=(A, B ,
π), how do we efficiently compute P(O|λ), the probability of the observation
sequence, given the model?
Problem 2: Given the observation sequence O=O1O2…OT , and the model λ, how
do we choose a correspond state sequence Q=q1 q2 ... qT which is optimal in some
meaningful sense (i.e., best “explains” the observations)?
Problem 3: How do we adjust the model parameter λ=(A, B , π) to maximize
P(O|λ)?
III.3.1. Solution to problem 1
We wish to calculate the probability of the observation sequence, O=O1O2…OT ,
given the model λ, i.e., P(O|λ). The most straightforward way of doing this is
through enumerating every possible state sequence of length T (the number of
observations). Consider one such fixed state sequence
27
Q=q1 q2 ... qT
Where q1 is the initial state. The probability of the observation sequence O for the
state sequence is
),|(),|(
1 tt
T
tqOPQOP
Where we have assumed statistical independence of observations. Thus we get
)()....(*)(),|( 21 21 Tqqq ObObObQOP
T
The probability of such a state sequence Q can be written as
TT qqqqqqq aaaOP132211
....)|(
The joint probability of O and Q, i.e., the probability that O and Q occur
simultaneously, is simply the product of the above two terms, i.e.,
P(O,Q|λ)=P(O|Q,λ)P(Q,λ). (III.21)
The probability of O (given the model) is obtained by summing this joint
probability over all possible state sequences q giving
)()....()(
)|(),|()|(
1
21
22111 2
...,,
1 Tqqq
qqq
qqqqq
Qall
ObaObaOb
QPQOPOP
TTt
T
(III.22)
The interpretation of the computation in the above equation is the following.
Initially (at time t=1) we are in state q1 with probability 1q , and we generate the
symbol O1 (in this state) with probability )( 11Obq . The clock changes from time t to
t+1 (t=2) and we make a transition to state q2 from state q1 with probability 21qqa ,
and generate symbol O2 with probability )( 22Obq . This process continues in this
manner until we make the list transition (at time T) from state qT-1 to state qT with
probability TT qqa
1and generate symbol OT with probability )( Tq Ob
T.
A little thought should convince the reader that the calculation of P(O|λ), according
to its direct definition (17) involves on the order of 2T*NT calculations, since at
every t=1,2, …., T, there are N possible states which can be reached (i,e., there are
NT possible state sequences), and for each such state sequence about 2T calculations
(III.18)
(III.19)
(III.20)
28
are required for each term in the sum of (17). (To be precise, we need (2T-1)NT
multiplications, and NT-1 additions.) This calculation is computationally unfeasible,
even for small values of N and T; e.g., for N=5, T=100, there are on the order of
2*100*5100
1072
computation! Clearly a more efficient procedure is required to
solve Problem 1. Fortunately such a procedure exists and is called the forward-
backward procedure.
The Forward-Backward Procedure: Consider the forward variable )(it defined as
)|,...()( 21 ittt SqOOOPi
(III.23)
i.e, the probability of the partial observation sequence, O1 O2…Ot, (until time t) and
state Si at time t, given the model λ. We can solve for )(it inductively, as follow:
1) Initialization:
.1),()( 1 NiObi iit (III.24)
2) Induction:
.1
11),()()( 1
1
1
Nj
TtObaij tj
N
i
ijtt
(III.25)
3) Termination:
N
i
T iOP1
).(| (III.26)
Step 1) initializes the forward probabilities as the joint probability of state Si and
initial observation O1. The induction step, which is the heart of the forward
calculation, is illustrated in Figure III.4 (a). This figure show how state Sj can be
reached at time t+1 from the N possible states, Si, 1 ≤ i ≤ N, at time t. Since )(it is
the probability of the joint event that O1O2…Ot are observed, and the state at time t
is Si, the product ijt ai)( is then the probability of the joint event that O1O2…Ot are
observed, and state S j is reached at time t+1 via state Si at time t. Summing this
29
product over all the N possible states Si, 1 ≤ i ≤ N at time t results in the probability
of Sj at time t+1 with all the accompanying previous partial observations. Once this
is done and Sj is known, it is easy to see that )(1 jt is obtained by accounting for
observation Ot+1 in state j, i.e., by multiplying the summed quantity by the
probability bj(Ot+1). The computation of (20) is performed for all states j, 1 ≤ j ≤ N,
for a given t; the computation is then iterated for t=1,2, …, T-1. Finally, step 3)
gives the desired calculation of P(O|λ) as the sum of the terminal forward variable
αT(i). This is the case since, by definition,
αT(i)=P(O1O2…OT , qT=Si|λ) (III.27)
and hence P(O|λ) is just the sum of the αT(i)’s.
Figure III.4 (a) Illustration of the sequence of operations required for the computation of the
forward variable αt+1(j). (b) Implementation of the computation of αt(i) in terms of a lattice of
observation t, and states i.
If we examine the computation involved in the calculation of α t(j), 1 ≤ t ≤ T, 1 ≤ j ≤
N, we see that it requires on the order of N2T calculation, rather than 2TN
T as
required by the direct calculation. (Again, to be precise, we need N(N+1)(T-1)+N
multiplication and N(N-1)(T-1) additions.) For N=5, T=100, we need about 3000
computations for the forward method, versus 1072
computations for the direct
calculation, a savings of about 69 orders of magnitude.
The forward probability calculation is, in effect, based upon the lattice (or trellis)
structure shown in Figure III.4 (b). The key is that since there are only N states
(nodes at each time slot in that lattice), all the possible state sequences will remerge
into these N nodes, no matter how long the observation sequence. At time t=1, we
…
S1
S2
SN
t
αt(i)
Sj
t+1
αt+1(j)
(a) (b
)
30
need to calculate values of α1(i), 1 ≤ i ≤ N. At times t=2, 3, …, T, we only need to
calculate values of αt(j), 1 ≤ j ≤ N, where each calculation involves only N previous
values of αt-1(i) because each of N grid points is reached from the same N grid
points at the previous time slot.
In a similar way, we can consider a backward variable β t(i) defined as
βt(i) = P(Ot+1 Ot+2 . . . OT |qt=Si,λ) (III.28)
i.e., the probability of the partial observation sequence from t+1 to the end, given
state Si at time t and the model λ. Again we can solve for βt(i) inductively as
follows:
1) Initialization:
.1,1)( NiiT (III.29)
2) Induction:
.1,1....,,2,1
),()()(1
11
NiTTt
jObaiN
j
ttjijt
(III.30)
The initialization step 1) arbitrarily define βT(i) to be 1 for all i. Step 2), which is
illustrated in Figure IV.5., show that in order to have been in state S i at time t, and
to account for the observation sequence from time t+1 on, you have to consider all
possible states Sj at time t+1, accounting for the transition from Si to Sj (the aij term),
as well as the observation Ot+1 in state j (the bj(Ot+1) term), and then account for the
remaining partial observation sequence from state j (the β t+1(j) term). We will see
later how the backward, as well as the forward calculation are used extensively to
help solve fundamental Problems 2 and 3 of HMMs.
Again, the computation of βt(i), 1 ≤ t ≤ T, 1 ≤ i ≤ N, requires on the order of N2T
calculation, and can be computed in a lattice structure similar to that of figure III.4
(b).
31
III.3.2. Solution to Problem 2
Unlike Problem 1 for which an exact solution can be given, there are several
possible ways of solving Problem 2 , namely finding the “optimal” state sequence
associated with the given observation sequence. The difficulty lies with the
definition of the optimal state sequence; i.e., there are several possible optimality
criteria. For example, one possible optimality criterion is to choose the states q t
which are individually most likely. This optimality criterion maximizes the
expected number of correct individual states. To implement this solution to Problem
2, we define the variable
γt(i)=P(qt=Si|O,λ) (III.31)
i.e., the probability of being in state S i at time t, given the observation sequence O,
and the model λ. Equation (26) can be expressed simply in terms of the forward-
backward variable, i,e.,
N
i
tt
tittt
ii
ii
OP
iii
1
)()(
)()(
)|(
)()()(
(III.32)
…
S1
S2
SN
t
βt(i)
Si
t+1
βt+1(j)
Figure III.5. Illustration of the sequence of operation required for the computation of the backward variable βt(i)
32
Since αt(i) accounts for the partial observation sequence O1O2 … Ot and state Si at t,
while βt(i) accounts for the remainder of the observation sequence Ot+1Ot+2… OT ,
given state Si at t. The normalization factor P(O|λ)=
N
i
tt iiOP1
)()()|( makes
γt(i) a probability measure so that
N
i
t i1
1)( (III.33)
Using γ(i), we can solve for the individually most likely state qt at time t, as
TtiqNi
tt
1,)(maxarg1
(III.34)
Although (29) maximizes the expected number of correct states (by choosing the
most likely state for each t), there could be some problem with the resulting state
sequence. For example, when the HMM has state transitions which have zero
probability (aij=0 for some i and j), the “optimal” state sequence may, in fact, not
even be a valid state sequence may, in fact, not even be a valid state sequence. This
is due to fact that the solution of ( III.34) simply determines the most likely state at
every instant, without regard to the probability of occurrence of sequences of states.
One possible solution to the above problem is to modify the optimality criterion.
For example, one could solve for the state sequence that maximizes the expected
number of correct pairs of states (qt, qt+1), or triples of states (qt, qt+1, qt+2), etc.
Although these criteria might be reasonable for some applications, the most widely
used criterion is to find the single best state sequence (path), i.e., to maximize
P(Q|O,λ) which is equivalent to maximizing P(Q,O|λ). A formal technique for
finding this single best state sequence exists, based on dynamic programming
methods, and is called the Viterbi algorithm.
Viterbi Algorithm: To find the single best state sequence, Q={q1q2…qT}, for the
given observation sequence O={O1O2…OT}, we need to define the quantity
121 ,...,,2121 ]|....,...[max)(
tqqq
ttt OOOiqqqPi (III.35)
i.e., δt(i) is the best score (highest probability) along a single path, at time t, which
accounts for the first t observations and ends in state Si . By induction we have
33
i
tjijtt Obaij )(*])(max[)( 11 (III.36)
To actually retrieve the state sequence, we need to keep track of the argument which
maximized (IV.31), for each t and j. We do this via the array )( jt . The complete
procedure for finding the best state sequence can now be stated as follows:
1) Initialization:
NiObi ii 1),()( 11 (III.37)
0)(1 i (III.38)
2) Recursion:
Nj
TtObaij tjijtNi
t
1
2),(])([max)( 11
(III.39)
Nj
Ttaij ijtNi
t
1
2],)([maxarg)( 11
(III.40)
3) Termination:
)(max1
* iP TNi
(III.41)
)(maxarg1
* iq TNi
T
(III.42)
4) Path (state sequence) backtracking:
1....,,2,1,*
11
* TTtqq ttt (III.43)
34
III.3.3. Solution to Problem 3
The third, and by far the most difficult, problem of HMMs is to determine a method
to adjust the model parameters (A, B, π) to maximize the probability of the
observation sequence given the model. There is no known way to analytically solve
for the model which maximizes the probability of the observation sequence. In fact,
given any finite observation sequence as training data, there is no optimal way of
estimating the model parameters. We can, however, choose λ=(A,B,π) such that
P(O|λ) is locally maximized using an iterative procedure such as the Baum-Welch
method.
In order to describe the procedure for reestimation (iterative update and
improvement) of HMM parameters, we first define ),( ji , the probability of being
in state Si at time t, and state S j at time t+1, given the model and the observation
sequence, i.e.
,|,),( 1 OSqSqPji jtit (III.44)
The sequence of events leading to the conditions required by (III.41) is illustrated in
figure III.6. It should be clear, from the definitions of the forward and backward
variables, that we can write ),( ji in the from
…
Sj
t+2
…
t-1
Si
aijbj(Ot+1)
at(i) βt+1(j)
t+1
t
Figure III.6. Illustration of the sequence of operations required for the computation of
the joint event that system is in state Si at time t and state Sj at time t+1.
35
N
i
N
j
ttjijt
ttjijt
ttjiji
jObai
jObai
OP
jObaiji
1 1
11
11
11
)()()(
)()()(
)|(
)()()(,
(III.45)
Where the numerator term is just P(qt=Si, qt+1=Sj, O|λ) and the division by P(O|λ)
gives the desired probability measure.
We have previously defined )(it as the probability of being in state Si at time t,
given the observation sequence and the model; hence we can relate )(it to ),( ji by
summing over j, giving
N
j
tt jii1
),()(
(III.46)
If we sum )(it over the time index t, we get a quantity which can be interpreted as
the expected (over time) number of times that state Si is visited, or equivalently, the
expected number of transitions made from state Si (if we exclude the time slot t=T
from the summation). Similarly, summation of ),( ji over t (from t=1 to t=T-1) can
be interpreted as the expected number of transitions from state Si to state Sj. That is
1
1
exp)(T
t
it Sfromstransitionofnumberectedi (III.47)
1
1
exp),(T
t
jit StoSfromstransitionofnumberectedji (III.48)
Using the above formulas (and the concept of counting event occurrences) we can
give a method for reestimation of the parameters, of an HMM. A set of reasonable
reestimation formula, for π, A, and B are
36
N
i
T
ii
ii
ittimeatSstateintimesofnumberfrequencyected
1
11
1
)()(
)()1()(exp
(III.49a)
)()(
)()()(
)(
),(
exp
exp
1
1
1
1
11
1
1
1
ii
jObai
i
ji
Sstatefromstransitionofnumberected
SstatetoSstatefromstransitionofnumberecteda
t
T
t
t
T
t
ttjijt
T
t
t
T
t
t
i
ji
ij
(III.49b)
T
t
t
T
vOt
t
i
kj
j
j
Sstateintimesofnumberected
vsymbolobservingandjstateintimesofnumberectedkb
kt
1
1
)(
)(
exp
exp)(
(III.49c)
If we define the current model as λ=(A, B, π), and use that to compute the right -
hand side of (III.49a)-(III.49c), and we define the reestimated model as ),,( BA
, as determined from the left-hand sides of (III.49a)-(III.49c), then it has been
proven by Baum and his colleagues that either 1) the initial model λ defines a
critical point of the likelihood function, in which case ; or 2) model is more
likely than model λ in the sense that )|()|( OPOP , i.e., we have found a new
model from which the observation sequence is more likely to have been
produced.
37
Based on the above procedure, if we iteratively use in place of λ and repeat the
reestimation calculation, we then can improve the probability of O being observed
from the model until some limiting point is reached. The final result of this
reestimation procedure is called a maximum likelihood estimate of the HMM. It
should be pointed out that the forward-backward algorithm leads to local maxima
only, and that in most problems of interest, the optimization surface is very complex
and has many local maxima.
The reestimation formulas of (III.49a)-(III.49c) can be derived directly by
maximizing (using standard constrained optimization techniques) Baum’s auxiliary
function
Q
QOPOQPQ )|,(log),|(),(
(III.50)
over . It has been proven by Baum that maximization of ),( Q leads to increased
likelihood, i.e.
).|()|(),(max
OPOPQ
(III.51)
Eventually the likelihood function converges to a critical point.
Notes on the Reestimation Procedure: The reestimation formulas can readily be
interpreted as an implementation of the EM algorithm of statistics in which the E
(expectation) step is the calculation of the auxiliary function ),( Q , and the M
(modification) step is the maximization over . Thus the Baum-Welch reestimation
equations are essentially identical to the EM steps for this particular problem.
An important aspect of the reestimation procedure is that the stochastic constraints
of the HMM parameters, namely
N
i
i
1
1 (III.52)
NiaN
j
ij
1,11 (III.53)
38
NjkbK
k
j
1,1)(1 (III.54)
are automatically satisfied at each iteration.
III.3.4. Reestimation For Multiple Observation Sequences
If only one observation sequence is used to train the model then would the model
perform good recognition on this particular sample, but might give low recognition
rate when testing other utterances of the same word. So the good training need
multiple observation sequences from different speakers for the same word.
Let O(r)
denote the r th observation of length Tr, and let superscript r indicate results
for this sequence and R is number of sequence, then the F-B reestimation algorithm
must be modified as:
R
r
N
i
r
T
rR
r
r
i
i
ii
r
1 1
)(^
)(
1
^
1
)(
1
^
)(
)()(
(III.55)
R
r
T
t
r
t
r
t
R
r
r
t
T
t
tr
jij
r
t
ijr
r
ii
jObai
a
1
1
1
)(^)(^
1
)(
1
^1
1
1)(
)(^
)()(
)()()(
(III.56)
R
r
r
t
rT
t
t
r
t
R
r
rT
tVO
t
j
jj
jj
kbr
r
kt
1
)(^)(
1
^
)(^
1
)(
1,
^
)()(
)()(
)(
(III.57)
Where:
t
tt
c
ii
)()(
^
(III.58)
t
t
tc
ii
)()(
^
(III.59)
39
N
j
tt jc1
)( : scale factor (III.60)
III.4. Type of HMM
Different kinds of structures for HMMS can be used. The structure is defined by the
transition matrix, A. The most general structure is the ergodic or fully connected
HMM. In this model can every state be reached from every other state of the model.
As show in figure III.7(a), for an N=4 state model, this model has the property 0 <
aij < 1 (the zero and the one has to excluded, otherwise is the ergodic property not
fulfilled). The state transition matrix, A, for an ergodic model, can be described by:
44434241
34333231
24232221
14131211
aaaa
aaaa
aaaa
aaaa
A
(a)
1 2
3 4
(b) 1 2 3 4
1
2 4 6
3 5
(c)
Figure III.7. Illustration of 3 distinct types of HMMs. (a) A 4-state ergodic model.
(b) A 4-state left-right model. (c) A 6-state parallel path left-right model
40
In speech recognition, it is desirable to use a model which models the observations
in a successive manner – since this is the property of speech. The models that
fulfills this modeling technique, is the left-right model or parallel path left-right
model. See figure III.7 (b),(c). The property for a left-right model is:
aij=0, j < i (III.61)
That is, no jumps can be made to a previous states. The lengths of the transitions are
usually restricted to some maximum length, typical two or three:
aij=0, j > i + ∆ (III.62)
Note that, for a left-right model, the state transitions coefficients for the last state
has the following property:
aNN=1 (III.63)
aNj=0, j < N (III.64)
In Figure III. 7(b) and (c) two left-right models are presented. In figure III.7(b) is
∆=2 and the state transition matrix, A, will be:
44
3433
242322
131211
000
00
0
0
a
aa
aaa
aaa
A
(III.65)
It should be clear that the imposition of the constraints of the left-right model, or
those of the constrained jump model, essentially have no effect on the reestimation
procedure. This is the case because any HMM parameter set to zero initially, will
remain at zero throughout the reestimation procedure.
III.5. Choice of Model Parameters
Size of codebook
For the case in which we wish to use an HMM with a discrete observation symbol
density, rather than the continuous one, a vector quantizer (VQ) is required to map
each continuous observation vector into a discrete codebook index. Once the
codebook of vectors has been obtained, the mapping between continuous vectors
and codebook indices become a simple nearest neighbor computation, the
41
continuous vector is assigned the index of the nearest codebook vector. Thus the
major issue in VQ is the design of an appropriate codebook for quantization.
A great deal of work has gone into devising an excellent iterative procedure for
designing codebooks based on having a representative training sequence of vectors.
The procedure basically partitions the training vectors into K disjoint sets (where K
is the size of the codebook), represents each such set by a single vector (vm , 1≤ k ≤
K), which is generally the centroid of the vectors in the training set assigned to kth
region, and then iteratively optimizes the partition and the codebook. Associated
with VQ is a distortion penalty since we are representing an entire region of the
vector space by a single vector. Clearly it is advantageous to keep the distortion
penalty as small as possible. However, this implies a large size codebook, and that
leads to problems in implementing HMMs with a large number of parameters.
Figure III.8 illustrates the tradeoff of quantization distortion versus K (on a log
scale). Although the distortion steadily decreases as K increases, it can be seen from
figure III.8 that only small decreases in distortion accrue beyond a value of K=32.
Hence HHMs with codebook sizes of from K=32 to 256 vectors have been used in
speech recognition experiments using HMMs.
K Figure III.8. Curve showing tradeoff of VQ average
distortion as a function of the size of the VQ, K as a log
scale.
42
Type of model
How do we select the type of model? and how do we choose the parameters of
selected model. For isolated word recognition with a distinct HMM designed for
each word in the vocabulary, it should be clear that a left-right model is more
appropriate than ergodic model, since we can then associate time with model states
in fairly straightforward manner. Furthermore we can envision the physical meaning
of the model states as distinct sound of the word being modeled.
Number of states.
The issue of the number of states to use in each word model leads to two ways of
thought. One idea is to let the number of states correspond roughly to the number of
sounds (phonemes) within the word – hence models with from 2 to 10 states would
be appropriate. The other idea is to let the number of states correspond roughly to
the average number of observations in spoken version of the word. In this manner
each state corresponds to an observation interval. Each word models have same
number of states; this implies that the models will work best when they represent
works with the same number of sounds. So in this project, I chosen the second one.
Figure III.9. Average word error rate versus the number of states N in the HMM
43
To illustrate the effect of varying the number of states in a word model, figure
III.9 shows a plot of average word error rate versus N, for the case of recognition of
isolated digits. It can be seen that the error is somewhat insensitive to N, achieving a
local minimum at N=6; however, differences in error rate for values of N close to 6
are small, i.e. N=5.
III.6. Initial HMM Parameters
Before the reestimation formulas can be applied for training, it is important to get
good initial parameters so that the reestimation leads to the global maximum or as
close as possible to it. A adequate choice for π and A is the uniform distribution.
But since left-right models are used, π will have probability one for the first state
and zero for otherwise. For example will the left-right model in figure III.7(b) have
the following initial π and A:
0
0
0
1
(III.66)
1000
5.05.000
05.05.00
005.05.0
A
(III.67)
The parameters for the emission distribution needs good initial estimations, to get a
rapid and proper convergence.