Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist –...
Transcript of Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist –...
![Page 1: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/1.jpg)
Sta$s$cal sequence recogni$on
![Page 2: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/2.jpg)
Determinis$c sequence recogni$on
• Last $me, temporal integra$on of local distances via DP – Integrates local matches over $me – Normalizes $me varia$ons – For cts speech, segments as well as classifies
• Limita$ons – End-‐point detec$on required – Choosing local distance (effect on global) – Doesn’t model context effect between words
![Page 3: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/3.jpg)
Sta$s$cal vs determinis$c sequence recogni$on
• Sta$s$cal models can also be used (rather than examples) with DP – S$ll integrate local matches over $me – Normalize $me varia$ons – For cts speech, segment as well as classify
• Helping with DTW Limita$ons – End-‐point detec$on not as cri$cal – Local “distance” comes from the model – Cross-‐word modeling is straighMorward (though it does require enough data to train good models)
![Page 4: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/4.jpg)
Sta$s$cal sequence recogni$on • Powerful tools exist – Density es$ma$on
– Training data alignment – Recogni$on given the models
• Increases generality over determinis$c – Any distance choice is equivalent to implicit model
– Sufficiently detailed sta$s$cs can model any distribu$on
– In recogni$on, find MAP choice for sequence – In prac$ce, approxima$ons used
![Page 5: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/5.jpg)
Probabilis$c problem statement
• Bayes rela$on for models
where is the jth stat model for a sequence of speech units
And is the sequence of feature vectors – Minimum probability of error if j chosen to maximize
Mj
X
P(Mj|X)
P(Mj| X) =
P(X |Mj)P(M
j)
P(X)
![Page 6: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/6.jpg)
Decision rule
• Bayes rela$on for models jbest = argmaxj
= argmaxj
since is fixed for all choices of j X
P(Mj| X)
P(X |Mj)P(M
j)
![Page 7: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/7.jpg)
Bayes decision rule for models
X
M2
M1
MJ
P(X M2)P(M2)
P(X MJ)P(MJ)
P(X M1)P(M1)
![Page 8: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/8.jpg)
Acous$c and language models
• So far, no assump$ons • But how do we get the probabili$es? • Es$mate from training data
• Then, first assump$on: acous$c parameters independent of language parameters
where are parameters es$mated from training data
jbest
= argmaxjP(M
j| X,!)= argmax
jP(X |M
j,!A)P(M
j|!L)
!
![Page 9: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/9.jpg)
The three problems
(1) How should be computed?
(2) How should parameters be determined?
(3) Given the model and parameters, how can we find the best sequence to classify the input sequence?
• Today’s focus is on problem 1, with some on problem 3; problem 2 will be next $me.
P(X |Mj,!A)
!A
![Page 10: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/10.jpg)
Genera$ve model for speech
Mj, !AX
P(X Mj, !A)
![Page 11: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/11.jpg)
Composi$on of a model
• Could collect sta$s$cs for whole words • More typically, densi$es for subword units
• Models consist of states
• States have possible transi$ons • States have observed outputs (feature vectors) • States have density func$ons • General sta$s$cal formula$ons hard to es$mate
• To simplify, use Markov assump$ons
![Page 12: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/12.jpg)
Markov assump$ons
• Assume finite state automaton • Assume stochas$c transi$ons
• Each random variable only depends on the previous n variables (typically n=1)
• HMMs have another layer of indeterminacy
• Let’s start with Markov models per se
![Page 13: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/13.jpg)
Example: Markov Model
1/3
1/3
1/4
1/41/4
1/4
1/2
1/2
1/3
a b
c
state outputa sunnyb cloudyc rainy
Numbers on arcs are probabili$es of transi$ons
![Page 14: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/14.jpg)
Markov Model • By defini$on of joint and condi$onal probability, if
then
• And with 1st order Markov assump$on,
P(Q) = P(q1) P(q
i
i=2
N
! | qi"1,q
i"2,...,q
1)
P(Q) = P(q1) P(q
i
i=2
N
! | qi"1)
Q = (q1,q
2,q
3,...,q
N)
![Page 15: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/15.jpg)
Example: Markov Model
1/3
1/3
1/4
1/41/4
1/4
1/2
1/2
1/3
a b
c
state outputa sunnyb cloudyc rainyIf we assume that we start with “a” so that P(a)=1, then
P(abc) = P(c | b)P(b | a)P(a)
P(abc) = (1
4)(1
3) =
1
12
![Page 16: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/16.jpg)
Hidden Markov Model (HMM)
• Outputs of Markov Model are determinis$c • For HMM, outputs are stochas$c – Instead of a fixed value, a pdf – Genera$ng state sequence is “hidden”
• “Doubly stochas$c” – Transi$ons – Emissions
![Page 17: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/17.jpg)
Line-‐up: ASR researchers and Swedish basketball players
Sigh!(f2 = 1 4 0 0)
Sigh!(f2 = 1 4 5 0)
Sigh!(f2 = 1 6 5 0)
![Page 18: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/18.jpg)
MM vs HMM
• MM: bbplayer-‐>1400 Hz; researcher-‐>1600 Hz HMM: bbplayer-‐> 1400 Hz mean, 200 Hz std dev etc. (Gaussian)
• Outputs are observed in both cases; but only one possible for MM, >1 possible for HMM
• For MM, then directly know the state; for HMM,
probabilis$c inference
• In both cases, two states, two possible transi$ons from each
![Page 19: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/19.jpg)
Two-‐state HMM
q1 q2
Associated func$ons: P(xn | qi) and P(qj | qi)
![Page 20: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/20.jpg)
Emissions and transi$ons
• P(xn | q1) could be density for F2 of bbplayer (emission probability)
• P(xn | q2) could be density for F2 of researcher • P(q2 | q1) could be probability for transi$on of
bbplayer-‐>researcher in the lineup
• P(q1 | q2) could be probability for transi$on of researcher-‐>bbplayer in the lineup
• P(q1 | q1) could be probability for transi$on of bbplayer-‐>bbplayer in the lineup
• P(q2 | q2) could be probability for transi$on of researcher-‐>researcher in the lineup
![Page 21: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/21.jpg)
States vs speech classes
• State is just model component; could be “$ed” (same class as a state in a different model)
• States are oden parts of a speech sound, (e.g., three to a phone)
• States can also correspond to specific contexts (e.g., “uh” with a led context of “b”)
• States can just be repeated versions of same speech class – enforces minimum dura$on
• In prac$ce state iden$$es are learned
![Page 22: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/22.jpg)
Temporal constraints
• Minimum dura$on from shortest path • Self-‐loop probability vs. forward transi$on probability
• Transi$ons not in models not permifed
• Some$mes explicit dura$on models are used
![Page 23: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/23.jpg)
Es$ma$on of P(X|M)
• Given states, topology, probabili$es • Assume 1st order Markov
• For now, assume transi$on probabili$es are known, emission probabili$es es$mated
• How to es$mate likelihood func$on?
• Hint: at the end, one version will look like DTW
![Page 24: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/24.jpg)
“Total” likelihood es$ma$on • ith path through model M is Qi • N is the number of frames in the input
• X is the data sequence • L(M) is the number of states in the model
• Then expand to
• But this requires O(N L(M)N) steps
• Fortunately, we can reuse intermediate results
P(X | M ) = P(Qi
all Qi in M of length N
! ,X | M )
![Page 25: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/25.jpg)
“Forward” recurrence (1)
• Expand to joint probability at last frame only
• Decompose into local and cumula$ve terms, where X can be expressed as
• Using P(a,b|c)=P(a|b,c)P(b|c), get
P(X |M ) = P(qlN,X
l=1
L(M )
! |M )
X1
N
P(qln,X
1n |M )= P(q
kn!1
k=1
L(M )
" ,X1n!1 |M )P(q
ln,xn |qk
n!1,X1n!1,M )
![Page 26: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/26.jpg)
“Forward” recurrence (2)
• Now define a joint probability for state at $me n being ql , and the observa$on sequence:
• Then, resta$ng the forward recurrence,
!n(l |M ) = P(qln,X1
n |M )
!n(l |M ) = !n"1(k)P(qln, xn | qk
n"1,X
1
n"1,M )
k=1
L(M )
#
![Page 27: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/27.jpg)
“Forward” recurrence (3)
• The “local” term can be decomposed further:
• But these are very hard to es$mate. So we need to make two assump$ons of condi$onal independence
P(qln,xn | qk
n!1,M ) = P(qln | qk
n!1,X1n!1,M )P(xn | ql
n,qkn!1,X1
n!1,M )
![Page 28: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/28.jpg)
Assump$on 1: 1st order Markov
• State chain: state of Markov chain at $me n depends only on state at $me n-‐1, condi$onally independent of the past
P(qln | qk
n!1,X1n!1,M ) = P(ql
n | qkn!1,M )
![Page 29: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/29.jpg)
Assump$on 2: condi$onal independence of the data
• Given the state, observa$ons are independent of the past states and observa$ons
P(xn | qln,qk
n!1,X1n!1,M ) = P(xn | ql
n,M )
![Page 30: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/30.jpg)
“Forward” recurrence (4) • Given those assump$ons, the local term is
• And the forward recurrence is
• Or, suppressing the dependence on M,
P(qln| qk
n!1,M )P(xn | ql
n,M )
!n(l |M ) = !n"1(k |M )k=1
L(M )
# P(qln | qk
n"1,M )P(xn | ql
n,M )
!n(l) = !n"1(k)k=1
L
# P(qln | qk
n"1)P(xn | qln)
![Page 31: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/31.jpg)
How do we start it?
• Recall defini$on
• Set n=1 !1(l) = P(ql
1,X11) = P(ql
1)P(x
1| ql
1)
!n(l |M ) = P(qln,X1
n |M )
![Page 32: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/32.jpg)
How do we finish it?
• Recall defini$on
• Sum over all states in model for n=N
!N (l |Ml=1
L
" ) = Pl=1
L
" (qlN ,X1
N |M ) = P(X |M )
!n(l |M ) = P(qln,X1
n |M )
![Page 33: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/33.jpg)
Forward recurrence summary
• Decompose data likelihood into sum (over predecessor states) of product of local and global probabili$es
• Condi$onal independence assump$ons
• Local probability is product of emission and transi$on probabili$es
• Global probability is a cumula$ve value
• A lot like DTW!
![Page 34: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/34.jpg)
Forward recurrence vs DTW
• Terms are probabilis$c • Predecessors are model states, not observa$on frames
• Predecessors are always the previous frame
• Local and global factors are combined by product, not sum (but you could take the log)
• Combina$on of terms over predecessors are done by summa$on rather than finding the maximum (or min for distance/distor$on)
![Page 35: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/35.jpg)
Viterbi Approxima$on
• Summing very small products is tricky (numerically)
• Instead of total likelihood for model, can find best path through states
• Summa$on replaced by max
• Probabili$es can be replaced by log probabili$es • Then summa$ons can be replaced by min of the sum of nega$ve log probabili$es
! logP(qln ,X1
n ) = mink[! logP(qkn!1,X1
n!1) ! logP(qln | qk
n!1) ! logP(xn | qln )]
![Page 36: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/36.jpg)
Similarity to DTW
• Nega$ve log probabili$es are the distance! But instead of a frame in the reference, we compare to a state in a model, i.e.,
• Note that we also now have explicit transi$on costs
! logP(qln ,X1
n ) = mink[! logP(qkn!1,X1
n!1) ! logP(qln | qk
n!1) ! logP(xn | qln )]
D(n,ql
n
) =mink[D(n !1,qkn!1) + d(n,ql
n) + T (ql
n,qk
n!1)]
![Page 37: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/37.jpg)
Viterbi vs DTW
• Models, not examples • The distance measure is now dependent on es$ma$ng probabili$es – good tools exist
• We now have explicit way to specify priors – State sequences: transi$on probabili$es – Word sequences: P(M) priors come from a language model
![Page 38: Stas$cal(sequence(recogni$on(Stas$cal(sequence(recogni$on(• Powerful(tools(exist – Density(es$maon(– Training(dataalignment – Recogni$on(given(the(models(• Increases(generality(over](https://reader033.fdocuments.us/reader033/viewer/2022042915/5f5021641c324014d92b1325/html5/thumbnails/38.jpg)
Assump$ons required
• Language and acous$c model parameters are separable
• State chain is first-‐order Markov
• Observa$ons independent of past • Recogni$on via best path (best state sequence) is good enough – don’t need to sum over all paths to get best model
• If all were true, then resul$ng inference would be op$mum