b224s.09.lec10
-
Upload
burim-guri -
Category
Documents
-
view
212 -
download
0
Transcript of b224s.09.lec10
-
7/28/2019 b224s.09.lec10
1/59
CS 224S / LINGUIST 281
Speech Recognition, Synthesis, andDialogue
Dan Jurafsky
Lecture 10: Acoustic Modeling
IP Notice:
-
7/28/2019 b224s.09.lec10
2/59
Outline for Today
Speech Recognition Architectural Overview Hidden Markov Models in general and for speech
ForwardViterbi Decoding
How this fits into the ASR component of course Jan 27 HMMs, Forward, Viterbi, Jan 29 Baum-Welch (Forward-Backward) Feb 3: Feature Extraction, MFCCs, start of AM (VQ) Feb 5: Acoustic Modeling: GMMs Feb 10: N-grams and Language Modeling Feb 24: Search and Advanced Decoding Feb 26: Dealing with Variation Mar 3: Dealing with Disfluencies
-
7/28/2019 b224s.09.lec10
3/59
Outline for Today
Acoustic Model Increasingly sophisticated modelsAcoustic Likelihood for each state:
Gaussians
Multivariate Gaussians Mixtures of Multivariate Gaussians Where a state is progressively:
CI Subphone (3ish per phone) CD phone (=triphones) State-tying of CD phone
If Time: Evaluation Word Error Rate
-
7/28/2019 b224s.09.lec10
4/59
Reminder: VQ
To compute p(ot|qj) Compute distance between feature vector ot
and each codeword (prototype vector) in a preclustered codebook where distance is either
Euclidean Mahalanobis
Choose the vector that is the closest to ot and take its codeword vk
And then look up the likelihood of vkgiven HMM statej in the B matrix
Bj(ot)=bj(vk) s.t. vk is codeword of closest vectorto ot
Using Baum-Welch as above
-
7/28/2019 b224s.09.lec10
5/59
Computing bj(vk)
feature value 1 for state jfeature
value
2
forstate
j
bj(vk) = number of vectors with codebook index kin state jnumber of vectors in state j
= =14 1
56 4
Slide from John-Paul Hosum, OHSU/OGI
-
7/28/2019 b224s.09.lec10
6/59
Summary: VQ
Training:Do VQ and then use Baum-Welch to assign
probabilities to each symbol
Decoding:Do VQ and then use the symbol probabilities
in decoding
-
7/28/2019 b224s.09.lec10
7/59
Directly Modeling ContinuousObservations
GaussiansUnivariate Gaussians
Baum-Welch for univariate GaussiansMultivariate Gaussians
Baum-Welch for multivariate GausiansGaussian Mixture Models (GMMs)
Baum-Welch for GMMs
-
7/28/2019 b224s.09.lec10
8/59
Better than VQ
VQ is insufficient for real ASR Instead: Assume the possible values of the
observation feature vector ot are normallydistributed.
Represent the observation likelihood functionbj(ot) as a Gaussian with mean j and variancej
2
f(x |,) =1
2exp(
(x )2
22
)
-
7/28/2019 b224s.09.lec10
9/59
Gaussians are parameters bymean and variance
-
7/28/2019 b224s.09.lec10
10/59
For a discrete random variable X Mean is the expected value of X
Weighted sum over the values of X
Variance is the squared average deviationfrom mean
Reminder: means andvariances
-
7/28/2019 b224s.09.lec10
11/59
Gaussian as ProbabilityDensity Function
-
7/28/2019 b224s.09.lec10
12/59
Gaussian PDFs
A Gaussian is a probability density function;probability is area under curve.
To make it a probability, we constrain areaunder curve = 1.
BUT We will be using point estimates; value of Gaussian
at point.
Technically these are not probabilities, since apdf gives a probability over a interval, needs tobe multiplied by dx
As we will see later, this is ok since the samevalue is omitted from all Gaussians, so argmax isstill correct.
-
7/28/2019 b224s.09.lec10
13/59
Gaussians for Acoustic
Modeling
P(o|q):P(o|q)
o
P(o|q) is highest here at mean
P(o|q is low here, very far from mean)
A Gaussian is parameterized by a mean anda variance:
Different means
-
7/28/2019 b224s.09.lec10
14/59
Using a (univariate Gaussian) asan acoustic likelihood estimator
Lets suppose our observation was a singlereal-valued feature (instead of 39D vector)
Then if we had learned a Gaussian overthe distribution of values of this feature
We could compute the likelihood of anygiven observation ot as follows:
-
7/28/2019 b224s.09.lec10
15/59
Training a UnivariateGaussian
A (single) Gaussian is characterized by a mean anda variance
Imagine that we had some training data in whicheach state was labeled
We could just compute the mean and variance fromthe data:
i=
1
To
t
t=1
T
s.t. ot is state i
i
2=
1
T(ot
t=1
T
i)2 s.t. qt is state i
-
7/28/2019 b224s.09.lec10
16/59
Training Univariate Gaussians
But we dont know which observation was producedby which state!
What we want: to assign each observation vector otto every possible state i, prorated by the probability
the the HMM was in state iat time t. The probability of being in state i at time t is t(i)!!
2i =
t(i)(o
t
i)2
t=1
T
t(i)
t=1
T
i=
t(i)o
t
t=
1
T
t(i)
t=1
T
-
7/28/2019 b224s.09.lec10
17/59
Multivariate Gaussians
Instead of a single mean and variance :
Vector of observations xmodeled by vectorof means and covariance matrix
f(x |,) =1
2exp(
(x )2
22)
f(x |,) =1
(2)D/ 2
| |1/ 2exp
1
2(x )T1(x )
-
7/28/2019 b224s.09.lec10
18/59
Multivariate Gaussians
Defining and
So the i-jth element of is:
= E(x)
= E (x )(x )T[ ]
ij
2= E (x i i)(x j j)[ ]
-
7/28/2019 b224s.09.lec10
19/59
Gaussian Intuitions: Size of
= [0 0] = [0 0] = [0 0] = I = 0.6I = 2IAs becomes larger, Gaussian becomes
more spread out; as becomes smaller,
Gaussian more compressedText and figures from Andrew Ngs lecture notes for CS229
-
7/28/2019 b224s.09.lec10
20/59
From Chen, Picheny et al lecture slides
-
7/28/2019 b224s.09.lec10
21/59
[1 0] [.6 0][0 1] [ 0 2]
Different variances in different dimensions
-
7/28/2019 b224s.09.lec10
22/59
Gaussian Intuitions: Off-
diagonal
As we increase the off-diagonal entries, more correlation betweenvalue of x and value of y
Text and figures from Andrew Ngs lecture notes for CS229
-
7/28/2019 b224s.09.lec10
23/59
Gaussian Intuitions: off-
diagonal
As we increase the off-diagonal entries, more correlation between value of x andvalue of y
Text and figures from Andrew Ngs lecture notes for CS229
-
7/28/2019 b224s.09.lec10
24/59
Gaussian Intuitions: off-
diagonal and diagonal
Decreasing non-diagonal entries (#1-2) Increasing variance of one dimension in diagonal (#3)
Text and figures from Andrew Ngs lecture notes for CS229
-
7/28/2019 b224s.09.lec10
25/59
In two dimensions
From Chen, Picheny et al lecture slides
-
7/28/2019 b224s.09.lec10
26/59
But: assume diagonalcovariance
I.e., assume that the features in the featurevector are uncorrelated
This isnt true for FFT features, but is true forMFCC features, as we saw las time
Computation and storage much cheaper ifdiagonal covariance.
I.e. only diagonal entries are non-zero Diagonal contains the variance of each dimensionii2 So this means we consider the variance of each
acoustic feature (dimension) separately
-
7/28/2019 b224s.09.lec10
27/59
Diagonal covariance
Diagonal contains the variance of eachdimension ii
2
So this means we consider the variance ofeach acoustic feature (dimension)separately
bj(ot) =1
2
jd
2exp
1
2
otd jd
jd
2
d=
1
D
bj(ot) =1
2D
2jd
2
d=1
D
exp(
1
2
(otd jd)2
jd
2
d=1
D
)
-
7/28/2019 b224s.09.lec10
28/59
Baum-Welch reestimation equations formultivariate Gaussians
Natural extension of univariate case,where now i is mean vector for state i:
i
2=
t(i)(o
t
i)
t=1
T
(ot i)T
t(i)
t=
1
T
i=
t(i)o
t
t=1
T
t(i)
t=1
T
-
7/28/2019 b224s.09.lec10
29/59
But were not there yet
Single Gaussian may do a bad job ofmodeling distribution in any dimension:
Solution: Mixtures of GaussiansFigure from Chen, Picheney et al slides
-
7/28/2019 b224s.09.lec10
30/59
Mixture of Gaussians to modela function
4 3 2 1 0 1 2 3 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-
7/28/2019 b224s.09.lec10
31/59
Mixtures of Gaussians
M mixtures of Gaussians:
For diagonal covariance:
bj(ot) =c jk
2D
2jkd
2
d=1
D
exp(
1
2
(x jkd jkd)2
jkd
2
d=1
D
)k=1
M
bj(ot) = c jkk=1
M
1
2jkd2
exp 1
2
otd jkd
jkd
2
d=1
D
f(x |jk, jk) = c jkk=1
M
1
(2)D / 2 | jk |
1/ 2exp
1
2(x jk)
T1(x jk)
-
7/28/2019 b224s.09.lec10
32/59
GMMs
Summary: each state has a likelihood functionparameterized by:
M Mixture weights M Mean Vectors of dimensionality D Either
M Covariance Matrices of DxD Or more likely
M Diagonal Covariance Matrices of DxD which is equivalent to M Variance Vectors of dimensionality D
-
7/28/2019 b224s.09.lec10
33/59
Training a GMM
Problem: how do we train a GMM if wedont know what component is accounting
for aspects of any particular observation?
Intuition: we use Baum-Welch to find it forus, just as we did for finding hidden statesthat accounted for the observation
-
7/28/2019 b224s.09.lec10
34/59
2/4/09 34
Baum-Welch for MixtureModels
By analogy with earlier, lets define the probability of being instate jat time twith the kth mixture component accounting for ot:
Now,tm (j) =
t1(j)i=1
N
aijc jmbjm (ot)j(t)F(T)
jm =
tm (j)ott=1
T
tk(j)k=1
M
t=1
T
cjm =tm (j)
t=1
T
tk(j)k=1
M
t=1
T
jm =tm (j)(ot j)
t=1
T
(ot j)T
tm (j)k=1
M
t=1
T
-
7/28/2019 b224s.09.lec10
35/59
2/4/09 35
How to train mixtures?
Choose M (often 16; or can tune M dependent onamount of training observations)
Then can do various splitting or clustering algorithms One simple method for splitting:1) Compute global mean and global variance2) Split into two Gaussians, with means (sometimes
is 0.2)
3) Run Forward-Backward to retrain4) Go to 2 until we have 16 mixtures
-
7/28/2019 b224s.09.lec10
36/59
2/4/09 CS 224S Winter 2007 36
Embedded Training
Components of a speech recognizer:Feature extraction: not statisticalLanguage model: word transition probabilities,
trained on some other corpusAcoustic model:
Pronunciation lexicon: the HMM structure for eachword, built by hand
Observation likelihoods bj(ot)
Transition probabilities aij
-
7/28/2019 b224s.09.lec10
37/59
2/4/09 CS 224S Winter 2007 37
Embedded training ofacoustic model
If we had hand-segmented and hand-labeled trainingdata
With word and phone boundaries We could just compute the
B: means and variances of all our triphone gaussiansA: transition probabilities
And wed be done! But we dont have word and phone boundaries, nor
phone labeling
-
7/28/2019 b224s.09.lec10
38/59
2/4/09 CS 224S Winter 2007 38
Embedded training
Instead:Well train each phone HMM embedded in an
entire sentence
Well do word/phone segmentation andalignment automatically as part of trainingprocess
-
7/28/2019 b224s.09.lec10
39/59
2/4/09 CS 224S Winter 2007 39
Embedded Training
-
7/28/2019 b224s.09.lec10
40/59
2/4/09 40
Initialization: Flat start
Transition probabilities: set to zero any that you want to bestructurally zeroThe probability computation includes previous
value of aij, so if its zero it will never changeSet the rest to identical values
Likelihoods:initialize and of each state to global mean
and variance of all training data
-
7/28/2019 b224s.09.lec10
41/59
2/4/09 41
Embedded Training
Given: phoneset, pron lexicon, transcribedwavefiles
Build a whole sentence HMM for eachsentence
Initialize A probs to 0.5, or to zeroInitialize B probs to global mean and varianceRun multiple iteractions of Baum Welch
During each iteration, we compute forward andbackward probabilities
Use them to re-estimate A and BRun Baum-Welch til converge
-
7/28/2019 b224s.09.lec10
42/59
2/4/09 CS 224S Winter 2007 42
Viterbi training
Baum-Welch training says: We need to know what state we were in, to
accumulate counts of a given output symbol ot
Well compute I(t), the probability of being in state iat time t, by using forward-backward to sum over allpossible paths that might have been in state i andoutput ot.
Viterbi training says:
Instead of summing over all possible paths, just takethe single most likely path
Use the Viterbi algorithm to compute this Viterbipath
Via forced alignment
-
7/28/2019 b224s.09.lec10
43/59
2/4/09 CS 224S Winter 2007 43
Forced Alignment
Computing the Viterbi path over thetraining data is called forced alignment
Because we know which word string toassign to each observation sequence.
We just dont know the state sequence. So we use aij to constrain the path to go
through the correct words
And otherwise do normal Viterbi Result: state sequence!
-
7/28/2019 b224s.09.lec10
44/59
2/4/09 CS 224S Winter 2007 44
Viterbi training equations
Viterbi Baum-Welch
bj(vk)=
nj(s.t.ot = vk)
nj
aij=
nij
ni
For all pairs of emitting states,
1
-
7/28/2019 b224s.09.lec10
45/59
-
7/28/2019 b224s.09.lec10
46/59
2/4/09 CS 224S Winter 2007 46
Viterbi training (II)
Equations for non-mixture Gaussians
Viterbi training for mixture Gaussians ismore complex, generally just assign eachobservation to 1 mixture
i2=
1
Ni(ot
t=1
T
i )2 s.t. qt = i
i =
1
Niot s.t. qt = i
t=1
T
-
7/28/2019 b224s.09.lec10
47/59
2/4/09 CS 224S Winter 2007 47
Log domain
In practice, do all computation in log domain Avoids underflow
Instead of multiplying lots of very small probabilities, we addnumbers that are not so small.
Single multivariate Gaussian (diagonal ) compute:
In log space:
-
7/28/2019 b224s.09.lec10
48/59
2/4/09 CS 224S Winter 2007 48
Repeating:
With some rearrangement of terms
Where:
Note that this looks like a weighted Mahalanobis distance!!! Also may justify why we these arent really probabilities (point
estimates); these are really just distances.
Log domain
-
7/28/2019 b224s.09.lec10
49/59
Evaluation
How to evaluate the word string output bya speech recognizer?
-
7/28/2019 b224s.09.lec10
50/59
Word Error Rate
Word Error Rate =100 (Insertions+Substitutions + Deletions)
------------------------------
Total Word in Correct Transcript
Aligment example:
REF: portable **** PHONE UPSTAIRS last night so
HYP: portable FORM OF STORES last night so
Eval I S S
WER = 100 (1+2+0)/6 = 50%
-
7/28/2019 b224s.09.lec10
51/59
NIST sctk-1.3 scoring softare:Computing WER with sclite
http://www.nist.gov/speech/tools/ Sclite aligns a hypothesized text (HYP) (from the recognizer) with a
correct or reference text (REF) (human transcribed)id: (2347-b-013)Scores: (#C #S #D #I) 9 3 1 2REF: was an engineer SO I i was always with **** **** MEN UM and theyHYP: was an engineer ** AND i was always with THEM THEY ALL THAT and theyEval: D S I I S S
-
7/28/2019 b224s.09.lec10
52/59
Sclite output for erroranalysis
CONFUSION PAIRS Total (972)With >= 1 occurances (972)
1: 6 -> (%hesitation) ==> on2: 6 -> the ==> that3: 5 -> but ==> that4: 4 -> a ==> the5: 4 -> four ==> for6: 4 -> in ==> and7: 4 -> there ==> that8: 3 -> (%hesitation) ==> and9: 3 -> (%hesitation) ==> the10: 3 -> (a-) ==> i11: 3 -> and ==> i12: 3 -> and ==> in13: 3 -> are ==> there14: 3 -> as ==> is15: 3 -> have ==> that16: 3 -> is ==> this
-
7/28/2019 b224s.09.lec10
53/59
Sclite output for erroranalysis
17: 3 -> it ==> that
18: 3 -> mouse ==> most19: 3 -> was ==> is20: 3 -> was ==> this21: 3 -> you ==> we22: 2 -> (%hesitation) ==> it23: 2 -> (%hesitation) ==> that24: 2 -> (%hesitation) ==> to25: 2 -> (%hesitation) ==> yeah26: 2 -> a ==> all27: 2 -> a ==> know28: 2 -> a ==> you29: 2 -> along ==> well30: 2 -> and ==> it31: 2 -> and ==> we32: 2 -> and ==> you33: 2 -> are ==> i34: 2 -> are ==> were
-
7/28/2019 b224s.09.lec10
54/59
Better metrics than WER?
WER has been useful But should we be more concerned with meaning
(semantic error rate)?
Good idea, but hard to agree on Has been applied in dialogue systems, where desired
semantic output is more clear
-
7/28/2019 b224s.09.lec10
55/59
Summary: ASR Architecture
Five easy pieces: ASR Noisy Channelarchitecture1)Feature Extraction:
39 MFCC features
2)Acoustic Model:Gaussians for computing p(o|q)
3)Lexicon/Pronunciation Model HMM: what phones can follow each other
4)Language Model
N-grams for computing p(wi|wi-1)5)Decoder
Viterbi algorithm: dynamic programming for combining allthese to get word sequence from speech!
55
ASR L i M k M d l
-
7/28/2019 b224s.09.lec10
56/59
ASR Lexicon: Markov Modelsfor pronunciation
-
7/28/2019 b224s.09.lec10
57/59
Pronunciation Modeling
Generating surface forms:
Eric Fosler-Lussier slide
D i P i ti
-
7/28/2019 b224s.09.lec10
58/59
Dynamic PronunciationModeling
Slide from Eric Fosler-Lussier
-
7/28/2019 b224s.09.lec10
59/59
Summary
Speech Recognition Architectural Overview Hidden Markov Models in general
ForwardViterbi Decoding
Hidden Markov models for Speech Evaluation