b224s.09.lec10

7/28/2019 b224s.09.lec10

1/59

CS 224S / LINGUIST 281

Speech Recognition, Synthesis, andDialogue

Dan Jurafsky

Lecture 10: Acoustic Modeling

IP Notice:

7/28/2019 b224s.09.lec10

2/59

Outline for Today

Speech Recognition Architectural Overview Hidden Markov Models in general and for speech

ForwardViterbi Decoding

How this fits into the ASR component of course Jan 27 HMMs, Forward, Viterbi, Jan 29 Baum-Welch (Forward-Backward) Feb 3: Feature Extraction, MFCCs, start of AM (VQ) Feb 5: Acoustic Modeling: GMMs Feb 10: N-grams and Language Modeling Feb 24: Search and Advanced Decoding Feb 26: Dealing with Variation Mar 3: Dealing with Disfluencies

7/28/2019 b224s.09.lec10

3/59

Outline for Today

Acoustic Model Increasingly sophisticated modelsAcoustic Likelihood for each state:

Gaussians

Multivariate Gaussians Mixtures of Multivariate Gaussians Where a state is progressively:

CI Subphone (3ish per phone) CD phone (=triphones) State-tying of CD phone

If Time: Evaluation Word Error Rate

7/28/2019 b224s.09.lec10

4/59

Reminder: VQ

To compute p(ot|qj) Compute distance between feature vector ot

and each codeword (prototype vector) in a preclustered codebook where distance is either

Euclidean Mahalanobis

Choose the vector that is the closest to ot and take its codeword vk

And then look up the likelihood of vkgiven HMM statej in the B matrix

Bj(ot)=bj(vk) s.t. vk is codeword of closest vectorto ot

Using Baum-Welch as above

7/28/2019 b224s.09.lec10

5/59

Computing bj(vk)

feature value 1 for state jfeature

value

2

forstate

j

bj(vk) = number of vectors with codebook index kin state jnumber of vectors in state j

= =14 1

56 4

Slide from John-Paul Hosum, OHSU/OGI

7/28/2019 b224s.09.lec10

6/59

Summary: VQ

Training:Do VQ and then use Baum-Welch to assign

probabilities to each symbol

Decoding:Do VQ and then use the symbol probabilities

in decoding

7/28/2019 b224s.09.lec10

7/59

Directly Modeling ContinuousObservations

GaussiansUnivariate Gaussians

Baum-Welch for univariate GaussiansMultivariate Gaussians

Baum-Welch for multivariate GausiansGaussian Mixture Models (GMMs)

Baum-Welch for GMMs

7/28/2019 b224s.09.lec10

8/59

Better than VQ

VQ is insufficient for real ASR Instead: Assume the possible values of the

observation feature vector ot are normallydistributed.

Represent the observation likelihood functionbj(ot) as a Gaussian with mean j and variancej

2

f(x |,) =1

2exp(

(x )2

22

)

7/28/2019 b224s.09.lec10

9/59

Gaussians are parameters bymean and variance

7/28/2019 b224s.09.lec10

10/59

For a discrete random variable X Mean is the expected value of X

Weighted sum over the values of X

Variance is the squared average deviationfrom mean

Reminder: means andvariances

7/28/2019 b224s.09.lec10

11/59

Gaussian as ProbabilityDensity Function

7/28/2019 b224s.09.lec10

12/59

Gaussian PDFs

A Gaussian is a probability density function;probability is area under curve.

To make it a probability, we constrain areaunder curve = 1.

BUT We will be using point estimates; value of Gaussian

at point.

Technically these are not probabilities, since apdf gives a probability over a interval, needs tobe multiplied by dx

As we will see later, this is ok since the samevalue is omitted from all Gaussians, so argmax isstill correct.

7/28/2019 b224s.09.lec10

13/59

Gaussians for Acoustic

Modeling

P(o|q):P(o|q)

o

P(o|q) is highest here at mean

P(o|q is low here, very far from mean)

A Gaussian is parameterized by a mean anda variance:

Different means

7/28/2019 b224s.09.lec10

14/59

Using a (univariate Gaussian) asan acoustic likelihood estimator

Lets suppose our observation was a singlereal-valued feature (instead of 39D vector)

Then if we had learned a Gaussian overthe distribution of values of this feature

We could compute the likelihood of anygiven observation ot as follows:

7/28/2019 b224s.09.lec10

15/59

Training a UnivariateGaussian

A (single) Gaussian is characterized by a mean anda variance

Imagine that we had some training data in whicheach state was labeled

We could just compute the mean and variance fromthe data:

i=

1

To

t

t=1

T

s.t. ot is state i

i

2=

1

T(ot

t=1

T

i)2 s.t. qt is state i

7/28/2019 b224s.09.lec10

16/59

Training Univariate Gaussians

But we dont know which observation was producedby which state!

What we want: to assign each observation vector otto every possible state i, prorated by the probability

the the HMM was in state iat time t. The probability of being in state i at time t is t(i)!!

2i =

t(i)(o

t

i)2

t=1

T

t(i)

t=1

T

i=

t(i)o

t

t=

1

T

t(i)

t=1

T

7/28/2019 b224s.09.lec10

17/59

Multivariate Gaussians

Instead of a single mean and variance :

Vector of observations xmodeled by vectorof means and covariance matrix

f(x |,) =1

2exp(

(x )2

22)

f(x |,) =1

(2)D/ 2

| |1/ 2exp

1

2(x )T1(x )

7/28/2019 b224s.09.lec10

18/59

Multivariate Gaussians

Defining and

So the i-jth element of is:

= E(x)

= E (x )(x )T[ ]

ij

2= E (x i i)(x j j)[ ]

7/28/2019 b224s.09.lec10

19/59

Gaussian Intuitions: Size of

= [0 0] = [0 0] = [0 0] = I = 0.6I = 2IAs becomes larger, Gaussian becomes

more spread out; as becomes smaller,

Gaussian more compressedText and figures from Andrew Ngs lecture notes for CS229

7/28/2019 b224s.09.lec10

20/59

From Chen, Picheny et al lecture slides

7/28/2019 b224s.09.lec10

21/59

[1 0] [.6 0][0 1] [ 0 2]

Different variances in different dimensions

7/28/2019 b224s.09.lec10

22/59

Gaussian Intuitions: Off-

diagonal

As we increase the off-diagonal entries, more correlation betweenvalue of x and value of y

Text and figures from Andrew Ngs lecture notes for CS229

7/28/2019 b224s.09.lec10

23/59

Gaussian Intuitions: off-

diagonal

As we increase the off-diagonal entries, more correlation between value of x andvalue of y


7/28/2019 b224s.09.lec10

24/59

Gaussian Intuitions: off-

diagonal and diagonal

Decreasing non-diagonal entries (#1-2) Increasing variance of one dimension in diagonal (#3)


7/28/2019 b224s.09.lec10

25/59

In two dimensions

From Chen, Picheny et al lecture slides

7/28/2019 b224s.09.lec10

26/59

But: assume diagonalcovariance

I.e., assume that the features in the featurevector are uncorrelated

This isnt true for FFT features, but is true forMFCC features, as we saw las time

Computation and storage much cheaper ifdiagonal covariance.

I.e. only diagonal entries are non-zero Diagonal contains the variance of each dimensionii2 So this means we consider the variance of each

acoustic feature (dimension) separately

7/28/2019 b224s.09.lec10

27/59

Diagonal covariance

Diagonal contains the variance of eachdimension ii

2

So this means we consider the variance ofeach acoustic feature (dimension)separately

bj(ot) =1

2

jd

2exp

1

2

otd jd

jd

2

d=

1

D

bj(ot) =1

2D

2jd

2

d=1

D

exp(

1

2

(otd jd)2

jd

2

d=1

D

)

7/28/2019 b224s.09.lec10

28/59

Baum-Welch reestimation equations formultivariate Gaussians

Natural extension of univariate case,where now i is mean vector for state i:

i

2=

t(i)(o

t

i)

t=1

T

(ot i)T

t(i)

t=

1

T

i=

t(i)o

t

t=1

T

t(i)

t=1

T

7/28/2019 b224s.09.lec10

29/59

But were not there yet

Single Gaussian may do a bad job ofmodeling distribution in any dimension:

Solution: Mixtures of GaussiansFigure from Chen, Picheney et al slides

7/28/2019 b224s.09.lec10

30/59

Mixture of Gaussians to modela function

4 3 2 1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

7/28/2019 b224s.09.lec10

31/59

Mixtures of Gaussians

M mixtures of Gaussians:

For diagonal covariance:

bj(ot) =c jk

2D

2jkd

2

d=1

D

exp(

1

2

(x jkd jkd)2

jkd

2

d=1

D

)k=1

M

bj(ot) = c jkk=1

M

1

2jkd2

exp 1

2

otd jkd

jkd

2

d=1

D

f(x |jk, jk) = c jkk=1

M

1

(2)D / 2 | jk |

1/ 2exp

1

2(x jk)

T1(x jk)

7/28/2019 b224s.09.lec10

32/59

GMMs

Summary: each state has a likelihood functionparameterized by:

M Mixture weights M Mean Vectors of dimensionality D Either

M Covariance Matrices of DxD Or more likely

M Diagonal Covariance Matrices of DxD which is equivalent to M Variance Vectors of dimensionality D

7/28/2019 b224s.09.lec10

33/59

Training a GMM

Problem: how do we train a GMM if wedont know what component is accounting

for aspects of any particular observation?

Intuition: we use Baum-Welch to find it forus, just as we did for finding hidden statesthat accounted for the observation

7/28/2019 b224s.09.lec10

34/59

2/4/09 34

Baum-Welch for MixtureModels

By analogy with earlier, lets define the probability of being instate jat time twith the kth mixture component accounting for ot:

Now,tm (j) =

t1(j)i=1

N

aijc jmbjm (ot)j(t)F(T)

jm =

tm (j)ott=1

T

tk(j)k=1

M

t=1

T

cjm =tm (j)

t=1

T

tk(j)k=1

M

t=1

T

jm =tm (j)(ot j)

t=1

T

(ot j)T

tm (j)k=1

M

t=1

T

7/28/2019 b224s.09.lec10

35/59

2/4/09 35

How to train mixtures?

Choose M (often 16; or can tune M dependent onamount of training observations)

Then can do various splitting or clustering algorithms One simple method for splitting:1) Compute global mean and global variance2) Split into two Gaussians, with means (sometimes

is 0.2)

3) Run Forward-Backward to retrain4) Go to 2 until we have 16 mixtures

7/28/2019 b224s.09.lec10

36/59

2/4/09 CS 224S Winter 2007 36

Embedded Training

Components of a speech recognizer:Feature extraction: not statisticalLanguage model: word transition probabilities,

trained on some other corpusAcoustic model:

Pronunciation lexicon: the HMM structure for eachword, built by hand

Observation likelihoods bj(ot)

Transition probabilities aij

7/28/2019 b224s.09.lec10

37/59

2/4/09 CS 224S Winter 2007 37

Embedded training ofacoustic model

If we had hand-segmented and hand-labeled trainingdata

With word and phone boundaries We could just compute the

B: means and variances of all our triphone gaussiansA: transition probabilities

And wed be done! But we dont have word and phone boundaries, nor

phone labeling

7/28/2019 b224s.09.lec10

38/59

2/4/09 CS 224S Winter 2007 38

Embedded training

Instead:Well train each phone HMM embedded in an

entire sentence

Well do word/phone segmentation andalignment automatically as part of trainingprocess

7/28/2019 b224s.09.lec10

39/59

2/4/09 CS 224S Winter 2007 39

Embedded Training

7/28/2019 b224s.09.lec10

40/59

2/4/09 40

Initialization: Flat start

Transition probabilities: set to zero any that you want to bestructurally zeroThe probability computation includes previous

value of aij, so if its zero it will never changeSet the rest to identical values

Likelihoods:initialize and of each state to global mean

and variance of all training data

7/28/2019 b224s.09.lec10

41/59

2/4/09 41

Embedded Training

Given: phoneset, pron lexicon, transcribedwavefiles

Build a whole sentence HMM for eachsentence

Initialize A probs to 0.5, or to zeroInitialize B probs to global mean and varianceRun multiple iteractions of Baum Welch

During each iteration, we compute forward andbackward probabilities

Use them to re-estimate A and BRun Baum-Welch til converge

7/28/2019 b224s.09.lec10

42/59

2/4/09 CS 224S Winter 2007 42

Viterbi training

Baum-Welch training says: We need to know what state we were in, to

accumulate counts of a given output symbol ot

Well compute I(t), the probability of being in state iat time t, by using forward-backward to sum over allpossible paths that might have been in state i andoutput ot.

Viterbi training says:

Instead of summing over all possible paths, just takethe single most likely path

Use the Viterbi algorithm to compute this Viterbipath

Via forced alignment

7/28/2019 b224s.09.lec10

43/59

2/4/09 CS 224S Winter 2007 43

Forced Alignment

Computing the Viterbi path over thetraining data is called forced alignment

Because we know which word string toassign to each observation sequence.

We just dont know the state sequence. So we use aij to constrain the path to go

through the correct words

And otherwise do normal Viterbi Result: state sequence!

7/28/2019 b224s.09.lec10

44/59

2/4/09 CS 224S Winter 2007 44

Viterbi training equations

Viterbi Baum-Welch

bj(vk)=

nj(s.t.ot = vk)

nj

aij=

nij

ni

For all pairs of emitting states,

1

7/28/2019 b224s.09.lec10

45/59

7/28/2019 b224s.09.lec10

46/59

2/4/09 CS 224S Winter 2007 46

Viterbi training (II)

Equations for non-mixture Gaussians

Viterbi training for mixture Gaussians ismore complex, generally just assign eachobservation to 1 mixture

i2=

1

Ni(ot

t=1

T

i )2 s.t. qt = i

i =

1

Niot s.t. qt = i

t=1

T

7/28/2019 b224s.09.lec10

47/59

2/4/09 CS 224S Winter 2007 47

Log domain

In practice, do all computation in log domain Avoids underflow

Instead of multiplying lots of very small probabilities, we addnumbers that are not so small.

Single multivariate Gaussian (diagonal ) compute:

In log space:

7/28/2019 b224s.09.lec10

48/59

2/4/09 CS 224S Winter 2007 48

Repeating:

With some rearrangement of terms

Where:

Note that this looks like a weighted Mahalanobis distance!!! Also may justify why we these arent really probabilities (point

estimates); these are really just distances.

Log domain

7/28/2019 b224s.09.lec10

49/59

Evaluation

How to evaluate the word string output bya speech recognizer?

7/28/2019 b224s.09.lec10

50/59

Word Error Rate

Word Error Rate =100 (Insertions+Substitutions + Deletions)

------------------------------

Total Word in Correct Transcript

Aligment example:

REF: portable **** PHONE UPSTAIRS last night so

HYP: portable FORM OF STORES last night so

Eval I S S

WER = 100 (1+2+0)/6 = 50%

7/28/2019 b224s.09.lec10

51/59

NIST sctk-1.3 scoring softare:Computing WER with sclite

http://www.nist.gov/speech/tools/ Sclite aligns a hypothesized text (HYP) (from the recognizer) with a

correct or reference text (REF) (human transcribed)id: (2347-b-013)Scores: (#C #S #D #I) 9 3 1 2REF: was an engineer SO I i was always with **** **** MEN UM and theyHYP: was an engineer ** AND i was always with THEM THEY ALL THAT and theyEval: D S I I S S

7/28/2019 b224s.09.lec10

52/59

Sclite output for erroranalysis

CONFUSION PAIRS Total (972)With >= 1 occurances (972)

1: 6 -> (%hesitation) ==> on2: 6 -> the ==> that3: 5 -> but ==> that4: 4 -> a ==> the5: 4 -> four ==> for6: 4 -> in ==> and7: 4 -> there ==> that8: 3 -> (%hesitation) ==> and9: 3 -> (%hesitation) ==> the10: 3 -> (a-) ==> i11: 3 -> and ==> i12: 3 -> and ==> in13: 3 -> are ==> there14: 3 -> as ==> is15: 3 -> have ==> that16: 3 -> is ==> this

7/28/2019 b224s.09.lec10

53/59

Sclite output for erroranalysis

17: 3 -> it ==> that

18: 3 -> mouse ==> most19: 3 -> was ==> is20: 3 -> was ==> this21: 3 -> you ==> we22: 2 -> (%hesitation) ==> it23: 2 -> (%hesitation) ==> that24: 2 -> (%hesitation) ==> to25: 2 -> (%hesitation) ==> yeah26: 2 -> a ==> all27: 2 -> a ==> know28: 2 -> a ==> you29: 2 -> along ==> well30: 2 -> and ==> it31: 2 -> and ==> we32: 2 -> and ==> you33: 2 -> are ==> i34: 2 -> are ==> were

7/28/2019 b224s.09.lec10

54/59

Better metrics than WER?

WER has been useful But should we be more concerned with meaning

(semantic error rate)?

Good idea, but hard to agree on Has been applied in dialogue systems, where desired

semantic output is more clear

7/28/2019 b224s.09.lec10

55/59

Summary: ASR Architecture

Five easy pieces: ASR Noisy Channelarchitecture1)Feature Extraction:

39 MFCC features

2)Acoustic Model:Gaussians for computing p(o|q)

3)Lexicon/Pronunciation Model HMM: what phones can follow each other

4)Language Model

N-grams for computing p(wi|wi-1)5)Decoder

Viterbi algorithm: dynamic programming for combining allthese to get word sequence from speech!

55

ASR L i M k M d l

7/28/2019 b224s.09.lec10

56/59

ASR Lexicon: Markov Modelsfor pronunciation

7/28/2019 b224s.09.lec10

57/59

Pronunciation Modeling

Generating surface forms:

Eric Fosler-Lussier slide

D i P i ti

7/28/2019 b224s.09.lec10

58/59

Dynamic PronunciationModeling

Slide from Eric Fosler-Lussier

7/28/2019 b224s.09.lec10

59/59

Summary

Speech Recognition Architectural Overview Hidden Markov Models in general

ForwardViterbi Decoding

Hidden Markov models for Speech Evaluation

b224s.09.lec10

Documents

Transcript of b224s.09.lec10