b224s.09.lec10

download b224s.09.lec10

of 59

Transcript of b224s.09.lec10

  • 7/28/2019 b224s.09.lec10

    1/59

    CS 224S / LINGUIST 281

    Speech Recognition, Synthesis, andDialogue

    Dan Jurafsky

    Lecture 10: Acoustic Modeling

    IP Notice:

  • 7/28/2019 b224s.09.lec10

    2/59

    Outline for Today

    Speech Recognition Architectural Overview Hidden Markov Models in general and for speech

    ForwardViterbi Decoding

    How this fits into the ASR component of course Jan 27 HMMs, Forward, Viterbi, Jan 29 Baum-Welch (Forward-Backward) Feb 3: Feature Extraction, MFCCs, start of AM (VQ) Feb 5: Acoustic Modeling: GMMs Feb 10: N-grams and Language Modeling Feb 24: Search and Advanced Decoding Feb 26: Dealing with Variation Mar 3: Dealing with Disfluencies

  • 7/28/2019 b224s.09.lec10

    3/59

    Outline for Today

    Acoustic Model Increasingly sophisticated modelsAcoustic Likelihood for each state:

    Gaussians

    Multivariate Gaussians Mixtures of Multivariate Gaussians Where a state is progressively:

    CI Subphone (3ish per phone) CD phone (=triphones) State-tying of CD phone

    If Time: Evaluation Word Error Rate

  • 7/28/2019 b224s.09.lec10

    4/59

    Reminder: VQ

    To compute p(ot|qj) Compute distance between feature vector ot

    and each codeword (prototype vector) in a preclustered codebook where distance is either

    Euclidean Mahalanobis

    Choose the vector that is the closest to ot and take its codeword vk

    And then look up the likelihood of vkgiven HMM statej in the B matrix

    Bj(ot)=bj(vk) s.t. vk is codeword of closest vectorto ot

    Using Baum-Welch as above

  • 7/28/2019 b224s.09.lec10

    5/59

    Computing bj(vk)

    feature value 1 for state jfeature

    value

    2

    forstate

    j

    bj(vk) = number of vectors with codebook index kin state jnumber of vectors in state j

    = =14 1

    56 4

    Slide from John-Paul Hosum, OHSU/OGI

  • 7/28/2019 b224s.09.lec10

    6/59

    Summary: VQ

    Training:Do VQ and then use Baum-Welch to assign

    probabilities to each symbol

    Decoding:Do VQ and then use the symbol probabilities

    in decoding

  • 7/28/2019 b224s.09.lec10

    7/59

    Directly Modeling ContinuousObservations

    GaussiansUnivariate Gaussians

    Baum-Welch for univariate GaussiansMultivariate Gaussians

    Baum-Welch for multivariate GausiansGaussian Mixture Models (GMMs)

    Baum-Welch for GMMs

  • 7/28/2019 b224s.09.lec10

    8/59

    Better than VQ

    VQ is insufficient for real ASR Instead: Assume the possible values of the

    observation feature vector ot are normallydistributed.

    Represent the observation likelihood functionbj(ot) as a Gaussian with mean j and variancej

    2

    f(x |,) =1

    2exp(

    (x )2

    22

    )

  • 7/28/2019 b224s.09.lec10

    9/59

    Gaussians are parameters bymean and variance

  • 7/28/2019 b224s.09.lec10

    10/59

    For a discrete random variable X Mean is the expected value of X

    Weighted sum over the values of X

    Variance is the squared average deviationfrom mean

    Reminder: means andvariances

  • 7/28/2019 b224s.09.lec10

    11/59

    Gaussian as ProbabilityDensity Function

  • 7/28/2019 b224s.09.lec10

    12/59

    Gaussian PDFs

    A Gaussian is a probability density function;probability is area under curve.

    To make it a probability, we constrain areaunder curve = 1.

    BUT We will be using point estimates; value of Gaussian

    at point.

    Technically these are not probabilities, since apdf gives a probability over a interval, needs tobe multiplied by dx

    As we will see later, this is ok since the samevalue is omitted from all Gaussians, so argmax isstill correct.

  • 7/28/2019 b224s.09.lec10

    13/59

    Gaussians for Acoustic

    Modeling

    P(o|q):P(o|q)

    o

    P(o|q) is highest here at mean

    P(o|q is low here, very far from mean)

    A Gaussian is parameterized by a mean anda variance:

    Different means

  • 7/28/2019 b224s.09.lec10

    14/59

    Using a (univariate Gaussian) asan acoustic likelihood estimator

    Lets suppose our observation was a singlereal-valued feature (instead of 39D vector)

    Then if we had learned a Gaussian overthe distribution of values of this feature

    We could compute the likelihood of anygiven observation ot as follows:

  • 7/28/2019 b224s.09.lec10

    15/59

    Training a UnivariateGaussian

    A (single) Gaussian is characterized by a mean anda variance

    Imagine that we had some training data in whicheach state was labeled

    We could just compute the mean and variance fromthe data:

    i=

    1

    To

    t

    t=1

    T

    s.t. ot is state i

    i

    2=

    1

    T(ot

    t=1

    T

    i)2 s.t. qt is state i

  • 7/28/2019 b224s.09.lec10

    16/59

    Training Univariate Gaussians

    But we dont know which observation was producedby which state!

    What we want: to assign each observation vector otto every possible state i, prorated by the probability

    the the HMM was in state iat time t. The probability of being in state i at time t is t(i)!!

    2i =

    t(i)(o

    t

    i)2

    t=1

    T

    t(i)

    t=1

    T

    i=

    t(i)o

    t

    t=

    1

    T

    t(i)

    t=1

    T

  • 7/28/2019 b224s.09.lec10

    17/59

    Multivariate Gaussians

    Instead of a single mean and variance :

    Vector of observations xmodeled by vectorof means and covariance matrix

    f(x |,) =1

    2exp(

    (x )2

    22)

    f(x |,) =1

    (2)D/ 2

    | |1/ 2exp

    1

    2(x )T1(x )

  • 7/28/2019 b224s.09.lec10

    18/59

    Multivariate Gaussians

    Defining and

    So the i-jth element of is:

    = E(x)

    = E (x )(x )T[ ]

    ij

    2= E (x i i)(x j j)[ ]

  • 7/28/2019 b224s.09.lec10

    19/59

    Gaussian Intuitions: Size of

    = [0 0] = [0 0] = [0 0] = I = 0.6I = 2IAs becomes larger, Gaussian becomes

    more spread out; as becomes smaller,

    Gaussian more compressedText and figures from Andrew Ngs lecture notes for CS229

  • 7/28/2019 b224s.09.lec10

    20/59

    From Chen, Picheny et al lecture slides

  • 7/28/2019 b224s.09.lec10

    21/59

    [1 0] [.6 0][0 1] [ 0 2]

    Different variances in different dimensions

  • 7/28/2019 b224s.09.lec10

    22/59

    Gaussian Intuitions: Off-

    diagonal

    As we increase the off-diagonal entries, more correlation betweenvalue of x and value of y

    Text and figures from Andrew Ngs lecture notes for CS229

  • 7/28/2019 b224s.09.lec10

    23/59

    Gaussian Intuitions: off-

    diagonal

    As we increase the off-diagonal entries, more correlation between value of x andvalue of y

    Text and figures from Andrew Ngs lecture notes for CS229

  • 7/28/2019 b224s.09.lec10

    24/59

    Gaussian Intuitions: off-

    diagonal and diagonal

    Decreasing non-diagonal entries (#1-2) Increasing variance of one dimension in diagonal (#3)

    Text and figures from Andrew Ngs lecture notes for CS229

  • 7/28/2019 b224s.09.lec10

    25/59

    In two dimensions

    From Chen, Picheny et al lecture slides

  • 7/28/2019 b224s.09.lec10

    26/59

    But: assume diagonalcovariance

    I.e., assume that the features in the featurevector are uncorrelated

    This isnt true for FFT features, but is true forMFCC features, as we saw las time

    Computation and storage much cheaper ifdiagonal covariance.

    I.e. only diagonal entries are non-zero Diagonal contains the variance of each dimensionii2 So this means we consider the variance of each

    acoustic feature (dimension) separately

  • 7/28/2019 b224s.09.lec10

    27/59

    Diagonal covariance

    Diagonal contains the variance of eachdimension ii

    2

    So this means we consider the variance ofeach acoustic feature (dimension)separately

    bj(ot) =1

    2

    jd

    2exp

    1

    2

    otd jd

    jd

    2

    d=

    1

    D

    bj(ot) =1

    2D

    2jd

    2

    d=1

    D

    exp(

    1

    2

    (otd jd)2

    jd

    2

    d=1

    D

    )

  • 7/28/2019 b224s.09.lec10

    28/59

    Baum-Welch reestimation equations formultivariate Gaussians

    Natural extension of univariate case,where now i is mean vector for state i:

    i

    2=

    t(i)(o

    t

    i)

    t=1

    T

    (ot i)T

    t(i)

    t=

    1

    T

    i=

    t(i)o

    t

    t=1

    T

    t(i)

    t=1

    T

  • 7/28/2019 b224s.09.lec10

    29/59

    But were not there yet

    Single Gaussian may do a bad job ofmodeling distribution in any dimension:

    Solution: Mixtures of GaussiansFigure from Chen, Picheney et al slides

  • 7/28/2019 b224s.09.lec10

    30/59

    Mixture of Gaussians to modela function

    4 3 2 1 0 1 2 3 40

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

  • 7/28/2019 b224s.09.lec10

    31/59

    Mixtures of Gaussians

    M mixtures of Gaussians:

    For diagonal covariance:

    bj(ot) =c jk

    2D

    2jkd

    2

    d=1

    D

    exp(

    1

    2

    (x jkd jkd)2

    jkd

    2

    d=1

    D

    )k=1

    M

    bj(ot) = c jkk=1

    M

    1

    2jkd2

    exp 1

    2

    otd jkd

    jkd

    2

    d=1

    D

    f(x |jk, jk) = c jkk=1

    M

    1

    (2)D / 2 | jk |

    1/ 2exp

    1

    2(x jk)

    T1(x jk)

  • 7/28/2019 b224s.09.lec10

    32/59

    GMMs

    Summary: each state has a likelihood functionparameterized by:

    M Mixture weights M Mean Vectors of dimensionality D Either

    M Covariance Matrices of DxD Or more likely

    M Diagonal Covariance Matrices of DxD which is equivalent to M Variance Vectors of dimensionality D

  • 7/28/2019 b224s.09.lec10

    33/59

    Training a GMM

    Problem: how do we train a GMM if wedont know what component is accounting

    for aspects of any particular observation?

    Intuition: we use Baum-Welch to find it forus, just as we did for finding hidden statesthat accounted for the observation

  • 7/28/2019 b224s.09.lec10

    34/59

    2/4/09 34

    Baum-Welch for MixtureModels

    By analogy with earlier, lets define the probability of being instate jat time twith the kth mixture component accounting for ot:

    Now,tm (j) =

    t1(j)i=1

    N

    aijc jmbjm (ot)j(t)F(T)

    jm =

    tm (j)ott=1

    T

    tk(j)k=1

    M

    t=1

    T

    cjm =tm (j)

    t=1

    T

    tk(j)k=1

    M

    t=1

    T

    jm =tm (j)(ot j)

    t=1

    T

    (ot j)T

    tm (j)k=1

    M

    t=1

    T

  • 7/28/2019 b224s.09.lec10

    35/59

    2/4/09 35

    How to train mixtures?

    Choose M (often 16; or can tune M dependent onamount of training observations)

    Then can do various splitting or clustering algorithms One simple method for splitting:1) Compute global mean and global variance2) Split into two Gaussians, with means (sometimes

    is 0.2)

    3) Run Forward-Backward to retrain4) Go to 2 until we have 16 mixtures

  • 7/28/2019 b224s.09.lec10

    36/59

    2/4/09 CS 224S Winter 2007 36

    Embedded Training

    Components of a speech recognizer:Feature extraction: not statisticalLanguage model: word transition probabilities,

    trained on some other corpusAcoustic model:

    Pronunciation lexicon: the HMM structure for eachword, built by hand

    Observation likelihoods bj(ot)

    Transition probabilities aij

  • 7/28/2019 b224s.09.lec10

    37/59

    2/4/09 CS 224S Winter 2007 37

    Embedded training ofacoustic model

    If we had hand-segmented and hand-labeled trainingdata

    With word and phone boundaries We could just compute the

    B: means and variances of all our triphone gaussiansA: transition probabilities

    And wed be done! But we dont have word and phone boundaries, nor

    phone labeling

  • 7/28/2019 b224s.09.lec10

    38/59

    2/4/09 CS 224S Winter 2007 38

    Embedded training

    Instead:Well train each phone HMM embedded in an

    entire sentence

    Well do word/phone segmentation andalignment automatically as part of trainingprocess

  • 7/28/2019 b224s.09.lec10

    39/59

    2/4/09 CS 224S Winter 2007 39

    Embedded Training

  • 7/28/2019 b224s.09.lec10

    40/59

    2/4/09 40

    Initialization: Flat start

    Transition probabilities: set to zero any that you want to bestructurally zeroThe probability computation includes previous

    value of aij, so if its zero it will never changeSet the rest to identical values

    Likelihoods:initialize and of each state to global mean

    and variance of all training data

  • 7/28/2019 b224s.09.lec10

    41/59

    2/4/09 41

    Embedded Training

    Given: phoneset, pron lexicon, transcribedwavefiles

    Build a whole sentence HMM for eachsentence

    Initialize A probs to 0.5, or to zeroInitialize B probs to global mean and varianceRun multiple iteractions of Baum Welch

    During each iteration, we compute forward andbackward probabilities

    Use them to re-estimate A and BRun Baum-Welch til converge

  • 7/28/2019 b224s.09.lec10

    42/59

    2/4/09 CS 224S Winter 2007 42

    Viterbi training

    Baum-Welch training says: We need to know what state we were in, to

    accumulate counts of a given output symbol ot

    Well compute I(t), the probability of being in state iat time t, by using forward-backward to sum over allpossible paths that might have been in state i andoutput ot.

    Viterbi training says:

    Instead of summing over all possible paths, just takethe single most likely path

    Use the Viterbi algorithm to compute this Viterbipath

    Via forced alignment

  • 7/28/2019 b224s.09.lec10

    43/59

    2/4/09 CS 224S Winter 2007 43

    Forced Alignment

    Computing the Viterbi path over thetraining data is called forced alignment

    Because we know which word string toassign to each observation sequence.

    We just dont know the state sequence. So we use aij to constrain the path to go

    through the correct words

    And otherwise do normal Viterbi Result: state sequence!

  • 7/28/2019 b224s.09.lec10

    44/59

    2/4/09 CS 224S Winter 2007 44

    Viterbi training equations

    Viterbi Baum-Welch

    bj(vk)=

    nj(s.t.ot = vk)

    nj

    aij=

    nij

    ni

    For all pairs of emitting states,

    1

  • 7/28/2019 b224s.09.lec10

    45/59

  • 7/28/2019 b224s.09.lec10

    46/59

    2/4/09 CS 224S Winter 2007 46

    Viterbi training (II)

    Equations for non-mixture Gaussians

    Viterbi training for mixture Gaussians ismore complex, generally just assign eachobservation to 1 mixture

    i2=

    1

    Ni(ot

    t=1

    T

    i )2 s.t. qt = i

    i =

    1

    Niot s.t. qt = i

    t=1

    T

  • 7/28/2019 b224s.09.lec10

    47/59

    2/4/09 CS 224S Winter 2007 47

    Log domain

    In practice, do all computation in log domain Avoids underflow

    Instead of multiplying lots of very small probabilities, we addnumbers that are not so small.

    Single multivariate Gaussian (diagonal ) compute:

    In log space:

  • 7/28/2019 b224s.09.lec10

    48/59

    2/4/09 CS 224S Winter 2007 48

    Repeating:

    With some rearrangement of terms

    Where:

    Note that this looks like a weighted Mahalanobis distance!!! Also may justify why we these arent really probabilities (point

    estimates); these are really just distances.

    Log domain

  • 7/28/2019 b224s.09.lec10

    49/59

    Evaluation

    How to evaluate the word string output bya speech recognizer?

  • 7/28/2019 b224s.09.lec10

    50/59

    Word Error Rate

    Word Error Rate =100 (Insertions+Substitutions + Deletions)

    ------------------------------

    Total Word in Correct Transcript

    Aligment example:

    REF: portable **** PHONE UPSTAIRS last night so

    HYP: portable FORM OF STORES last night so

    Eval I S S

    WER = 100 (1+2+0)/6 = 50%

  • 7/28/2019 b224s.09.lec10

    51/59

    NIST sctk-1.3 scoring softare:Computing WER with sclite

    http://www.nist.gov/speech/tools/ Sclite aligns a hypothesized text (HYP) (from the recognizer) with a

    correct or reference text (REF) (human transcribed)id: (2347-b-013)Scores: (#C #S #D #I) 9 3 1 2REF: was an engineer SO I i was always with **** **** MEN UM and theyHYP: was an engineer ** AND i was always with THEM THEY ALL THAT and theyEval: D S I I S S

  • 7/28/2019 b224s.09.lec10

    52/59

    Sclite output for erroranalysis

    CONFUSION PAIRS Total (972)With >= 1 occurances (972)

    1: 6 -> (%hesitation) ==> on2: 6 -> the ==> that3: 5 -> but ==> that4: 4 -> a ==> the5: 4 -> four ==> for6: 4 -> in ==> and7: 4 -> there ==> that8: 3 -> (%hesitation) ==> and9: 3 -> (%hesitation) ==> the10: 3 -> (a-) ==> i11: 3 -> and ==> i12: 3 -> and ==> in13: 3 -> are ==> there14: 3 -> as ==> is15: 3 -> have ==> that16: 3 -> is ==> this

  • 7/28/2019 b224s.09.lec10

    53/59

    Sclite output for erroranalysis

    17: 3 -> it ==> that

    18: 3 -> mouse ==> most19: 3 -> was ==> is20: 3 -> was ==> this21: 3 -> you ==> we22: 2 -> (%hesitation) ==> it23: 2 -> (%hesitation) ==> that24: 2 -> (%hesitation) ==> to25: 2 -> (%hesitation) ==> yeah26: 2 -> a ==> all27: 2 -> a ==> know28: 2 -> a ==> you29: 2 -> along ==> well30: 2 -> and ==> it31: 2 -> and ==> we32: 2 -> and ==> you33: 2 -> are ==> i34: 2 -> are ==> were

  • 7/28/2019 b224s.09.lec10

    54/59

    Better metrics than WER?

    WER has been useful But should we be more concerned with meaning

    (semantic error rate)?

    Good idea, but hard to agree on Has been applied in dialogue systems, where desired

    semantic output is more clear

  • 7/28/2019 b224s.09.lec10

    55/59

    Summary: ASR Architecture

    Five easy pieces: ASR Noisy Channelarchitecture1)Feature Extraction:

    39 MFCC features

    2)Acoustic Model:Gaussians for computing p(o|q)

    3)Lexicon/Pronunciation Model HMM: what phones can follow each other

    4)Language Model

    N-grams for computing p(wi|wi-1)5)Decoder

    Viterbi algorithm: dynamic programming for combining allthese to get word sequence from speech!

    55

    ASR L i M k M d l

  • 7/28/2019 b224s.09.lec10

    56/59

    ASR Lexicon: Markov Modelsfor pronunciation

  • 7/28/2019 b224s.09.lec10

    57/59

    Pronunciation Modeling

    Generating surface forms:

    Eric Fosler-Lussier slide

    D i P i ti

  • 7/28/2019 b224s.09.lec10

    58/59

    Dynamic PronunciationModeling

    Slide from Eric Fosler-Lussier

  • 7/28/2019 b224s.09.lec10

    59/59

    Summary

    Speech Recognition Architectural Overview Hidden Markov Models in general

    ForwardViterbi Decoding

    Hidden Markov models for Speech Evaluation