Building an ASR using HTK CS4706
description
Transcript of Building an ASR using HTK CS4706
![Page 1: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/1.jpg)
Building an ASR using HTK CS4706
Fadi BiadsyApril 21st, 2008
1
![Page 2: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/2.jpg)
Outline
• Speech Recognition• Feature Extraction• HMM– 3 basic problems
• HTK– Steps to Build a speech recognizer
2
![Page 3: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/3.jpg)
Speech Recognition
• Speech Signal to Linguistic Units
3
There’s something happening when Americans… ASR
![Page 4: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/4.jpg)
It’s hard to recognize speech• Contextual effects
– Speech sounds vary with context• E.g., “How do you do?”
• Within-speaker variability– Speaking Style
• Pitch, intensity, speaking rate – Voice Quality
• Between-speaker variability– Accents, Dialects, native vs. non-native
• Environment variability– Background noise– Microphone
4
![Page 5: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/5.jpg)
Feature Extraction
• Wave form?• Spectrogram?• We need a stable representation for different
examples of the same speech sound
5
![Page 6: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/6.jpg)
Feature Extraction
• Extract features from short frames (frame period 10ms, 25ms frame size) – a sequence of features
6
![Page 7: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/7.jpg)
Feature Extraction - MFCC
• Mel Scale– approximate the unequal sensitivity of human
hearing at different frequencies
7
![Page 8: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/8.jpg)
Feature Extraction - MFCC
• MFCC (Mel frequency cepstral coefficient) – Widely used in speech recognition
1.Take the Fourier transform of the signal2.Map the log amplitudes of the spectrum to the mel
scale3.Discrete cosine transform of the mel log-amplitudes4.The MFCCs are the amplitudes of the resulting
spectrum
8
![Page 9: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/9.jpg)
Feature Extraction - MFCC
• Extract a feature vector from each frame• 12 MFCCs (Mel frequency cepstral coefficient) + 1
normalized energy = 13 features• Delta MFCC = 13• Delta-Delta MCC = 13
– Total: 39 features– Inverted MFCCs: 39 Feature vector
9
![Page 10: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/10.jpg)
Markov Chain
• Weighted Finite State Acceptor: Future is independent of the past given the present
10
![Page 11: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/11.jpg)
Hidden Markov Model (HMM)
• HMM is a Markov chain + emission probability function for each state.
• Markov Chain
• HMM M=(A, B, Pi) • A = Transition Matrix• B = Observation Distributions• Pi = Initial state probabilities
11
![Page 12: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/12.jpg)
HMM Example
/d/ /n//aa/ /aa/
12
![Page 13: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/13.jpg)
HMM – 3 basic problems (1) – Evaluation
• 1. Given the observation sequence O and a model M, how do we efficiently compute: P(O | M) = the probability of the observation sequence, given the model?
argmax i (P(O | Θi)
13
![Page 14: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/14.jpg)
HMM – 3 basic problems (2) - Decoding
2. Given the observation sequence O and the model M, how do we choose a corresponding state sequence Q = q1 q2 . . . qt which best “explains” the observation O?
Q* = argmax Q (P(O | Q, M)) = argmaxQ(P(Q|O,M)P(Q|M))
14
![Page 15: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/15.jpg)
Viterbi algorithm
• Is an efficient algorithm for Decoding O(TN^2)
/d/ /n//aa/ /aa/
/uw/
End
Start
/d/ /aa/ /n/ /aa/ => dana15
![Page 16: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/16.jpg)
HMM – 3 basic problems (3) – Training• How do we adjust the model parameters M=
(A, B, Pi) to maximize P(O | M)?
/d/ /n//aa/ /aa/
Estimate
dana => /d/ /aa/ /n/ /aa/ 1) Transition Matrix: A2) Emission probability distribution:
16
![Page 17: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/17.jpg)
HMM – 3 basic problems (3) – Training•
17
![Page 18: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/18.jpg)
HTK
• HTK is a toolkit for building Hidden Markov Models (HMMs)
• HTK is primarily designed for building HMM-based speech processing tools (e.g., extracting MFCC features)
18
![Page 19: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/19.jpg)
Steps for building ASR voice-operated interface for phone dialing• Examples: – Dial three three two six five four– Phone Woodland– Call Steve Young
• Grammar:– $digit = ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT
| NINE | OH | ZERO;– $name = [ JOOP ] JANSEN | [ JULIAN ] ODELL | [ DAVE ] OLLASON |
[ PHIL ] WOODLAND | [ STEVE ] YOUNG;– ( SENT-START ( DIAL <$digit> | (PHONE|CALL) $name) SENT-END )
19
![Page 20: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/20.jpg)
20
![Page 21: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/21.jpg)
S0001 ONE VALIDATED ACTS OF SCHOOL DISTRICTS S0002 TWO OTHER CASES ALSO WERE UNDER ADVISEMENTS0003 BOTH FIGURES WOULD GO HIGHER IN LATER YEARSS0004 THIS IS NOT A PROGRAM OF SOCIALIZED MEDICINEetc
A ah spA ax spA ey spCALL k ao l spDIAL d ay ax l spEIGHT ey t spPHONE f ow n sp…
21
![Page 22: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/22.jpg)
HTK scripting language is used to generate Phonetic transcription for all training data
22
![Page 23: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/23.jpg)
Extracting MFCC
• For each wave file, extract MFCC features.
23
![Page 24: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/24.jpg)
Creating Monophone HMMs
• Create Monophone HMM Topology• 5 states: 3 emitting states
• Flat Start: Mean and Variance are initialized as the global mean and variance of all the data
S1 S3S2 S4 S5
24
![Page 25: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/25.jpg)
Training
• For each training pair of files (mfc+lab): 1. concatenate the corresponding
monophone HMMS: 2. Use the Beam-Welch Algorithm to train the HMMS given the MFC
features.
S1
S3
S2
S4
S5
S1
S3
S2
S4
S5
S1
S3
S2
S4
S5
/ah//w/ /n/
25
ONE VALIDATED ACTS OF SCHOOL DISTRICTS
![Page 26: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/26.jpg)
Training
• So far, we have all monophones models trained
• Train the sp model
26
![Page 27: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/27.jpg)
Forced alignment• The dictionary contains multiple
pronunciations for some words. • Realignment the training data
/d/ /t//ey/ /ax/
/ae/ /dx/
Run Viterbi to get the best pronunciation that matches the acoustics
27
![Page 28: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/28.jpg)
Retrain
• After getting the best pronunciation– => Train again using Beam-Welch algorithm using
the “correct” pronunciation.
28
![Page 29: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/29.jpg)
Creating Triphone models• Context dependent HMMs • Make Tri-phones from monophones
– Generate a list of all the triphones for which there is at least one example in the training data
– jh-oy+s– oy-s– ax+z– f-iy+t– iy-t– s+l– s-l+ow
29
![Page 30: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/30.jpg)
Creating Tied-Triphone models
Data insufficiency => Tie states
S1
S3
S2
S4
S5
S1
S3
S2
S4
S5
S1
S3
S2
S4
S5
/aa//t/ /b/
S1
S3
S2
S4
S5
S1
S3
S2
S4
S5
S1
S3
S2
S4
S5
/b//aa/ /l/
30
![Page 31: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/31.jpg)
Tie Triphone
• Data Driven Clustering: Using similarity metric• Clustering using Decision Tree. All states in the
same leafe will be tied t+ih t+aet+iy t+aeao-r+ax rt+oh t+aeao-r+iyt+uh t+aet+uw t+aesh-n+tsh-n+z sh-n+tch-ih+lay-oh+lay-oh+r ay-oh+l
L = Class-Stop ?
n yR = Nasal ? L = Nasal ?
n ny y
R = Glide?
n y
31
![Page 32: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/32.jpg)
After Tying
• Train the Acoustic models again using Beam-Welch algorithm
32
![Page 33: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/33.jpg)
Decoding
• Using the grammar network for the phones– Generate the triphone HMM grammar network
WNET – Given a new Speech file, extract the mfcc features– Run Viterbi on the WNET given the mfcc features
to get the best word sequence.
33
![Page 34: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/34.jpg)
Summary
• MFCC Features• HMM 3 basic problems• HTK
34
![Page 35: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/35.jpg)
Thanks!
35
![Page 36: Building an ASR using HTK CS4706](https://reader035.fdocuments.us/reader035/viewer/2022062816/56814b39550346895db84213/html5/thumbnails/36.jpg)
HMM - Problem 1
36