Automatic Speech Recognition Introduction. The Human Dialogue System.
-
Upload
joan-walters -
Category
Documents
-
view
219 -
download
0
Transcript of Automatic Speech Recognition Introduction. The Human Dialogue System.
Automatic Speech RecognitionIntroduction
The Human Dialogue System
The Human Dialogue System
Computer Dialogue Systems
AuditionAutomatic
SpeechRecognition
NaturalLanguage
Understanding
DialogueManagement
Planning
NaturalLanguageGeneration
Text-to-speech
signal words
logical form
words signalsignal
Computer Dialogue Systems
Audition ASR NLU
DialogueMgmt.
Planning
NLGText-to-speech
signal words
logical form
words signalsignal
Parameters of ASR Capabilities
• Different types of tasks with different difficulties– Speaking mode (isolated words/continuous speech)
– Speaking style (read/spontaneous)
– Enrollment (speaker-independent/dependent)
– Vocabulary (small < 20 wd/large >20kword)
– Language model (finite state/context sensitive)
– Signal-to-noise ratio (high > 30 dB/low < 10dB)
– Transducer (high quality microphone/telephone)
The Noisy Channel Model (Shannon)
message
Message
noisy channel
Channel+
message
=Signal
Decoding model: find Message*= argmax P(Message|Signal)But how do we represent each of these things?
What are the basic units for acoustic information?
When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable.
Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary & continuous SR:
• Each word is treated individually –which implies large amount of training data and storage.
• The recognition vocabulary may consist of words which have never been given in the training data.
• Expensive to model interword coarticulation effects.
Why phones are better units than words: an example
"SAY BITE AGAIN" spoken so that the phonemes are separated in time
Recorded soundRecorded sound
spectrogramspectrogram
"SAY BITE AGAIN" spoken normally
And why phones are still not the perfect choice
Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent).
However, each word is not a sequence of independent phonemes!
Our articulators move continuously from one position to another.
The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc.
Different realizations of a phoneme are called allophones.
Example: different spectrograms for “eh”
Triphone modelEach triphone captures facts about preceding and following phone
•Monophone: p, t, k
•Triphone: iy-p+aa
•a-b+c means “phone b, preceding by phone
a, followed by phone c”
In practice, systems use order of 100,000 3phones, andthe 3phone model is the one currently used (e.g. Sphynx)
Parts of an ASR System
FeatureCalculation
LanguageModeling
AcousticModeling
k @
PronunciationModeling
cat: k@tdog: dogmail: mAlthe: D&, DE…
cat dog: 0.00002cat the: 0.0000005the cat: 0.029the dog: 0.031the mail: 0.054 …
Produces acoustic vectors (xt)
Maps acousticsto 3phones
Maps 3phonesto words
Strings wordstogether
Feature calculation
interpretationsinterpretations
Feature calculationF
requ
ency
Time
Find energy at each time step ineach frequency channel
Feature calculation
Fre
quen
cy
Time
Take Inverse Discrete FourierTransform to decorrelate frequencies
Feature calculation
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
…
Input:
Output: acousticobservations vectors
Robust Speech Recognition
• Different schemes have been developed for dealing with noise, reverberation– Additive noise: reduce effects of particular
frequencies– Convolutional noise: remove effects of linear
filters (cepstral mean subtraction)
cepstrum: fourier transfor of the LOGARITHM of the spectrum cepstrum: fourier transfor of the LOGARITHM of the spectrum
How do we map from vectors to word sequences?
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
“That you” …???
HMM (again)!
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
“That you” …Pattern recognition
with HMMs
ASR using HMMs
• Try to solve P(Message|Signal) by breaking the problem up into separate components
• Most common method: Hidden Markov Models– Assume that a message is composed of words
– Assume that words are composed of sub-word parts (3phones)
– Assume that 3phones have some sort of acoustic realization
– Use probabilistic models for matching acoustics to phones to words
Creating HMMs for word sequences: Context independent units
3phones3phones
“Need” 3phone model
Hierarchical system of HMMs
HMM of a triphone HMM of a triphone HMM of a triphone
Higher level HMM of a word
Language model
To simplify, let’s now ignorelower level HMM
Each phone node hasa “hidden” HMM (H2MM)
HMMs for ASR
go home
g o h o m
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov modelbackbone composedof sequences of 3phones(hidden because wedon’t knowcorrespondences)
Acoustic observations
Each line represents a probability estimate (more later)
g o o o o o oh mm
HMMs for ASR
go home
g o h o m
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov modelbackbone composedof phones(hidden because wedon’t knowcorrespondences)
Acoustic observations
Even with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypothesesEven with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypotheses
For every HMM (in hierarchy): compute Max probability sequence
th a t
h iy
y uw
p(he|that)
p(you|that)
h iy
sh uh d
X= acoustic observations,(3)phones, phone sequencesW= (3)phones, phonesequences, word sequences
argmaxW P(W|X)=argmaxW P(X|W)P(W)/P(X)=argmaxW P(X|W)P(W)
COMPUTE:
Search• When trying to find W*=argmaxW P(W|X), need to look
at (in theory)– All possible (3phone, word.. etc) sequences– All possible segmentations/alignments of W&X
• Generally, this is done by searching the space of W– Viterbi search: dynamic programming approach that looks for
the most likely path– A* search: alternative method that keeps a stack of hypotheses
around
• If |W| is large, pruning becomes important• Need also to estimate transition probabilities
Training: speech corpora
• Have a speech corpus at hand– Should have word (and preferrably phone)
transcriptions
– Divide into training, development, and test sets
• Develop models of prior knowledge– Pronunciation dictionary
– Grammar, lexical trees
• Train acoustic models– Possibly realigning corpus phonetically
Acoustic Model
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
dh a a t • Assume that you can label each vector with a phonetic label
• Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks)
Na()P(X|state=a)
Pronunciation model
• Pronunciation model gives connections between phones and words
• Multiple pronunciations (tomato):
owt m
dh pdh
1-pdh
a pa
1-pa
t pt
1-pt
ah
ow ey
ah
t
Training models for a sound unit
Language Model
• Language model gives connections between words (e.g., bigrams: probability of two word sequences)
dh a t
h iy
y uw
p(he|that)
p(you|that)
Lexical treesSTART S-T-AA-R-TD STARTING S-T-AA-R-DX-IX-NG STARTED S-T-AA-R-DX-IX-DD STARTUP S-T-AA-R-T-AX-PD START-UP S-T-AA-R-T-AX-PD
S T AA
R
R T
TD
DXIX
IX
NG
DD
AXPD
PD
start
starting
started
startup
start-up
Judging the quality of a system
• Usually, ASR performance is judged by the word error rateErrorRate = 100*(Subs + Ins + Dels) / Nwords
REF: I WANT TO GO HOME ***
REC: * WANT TWO GO HOME NOW
SC: D C S C C I
100*(1S+1I+1D)/5 = 60%
Judging the quality of a system
• Usually, ASR performance is judged by the word error rate
• This assumes that all errors are equal– Also, a bit of a mismatch between optimization
criterion and error measurement
• Other (task specific) measures sometimes used– Task completion– Concept error rate
Sphinx4http://cmusphinx.sourceforge.net
• Feature extractor
• Feature extractor• Mel-Frequency Cepstral
Coefficients (MFCCs)Feature vectors
• Acoustic Observations
• Acoustic Observations• Hidden States
• Acoustic Observations• Hidden States• Acoustic Observation likelihoods
“Six”
• Constructs the search graph of HMMs from: – Acoustic model– Statistical Language model ~or~– Grammar– Dictionary
• Constructs the HMMs of phones• Produces observation likelihoods
• Constructs the HMMs for units of speech
• Produces observation likelihoods • Sampling rate is critical! • WSJ vs. WSJ_8k
• Constructs the HMMs for units of speech
• Produces observation likelihoods • Sampling rate is critical! • WSJ vs. WSJ_8k• TIDIGITS, RM1, AN4, HUB4
• Word likelihoods
• ARPA format Example:1-grams:-3.7839 board -0.1552-2.5998 bottom -0.3207-3.7839 bunch -0.21742-grams:-0.7782 as the -0.2717-0.4771 at all 0.0000-0.7782 at the -0.29153-grams:-2.4450 in the lowest -0.5211 in the middle -2.4450 in the on
public <basicCmd> = <startPolite> <command> <endPolite>;
public <startPolite> = (please | kindly | could you ) *;
public <endPolite> = [ please | thanks | thank you ];
<command> = <action> <object>;
<action> = (open | close | delete | move); <object> = [the | a] (window | file | menu);
• Maps words to phoneme sequences
• Example from cmudict.06dPOULTICE P OW L T AH S
POULTICES P OW L T AH S IH Z
POULTON P AW L T AH N
POULTRY P OW L T R IY
POUNCE P AW N S
POUNCED P AW N S T
POUNCEY P AW N S IY
POUNCING P AW N S IH NG
POUNCY P UW NG K IY
• Can be statically or dynamically constructed
• Maps feature vectors to search graph
• Searches the graph for the “best fit”
• Searches the graph for the “best fit”
• P(sequence of feature vectors| word/phone)
• aka. P(O|W)
-> “how likely is the input to have been generated by the word”
F ay ay ay ay v v v v vF f ay ay ay ay v v v vF f f ay ay ay ay v v vF f f f ay ay ay ay v vF f f f ay ay ay ay ay vF f f f f ay ay ay ay vF f f f f f ay ay ay v…
TimeO1 O2 O3
• Uses algorithms to weed out low scoring paths during decoding
• Words!
• Most common metric• Measure the # of modifications to
transform recognized sentence into reference sentence
• Reference: “This is a reference sentence.”
• Result: “This is neuroscience.”
• Reference: “This is a reference sentence.”
• Result: “This is neuroscience.”• Requires 2 deletions, 1 substitution
• Reference: “This is a reference sentence.”
• Result: “This is neuroscience.”
• Reference: “This is a reference sentence.”
• Result: “This is neuroscience.”
• D S D
Installation details
• http://cmusphinx.sourceforge.net/wiki/sphinx4:howtobuildand_run_sphinx4
• Student report on NLP course web site