Automatic Speech Recognition Introduction. The Human Dialogue System.

76
Automatic Speech Recognition Introduction

Transcript of Automatic Speech Recognition Introduction. The Human Dialogue System.

Page 1: Automatic Speech Recognition Introduction. The Human Dialogue System.

Automatic Speech RecognitionIntroduction

Page 2: Automatic Speech Recognition Introduction. The Human Dialogue System.

The Human Dialogue System

Page 3: Automatic Speech Recognition Introduction. The Human Dialogue System.

The Human Dialogue System

Page 4: Automatic Speech Recognition Introduction. The Human Dialogue System.

Computer Dialogue Systems

AuditionAutomatic

SpeechRecognition

NaturalLanguage

Understanding

DialogueManagement

Planning

NaturalLanguageGeneration

Text-to-speech

signal words

logical form

words signalsignal

Page 5: Automatic Speech Recognition Introduction. The Human Dialogue System.

Computer Dialogue Systems

Audition ASR NLU

DialogueMgmt.

Planning

NLGText-to-speech

signal words

logical form

words signalsignal

Page 6: Automatic Speech Recognition Introduction. The Human Dialogue System.

Parameters of ASR Capabilities

• Different types of tasks with different difficulties– Speaking mode (isolated words/continuous speech)

– Speaking style (read/spontaneous)

– Enrollment (speaker-independent/dependent)

– Vocabulary (small < 20 wd/large >20kword)

– Language model (finite state/context sensitive)

– Signal-to-noise ratio (high > 30 dB/low < 10dB)

– Transducer (high quality microphone/telephone)

Page 7: Automatic Speech Recognition Introduction. The Human Dialogue System.

The Noisy Channel Model (Shannon)

message

Message

noisy channel

Channel+

message

=Signal

Decoding model: find Message*= argmax P(Message|Signal)But how do we represent each of these things?

Page 8: Automatic Speech Recognition Introduction. The Human Dialogue System.

What are the basic units for acoustic information?

When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable.

Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary & continuous SR:

• Each word is treated individually –which implies large amount of training data and storage.

• The recognition vocabulary may consist of words which have never been given in the training data.

• Expensive to model interword coarticulation effects.

Page 9: Automatic Speech Recognition Introduction. The Human Dialogue System.

Why phones are better units than words: an example

Page 10: Automatic Speech Recognition Introduction. The Human Dialogue System.

"SAY BITE AGAIN" spoken so that the phonemes are separated in time

Recorded soundRecorded sound

spectrogramspectrogram

Page 11: Automatic Speech Recognition Introduction. The Human Dialogue System.

"SAY BITE AGAIN" spoken normally

Page 12: Automatic Speech Recognition Introduction. The Human Dialogue System.

And why phones are still not the perfect choice

Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent).

However, each word is not a sequence of independent phonemes!

Our articulators move continuously from one position to another.

The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc.

Different realizations of a phoneme are called allophones.

Page 13: Automatic Speech Recognition Introduction. The Human Dialogue System.

Example: different spectrograms for “eh”

Page 14: Automatic Speech Recognition Introduction. The Human Dialogue System.

Triphone modelEach triphone captures facts about preceding and following phone

•Monophone: p, t, k

•Triphone: iy-p+aa

•a-b+c means “phone b, preceding by phone

a, followed by phone c”

In practice, systems use order of 100,000 3phones, andthe 3phone model is the one currently used (e.g. Sphynx)

Page 15: Automatic Speech Recognition Introduction. The Human Dialogue System.

Parts of an ASR System

FeatureCalculation

LanguageModeling

AcousticModeling

k @

PronunciationModeling

cat: k@tdog: dogmail: mAlthe: D&, DE…

cat dog: 0.00002cat the: 0.0000005the cat: 0.029the dog: 0.031the mail: 0.054 …

Produces acoustic vectors (xt)

Maps acousticsto 3phones

Maps 3phonesto words

Strings wordstogether

Page 16: Automatic Speech Recognition Introduction. The Human Dialogue System.

Feature calculation

interpretationsinterpretations

Page 17: Automatic Speech Recognition Introduction. The Human Dialogue System.

Feature calculationF

requ

ency

Time

Find energy at each time step ineach frequency channel

Page 18: Automatic Speech Recognition Introduction. The Human Dialogue System.

Feature calculation

Fre

quen

cy

Time

Take Inverse Discrete FourierTransform to decorrelate frequencies

Page 19: Automatic Speech Recognition Introduction. The Human Dialogue System.

Feature calculation

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

Input:

Output: acousticobservations vectors

Page 20: Automatic Speech Recognition Introduction. The Human Dialogue System.

Robust Speech Recognition

• Different schemes have been developed for dealing with noise, reverberation– Additive noise: reduce effects of particular

frequencies– Convolutional noise: remove effects of linear

filters (cepstral mean subtraction)

cepstrum: fourier transfor of the LOGARITHM of the spectrum cepstrum: fourier transfor of the LOGARITHM of the spectrum

Page 21: Automatic Speech Recognition Introduction. The Human Dialogue System.

How do we map from vectors to word sequences?

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

“That you” …???

Page 22: Automatic Speech Recognition Introduction. The Human Dialogue System.

HMM (again)!

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

“That you” …Pattern recognition

with HMMs

Page 23: Automatic Speech Recognition Introduction. The Human Dialogue System.

ASR using HMMs

• Try to solve P(Message|Signal) by breaking the problem up into separate components

• Most common method: Hidden Markov Models– Assume that a message is composed of words

– Assume that words are composed of sub-word parts (3phones)

– Assume that 3phones have some sort of acoustic realization

– Use probabilistic models for matching acoustics to phones to words

Page 24: Automatic Speech Recognition Introduction. The Human Dialogue System.

Creating HMMs for word sequences: Context independent units

3phones3phones

Page 25: Automatic Speech Recognition Introduction. The Human Dialogue System.

“Need” 3phone model

Page 26: Automatic Speech Recognition Introduction. The Human Dialogue System.

Hierarchical system of HMMs

HMM of a triphone HMM of a triphone HMM of a triphone

Higher level HMM of a word

Language model

Page 27: Automatic Speech Recognition Introduction. The Human Dialogue System.

To simplify, let’s now ignorelower level HMM

Each phone node hasa “hidden” HMM (H2MM)

Page 28: Automatic Speech Recognition Introduction. The Human Dialogue System.

HMMs for ASR

go home

g o h o m

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

Markov modelbackbone composedof sequences of 3phones(hidden because wedon’t knowcorrespondences)

Acoustic observations

Each line represents a probability estimate (more later)

g o o o o o oh mm

Page 29: Automatic Speech Recognition Introduction. The Human Dialogue System.

HMMs for ASR

go home

g o h o m

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

Markov modelbackbone composedof phones(hidden because wedon’t knowcorrespondences)

Acoustic observations

Even with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypothesesEven with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypotheses

Page 30: Automatic Speech Recognition Introduction. The Human Dialogue System.

For every HMM (in hierarchy): compute Max probability sequence

th a t

h iy

y uw

p(he|that)

p(you|that)

h iy

sh uh d

X= acoustic observations,(3)phones, phone sequencesW= (3)phones, phonesequences, word sequences

argmaxW P(W|X)=argmaxW P(X|W)P(W)/P(X)=argmaxW P(X|W)P(W)

COMPUTE:

Page 31: Automatic Speech Recognition Introduction. The Human Dialogue System.

Search• When trying to find W*=argmaxW P(W|X), need to look

at (in theory)– All possible (3phone, word.. etc) sequences– All possible segmentations/alignments of W&X

• Generally, this is done by searching the space of W– Viterbi search: dynamic programming approach that looks for

the most likely path– A* search: alternative method that keeps a stack of hypotheses

around

• If |W| is large, pruning becomes important• Need also to estimate transition probabilities

Page 32: Automatic Speech Recognition Introduction. The Human Dialogue System.

Training: speech corpora

• Have a speech corpus at hand– Should have word (and preferrably phone)

transcriptions

– Divide into training, development, and test sets

• Develop models of prior knowledge– Pronunciation dictionary

– Grammar, lexical trees

• Train acoustic models– Possibly realigning corpus phonetically

Page 33: Automatic Speech Recognition Introduction. The Human Dialogue System.

Acoustic Model

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

dh a a t • Assume that you can label each vector with a phonetic label

• Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks)

Na()P(X|state=a)

Page 34: Automatic Speech Recognition Introduction. The Human Dialogue System.

Pronunciation model

• Pronunciation model gives connections between phones and words

• Multiple pronunciations (tomato):

owt m

dh pdh

1-pdh

a pa

1-pa

t pt

1-pt

ah

ow ey

ah

t

Page 35: Automatic Speech Recognition Introduction. The Human Dialogue System.

Training models for a sound unit

Page 36: Automatic Speech Recognition Introduction. The Human Dialogue System.

Language Model

• Language model gives connections between words (e.g., bigrams: probability of two word sequences)

dh a t

h iy

y uw

p(he|that)

p(you|that)

Page 37: Automatic Speech Recognition Introduction. The Human Dialogue System.

Lexical treesSTART S-T-AA-R-TD STARTING S-T-AA-R-DX-IX-NG STARTED S-T-AA-R-DX-IX-DD STARTUP S-T-AA-R-T-AX-PD START-UP S-T-AA-R-T-AX-PD

S T AA

R

R T

TD

DXIX

IX

NG

DD

AXPD

PD

start

starting

started

startup

start-up

Page 38: Automatic Speech Recognition Introduction. The Human Dialogue System.

Judging the quality of a system

• Usually, ASR performance is judged by the word error rateErrorRate = 100*(Subs + Ins + Dels) / Nwords

REF: I WANT TO GO HOME ***

REC: * WANT TWO GO HOME NOW

SC: D C S C C I

100*(1S+1I+1D)/5 = 60%

Page 39: Automatic Speech Recognition Introduction. The Human Dialogue System.

Judging the quality of a system

• Usually, ASR performance is judged by the word error rate

• This assumes that all errors are equal– Also, a bit of a mismatch between optimization

criterion and error measurement

• Other (task specific) measures sometimes used– Task completion– Concept error rate

Page 40: Automatic Speech Recognition Introduction. The Human Dialogue System.

Sphinx4http://cmusphinx.sourceforge.net

Page 41: Automatic Speech Recognition Introduction. The Human Dialogue System.
Page 42: Automatic Speech Recognition Introduction. The Human Dialogue System.
Page 43: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Feature extractor

Page 44: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Feature extractor• Mel-Frequency Cepstral

Coefficients (MFCCs)Feature vectors

Page 45: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Acoustic Observations

Page 46: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Acoustic Observations• Hidden States

Page 47: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Acoustic Observations• Hidden States• Acoustic Observation likelihoods

Page 48: Automatic Speech Recognition Introduction. The Human Dialogue System.

“Six”

Page 49: Automatic Speech Recognition Introduction. The Human Dialogue System.
Page 50: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Constructs the search graph of HMMs from: – Acoustic model– Statistical Language model ~or~– Grammar– Dictionary

Page 51: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Constructs the HMMs of phones• Produces observation likelihoods

Page 52: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Constructs the HMMs for units of speech

• Produces observation likelihoods • Sampling rate is critical! • WSJ vs. WSJ_8k

Page 53: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Constructs the HMMs for units of speech

• Produces observation likelihoods • Sampling rate is critical! • WSJ vs. WSJ_8k• TIDIGITS, RM1, AN4, HUB4

Page 54: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Word likelihoods

Page 55: Automatic Speech Recognition Introduction. The Human Dialogue System.

• ARPA format Example:1-grams:-3.7839 board -0.1552-2.5998 bottom -0.3207-3.7839 bunch -0.21742-grams:-0.7782 as the -0.2717-0.4771 at all 0.0000-0.7782 at the -0.29153-grams:-2.4450 in the lowest -0.5211 in the middle -2.4450 in the on

Page 56: Automatic Speech Recognition Introduction. The Human Dialogue System.

public <basicCmd> = <startPolite> <command> <endPolite>;

public <startPolite> = (please | kindly | could you ) *;

public <endPolite> = [ please | thanks | thank you ];

<command> = <action> <object>;

<action> = (open | close | delete | move); <object> = [the | a] (window | file | menu);

Page 57: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Maps words to phoneme sequences

Page 58: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Example from cmudict.06dPOULTICE P OW L T AH S

POULTICES P OW L T AH S IH Z

POULTON P AW L T AH N

POULTRY P OW L T R IY

POUNCE P AW N S

POUNCED P AW N S T

POUNCEY P AW N S IY

POUNCING P AW N S IH NG

POUNCY P UW NG K IY

Page 59: Automatic Speech Recognition Introduction. The Human Dialogue System.
Page 60: Automatic Speech Recognition Introduction. The Human Dialogue System.
Page 61: Automatic Speech Recognition Introduction. The Human Dialogue System.
Page 62: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Can be statically or dynamically constructed

Page 63: Automatic Speech Recognition Introduction. The Human Dialogue System.
Page 64: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Maps feature vectors to search graph

Page 65: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Searches the graph for the “best fit”

Page 66: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Searches the graph for the “best fit”

• P(sequence of feature vectors| word/phone)

• aka. P(O|W)

-> “how likely is the input to have been generated by the word”

Page 67: Automatic Speech Recognition Introduction. The Human Dialogue System.

F ay ay ay ay v v v v vF f ay ay ay ay v v v vF f f ay ay ay ay v v vF f f f ay ay ay ay v vF f f f ay ay ay ay ay vF f f f f ay ay ay ay vF f f f f f ay ay ay v…

Page 68: Automatic Speech Recognition Introduction. The Human Dialogue System.

TimeO1 O2 O3

Page 69: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Uses algorithms to weed out low scoring paths during decoding

Page 70: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Words!

Page 71: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Most common metric• Measure the # of modifications to

transform recognized sentence into reference sentence

Page 72: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Reference: “This is a reference sentence.”

• Result: “This is neuroscience.”

Page 73: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Reference: “This is a reference sentence.”

• Result: “This is neuroscience.”• Requires 2 deletions, 1 substitution

Page 74: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Reference: “This is a reference sentence.”

• Result: “This is neuroscience.”

Page 75: Automatic Speech Recognition Introduction. The Human Dialogue System.

• Reference: “This is a reference sentence.”

• Result: “This is neuroscience.”

• D S D

Page 76: Automatic Speech Recognition Introduction. The Human Dialogue System.

Installation details

• http://cmusphinx.sourceforge.net/wiki/sphinx4:howtobuildand_run_sphinx4

• Student report on NLP course web site