DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance:...

149
DT2112 Speech Recognition by Computers Giampiero Salvi KTH/CSC/TMH [email protected] VT 2015 1 / 117

Transcript of DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance:...

Page 1: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

DT2112Speech Recognition by Computers

Giampiero Salvi

KTH/CSC/TMH [email protected]

VT 2015

1 / 117

Page 2: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Motivation

I Natural way of communication (No trainingneeded)

I Leaves hands and eyes free (Good forfunctionally disabled)

I Effective (Higher data rate than typing)

I Can be transmitted/received inexpensively(phones)

2 / 117

Page 3: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

A dream of Artificial Intelligence

2001: A space odyssey (1968)

3 / 117

Page 4: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

ASR in a Broader Context

DialogueManager

AutomaticSpeech

Recognition

SpokenLanguage

UnderstandingText to Speech

4 / 117

Page 5: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

The ASR Scope

Convert speech into text

AutomaticSpeech

Recognition“My name is . . . ”

[confidence score]

Not considered here:

I non-verbal signals

I prosody

I multi-modal interaction

5 / 117

Page 6: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

The ASR Scope

Convert speech into text

AutomaticSpeech

Recognition“My name is . . . ”

[confidence score]

Not considered here:

I non-verbal signals

I prosody

I multi-modal interaction

5 / 117

Page 7: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

A very long endeavour1952, Bell laboratories, isolated digit recognition,single speaker, hardware based [2]

[2] K. H. Davis, R. Biddulph, and S. Balashek. “Automatic Recognition of Spoken Digits”. In: JASA 24.6 (1952),pp. 637–642

6 / 117

Page 8: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

An underestimated challenge

for 60 years many boldannouncements

7 / 117

Page 9: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

8 / 117

Page 10: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Main variables in ASR

Speaking mode isolated words vs continuous speech

Speaking style read speech vs spontaneous speech

Speakers speaker dependent vs speakerindependent

Vocabulary small (<20 words) vs large (>50 000words)

Robustness against background noise

9 / 117

Page 11: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Challenges — VariabilityBetween speakers

I AgeI GenderI AnatomyI Dialect

Within speaker

I StressI EmotionI Health conditionI Read vs SpontaneousI Adaptation to

environment (Lombardeffect)

I Adaptation to listener

Environment

I NoiseI Room acousticsI Microphone distanceI Microphone, telephoneI Bandwidth

Listener

I AgeI Mother tongueI Hearing lossI Known / unknownI Human / Machine

10 / 117

Page 12: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Applications today

Call centers:

I traffic information

I time-tables

I booking. . .

Accessibility

I Dictation

I hand-free control (TV, video, telephone)

Smart phones

I Siri, Android. . .

11 / 117

Page 13: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Outline

Speech Signal Representations

Template Matching

Probabilistic ApproachKnowledge Modelling

Performance Measures

Robustness and Adaptation

Speaker Recognition

More details in DT2118: “Speech and SpeakerRecognition”

12 / 117

Page 14: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Components of ASR System

Speech SignalSpectralAnalysis

FeatureExtraction

Searchand Match

Recognised Words

Acoustic Models

Lexical Models

Language Models

Representation

Constraints - KnowledgeDecoder

13 / 117

Page 15: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Outline

Speech Signal Representations

Template Matching

Probabilistic ApproachKnowledge Modelling

Performance Measures

Robustness and Adaptation

Speaker Recognition

14 / 117

Page 16: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Speech Signal Representations

Speech SignalSpectralAnalysis

FeatureExtraction

Representation

Goals:

I disregard irrelevant information

I optimise relevant information for modelling

15 / 117

Page 17: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Speech Signal Representations

Speech SignalSpectralAnalysis

FeatureExtraction

Representation

Means:

I try to model essential aspects of speechproduction

I imitate auditory processes

I consider properties of statistical modelling

15 / 117

Page 18: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Examples of Speech Sounds

http://www.speech.kth.se/wavesurfer/

16 / 117

Page 19: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Feature Extraction and Speech Production

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

����������������������������������������

Trachea

Muscle Force and Relaxation

Lungs

FoldsVocal

Glottis

PharyngealCavity Cavity

Cavity

Oral

Nasal

Velum

NoseOutput

MouthOutput

17 / 117

Page 20: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Source/Filter Model, General Case

Vowels

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

�����������������������������������������������������������������������������������������������������������������������

��������������������������������������������������������������������������������������������������������������������������

������

������������������������������������������������������

���������������������������������������������

������������������������

������������������������

����������������������������

����������������������������

����������������������������������������

����������������������������������������

������������������������

������������������������

����������������

������������������������

������������������������

��������������������

��������������������

����������������

����������������

������������������������

������������������������

�������������������������

�������������������������

� Source (periodic)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)

18 / 117

Page 21: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Source/Filter Model, General Case

Fricatives (e.g. sh) or Plosive (e.g. k)

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

�����������������������������������������������������������������������������������������������������������������������

�����������������������������������������������������������������������������������������������������������������������

��������

������������������

������������������

������������������������

������������������������

������������������������������

������������������������������

����������������������������

����������������������������

����������������������������������������

����������������������������������������

������������������������

������������������������

����������������

������������������������

������������������������

��������������������

��������������������

����������������

����������������

������������������������

������������������������

������������������������������

������������������������������

� Source (noise orimpulsive)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)

18 / 117

Page 22: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Source/Filter Model, General Case

Fricatives (e.g. s) or Plosive (e.g. t)

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

�����������������������������������������������������������������������������������������������������������������������

�����������������������������������������������������������������������������������������������������������������������

��������

��������������������������������������������������������

��������������������������������������������������������

������������������������

������������������������

����������������������������

����������������������������

����������������������������������������

����������������������������������������

������������������������

������������������������

����������������

������������������������

������������������������

��������������������

��������������������

����������������

����������������

������������������������

������������������������

�������������������������

�������������������������

� Source (noise orimpulsive)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)

18 / 117

Page 23: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Source/Filter Model, General Case

Nasalised Vowels

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

�����������������������������������������������������������������������������������������������������������������������

��������������������������������������������������������������������������������������������������������������������������

������

������������������������������������������������������

�����������������������������������������������������������������������������������������������

��������������������������������������������������

���������������������������������������

���������������������������������������

����������������

������������������������

������������������������

��������������������

��������������������

����������������

����������������

��������������������

��������������������

������������������������

������������������������

����������������������������

����������������������������

����������������������������������������

����������������������������������������

������������������������

������������������������

��������������������

��������������������

����������������������������

����������������������������

��������������������������������

��������������������������������

��������������������

��������������������

������������

������������

�������������������������

�������������������������

� Source (periodic)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)

18 / 117

Page 24: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Examples

0 1 2 3 4 5 6 7 8

freqency (kHz)

dB

vocalic sound

������������������������

������������������������

����������������������������

����������������������������

����������������������������������������

����������������������������������������

������������������������

������������������������

����������������

������������������������

������������������������

��������������������

��������������������

����������������

����������������

������������������������

������������������������

�������������������������

�������������������������

0 1 2 3 4 5 6 7 8

freqency (kHz)

dB

fricative sound

������������������������������

������������������������������

����������������������������

����������������������������

����������������������������������������

����������������������������������������

������������������������

������������������������

����������������

������������������������

������������������������

��������������������

��������������������

����������������

����������������

������������������������

������������������������

������������������������������

������������������������������

19 / 117

Page 25: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Relevant vs Irrelevant InformationFor the purpose of transcribing words:Relevant: vocal tract shape → spectral envelopeIrrelevant: vocal fold vibration frequency (f0) →spectral details

0 1 2 3 4 5 6 7 8

freqency (kHz)

dB

vocalic sound

Exceptions:

I tonal languages (Chinese)

I pitch and prosody convey meaning

20 / 117

Page 26: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Relevant vs Irrelevant InformationFor the purpose of transcribing words:Relevant: vocal tract shape → spectral envelopeIrrelevant: vocal fold vibration frequency (f0) →spectral details

0 1 2 3 4 5 6 7 8

freqency (kHz)

dB

vocalic sound

Exceptions:

I tonal languages (Chinese)

I pitch and prosody convey meaning

20 / 117

Page 27: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Linear Prediction AnalysisAttempt to model the vocal tract filter

x [n] =

p∑k=1

akx [n − k]

better match at spectral peaks than valleys21 / 117

Page 28: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Mel Frequency Cepstrum Coefficients

I imitate aspects of auditory processing

I de facto standard in ASR

I does not assume all-pole model of the spectrum

I uncorrelated: easier to model statistically

22 / 117

Page 29: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

MFCCs Calculation

FFT Mel FilterbankCosine

Transform

C1 C2 C3 C4−20

0

20

40cepstrum of /a:/

0 5 10 1540

60

80

100

filterbank channels

spectrum of /a:/

23 / 117

Page 30: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Cosine Transform

Cj =√

2N

∑Ni=1 Ai cos( jπ(i−0.5)

N )

0 5 10 15−0.5

0

0.5

W1

0 5 10 15−0.5

0

0.5

W2

0 5 10 15−0.5

0

0.5

W3

0 5 10 15−0.5

0

0.5

W4

Ai

0 5 10 1540

60

80

100

filterbank channels

spectrum of /a:/

0 5 10 1520

40

60

80

filterbank channels

spectrum of /s:/

Cj

C1 C2 C3 C4−20

0

20

40cepstrum of /a:/

C1 C2 C3 C4−60

−40

−20

0

20cepstrum of /s:/

24 / 117

Page 31: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

MFCCs: typical values

I 12 Coefficients C1–C12

I Energy (could be C0)

I Delta coefficients (derivatives in time)

I Delta-delta (second order derivatives)

I total: 39 coefficients per frame (analysiswindow)

25 / 117

Page 32: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

A time varying signal

I speech is time varying

I short segment are quasi-stationary

I use short time analysis

26 / 117

Page 33: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Short-Time Fourier Analysis

7500 8000 8500 9000 9500 10000−0.1

−0.05

0

0.05

0.1am

plit

ude

samples

0 1000 2000 3000 4000 5000 6000 7000 8000−20

−15

−10

−5

0

5

pow

er

spectr

um

(dB

)

frequency (Hz)

27 / 117

Page 34: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Short-Time Fourier Analysis

7500 8000 8500 9000 9500 10000−0.1

−0.05

0

0.05

0.1am

plit

ude

samples

0 1000 2000 3000 4000 5000 6000 7000 8000−16

−14

−12

−10

−8

−6

−4

−2

0

2

pow

er

spectr

um

(dB

)

frequency (Hz)

27 / 117

Page 35: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Short-Time Fourier Analysis

7500 8000 8500 9000 9500 10000−0.1

−0.05

0

0.05

0.1am

plit

ude

samples

0 1000 2000 3000 4000 5000 6000 7000 8000−16

−14

−12

−10

−8

−6

−4

−2

pow

er

spectr

um

(dB

)

frequency (Hz)

27 / 117

Page 36: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Short-Time Fourier Analysis

27 / 117

Page 37: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Windowing, typical values

I signal sampling frequency: 8–20kHz

I analysis window: 10–50ms

I frame interval: 10–25ms (100–40Hz)

28 / 117

Page 38: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Frame-Based Processing

File: sx352.WAV Page: 1 of 1 Printed: Mon Dec 05 09:01:39

1

2

3

4

5

6

7

kHz

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

time

ttclnixaymaxttclniydxraokkclixh#

mytoaccording

29 / 117

Page 39: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Comparing frames

I city block distance: d(x , y) =∑

i |xi − yi |I Euclidean distance: d(x , y) =

√∑i(xi − yi)2

I Mahalanobis distance:d(x , y) =

∑i(xi − µy)2/σy

I probability function:f (X = x |µ,Σ) = N(x ;µ,Σ)

I artificial neural networks: d = f (∑

i wixi − θ)

30 / 117

Page 40: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Comparing Utterances

In order to recognise speech we have to be able tocompare different utterances

Va jobbaru me Vad jobbar du med

31 / 117

Page 41: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Fixed vs Variable Length Representation

32 / 117

Page 42: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Combining frame-wise scores intoutterance scores

Template Matching

I oldest technique

I simple comparison of template patterns

I compensate for varying speech rate (DynamicProgramming)

Hidden Markov Models (HMMs)

I most used technique

I models of segmental structure of speech

I recognition by Viterbi search (DynamicProgramming)

33 / 117

Page 43: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Outline

Speech Signal Representations

Template Matching

Probabilistic ApproachKnowledge Modelling

Performance Measures

Robustness and Adaptation

Speaker Recognition

34 / 117

Page 44: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Template Matching

35 / 117

Page 45: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Dynamic ProgrammingI compare any possible alignmentI problem: exponential with H and K!

1 H

1K

unknown utterance

refe

renc

eut

tera

nce

36 / 117

Page 46: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Dynamic ProgrammingDynamic Time Warping (DTW) algorithm

1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],

AccD[h − 1, k − 1],AccD[h, k − 1])

1 H

1K

unknown utterance

refe

renc

eut

tera

nce

36 / 117

Page 47: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Dynamic ProgrammingDynamic Time Warping (DTW) algorithm

1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],

AccD[h − 1, k − 1],AccD[h, k − 1])

1 H

1K

unknown utterance

refe

renc

eut

tera

nce

36 / 117

Page 48: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Dynamic ProgrammingDynamic Time Warping (DTW) algorithm

1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],

AccD[h − 1, k − 1],AccD[h, k − 1])

1 H

1K

unknown utterance

refe

renc

eut

tera

nce

36 / 117

Page 49: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Dynamic ProgrammingDynamic Time Warping (DTW) algorithm

1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],

AccD[h − 1, k − 1],AccD[h, k − 1])

1 H

1K

unknown utterance

refe

renc

eut

tera

nce

36 / 117

Page 50: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Dynamic ProgrammingDynamic Time Warping (DTW) algorithm

1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],

AccD[h − 1, k − 1],AccD[h, k − 1])

1 H

1K

unknown utterance

refe

renc

eut

tera

nce

36 / 117

Page 51: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Dynamic ProgrammingDynamic Time Warping (DTW) algorithm

1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],

AccD[h − 1, k − 1],AccD[h, k − 1])

1 H

1K

unknown utterance

refe

renc

eut

tera

nce

36 / 117

Page 52: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Dynamic ProgrammingDynamic Time Warping (DTW) algorithm

1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],

AccD[h − 1, k − 1],AccD[h, k − 1])

1 H

1K

unknown utterance

refe

renc

eut

tera

nce AccD[H,K]

36 / 117

Page 53: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

DP Example: Spelling

I observations are letters

I local distance: 0 (same letter), 1 (differentletter)

I Unknown utterance: ALLDRIG

I Reference1: ALDRIG

I Reference2: ALLTID

I Problem: find closest match

Distance char-by-char:

I ALLDRIG–ALDRIG = 5

I ALLDRIG–ALLTID = 4

37 / 117

Page 54: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

DP Example: Solution

LocD[h,k]=

G 1 1 1 1 1 1 0

I 1 1 1 1 1 0 1

R 1 1 1 1 0 1 1

D 1 1 1 0 1 1 1

L 1 0 0 1 1 1 1

A 0 1 1 1 1 1 1

A L L D R I G

AccD[h,k]=

G 5 4 4 3 2 1 0

I 4 3 3 2 1 0 1

R 3 2 2 1 0 1 2

D 2 1 1 0 1 2 3

L 1 0 0 1 2 3 4

A 0 1 2 3 4 5 6

A L L D R I G

Distance ALLDRIG–ALDRIG: AccD[H,K] = 0

Distance ALLDRIG–ALLTID? (5min)

38 / 117

Page 55: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

DP Example: Solution

LocD[h,k]=

G 1 1 1 1 1 1 0

I 1 1 1 1 1 0 1

R 1 1 1 1 0 1 1

D 1 1 1 0 1 1 1

L 1 0 0 1 1 1 1

A 0 1 1 1 1 1 1

A L L D R I G

AccD[h,k]=

G 5 4 4 3 2 1 0

I 4 3 3 2 1 0 1

R 3 2 2 1 0 1 2

D 2 1 1 0 1 2 3

L 1 0 0 1 2 3 4

A 0 1 2 3 4 5 6

A L L D R I G

Distance ALLDRIG–ALDRIG: AccD[H,K] = 0Distance ALLDRIG–ALLTID? (5min)

38 / 117

Page 56: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

DP Example: Solution

LocD[h,k]=

D 1 1 1 0 1 1 1

I 1 1 1 1 1 0 1

T 1 1 1 1 1 1 1

L 1 0 0 1 1 1 1

L 1 0 0 1 1 1 1

A 0 1 1 1 1 1 1

A L L D R I G

AccD[h,k]=

D 5 3 3 2 3 3 3

I 4 2 2 2 2 2 3

T 3 1 1 1 2 3 4

L 2 0 0 1 2 3 4

L 1 0 0 1 2 3 4

A 0 1 2 3 4 5 6

A L L D R I G

Distance ALLDRIG–ALDRIG: AccD[H,K] = 0Distance ALLDRIG–ALLTID: AccD[H,K] = 3

39 / 117

Page 57: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Best path: Backtracking

Sometimes we want to know the path

1. at each point [h,k] remember the minimumdistance predecessor (back pointer)

2. at the end point [H,K] follow the back pointersuntil the start

40 / 117

Page 58: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Properties of Template Matching

Pros:

+ No need for phonetic transcriptions

+ within-word co-articulation for free

+ high time resolution

Cons:

– cross-word co-articulation not modelled

– requires recordings of every word

– not easy to model variation

– does not scale up with vocabulary size

41 / 117

Page 59: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Outline

Speech Signal Representations

Template Matching

Probabilistic ApproachKnowledge Modelling

Performance Measures

Robustness and Adaptation

Speaker Recognition

42 / 117

Page 60: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Components of ASR System

Speech SignalSpectralAnalysis

FeatureExtraction

Searchand Match

Recognised Words

Acoustic Models

Lexical Models

Language Models

Representation

Constraints - KnowledgeDecoder

43 / 117

Page 61: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

A probabilistic perspective

1. Compute probability of a word sequence giventhe acoustic observation: P(words|sounds)

2. find the optimal word sequence by maximisingthe probability:

words = arg maxP(words|sounds)

44 / 117

Page 62: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

A probabilistic perspective: Bayes’ rule

P(words|sounds) =P(sounds|words)P(words)

P(sounds)

I P(sounds|words) can be estimated fromtraining data and transcriptions

I P(words): a priori probability of the words(Language Model)

I P(sounds): a priori probability of the sounds(constant, can be ignored)

45 / 117

Page 63: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Components of ASR System

Speech SignalSpectralAnalysis

FeatureExtraction

Searchand Match

Recognised Words

Acoustic Models

Lexical Models

Language Models

Representation

Constraints - KnowledgeDecoder

Acoustic Models

46 / 117

Page 64: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Probabilistic ModellingProblem: How do we model P(sounds|words)?

File: sx352.WAV Page: 1 of 1 Printed: Mon Dec 05 09:01:39

1

2

3

4

5

6

7

kHz

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

time

ttclnixaymaxttclniydxraokkclixh#

mytoaccording

Every feature vector (observation at time t) is acontinuous stochastic variable (e.g. MFCC)

47 / 117

Page 65: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Probabilistic ModellingProblem: How do we model P(sounds|words)?

File: sx352.WAV Page: 1 of 1 Printed: Mon Dec 05 09:01:39

1

2

3

4

5

6

7

kHz

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

time

ttclnixaymaxttclniydxraokkclixh#

mytoaccording

Every feature vector (observation at time t) is acontinuous stochastic variable (e.g. MFCC)

47 / 117

Page 66: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

StationarityProblem: speech is not stationaryI we need to model short segments independentlyI the fundamental unit can not be the word, but

must be shorterI usually we model three segments for each

phoneme

48 / 117

Page 67: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Local probabilities (frame-wise)

If segment sufficiently short

P(sounds|segment)

can be modelled with standard probabilitydistributionsUsually Gaussian or Gaussian Mixture

49 / 117

Page 68: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Global Probabilities (utterance)

Problem: How do we combine the differentP(sounds|segment) to form P(sounds|words)?

Answer: Hidden Markov Model (HMM)

w1 w2 w3

w

ao1 ao2 ao3

ao

sh1 sh2 sh3

sh

wash

50 / 117

Page 69: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Hidden Markov Models (HMMs)

s1 s2

s3

Elements:

set of states: S = {s1, s2, s3}transition probabilities: T (sa, sb) = P(sb, t|sa, t − 1)prior probabilities: π(sa) = P(sa, t0)state to observation prob-abilities:

B(o, sa) = P(o|sa)

51 / 117

Page 70: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Hidden Markov Models (HMMs)

s1 s2 s3

Elements:

set of states: S = {s1, s2, s3}transition probabilities: T (sa, sb) = P(sb, t|sa, t − 1)prior probabilities: π(sa) = P(sa, t0)state to observation prob-abilities:

B(o, sa) = P(o|sa)

51 / 117

Page 71: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Hidden Markov Models (HMMs)

1 Hunknown utterance

52 / 117

Page 72: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Hidden Markov Models (HMMs)

1 Hunknown utterance

52 / 117

Page 73: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

HMM-questions

1. what is the probability that the model hasgenerated the sequence of observations?(isolated word recognition)

2. what is the most likely state sequence given theobservation sequence? (continuous speechrecognition)

3. how can the model parameters be estimatedfrom examples? (training)

53 / 117

Page 74: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

HMM-questions

1. what is the probability that the model hasgenerated the sequence of observations?(isolated word recognition) forward algorithm

2. what is the most likely state sequence given theobservation sequence? (continuous speechrecognition) Viterbi algorithm [5]

3. how can the model parameters be estimatedfrom examples? (training) Baum-Welch[1]

[5] A. J. Viterbi. “Error Bounds for Convolutional Codes and an Asymtotically optimum decoding algorithm”. In:IEEE Trans. Inform. Theory IT-13 (Apr. 1967), pp. 260–269

[1] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. “A maximization technique occurring in the statistical analysisof probabilistic functions of Markov chains”. In: Ann. Math. Statist. 41.1 (1970), pp. 164–171

53 / 117

Page 75: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Isolated Words Recognition

hmm1

hmm2

hmm3

. . .

hmmK

Compare Likelihoods (forward-backward)

54 / 117

Page 76: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Continuous Speech Recognition

hmm1

hmm2

hmm3

. . .

hmmK

Viterbi algorithm

55 / 117

Page 77: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Modelling Coarticulation

Example peat /pi:t/ vs wheel /wi:l/

0 0 . 1 0 . 2 0 . 3

-2 0 0 0

-1 0 0 0

0

1 0 0 0

0 0 . 1 0 . 2 0 . 3

-2 0 0 0

-1 0 0 0

0

1 0 0 0

2 0 0 0

T im e (s e c o n d s )

Freq

uenc

y(H

z)

0 0 . 1 0 . 2 0 . 30

1 0 0 0

2 0 0 0

3 0 0 0

4 0 0 0

T im e (s e c o n d s )

Freq

uenc

y(H

z)

0 0 . 1 0 . 2 0 . 30

1 0 0 0

2 0 0 0

3 0 0 0

4 0 0 0

56 / 117

Page 78: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Modelling Coarticulation

Context dependent models (CD-HMMs)

I Duplicate each phoneme model depending onleft and right context:

I from “a” monophone model

I to “d−a+f”, “d−a+g”, “l−a+s”. . . triphonemodels

I If there are N = 50 phonemes in the language,there are N3 = 125000 potential triphones

I many of them are not exploited by the language

57 / 117

Page 79: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Amount of parameters

Example:

I a large vocabulary recogniser may have 60000triphone models

I each model has 3 states

I each state may have 32 mixture componentswith 1 + 39× 2 parameters each (weight,means, variances): 39× 32× 2 + 32 = 2528

Totally it is 60000× 3× 2528 = 455 millionparameters!

58 / 117

Page 80: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Similar Coarticulation

/ri:/ vs /wi:/

0 0.1 0.2 0.3 0.4-600

-400

-200

0

200

400

Time (seconds)

Frequency(Hz)

0 0.1 0.2 0.3 0.40

1000

2000

3000

4000

0 0.1 0.2 0.3 0.4-600

-400

-200

0

200

400

Time (seconds)

Frequency(Hz)

0 0.1 0.2 0.3 0.40

1000

2000

3000

4000

59 / 117

Page 81: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Tying to reduce complexity

Example: similar triphones d−a+m and t−a+m

I same right context, similar left context

I 3rd state is expected to be very similar

I 2nd state may also be similar

States (and their parameters) can be sharedbetween models

+ reduce complexity

+ more data to estimate each parameter

– fine detail may be lost

done with CART tree methodology

60 / 117

Page 82: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Tying to reduce complexity

Example: similar triphones d−a+m and t−a+m

I same right context, similar left context

I 3rd state is expected to be very similar

I 2nd state may also be similar

States (and their parameters) can be sharedbetween models

+ reduce complexity

+ more data to estimate each parameter

– fine detail may be lost

done with CART tree methodology

60 / 117

Page 83: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Components of ASR System

Speech SignalSpectralAnalysis

FeatureExtraction

Searchand Match

Recognised Words

Acoustic Models

Lexical Models

Language Models

Representation

Constraints - KnowledgeDecoder

Lexical Models

61 / 117

Page 84: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Components of ASR System

Speech SignalSpectralAnalysis

FeatureExtraction

Searchand Match

Recognised Words

Acoustic Models

Lexical Models

Language Models

Representation

Constraints - KnowledgeDecoder

Lexical Models

61 / 117

Page 85: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Lexical Models

I in general specify sequence of phoneme foreach word

I example:

“dictionary” IPA X-SAMPAUK: /d I k S @ n (@) ô i/ /d I k S @ n (@) r i/

USA: /d I k S @ n E ô i/ /d I k S @ n E r i/

I expensive resources

I include multiple pronunciations

I phonological rules (assimilation, deletion)

62 / 117

Page 86: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Pronunciation Network

Example: tomato

/t/

/@/

/oU/

/m/

/A/

/eI/

/R/

/t/

/oU/

0.5

0.3

0.2

0.7

0.3

0.2

0.8

0.8

0.2

63 / 117

Page 87: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Assimilation

did you /d I dZ j @/set you /s E tS 3/

last year /l æ s tS i: ô/because you’ve /b i: k @ Z u: v/

64 / 117

Page 88: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Deletion

find him /f a I n I m/around this /@ ô aU n I s/

let me in /l E m i: n/

65 / 117

Page 89: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Out of Vocabulary Words

I Proper names often not in lexicon

I derive pronunciation automatically

I English has very complexgrapheme-to-phoneme rules

I attempts to derive pronunciation from speechrecordings

66 / 117

Page 90: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Components of ASR System

Speech SignalSpectralAnalysis

FeatureExtraction

Searchand Match

Recognised Words

Acoustic Models

Lexical Models

Language Models

Representation

Constraints - KnowledgeDecoder

Language Models

67 / 117

Page 91: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Components of ASR System

Speech SignalSpectralAnalysis

FeatureExtraction

Searchand Match

Recognised Words

Acoustic Models

Lexical Models

Language Models

Representation

Constraints - KnowledgeDecoder

Language Models

67 / 117

Page 92: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Why do we need language models?

Bayes’ rule:

P(words|sounds) =P(sounds|words)P(words)

P(sounds)

whereP(words): a priori probability of the words(Language Model)

We could use non informative priors(P(words) = 1/N), but. . .

68 / 117

Page 93: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Branching Factor

I if we have N words in the dictionary

I at every word boundary we have to consider Nequally likely alternatives

I N can be in the order of millions

word

word1

word2

. . .

wordN

69 / 117

Page 94: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Ambiguity

“ice cream” vs “I scream”

/aI s k ô i: m/

70 / 117

Page 95: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Language Models

P(words|sounds) =P(sounds|words)P(words)

P(sounds)

Finite state networks (hand-made, see lab)

I formal language, e.g. traffic control

Statistical Models (N-grams)

I unigrams: P(wi)

I bigrams: P(wi |wi−1)

I trigrams: P(wi |wi−1,wi−2)

I . . .

71 / 117

Page 96: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Chomsky’s formal grammar

Noam Chomsky: linguist, philosopher, . . .

G = (V ,T ,P , S)

where

V : set of non-terminal constituentsT : set of terminals (lexical items)P : set of production rulesS : start symbol

72 / 117

Page 97: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

ExampleS = sentenceV = {NP (noun phrase),

NP1, VP (verbphrase), NAME, ADJ,V (verb), N (noun)}

T = {Mary , person , loves, that , . . . }

P = {S → NP VPNP → NAMENP → ADJ NP1NP1 → NVP → VERB NPNAME → MaryV → lovesN → personADJ → that }

S

NP

NAME

Mary

VP

V

loves

NP

ADJ

that

NP1

N

person

73 / 117

Page 98: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

ExampleS = sentenceV = {NP (noun phrase),

NP1, VP (verbphrase), NAME, ADJ,V (verb), N (noun)}

T = {Mary , person , loves, that , . . . }

P = {S → NP VPNP → NAMENP → ADJ NP1NP1 → NVP → VERB NPNAME → MaryV → lovesN → personADJ → that }

S

NP

NAME

Mary

VP

V

loves

NP

ADJ

that

NP1

N

person

73 / 117

Page 99: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Formal Language Models

I only used for simple tasks

I hard to code by hand

I people do not speak following formal grammars

74 / 117

Page 100: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Statistical Grammar Models (N-grams)

Simply count co-occurrence of words in large textdata sets

I unigrams: P(wi)

I bigrams: P(wi |wi−1)

I trigrams: P(wi |wi−1,wi−2)

I . . .

75 / 117

Page 101: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Language Models: complexity

Increasing N in N-grams leads to:

1. more complex decoders

2. difficulties in training the LM parameters

76 / 117

Page 102: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Knowledge Models in ASR

Acoustic Models trained on hours of annotatedspeech recordings (especially developedspeech databases)

Lexical Model usually produced by hand by experts(or generated by rules)

Language Models trained on millions of words oftext (often from news papers)

77 / 117

Page 103: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Outline

Speech Signal Representations

Template Matching

Probabilistic ApproachKnowledge Modelling

Performance Measures

Robustness and Adaptation

Speaker Recognition

78 / 117

Page 104: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Word Accuracy

A = 100N − S − D − I

NWhere

I N : total number of reference words

I S : substitutions

I D: deletions

I I : insertions

79 / 117

Page 105: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Word Accuracy: example

Ref/Rec I wanted badly to meet you

I corrreally delwanted corrto ins corrsee subyou corr

6 words, 1 substitution, 1 insertion, 1 deletion

A = 1006− 1− 1− 1

6= 50%

requires dynamic programming80 / 117

Page 106: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Measure Difficulty

Language Perplexity

B = 2H , H = −∑∀W

P(W ) log2(P(W ))

I P(W ) is the probability of the word sequence(language model)

I H is called entropy

I B can be seen as measure of average numberof words that can follow any given word

I Example: equiprobable digit sequences B = 10

81 / 117

Page 107: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Effect of adding features

82 / 117

Page 108: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Effect of adding training dataSwichboard data

83 / 117

Page 109: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Effect of adding Gaussians

84 / 117

Page 110: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Effect of adding data for language models

Million sentences

85 / 117

Page 111: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Some dictation systems

I vocabulary over 100 000 words

I many languages

I systems: Nuance NatuallySpeaking, Microsoft,(IBM ViaVoice), (Dragon Dictate)

86 / 117

Page 112: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

New applications

I Indexing of TV and radio programs (offline),Google

I real-time subtitling of TV programs (re-speakerthat summarises)

I voice search (Google)

I language learning

I smart phones

87 / 117

Page 113: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Outline

Speech Signal Representations

Template Matching

Probabilistic ApproachKnowledge Modelling

Performance Measures

Robustness and Adaptation

Speaker Recognition

88 / 117

Page 114: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Main variables in ASR

Speaking mode isolated words vs continuous speech

Speaking style read speech vs spontaneous speech

Speakers speaker dependent vs speakerindependent

Vocabulary small (<20 words) vs large (>50 000words)

Robustness against background noise

89 / 117

Page 115: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

90 / 117

Page 116: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Why is it so hard?

Language

91 / 117

Page 117: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Why is it so hard?

Language

database

91 / 117

Page 118: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Why is it so hard?

Language

database

application

91 / 117

Page 119: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Challenges — VariabilityBetween speakers

I AgeI GenderI AnatomyI Dialect

Within speaker

I StressI EmotionI Health conditionI Read vs SpontaneousI Adaptation to

environment (Lombardeffect)

I Adaptation to listener

Environment

I NoiseI Room acousticsI Microphone distanceI Microphone, telephoneI Bandwidth

Listener

I AgeI Mother tongueI Hearing lossI Known / unknownI Human / Machine

92 / 117

Page 120: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Sheep and Goats [3]

[3] G. Doddington, W. Liggett, A. Martin, M. Przybocki, and D. Reynolds. “SHEEP, GOATS, LAMBS and WOLVESA Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation”. In:INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING. 1998

93 / 117

Page 121: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Sheep and Goats [3]

[3] G. Doddington, W. Liggett, A. Martin, M. Przybocki, and D. Reynolds. “SHEEP, GOATS, LAMBS and WOLVESA Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation”. In:INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING. 1998

93 / 117

Page 122: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Exmpl: spontaneous vs hyper-articulated

Va jobbaru me Vad jobbar du med

“What is your occupation”(“What work you with”)

94 / 117

Page 123: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Examples of reduced pronunciation

Spoken Written In EnglishTesempel Till exempel for exampleahamba och han bara and he justbafatt bara for att just becausejavende jag vet inte I don’t know

95 / 117

Page 124: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Microphone distanceHeadset

2 m distance

96 / 117

Page 125: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

How do we cope with variability?Ideally: models that generalise

Language

database

application

97 / 117

Page 126: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

How do we cope with variability?Ideally: models that generalise

Language

database

application

97 / 117

Page 127: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

How do we cope with variability?Large companies use insane quantities of data

Language

databaseapplication

98 / 117

Page 128: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

How do we cope with variability?Adaptation

Language

database

application

99 / 117

Page 129: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

How do we cope with variability?Adaptation

Language

database

application

99 / 117

Page 130: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Adaptation: Example

Enrolment in Dictation Systems

I let the user read a small text before using thesystem

Beta version of smartphone applications

I the company has all the rights on datagenerated

100 / 117

Page 131: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Limitations

I lack of context

I require huge amounts of training data

101 / 117

Page 132: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Adapted from Mikael Parkvall’s Lingvistiska Samlarbilder, Nr.96:“Problem med automatisk taligenkanning”

SPRÄNGD ANKA

PASSERADE TOMATER

VEGGIE WHOPPER PIROG

PYTTIPANNA

OXJÄRPE

NASI GORENG

SEMLA

SILLSALLAD

POTATISGRATÄNG

FALUKORVKNAPRIGA KLIKEX

PEPPARROTKÖTTPEPPARBIFFKOKOSBOLLARZUCCHINI

SPRÄNGD ANKA

ROTMOS

KROPPKAKA

LINSSOPPALINGONSYLT

RIS A LA MALTA

LÖVBIFFKORV STROGANOFF

SOCKERKAKA

KÖTTBULLARLINSPASTEJ

BUILLABAISE

TACOS

GULASCHGRÖNKÅLS-BAKLAVA

OLIVBRÖD

PYTTIPANNAKASSLER

TORTILLASMOUSSAKAMARÄNGSWISS

SALSAOXRULLADER

RABARBERPAJMANGOCHUTNEY OXJÄRPE

102 / 117

Page 133: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Lack of Generalisation[4]

0

5

10

15

20

25

30

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

Hours of training data

Wor

d Er

ror R

ate

(%)

Less supervised More supervised

Broadcast news transcription

linear extrapolation

[4] R. Moore. “A Comparison of the Data Requirements of Automatic Speech Recognition Systems and HumanListeners”. In: Proc. of Eurospeech. Geneva, Switzerland, 2003, pp. 2582–2584

103 / 117

Page 134: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Lack of Generalisation[4]

0

5

10

15

20

25

30

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

Hours of training data

Wor

d Er

ror R

ate

(%)

Less supervised More supervised

Broadcast news transcription

linear extrapolation

In order to reach 10-years-old’s performance,ASR needs 4 to 70 human lifetimes

exposure to speech!!

[4] R. Moore. “A Comparison of the Data Requirements of Automatic Speech Recognition Systems and HumanListeners”. In: Proc. of Eurospeech. Geneva, Switzerland, 2003, pp. 2582–2584

103 / 117

Page 135: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

New directions

I Production inspired modelling

I Study children’s speech acquisitionI Modelling and decision techniques

I EigenvoicesI Deep learning neural networks

104 / 117

Page 136: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Outline

Speech Signal Representations

Template Matching

Probabilistic ApproachKnowledge Modelling

Performance Measures

Robustness and Adaptation

Speaker Recognition

105 / 117

Page 137: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Speaker Recognition

Created by Hakan Melin

106 / 117

Page 138: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Person Identification

Methods rely on:

I something you posses:key, magnetic card, . . .

I something you know:PIN-code, password, . . .

I something you are:physical attributes, behaviour (biometrics)

107 / 117

Page 139: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Recognition, Verification, IdentificationRecognition: general termSpeaker verification:

I an identity is claimed and is verified by voice

I binary decision (accept/reject)

I performance independent of number of users

Speaker identification:

I choose one of N speakers

I close set: voice belongs to one of the Nspeakers

I open set: any person can access the system

I problem difficulty increases with N

108 / 117

Page 140: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Text Dependence

Either fix the content or recognise it. Examples:

I Fixed password (text dependent)

I User-specific password

I System prompts the text (preventsimpostors from recording and playingback the password)

I any word is allowed (text independent)

text

inde

pen

dent

109 / 117

Page 141: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Representations

Speech Recognition:

I represent speech content

I disregard speaker identity

Speaker Recognition:

I represent speaker identity

I disregard speech content

Surprisingly:

I MFCCs used for both

I suggests that feature extraction could beimproved

110 / 117

Page 142: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Representations

Speech Recognition:

I represent speech content

I disregard speaker identity

Speaker Recognition:

I represent speaker identity

I disregard speech content

Surprisingly:

I MFCCs used for both

I suggests that feature extraction could beimproved

110 / 117

Page 143: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Speaker Verification

Registration (training, enrolment)

Spectralanalysis

Training utterancesfrom a new client

Trainmodel q1 q2 q7

a2 2,

Trained speaker model

Verification

Spectralanalysis MatchingAccess utterance

Claimed identity

Accept / Reject

Problem: The matching scorebetween the client model and theutterance is sensitive todistortion, utterance duration, etc.

111 / 117

Page 144: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Modelling Techniques

HMMs

I Text dependent systems

I state sequence represents allowed utterance

GMMs (Gaussian Mixture Models)

I Text independent systems

I large number of Gaussian components

I sequential information not used

SVM (Support Vector Machines)

Combined models

112 / 117

Page 145: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Evaluation

Claimed Decision:Identity Accept Reject

True OK False Reject (FR)

False False Accept (FA) OK

113 / 117

Page 146: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Score Distribution and Error Balance

( )speaker"true"sf

( )speaker"false"sf

P("false accept")

P("false reject")

$s

Decision threshold

114 / 117

Page 147: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

Performance Measures

I False Rejection Rate (FR)

I False Acceptance Rate (FA)

I Half Total Error Rate (HTER = (FR+FA)/2)

I Equal Error Rate (EER)

I Detection Error Trade-off (DET) Curve

115 / 117

Page 148: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

PER vs Commercial System

2 5 10 20

2

5

10

20

False Accept Rate (in %)

FalseRejectRate(in%)

PER eval02, es=E2a_G8, ts=S2b_G8

commercial_system_with_adaptationCTT:combo,originalCTT:combo,retrained

Retrained:Background modeltrained on PER speech

Original:Background model trained ontelephone speech

116 / 117

Page 149: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance

More information and mathematicalformulations in DT2118

117 / 117