Automatic Speech Recognition: Introduction
Transcript of Automatic Speech Recognition: Introduction
![Page 1: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/1.jpg)
Automatic Speech Recognition: Introduction
Peter Bell
Automatic Speech Recognition— ASR Lecture 111 January 2020
ASR Lecture 1 Automatic Speech Recognition: Introduction 1
![Page 2: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/2.jpg)
Automatic Speech Recognition — ASR
Course details
Lectures: About 18 lectures, delivered live on Teams for now
Labs: Weekly lab sessions – using Python, OpenFst(openfst.org) and later Kaldi (kaldi-asr.org)
Lab sessions will start in Week 3 – exact format TBA.
Assessment:First five lab sessions worth 10%Coursework, building on the lab sessions, worth 40%Open book exam in April or May worth 50%
People:Course organiser: Peter BellGuest lecturers: Hiroshi Shimodaira and Yumnah MohammiedTA: Andrea CarmantiniDemonstrators: Chau Luu and Electra Wallington
http://www.inf.ed.ac.uk/teaching/courses/asr/
ASR Lecture 1 Automatic Speech Recognition: Introduction 2
![Page 3: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/3.jpg)
Your background
If you have taken:
Speech Processing and either of (MLPR or MLP)
Perfect!
either of (MLPR or MLP) but not Speech Processing(probably you are from Informatics)
You’ll require some speech background:
A couple of the lectures will cover material that was in SpeechProcessingSome additional background study (including material fromSpeech Processing)
Speech Processing but neither of (MLPR or MLP)(probably you are from SLP)
You’ll require some machine learning background (especiallyneural networks)
A couple of introductory lectures on neural networks providedfor SLP studentsSome additional background study
ASR Lecture 1 Automatic Speech Recognition: Introduction 3
![Page 4: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/4.jpg)
Labs
Series of weekly labs using Python, OpenFst and Kaldi
They count towards 10% of the course credit
Labs start week 3 – exact arrangements TBA
You will need to work in pairs
Labs 1-5 will give you hands-on experience of using HMMalgorithms to build your own ASR system
These labs are an important pre-requisite for the coursework –take advantage of the demonstrator support!
Later optional labs will introduce you to Kaldi recipes fortraining acoustic models – useful if you will be doing anASR-related research project
ASR Lecture 1 Automatic Speech Recognition: Introduction 4
![Page 5: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/5.jpg)
What is speech recognition?
ASR Lecture 1 Automatic Speech Recognition: Introduction 5
![Page 6: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/6.jpg)
What is speech recognition?
ASR Lecture 1 Automatic Speech Recognition: Introduction 5
![Page 7: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/7.jpg)
What is speech recognition?
Speech-to-text transcription
Transform recorded audio into a sequence of words
Just the words, no meaning.... But do need to deal withacoustic ambiguity: “Recognise speech?” or “Wreck a nicebeach?”
Speaker diarization: Who spoke when?
Speech recognition: what did they say?
Paralinguistic aspects: how did they say it? (timing,intonation, voice quality)
Speech understanding: what does it mean?
ASR Lecture 1 Automatic Speech Recognition: Introduction 6
![Page 8: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/8.jpg)
Why isspeech recognition
difficult?
ASR Lecture 1 Automatic Speech Recognition: Introduction 7
![Page 9: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/9.jpg)
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics
Environment Noise, competing speakers, channel conditions(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak aparticular language
Other paralinguistics Emotional state, social class, . . .
Language spoken Estimated 7,000 languages, most with limitedtraining resources; code-switching; language change
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
![Page 10: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/10.jpg)
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics
Environment Noise, competing speakers, channel conditions(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak aparticular language
Other paralinguistics Emotional state, social class, . . .
Language spoken Estimated 7,000 languages, most with limitedtraining resources; code-switching; language change
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
![Page 11: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/11.jpg)
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics
Environment Noise, competing speakers, channel conditions(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak aparticular language
Other paralinguistics Emotional state, social class, . . .
Language spoken Estimated 7,000 languages, most with limitedtraining resources; code-switching; language change
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
![Page 12: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/12.jpg)
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics
Environment Noise, competing speakers, channel conditions(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak aparticular language
Other paralinguistics Emotional state, social class, . . .
Language spoken Estimated 7,000 languages, most with limitedtraining resources; code-switching; language change
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
![Page 13: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/13.jpg)
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics
Environment Noise, competing speakers, channel conditions(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak aparticular language
Other paralinguistics Emotional state, social class, . . .
Language spoken Estimated 7,000 languages, most with limitedtraining resources; code-switching; language change
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
![Page 14: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/14.jpg)
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics
Environment Noise, competing speakers, channel conditions(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak aparticular language
Other paralinguistics Emotional state, social class, . . .
Language spoken Estimated 7,000 languages, most with limitedtraining resources; code-switching; language change
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
![Page 15: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/15.jpg)
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics
Environment Noise, competing speakers, channel conditions(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak aparticular language
Other paralinguistics Emotional state, social class, . . .
Language spoken Estimated 7,000 languages, most with limitedtraining resources; code-switching; language change
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
![Page 16: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/16.jpg)
From a machine learning perspective
As a classification problem: very high dimensional outputspace
As a sequence-to-sequence problem: very long input sequence(although limited re-ordering between acoustic and wordsequences)
Data is often noisy, with many “nuisance” factors of variationin the data
Very limited quantities of training data available (in terms ofwords) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)
Hierachical and compositional nature of speech productionand comprehension makes it difficult to handle with a singlemodel
ASR Lecture 1 Automatic Speech Recognition: Introduction 9
![Page 17: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/17.jpg)
From a machine learning perspective
As a classification problem: very high dimensional outputspace
As a sequence-to-sequence problem: very long input sequence(although limited re-ordering between acoustic and wordsequences)
Data is often noisy, with many “nuisance” factors of variationin the data
Very limited quantities of training data available (in terms ofwords) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)
Hierachical and compositional nature of speech productionand comprehension makes it difficult to handle with a singlemodel
ASR Lecture 1 Automatic Speech Recognition: Introduction 9
![Page 18: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/18.jpg)
From a machine learning perspective
As a classification problem: very high dimensional outputspace
As a sequence-to-sequence problem: very long input sequence(although limited re-ordering between acoustic and wordsequences)
Data is often noisy, with many “nuisance” factors of variationin the data
Very limited quantities of training data available (in terms ofwords) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)
Hierachical and compositional nature of speech productionand comprehension makes it difficult to handle with a singlemodel
ASR Lecture 1 Automatic Speech Recognition: Introduction 9
![Page 19: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/19.jpg)
From a machine learning perspective
As a classification problem: very high dimensional outputspace
As a sequence-to-sequence problem: very long input sequence(although limited re-ordering between acoustic and wordsequences)
Data is often noisy, with many “nuisance” factors of variationin the data
Very limited quantities of training data available (in terms ofwords) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)
Hierachical and compositional nature of speech productionand comprehension makes it difficult to handle with a singlemodel
ASR Lecture 1 Automatic Speech Recognition: Introduction 9
![Page 20: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/20.jpg)
From a machine learning perspective
As a classification problem: very high dimensional outputspace
As a sequence-to-sequence problem: very long input sequence(although limited re-ordering between acoustic and wordsequences)
Data is often noisy, with many “nuisance” factors of variationin the data
Very limited quantities of training data available (in terms ofwords) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)
Hierachical and compositional nature of speech productionand comprehension makes it difficult to handle with a singlemodel
ASR Lecture 1 Automatic Speech Recognition: Introduction 9
![Page 21: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/21.jpg)
The speech recognition problem
We generally represent recorded speech as a sequence ofacoustic feature vectors (observations), X and the outputword sequence as W
At recognition time, our aim is to find the most likely W,given X
To achieve this, statistical models are trained using a corpusof labelled training utterances (Xn,Wn)
ASR Lecture 1 Automatic Speech Recognition: Introduction 10
![Page 22: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/22.jpg)
The speech recognition problem
We generally represent recorded speech as a sequence ofacoustic feature vectors (observations), X and the outputword sequence as W
At recognition time, our aim is to find the most likely W,given X
To achieve this, statistical models are trained using a corpusof labelled training utterances (Xn,Wn)
ASR Lecture 1 Automatic Speech Recognition: Introduction 10
![Page 23: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/23.jpg)
The speech recognition problem
We generally represent recorded speech as a sequence ofacoustic feature vectors (observations), X and the outputword sequence as W
At recognition time, our aim is to find the most likely W,given X
To achieve this, statistical models are trained using a corpusof labelled training utterances (Xn,Wn)
ASR Lecture 1 Automatic Speech Recognition: Introduction 10
![Page 24: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/24.jpg)
Representing recorded speech (X)
Represent a recorded utterance as a sequence of feature vectors
Reading: Jurafsky & Martin section 9.3ASR Lecture 1 Automatic Speech Recognition: Introduction 11
![Page 25: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/25.jpg)
Labelling speech (W)
Labels may be at different levels: words, phones, etc.Labels may be time-aligned – i.e. the start and end times of anacoustic segment corresponding to a label are known
Reading: Jurafsky & Martin chapter 7 (especially sections 7.4, 7.5)
ASR Lecture 1 Automatic Speech Recognition: Introduction 12
![Page 26: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/26.jpg)
Two key challenges
In training the model:Aligning the sequences Xn and Wn for each training utterance
In performing recognition:Searching over all possible output sequences Wto find the most likely one
The hidden Markov model (HMM) provides a good solution toboth problems
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
![Page 27: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/27.jpg)
Two key challenges
In training the model:Aligning the sequences Xn and Wn for each training utterance
x1 x2 x3 x4 ...
RIGHTNO
w1 w2
In performing recognition:Searching over all possible output sequences Wto find the most likely one
The hidden Markov model (HMM) provides a good solution toboth problems
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
![Page 28: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/28.jpg)
Two key challenges
In training the model:Aligning the sequences Xn and Wn for each training utterance
x1 x2 x3 x4 ...
RIGHTNO
w1 w2
In performing recognition:Searching over all possible output sequences Wto find the most likely one
The hidden Markov model (HMM) provides a good solution toboth problems
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
![Page 29: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/29.jpg)
Two key challenges
In training the model:Aligning the sequences Xn and Wn for each training utterance
x1 x2 x3 x4 ...
rn
p1 p2
aioh t
p3 p4 p5
In performing recognition:Searching over all possible output sequences Wto find the most likely one
The hidden Markov model (HMM) provides a good solution toboth problems
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
![Page 30: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/30.jpg)
Two key challenges
In training the model:Aligning the sequences Xn and Wn for each training utterance
x1 x2 x3 x4 ...
rn
g1 g2
io t
g3 g4 g5
g h
g6 g6
In performing recognition:Searching over all possible output sequences Wto find the most likely one
The hidden Markov model (HMM) provides a good solution toboth problems
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
![Page 31: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/31.jpg)
Two key challenges
In training the model:Aligning the sequences Xn and Wn for each training utterance
x1 x2 x3 x4 ...
rn
g1 g2
io t
g3 g4 g5
g h
g6 g6
In performing recognition:Searching over all possible output sequences Wto find the most likely one
The hidden Markov model (HMM) provides a good solution toboth problems
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
![Page 32: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/32.jpg)
Two key challenges
In training the model:Aligning the sequences Xn and Wn for each training utterance
x1 x2 x3 x4 ...
rn
g1 g2
io t
g3 g4 g5
g h
g6 g6
In performing recognition:Searching over all possible output sequences Wto find the most likely one
The hidden Markov model (HMM) provides a good solution toboth problems
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
![Page 33: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/33.jpg)
The Hidden Markov Model
x1 x2 x3 x4 ...
A simple but powerful model for mapping a sequence ofcontinuous observations to a sequence of discrete outputs
It is a generative model for the observation sequence
Algorithms for training (forward-backward) andrecognition-time decoding (Viterbi)
Later in the course we will also look at newer all-neural,fully-differentiable “end-to-end” models
ASR Lecture 1 Automatic Speech Recognition: Introduction 14
![Page 34: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/34.jpg)
The Hidden Markov Model
x1 x2 x3 x4 ...
A simple but powerful model for mapping a sequence ofcontinuous observations to a sequence of discrete outputs
It is a generative model for the observation sequence
Algorithms for training (forward-backward) andrecognition-time decoding (Viterbi)
Later in the course we will also look at newer all-neural,fully-differentiable “end-to-end” models
ASR Lecture 1 Automatic Speech Recognition: Introduction 14
![Page 35: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/35.jpg)
Hierarchical modelling of speech
"No right"
NO RIGHT
ohn r ai t
Utterance
Word
Subword
HMM
Acoustics
GenerativeModel
W
X
ASR Lecture 1 Automatic Speech Recognition: Introduction 15
![Page 36: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/36.jpg)
“Fundamental Equation of Statistical Speech Recognition”
If X is the sequence of acoustic feature vectors (observations) and
W denotes a word sequence, the most likely word sequence W∗ is
given by
W∗ = argmaxW
P(W | X)
Applying Bayes’ Theorem:
P(W | X) =p(X |W)P(W)
p(X)
∝ p(X |W)P(W)
W∗ = arg maxW
p(X |W)︸ ︷︷ ︸Acoustic
model
P(W)︸ ︷︷ ︸Language
model
ASR Lecture 1 Automatic Speech Recognition: Introduction 16
![Page 37: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/37.jpg)
“Fundamental Equation of Statistical Speech Recognition”
If X is the sequence of acoustic feature vectors (observations) and
W denotes a word sequence, the most likely word sequence W∗ is
given by
W∗ = argmaxW
P(W | X)
Applying Bayes’ Theorem:
P(W | X) =p(X |W)P(W)
p(X)
∝ p(X |W)P(W)
W∗ = arg maxW
p(X |W)︸ ︷︷ ︸Acoustic
model
P(W)︸ ︷︷ ︸Language
model
ASR Lecture 1 Automatic Speech Recognition: Introduction 16
![Page 38: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/38.jpg)
Speech Recognition Components
W∗ = arg maxW
p(X |W)P(W)
Use an acoustic model, language model, and lexicon to obtain themost probable word sequence W∗ given the observed acoustics X
AcousticModel
LanguageModel
Recorded Speech Decoded Text (Transcription)
TrainingData
SignalAnalysis
W*X
p(X | W)
P(W)
SearchSpace
W
ASR Lecture 1 Automatic Speech Recognition: Introduction 17
![Page 39: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/39.jpg)
Phones and Phonemes
Phonemesabstract unit defined by linguists based on contrastive role inword meanings (eg “cat” vs “bat”)40–50 phonemes in English
Phonesspeech sounds defined by the acousticsmany allophones of the same phoneme (eg /p/ in “pit” and“spit”)limitless in number
Phones are usually used in speech recognition – but noconclusive evidence that they are the basic units in speechrecognition
Possible alternatives: syllables, automatically derived units, ...
(Slide taken from Martin Cooke from long ago)
ASR Lecture 1 Automatic Speech Recognition: Introduction 18
![Page 40: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/40.jpg)
Evaluation
How accurate is a speech recognizer?
String edit distance
Use dynamic programming to align the ASR output with areference transcriptionThree type of error: insertion, deletion, substitutions
Word error rate (WER) sums the three types of error. If thereare N words in the reference transcript, and the ASR outputhas S substitutions, D deletions and I insertions, then:
WER = 100 · S + D + I
N% Accuracy = 100−WER%
Speech recognition evaluations: common training anddevelopment data, release of new test sets on which differentsystems may be evaluated using word error rate
ASR Lecture 1 Automatic Speech Recognition: Introduction 19
![Page 41: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/41.jpg)
Next Lecture
AcousticModel
LanguageModel
Recorded Speech Decoded Text (Transcription)
TrainingData
SignalAnalysis
SearchSpace
ASR Lecture 1 Automatic Speech Recognition: Introduction 20
![Page 42: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/42.jpg)
Example: recognising TV broadcasts
ASR Lecture 1 Automatic Speech Recognition: Introduction 21
![Page 43: Automatic Speech Recognition: Introduction](https://reader030.fdocuments.us/reader030/viewer/2022012410/616a5e2111a7b741a351bacb/html5/thumbnails/43.jpg)
Reading
Jurafsky and Martin (2008). Speech and Language Processing(2nd ed.): Chapter 7 (esp 7.4, 7.5) and Section 9.3.
General interest:
The Economist Technology Quarterly, “Language: Finding aVoice”, Jan 2017.http://www.economist.com/technology-quarterly/2017-05-
01/language
The State of Automatic Speech Recognition: Q&A withKaldi’s Dan Povey, Jul 2018.https://medium.com/descript/the-state-of-automatic-
speech-recognition-q-a-with-kaldis-dan-povey-
c860aada9b85
ASR Lecture 1 Automatic Speech Recognition: Introduction 22