Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital...

62
Pattern Recognition Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Institute of Electrical and Information Engineering Digital Signal Processing and System Theory Part 9: Speaker and Speech Recognition

Transcript of Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital...

Page 1: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Pattern Recognition

Gerhard Schmidt

Christian-Albrechts-Universität zu KielFaculty of Engineering Institute of Electrical and Information EngineeringDigital Signal Processing and System Theory

Part 9: Speaker and Speech Recognition

Page 2: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 2

Speaker and Speech Recognition

Contents

❑ Literature

❑ Speaker recognition

❑ Motivation

❑ Speaker verification and speaker identification

❑ Model adaption

❑ Discriminative approaches

❑ Speech recognition

❑ Fundamentals

❑ Statistical speech recognition

❑ Conclusion and outlook

Contents

Page 3: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 3

Speaker and Speech Recognition

Literature

Gaussian mixture models:

❑ C. M. Bishop: Pattern Recognition and Machine Learning, Springer, 2006

❑ L. Rabiner, B.H. Juang: Fundamentals of Speech Recognition, Prentice Hall, 1993

Speech recognition:

❑ C. M. Bishop: Pattern Recognition and Machine Learning, Springer, 2006

❑ B. Pfister, T. Kaufmann: Sprachverarbeitung, Springer, 2008 (in German)

Speaker recognition:

❑ G. Kolano: Lernverfahren zur Sprecherverifikation, Shaker, 2000 (in German)

❑ J. Benesty, et al.: Handbook on Speech Processing, Chapters 37 and 38 on „Speaker Recognition“, Springer, 2008

Page 4: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 4

Speaker and Speech Recognition

Contents

❑ Literature

❑ Speaker recognition

❑ Motivation

❑ Speaker verification and speaker identification

❑ Model adaption

❑ Discriminative approaches

❑ Speech recognition

❑ Fundamentals

❑ Statistical speech recognition

❑ Conclusion and outlook

Contents

Page 5: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 5

Speaker and Speech Recognition

Motivation

Applications for speaker recognition

❑ Admission control (for supplementation of immobilizer systems in cars or admission to protected areas or rooms).

❑ Personalization of speech services (systems recognize the user/caller again and can access preference data bases).

❑ Improvement of speech signal enhancement schemes (e.g., speaker specific signal reconstruction).

❑ The post-training (optimization) of a speech recognition system can be done speaker dependent. In the case that a speech dialog system is used randomly by multiple users, the post-training/adaptation of the recognizer can be speaker-dependent

Page 6: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 6

Speaker and Speech Recognition

Variants of Speaker Recognition – Part 1

Differentiation between verification and identification

Speaker verification:

Binary decision – is a speaker really the person he pretends to be?

Speaker identification:

1-out-of-N-deciscion – Which one of N speakers is active?

Page 7: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 7

Speaker and Speech Recognition

Variants of Speaker Recognition – Part 2

Differentiation between text-dependent and text-independent speaker verification

Text-dependent verification:

The speaker knows a password that he has to speak or a new password that has to be spoken is provided for every verification.

Text-independent verification:

The speaker‘s utterance is unknown.

Page 8: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 8

Speaker and Speech Recognition

Variants of Speaker Recognition – Part 3

Differentiation between „closed-set“ and „open-set“ identification

„closed“ (closed-set) identification:

All potential speakers are known in advance – no new speakers are added later.

„Open“ (open-set) identification:

The potential speakers are not known in advance. It is not necessarily known, how many speakers exist.

Page 9: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 9

Speaker and Speech Recognition

Variants of Speaker Recognition – Part 4

Again, a differentiation between text-dependent and text-independent variants is possible.

Page 10: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 10

Speaker and Speech Recognition

Variants of Speaker Recognition – Part 5

Differentiation between non-discriminant and discriminant training methods

Non-discriminant training:

The models are trained for each speaker independently, i.e., the model has to fit to the extracted training data as good as possible – however, a good discrimination of other speakers is not considered.

Discriminant training:

All speakers are considered during the training of the models to fit the individual models not only to one speaker, but also to learn the differences between the speaker features.

Page 11: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 11

Speaker and Speech Recognition

Basics of Speaker Recognition – Part 1

Distortion-reducingpreprocessing

andsegmentation

Feature extraction

(withnormalization)

Feature vector

Binarydecision

Accumulationof the singlelogarithmic

probabilities or distances over

time

Model for the featuresof the speaker to

be verified

Speaker verification

Universal backgroundmodel for other speakers

Feedback of thedecision for

adapting the modelShort-term spectrum of the

distortion-reduced signal

Page 12: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 12

Speaker and Speech Recognition

Basics of Speaker Recognition – Part 2

Distortion-reducingpreprocessing

andsegmentation

Feature extraction

(withnormalization)

Feature vector

1-out-of-(N+1) decision

Accumulation of the singlelogarithmic probabilities or

distances over time

New speaker modelSpeaker identification

Universal backgroundmodel for other speakers

Generation of a new speaker model

Short-term spectrum of thedistortion-reduced signal

Speaker model 1

Speaker model N

Page 13: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 13

Speaker and Speech Recognition

Difficulties in Speaker Recognition

Some typical problems…

❑ In many practical applications only a relatively small amount of training data for the individual speakers is available. Additionally, this training data is often not phonetically „balanced“. During the recognition itself, a decision should be made as fast as possible.

❑ As a consequence, text-independent systems become a strong text-dependency: Speaker A speaks words that are contained in the small training set of speaker B, but not in his own. That probability to identify speaker B is rather high for a small amount of training data.

❑ It is often reported in literature that preprocessing or normalization have a negative influence on the recognition rate. This is true if the recording conditions during training and test match well. However, such a match between training and test conditions is not always given in practice.

❑ Speech pauses should be removed before the recognition task itself. Otherwise, the background noise will have a strong influence on the decision: speakers with similar background noise during recording will be preferred.

Page 14: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 14

Speaker and Speech Recognition

Preprocessing and Segmentation – Part 1

Subband structure:

Analysisfilterbank

Segmentation

Filter characteristic

Input PSD estimation

Noise PSD estimation

PSD= power spectral density

Page 15: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 15

Speaker and Speech Recognition

Preprocessing and Segmentation – Part 2

Noise reduction: Noise reduction without limitation of the attenuation (needed for the segmentation)

Noise reduction with limitation of the attenuation (needed for the signal enhancement)

Segmentation:

If the noise reduction filter is open in 10…30 percent of all subbands, the current frame is classified to contain speech.

Page 16: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 16

Speaker and Speech Recognition

Preprocessing and Segmentation – Part 3

Example:

❑ Input signal

❑ Signal after noise reduction

❑ Signal after segmentation

Time-frequency analysis of the noisy input signal

Time-frequency analysis of the noise-reduced signal

Time-frequency analysis of the segmented noise-reduced signal

Time in seconds

Fre

qu

ency

in H

z

Time in seconds

Time in seconds

Freq

uen

cy in

Hz

Freq

uen

cy in

Hz

Page 17: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 17

Speaker and Speech Recognition

Feature Extraction – Part 1

Mel-filtered cepstral coefficients (MFCCs):

Computation ofthe (squared)

magnitudeMel

filtering Logarithm

Discretecosine

transform

❑ The first (zeroth) coefficient of the feature vectors is often replaced by the normalized short-term power of the current signal frame.

❑ The normalization is done such that the maximum short-term power of an utterance is mapped to a defined value.

Page 18: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 18

Speaker and Speech Recognition

Feature Extraction – Part 2

❑Many publications deal with the selection of features. The most common conclusion is that a compact representation of the short-term spectral envelope should be used.

❑MFCCs and cepstral coefficients (with slight modification) have proven to be useful.

❑ It is astonishing that these are the same features that are used for speech recognition. In the application of speech recognition, the interest is to remove differences between speakers to obtain only information about the words that have been spoken.

❑However, it should be mentioned that different preprocessing is used for speaker and speech recognition.

❑As a consequence, it can be concluded that a speaker-specific speech recognition yields better results compared to a non speaker-specific one – this can also be observed in practice. For this reason, it is often desired to adapt the models of a speech recognition system to the current speaker.

Some remarks:

Page 19: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 19

Speaker and Speech Recognition

Speaker Recognition With Codebooks – Recognition Phase

Speaker-specificfeature codebook

Speaker-specificthreshold codebooks

Speaker identityunder test Test

utterance

Distance calculation withthe background codebook

Distance calculation with the speaker-specific codebook

Distance comparison with considerationof the speaker specific threshold

Acceptance or rejection of thespeaker identity under test

Flow chart – speaker verification:

Page 20: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 20

Speaker and Speech Recognition

Speaker Recognition With Codebooks– Training Phase

Flow chart – speaker verification:

Speech dataof a speaker

Speech dataof the background speakers

Featureextraction

Featureextraction

Codebooktraining

Codebooktraining

Save the speaker-specificfeature codebook

Save the speaker-specificthreshold codebook

Save the backgroundfeature codebook

Calculate the speaker-specificthresholds

Page 21: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 21

Speaker and Speech Recognition

Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 1)

Approach of the speaker verification:

❑ Pose two hypothesis:

❑ If the same „costs“ for different kinds of errors are assumed, the target and the test speaker are decided to be same person if

The matrix contains the feature vectors of the utterance (after noise and speech pauses have been removed).

Page 22: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 22

Speaker and Speech Recognition

Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 2)

Approach of the speaker verification :

❑ The conditional probabilities can be re-written as follows:

❑ This yields for our condition:

❑ Different speaker probabilities can be modeled by the ratio of and .

Page 23: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 23

Speaker and Speech Recognition

Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 3)

Feature 1 Feature 2

Feature 1 Feature 2

Mu

tual

den

sity

Mu

tual

den

sity

Observeddata

Probability density model(trained on data of hypothesis H0, i.e. on training data of the target speaker)

Probability density model (trained on data of hypothesis H1 , i.e. on training data of non-target speaker(s))

Decision

Multiplication with the speaker probability

Multiplication with the complementary speaker probability

Page 24: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 24

Speaker and Speech Recognition

Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 4)

Approach of the speaker verification:

❑ If Gaussian mixture models are used, the (logarithmic) probability density functions are:

The superscripts (s) and (b) denote the individual speaker and background model, respectively.

Page 25: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 25

Speaker and Speech Recognition

Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 5)

❑ The decision rule

can be re-written as follows:

Approach of the speaker verification :

Page 26: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 26

Speaker and Speech Recognition

Results of a Speaker Verification – Part 1

❑ The results are taken from the dissertation of G. Kolano (work done at the Daimler Research Center in Ulm, see literature section for details).

❑A data base with 106 speakers (only male speakers) has been used. The data based consists of English double-digits (i.e., the vocabulary is limited).

❑All data has been transmitted over telephone channels. Thus, the bandwidth of the data is approximately 3.8 kHz (8 kHz sample rate). Especially for speaker recognition, these are rather bad boundary conditions.

❑Out of the 106 speakers, 33 have been used for training the background models, the remaining 73 have been used for the evaluation of the speaker identification.

❑MFCCs have been used as features. They were only computed if the current signal frame has been classified as voiced speech.

Boundary conditions:

Page 27: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 27

Speaker and Speech Recognition

Results of a Speaker Verification – Part 2

❑ The background model has the same size as the speaker model for the cases.

❑ Results in terms of error rates:

Model order Codebuch Gaussian(Number of codebook entries approach mixture model

or number Gaussian distributions)

4 11.5 % 4.2 %8 9.6 % 3.0 %

16 8.2 % 2.3 %32 6.8 % 2.0 %

Comparison between codebooks and GMMs:

Conclusion:

GMMs are – at least in this test –clearly superior to codebook approaches, but …

Page 28: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 28

Speaker and Speech Recognition

Results of a Speaker Verification – Part 3

❑ The covariance matrices of the GMM approach were fully populated. Thus, clearly a larger amount of model parameters have been used in this approach and the computational complexity is clearly higher.

❑ Number of model parameters:

Model order Codebook Gaussian(Number of codebook entries approach mixture modelor number Gaussian distributions)

4 68 6838 136 1367

16 272 273532 544 5471

Comparison between codebooks and GMMs:

Conclusion:

… GMMs require clearly more memory and computational power, compared to codebook approaches.

Page 29: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 29

Speaker and Speech Recognition

Results of a Speaker Verification – Part 4

❑ So far, individual thresholds and a priory-probabilities have been trained for each speaker.

❑ Comparison between global and individual thresholds:

Model order Codebook Gaussian(Number of codebook entries approach mixture modelor number Gaussian distributions)

4 12.9 % / 11.5 % 5.3 % / 4.2 %8 11.1 % / 9.6 % 4.1 % / 3.0 %

16 9.6 % / 8.2 % 3.4 % / 2.3 %32 8.2 % / 6.8 % 3.0 % / 2.0 %

Comparison between global and individual thresholds:

Conclusion:

By training the thresholds, the recognition rate can be improved or the number of parameters can be decreased.

IndividualThreshold

Globalthreshold

Page 30: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 30

Speaker and Speech Recognition

From Speaker Verification to Speaker Identification

Flow chart – Speaker identification:

Speaker-specificfeature models

Speaker-specificthreshold/distance models

Testutterance

„Scoring“ with the background model

„Scoring“with the speaker-specific models

Computation of the bestspeaker model or detectionof a new speaker

Adaptation of the „winning“ speakermodel or generation of a new speakermodel

Selection of the best speaker model

Page 31: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 31

Speaker and Speech Recognition

Results of a Speaker Identification – Part 1

Boundary conditions:

❑ The results are taken from a publication of D. Reynolds (work done at the MIT, see literature section for details).

❑A data base with 51 speakers (only male speakers) has been used. The data base consists of English conversations (approximately 10 utterances with a duration of 45 seconds each).

❑All data has been transmitted over telephone channels. Thus, the bandwidth of the data is approximately 3.8 kHz (8 kHz sample rate).

❑MFCCs have been used as features. Modeling has been done with GMMs, where only diagonal covariance matrices have been used.

Page 32: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 32

Speaker and Speech Recognition

Results of a Speaker Identification – Part 2

❑ Length of test and training data vs. recognition rate:

Length Model order Length of test dataof training (number of Gaussiandata distributions) 1 sec 5 sec 10 sec

30 sec 8 54.6 % 79.8 % 86.6 %16 63.7 % 87.3 % 90.5 %32 64.6 % 85.3 % 88.4 %

60 sec 8 66.1 % 91.5 % 97.3 %16 74.9 % 95.7 % 98.8 %32 78.6 % 95.6 % 98.3 %

90 sec 8 71.5 % 95.5 % 98.8 %16 79.0 % 98.0 % 99.7 %32 84.7 % 98.8 % 99.6 %

Results:

Page 33: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 33

Speaker and Speech Recognition

Adaption of the Models During Run-Time – Part 1

General:

❑After a speaker recognition has been successful (this should be validated e.g. by using a dialog system), the speaker model of the active speaker can be adapted.

❑Generally, all model parameters can be adapted. However, updating only the mean values of GMMs proved to provide a good cost-value ratio. For codebooks, the mean values can be seen as the individual codebook entries, i.e., all parameters are adapted.

❑ Both, the amount of training data and the number of new feature vectors should be considered. The codebook adaption can be done according to

where denotes the new codebook entry and the old one. is the number of vectors that have been used to form the entry during training and is the number of those feature vectors which have been assigned to the corresponding codebook vector.

Page 34: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 34

Speaker and Speech Recognition

Adaption of the Models During Run-Time – Part 2

General:

❑ The mean values of GMMs can be updated similar to the codebooks by a modified iteration step of the EM algorithm (see last lecture). First, a „soft“ assignment to the individual classes is done (E-step):

Next, the mean values are corrected (M-step) by

The variable denotes the sum of the „soft“ assignments of the kth class in the last iteration during the training.

Page 35: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 35

Speaker and Speech Recognition

Adaption of the Models During Run-Time – Part 3

Example:

Input feature 1 Input feature1

Inp

ut

feat

ure

2

Inp

ut

feat

ure

2Gaussian distributionsbefore adaptation

Gaussian distributionsafter adaptation

New featuevectors

New featurevectors

Page 36: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 36

Speaker and Speech Recognition

Discriminative Approaches – Part 1

❑ In discriminative approaches, the aim is not only to optimize the assignment of training data to a model but also to optimize the discrimination between other models at the same time.

❑ Examples for such approaches are neural networks or learning vector quantization (LVQ).

❑ The advantage of such methods is in general an improve recognition rate.

❑However, it is more difficult to include new speakers into the models when discriminant approaches are used. If this should be necessary, all model parameters (even those of already known speakers) have to be recalculated – while only a new speaker model had to be generated in the approaches discussed so far.

General:

Page 37: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 37

Speaker and Speech Recognition

Discriminative Approaches – Part 2

Neuronal networks (e.g. with radial basis functions):

❑ Input data of the neural network are the feature vectors of the training data of all speakers.

❑ The desired output is a vector which contains a 1 at the index of the current speaker. All other vector elements are either set to 0 or 1.

❑ Standard training methods for neural networks attempt to minimize the quadratic distance between the output of the neural network and the desired output. This leads to discriminant methods.

Page 38: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 38

Speaker and Speech Recognition

Contents

❑ Literature

❑ Speaker recognition

❑ Motivation

❑ Speaker verification and speaker identification

❑ Model adaption

❑ Discriminative approaches

❑ Speech recognition

❑ Fundamentals

❑ Statistical speech recognition

❑ Conclusion and outlook

Contents

Page 39: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 39

Speech and Speech Recognition

Speech Generation and Speech Recognition

Overview:

Creation of the message

Integration of language (English)

Vocal tract

Vocal cords

Neuromuscularactivities

„Acoustic“Channel

Movement of thebasilar membrane

Understanding of the message

Conversion based on a language

Neuronal activity

Page 40: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 40

Speech and Speech Recognition

„History“

1952 at Bell Labs

❑ First digit recognition

❑ Estimation of the energy in the formant frequencies (resonant frequencies of the vocal tract)

In the 60th

❑ Improved digit recognition

❑ Breakthroughs in spectral estimation (FFT, cepstrum), dynamic time warping, and hidden Markov models

Hidden Markov models in speech recognition

❑ Mathematics from Baum et al. (1966 – 1972)

❑ Application to speech recognition from Baker (CMU Dragon System, 1974)

❑ Development at IBM (Baker, Jelinek, Bahl, Mercer, and others)

(Deep) Neural Networks

❑ Helped this technology to get as successful as seen in todays products (usually server-based architecturs)

❑ Developement started about 15 years ago

Page 41: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 41

Speech and Speech Recognition

Speech Dialog Systems

Speech signal Speech signal

Text Text

Semantic representation

Semantic representation

Overview of a speech dialog system:

Speech recognition

Dialog manager

Parser Prompter

Speech syntheses

Page 42: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 42

Speech and Speech Recognition

Fundamental Principle of Speech Recognition

Speech signal with background noise

Preprocessing

Speech signal with little disturbances

Feature extraction

Classification

Features(e.g. MFCCs)

Decision

Evaluation(“N-best” list)

Recognized text

❑ Reduces background noise and echoes

❑ Combines several microphone signals

❑ Compresses the amount of data

❑ Extracts the important parameters for the speech recognition

❑ Performs for each activated class an evaluation

❑As a result often the N-best evaluations are determined

❑Determines the best entry based on additional prior knowledge (word probabilities)

Page 43: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 43

Speech and Speech Recognition

Variants of Speech Recognition Systems – Part 1

Single word recognizer

❑ Single words or short commands

❑ The (command) words are spoken isolated (with pauses)

Keyword spotter

❑ Single words respectively word orders in arbitrary statements

❑ In case the keyword was detected, a new recognizer is started

Recognition of connected words

❑ Sequence of fluent spoken words from a small vocabulary

Continuous speech recognizer

❑ Whole, fluently spoken sentences

Page 44: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 44

Speech and Speech Recognition

Variants of Speech Recognition Systems – Part 2

Speaker-dependent systems

❑ Such systems have to be trained individually for each speaking person.

❑ The training phase can (depending on the size of the vocabulary, respectively the desired quality) take some time.

Speaker-independent systems

❑ There is no need for a (speaker specific) training phase.

❑ To obtain an appropriate quality a large training database has to be provided.

Speaker-adaptive systems

❑ It starts with an universal model, which is then gradually adapted to the speaker.

❑ This can be done during a short training phase or during runtime.

Page 45: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 45

Speech and Speech Recognition

Variants of Speech Recognition Systems – Part 3

Systems with small vocabulary

❑ Up to a few hundred words

❑ Typically used for control tasks

Systems with large vocabulary

❑ Several 100.000 words

❑ Dictation, address input

❑ Vocabulary at this size, often have many phonetically similar words.

❑ To reduce the numbers of mistakes, usually a so-called language model is needed (describes the relationship between words).

Page 46: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 46

Speech and Speech Recognition

Evaluation of Speech Recognition Systems – Part 1

Basics

❑ Usually speech recognition systems respectively speech dialogue system are evaluated by using word error rates.

❑ In practice, however, this value is overstated. Other criteria are also important. It is also important, for example, how much computing power, respectively, memory is needed by the system or after which time the result is available.

❑ To evaluate speech dialogue systems also the quality of the speech syntheses, the so-called start-up time and many other values are of special interest.

Word error rate

Page 47: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 47

Speech and Speech Recognition

Evaluation of Speech Recognition Systems – Part 2

Word error rate

❑ The word error rate can efficiently be derived by means of a dynamic programming. Defining the word reference sequence by

and the word sequence determined by the recognizer by

the following distance between the sequences can be concluded:

Page 48: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 48

Speech and Speech Recognition

Evaluation of Speech Recognition Systems – Part 2

Word error rate

❑ Initialization:

❑ Derivation of the word error rate

❑ Note that with this definition word error rates greater than a 100 % can be achieved (due to lots of insertions).

Page 49: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 49

Speech and Speech Recognition

Maximum A-posteriori Probability Rule (MAP Rule) – Part 1

Recognition criterion

❑ For a given feature sequence the one series of words out of all possible (permitted) word series should be selected, which exhibit the maximum a-posterior probability:

❑ Using Bayes’ theorem, it can be concluded:

❑ Due to the fact that the probability of the feature sequences is constant for the maximization, this has nor influences on the decision and can be neglected:

Page 50: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 50

Speech and Speech Recognition

Maximum A-posteriori Probability Rule (MAP Rule) – Part 2

Recognition criterion

❑ Optimization function:

❑ The probability indicates the probability to observe the feature sequence ,if the series of words was spoken. To model such probabilities hidden Markov models (HMMs) have proved to be suitable. This part of the optimization criterion is called the acoustic model.

❑ The probability is the a-priori probability of the word series . This probability is independent of the observation sequence and describes a priori knowledge about the word series (e.g. that some words do occur more often than others). This part of the optimization criterion is called the language model.

Page 51: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 51

Speech and Speech Recognition

Computationally Efficient Model Restrictions – Part 1

Limitations of HMM model degrees of freedom

❑ If at first a large number of Gaussian distributions is permit for the several HMMs or for the states of the HMMs, one can try to use the same mean vectors and covariance matrices for further calculations. The weights of the several Gaussian distributions can be selected individually for each model or model state, respectively.

❑ This kind of hidden Markov models are called semi-continuous HMMs.

❑ Through this, a lot of memory and computational load can be saved:

❑ The needed memory is reduced due to the fact that the mean value vectors and covariance matrices can be reused for all models (and have to be saved individually for all models respectively model states).

❑ Also the contribution of the single Gaussian distributions are only computed once per time frame respectively feature vector determination (and not individually for each model respectively each model state).

Page 52: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 52

Speech and Speech Recognition

Computationally Efficient Model Restrictions – Part 2

Limitations of HMM model degrees of freedom

❑ In addition, a limited number of Gaussian distributions can be assigned to each HMM, respectively HMM state. In this case the computational load (and also memory to a small extent) can be further reduced.

❑ This is done by storing only the indices of the active Gaussian distributions of each model state (e.g. 8 to 32 Gaussian distributions from 512 to 2048).

❑ The base Gaussian distributions can be, for example, derived out ofa big speech database by using the EM algorithm.

Page 53: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 53

Speech and Speech Recognition

Base Units of Hidden Markov Models – Part 1

Approximately 1000 speakers

I

Il Ev

vn

lE

Acoustic modeling

❑ Acoustic basic units are extracted out of a big speech database (e.g. 1000 speaking persons, where their were talking approximately one hour) to train the HMMs.

❑ Such basic units can be phonemes, but also phoneme pairs or groups of 3 phonemes.

❑ In addition, often for key words (e.g. numbers) single word models are trained.

❑ There are about 50 phonemes in each language. For phoneme pairs respectively groups of 3 phonemes, the amount of occurring groups within a language is considerably smaller than 50², respectively 50³.

Page 54: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 54

Speech and Speech Recognition

Base Units of Hidden Markov Models – Part 2

Acoustic modeling

❑ The composition of the essentials models (given due to the vocabulary) out of base units has the advantage that this is pretty simple (see following slides) and can be done during runtime of the speech recognizer.

❑ Therefore, it is possible to wait for the answer of a speech dialog, generate a corresponding answer by use of a speech syntheses system, and start then the new excitation. With this recognition method, only those word possibilities are used, which make sense in this case.

❑ This allows to keep the vocabulary small, which leads to less errors and a lower computational load.

❑ Especially for database queries (e.g. Google search, operating of an MP3 player, etc.) this is important.

Page 55: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 55

Speech and Speech Recognition

Training versus Processing During Run-Time of a Recognizer

Trai

nin

gD

uri

ng

run

-tim

eVocabulary

Speech signal

Acoustic model

p(X|W)

Language model

p(W)

Feature extraction

Decoding

Training of the acoustic model

Training material(speech data)

Training mat.(text)

Training of thelanguage model

Textprocessing

Page 56: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 56

Speech and Speech Recognition

Composition of HMMs – Part 1

Parallel connection of HMMs

❑ For the parallel composition of HMMs, only the transition probability and the a-priori probability have to be combined.

❑ Example for a parallel connection of two simple left-right models, each with two emitted states:

Page 57: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 57

Speech and Speech Recognition

Composition of HMMs – Part 2

Parallel connection of HMMs

❑ Example for the parallel connection of two simple left-right models:

Page 58: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 58

Speech and Speech Recognition

Composition of HMMs – Part 3

Series connection of HMMs

❑ Example for the series connection of two simple left-right models:

Page 59: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 59

Speech and Speech Recognition

Composition of HMMs – Part 4

Generation of the active vocabulary

❑ The HMMs have to connected as efficient as possible with each other (graph theory).

❑ Example (fragment) for German double digits:

ein

zwei

drei

neun

und

dreißig

vierzig

fünfzig

neunzig

Page 60: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 60

Speech and Speech Recognition

Research Directions

Feature extraction:

❑ Psycho-acoustic motivated feature extraction

❑ Use of additional information (speaker direction, etc.)

Acoustic modeling:

❑ Use of different base units as phonemes

❑ Improved modeling (e.g. with sound duration models)

❑ Using neural networks and mixed approaches

Adaption:

❑ Adaption with regard to the current speaker

❑ Feature transformation to reduce the dependence on the recording conditions

Training:

❑ Discriminative approaches

Page 61: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 61

Speaker and Speech Recognition

Contents

❑ Literature

❑ Speaker recognition

❑ Motivation

❑ Speaker verification and speaker identification

❑ Model adaption

❑ Discriminative approaches

❑ Speech recognition

❑ Fundamentals

❑ Statistical speech recognition

❑ Conclusion and outlook

Contents

Page 62: Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 62

Speaker and Speech and Speech Recognition

Summary and Outlook

Summary:

❑ Speaker recognition

❑ Motivation

❑ Speaker verification and speaker identification

❑ Model adaption

❑ Discriminative approaches

❑ Speech recognition

❑ Fundamentals

❑ Statistical speech recognition

Next week:

❑ Neural networks