Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital...

Pattern Recognition

Gerhard Schmidt

Christian-Albrechts-Universität zu KielFaculty of Engineering Institute of Electrical and Information EngineeringDigital Signal Processing and System Theory

Part 9: Speaker and Speech Recognition

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 2

•

Speaker and Speech Recognition

Contents

❑ Literature

❑ Speaker recognition

❑ Motivation

❑ Speaker verification and speaker identification

❑ Model adaption

❑ Discriminative approaches

❑ Speech recognition

❑ Fundamentals

❑ Statistical speech recognition

❑ Conclusion and outlook

Contents


•


Literature

Gaussian mixture models:

❑ C. M. Bishop: Pattern Recognition and Machine Learning, Springer, 2006

❑ L. Rabiner, B.H. Juang: Fundamentals of Speech Recognition, Prentice Hall, 1993

Speech recognition:

❑ C. M. Bishop: Pattern Recognition and Machine Learning, Springer, 2006

❑ B. Pfister, T. Kaufmann: Sprachverarbeitung, Springer, 2008 (in German)

Speaker recognition:

❑ G. Kolano: Lernverfahren zur Sprecherverifikation, Shaker, 2000 (in German)

❑ J. Benesty, et al.: Handbook on Speech Processing, Chapters 37 and 38 on „Speaker Recognition“, Springer, 2008


•


Contents

❑ Literature


❑ Motivation


❑ Model adaption



❑ Fundamentals



Contents


•


Motivation

Applications for speaker recognition

❑ Admission control (for supplementation of immobilizer systems in cars or admission to protected areas or rooms).

❑ Personalization of speech services (systems recognize the user/caller again and can access preference data bases).

❑ Improvement of speech signal enhancement schemes (e.g., speaker specific signal reconstruction).

❑ The post-training (optimization) of a speech recognition system can be done speaker dependent. In the case that a speech dialog system is used randomly by multiple users, the post-training/adaptation of the recognizer can be speaker-dependent


•


Variants of Speaker Recognition – Part 1

Differentiation between verification and identification

Speaker verification:

Binary decision – is a speaker really the person he pretends to be?

Speaker identification:

1-out-of-N-deciscion – Which one of N speakers is active?


•



Differentiation between text-dependent and text-independent speaker verification

Text-dependent verification:

The speaker knows a password that he has to speak or a new password that has to be spoken is provided for every verification.

Text-independent verification:

The speaker‘s utterance is unknown.


•



Differentiation between „closed-set“ and „open-set“ identification

„closed“ (closed-set) identification:

All potential speakers are known in advance – no new speakers are added later.

„Open“ (open-set) identification:

The potential speakers are not known in advance. It is not necessarily known, how many speakers exist.


•



Again, a differentiation between text-dependent and text-independent variants is possible.


•



Differentiation between non-discriminant and discriminant training methods

Non-discriminant training:

The models are trained for each speaker independently, i.e., the model has to fit to the extracted training data as good as possible – however, a good discrimination of other speakers is not considered.

Discriminant training:

All speakers are considered during the training of the models to fit the individual models not only to one speaker, but also to learn the differences between the speaker features.


•


Basics of Speaker Recognition – Part 1

Distortion-reducingpreprocessing

andsegmentation

Feature extraction

(withnormalization)

Feature vector

Binarydecision

Accumulationof the singlelogarithmic

probabilities or distances over

time

Model for the featuresof the speaker to

be verified

Speaker verification

Universal backgroundmodel for other speakers

Feedback of thedecision for

adapting the modelShort-term spectrum of the

distortion-reduced signal


•


Basics of Speaker Recognition – Part 2

Distortion-reducingpreprocessing

andsegmentation

Feature extraction

(withnormalization)

Feature vector

1-out-of-(N+1) decision

Accumulation of the singlelogarithmic probabilities or

distances over time

New speaker modelSpeaker identification

Universal backgroundmodel for other speakers

Generation of a new speaker model

Short-term spectrum of thedistortion-reduced signal

Speaker model 1

Speaker model N


•


Difficulties in Speaker Recognition

Some typical problems…

❑ In many practical applications only a relatively small amount of training data for the individual speakers is available. Additionally, this training data is often not phonetically „balanced“. During the recognition itself, a decision should be made as fast as possible.

❑ As a consequence, text-independent systems become a strong text-dependency: Speaker A speaks words that are contained in the small training set of speaker B, but not in his own. That probability to identify speaker B is rather high for a small amount of training data.

❑ It is often reported in literature that preprocessing or normalization have a negative influence on the recognition rate. This is true if the recording conditions during training and test match well. However, such a match between training and test conditions is not always given in practice.

❑ Speech pauses should be removed before the recognition task itself. Otherwise, the background noise will have a strong influence on the decision: speakers with similar background noise during recording will be preferred.


•


Preprocessing and Segmentation – Part 1

Subband structure:

Analysisfilterbank

Segmentation

Filter characteristic

Input PSD estimation

Noise PSD estimation

PSD= power spectral density


•



Noise reduction: Noise reduction without limitation of the attenuation (needed for the segmentation)

Noise reduction with limitation of the attenuation (needed for the signal enhancement)

Segmentation:

If the noise reduction filter is open in 10…30 percent of all subbands, the current frame is classified to contain speech.


•



Example:

❑ Input signal

❑ Signal after noise reduction

❑ Signal after segmentation

Time-frequency analysis of the noisy input signal

Time-frequency analysis of the noise-reduced signal

Time-frequency analysis of the segmented noise-reduced signal

Time in seconds

Fre

qu

ency

in H

z

Time in seconds

Time in seconds

Freq

uen

cy in

Hz

Freq

uen

cy in

Hz


•


Feature Extraction – Part 1

Mel-filtered cepstral coefficients (MFCCs):

Computation ofthe (squared)

magnitudeMel

filtering Logarithm

Discretecosine

transform

❑ The first (zeroth) coefficient of the feature vectors is often replaced by the normalized short-term power of the current signal frame.

❑ The normalization is done such that the maximum short-term power of an utterance is mapped to a defined value.


•


Feature Extraction – Part 2

❑Many publications deal with the selection of features. The most common conclusion is that a compact representation of the short-term spectral envelope should be used.

❑MFCCs and cepstral coefficients (with slight modification) have proven to be useful.

❑ It is astonishing that these are the same features that are used for speech recognition. In the application of speech recognition, the interest is to remove differences between speakers to obtain only information about the words that have been spoken.

❑However, it should be mentioned that different preprocessing is used for speaker and speech recognition.

❑As a consequence, it can be concluded that a speaker-specific speech recognition yields better results compared to a non speaker-specific one – this can also be observed in practice. For this reason, it is often desired to adapt the models of a speech recognition system to the current speaker.

Some remarks:


•


Speaker Recognition With Codebooks – Recognition Phase

Speaker-specificfeature codebook

Speaker-specificthreshold codebooks

Speaker identityunder test Test

utterance

Distance calculation withthe background codebook

Distance calculation with the speaker-specific codebook

Distance comparison with considerationof the speaker specific threshold

Acceptance or rejection of thespeaker identity under test

Flow chart – speaker verification:


•


Speaker Recognition With Codebooks– Training Phase

Flow chart – speaker verification:

Speech dataof a speaker

Speech dataof the background speakers

Featureextraction

Featureextraction

Codebooktraining

Codebooktraining

Save the speaker-specificfeature codebook

Save the speaker-specificthreshold codebook

Save the backgroundfeature codebook

Calculate the speaker-specificthresholds


•


Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 1)

Approach of the speaker verification:

❑ Pose two hypothesis:

❑ If the same „costs“ for different kinds of errors are assumed, the target and the test speaker are decided to be same person if

The matrix contains the feature vectors of the utterance (after noise and speech pauses have been removed).


•



Approach of the speaker verification :

❑ The conditional probabilities can be re-written as follows:

❑ This yields for our condition:

❑ Different speaker probabilities can be modeled by the ratio of and .


•



Feature 1 Feature 2

Feature 1 Feature 2

Mu

tual

den

sity

Mu

tual

den

sity

Observeddata

Probability density model(trained on data of hypothesis H0, i.e. on training data of the target speaker)

Probability density model (trained on data of hypothesis H1 , i.e. on training data of non-target speaker(s))

Decision

Multiplication with the speaker probability

Multiplication with the complementary speaker probability


•



Approach of the speaker verification:

❑ If Gaussian mixture models are used, the (logarithmic) probability density functions are:

The superscripts (s) and (b) denote the individual speaker and background model, respectively.


•



❑ The decision rule

can be re-written as follows:

Approach of the speaker verification :


•


Results of a Speaker Verification – Part 1

❑ The results are taken from the dissertation of G. Kolano (work done at the Daimler Research Center in Ulm, see literature section for details).

❑A data base with 106 speakers (only male speakers) has been used. The data based consists of English double-digits (i.e., the vocabulary is limited).

❑All data has been transmitted over telephone channels. Thus, the bandwidth of the data is approximately 3.8 kHz (8 kHz sample rate). Especially for speaker recognition, these are rather bad boundary conditions.

❑Out of the 106 speakers, 33 have been used for training the background models, the remaining 73 have been used for the evaluation of the speaker identification.

❑MFCCs have been used as features. They were only computed if the current signal frame has been classified as voiced speech.

Boundary conditions:


•



❑ The background model has the same size as the speaker model for the cases.

❑ Results in terms of error rates:

Model order Codebuch Gaussian(Number of codebook entries approach mixture model

or number Gaussian distributions)

4 11.5 % 4.2 %8 9.6 % 3.0 %

16 8.2 % 2.3 %32 6.8 % 2.0 %

Comparison between codebooks and GMMs:

Conclusion:

GMMs are – at least in this test –clearly superior to codebook approaches, but …


•



❑ The covariance matrices of the GMM approach were fully populated. Thus, clearly a larger amount of model parameters have been used in this approach and the computational complexity is clearly higher.

❑ Number of model parameters:

Model order Codebook Gaussian(Number of codebook entries approach mixture modelor number Gaussian distributions)

4 68 6838 136 1367

16 272 273532 544 5471

Comparison between codebooks and GMMs:

Conclusion:

… GMMs require clearly more memory and computational power, compared to codebook approaches.


•



❑ So far, individual thresholds and a priory-probabilities have been trained for each speaker.

❑ Comparison between global and individual thresholds:

Model order Codebook Gaussian(Number of codebook entries approach mixture modelor number Gaussian distributions)

4 12.9 % / 11.5 % 5.3 % / 4.2 %8 11.1 % / 9.6 % 4.1 % / 3.0 %

16 9.6 % / 8.2 % 3.4 % / 2.3 %32 8.2 % / 6.8 % 3.0 % / 2.0 %

Comparison between global and individual thresholds:

Conclusion:

By training the thresholds, the recognition rate can be improved or the number of parameters can be decreased.

IndividualThreshold

Globalthreshold


•


From Speaker Verification to Speaker Identification

Flow chart – Speaker identification:

Speaker-specificfeature models

Speaker-specificthreshold/distance models

Testutterance

„Scoring“ with the background model

„Scoring“with the speaker-specific models

Computation of the bestspeaker model or detectionof a new speaker

Adaptation of the „winning“ speakermodel or generation of a new speakermodel

Selection of the best speaker model


•


Results of a Speaker Identification – Part 1

Boundary conditions:

❑ The results are taken from a publication of D. Reynolds (work done at the MIT, see literature section for details).

❑A data base with 51 speakers (only male speakers) has been used. The data base consists of English conversations (approximately 10 utterances with a duration of 45 seconds each).

❑All data has been transmitted over telephone channels. Thus, the bandwidth of the data is approximately 3.8 kHz (8 kHz sample rate).

❑MFCCs have been used as features. Modeling has been done with GMMs, where only diagonal covariance matrices have been used.


•


Results of a Speaker Identification – Part 2

❑ Length of test and training data vs. recognition rate:

Length Model order Length of test dataof training (number of Gaussiandata distributions) 1 sec 5 sec 10 sec

30 sec 8 54.6 % 79.8 % 86.6 %16 63.7 % 87.3 % 90.5 %32 64.6 % 85.3 % 88.4 %

60 sec 8 66.1 % 91.5 % 97.3 %16 74.9 % 95.7 % 98.8 %32 78.6 % 95.6 % 98.3 %

90 sec 8 71.5 % 95.5 % 98.8 %16 79.0 % 98.0 % 99.7 %32 84.7 % 98.8 % 99.6 %

Results:


•


Adaption of the Models During Run-Time – Part 1

General:

❑After a speaker recognition has been successful (this should be validated e.g. by using a dialog system), the speaker model of the active speaker can be adapted.

❑Generally, all model parameters can be adapted. However, updating only the mean values of GMMs proved to provide a good cost-value ratio. For codebooks, the mean values can be seen as the individual codebook entries, i.e., all parameters are adapted.

❑ Both, the amount of training data and the number of new feature vectors should be considered. The codebook adaption can be done according to

where denotes the new codebook entry and the old one. is the number of vectors that have been used to form the entry during training and is the number of those feature vectors which have been assigned to the corresponding codebook vector.


•



General:

❑ The mean values of GMMs can be updated similar to the codebooks by a modified iteration step of the EM algorithm (see last lecture). First, a „soft“ assignment to the individual classes is done (E-step):

Next, the mean values are corrected (M-step) by

The variable denotes the sum of the „soft“ assignments of the kth class in the last iteration during the training.


•



Example:

Input feature 1 Input feature1

Inp

ut

feat

ure

2

Inp

ut

feat

ure

2Gaussian distributionsbefore adaptation

Gaussian distributionsafter adaptation

New featuevectors

New featurevectors


•


Discriminative Approaches – Part 1

❑ In discriminative approaches, the aim is not only to optimize the assignment of training data to a model but also to optimize the discrimination between other models at the same time.

❑ Examples for such approaches are neural networks or learning vector quantization (LVQ).

❑ The advantage of such methods is in general an improve recognition rate.

❑However, it is more difficult to include new speakers into the models when discriminant approaches are used. If this should be necessary, all model parameters (even those of already known speakers) have to be recalculated – while only a new speaker model had to be generated in the approaches discussed so far.

General:


•


Discriminative Approaches – Part 2

Neuronal networks (e.g. with radial basis functions):

❑ Input data of the neural network are the feature vectors of the training data of all speakers.

❑ The desired output is a vector which contains a 1 at the index of the current speaker. All other vector elements are either set to 0 or 1.

❑ Standard training methods for neural networks attempt to minimize the quadratic distance between the output of the neural network and the desired output. This leads to discriminant methods.


•


Contents

❑ Literature


❑ Motivation


❑ Model adaption



❑ Fundamentals



Contents


•

Speech and Speech Recognition

Speech Generation and Speech Recognition

Overview:

Creation of the message

Integration of language (English)

Vocal tract

Vocal cords

Neuromuscularactivities

„Acoustic“Channel

Movement of thebasilar membrane

Understanding of the message

Conversion based on a language

Neuronal activity


•


„History“

1952 at Bell Labs

❑ First digit recognition

❑ Estimation of the energy in the formant frequencies (resonant frequencies of the vocal tract)

In the 60th

❑ Improved digit recognition

❑ Breakthroughs in spectral estimation (FFT, cepstrum), dynamic time warping, and hidden Markov models

Hidden Markov models in speech recognition

❑ Mathematics from Baum et al. (1966 – 1972)

❑ Application to speech recognition from Baker (CMU Dragon System, 1974)

❑ Development at IBM (Baker, Jelinek, Bahl, Mercer, and others)

(Deep) Neural Networks

❑ Helped this technology to get as successful as seen in todays products (usually server-based architecturs)

❑ Developement started about 15 years ago


•


Speech Dialog Systems

Speech signal Speech signal

Text Text

Semantic representation

Semantic representation

Overview of a speech dialog system:

Speech recognition

Dialog manager

Parser Prompter

Speech syntheses


•


Fundamental Principle of Speech Recognition

Speech signal with background noise

Preprocessing

Speech signal with little disturbances

Feature extraction

Classification

Features(e.g. MFCCs)

Decision

Evaluation(“N-best” list)

Recognized text

❑ Reduces background noise and echoes

❑ Combines several microphone signals

❑ Compresses the amount of data

❑ Extracts the important parameters for the speech recognition

❑ Performs for each activated class an evaluation

❑As a result often the N-best evaluations are determined

❑Determines the best entry based on additional prior knowledge (word probabilities)


•


Variants of Speech Recognition Systems – Part 1

Single word recognizer

❑ Single words or short commands

❑ The (command) words are spoken isolated (with pauses)

Keyword spotter

❑ Single words respectively word orders in arbitrary statements

❑ In case the keyword was detected, a new recognizer is started

Recognition of connected words

❑ Sequence of fluent spoken words from a small vocabulary

Continuous speech recognizer

❑ Whole, fluently spoken sentences


•



Speaker-dependent systems

❑ Such systems have to be trained individually for each speaking person.

❑ The training phase can (depending on the size of the vocabulary, respectively the desired quality) take some time.

Speaker-independent systems

❑ There is no need for a (speaker specific) training phase.

❑ To obtain an appropriate quality a large training database has to be provided.

Speaker-adaptive systems

❑ It starts with an universal model, which is then gradually adapted to the speaker.

❑ This can be done during a short training phase or during runtime.


•



Systems with small vocabulary

❑ Up to a few hundred words

❑ Typically used for control tasks

Systems with large vocabulary

❑ Several 100.000 words

❑ Dictation, address input

❑ Vocabulary at this size, often have many phonetically similar words.

❑ To reduce the numbers of mistakes, usually a so-called language model is needed (describes the relationship between words).


•


Evaluation of Speech Recognition Systems – Part 1

Basics

❑ Usually speech recognition systems respectively speech dialogue system are evaluated by using word error rates.

❑ In practice, however, this value is overstated. Other criteria are also important. It is also important, for example, how much computing power, respectively, memory is needed by the system or after which time the result is available.

❑ To evaluate speech dialogue systems also the quality of the speech syntheses, the so-called start-up time and many other values are of special interest.

Word error rate


•



Word error rate

❑ The word error rate can efficiently be derived by means of a dynamic programming. Defining the word reference sequence by

and the word sequence determined by the recognizer by

the following distance between the sequences can be concluded:


•



Word error rate

❑ Initialization:

❑ Derivation of the word error rate

❑ Note that with this definition word error rates greater than a 100 % can be achieved (due to lots of insertions).


•


Maximum A-posteriori Probability Rule (MAP Rule) – Part 1

Recognition criterion

❑ For a given feature sequence the one series of words out of all possible (permitted) word series should be selected, which exhibit the maximum a-posterior probability:

❑ Using Bayes’ theorem, it can be concluded:

❑ Due to the fact that the probability of the feature sequences is constant for the maximization, this has nor influences on the decision and can be neglected:


•


Maximum A-posteriori Probability Rule (MAP Rule) – Part 2

Recognition criterion

❑ Optimization function:

❑ The probability indicates the probability to observe the feature sequence ,if the series of words was spoken. To model such probabilities hidden Markov models (HMMs) have proved to be suitable. This part of the optimization criterion is called the acoustic model.

❑ The probability is the a-priori probability of the word series . This probability is independent of the observation sequence and describes a priori knowledge about the word series (e.g. that some words do occur more often than others). This part of the optimization criterion is called the language model.


•


Computationally Efficient Model Restrictions – Part 1

Limitations of HMM model degrees of freedom

❑ If at first a large number of Gaussian distributions is permit for the several HMMs or for the states of the HMMs, one can try to use the same mean vectors and covariance matrices for further calculations. The weights of the several Gaussian distributions can be selected individually for each model or model state, respectively.

❑ This kind of hidden Markov models are called semi-continuous HMMs.

❑ Through this, a lot of memory and computational load can be saved:

❑ The needed memory is reduced due to the fact that the mean value vectors and covariance matrices can be reused for all models (and have to be saved individually for all models respectively model states).

❑ Also the contribution of the single Gaussian distributions are only computed once per time frame respectively feature vector determination (and not individually for each model respectively each model state).


•


Computationally Efficient Model Restrictions – Part 2

Limitations of HMM model degrees of freedom

❑ In addition, a limited number of Gaussian distributions can be assigned to each HMM, respectively HMM state. In this case the computational load (and also memory to a small extent) can be further reduced.

❑ This is done by storing only the indices of the active Gaussian distributions of each model state (e.g. 8 to 32 Gaussian distributions from 512 to 2048).

❑ The base Gaussian distributions can be, for example, derived out ofa big speech database by using the EM algorithm.


•


Base Units of Hidden Markov Models – Part 1

Approximately 1000 speakers

I

Il Ev

vn

lE

Acoustic modeling

❑ Acoustic basic units are extracted out of a big speech database (e.g. 1000 speaking persons, where their were talking approximately one hour) to train the HMMs.

❑ Such basic units can be phonemes, but also phoneme pairs or groups of 3 phonemes.

❑ In addition, often for key words (e.g. numbers) single word models are trained.

❑ There are about 50 phonemes in each language. For phoneme pairs respectively groups of 3 phonemes, the amount of occurring groups within a language is considerably smaller than 50², respectively 50³.


•


Base Units of Hidden Markov Models – Part 2

Acoustic modeling

❑ The composition of the essentials models (given due to the vocabulary) out of base units has the advantage that this is pretty simple (see following slides) and can be done during runtime of the speech recognizer.

❑ Therefore, it is possible to wait for the answer of a speech dialog, generate a corresponding answer by use of a speech syntheses system, and start then the new excitation. With this recognition method, only those word possibilities are used, which make sense in this case.

❑ This allows to keep the vocabulary small, which leads to less errors and a lower computational load.

❑ Especially for database queries (e.g. Google search, operating of an MP3 player, etc.) this is important.


•


Training versus Processing During Run-Time of a Recognizer

Trai

nin

gD

uri

ng

run

-tim

eVocabulary

Speech signal

Acoustic model

p(X|W)

Language model

p(W)

Feature extraction

Decoding

Training of the acoustic model

Training material(speech data)

Training mat.(text)

Training of thelanguage model

Textprocessing


•


Composition of HMMs – Part 1

Parallel connection of HMMs

❑ For the parallel composition of HMMs, only the transition probability and the a-priori probability have to be combined.

❑ Example for a parallel connection of two simple left-right models, each with two emitted states:


•



Parallel connection of HMMs

❑ Example for the parallel connection of two simple left-right models:


•



Series connection of HMMs

❑ Example for the series connection of two simple left-right models:


•



Generation of the active vocabulary

❑ The HMMs have to connected as efficient as possible with each other (graph theory).

❑ Example (fragment) for German double digits:

ein

zwei

drei

neun

und

dreißig

vierzig

fünfzig

neunzig


•


Research Directions

Feature extraction:

❑ Psycho-acoustic motivated feature extraction

❑ Use of additional information (speaker direction, etc.)

Acoustic modeling:

❑ Use of different base units as phonemes

❑ Improved modeling (e.g. with sound duration models)

❑ Using neural networks and mixed approaches

Adaption:

❑ Adaption with regard to the current speaker

❑ Feature transformation to reduce the dependence on the recording conditions

Training:



•


Contents

❑ Literature


❑ Motivation


❑ Model adaption



❑ Fundamentals



Contents


•

Speaker and Speech and Speech Recognition

Summary and Outlook

Summary:


❑ Motivation


❑ Model adaption



❑ Fundamentals


Next week:

❑ Neural networks

Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital...

Documents

Transcript of Pattern Recognition - Uni Kiel › images › teaching › lectures › pattern... · Digital...