Informati cs Dpt. 1 05- 06/2001 Digital Days Speaker Verification Alexandros Xafopoulos.

05-06/2001

Digital DaysDigital Days

Informatics Dpt.

1

SpeakerVerification

Alexandros Xafopoulos

05-06/2001


Informatics Dpt.

2

Presentation OutlinePresentation Outline

• Framework• Preprocessing -• Features (Extraction, Noise

Compensation-Channel Equalization, Selection)

• (Pattern) Matching-Modeling• Decision Making -• Performance Evaluation• Experimental Results• References

05-06/2001


Informatics Dpt.

3

IntroductionIntroductionFramework

• Motivation– Speech contains speaker specific

characteristics (physiological-behavioral)• vocal tract• pitch range - vocal cords• articulator movement

– Mouth– Nasal cavity– Lips

– Voiceprint as a biometric (distinguishing trait)

– Natural & economical way

05-06/2001


Informatics Dpt.

4

Framework

• Objective– Discriminate betw. a given speaker & all

others

• Definitions– Verification < Latin verus (true)

• Claim: Speaker identity• Proof: Speech utterance• Binary decision to establish the truth

– Client: speaker registered on the system– Impostor: speaker who claims a false identity– Model: set of parameters that represents a

speaker or a group of speakers

Introduction(2)Introduction(2)

05-06/2001


Informatics Dpt.

5

Digital Signal

Processing

• Signal Processing

Analog Signal

Processing

Speaker Recognition

Storage / Transmissio

n

Other Signals

Speech Processing

Recognition Coding / Synthesis

Enhancement

Analysis

Signal processing

Speaker Identificatio

n

Language Identificatio

n

Speech Recognition

Speaker Detection / Tracking

Speaker Verification

FrameworkRelated Research AreasRelated Research Areas

05-06/2001


Informatics Dpt.

6

• (Statistical) Pattern Recognition

Data FeatureExtractor

Features TrainedClassifier

ClassLabel

FrameworkRelated Research Areas(2)Related Research Areas(2)

05-06/2001


Informatics Dpt.

7

• Biometrics Technology– def: automatic identification of a person

based on his/her physiological or behavioral characteristics (=biometrics)

– desirable properties of biometrics [Jain_bk]• universality (found in every person)• uniqueness (different ”value” for each person)• permanence (invariant with time)• collectability (quantitatively measurable)• performance ( accuracy vs. resources)• high acceptability (person’s willingness)• low circumvention (not easy to ”fool”)


05-06/2001


Informatics Dpt.

8

• Speech Science Communication by speech [Somervuo]


05-06/2001


Informatics Dpt.

9

Framework

Speech Production Physiology

[Picone]

[Picone]

• Speech Science(2)

Related Research Areas(5)Related Research Areas(5)

Block Diagram of Human Speech Production

05-06/2001


Informatics Dpt.

10

Framework

General Discrete-Time Model for Speech Production[Morgan] (corrected)• Speech Science(3)

Related Research Areas(6)Related Research Areas(6)

Gain for voice source

Gain for noise source

G(z)

H(z) R(z)

05-06/2001


Informatics Dpt.

11

Framework

• Enrollment (Training) module

Digital SignalAcquisition

DigitalSpeech Feature

Creation

FeatureVectors

Speech PressureWave of ”A”

Speaker ”A”N utterances

ModelRegistration

N Sets ofFeatureVectors

Known Identity: ”Speaker is ”A””

SpeakerModel of ”A”

Generic Speaker Verification Generic Speaker Verification ProcessProcess

Channel to transfer signal

05-06/2001


Informatics Dpt.

12

Framework

• Enrollment module(2)– Digital signal acquisition

• Sampling frequency:

Microphone

AnalogVoltageSignal

Antialiasinglow-pass

filter

ConditionedAnalogSignal

Speech PressureWave

Sampling &Quantization

(A/D converter)

Digital Speech

Generic Speaker Verification Generic Speaker Verification Process(2)Process(2)

sF

05-06/2001


Informatics Dpt.

13

Framework

• Enrollment module(3): Feature creation

Preprocessing

Preprocessed Digital Speech

FeatureExtraction

PlainFeatureVectors

DigitalSpeech

Noise Compensation

& Channel

Equalization(Clean)FeatureVectors


Feature Selection

05-06/2001


Informatics Dpt.

14

Framework

• Verification module

DigitalSignal

Aquisition

DigitalSpeech Feature

Creation

FeatureVectors

Speech PressureWave of ”A”

PatternMatching

Thresholdof ”B”

Acceptance (A=B)or Rejection (AB)

MatchingResults Decision

Making

|P| SpeakerModels Model

Selection

SpeakerModel of ”B”

ClaimedIdentity B


05-06/2001


Informatics Dpt.

15

Framework

• Threshold setting module

SpeakerModel of ”A”

ThresholdSetting

Thresholdof ”A”

SpeakerModels of


"A"

model) (world or model)(cohort A:A h

• Cohort model: competitive clients only• World model: all the clients

05-06/2001


Informatics Dpt.

16

Framework

• Text-dependency [Nedic]– Text dependent: verification done on a fixed

phrase, predetermined by the recognizer (fixed phrase)

– Text prompted: verification done on system generated sequence of predetermined words (fixed vocabulary)

– User customized: verification done on user requested phrase

– Text independent: verification done on any phrase– Language independent: verification done on any

language

• Vocabulary– Fixed or not– Size (|V|)

Corpus ParametersCorpus Parameters

05-06/2001


Informatics Dpt.

17

Framework

• Population (Speakers)– Size (|P|)– Similarity

• Speech Flow– Discrete Utterance (pauses betw. words)– Continuous– Spontaneous (natural)

• Quantity (#sessions, #phrases, phrase duration)

• Quality of speech (Problems)

Corpus Parameters(2)Corpus Parameters(2)

05-06/2001


Informatics Dpt.

18

Framework

• Microphone / Communication channel / Digitizer quality

• Channel - Environmental mismatch (different channels - environments for enrollment & verification request)

• Mimicry by humans & tape recorders• Bad pronunciation• Extreme emotional states (e.g. anger)• Sickness / Allergies / Tiredness / Thirst• Aging• Environmental noise / Poor room acoustics

Problems under real conditionsProblems under real conditions

05-06/2001


Informatics Dpt.

19

Framework

• False Rejection– A client makes a request to be verified as

himself/herself & the request is rejected– High rate client: goat [Koolwaaij]– Low rate client: sheep

• False Acceptance– An impostor makes a request to be verified as a

client & the request is accepted– High rate client: lamb– Low rate client: ram– High rate impostor: wolf– Low rate impostor: badger

ErrorsErrors

05-06/2001


Informatics Dpt.

20

Framework

• Access control to databases / facilities

• Electronic commerce• Remote access to computer

networks• Forensic• Telephone banking [James]

ApplicationsApplications

05-06/2001


Informatics Dpt.

21

Preprocessing

Preemphasis-Frame BlockingPreemphasis-Frame Blocking

• Preemphasis: Low order digital system to– spectrally flatten the signal (in favour of vocal tract

parameters)– make it less susceptible to later finite precision

effects– usually (order=1):

• Frame blocking (short-term(st) processing)– L successive overlapping (by M samples) frames

– window size - length: N samples = N/ sec– frame rate-shift-period: M samples = M/ sec

]1,90[ ),1()()( .αnsαnsns pepepe

LlN nlMnsnlf pe ,...,1 ,1,...,0)),1(();(

sFsF

05-06/2001


Informatics Dpt.

22

Preprocessing

Frame WindowingFrame Windowing

• Used to minimize the singal discontinuities at the beg. & end of each frame– Time (long window)<->freq. (short) resolution

– Window type:

– Corrections:

[Picone]

1,...,0),();();( N nnwnlfnlfw

1,...,0 , ,1 NnnkNN

05-06/2001


Informatics Dpt.

23

Preprocessing

Speech Activity DetectionSpeech Activity Detection

• Silence-speech detection• Voiced-unvoiced discrimination• Endpoint detection [Deller_bk]• Can be applied afterward

05-06/2001


Informatics Dpt.

24

Preprocessing

Signal Measures & GraphsSignal Measures & Graphs

Zerocrossing rate

[Weingessel]

Speech waveform

Energy plotTime-frequency plot (Spectrogram)

05-06/2001


Informatics Dpt.

25

Feature ExtractionFeatures - GeneralFeatures - General

• Maps each speech interval-frame to a multidimensional feature space

• Order : number of coefficients in each feature vector (dimensionality)

• Several kinds of coefficients have been proposed

coefN

05-06/2001


Informatics Dpt.

26

Feature ExtractionLinear Prediction (LP)Linear Prediction (LP)

• Speech sample as a linear combination of previous samples (autoregressive mdl):

– : LP coefficients (LPC)– : normalized excitation source– G : scale factor– : stLPC of

frame l

)()()()(1

nGumnsmansLPCN

mLPC

LPCLPC Nmmla ,...,1 ),;(

)(nu

LPCN

LPCLPC Nmma ,...,1 ),(

05-06/2001


Informatics Dpt.

27

Feature ExtractionLinear Prediction (LP)(2)Linear Prediction (LP)(2)

• Calculation of stLPC– Mean squared error

minimization– Autocorrelation method

• Levinson-Durbin (L-D) recursion

– Covariance method• Cholesky (LU)

decomposition

[Picone2]

L-D recursion(l is implied,

R: autocorrelationmatrix)

05-06/2001


Informatics Dpt.

28

Feature ExtractionLinear Prediction (LP)(3)Linear Prediction (LP)(3)

• LPC– highly correlated– not orthonormal

• Distance: Itakura-Saito– Computationally expensive

• LPC processor [Rabiner_bk]

05-06/2001


Informatics Dpt.

29

Feature ExtractionCepstrum (Complex-Real)Cepstrum (Complex-Real)

• Special case of homomorphic signal proc.

• Focuses on voiced segments• Short-term complex cepstrum (stCC):

• Short-term real cepstrum (stRC):

• Distance of cepstrum based coefficients– Euclidean: vectors defined in an

orthonormal space

CCwCC Nmnlfmlc ,...,1 })},);((DFT{{logDFT);( 101

RCwRC Nmnlfmlc ,...,1 |},});(DFT{|{logDFT);( 101

05-06/2001


Informatics Dpt.

30

Feature ExtractionMel CepstrumMel Cepstrum

• Mel– unit of measure of

perceived frequency of a tone

– non-linear correspondance to the physical freq. (like the human ear)

– mel freq. cepstral coefficients (MFCC):

– generalized case [Vergin]

Mel-cepstral feature generation (frame l)

[Young]

MFCCMFCC Nmmlc ,...,1 ),;(

7001log2595 10

Hzmel

ff

)()( , melfiltersmelFFT NN

05-06/2001


Informatics Dpt.

31

Feature ExtractionLP derived CepstrumLP derived Cepstrum

• LP Cepstral Coefficients (LPCC):

);();();();(

:,...,11

1

kmlaklcm

kmlamlc

Nm

LPCLPCC

m

kLPCLPCC

LPC

);();();(

:,...,11

kmlaklcm

kmlc

NNm

LPCLPCC

m

NmkLPCC

LPCCLPC

05-06/2001


Informatics Dpt.

32

Feature ExtractionOther cepstral variantsOther cepstral variants

• Linear Freq. Cepstral Coefficients (LFCC)– Like MFCC but:– filters are uniformally spaced on the Hz

scale

• Mel-warped LPCC (MLPCC) [Kuitert]– CC not directly derived from LPC– 1st compute the log magn. spectrum of LPC– then warp the freq. axis to correspond to

the mel axis

05-06/2001


Informatics Dpt.

33

Feature ExtractionVariantsVariants

• Discrete Wavelet Transform (DWT) instead of FFT [Krishnan]

• Application of other type than triangular filters

• Application of the logarithm before the triangular filters

05-06/2001


Informatics Dpt.

34

Feature ExtractionDelta CepstrumDelta Cepstrum

• [Milner]:

• Inclusion of temporal information

GCCK

Kk

GCC

K

KkGCC Nm

k

mklckmlc ,...,1 ,

);();(

2

05-06/2001


Informatics Dpt.

35

Feature ExtractionPLP - Auditory FeaturesPLP - Auditory Features

• Perceptual Linear Prediction (PLP) [Hermansky]– Spectral scale: non-linear Bark scale

– Spectral features smoothed within freq. bands

• Auditory Features [Kumar]– Imitates signal proc. performed by the ear– cochlear modeling

2

2

75003.5atan

1000

76.0atan13 HzHz

Bark

fff

05-06/2001


Informatics Dpt.

36

Noise Compensation-Channel EqualizationIntra-frame cepstral proc.Intra-frame cepstral proc.

• Liftering-weighting– low order coeffs: sensitive to overall spectral

slope– high order: sensitive to noise– =>tapered window (bandpass liftering)

• Adaptive Component Weighting (ACW)– motivation: all frames don't have same

distortion

GCCGCCGCCw

GCCGCC

GCC

Nmmlcmwmlc

NmN

mNmw

,...,1 ),;()();(

,...,1 ,π

sin2

1)(

[Mammone]

05-06/2001


Informatics Dpt.

37

Inter-frame cepstral proc.Inter-frame cepstral proc.

• Cepstral Mean Subtraction (CMS)– mean (over a num of frames) subtraction

(tackles training-testing discrepancy)

– lowpass filtering– eliminates communication channel

spectral shaping

• Pole Filtered CMS (PFCMS): cepstrum poles modification

GCCGCCkGCCGCCCMS Nmmkcmlcmlc ,...,1 )),;((avg);();(

Noise Compensation-Channel Equalization

05-06/2001


Informatics Dpt.

38

RASTA proc.RASTA proc.

• Relative Spectral Filtering (RASTA) [Hermansky]– bandpass filtering in the log-spectral

domain– suppresses spectral components that

change more slowly or quickly than in typical speech

– RASTA-PLP• Microphone (type, position) robustness

Noise Compensation-Channel Equalization

05-06/2001


Informatics Dpt.

39

Feature SelectionFeature Selection IntroductionFeature Selection Introduction

• Goal– find a transformation to a relatively low-

dimensional feature space that preserves the information pertinent to the application while enabling meaningful comparisons to be performed using measures of similarity

• Processing of features– Principal Component Analysis (PCA) (or

Karhunen Loève Expansion-KLE)• seeks a lower dimensional representation that

accounts for variance of the features• not necessarily optimum for class discrimination

– Linear Discriminant Analysis (LDA) [Jin]– Non LDA (NLDA) (using MLP) [Konig]

05-06/2001


Informatics Dpt.

40

Matching-ModelingMatching-Modeling IntroductionMatching-Modeling Introduction

• Modeling: creation of (speaker) models• Model: Can be considered as the output of a

proper proc. of a speaker’s set of feature vectors

• Matching: computation of a match score betw. the input feature vectors & some speaker model

• Methods [Wassner]– Template Matching

• deterministic• score: distance betw. a test speaker (feature vectors of

an) utterance & a reference speaker model• better score: min distance

05-06/2001


Informatics Dpt.

41

Matching-ModelingMatching-Modeling Introduction(2)Matching-Modeling Introduction(2)

• Methods(2)– Stochastic Approach

• probabilistic matching• score: prob. of generation of a speech

utterance by the claimed speaker• better score: max probability• Parametric speaker model: specific pdf is

assumed & its appropriate parameters (e.g. mean vector, covariance matrix) can be estimated using the Maximum Likelihood Estimation (MLE) e.g. multivariate normal model

)|( cSUP

05-06/2001


Informatics Dpt.

42

Matching-ModelingTemplate Matching MethodsTemplate Matching Methods

• Dynamic Time Warping (DTW)– dynamic comparison betw. a test & a

reference (model) matrix (set of feature vectors)

– computes a distance betw. the test & ref. patterns

– allows time alignment at different costs– uses Dynamic Programming (DP)

05-06/2001


Informatics Dpt.

43

Matching-ModelingTemplate Matching Methods(2)Template Matching Methods(2)

• Dynamic Time Warping (DTW)(2)

The DP gridwith test (t)

& reference (r)feature vectorsat respectiveframe indices

[Picone]

05-06/2001


Informatics Dpt.

44


• Dynamic Time Warping (DTW)(3)– distances-costs on the DP grid (i,j frame

indices, k step index)• Node

– e.g.

• Transition e.g.• Both

– e.g.

• Global

– K: number of transitions

)],(|),[( 11 kkkkT jijid

),( kkN jid

)],(|),[( 11 kkkkB jijid

LPCCN

mk

refLPCCk

testLPCC mjcmic

1

2));();((

)],(|),[(),( 11 kkkkTkkN jijidjid

][][ 11 kkkk jjii

K

kkkkkB jijidD

111 )],(|),[(

(Type 4)

05-06/2001


Informatics Dpt.

45


• Dynamic Time Warping (DTW)(4)– DTW search constraints

• Endpoint Constraints (bottom left(S) - top right(E) corners)

– endpoint relaxation: max points allowed in each direction

• Monotonicity (going up & right)• Global Path Constraints (global movement area)

– permissible slope or– permissible window Wij kk

jEiEjSiS Δ,Δ,Δ,Δ

kkkk jjii 11

05-06/2001


Informatics Dpt.

46


• Dynamic Time Warping (DTW)(5)– DTW search

constraints(2)• Local Path Constraints (local movement area)

Sakoe & Shibalocal constraints

on DTWpath search

[Picone]

05-06/2001


Informatics Dpt.

47


• Dynamic Time Warping (DTW)(6)– The minimum cost final endpoint

provides the distance betw. a test & a reference phrase

– Training-Modeling [Deller_bk]• Casual: Unaltered feature strings form models• Averaging feature strings of utterances• The stochastic techniques possess superior

training methods

05-06/2001


Informatics Dpt.

48


• Vector Quantization (VQ)– Uses intra-vector dependencies to break-

up a vector space in cells (unsupervised)– follows Linde-Buzo-Gray (LBG) algorithm– speaker model: codebook– codebook: set of prototype vectors used

to represent vector spaces– goal: data structure "discovery" by

finding how the data is clustered

05-06/2001


Informatics Dpt.

49


• Learning Vector Quantization (LVQ)– Predefined classes, labeled data– defines the class borders according to the

nearest neighbor rule– supervised version of VQ– set of variants (e.g. LVQ1,2,3)– goal: to determine a set of prototypes

that best represent each class.

05-06/2001


Informatics Dpt.

50

Matching-ModelingStatistical MeasuresStatistical Measures

• Second Order Statistical Measures (SOSM) [Bimbot]– E.g. Arithmetic-Harmonic-Sphericity

(AHS)• speaker model: covariance matrix of feature

vectors• Distance=min(=0) iff all eigenvalues of test &

ref covar matrices are equal

05-06/2001


Informatics Dpt.

51

Matching-ModelingGenerative ModelsGenerative Models

• Hidden Markov Models (HMMs)– Statistical - stochastic– Flexible– Types

• Continuous Density (CD)• Discrete• SemiContinuous (SC) [Falavigna]

– Model: prob. distributions e.g. mixtures of Gaussians of the feature vectors of the speaker

05-06/2001


Informatics Dpt.

52

Matching-ModelingGenerative Models(2)Generative Models(2)

• Hidden Markov Models (HMMs)(2)– Topologies

• Left-Right (LR) (self & right connections): attempts to catch the temporal structure of the speech & to link consecutive short-time observations together

#states/unit (e.g. phoneme) #gaussian distributions(mixtures)/state

[Kumar]Example of a left-right HMM

: feature vectorskO

05-06/2001


Informatics Dpt.

53


• Hidden Markov Models (HMMs)(3)– Topologies(2)

• Ergodic (fullyconnected)

-AR HMMs: the prob.distrib. associatedwith each state isestimated via an ARprocess [Bourlard]

[Picone]Example of an ergodic HMM

05-06/2001


Informatics Dpt.

54


• Gaussian Mixture Models (GMMs)– Single multi-Gaussian state ”HMM”– Uses a mixture of Gaussian densities to

model the distribution of the feature vectors of each speaker

– Local covariance info

05-06/2001


Informatics Dpt.

55

Matching-ModelingNeural Networks (NN)Neural Networks (NN)

• Feed-Forward Neural Networks– supervised learning– each speaker has his own NN (each

checked in turn to find the best match)– classifier-matcher: NN output– positive/negative training (rivals)

05-06/2001


Informatics Dpt.

56

Matching-ModelingNeural Networks (NN)(2)Neural Networks (NN)(2)

• Feed-Forward NNs(2)– Types [Haykin_bk]

• Multilayer Perceptron (MLP): trained usually with the Back-Propagation (BP) algorithm

– Error Correction Learning– Global optimization

• Time Delay NNs (TDNN)• Radial Basis Function (RBF) Networks [Lo]

– Memory-Based Learning– Local optimization

• Support Vector Machines (SVM)– Learning by examples– Vapnik-Chervonenkis (VC) dimension: framework

for the development of SVMs

05-06/2001


Informatics Dpt.

57

Matching-ModelingNeural Networks (NN)(3)Neural Networks (NN)(3)

• Self Organizing Maps (SOM)– unsupervised learning– method to form a topologically ordered

codebook– speaker model: codebook– density of prototype vectors approaches

the pdf of the input vectors during the training

– nonlinear projection– competitive (winner neuron) learning

05-06/2001


Informatics Dpt.

58

Matching-ModelingNNs & Combined MethodsNNs & Combined Methods

• DTW-SOM– associate an entire feature vector

sequence, instead of a single feature vector, as a model with each SOM node (also DTW-LVQ) [Somervuo]

• Recurrent NNs (RNN) [Shrimpton]– (self-or not) feedback

• Combined methods [Genoud]

05-06/2001


Informatics Dpt.

59

Matching-ModelingSub-band Proc. IntroductionSub-band Proc. Introduction

• Speech signal split into band-limited channels (freq. ranges)

Block diagram of an LPCC-based sub-band processing system

[Finan]

05-06/2001


Informatics Dpt.

60

Decision MakingDecision ApproachesDecision Approaches

• ”Template” approach– threshold setting: based on inter- & intra-

speaker scores/distances– comparison:

test score<=thresholdacceptance [Fakotakis]

• Statistical approach [Bengio] [Bourlard]– : speaker RV for identity c being claimed– : utterance represented by feat. vectors– : other speakers RV)(

)()|()|(

UP

SPSUPUSP cc

c

cS

U

cS

05-06/2001


Informatics Dpt.

61

Decision MakingDecision Approaches(2)Decision Approaches(2)

• Statistical approach(2)– Claim c is true if:

– : decision threshold usu. found assuming Gaussian distributions for and

normalised likelihood - likelihood ratio– using logs:

Log Likelihood Ratio (LLR)

cccc

c

c ΘSUPSUPSUP

SUP )|(log)|(loglog

)|(

)|(log

cc

c

c

c

c

c

SP

SP

SUP

SUP

USP

USP )(

)(

)|(

)|(1

)|(

)|(

c)|( cSUP )|( cSUP

05-06/2001


Informatics Dpt.

62


• Statistical approach(3)– : speaker dependent model– : normalization factor– cohort model : group of selected

speakers who are more competitive with the model of the claimed id• No well-established selection procedure

– world model : all other speakers• less computation & storage needed

)|( cSUP

cS

chc SS )|( cSUP

05-06/2001


Informatics Dpt.

63


• Statistical approach extensions– If– sign(y) gives the decision– Techniques:

• Bayes Decision Rule (assumes prob.s perfectly estimated)

– Minimizes Half Total Error Rate(HTER)

• Linear Regression• SVM Regression

ccc ΘSUPSUPy )|(log)|(log

2

FR%FA%HTER

05-06/2001


Informatics Dpt.

64

Decision MakingThreshold SettingThreshold Setting

• speaker dependent– |P| thresholds:

• speaker independent– 1 threshold:

• leave one (client o) out– |P|*|P| thresholds:

• a priori: computed on training set (enrollment data) [Lindberg]

• a posteriori: computed on test set (obtained during actual use of the system)

|P|1,...,c , c

|P|o|P|cco 1,..., ,1,..., ,

05-06/2001


Informatics Dpt.

65

Decision MakingHypothesis TestingHypothesis Testing

Valid & impostor densities

[Campbell]

05-06/2001


Informatics Dpt.

66

Decision MakingHypothesis Testing(2)Hypothesis Testing(2)

Probability terms & definitions

[Campbell]

05-06/2001


Informatics Dpt.

67

Performance EvaluationAccuracyAccuracy

• Error %s– FAR (False Acceptance Rate):

• Prob. of false acceptance

– FRR (False Rejection Rate):• Prob. of false rejection

– Values for FAR & FRR are adjusted by changing the threshold values: FAR vs. FRR

claims false#

sacceptance false#

claims true#

rejections false#

05-06/2001


Informatics Dpt.

68

Performance EvaluationAccuracy(2)Accuracy(2)

• Error %s(2)– EER (Equal Error Rate): operating point

where FAR FRR– Choice of 2 subsequent operating points

to approximate the EER value

– MDE (Minimum Decision Error): operating point where FRR 10FAR

11

11

11

11

FRRFARFRRFAR

,FRRFRRFARFAR

,)FRRFRR()FAR(FAR

FARFRRFARFRREER

kkkk

iiii

kkkk

kkkk

i

05-06/2001


Informatics Dpt.

69

Performance EvaluationAccuracy(3)Accuracy(3)

• Graphs

• Quantities– #speakers correctly/wrongly verified

ROC (Receiver OperatingCharacteristics) curve:

Plot of differentoperating points

(FRR vs. FAR values).Called also DET(Detection ErrorTradeoff) plot

[Gauvain]

05-06/2001


Informatics Dpt.

70

Performance EvaluationComputational ComplexityComputational Complexity

• CPU time– Training

• Feature creation• Modeling• Threshold setting

– Testing (verification throughput)• Feature creation• Matching

• Memory-disk storage– Speech database, Features, Models,

Thresholds

05-06/2001


Informatics Dpt.

71

Experimental ResultsParametersParameters

KHz48sF

Text dependent – Fixed vocab.: Digits 0-9 in French or Spanish|V|=10 |P|=37 (M2VTS database)Discrete utterance speech flow#sessions(shots)/speaker=5, the 5th is for testing|S|=4#phrases/session=1 (0-9 utterance)Phrase duration~6sec

95.0peα (30ms) 360N (20ms) 240MWindow type: Hamming

12LPCCN 12LPCNLiftering-weighting: );( mlc LPCCw

Proc. Freq.=12KHz

Coefficients: LPCC

05-06/2001


Informatics Dpt.

72

Experimental ResultsParameters(2)-EERParameters(2)-EER

Matching method: DTW

Nd : Euclidean Td : Type 4 TNB ddd 30W10Δ ,10Δ ,10Δ ,10Δ jEiEjSiS

Local path constraint: Sakoe & Shiba (b)

Decision approach: TemplateThreshold setting: leave one out|P|(client left out).|P-1|(rest clients as claimants).|S|(shot left out for claiming-testing)=5328 client claims|P|(client left out as impostor).|P-1|(claims of the impostor as one of the rest clients).|S|(shot left out for claiming)=5328 impostor claimsEER(avg)[0.6569%,1.5390%] (FAR1=1.5390%

>FRR1=0.6569%)EER(avg)=[EER(1|234)+EER(2|134)+EER(3|124)+EER(4|123)]/4

05-06/2001


Informatics Dpt.

73

Experimental ResultsParameters(3)-EERParameters(3)-EER

Difference:

EER(avg)=4.1817%EER(5|123)=5.4054%

12MFCCN512)( melFFTN

Coefficients: MFCC

40)( melfiltersN

Shot 4 left out, shot 5 used for testing:|P|.|P-1|=1332 client & 1332 impostor claimsEER(5|123)=2.7027%

05-06/2001


Informatics Dpt.

74

ReferencesReferences

• [Bengio] S. Bengio and J. Mariéthoz, Learning the Decision Function for Speaker Verification, IDIAP Research Report, 2001

• [Bimbot] F. Bimbot, I. Magrin-Chagnolleau and L. Mathan, Second-Order Statistical Measures for Text-Independent Speaker Identification, Speech Communication, vol. 17, pp. 177-192, 1995

• [Bourlard] H. Bourlard and N. Morgan, Speaker Verication: A Quick Overview, IDIAP Research Report, 1998

• [Campbell] J.P. Campbell Jr., Speaker Recognition: a Tutorial, Proc. of the IEEE, vol. 85, no. 9, pp. 1437-1462, 1997

• [Deller_bk] J.R. Deller, J.G. Proakis and J.H. Hansen, Discrete-time Processing of Speech Signals, Macmillan, New York, 1993

• [Fakotakis] N. Fakotakis, E. Dermatas, G. Kokkinakis, Optimal Decision Threshold for Speaker Verification, in Signal Processing III: Theories and Applications, editor: I.T. Young et al., pp. 585-587, Elsevier Science Publishers B.V. (North Holland), 1986

05-06/2001


Informatics Dpt.

75

References(2)References(2)

• [Falavigna] D. Falavigna, Comparison Of Different Hmm Based Methods For Speaker Verification (citeseer)

• [Finan] R.A. Finan, R.I. Damper and A.T. Sapeluk, Improved Data Modeling for Text-Dependent Speaker Recognition Using Sub-Band Processing (citeseer)

• [Gauvain] J. Gauvain, L. Lamel and B. Prouts, Experiments with Speaker Verification over the Telephone, Eurospeech’95, pp. 651-654, 1995

• [Genoud] D. Genoud, F. Bimbot, G. Gravier and G. Chollet, Combining Methods to Improve Speaker Verification Decision, Proc. of ICSLP'96, vol. 3, pp. 1756-1759, 1996

• [Haykin_bk] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan, New York, 1995

• [Hermansky] H. Hermansky and N. Morgan, Rasta Processing of Speech, IEEE Trans. on Speech and Audio Processing, vol. 2, no. 4, pp. 578-589, 1994

05-06/2001


Informatics Dpt.

76


• [Jain_bk] A. Jain, R. Bolle and S. Pankanti, editors, Biometrics: Personal Identification in Networked Society, Kluwer Academic Publishers, Boston, MA, 1999

• [James] D. James, H. Hutter and F. Bimbot, CAVE -- Speaker Verification in Banking and Telecommunications (citeseer)

• [Jin] Q. Jin and A. Waibel, Application of LDA to Speaker Recognition (citeseer)

• [Konig] Y. Konig, L. Heck, M. Weintraub and K. Sonmez, Nonlinear Discriminant Feature Extraction for Robust Text-Independent Speaker Recognition, Proc. of RLA2C’98 (Speaker Recognition and Its Commercial and Forensic Applications), 1998

• [Koolwaaij] J.W. Koolwaaij and L. Boves, A New Procedure for Classifying Speakers in Speaker Verification Systems, Proc. of Eurospeech'97, pp. 2355-2358, 1997

• [Krishnan] M. Krishnan, C. Neophytou and G. Prescott, Wavelet Transform Speech Recognition using Vector Quantization, Dynamic Time Warping and Artificial Neural Networks, 1994

05-06/2001


Informatics Dpt.

77


• [Kuitert] M. Kuitert and L. Boves, Speaker Verification with GSM Coded Telephone Speech, Proc. of Eurospeech'97, vol. 2, pp. 975-978, 1997

• [Kumar] N. Kumar, Investigation of Silicon Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition, PhD thesis, Johns Hopkins University, 1997

• [Lindberg] J. Lindberg, J.W. Koolwaaij, H.-P. Hutter, D. Genoud, M. Blomberg, F. Bimbot and J.-B. Pierrot, Techniques for a priori Decision Threshold Estimation in Speaker Verification, Proc. of RLA2C’98, 1998

• [Lo] T.F. Lo and M.W. Mak, A New Intra-Frame and Inter-Frame Cepstral Processing Method for Telephone-Based Speaker Verification, Int. Workshop on Multimedia Data Storage, Retrieval, Integration and Applications, pp. 116-122, 2000

• [Mammone] R.J. Mammone, X. Zhang and R.P. Ramachandran, Robust Speaker Recognition, IEEE Signal Proc. Magazine, vol. 13, no. 5, pp. 58-71, Sep. 1996

• [Milner] B. Milner, Inclusion of Temporal Information into Features for Speech Recognition, Proc. of ICSLP’96, pp. 256-259, 1996

05-06/2001


Informatics Dpt.

78


• [Morgan] N. Morgan and B. Gold, Speech Analysis and Synthesis Overview, Lecture, Univ. of California Berkeley, 1999

• [Nedic] B. Nedic and H. Bourlard, Recent Developments in Speaker Verification at IDIAP, IDIAP Research Report, 2000

• [Picone] J. Picone, Fundamentals of Speech Recognition: A Short Course, Mississippi State Univ., 1996

• [Picone2] J. Picone, Signal Modeling Techniques in Speech Recognition, Proc. of the IEEE, vol. 81, no. 9, pp. 1215-1247, 1993

• [Rabiner_bk] L. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, NJ, 1993

• [Shrimpton] D. Shrimpton and B.D. Watson, Comparison of Recurrent Neural Network Architectures for Speaker Verification. Proc. of the Fourth Australian International Conference on Speech Science and Technology, pp. 460-464, 1992

05-06/2001


Informatics Dpt.

79


• [Somervuo] P. Somervuo, Speech Recognition using Context Vectors and Multiple Feature Streams, Helsinki University of Technology, Faculty of Electrical Engineering, 1996

• [Vergin] R. Vergin, D. O'Shaughnessy and A. Farhat, Generalized Mel Frequency Cepstral Coefficients for Large-Vocabulary Speaker-Independent Continuous-Speech Recognition, IEEE Trans. on Speech and Audio Processing, vol. 7, no. 5, pp. 525-532, 1999

• [Wassner] H. Wassner, G. Maitre and G. Chollet, Speaker Verification : a Review, Technical Report, IDIAP, 1996

• [Weingessel] A. Weingessel, Speech Recognition (citeseer)• [Young] S. Young, Large vocabulary continuous speech

recognition, IEEE Signal Proc. Magazine, vol. 13, no. 5, pp. 45-57, 1996

Informati cs Dpt. 1 05- 06/2001 Digital Days Speaker Verification Alexandros Xafopoulos.

Documents

Transcript of Informati cs Dpt. 1 05- 06/2001 Digital Days Speaker Verification Alexandros Xafopoulos.