Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Speaker Recognition

University of Joensuu,Department of Computer Science

PUMS 2003-2004 –seminaari 14.10.2004 Turku

Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki,Tomi Kinnunen, Ismo Kärkkäinen


Research Group

Pasi FräntiProfessor

Juhani SaastamoinenProject manager

Evgeny KarpovProject researcher

Ville HautamäkiProject researcher

Tomi KinnunenResearcher

Ismo Kärkkäinen Clustering algorithms

PUMS project


PUMS & JoY

• Speaker Recognition• PUMS season 2003-2004:

– Identification, no verification– Port it in mobile phone– Feature fusion– Real-time

• http://cs.joensuu.fi/pages/pums


Application Scenarios

Speaker VerificationSpeaker Verification Speaker IdentificationSpeaker Identification

Speaker RecognitionSpeaker Recognition

Whose voice is this?Is this Bob’s voice?

(Claim)+

Verification

Imposter!

?Identification


Identification System

Recognition: min. MSE within DB

over input speech

SignalProcessing

SpeakerModellingFeature

VectorsSpeechAudio

AddtrainedspeakerprofilesUse all profiles

in recognition

Decision

Speaker ProfileDatabase


sprofiler

Results 2003-2004

Fusion

Speechfeatures (HY)

ProfMatch

srlibReal-time

SpeakerProfiler

Winsprofiler

Epocsprofiler

console UI

Windows

Series60

TCL/TK (HY)

console UI

common speaker recognition app. interface

DB


Planned Results

sprofiler

Fusion

Speechfeatures (HY)

ProfMatch

srlibReal-time

SpeakerProfilerWinsprofiler

Epocsprofiler

DB

ApplicationsAccess control

TeleconferenceLarge scale database

Mobile phone login?

Results 2003-2004


Segmentation

VAD


Verification


System in Mobile Phone

Port to Symbian OS with Series 60 UI platform


Symbian Phones

• Series 60 phone features:– 16 MB ROM– 8 MB RAM

– 176 x 208 display

– 32-bit ARM-processor

– No floating-point unit!!!

Series 80

Series 60UIQ


FFTGEN

• Multiplication results must fit in 32 bits: truncate multiplication inputs

• FFTGEN: Truncate to 16/16 bits (“16/16 FFT”)

32-bit multiplication result

FFT layer input FFT Twiddle FactorX

X16-bit integer 16-bit integer

FFT layer output (part of it)Crop-off for next layer: 16 bits!16-bit integer

16 used bits 16 crop-off bits


Proposed Information Preserving “22/10 FFT”

• Approximate DFT operator F with G• Increase ||F-G||, preserve more signal information

– minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024

– Truncate multiplication inputs to 22/10 bits (signal/op)

22 used bits 10 crop-off bits

32-bit multiplication result

X32-bit integer, 22 bits used 16-bit integer, 10 bits used

32-bit integer

FFT layer input FFT Twiddle FactorX

FFT layer output (part of it)Crop-off for next layer: 10 bits


Scale of Error in Proposed FFT

16/16 22/10

Log10 of relative error in FFT elements

FFTGEN 22/10 FFT

average -0.775 -2.118

standard deviation 0.797 0.590


Mobile Phone Results

TIMIT, 100 speakers recog. rate (%) std. dev. (%)

FLOAT 100.0 N/A

FFTGEN 9.7 1.6

FIXED 95.8 1.2

MIXED 100.0 N/A

MIXED2 98.0 0.6

implementation, signal recog. rate (%) std. dev. (%)

FLOAT, Symbian audio 83.2 4.38

FLOAT, PC audio 100.0 N/A

FIXED, Symbian audio 76.0 2.83

FIXED, PC audio 100.0 N/A


Improving Accuracy by Information Fusion

Time (s)0 0.483107

-0.1211

0.1058

0

Feature set 1

... ... Feature set 2

Feature set 3

Classifier 1

Classifier 2

Classifier 3

score 1

score 2

score 3

Decision

feature vector

Score combiner

(e.g. 5 MFCCs)

(e.g. F0 + -F0)

(e.g. formants F1,F2,F3)


Information Fusion Results

Decision-level fusion

Score-level fusion

Feature-level fusion

BASELINE:

Best individual

Feature set combination

14.615.816.8MFCC + MFCC

15.2

52.0

16.8

14.7

12.621.216.0All feature sets

29.919.4FMT + FMT

18.217.1ARCSIN + ARCSIN

19.816.0LPCC + LPCC

Fusion succesfull

Fusion sucks

N/A

N/A

N/A

N/A


Speech input stream

Silence detection

Feature extraction

Pre-quantization

Speaker database

Speaker 1 model

Speaker N model

List of candidate speakers

Active speakers Pruned speakers

Frame blocking

Decision ?END

...

Fill buffer with new data

All frames

Non-silent frames

Feature vectors

Redused set of vectors

Matching

v

v

v

v

v

v

v

Database pruningv

v

YesNo

Vantage-point tree (VPT) indexing of the code vectors

1. Averaging

2. Random sampling

3. Decimation

4. Clustering (LBG) 1. Static pruning

2. Hierarchical pruning

3. Adaptive pruning

4. Confidence-based pruning

Reducing # vectors

Speed up NN search

Reduce # speakers

Real-Time Speaker Identification


Results: Baseline System (TIMIT)

(Average length of test utterance = 8.9 s)

Real-time requirement satisfied

4 x realtime


Results: Pre-Quantization (TIMIT)(Codebook size = 64)

• Averaging performs worst, clustering best

• About 2:1 speed-up to full search (no pre-quantization) without degradation in the accuracy

9 x realtime


Results: Pruning Variants (TIMIT)(Codebook size = 64)

11 x realtime

• Recommended method : adaptive pruning (AP)


Results: PQ, Pruning and PQP (TIMIT)(Codebook size = 64)

33 x realtime

• Recommended method : Combination of pre-quantization and pruning (PQP)


Results : VQ vs. GMM (TIMIT)

13:1 speed-up without degradation

9:1 to 10:1 speed-up without degradation

VQ GMM

Best time : 0.27 s = 33 x realtime

@ error rate 0.32 %

Smallest error : 0.00 %

@ 0.31 s = 28 x realtime


@ error rate 0.16 %





Results : VQ vs. GMM (NIST-1999)

VQ GMM13:1 to 16:1 speedup with minor degradation

23:1 to 34:1 speedup with minor degradation


@ error rate 19.22 %




@ error rate 19.36 %

Smallest error: 16.90 %

@ 37.9 s = 0.8 x realtime


Speaker Recognition

Documents

Transcript of Speaker Recognition