Speaker Recognition

22
University of Joensuu Dept. of Computer Scienc P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Speaker Recognition University of Joensuu, Department of Computer Science PUMS 2003-2004 –seminaari 14.10.2004 Turku Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen

description

University of Joensuu, Department of Computer Science. PUMS 2003-2004 –seminaari 14.10.2004 Turku. Speaker Recognition. Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen. Research Group. PUMS project. Juhani Saastamoinen Project manager. - PowerPoint PPT Presentation

Transcript of Speaker Recognition

Page 1: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Speaker Recognition

University of Joensuu,Department of Computer Science

PUMS 2003-2004 –seminaari 14.10.2004 Turku

Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki,Tomi Kinnunen, Ismo Kärkkäinen

Page 2: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Research Group

Pasi FräntiProfessor

Juhani SaastamoinenProject manager

Evgeny KarpovProject researcher

Ville HautamäkiProject researcher

Tomi KinnunenResearcher

Ismo Kärkkäinen Clustering algorithms

PUMS project

Page 3: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

PUMS & JoY

• Speaker Recognition• PUMS season 2003-2004:

– Identification, no verification– Port it in mobile phone– Feature fusion– Real-time

• http://cs.joensuu.fi/pages/pums

Page 4: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Application Scenarios

Speaker VerificationSpeaker Verification Speaker IdentificationSpeaker Identification

Speaker RecognitionSpeaker Recognition

Whose voice is this?Is this Bob’s voice?

(Claim)+

Verification

Imposter!

?Identification

Page 5: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Identification System

Recognition: min. MSE within DB

over input speech

SignalProcessing

SpeakerModellingFeature

VectorsSpeechAudio

AddtrainedspeakerprofilesUse all profiles

in recognition

Decision

Speaker ProfileDatabase

Page 6: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

sprofiler

Results 2003-2004

Fusion

Speechfeatures (HY)

ProfMatch

srlibReal-time

SpeakerProfiler

Winsprofiler

Epocsprofiler

console UI

Windows

Series60

TCL/TK (HY)

console UI

common speaker recognition app. interface

DB

Page 7: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Planned Results

sprofiler

Fusion

Speechfeatures (HY)

ProfMatch

srlibReal-time

SpeakerProfilerWinsprofiler

Epocsprofiler

DB

ApplicationsAccess control

TeleconferenceLarge scale database

Mobile phone login?

Results 2003-2004

common speaker recognition app. interface

Segmentation

VAD

common speaker recognition app. interface

Verification

Page 8: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

System in Mobile Phone

Port to Symbian OS with Series 60 UI platform

Page 9: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Symbian Phones

• Series 60 phone features:– 16 MB ROM– 8 MB RAM

– 176 x 208 display

– 32-bit ARM-processor

– No floating-point unit!!!

Series 80

Series 60UIQ

Page 10: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

FFTGEN

• Multiplication results must fit in 32 bits: truncate multiplication inputs

• FFTGEN: Truncate to 16/16 bits (“16/16 FFT”)

32-bit multiplication result

FFT layer input FFT Twiddle FactorX

X16-bit integer 16-bit integer

FFT layer output (part of it)Crop-off for next layer: 16 bits!16-bit integer

16 used bits 16 crop-off bits

Page 11: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Proposed Information Preserving “22/10 FFT”

• Approximate DFT operator F with G• Increase ||F-G||, preserve more signal information

– minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024

– Truncate multiplication inputs to 22/10 bits (signal/op)

22 used bits 10 crop-off bits

32-bit multiplication result

X32-bit integer, 22 bits used 16-bit integer, 10 bits used

32-bit integer

FFT layer input FFT Twiddle FactorX

FFT layer output (part of it)Crop-off for next layer: 10 bits

Page 12: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Scale of Error in Proposed FFT

16/16 22/10

Log10 of relative error in FFT elements

FFTGEN 22/10 FFT

average -0.775 -2.118

standard deviation 0.797 0.590

Page 13: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Mobile Phone Results

TIMIT, 100 speakers recog. rate (%) std. dev. (%)

FLOAT 100.0 N/A

FFTGEN 9.7 1.6

FIXED 95.8 1.2

MIXED 100.0 N/A

MIXED2 98.0 0.6

implementation, signal recog. rate (%) std. dev. (%)

FLOAT, Symbian audio 83.2 4.38

FLOAT, PC audio 100.0 N/A

FIXED, Symbian audio 76.0 2.83

FIXED, PC audio 100.0 N/A

Page 14: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Improving Accuracy by Information Fusion

Time (s)0 0.483107

-0.1211

0.1058

0

Feature set 1

... ... Feature set 2

Feature set 3

Classifier 1

Classifier 2

Classifier 3

score 1

score 2

score 3

Decision

feature vector

Score combiner

(e.g. 5 MFCCs)

(e.g. F0 + -F0)

(e.g. formants F1,F2,F3)

Page 15: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Information Fusion Results

Decision-level fusion

Score-level fusion

Feature-level fusion

BASELINE:

Best individual

Feature set combination

14.615.816.8MFCC + MFCC

15.2

52.0

16.8

14.7

12.621.216.0All feature sets

29.919.4FMT + FMT

18.217.1ARCSIN + ARCSIN

19.816.0LPCC + LPCC

Fusion succesfull

Fusion sucks

N/A

N/A

N/A

N/A

Page 16: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Speech input stream

Silence detection

Feature extraction

Pre-quantization

Speaker database

Speaker 1 model

Speaker N model

List of candidate speakers

Active speakers Pruned speakers

Frame blocking

Decision ?END

...

Fill buffer with new data

All frames

Non-silent frames

Feature vectors

Redused set of vectors

Matching

v

v

v

v

v

v

v

Database pruningv

v

YesNo

Vantage-point tree (VPT) indexing of the code vectors

1. Averaging

2. Random sampling

3. Decimation

4. Clustering (LBG) 1. Static pruning

2. Hierarchical pruning

3. Adaptive pruning

4. Confidence-based pruning

Reducing # vectors

Speed up NN search

Reduce # speakers

Real-Time Speaker Identification

Page 17: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results: Baseline System (TIMIT)

(Average length of test utterance = 8.9 s)

Real-time requirement satisfied

4 x realtime

Page 18: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results: Pre-Quantization (TIMIT)(Codebook size = 64)

• Averaging performs worst, clustering best

• About 2:1 speed-up to full search (no pre-quantization) without degradation in the accuracy

9 x realtime

Page 19: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results: Pruning Variants (TIMIT)(Codebook size = 64)

11 x realtime

• Recommended method : adaptive pruning (AP)

Page 20: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results: PQ, Pruning and PQP (TIMIT)(Codebook size = 64)

33 x realtime

• Recommended method : Combination of pre-quantization and pruning (PQP)

Page 21: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results : VQ vs. GMM (TIMIT)

13:1 speed-up without degradation

9:1 to 10:1 speed-up without degradation

VQ GMM

Best time : 0.27 s = 33 x realtime

@ error rate 0.32 %

Smallest error : 0.00 %

@ 0.31 s = 28 x realtime

Best time : 0.18 s = 49 x realtime

@ error rate 0.16 %

Smallest error : 0.16 %

@ 0.18 s = 49 x realtime

(Average length of test utterance = 8.9 s)

Page 22: Speaker Recognition

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results : VQ vs. GMM (NIST-1999)

VQ GMM13:1 to 16:1 speedup with minor degradation

23:1 to 34:1 speedup with minor degradation

Best time : 0.48 s = 63 x realtime

@ error rate 19.22 %

Smallest error : 17.34 %

@ 11.4 s = 3 x realtime

Best time : 0.82 s = 37 x realtime

@ error rate 19.36 %

Smallest error: 16.90 %

@ 37.9 s = 0.8 x realtime

(Average length of test utterance = 30.4 s)