Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker...

Bottleneck Features for Speaker Recognition

Sibel Yaman1, Jason Pelecanos1, and Ruhi Sarikaya2

1 IBM T. J. Watson Research Labs, Yorktown Heights, NY2 Microsoft Corporation, Redmond, WA

Odyssey 2012: The Speaker and Language Recognition Workshop

Roadmap

Introduction

Bottleneck feature extraction 1) A conversation level training criterion

2) Incorporating a separate system in training

Experiments

Summary

In the speech recognition literature:

Deep networks are shown to outperform HMMs (Seide 2012, etc.).

In the speaker recognition literature:

Many sites report ever-improving performance figures (Konig 1998, Garimella 2012).

The Big Picture

Bottleneck Network Architecture

Stacked raw features from 0.5 seconds of speech

Speakers

BACK-PROPAGATION OF SPEAKER INFORMATION

FORWARD FEEDING OF INPUT FEATURE STATISTICS

An information bottleneck acts as a feature compressor (Konig 1998).

Same mic – EER (%) Different mic – EER (%)0

MFCCTraditional NN

Using Neural Networks for Speaker Recognition

Feature extraction with neural networks traditionally performs relatively poorly.

We investigate approaches to make the performance comparison the other way around. :-)

Same mic – D08 Different mic - D080

MFCCTraditional NN

EER (%)

1000xmin DCF(2008)

-40.4 % rel.

-45.1 % rel.

-57.5 % rel.

-46.1 % rel.

An Overview

We demonstrate two ways of exploiting the expressive power of deep networks:

1) The training is adjusted to the targeted performance evaluation metric.

2) Information from a separate system is incorporated in training.

Spectral Feature

Extraction

Linear Transformation via Differentiation

MFCCs,

Nonlinear Transformation via

Bottleneck Networks

BN Features

Speech Signal

Raw MFCCs

, ,...∆ ∆ ∆

Roadmap

Introduction

Experiments

Summary

Frame level training has limitations: – Learning the speaker is constrained to the context around the

current frame.

– A long context would explode the number of free parameters.

Conversation level training offers solutions:– The frames coming from one conversation are tied together so that

a single decision is made.

– The network size can be kept relatively small.

Frame vs. Conversation Level Training

A log-likelihood ratio-based training criterion (Brummer 2005) is optimized

There is one target and (S-1) nontarget scores at the output layer.

(1) A Speaker RecognitionTraining Criterion

:target :nontarget( ) log(1 ) log(1 )NT u cu c

LLRT N

J e eα β + +− −Θ = + + +∑ ∑

Cost associated with target trials

Cost associated with nontarget trials

We need a global constraint on the decision for the entire recording.

The scores are averaged at the output layer before the nonlinearity.

Conversation Level Training

( )1Lu ( )

2Lu ( )

3Lu ( )L

t=0( )1Lu ( )

2Lu ( )

3Lu ( )L

t=1( )1Lu ( )

2Lu ( )

3Lu ( )L

( )1Lu ( )

2Lu ( )

3Lu ( )L

(2) Using a Separate System in Training

Scores from a separate system are incorporated in training.

...Standard

MFCC System

Calibration

BN score generation

Additional scores

( 1)−W l

( ) ( 1) ( 1)( ) σ− −Θ =u Wl l l

( 1) ( 1)1 2( ) M

n nu uω σ ω κ− −′ Θ = + +W l l

in the training objective is replaced with

The term

Score Calibration

The additional scores should have a log-likelihood ratio interpretation.

The score calibration is achieved by solving

The network is trained by solving

* * *1 2 1 2

, ,{ , , } arg min ( , , | fixed)LLRJ

ω ω κω ω κ ω ω κ= Θ

* * *1 2arg min ( | , , fixed)LLRJ ω ω κ

ΘΘ = Θ

The Back-End System

Universal Background

Model Training

MAP-Adapted Speaker Modeling

Dimension Reduction

Probabilistic Linear

Discriminant Analysis (PLDA)

UBM Supervectors

i-vectors Recognition Scores

STATE-OF-THE-ART SPEAKER RECOGNITION SYSTEM

Bottleneck Feature Extraction

Roadmap

Introduction

Experiments

Summary

Experiments

We ran experiments on the same and different microphone tasks of NIST SRE 2010.

Microphone recordings were used in bottleneck network training.– 173 speakers in the training and validation sets– 4341 recordings in training and 865 recordings in validation

Network architecture:

294 dimensional input → 1000 x 42 x 500 → 173 speakers

Processing of the Input and Output Features of the Network

Bottleneck Network

Decorrelated Bottleneck Features

● Input features are mean and variance normalized to better condition the network.

● The bottleneck features are decorrelated for modeling with diagonal covariance GMMs.

Decorrelation with PCA

Bottleneck Feature Extraction (before the nonlinearity)

Effect of the Training Criterion

MFCCFrame level Conversation level

EER (%)

-30.0 % rel.

-34.2 % rel.

MFCCFrame level Conversation level

1000xmin DCF(2008)

-36.4 % rel.

-30.0 % rel.

Dependence on Feature Size

BN – 42 dimBN – 60 dimBN – 100 dim

EER (%)

1000xmin DCF(2008)

Performance when Trained with Information from a Separate System

MFCCLinear Score CombinationIncorporating a Separate System

EER (%)

+14.0 % rel.+18.0 % rel.

MFCCLinear Score CombinationIncorporating a Separate System

1000xmin DCF(2008)

+12.0 % rel.+11.0 % rel.

Summary

1) We showed how to train a neural network for use in the front-end of a speaker recognition system.

– A conversation level training criterion that minimizes a log-likelihood

ratio score-based cost function is developed.

2) We also showed how to use neural networks to exploit information from a separate system.

Thank you!

Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker...

Documents

Transcript of Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker...

A Review of the Fingerprint, Speaker Recognition ... - · PDF fileIndex Terms— biometric, fingerprint, face recognition, iris recognition, speaker recognition. and reliable non-repudiation.

Automatic Speaker Recognition Using SVM · Automatic Speaker Recognition Using SVM Umer Malik1, P.K. Mishra2 1 M.Tech Student, ... Speaker recognition is a well known example of it.

THE ULTIMATE SPEECH RECOGNITION AND SPEAKER IDENTIFICATION ... · THE ULTIMATE SPEECH RECOGNITION AND SPEAKER IDENTIFICATION TOOL JAKO • Speech Recognition • Speaker Identification

Speaker Recognition Overview

Speech Processing Speaker Recognition. 18 February 2016Veton Këpuska2 Speaker Recognition Definitions: Speaker Identification: For given a set of.

Text Independent Speaker Recognition

Speaker Recognition Using MATLAB

Speaker Recognition - NIST

Automatic Speaker Recognition System in Adverse …usir.salford.ac.uk/38902/1/571-CE020.pdf · ... robust speaker recognition, MFCC, MSR toolbox, ... reverberation on speaker recognition

Machine Learning for Speaker Recognition - SRI …€¦ · · 2010-12-16Machine Learning for Speaker Recognition ... Speaker Recognition ... Speaker Recognition Using Syllable-Based

Speaker Recognition Using Neural Tree Networkspapers.nips.cc/paper/861-speaker-recognition-using...Speaker Recognition Using Neural Tree Networks 1039 speaker Si, the corresponding

Speaker Recognition for Forensic Applications - UEFcs.uef.fi/odyssey2014/downloads/Odyssey_Keynote_Campbell_20140… · Speaker Recognition for Forensic Applications Introduction

DT2118 Speech and Speaker Recognition - Speaker Recognition · Recognition, Veri cation, Identi cation Recognition: general term Speaker veri cation: I an identity is claimed and

Addressing the Speech Recognition Bottleneck fileHelp Me Understand You: Addressing the Speech Recognition Bottleneck Rebecca Passonneau,* Susan Epstein † and Joshua Gordon* *Columbia

Multimedia Analysis Speaker Recognition

Speaker recognition using MFCC

Speaker Recognition Anti-Spoofing

Information Bottleneck and HMM for Speech Recognition · 2008-01-05 · Information Bottleneck and HMM for Speech Recognition Thesis submitted for the degree of "Master of Science"

Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Evaluating Automatic Speaker Recognition Systems an Overview of the NIST Speaker Recognition Evaluations (1996-2014)