The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM...

27
Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE Patrick Kenny, Najim Dehak and Pierre Ouellet Centre de recherche informatique de Montreal (CRIM)

Transcript of The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM...

Page 1: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

The CRIM Systems for the NIST 2008 SRE

Patrick Kenny, Najim Dehak and Pierre Ouellet

Centre de recherche informatique de Montreal (CRIM)

Page 2: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Systems

• CRIM_2 was the primary system for all but the core condition

– Large stand-alone joint factor analysis (JFA) system trained on pre-2006 data

• CRIM_1 was the primary system for the core condition

– CRIM_1 = CRIM_2 + 3 other JFA systems with different feature sets

• CRIM_3 = CRIM_2 + 2006 SRE data

Page 3: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Overview

• Tasks involving multiple enrollment recordings:

– 8conv-short3, 3conv-short3

• Tasks involving 10 sec test recordings:

– 10sec-10sec, short2-10sec, 8conv-10sec

• Najim Dehak will talk about

– JFA with unconventional features

– Post-eval experiments on the interview data (following LPT and I4U)

Page 4: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Factor Analysis Configuration

• 2K Gaussians, 60 dimensional features

– 20 Gaussianized mfcc’s + first and second

derivatives

• 300 speaker factors

• 100 channel factors for telephone speech

• Additional 100 channel factors for

microphone speech

Page 5: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Speaker Variability

Prior distribution on speaker supervectors

s = m + vy + dz

– m is the speaker-independent supervector

– v is rectangular, low rank (eigenvectors)

– d is diagonal

– y, z standard Normal random vectors (speaker

factors)

Page 6: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Channel Variability

Each supervector M is assumed to be a sum of a

speaker supervector and a channel supervector:

M = s + c

Prior distribution on channel supervectors

c = ux

– u is rectangular, low rank (eigenchannels)

– x standard Normal random

Page 7: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Enrollment: single utterance

The supervector for the utterance is

m + dz + vy + ux

Calculate the MAP estimates of x, y and z

The speaker supervector is

s + dz + vy

The full posterior distribution of s can be calculated

in closed form (but this is messy unless d is 0)

Page 8: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Enrollment: 8conv case

Again the joint posterior distribution of the hidden variables

can be calculated in closed form.

Unless d is 0, this is very messy

Trick: pool the utterances together and ignore the fact that the x’s are different

8

2

1

uxvydzm

uxvydzm

uxvydzm

Page 9: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Page 10: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Page 11: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

10 second test conditions

Many labs have reported difficulty in getting channel factors or NAP to work under these conditions

The problem may be that it is unrealistic to attempt to produce point estimates (ML or MAP) of channel factors using 10 second test utterances

Probability rules say you should integrate over channel factors instead

Page 12: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Why is this not an issue for long test utterances?

If the test utterance is long, the posterior

distribution of the channel factors will be

sharply peaked in the neighbourhood of the

point estimate (MAP or ML).

MAP

MAP

xxx

xxx

x

xx

unless 0)|(

thatimplies

)()|()()|(

equation then the,at edconcentrat

is ofon distributiposterior theIf

dataP

dataPdataPPdataP

|data) P(

Page 13: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Page 14: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Page 15: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Page 16: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Research Problem

How should factor analysis likelihoods and

posteriors be evaluated so as to take account of all

of the relevant uncertainties?

- Uncertainty in the speaker factors

- Uncertainty in the channel factors

- Uncertainty in the assignment of observations to

mixture components

Page 17: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Current Solution

• Use point estimate of speaker factors– Bayesian approach (using full posterior) doesn’t seem

to help

• Integrate over the channel factors

• Use the UBM to align frames with mixture components– Tractable posterior + Jensen’s inequality gives lower

bound on likelihood (Niko Brummer)

– Very fast if combined with LPT assumption

• Paradoxical results if speaker/channel dependent GMM’s used in place of UBM

Page 18: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Ideal Solution: Integrate over all hidden variables

• Robbie Vogt (Odyssey 2004) did this for a diagonal factor analysis model– No speaker or channel factors

– Exact dynamic programming solution

• Variational Bayes offers an approximate solution in the general case– Assume that the posterior distribution factorizes into 3

terms (speaker factors, channel factors, assignments of frames to mixture components)

– Cycle through the factors to update them (like EM)

– Jensen’s inequality gives lower bound on the likelihood which increases on successive iterations

Page 19: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Fusion

• Fusing long term and short term features

• Pseudo-syllable unsupervised prosodic

and MFCC’s contours segmentation.

• Six Legendre Polynomial coefficients for

each contour.

• JFA without common factor (d=0)

• Logistic regression function (Focal).

Page 20: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Pseudo-syllable segmentation

Page 21: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Long term features

• Three long term systems:

– 512 G, Features : Pitch + energy + duration

(13 dimension)

– 1024 G, Features : 12 MFCCs contours +

energy + duration (79 dimension)

– 1024 G, Features : 12 MFCCs contours +

pitch + energy + duration (85 dimension)

Page 22: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Short2-short3 : Tel-Tel det7

Page 23: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

Short2-short3 : Tel-Tel det8

Page 24: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

How to deal with interview data?

• Interview eigenchannel trained on interview

development data (as LPT and I4U).

• Small configuration of the Factor analayis

– Features 20 Gaussianized MFCC’s + first derivatives

– 300 speaker factors , d=0 (no common factor), 100

telephone channel factors.

• We carried out two experiments :

– 50 TeL-Mic channel factors.

– 50 TeL-Mic channel factors + 50 interview channel

factors.

Page 25: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

NIST 2008 : Interview data –det1

Page 26: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

NIST 2008 : Interview data –det1

EER (%) MinDCF

Without interview

eigenchannels8.9% 0.0477389

Interview speaker

utterances means5.5% 0.0342164

Interview channel_2

utterance as means5.7% 0.0360786

Interview &

microphone

eigenchannels

5.7% 0.033472

Page 27: The CRIM Systems for the NIST 2008 SRE - Pradžia · Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE ... Tous droits réservés © 2005 CRIM Enrollment:

Tous droits réservés © 2005 CRIM

References

• A Study of Inter-Speaker Variability in

Speaker Verification.

• Modeling prosodic features with joint factor

analysis for speaker verification.

www.crim.ca/perso/patrick.kenny