A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J....
-
Upload
gillian-small -
Category
Documents
-
view
216 -
download
1
Transcript of A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J....
A Baseline System for Speaker Recognition
C. Mokbel, H. Greige, R. Zantout, H. Abi Akl
A. Ghaoui, J. Chalhoub, R. Bayeh
University Of Balamand - ELISA
C. Mokbel - UOB - NIST2002 2
Outline
• Introduction
• Baseline speaker recognition system
• NIST 2002 evaluation
• Conclusion and perspective
C. Mokbel - UOB - NIST2002 3
Introduction
• A baseline system has been built and was used in the NIST 2002 speaker recognition evaluation– GMM based system– Normalization using z-norm– Adaptation technique used to estimate speaker
model starting from world model
C. Mokbel - UOB - NIST2002 4
Baseline Speaker Recognition System
• Feature extraction:– Speech recognition based feature vectors
• 13 MFCC coefficients including the energy on logarithmic scale
• + first and second order derivative – Leading to 39 feature parameters
• Preprocessing using cepstral mean normalization
C. Mokbel - UOB - NIST2002 5
Baseline Speaker Recognition System
• GMM modeling for both hypotheses: speaker and non speaker (world)– EM algorithm to train the world model (Baum-
Welch)• Initialization using LBG VQ
– Speaker model: adapted mean vectors from the world model
• Approximation of the “unified adaptation approach” (“Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework”, IEEE Trans. on SAP Vol. 9, n 4, may 2001) IEEE Trans. on SAP Vol. 9, n 4, may 2001)
C. Mokbel - UOB - NIST2002 6
Baseline Speaker Recognition System
• Speaker Adaptation:– World model Gaussian distributions grouped in a
binary tree– Speaker data driven determination of the Gaussian
classes– MLLR applied based on these classes: only means
of Gaussian distributions are adapted– MAP applied to the leaves Gaussian distributions
C. Mokbel - UOB - NIST2002 7
Baseline Speaker Recognition System
• Building the Gaussian tree bottom up:– Grouping two by two the closest Gaussian
distributions– Distance between 2 Gaussian distributions is
equal to the loss in the likelihood of the associated data if the two Gaussian are merged in a unique Gaussian
C. Mokbel - UOB - NIST2002 8
Baseline Speaker Recognition System
• After the E-step of the EM algorithm the weights associated to the leaves of the tree are propagated through the tree up to the root
• Going from the root to the leaves, nodes are selected whenever one of their two children has a weight less than a threshold– This defines a partition that will be used in an
MLLR algorithm
C. Mokbel - UOB - NIST2002 9
Baseline Speaker Recognition System
• MAP algorithm:– Estimated Gaussian means parameters at the
leaves are smoothed using a fixed weight with the parameters of the world Gaussian
C. Mokbel - UOB - NIST2002 10
Baseline Speaker Recognition System
• Given a target speaker model s, the world model w and a test utterance X, the score for this utterance is computed as the log likelihood ratio:s = log [p(X/s) / p(X/w)]
• This score should be normalized due to the fact that the world model is not precise
C. Mokbel - UOB - NIST2002 11
Baseline Speaker Recognition System
• Normalization using the z-norm:– Few impostors utterances are used– A score is computed for every utterance– The different scores define a distribution per
target speaker– Target speakers distributions should be similar
for a decision using a unique threshold• Reduce and center the distribution
ns = a * s + b
C. Mokbel - UOB - NIST2002 12
Baseline Speaker Recognition System
• Based on the data from the 2001 evaluation a DET curve can be plotted– Find the optimal decision threshold that
minimize the cost defined by NIST’2002, i.e.:
Cdet = Cmis*Prmiss/target*Prtarget + CFalseAlarm*PrFalseAlarm/NonTarget*(1-Prtarget)
C. Mokbel - UOB - NIST2002 13
NIST 2002 evaluation
• Feature vector: 13 MFCCs + 13 + 13 2
• Cepstral Mean Normalization
• Gender dependent GMM with 256 Gaussian mixtures for world model– Trained on a subset of the cellular data of NIST
2001 evaluation
C. Mokbel - UOB - NIST2002 14
NIST 2002 evaluation
• Target speaker model adapted from world model– For every iteration and after the E step
• Threshold (cumulative probability = 3.0) to select tree nodes
• MLLR used to update the Gaussian means
• Approximated MAP to smooth the MLLR estimated parameters: linear combination between the MLLR estimated mean (0.8) and the world (a priori) mean (0.2)
C. Mokbel - UOB - NIST2002 15
NIST 2002 evaluation
• 16 male and 21 female speakers (NIST 2001) used as impostors (~8 test files from each)– The pseudo-impostors scores define a
distribution used to z-normalize the score for a given target speaker
• Global threshold estimated on NIST 2001 data in order to minimize the cost
C. Mokbel - UOB - NIST2002 16
NIST 2002 evaluation
• System characteristics:– CPU time on a pentium III 800 MHz:
2.1 ms per frame and per speaker for speaker model adaptation
0.92 ms per frame for the test– Memory usage:
~360 Kbytes per test
C. Mokbel - UOB - NIST2002 17
NIST 2002 evaluation
• Results:– Cdet = 0.100292
– Min Cdet = 0.097833
• DET Curve:
C. Mokbel - UOB - NIST2002 18
NIST 2002 evaluation
C. Mokbel - UOB - NIST2002 19
NIST 2002 evaluation
C. Mokbel - UOB - NIST2002 20
NIST 2002 evaluation
C. Mokbel - UOB - NIST2002 21
Conclusions and perspectives• A new baseline system has been developed and
evaluated
• A lot of work to be done, mainly:– Optimize the feature extraction module– Implement the complete Unified Adaptation approach– Investigate new normalization strategies– Integrate automatic labeling of speech segments