Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and...
Transcript of Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and...
![Page 1: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/1.jpg)
Speaker recognition system adaptation to unseen and
mismatched recording devices in the NFI-FRIDA database
1Oxford Wave Research Ltd, UK 2 Speech and Audio Research,
Netherlands Forensic Institute, The
Hague, Netherlands
Finnian Kelly1, Anil Alexander1,
Oscar Forth1, and David van
der Vloed2
![Page 2: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/2.jpg)
Motivation
Current automatic speaker recognition systems are trained on large
quantities of diverse speaker recordings:
• performance is good for forensic casework material involving
typical microphone or telephone recordings
• for unseen recording types, such as those involving a new
covert surveillance recorder, or a new transmission condition,
for example, performance may be impacted negatively
![Page 3: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/3.jpg)
Motivation
Current automatic speaker recognition systems are trained on large
quantities of diverse speaker recordings:
• performance is good for forensic casework material involving
typical microphone or telephone recordings
• for unseen recording types, such as those involving a new
covert surveillance recorder, or a new transmission condition,
for example, performance may be impacted negatively
How can we adapt a well-trained automatic system to the unseen and mismatched conditions of a new case?
![Page 4: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/4.jpg)
Levels of mismatch
Matched seen conditions
Mismatched seen conditions
Matched unseen conditions
Mismatched unseen conditions
Increasing
difficulty
![Page 5: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/5.jpg)
The challenge of mismatch
Matched, seen
conditions
Mismatched, unseen
conditions
H0 H1 H0 H1
H0: same speaker scores
H1: different speaker scores
![Page 6: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/6.jpg)
Existing solutions in VOCALISE
• Train the system from scratch with relevant data
• Data hungry: 1000s of speakers required
• Re-train the system LDA/PLDA stages with relevant data
• Data hungry: 100s of speakers required
• Apply score normalisation
• Will help, but is limited
![Page 7: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/7.jpg)
Existing solutions in VOCALISE
• Train the system from scratch with relevant data
• Data hungry: 1000s of speakers required
• Re-train the system LDA/PLDA stages with relevant data
• Data hungry: 100s of speakers required
• Apply score normalisation
• Will help, but is limited
Here we introduce a new method of adapting a well-trained
system to unseen conditions on the fly using small* quantities
of data => forensically realistic
*10s of speakers
![Page 8: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/8.jpg)
VOCALISE i-vector framework
UBM
feature
extraction
speech i-vector
i-vector
extraction
High-dimensional,
universal speaker space
Low-dimensional,
speaker-specific
space
![Page 9: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/9.jpg)
Comparing i-vectors
i-vector A
i-vector B LDA / PLDA
Comparison score for
i-vectors A and B
![Page 10: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/10.jpg)
Post-processing i-vectors
We could compare ‘raw’ i-vectors directly, but
it is beneficial to first post-process i-vectors to
increase their discriminatory power
LDA (linear discriminant analysis) is an
important post-processing step that:
1. Increases inter-speaker separability
2. Reduces dimensionality
![Page 11: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/11.jpg)
Linear Discriminant Analysis (LDA)
• LDA projects i-vectors into a new space in which: • within-speaker variability is minimised • between-speaker separation is maximised
• Requires a set of training i-vectors and their speaker labels
![Page 12: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/12.jpg)
LDA training
• The LDA transformation is generally learned using the same training data as the other models in the i-vector framework (UBM and TV).
• Can we leverage LDA for adapting a system to new conditions?
![Page 13: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/13.jpg)
Condition adaptation via LDA
System
development
i-vectors
N ≈ 50,000
Adaptation
i-vectors
N ≈ 100
adapted LDA
transformation
well-trained LDA transformation
![Page 14: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/14.jpg)
Probabilistic LDA (PLDA) • PLDA compares two post-LDA i-vectors and returns a comparison
score
• The score is calculated based on the most discriminative parts of an
i-vector:
• Achieved by learning a subspace that describes the dominant
directions of change in the i-vectors of different speakers
• PLDA therefore requires a set of post-LDA training i-vectors and
their speaker labels
We supplement our LDA condition adaptation by re-training
PLDA with all adapted i-vectors
![Page 15: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/15.jpg)
Reference normalisation
Reference (or score) normalisation, is an established technique for
adapting the output of a system to new conditions
i-vector A
i-vector B Raw comparison
score
![Page 16: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/16.jpg)
Reference normalisation
Reference (or score) normalisation, is an established technique for
adapting the output of a system to new conditions
i-vector A
i-vector B
Reference
i-vectors
Raw comparison
score
Reference scores A
Reference scores B
Normalised
comparison
score
![Page 17: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/17.jpg)
Reference normalisation
Reference (or score) normalisation, is an established technique for
adapting the output of a system to new conditions
i-vector A
i-vector B
Reference
i-vectors
Raw comparison
score
Reference scores A
Reference scores B
Normalised
comparison
score
Can only shift scores up or down; less powerful than LDA/PLDA adaptation…
![Page 18: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/18.jpg)
Mismatched condtion experiments
System:
iVOCALISE 2017B
• TEL only and TEL-MIC session
• Condition adaptation
• Reference normalisation
Alexander, A., Forth, O., Atreya, A. A. and Kelly, F. (2016). VOCALISE: A forensic automatic speake recognition system supporting spectral, phonetic, and user-provided featurs. Speaker Odyssey 2016, Bilbao, Spain.
![Page 19: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/19.jpg)
outline
NFI-FRIDA recap
![Page 20: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/20.jpg)
Experiments with NFI-FRIDA
• 40 test speakers
• 3 recordings per speaker from each of the following devices
• d1: Headset microphone
• d2: Close microphone A
• d3: Close microphone B
• d4: Far microphone
• d5: Telephone intercept
• 1 additional recording per speaker from device d1
• Cross-device (mismatched) performance, relative to d1, was evaluated
for all devices d1—d5
• #H0 (same speaker) comparisons = 120
• #H1 (different speaker) comparisons = 4680
![Page 21: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/21.jpg)
Condition adaptation & reference normalisation experiments • 15 training speakers (no overlap with the 40 test speakers)
• 2 recordings per speaker from each of the following devices
• d1: Headset microphone
• d2: Close microphone A
• d3: Close microphone B
• d4: Far microphone
• d5: Telephone intercept
• For condition adaptation, 2 recordings from each of the devices
under comparison were used => 2 recordings x 2 devices x 15
speakers = 60 recordings*
• For reference normalisation, 2 recordings from the other (not d1)
device were used => 2 recordings x 1 devices x 15 speakers = 30
recordings
* With the exception of d1-d1, where only 30 recordings were used
![Page 22: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/22.jpg)
Cross-condition performance (EER%): Telephone-only session data
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
EER%
15 speakers for adaptation/normalisation
![Page 23: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/23.jpg)
EER%
Cross-condition performance (EER%): Telephone+Microphone session data
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
15 speakers for adaptation/normalisation
![Page 24: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/24.jpg)
EER%
Revisiting matched comparison EER%
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
15 speakers for adaptation/normalisation
![Page 25: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/25.jpg)
Cllr-
min
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
Cross-condition performance (Cllr-min): Telephone+Microphone session data
• Cllr-min, like the EER, measures the ability of the system to discriminate
between speakers
• Unlike the EER, it considers the discriminatory power of the system across
all possible score thresholds
• Cllr-min, or minimum log-likelihood ratio cost, is the optimal Cllr value
achievable by a system
• Like the EER, lower is better:
• Cllr-min = 0 for a perfect system
• Cllr-min = 1 for a useless system
![Page 26: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/26.jpg)
Cllr-
min
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
Cross-condition performance (Cllr-min): Telephone+Microphone session data
15 speakers for adaptation/normalisation
![Page 27: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/27.jpg)
Cllr-
min
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
Revisiting matched comparison Cllr-min
15 speakers for adaptation/normalisation
![Page 28: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/28.jpg)
Varying the number of adaptation & normalisation speakers • The original set of 15 training speakers was increased to 38 speakers
(again, no overlap with the 40 test speakers)
• 2 recordings from each device d1—d5 were used
• Condition adaptation and reference normalisation proceeded as
before, increasing the number of training speakers in increments of 5,
from 5 to 38.
• Results presented for d1—d4 and d1—d5 only
![Page 29: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/29.jpg)
d1-d4: Close mic - Far mic Telephone+Microphone session data
Cllr-
min
#adapt/reference speakers
![Page 30: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/30.jpg)
d1-d5: Close mic - Telephone intercept Telephone+Microphone session data
Cllr-
min
#adapt/reference speakers
![Page 31: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/31.jpg)
d1-d5: Close mic - Telephone intercept Condition adaptation variance with 5 speakers
#adapt/reference speakers
Cllr-
min
![Page 32: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/32.jpg)
#adapt/reference speakers
Cllr-
min
d1-d5: Close mic - Telephone intercept Reference normalisation variance with 5 speakers
![Page 33: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/33.jpg)
d1-d5: Close mic - Telephone intercept Condition adaptation variance with 20 speakers
#adapt/reference speakers
Cllr-
min
![Page 34: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/34.jpg)
#adapt/reference speakers
Cllr-
min
d1-d5: Close mic - Telephone intercept Reference normalisation variance with 20 speakers
![Page 35: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/35.jpg)
Conclusions
The baseline performance of a well-trained automatic system to unseen and mismatched conditions is good:
<4 % EER
Condition adaptation can provide consistent and stable performance improvement with a very small number of speakers (≈30) => applicable to forensic casework
Condition adaptation has the scope to exploit additional speakers and recordings if they are available
Here we have used condition adaptation and reference normalisation in isolation; they can be used in combination
![Page 36: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,](https://reader033.fdocuments.us/reader033/viewer/2022043009/5f9b8b7f83625603d51edef2/html5/thumbnails/36.jpg)
Thanks!