Post on 23-Sep-2020
AUTOMATICALLY IDENTIFYING PERCEPTUALLY SIMILAR VOICES
FOR VOICE PARADES
Finnian Kelly1 2 Anil Alexander1 Oscar Forth1 Samuel Kent1 Jonas Lindh3 and Joel Aringkesson3
1Oxford Wave Research Ltd Oxford UK2The University of Texas at Dallas USA3Voxalys AB Gothenburg Sweden
finnian|anil|oscar|sam
oxfordwaveresearchcom jonas|joelvoxalysse
VOICE PARADES
bull UK Home Office ldquoAdvice on the use of voice identification paradesrdquo circular
572003 December 2003
bull K McDougall ldquoAssessing perceived voice similarity using Multidimensional Scaling
for the construction of voice paradesrdquo IJSLL vol 20 no 2 pp 163-172 2013
A voice parade is a set of voices or foils judged to lie within a suitable range of
similarity to a suspectrsquos voice
bull Selecting foils requires manual screening of candidate voices by a phonetician
bull This is a time-consuming costly and subjective process
bull Automating the selection of foils under supervision of the forensic expert could
bull Allow a much larger pool of candidate voices to be considered
bull Reduce subjectivity
bull Increase speed while reducing costs
VOICE CASTING
VOICE CASTING
Voice Casting is the task of identifying the voice in a candidate database
most similar to a target voice
bull Voice casting is typically cross-language dubbing of film and games
bull Voices are manually compared by casting experts
bull Automating voice casting carries similar benefits to automating voice
parades
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
VOICE CASTING
Voice Casting is the task of identifying the voice in a candidate database
most similar to a target voice
bull Voice casting is typically cross-language dubbing of film and games
bull Voices are manually compared by casting experts
bull Automating voice casting carries similar benefits to automating voice
parades
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
PERCEIVED VOICE SIMILARITY
PERCEIVED VOICE SIMILARITY
Speaker traits sex age
Acoustic Characteristics articulation
timbre prosody vocal effort
Voice Quality breathy hoarse creaky
bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of
voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference
Vienna Austria 2011
FEATURES FOR SPEAKER RECOGNITION
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
Mel Frequency Cepstral Coefficients
bull MFCCs are the standard in
automatic speaker recognition
bull They effectively capture short-term
characteristics of the vocal tract
FEATURES FOR SPEAKER RECOGNITION
Phonetic features eg long-term
formants
bull Less discriminative than MFCCs for
automatic speaker recognition
bull However they capture acoustic
characteristics of the voice
important for perceived similarityhellip
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
LTF illustration from Catalina Manual
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
VOICE PARADES
bull UK Home Office ldquoAdvice on the use of voice identification paradesrdquo circular
572003 December 2003
bull K McDougall ldquoAssessing perceived voice similarity using Multidimensional Scaling
for the construction of voice paradesrdquo IJSLL vol 20 no 2 pp 163-172 2013
A voice parade is a set of voices or foils judged to lie within a suitable range of
similarity to a suspectrsquos voice
bull Selecting foils requires manual screening of candidate voices by a phonetician
bull This is a time-consuming costly and subjective process
bull Automating the selection of foils under supervision of the forensic expert could
bull Allow a much larger pool of candidate voices to be considered
bull Reduce subjectivity
bull Increase speed while reducing costs
VOICE CASTING
VOICE CASTING
Voice Casting is the task of identifying the voice in a candidate database
most similar to a target voice
bull Voice casting is typically cross-language dubbing of film and games
bull Voices are manually compared by casting experts
bull Automating voice casting carries similar benefits to automating voice
parades
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
VOICE CASTING
Voice Casting is the task of identifying the voice in a candidate database
most similar to a target voice
bull Voice casting is typically cross-language dubbing of film and games
bull Voices are manually compared by casting experts
bull Automating voice casting carries similar benefits to automating voice
parades
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
PERCEIVED VOICE SIMILARITY
PERCEIVED VOICE SIMILARITY
Speaker traits sex age
Acoustic Characteristics articulation
timbre prosody vocal effort
Voice Quality breathy hoarse creaky
bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of
voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference
Vienna Austria 2011
FEATURES FOR SPEAKER RECOGNITION
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
Mel Frequency Cepstral Coefficients
bull MFCCs are the standard in
automatic speaker recognition
bull They effectively capture short-term
characteristics of the vocal tract
FEATURES FOR SPEAKER RECOGNITION
Phonetic features eg long-term
formants
bull Less discriminative than MFCCs for
automatic speaker recognition
bull However they capture acoustic
characteristics of the voice
important for perceived similarityhellip
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
LTF illustration from Catalina Manual
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
VOICE CASTING
VOICE CASTING
Voice Casting is the task of identifying the voice in a candidate database
most similar to a target voice
bull Voice casting is typically cross-language dubbing of film and games
bull Voices are manually compared by casting experts
bull Automating voice casting carries similar benefits to automating voice
parades
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
VOICE CASTING
Voice Casting is the task of identifying the voice in a candidate database
most similar to a target voice
bull Voice casting is typically cross-language dubbing of film and games
bull Voices are manually compared by casting experts
bull Automating voice casting carries similar benefits to automating voice
parades
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
PERCEIVED VOICE SIMILARITY
PERCEIVED VOICE SIMILARITY
Speaker traits sex age
Acoustic Characteristics articulation
timbre prosody vocal effort
Voice Quality breathy hoarse creaky
bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of
voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference
Vienna Austria 2011
FEATURES FOR SPEAKER RECOGNITION
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
Mel Frequency Cepstral Coefficients
bull MFCCs are the standard in
automatic speaker recognition
bull They effectively capture short-term
characteristics of the vocal tract
FEATURES FOR SPEAKER RECOGNITION
Phonetic features eg long-term
formants
bull Less discriminative than MFCCs for
automatic speaker recognition
bull However they capture acoustic
characteristics of the voice
important for perceived similarityhellip
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
LTF illustration from Catalina Manual
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
VOICE CASTING
Voice Casting is the task of identifying the voice in a candidate database
most similar to a target voice
bull Voice casting is typically cross-language dubbing of film and games
bull Voices are manually compared by casting experts
bull Automating voice casting carries similar benefits to automating voice
parades
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
VOICE CASTING
Voice Casting is the task of identifying the voice in a candidate database
most similar to a target voice
bull Voice casting is typically cross-language dubbing of film and games
bull Voices are manually compared by casting experts
bull Automating voice casting carries similar benefits to automating voice
parades
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
PERCEIVED VOICE SIMILARITY
PERCEIVED VOICE SIMILARITY
Speaker traits sex age
Acoustic Characteristics articulation
timbre prosody vocal effort
Voice Quality breathy hoarse creaky
bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of
voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference
Vienna Austria 2011
FEATURES FOR SPEAKER RECOGNITION
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
Mel Frequency Cepstral Coefficients
bull MFCCs are the standard in
automatic speaker recognition
bull They effectively capture short-term
characteristics of the vocal tract
FEATURES FOR SPEAKER RECOGNITION
Phonetic features eg long-term
formants
bull Less discriminative than MFCCs for
automatic speaker recognition
bull However they capture acoustic
characteristics of the voice
important for perceived similarityhellip
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
LTF illustration from Catalina Manual
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
VOICE CASTING
Voice Casting is the task of identifying the voice in a candidate database
most similar to a target voice
bull Voice casting is typically cross-language dubbing of film and games
bull Voices are manually compared by casting experts
bull Automating voice casting carries similar benefits to automating voice
parades
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
PERCEIVED VOICE SIMILARITY
PERCEIVED VOICE SIMILARITY
Speaker traits sex age
Acoustic Characteristics articulation
timbre prosody vocal effort
Voice Quality breathy hoarse creaky
bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of
voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference
Vienna Austria 2011
FEATURES FOR SPEAKER RECOGNITION
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
Mel Frequency Cepstral Coefficients
bull MFCCs are the standard in
automatic speaker recognition
bull They effectively capture short-term
characteristics of the vocal tract
FEATURES FOR SPEAKER RECOGNITION
Phonetic features eg long-term
formants
bull Less discriminative than MFCCs for
automatic speaker recognition
bull However they capture acoustic
characteristics of the voice
important for perceived similarityhellip
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
LTF illustration from Catalina Manual
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
PERCEIVED VOICE SIMILARITY
PERCEIVED VOICE SIMILARITY
Speaker traits sex age
Acoustic Characteristics articulation
timbre prosody vocal effort
Voice Quality breathy hoarse creaky
bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of
voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference
Vienna Austria 2011
FEATURES FOR SPEAKER RECOGNITION
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
Mel Frequency Cepstral Coefficients
bull MFCCs are the standard in
automatic speaker recognition
bull They effectively capture short-term
characteristics of the vocal tract
FEATURES FOR SPEAKER RECOGNITION
Phonetic features eg long-term
formants
bull Less discriminative than MFCCs for
automatic speaker recognition
bull However they capture acoustic
characteristics of the voice
important for perceived similarityhellip
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
LTF illustration from Catalina Manual
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
PERCEIVED VOICE SIMILARITY
Speaker traits sex age
Acoustic Characteristics articulation
timbre prosody vocal effort
Voice Quality breathy hoarse creaky
bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of
voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference
Vienna Austria 2011
FEATURES FOR SPEAKER RECOGNITION
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
Mel Frequency Cepstral Coefficients
bull MFCCs are the standard in
automatic speaker recognition
bull They effectively capture short-term
characteristics of the vocal tract
FEATURES FOR SPEAKER RECOGNITION
Phonetic features eg long-term
formants
bull Less discriminative than MFCCs for
automatic speaker recognition
bull However they capture acoustic
characteristics of the voice
important for perceived similarityhellip
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
LTF illustration from Catalina Manual
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
FEATURES FOR SPEAKER RECOGNITION
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
Mel Frequency Cepstral Coefficients
bull MFCCs are the standard in
automatic speaker recognition
bull They effectively capture short-term
characteristics of the vocal tract
FEATURES FOR SPEAKER RECOGNITION
Phonetic features eg long-term
formants
bull Less discriminative than MFCCs for
automatic speaker recognition
bull However they capture acoustic
characteristics of the voice
important for perceived similarityhellip
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
LTF illustration from Catalina Manual
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
FEATURES FOR SPEAKER RECOGNITION
Phonetic features eg long-term
formants
bull Less discriminative than MFCCs for
automatic speaker recognition
bull However they capture acoustic
characteristics of the voice
important for perceived similarityhellip
bull J H L Hansen and T Hasan Speaker Recognition by Machines and
Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no
6 pp 74-99 Nov 2015
LTF illustration from Catalina Manual
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
AUTO-PHONETIC FEATURES IVOCALISE
Automatic extraction of phonetic
features with iVOCALISE
bull F1-F4 + ∆
bull F0 + ∆
bull F0 (semitones) + ∆
Capture pitch and format ranges along
with temporal information (intonation
patterns)
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
VOICE CORPUS
bull 175 public figures (actors musicians etc)
bull ~2 recordings each ~30 sec average length
bull Sourced from online archives (primarily YouTube)
bull Male-Female speaker ratio is approximately 21
bull All speech is in English with wide variation in accent
bull Recordings are exclusively from lapel microphones
bull Recording environment is unconstrained
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
SPEAKER RECOGNITION EXPERIMENT
bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A
forensic automatic speaker recognition system supporting spectral
phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
SPEAKER RECOGNITION EXPERIMENT
1
2
N
similarity rank
hellip
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
SELECTING COHORTS FOR SUBJECTIVE COMPARISON
1 Similar two highest-ranked speakers
2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)
3 Same speaker one different recording of the target speaker
1
2
N
hellip
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Similar Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Different Comparisons 1 amp 2
for Target Speaker 1
+ + + + +
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
CREATING LISTENER COMPARISONS
Similar Different Same Speaker
Recording of Target speaker 1 7 sec chunks
Same Speaker Comparisons
1 amp 2 for Target Speaker 1
+ + + + +
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
THE LISTENER TEST
x 12 x 6
6 target speakers
3 male 3 female
x 12
Similar
comparisons
Different
comparisons
Same Speaker
comparisons
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
THE LISTENER TEST
1 Judge the similarity of the two voices on scale of 1-9
2 Ignore the speaker accents any non-speech noises or any of the spoken content
The test was administered over the web there was no supervision of the test or control over the listening environment
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
THE LISTENERS
bull 43 listeners 25 female 18 male
bull 20 spoke English as a first language 23 did not
bull Age range hearing status and playback method were noted
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Similar
comparisons
Different
comparisons
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
RESPONSES TO MALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 5
Different
comparisons
Median = 3
Similarity rating
P(S
imila
rity
rating
)
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
RESPONSES TO FEMALE VOICE COMPARISONS
Same Speaker
comparisons
Median = 8
Similar
comparisons
Median = 3
Different
comparisons
Median = 4
Similarity rating
P(S
imila
rity
rating
)
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
Same Speaker
comparisons
Median = 8
Similarity rating
P(S
imila
rity
rating
)
Cross-accent comparisons
Example 1
Example 2
RESPONSES TO FEMALE VOICE COMPARISONS
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic
Corr (Pearson) = 072
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION
Is there a link between the scores from an automatic system and
perceived similarity
bull A correlation between perceived similarity and speaker
recognition scores has been observed with MFCC features
bull However greater correlation has been observed by combining
MFCCs and Voice Quality labels
bull Phonetic Features
bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic
Voice Casting in IEEEACM Transactions on Audio Speech and Language
Processing vol 24 no 9 pp 1642-1651 Sept 2016
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score Auto-Phonetic
Corr (Pearson) = 072
MFCC
Corr (Pearson) = 072
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
SCORE VS SIMILARITY RATING
Responses to similar and different male comparisons
mean similarity rating
score
Auto-Phonetic + MFCC
Corr (Pearson) = 076
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
CONCLUSIONS
bull Promising results for male comparisons
bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity
bull Variable results across female comparisons
bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons
bull Room to improve
bull Larger subjective evaluation required
bull Combine Auto-Phonetic and MFCC features
bull Scope to expand Auto-Phonetic feature set
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA
SDI media are a major provider of dubbing services worldwide
SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip
English Italian 1
English Italian 2
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest
CONCLUSIONS
bull The definition of voice similarity is application-dependent voice parades vs voice casting
bull Allow for an application-dependent search space
bull Use meta-data such as gender age accent to constrain the set of candidate voices
bull Allow for an application-dependent lsquodegree of similarityrsquo
bullGiven well-calibrated output scores from the automatic system can define a score range of interest