Individual-basis pronunciation clustering of World ...mine/OJAD/lecture/POZNAN/... · Yet another...
Transcript of Individual-basis pronunciation clustering of World ...mine/OJAD/lecture/POZNAN/... · Yet another...
Individual-basis pronunciation clusteringof World Englishes usingspeech structure analysis
Nobuaki MinematsuGraduate School of Engineering,
The University of Tokyo
Japanese learning and English learning
Learning of a local languageTo communicate to native speakers of that language.
Goal of pronunciation trainingIntelligible enough pronunciation to native speakers
= Native-sounding pronunciation
Learning of the international languageTo communicate among speakers of different native languages.
The total number of English users is 1.5 billions.
Goal of pronunciation trainingIntelligible enough pronunciation to UK and US speakers?
= Native-sounding pronunciation?
Many countries adopt English as official language.A quarter of the 1.5 billions speakers use English as official language.
They speak differently accented English from UK or US accent.
World Englishes
English as only the common language for everybodyDiversity of localized pronunciations of English [Crystal’95]
US pron. and UK pron. are just two major examples of World Englishes.
What is the unit of diversity of World English pronunciation?Country? region? city? town?
The minimal unit should be individual!! (#speakers = 1.5 billions)
No English is correct and no English is incorrect.One will be interested in how his pronunciation is compared to the others.
English as foreign language (750 millions)
English as second language or official language (400 M)
English as native language (350 M)
[Kachru 1992]
It is something like “face”.Europe, Africa, Asia, Oceania, America, ....
White, yellow, black, ....
No face is correct and no face is incorrect.
The face functions as identifier of that person.
??? Accent = Face ???
Yet another approach to CAPT
Computer Aided Pronunciation TrainingApplication of ASR (Automatic Speech Recognition) technology
To detect pronunciation errors in a given utterance.
To calculate pronunciation proficiency score for a speaker.
Comparison bet. teachers’ model and a learner’s utteranceTeachers’ model = native speakers’ model
Which English should be the target model?US or UK?
Meaningful question for learning English?
Yet another approach to CAPT system developmentLocation of a learner’s pronunciation in the diversity of World Englishes pronunciations
Comparison of a learner’s pronunciation to all the others.1 speaker 1.5 billions - 1 speakers
Outline of presentation
Actual condition of English (World Englishes)Learning of a local lang. and that of the international lang.
A new approach to develop CAPT systems for learning English
A technical challenge and it’s solutionTo which is Minematsu’s pronunciation closer?
Speech (pronunciation) structure analysis
Individual-basis pronunciation clusteringSpeech Accent Archive
IPA-based reference pronunciation distance between two speakers
Prediction of the reference distance by using structure analysis
Possible applicationsClose and distant speakers finder / World Englishes browser
Conclusions
Outline of presentation
Actual condition of English (World Englishes)Learning of a local lang. and that of the international lang.
A new approach to develop CAPT systems for learning English
A technical challenge and it’s solutionTo which is Minematsu’s pronunciation closer?
Speech (pronunciation) structure analysis
Individual-basis pronunciation clusteringSpeech Accent Archive
IPA-based reference pronunciation distance between two speakers
Prediction of the reference distance by using structure analysis
Possible applicationsClose and distant speakers finder / World Englishes browser
Conclusions
Outline of presentation
Actual condition of English (World Englishes)Learning of a local lang. and that of the international lang.
A new approach to develop CAPT systems for learning English
A technical challenge and it’s solutionTo which is Minematsu’s pronunciation closer?
Speech (pronunciation) structure analysis
Individual-basis pronunciation clusteringSpeech Accent Archive
IPA-based reference pronunciation distance between two speakers
Prediction of the reference distance by using structure analysis
Possible applicationsClose and distant speakers finder / World Englishes browser
Conclusions
Clustering English pronunciations
1
2
3
:
N
1 2 ... N
| | = (500x500 - 500) / 2 = 124,750
How to calculate pron. distance automatically?
A big and technical challenge
Pronunciation distance and acoustic distanceTo which Minematsu’s natural pronunciation closer?
Removal of speaker identity related to age, gender, etc.Pronunciation “skeleton” is needed for comparison.
Only the “accent” aspect of pronunciation is needed for comp.
Speech structure analysis [Minematsu’06]Speaker-invariant representation of speech or pronunciation
A X B
“Those answers will be straightforward if you think them through carefully first.”
A big and technical challenge
Pronunciation distance and acoustic distanceTo which Minematsu’s natural pronunciation closer?
Removal of speaker identity related to age, gender, etc.Pronunciation “skeleton” is needed for comparison.
Only the “accent” aspect of pronunciation is needed for comp.
Speech structure analysis [Minematsu’06]Speaker-invariant representation of speech or pronunciation
A X B
“Those answers will be straightforward if you think them through carefully first.”
A big and technical challenge
Pronunciation distance and acoustic distanceTo which Minematsu’s natural pronunciation closer?
Removal of speaker identity related to age, gender, etc.Pronunciation “skeleton” is needed for comparison.
Only the “accent” aspect of pronunciation is needed for comp.
Speech structure analysis [Minematsu’06]Speaker-invariant representation of speech or pronunciation
A X B
“Those answers will be straightforward if you think them through carefully first.”
A big and technical challenge
Pronunciation distance and acoustic distanceTo which Minematsu’s natural pronunciation closer?
Removal of speaker identity related to age, gender, etc.Pronunciation “skeleton” is needed for comparison.
Only the “accent” aspect of pronunciation is needed for comp.
Speech structure analysis [Minematsu’06]Speaker-invariant representation of speech or pronunciation
A X B
“Those answers will be straightforward if you think them through carefully first.”
Speaker variability = acoustic space deformationSpeaker-invariance = deformation (transform)-invariance
Are there any features that are invariant against any deformation?
Complete transform-invariance measure : f-divergence [Qiao’10]Every event has to be represented as distribution, not as point.
sufficiency
If is invariant, necessity
Speech contrasts (edges) based on f-div. are invariant.
(x, y)
(u, v)
Invariance in variability
x
y
u
vA
B
p1(x, y)
p2(x, y)
P1(u, v)
P2(u, v)
fdivfdiv
fdiv(p1, p2) � fdiv(P1, P2)�
M(p1(x), p2(x))dx
fdiv(p1, p2) =⇤
p2(x)g�
p1(x)p2(x)
⇥dx g : (0,⇥)� R and g(1) = 0
M = p2(x)g�
p1(x)p2(x)
⇥ ��
x
y
u
vA
B
Speaker variability = acoustic space deformationSpeaker-invariance = deformation (transform)-invariance
Are there any features that are invariant against any deformation?
Complete transform-invariance measure : f-divergence [Qiao’10]Every event has to be represented as distribution, not as point.
sufficiency
If is invariant, necessity
Speech contrasts (edges) based on f-div. are invariant.
(x, y)
(u, v)
Invariance in variability
x
y
u
vA
B
p1(x, y)
p2(x, y)
P1(u, v)
P2(u, v)
fdivfdiv
fdiv(p1, p2) � fdiv(P1, P2)�
M(p1(x), p2(x))dx
fdiv(p1, p2) =⇤
p2(x)g�
p1(x)p2(x)
⇥dx g : (0,⇥)� R and g(1) = 0
M = p2(x)g�
p1(x)p2(x)
⇥ ��
x
y
u
vA
B
Speaker variability = acoustic space deformationSpeaker-invariance = deformation (transform)-invariance
Are there any features that are invariant against any deformation?
Complete transform-invariance measure : f-divergence [Qiao’10]Every event has to be represented as distribution, not as point.
sufficiency
If is invariant, necessity
Speech contrasts (edges) based on f-div. are invariant.
(x, y)
(u, v)
Invariance in variability
x
y
u
vA
B
p1(x, y)
p2(x, y)
P1(u, v)
P2(u, v)
fdivfdiv
fdiv(p1, p2) � fdiv(P1, P2)�
M(p1(x), p2(x))dx
fdiv(p1, p2) =⇤
p2(x)g�
p1(x)p2(x)
⇥dx g : (0,⇥)� R and g(1) = 0
M = p2(x)g�
p1(x)p2(x)
⇥ ��
x
y
u
vA
B
Speaker variability = acoustic space deformationSpeaker-invariance = deformation (transform)-invariance
Are there any features that are invariant against any deformation?
Complete transform-invariance measure : f-divergence [Qiao’10]Every event has to be represented as distribution, not as point.
sufficiency
If is invariant, necessity
Speech contrasts (edges) based on f-div. are invariant.
(x, y)
(u, v)
Invariance in variability
x
y
u
vA
B
p1(x, y)
p2(x, y)
P1(u, v)
P2(u, v)
fdivfdiv
fdiv(p1, p2) � fdiv(P1, P2)�
M(p1(x), p2(x))dx
fdiv(p1, p2) =⇤
p2(x)g�
p1(x)p2(x)
⇥dx g : (0,⇥)� R and g(1) = 0
M = p2(x)g�
p1(x)p2(x)
⇥ ��
Definition of f-divergence
Examples of f-divergence
Several examples of f-div.
(x, y)
(u, v)
x
y
u
vA
B
p1(x, y)
p2(x, y)
P1(u, v)
P2(u, v)
fdivfdiv
Definition of f-divergence
Examples of f-divergence
Several examples of f-div.
c1
c3c2
c1
c3c2c4cD
c4cD
Bhattacharyya distance
BD-based distance matrix
Definition of f-divergence
Examples of f-divergence
Several examples of f-div.
c1
c3c2
c1
c3c2c4cD
c4cD
Bhattacharyya distance
BD-based distance matrix
Invariance against deformation
Topological invarianceEach event changes but the system or organization dose not change.
Invariance against deformation
Topological invarianceEach event changes but the system or organization dose not change.
Invariance against deformation
Topological invarianceEach event changes but the system or organization dose not change.
Classical theory in phonology
Theory of relational (topological) invarianceProposed by R. Jakobson
Distinctive features are used to explain the invariance.
We have to put aside the accidental properties of individual sounds and substitute a general expression that is the common denominator of these variables.
Physiologically identical sounds may possess different values in conformity with the whole sound system, i.e. with their relations to the other sounds.
Fig. 4. Jakobson’s structure of the French vowels and semi-vowels [20]
Fig. 5. Consonant and vowel triangles and related features [21]
Similar discussions can be found in classical phonologicalliterature. R. Jakobson proposed a theory of relational invari-ance, called distinctive feature theory. In [19], he repeatedlyemphasized the importance of relational and systemic invari-ance among speech sounds. “Physiologically identical soundsmay possess different values in conformity with the wholesound system, i.e. with their relations to the other sounds.”“We have to put aside the accidental properties of individualsounds and substitute a general expression that is the commondenominator of these variables.” In [20], he drew the invariantsystem of French vowels and semi-vowels, shown in Fig. 4.The distinctive features were proposed to describe the relationor difference between sounds. Two simple sound systems andtheir related features are shown in Fig. 5.
What is the simplest definition of a (sound) system? Ge-ometrically speaking, the shape of a three-point structure (atriangle) can be defined by the length of the three edges ofthe triangle. What about an n-point structure? In this case, thelength of all the edges including the diagonal edges can definethe shape of that structure, shown in Fig. 6. The distancematrix extracted from an n-point structure is the simplestdefinition of the shape of that structure. If the distance matrixrepresenting the sounds generated by a speaker and the matrixrepresenting the sounds of the same message generated byanother speaker is the same, we can say that those matrices arespeaker-invariant or speaker-independent and that infants seemto be sensitive to the matrix properties because, geometricallyspeaking, the matrix is one of the simplest definitions ofdistributional properties. How can one measure the length ofan edge of an n-point structure in a speaker-invariant way?This question is dealt with in the following section.
Recently in ASR studies, some researchers pay specialattention to the distinctive feature theory [22], [23], [24]. Here,
AB
C
DE
{ AB, BC, CD, DE, EA, AC, AD,
BD, BE, CE }= =
ABCDE
A B C D E0
00
00
Fig. 6. The distance matrix of a structure defines its shape.
features are often referred to as “attributes” [22], [23] andthe researchers tried to define the features acoustically anddetectors of the features or the attributes are used in the front-end of ASR systems. R. Jakobson introduced the features todescribe the invariant relational properties qualitatively. In ourstudy, instead of searching for a good acoustic definition ofthe features, the invariant shape of sounds which underlies aninput utterance is extracted and used for speech processing. Weunderstand that the features were introduced by R. Jakobsonto describe this invariant shape.
III. INVARIANT SPEECH STRUCTURE
A. Derivation of invariant speech structure
How to calculate distance between two sounds in a speaker-invariant way? In this section, a mathematical solution pro-posed in [6], [7], [8] is explained. Speaker difference ismodeled mathematically as space mapping in studies of voiceconversion. In [8], we proved that f -divergence betweentwo distributions is invariant with any kind of invertible anddifferentiable transforms (sufficiency) and that any invariantmeasure with respect to two distributions has to be written inthe form of f -divergence (necessity). fdiv is a distance metricbetween two distributions, p1 and p2, and it is formulated as
fdiv(p1, p2) =!
p2(x)g"
p1(x)p2(x)
#dx, (1)
where g(t) is a convex function for t > 0 [25]. If we taket log(t) as g(t), fdiv becomes KL-divergence. When
!t is
used for g(t), " log(fdiv) becomes Bhattacharyya distance.Fig. 8 shows two spaces (shapes) which are deformed intoeach other through an invertible and differentiable transform.An event is described not as point but as distribution. Twoevents of p1 and p2 in A are transformed into P1 and P2 inB. Generally speaking, the two spaces are closed manifoldsand the invariance of f -divergence is always satisfied [8].
fdiv(p1, p2) # fdiv(P1, P2). (2)
In [6], [7], [8], we have been using the Bhattacharyyadistance (BD) as one of the fdiv measures. If an input utteranceis represented as a BD-based distance matrix by using only thedistributions found in that utterance, the matrix is an invariantrepresentation of that utterance. Fig. 7 shows a procedure ofrepresenting an input utterance only by BD. The utterance in afeature space, such as cepstrum space, is a sequence of featurevectors and it is converted into a sequence of distributions
Classical theory in phonology
Theory of relational (topological) invarianceProposed by R. Jakobson
Distinctive features are used to explain the invariance.
We have to put aside the accidental properties of individual sounds and substitute a general expression that is the common denominator of these variables.
Physiologically identical sounds may possess different values in conformity with the whole sound system, i.e. with their relations to the other sounds.
Fig. 4. Jakobson’s structure of the French vowels and semi-vowels [20]
Fig. 5. Consonant and vowel triangles and related features [21]
Similar discussions can be found in classical phonologicalliterature. R. Jakobson proposed a theory of relational invari-ance, called distinctive feature theory. In [19], he repeatedlyemphasized the importance of relational and systemic invari-ance among speech sounds. “Physiologically identical soundsmay possess different values in conformity with the wholesound system, i.e. with their relations to the other sounds.”“We have to put aside the accidental properties of individualsounds and substitute a general expression that is the commondenominator of these variables.” In [20], he drew the invariantsystem of French vowels and semi-vowels, shown in Fig. 4.The distinctive features were proposed to describe the relationor difference between sounds. Two simple sound systems andtheir related features are shown in Fig. 5.
What is the simplest definition of a (sound) system? Ge-ometrically speaking, the shape of a three-point structure (atriangle) can be defined by the length of the three edges ofthe triangle. What about an n-point structure? In this case, thelength of all the edges including the diagonal edges can definethe shape of that structure, shown in Fig. 6. The distancematrix extracted from an n-point structure is the simplestdefinition of the shape of that structure. If the distance matrixrepresenting the sounds generated by a speaker and the matrixrepresenting the sounds of the same message generated byanother speaker is the same, we can say that those matrices arespeaker-invariant or speaker-independent and that infants seemto be sensitive to the matrix properties because, geometricallyspeaking, the matrix is one of the simplest definitions ofdistributional properties. How can one measure the length ofan edge of an n-point structure in a speaker-invariant way?This question is dealt with in the following section.
Recently in ASR studies, some researchers pay specialattention to the distinctive feature theory [22], [23], [24]. Here,
AB
C
DE
{ AB, BC, CD, DE, EA, AC, AD,
BD, BE, CE }= =
ABCDE
A B C D E0
00
00
Fig. 6. The distance matrix of a structure defines its shape.
features are often referred to as “attributes” [22], [23] andthe researchers tried to define the features acoustically anddetectors of the features or the attributes are used in the front-end of ASR systems. R. Jakobson introduced the features todescribe the invariant relational properties qualitatively. In ourstudy, instead of searching for a good acoustic definition ofthe features, the invariant shape of sounds which underlies aninput utterance is extracted and used for speech processing. Weunderstand that the features were introduced by R. Jakobsonto describe this invariant shape.
III. INVARIANT SPEECH STRUCTURE
A. Derivation of invariant speech structure
How to calculate distance between two sounds in a speaker-invariant way? In this section, a mathematical solution pro-posed in [6], [7], [8] is explained. Speaker difference ismodeled mathematically as space mapping in studies of voiceconversion. In [8], we proved that f -divergence betweentwo distributions is invariant with any kind of invertible anddifferentiable transforms (sufficiency) and that any invariantmeasure with respect to two distributions has to be written inthe form of f -divergence (necessity). fdiv is a distance metricbetween two distributions, p1 and p2, and it is formulated as
fdiv(p1, p2) =!
p2(x)g"
p1(x)p2(x)
#dx, (1)
where g(t) is a convex function for t > 0 [25]. If we taket log(t) as g(t), fdiv becomes KL-divergence. When
!t is
used for g(t), " log(fdiv) becomes Bhattacharyya distance.Fig. 8 shows two spaces (shapes) which are deformed intoeach other through an invertible and differentiable transform.An event is described not as point but as distribution. Twoevents of p1 and p2 in A are transformed into P1 and P2 inB. Generally speaking, the two spaces are closed manifoldsand the invariance of f -divergence is always satisfied [8].
fdiv(p1, p2) # fdiv(P1, P2). (2)
In [6], [7], [8], we have been using the Bhattacharyyadistance (BD) as one of the fdiv measures. If an input utteranceis represented as a BD-based distance matrix by using only thedistributions found in that utterance, the matrix is an invariantrepresentation of that utterance. Fig. 7 shows a procedure ofrepresenting an input utterance only by BD. The utterance in afeature space, such as cepstrum space, is a sequence of featurevectors and it is converted into a sequence of distributions
Classical theory in phonology
Theory of relational (topological) invarianceProposed by R. Jakobson
Distinctive features are used to explain the invariance.
We have to put aside the accidental properties of individual sounds and substitute a general expression that is the common denominator of these variables.
Physiologically identical sounds may possess different values in conformity with the whole sound system, i.e. with their relations to the other sounds.
c1
c3c2
c1
c3c2c4cD
c4cD
Bhattacharyya distance
BD-based distance matrix
A common paragraph to pron. structure
Pron. distance calculation using structure
00000
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack ..........
A common paragraph to pron. structure
Pron. distance calculation using structure
00000
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack ..........
A common paragraph to pron. structure
Pron. distance calculation using structure
00000
=
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack ..........
Clustering English pronunciations
1
2
3
:
N
1 2 ... N
| | = (500x500 - 500) / 2 = 124,750
How to calculate pron. distance automatically?
Outline of presentation
Actual condition of English (World Englishes)Learning of a local lang. and that of the international lang.
A new approach to develop CAPT systems for learning English
A technical challenge and it’s solutionTo which is Minematsu’s pronunciation closer?
Speech (pronunciation) structure analysis
Individual-basis pronunciation clusteringSpeech Accent Archive
IPA-based reference pronunciation distance between two speakers
Prediction of the reference distance by using structure analysis
Possible applicationsClose and distant speakers finder / World Englishes browser
Conclusions
Outline of presentation
Actual condition of English (World Englishes)Learning of a local lang. and that of the international lang.
A new approach to develop CAPT systems for learning English
A technical challenge and it’s solutionTo which is Minematsu’s pronunciation closer?
Speech (pronunciation) structure analysis
Individual-basis pronunciation clusteringSpeech Accent Archive
IPA-based reference pronunciation distance between two speakers
Prediction of the reference distance by using structure analysis
Possible applicationsClose and distant speakers finder / World Englishes browser
Conclusions
Real WE data with IPA transcription [Weinberger’13]http://accent.gmu.edu
Speech Accent Archive
Speech Accent Archive
Everybody’s read speech of the common paragraphThe paragraph is designed on AE phoneme coverage.
Everybody can join the data collection project of SAA.#speakers is about 1.8 thousands.
IPA narrow transcriptions are provided (#symbols = 153).
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
Speech Accent Archive
Everybody’s read speech of the common paragraphThe paragraph is designed on AE phoneme coverage.
Everybody can join the data collection project of SAA.#speakers is about 1.8 thousands.
IPA narrow transcriptions are provided (#symbols = 153).
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
Speech Accent Archive
Everybody’s read speech of the common paragraphThe paragraph is designed on AE phoneme coverage.
Everybody can join the data collection project of SAA.#speakers is about 1.8 thousands.
IPA narrow transcriptions are provided (#symbols = 153).
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
Speech Accent Archive
Everybody’s read speech of the common paragraphThe paragraph is designed on AE phoneme coverage.
Everybody can join the data collection project of SAA.#speakers is about 1.8 thousands.
IPA narrow transcriptions are provided (#symbols = 153).
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
Reference distance bet. two speakers
IPA-based reference distances Automatic alignment bet. two transcriptions using DTW
DTW(=Dynamic Time Warping) minimized the accumulated distortion.
DTW requires a distance matrix of all the 153 IPA symbols used.20 productions of each symbol
HMM is built for each symbol (SD-HMM)
Acoustic distance is obtained fromeach HMM (phone) pair.
- = ?
Reference distance bet. two speakers
IPA-based reference distances Automatic alignment bet. two transcriptions using DTW
DTW(=Dynamic Time Warping) minimized the accumulated distortion.
DTW requires a distance matrix of all the 153 IPA symbols used.20 productions of each symbol
HMM is built for each symbol (SD-HMM)
Acoustic distance is obtained fromeach HMM (phone) pair.
- = ?
Reference distance bet. two speakers
Selection of speakers from the SAANo word-level insertion or deletion in the utterances
Only 498 speakers are selected out of about 1,800.
- = ?
IPA-based clustering of AE and GE speakers
Visualization of the distance matrix by its tree diagramGerman speakers are picked up (#speakers = 9)
The same number of AE speakers are randomly selected.
GE16 has lived in USA for some years.
Experimental conditions (1/3)
Sample selection from the SAA corpusThe utterances without word-level insertion/deletion were selected.
The utterances with heavy noises were removed.381 speakers’ data remained (not 498).
#speaker pairs = 381 x 380 / 2 = 72,390Sorted based on the IPA-based reference inter-speaker distances.
Divided into 2 speaker-pair groups (even / odd numbered groups)
2-fold cross validation
Unit of building a pronunciation structure1 paragraph 9 phrases 9 structures
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
00000
00000
00000
00000
00000
00000
00000
00000
00000
Experimental conditions (2/3)
UBM + spk-adaptation for structure extraction9 phrase HMMs were built using all the data.
24MFCC + their , #states/phrase = 3 x #phonemes in that phrase
UBM-HMM (phrase HMM) is adapted to each speaker.MLLR adaptation (#classes=32)
Structure extraction from the adapted phrase HMMBD calculation by a unit of phoneme (3 HMM states)
The total number of phoneme-to-phoneme contrasts (edges): 2,804
Since IPA transcription is based on phones and HMMs are even if we could have a perfect
the generated transcriptions have to be hone to phoneme
t-c-
we calculate correlation between b-
tained by using perfect phoneme recognition results and DTW.
c1
c3c2
c1
c3c2c4cD
c4cD
00000
00000
00000
00000
00000
00000
00000
00000
00000
Experimental conditions (3/3)
Difference matrix Speaker S’s matrices:
Speaker T’s matrices:
Difference matrices:
Support Vector Regression -SVR in LIBSVM
Radial basis kernel function
Total number of variables: 2,804
Support Vector Regression
Experimental conditions (3/3)
Difference matrix Speaker S’s matrices:
Speaker T’s matrices:
Difference matrices:
Support Vector Regression -SVR in LIBSVM
Radial basis kernel function
Total number of variables: 2,804
Support Vector Regression
Comparison to other methods
Naive automation of reference distance preparationPhoneticians’ transcription of utterances
DTW between a pair of IPA transcriptions to calculate .
Phoneme error detection system with phoneme HMMsSAA’s IPA transcripts are converted to AE phoneme transcripts.
Phoneme acoustic models (#phonemes = 42, cf. 153 in IPA)Monophone HMMs trained with all the data of SAA.
ASR grammar for phoneme error detectionWord-based network grammar built from the SAA corpus
DTW between a pair of obtained phoneme transcriptionsUse the phoneme distance matrix obtained from the set of HMMs
Automate!!
Fig.2 An example of word-based grammar
More results
Correlation between anderror detection performance correlation
ideal error detector 100% 0.88
actual error detector 73.4% 0.31
structures + SVR --- 0.81
Fig.8 Correlation of the predicted distances and the
Outline of presentation
Actual condition of English (World Englishes)Learning of a local lang. and that of the international lang.
A new approach to develop CAPT systems for learning English
A technical challenge and it’s solutionTo which is Minematsu’s pronunciation closer?
Speech (pronunciation) structure analysis
Individual-basis pronunciation clusteringSpeech Accent Archive
IPA-based reference pronunciation distance between two speakers
Prediction of the reference distance by using structure analysis
Possible applicationsClose and distant speakers finder / World Englishes browser
Conclusions
Outline of presentation
Actual condition of English (World Englishes)Learning of a local lang. and that of the international lang.
A new approach to develop CAPT systems for learning English
A technical challenge and it’s solutionTo which is Minematsu’s pronunciation closer?
Speech (pronunciation) structure analysis
Individual-basis pronunciation clusteringSpeech Accent Archive
IPA-based reference pronunciation distance between two speakers
Prediction of the reference distance by using structure analysis
Possible applicationsClose and distant speakers finder / World Englishes browser
Conclusions
Close & distant speakers finder
Locating your pronunciation in the diversity of WECollecting speech samples from high school students
School A (Japan), School B (Korea), School C (China), School D (Poland)...
Easy conversation partners and challenging conversation partners
Close & distant speakers finder
Locating your pronunciation in the diversity of WECollecting speech samples from high school students
School A (Japan), School B (Korea), School C (China), School D (Poland)...
Easy conversation partners and challenging conversation partners
Close & distant speakers finder
Locating your pronunciation in the diversity of WECollecting speech samples from high school students
School A (Japan), School B (Korea), School C (China), School D (Poland)...
Easy conversation partners and challenging conversation partners
Close & distant speakers finder
Locating your pronunciation in the diversity of WECollecting speech samples from high school students
School A (Japan), School B (Korea), School C (China), School D (Poland)...
Easy conversation partners and challenging conversation partners
World Englishes browser
Linking all the English material on the net to the MAPAsking all the speakers of the material to read the Stella paragraph.
World Englishes browser
Linking all the English material on the net to the MAPAsking all the speakers of the material to read the Stella paragraph.
World Englishes browser
Linking all the English material on the net to the MAPAsking all the speakers of the material to read the Stella paragraph.
World Englishes browser
Linking all the English material on the net to the MAPAsking all the speakers of the material to read the Stella paragraph.
Outline of presentation
Actual condition of English (World Englishes)Learning of a local lang. and that of the international lang.
A new approach to develop CAPT systems for learning English
A technical challenge and it’s solutionTo which is Minematsu’s pronunciation closer?
Speech (pronunciation) structure analysis
Individual-basis pronunciation clusteringSpeech Accent Archive
IPA-based reference pronunciation distance between two speakers
Prediction of the reference distance by using structure analysis
Possible applicationsClose and distant speakers finder / World Englishes browser
Conclusions
Outline of presentation
Actual condition of English (World Englishes)Learning of a local lang. and that of the international lang.
A new approach to develop CAPT systems for learning English
A technical challenge and it’s solutionTo which is Minematsu’s pronunciation closer?
Speech (pronunciation) structure analysis
Individual-basis pronunciation clusteringSpeech Accent Archive
IPA-based reference pronunciation distance between two speakers
Prediction of the reference distance by using structure analysis
Possible applicationsClose and distant speakers finder / World Englishes browser
Conclusions