Individual-basis pronunciation clustering of World ...mine/OJAD/lecture/POZNAN/... · Yet another...

Individual-basis pronunciation clusteringof World Englishes usingspeech structure analysis

Nobuaki MinematsuGraduate School of Engineering,

The University of Tokyo

Japanese learning and English learning

Learning of a local languageTo communicate to native speakers of that language.

Goal of pronunciation trainingIntelligible enough pronunciation to native speakers

= Native-sounding pronunciation

Learning of the international languageTo communicate among speakers of different native languages.

The total number of English users is 1.5 billions.

Goal of pronunciation trainingIntelligible enough pronunciation to UK and US speakers?

= Native-sounding pronunciation?

Many countries adopt English as official language.A quarter of the 1.5 billions speakers use English as official language.

They speak differently accented English from UK or US accent.

World Englishes

English as only the common language for everybodyDiversity of localized pronunciations of English [Crystal’95]

US pron. and UK pron. are just two major examples of World Englishes.

What is the unit of diversity of World English pronunciation?Country? region? city? town?

The minimal unit should be individual!! (#speakers = 1.5 billions)

No English is correct and no English is incorrect.One will be interested in how his pronunciation is compared to the others.

English as foreign language (750 millions)

English as second language or official language (400 M)

English as native language (350 M)

[Kachru 1992]

It is something like “face”.Europe, Africa, Asia, Oceania, America, ....

White, yellow, black, ....

No face is correct and no face is incorrect.

The face functions as identifier of that person.

??? Accent = Face ???

Yet another approach to CAPT

Computer Aided Pronunciation TrainingApplication of ASR (Automatic Speech Recognition) technology

To detect pronunciation errors in a given utterance.

To calculate pronunciation proficiency score for a speaker.

Comparison bet. teachers’ model and a learner’s utteranceTeachers’ model = native speakers’ model

Which English should be the target model?US or UK?

Meaningful question for learning English?

Yet another approach to CAPT system developmentLocation of a learner’s pronunciation in the diversity of World Englishes pronunciations

Comparison of a learner’s pronunciation to all the others.1 speaker 1.5 billions - 1 speakers

How is one’s pron. compared to others’?

Outline of presentation

Actual condition of English (World Englishes)Learning of a local lang. and that of the international lang.

A new approach to develop CAPT systems for learning English

A technical challenge and it’s solutionTo which is Minematsu’s pronunciation closer?

Speech (pronunciation) structure analysis

Individual-basis pronunciation clusteringSpeech Accent Archive

IPA-based reference pronunciation distance between two speakers

Prediction of the reference distance by using structure analysis

Possible applicationsClose and distant speakers finder / World Englishes browser

Conclusions

What is needed for clustering?

Dendrogram

Multi-dimensional Scaling

Dendrogram

Multi-dimensional Scaling

Clustering English pronunciations

1 2 ... N

| | = (500x500 - 500) / 2 = 124,750

1 2 ... N

| | = (500x500 - 500) / 2 = 124,750

1 2 ... N

| | = (500x500 - 500) / 2 = 124,750

How to calculate pron. distance automatically?

A big and technical challenge

Pronunciation distance and acoustic distanceTo which Minematsu’s natural pronunciation closer?

Removal of speaker identity related to age, gender, etc.Pronunciation “skeleton” is needed for comparison.

Only the “accent” aspect of pronunciation is needed for comp.

Speech structure analysis [Minematsu’06]Speaker-invariant representation of speech or pronunciation

“Those answers will be straightforward if you think them through carefully first.”

Speaker variability = acoustic space deformationSpeaker-invariance = deformation (transform)-invariance

Are there any features that are invariant against any deformation?

Complete transform-invariance measure : f-divergence [Qiao’10]Every event has to be represented as distribution, not as point.

sufficiency

If is invariant, necessity

Speech contrasts (edges) based on f-div. are invariant.

(x, y)

(u, v)

Invariance in variability

p1(x, y)

p2(x, y)

P1(u, v)

P2(u, v)

fdivfdiv

fdiv(p1, p2) � fdiv(P1, P2)�

M(p1(x), p2(x))dx

fdiv(p1, p2) =⇤

p2(x)g�

p1(x)p2(x)

⇥dx g : (0,⇥)� R and g(1) = 0

M = p2(x)g�

p1(x)p2(x)

⇥ ��

sufficiency

(x, y)

(u, v)

p1(x, y)

p2(x, y)

P1(u, v)

P2(u, v)

fdivfdiv

M(p1(x), p2(x))dx

fdiv(p1, p2) =⇤

p2(x)g�

p1(x)p2(x)

⇥dx g : (0,⇥)� R and g(1) = 0

M = p2(x)g�

p1(x)p2(x)

⇥ ��

sufficiency

(x, y)

(u, v)

p1(x, y)

p2(x, y)

P1(u, v)

P2(u, v)

fdivfdiv

M(p1(x), p2(x))dx

fdiv(p1, p2) =⇤

p2(x)g�

p1(x)p2(x)

⇥dx g : (0,⇥)� R and g(1) = 0

M = p2(x)g�

p1(x)p2(x)

⇥ ��

sufficiency

(x, y)

(u, v)

p1(x, y)

p2(x, y)

P1(u, v)

P2(u, v)

fdivfdiv

M(p1(x), p2(x))dx

fdiv(p1, p2) =⇤

p2(x)g�

p1(x)p2(x)

⇥dx g : (0,⇥)� R and g(1) = 0

M = p2(x)g�

p1(x)p2(x)

⇥ ��

Definition of f-divergence

Examples of f-divergence

Several examples of f-div.

(x, y)

(u, v)

p1(x, y)

p2(x, y)

P1(u, v)

P2(u, v)

fdivfdiv

c3c2c4cD

Bhattacharyya distance

BD-based distance matrix

c3c2c4cD

Invariance against deformation

Topological invarianceEach event changes but the system or organization dose not change.

Classical theory in phonology

Theory of relational (topological) invarianceProposed by R. Jakobson

Distinctive features are used to explain the invariance.

We have to put aside the accidental properties of individual sounds and substitute a general expression that is the common denominator of these variables.

Physiologically identical sounds may possess different values in conformity with the whole sound system, i.e. with their relations to the other sounds.

Fig. 4. Jakobson’s structure of the French vowels and semi-vowels [20]

Fig. 5. Consonant and vowel triangles and related features [21]

Similar discussions can be found in classical phonologicalliterature. R. Jakobson proposed a theory of relational invari-ance, called distinctive feature theory. In [19], he repeatedlyemphasized the importance of relational and systemic invari-ance among speech sounds. “Physiologically identical soundsmay possess different values in conformity with the wholesound system, i.e. with their relations to the other sounds.”“We have to put aside the accidental properties of individualsounds and substitute a general expression that is the commondenominator of these variables.” In [20], he drew the invariantsystem of French vowels and semi-vowels, shown in Fig. 4.The distinctive features were proposed to describe the relationor difference between sounds. Two simple sound systems andtheir related features are shown in Fig. 5.

What is the simplest definition of a (sound) system? Ge-ometrically speaking, the shape of a three-point structure (atriangle) can be defined by the length of the three edges ofthe triangle. What about an n-point structure? In this case, thelength of all the edges including the diagonal edges can definethe shape of that structure, shown in Fig. 6. The distancematrix extracted from an n-point structure is the simplestdefinition of the shape of that structure. If the distance matrixrepresenting the sounds generated by a speaker and the matrixrepresenting the sounds of the same message generated byanother speaker is the same, we can say that those matrices arespeaker-invariant or speaker-independent and that infants seemto be sensitive to the matrix properties because, geometricallyspeaking, the matrix is one of the simplest definitions ofdistributional properties. How can one measure the length ofan edge of an n-point structure in a speaker-invariant way?This question is dealt with in the following section.

Recently in ASR studies, some researchers pay specialattention to the distinctive feature theory [22], [23], [24]. Here,

{ AB, BC, CD, DE, EA, AC, AD,

BD, BE, CE }= =

A B C D E0

Fig. 6. The distance matrix of a structure defines its shape.

features are often referred to as “attributes” [22], [23] andthe researchers tried to define the features acoustically anddetectors of the features or the attributes are used in the front-end of ASR systems. R. Jakobson introduced the features todescribe the invariant relational properties qualitatively. In ourstudy, instead of searching for a good acoustic definition ofthe features, the invariant shape of sounds which underlies aninput utterance is extracted and used for speech processing. Weunderstand that the features were introduced by R. Jakobsonto describe this invariant shape.

III. INVARIANT SPEECH STRUCTURE

A. Derivation of invariant speech structure

How to calculate distance between two sounds in a speaker-invariant way? In this section, a mathematical solution pro-posed in [6], [7], [8] is explained. Speaker difference ismodeled mathematically as space mapping in studies of voiceconversion. In [8], we proved that f -divergence betweentwo distributions is invariant with any kind of invertible anddifferentiable transforms (sufficiency) and that any invariantmeasure with respect to two distributions has to be written inthe form of f -divergence (necessity). fdiv is a distance metricbetween two distributions, p1 and p2, and it is formulated as

fdiv(p1, p2) =!

p2(x)g"

p1(x)p2(x)

#dx, (1)

where g(t) is a convex function for t > 0 [25]. If we taket log(t) as g(t), fdiv becomes KL-divergence. When

used for g(t), " log(fdiv) becomes Bhattacharyya distance.Fig. 8 shows two spaces (shapes) which are deformed intoeach other through an invertible and differentiable transform.An event is described not as point but as distribution. Twoevents of p1 and p2 in A are transformed into P1 and P2 inB. Generally speaking, the two spaces are closed manifoldsand the invariance of f -divergence is always satisfied [8].

fdiv(p1, p2) # fdiv(P1, P2). (2)

In [6], [7], [8], we have been using the Bhattacharyyadistance (BD) as one of the fdiv measures. If an input utteranceis represented as a BD-based distance matrix by using only thedistributions found in that utterance, the matrix is an invariantrepresentation of that utterance. Fig. 7 shows a procedure ofrepresenting an input utterance only by BD. The utterance in afeature space, such as cepstrum space, is a sequence of featurevectors and it is converted into a sequence of distributions

Fig. 4. Jakobson’s structure of the French vowels and semi-vowels [20]

Fig. 5. Consonant and vowel triangles and related features [21]

Similar discussions can be found in classical phonologicalliterature. R. Jakobson proposed a theory of relational invari-ance, called distinctive feature theory. In [19], he repeatedlyemphasized the importance of relational and systemic invari-ance among speech sounds. “Physiologically identical soundsmay possess different values in conformity with the wholesound system, i.e. with their relations to the other sounds.”“We have to put aside the accidental properties of individualsounds and substitute a general expression that is the commondenominator of these variables.” In [20], he drew the invariantsystem of French vowels and semi-vowels, shown in Fig. 4.The distinctive features were proposed to describe the relationor difference between sounds. Two simple sound systems andtheir related features are shown in Fig. 5.

What is the simplest definition of a (sound) system? Ge-ometrically speaking, the shape of a three-point structure (atriangle) can be defined by the length of the three edges ofthe triangle. What about an n-point structure? In this case, thelength of all the edges including the diagonal edges can definethe shape of that structure, shown in Fig. 6. The distancematrix extracted from an n-point structure is the simplestdefinition of the shape of that structure. If the distance matrixrepresenting the sounds generated by a speaker and the matrixrepresenting the sounds of the same message generated byanother speaker is the same, we can say that those matrices arespeaker-invariant or speaker-independent and that infants seemto be sensitive to the matrix properties because, geometricallyspeaking, the matrix is one of the simplest definitions ofdistributional properties. How can one measure the length ofan edge of an n-point structure in a speaker-invariant way?This question is dealt with in the following section.

Recently in ASR studies, some researchers pay specialattention to the distinctive feature theory [22], [23], [24]. Here,

{ AB, BC, CD, DE, EA, AC, AD,

BD, BE, CE }= =

A B C D E0

Fig. 6. The distance matrix of a structure defines its shape.

features are often referred to as “attributes” [22], [23] andthe researchers tried to define the features acoustically anddetectors of the features or the attributes are used in the front-end of ASR systems. R. Jakobson introduced the features todescribe the invariant relational properties qualitatively. In ourstudy, instead of searching for a good acoustic definition ofthe features, the invariant shape of sounds which underlies aninput utterance is extracted and used for speech processing. Weunderstand that the features were introduced by R. Jakobsonto describe this invariant shape.

III. INVARIANT SPEECH STRUCTURE

A. Derivation of invariant speech structure

How to calculate distance between two sounds in a speaker-invariant way? In this section, a mathematical solution pro-posed in [6], [7], [8] is explained. Speaker difference ismodeled mathematically as space mapping in studies of voiceconversion. In [8], we proved that f -divergence betweentwo distributions is invariant with any kind of invertible anddifferentiable transforms (sufficiency) and that any invariantmeasure with respect to two distributions has to be written inthe form of f -divergence (necessity). fdiv is a distance metricbetween two distributions, p1 and p2, and it is formulated as

fdiv(p1, p2) =!

p2(x)g"

p1(x)p2(x)

#dx, (1)

where g(t) is a convex function for t > 0 [25]. If we taket log(t) as g(t), fdiv becomes KL-divergence. When

used for g(t), " log(fdiv) becomes Bhattacharyya distance.Fig. 8 shows two spaces (shapes) which are deformed intoeach other through an invertible and differentiable transform.An event is described not as point but as distribution. Twoevents of p1 and p2 in A are transformed into P1 and P2 inB. Generally speaking, the two spaces are closed manifoldsand the invariance of f -divergence is always satisfied [8].

fdiv(p1, p2) # fdiv(P1, P2). (2)

In [6], [7], [8], we have been using the Bhattacharyyadistance (BD) as one of the fdiv measures. If an input utteranceis represented as a BD-based distance matrix by using only thedistributions found in that utterance, the matrix is an invariantrepresentation of that utterance. Fig. 7 shows a procedure ofrepresenting an input utterance only by BD. The utterance in afeature space, such as cepstrum space, is a sequence of featurevectors and it is converted into a sequence of distributions

c3c2c4cD

A common paragraph to pron. structure

Pron. distance calculation using structure

Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack ..........

1 2 ... N

| | = (500x500 - 500) / 2 = 124,750

How to calculate pron. distance automatically?

Conclusions

Real WE data with IPA transcription [Weinberger’13]http://accent.gmu.edu

Speech Accent Archive

Everybody’s read speech of the common paragraphThe paragraph is designed on AE phoneme coverage.

Everybody can join the data collection project of SAA.#speakers is about 1.8 thousands.

IPA narrow transcriptions are provided (#symbols = 153).

Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

Supervised learning of a distance predictor

1 2 ... NN speakers

N speakers

Reference distance bet. two speakers

IPA-based reference distances Automatic alignment bet. two transcriptions using DTW

DTW(=Dynamic Time Warping) minimized the accumulated distortion.

DTW requires a distance matrix of all the 153 IPA symbols used.20 productions of each symbol

HMM is built for each symbol (SD-HMM)

Acoustic distance is obtained fromeach HMM (phone) pair.

IPA-based reference distances Automatic alignment bet. two transcriptions using DTW

DTW(=Dynamic Time Warping) minimized the accumulated distortion.

DTW requires a distance matrix of all the 153 IPA symbols used.20 productions of each symbol

HMM is built for each symbol (SD-HMM)

Acoustic distance is obtained fromeach HMM (phone) pair.

Selection of speakers from the SAANo word-level insertion or deletion in the utterances

Only 498 speakers are selected out of about 1,800.

IPA-based clustering of AE and GE speakers

Visualization of the distance matrix by its tree diagramGerman speakers are picked up (#speakers = 9)

The same number of AE speakers are randomly selected.

GE16 has lived in USA for some years.

Experimental conditions (1/3)

Sample selection from the SAA corpusThe utterances without word-level insertion/deletion were selected.

The utterances with heavy noises were removed.381 speakers’ data remained (not 498).

#speaker pairs = 381 x 380 / 2 = 72,390Sorted based on the IPA-based reference inter-speaker distances.

Divided into 2 speaker-pair groups (even / odd numbered groups)

2-fold cross validation

Unit of building a pronunciation structure1 paragraph 9 phrases 9 structures

UBM + spk-adaptation for structure extraction9 phrase HMMs were built using all the data.

24MFCC + their , #states/phrase = 3 x #phonemes in that phrase

UBM-HMM (phrase HMM) is adapted to each speaker.MLLR adaptation (#classes=32)

Structure extraction from the adapted phrase HMMBD calculation by a unit of phoneme (3 HMM states)

The total number of phoneme-to-phoneme contrasts (edges): 2,804

Since IPA transcription is based on phones and HMMs are even if we could have a perfect

the generated transcriptions have to be hone to phoneme

we calculate correlation between b-

tained by using perfect phoneme recognition results and DTW.

c3c2c4cD

Difference matrix Speaker S’s matrices:

Speaker T’s matrices:

Difference matrices:

Support Vector Regression -SVR in LIBSVM

Radial basis kernel function

Total number of variables: 2,804

Support Vector Regression

Difference matrix Speaker S’s matrices:

Speaker T’s matrices:

Difference matrices:

Support Vector Regression -SVR in LIBSVM

Radial basis kernel function

Total number of variables: 2,804

Support Vector Regression

Results

Correlation between and

Fig.8 Correlation of the predicted distances and the

Comparison to other methods

Naive automation of reference distance preparationPhoneticians’ transcription of utterances

DTW between a pair of IPA transcriptions to calculate .

Phoneme error detection system with phoneme HMMsSAA’s IPA transcripts are converted to AE phoneme transcripts.

Phoneme acoustic models (#phonemes = 42, cf. 153 in IPA)Monophone HMMs trained with all the data of SAA.

ASR grammar for phoneme error detectionWord-based network grammar built from the SAA corpus

DTW between a pair of obtained phoneme transcriptionsUse the phoneme distance matrix obtained from the set of HMMs

Automate!!

Fig.2 An example of word-based grammar

More results

Correlation between anderror detection performance correlation

ideal error detector 100% 0.88

actual error detector 73.4% 0.31

structures + SVR --- 0.81

Fig.8 Correlation of the predicted distances and the

Conclusions

Close & distant speakers finder

Locating your pronunciation in the diversity of WECollecting speech samples from high school students

School A (Japan), School B (Korea), School C (China), School D (Poland)...

Easy conversation partners and challenging conversation partners

World Englishes browser

Linking all the English material on the net to the MAPAsking all the speakers of the material to read the Stella paragraph.

Conclusions

Thank you. Any questions?

Individual-basis pronunciation clustering of World ...mine/OJAD/lecture/POZNAN/... · Yet another...

Documents

Transcript of Individual-basis pronunciation clustering of World ...mine/OJAD/lecture/POZNAN/... · Yet another...