Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common...

39
Automatic Clustering of World English Pronunciations on an Individual Basis Nobuaki Minematsu Graduate School of Engineering, University of Tokyo

Transcript of Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common...

Page 1: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Automatic Clustering of World English Pronunciations on an Individual Basis

Nobuaki MinematsuGraduate School of Engineering,

University of Tokyo

Page 2: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Outline of the presentation

A research question and the goal of this researchWhat is the minimal unit of pronunciation diversity of WE?

Individual-basis and automatic clustering of WE pronunciations

Two problems and their solutionSpeech corpus of a fixed paragraph read by international speakers

Clustering technique which is insensitive to non-linguistic variation

Individual-basis and automatic pronunciation clusteringSpeech Accent Archive (SAA)

IPA-based reference pronunciation distance between two speakers

Pronunciation Structure Analysis (PSA)Prediction (regression) of the reference distance by using PSA

Experiments and discussion for future direction

Conclusions

Page 3: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

What is the minimal unit and how many units?

Expanding circle

Outer circle

Inner circle

eg. USA, UK

eg. India, Nigeria

eg. China, Russia, Brazil

Diversity of pronunciation in WE

[Kachru 1992]

Country?

Region / State / Prefecture?

City / Town / Village?

Page 4: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

What is the minimal unit and how many units?

Diversity of pronunciation in WE

[Kachru 1992]

Country?

Region / State / Prefecture?

City / Town / Village?

Individual!

Page 5: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

What is the minimal unit and how many units?

Diversity of pronunciation in WE

[Kachru 1992]

Country?

Region / State / Prefecture?

City / Town / Village?

Individual!

1.5 billions!

Page 6: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

What is needed for clustering?

Dendrogram

Multi-dimensional Scaling

Page 7: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Clustering pronunciations of N speakers

1

2

3

:

N

1 2 ... N

| | = (500x500 - 500) / 2 = 124,750

How to calculate pron. distance automatically?

N speakers

Page 8: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Related works

Pron. distance calculation using phonetic(mic) transcriptsMiller & Nicely 1995

Bailey & Hahn 2005

Wieling, Margaretha, & Nerbonne 2012Use of a confusion matrix of language sounds

Pronunciation distance calculation using no transcriptsThey are used in training a calculation machine.

so that the trained machine can cluster new speakers without them.

Page 9: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

A challenging problem of data collection

Speech Accent Archive (SAA)A common paragraph read by about 1.8K international speakers

The paragraph is designed to achieve high phonemic coverage of AE.

Speech samples and their narrow IPA transcripts are provided.

http://accent.gmu.edu

Page 10: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

A challenging problem of data collection

Speech Accent Archive (SAA)A common paragraph read by about 1.8K international speakers

The paragraph is designed to achieve high phonemic coverage of AE.

Speech samples and their narrow IPA transcripts are provided.

Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

Page 11: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

A big and technical challenge

Pronunciation distance and acoustic distanceTo which is Minematsu’s natural pronunciation closer?

Removal of non-linguistic factors of age, gender, etc.Pronunciation “skeleton” has to be extracted for comparison.

Pronunciation Structure Analysis (PSA) [Minematsu’06]Speaker-invariant representation of pronunciation

“Those answers will be straightforward if you think them through carefully first.”

A X B

Page 12: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

A big and technical challenge

Pronunciation distance and acoustic distanceTo which is Minematsu’s natural pronunciation closer?

Removal of non-linguistic factors of age, gender, etc.Pronunciation “skeleton” has to be extracted for comparison.

Pronunciation Structure Analysis (PSA) [Minematsu’06]Speaker-invariant representation of pronunciation

“Those answers will be straightforward if you think them through carefully first.”

A X B

Page 13: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

A big and technical challenge

Pronunciation distance and acoustic distanceTo which is Minematsu’s natural pronunciation closer?

Removal of non-linguistic factors of age, gender, etc.Pronunciation “skeleton” has to be extracted for comparison.

Pronunciation Structure Analysis (PSA) [Minematsu’06]Speaker-invariant representation of pronunciation

“Those answers will be straightforward if you think them through carefully first.”

A X B

Page 14: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Outline of the presentation

A research question and the goal of this researchWhat is the minimal unit of pronunciation diversity of WE?

Individual-basis and automatic clustering of WE pronunciations

Two problems and their solutionSpeech corpus of a fixed paragraph read by international speakers

Clustering technique which is insensitive to non-linguistic variation

Individual-basis and automatic pronunciation clusteringSpeech Accent Archive (SAA)

IPA-based reference pronunciation distance between two speakers

Pronunciation Structure Analysis (PSA)Prediction (regression) of the reference distance by using PSA

Experiments and discussion for future direction

Conclusions

Page 15: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Pron. clustering only based on SAA

1

2

3

:

N

1 2 ... NN speakers

Page 16: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Pron. clustering only based on SAA

1

2

3

:

N

1 2 ... NN speakers

Pron. StructureAnalysis

Page 17: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Pron. clustering only based on SAA

N speakers

Pron. StructureAnalysis

Page 18: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

IPA-based reference pron. distance

Optimal alignment bet. two transcriptsDynamic Time Warping (DTW)

DTW can minimize the accumulated distortion.

Similar to edit-distance approach for alignment [Wieling et. al 2012]

DTW requires a distance matrix of all the 153 IPA symbols used.20 productions for each by a phonetician

HMM is built for each symbol (SD-HMM)

HMM = Hidden Markov Model

Acoustic distance is obtained fromeach HMM (phone) pair.

- = ?

Page 19: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

IPA-based reference pron. distance

Optimal alignment bet. two transcriptsDynamic Time Warping (DTW)

DTW can minimize the accumulated distortion.

Similar to edit-distance approach for alignment [Wieling et. al 2012]

DTW requires a distance matrix of all the 153 IPA symbols used.20 productions for each by a phonetician

HMM is built for each symbol (SD-HMM)

HMM = Hidden Markov Model

Acoustic distance is obtained fromeach HMM (phone) pair.

- = ?

Page 20: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

IPA-based reference pron. distance

Selection of speakers from the SAANo word-level insertion or deletion in the utterances

Only 498 speakers are selected out of about 1,800.

- = ?

Page 21: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

A big and technical challenge

Pronunciation distance and acoustic distanceTo which is Minematsu’s natural pronunciation closer?

Removal of non-linguistic factors of age, gender, etc.Pronunciation “skeleton” has to be extracted for comparison.

Pronunciation Structure Analysis (PSA) [Minematsu’06]Speaker-invariant representation of pronunciation

“Those answers will be straightforward if you think them through carefully first.”

A X B

Page 22: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Insensitivity and sensitivityInsensitivity to age and gender differences (A)

Sensitivity to accent differences (B)

One solution: structural featuresFeature instances are sensitive to (A) butfeature relations are insensitive to (A)

Feature relations are sensitive to (B)

Relations are represented as distance matrix.

What are good accent features?10-year-old

children

femaleadults

male adults

1st Formant frequency [kHz]

2nd

Form

ant f

requ

ency

[kH

z]

AQEI Williamsport, PA Chicago, IL Ann Arbor, MI Rochester, NY

formant frequenciesof adults and children

Distribution of normalized formants among AE dialects [Labov et al.2005]

(A)

(B)

Page 23: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

1 utterances is represented as 1 distance matrix.Waveform to feature instances

Feature instances to feature relations (structure)

“Please call Stella. Ask her ....” 221 x 221 distance matrix

Can remove age and gender differences effectively [Minematsu’06].

Pronunciation Structure Analysis (PSA)

c1

c3c2c4cD

waveform spectrogram (=vector sequence)

trajectory in avector space

c1

c3c2c4cD

vector sequence

c1

c3c2c4cD

distribution seq. distance calculation

Page 24: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

A common paragraph to pron. structure

Pron. distance calculation using structure

Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack ..........

c1

c3c2c4cD

221

221

IPA-based distance

1.5 billions

1.5

billi

ons

Page 25: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Outline of the presentation

A research question and the goal of this researchWhat is the minimal unit of pronunciation diversity of WE?

Individual-basis and automatic clustering of WE pronunciations

Two problems and their solutionSpeech corpus of a fixed paragraph read by international speakers

Clustering technique which is insensitive to non-linguistic variation

Individual-basis and automatic pronunciation clusteringSpeech Accent Archive (SAA)

IPA-based reference pronunciation distance between two speakers

Pronunciation Structure Analysis (PSA)Prediction (regression) of the reference distance by using PSA

Experiments and discussion for future direction

Conclusions

Page 26: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Further selection of speakers from the SAA corpusOnly utterances without word-level insertion/deletion were used.

Utterances with heavy background noise were also removed.

About 1.8K 498 381#speaker pairs = 381 x 380 / 2 = 72,390

Differential features between two speakers

Prediction mechanismSupport Vector Regression is adopted to predict IPA distances.

Experimental conditions

SVR ?:

#elements = 221x210/2 = 24K

Page 27: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Prediction of IPA-based pronunciation distances

Result of prediction

0 0.5 1.0 1.5 2.0 2.5 3.0IPA-based reference distances

0.5

1.0

1.5

2.0

2.5

3.0Pr

edic

ted

dist

ance

s

Corr. = 0.94

Page 28: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Experiments for comparison

Automation of reference distance calculation of Phoneticians’ manual transcription

DTW between any two IPA transcripts

A baseline system by using AE phoneme recognizersAutomatic phonemic transcription by using AE phoneme HMMs.

Conversion of SAA’s phonetic transcripts into phonemic versions.

WSJ-HMMs are used as initial models and re-trained by using SAA (381)

Automatic phoneme recognition with sentence-specific network grammar

DTW bet. any two phonemic transcriptsWSJ-HMM-based phoneme to phoneme DM

Correlation bet. &A 100 % but imaginary phoneme recognizer and a real recognizer

imaginary recognizer 100% corr.=0.91real recognizer 72.5% corr.=0.30

Automate this process!

Info. loss exists.#phones=153, #phonemes=40

Page 29: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Three kinds of correlation

Phones with diacritic marksMinimum units of speech perceived by expert phoneticians

AE phonemesMinimal units of speech perceived by ordinary AE speakers

Perfect phoneme recognizers = simulators of AE speakers

Our proposed systemcan simulate phoneticians’ perception better than AE speakers.

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

0 0.5 1.0 1.5 2.0 2.5 3.0

Pred

icte

d di

stan

ces

IPA-based reference distances

A) Perfect recognizer

Corr. = 0.91 0 0.5 1.0 1.5 2.0 2.5 3.0

IPA-based reference distances

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

Pred

icte

d di

stan

ces

B) Real recognizer

Corr. = 0.30 0 0.5 1.0 1.5 2.0 2.5 3.0

IPA-based reference distances

0.5

1.0

1.5

2.0

2.5

3.0

Pred

icte

d di

stan

ces

D) Proposed method(K=220)

Corr. = 0.94

Page 30: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Good enough or not?

If not,Re-design or modification of features

Multiple Stream Structure [Asakawa2006]

Re-design or modification of prediction (regression)Multi-kernel methods

If it is,We may start a more intensive and extensivecollection of data in the near future.

No IPA transcription needed.

Beta-version of an iOS app for easier collection

Support from teachers and/or researchersof WE is important.

1.5 billions might be an achievable goal!?

Page 31: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Yaguchi and Oka 2013

Spherical visualization of a distance matrixYaguchi and Oka

Fig. 9. Results for spherical ASKS with 28,580 images.

Fig. 10. Interface design for the searching system.

6 Journal of Advanced Computational Intelligence Vol.0 No.0, 200xand Intelligent Informatics

Page 32: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Spherical visualization of a distance matrix

Yaguchi and Oka 2013

Yaguchi and Oka

Fig. 9. Results for spherical ASKS with 28,580 images.

Fig. 10. Interface design for the searching system.

6 Journal of Advanced Computational Intelligence Vol.0 No.0, 200xand Intelligent Informatics

Page 33: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Close & distant speakers finder

Locating your pronunciation in the diversity of WECollecting speech samples from high school students

School A (Japan), School B (Korea), School C (China), School D (Thai)...

Easy conversation partners and challenging conversation partners

Page 34: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

History of one’s own pronunciation

Visualization of pronunciation history of a learnerTracing that learner on the map for many years.

1st year 2nd year 3rd year 10 years ago present

Can motivate learners for continuous learning?

Page 35: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

World Englishes browser

Linking all the English material on the net to the MAPAsking all the speakers of the material to read the Stella paragraph.

Page 36: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Communication facilitator

Connecting two speakers through an IT deviceConversion of others’ accents to a listener’s own accent

accent info.of the owneraccent info.

of the owner

Conversion not to natives’ accent but to a listener’s own accentOur system will be able to facilitate one’s hearing through accent morphing as glasses can facilitate one’s seeing through visual morphing.

From educational support to communicative support

Page 37: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

The ultimate goal of this studyAutomatic and individual-basis clustering of WE pronunciations

Two big problems and their solutionA read speech corpus of a fixed paragraph by many speakers

Clustering technique that is insensitive to age and gender difference

ExperimentSAA + PSA + SVR

Corr. = 0.94

Better than ordinary humans.

Possible applicationsEducational support

Communicative support

Conclusions

0 0.5 1.0 1.5 2.0 2.5 3.0IPA-based reference distances

0.5

1.0

1.5

2.0

2.5

3.0

Pred

icte

d di

stan

ces

Corr. = 0.94

Page 38: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Thank you. Any questions?

?Special Thanks to: Hang-Ping Shen (National Cheng Kung University, Taiwan) Shun Kasahara (University of Tokyo, Japan) Takehiko Makino (Chuo University, Japan) Steven H. Weinberger (George Mason University, USA) Teeraphon Pongkittiphan (University of Tokyo, Japan) Chung-Hsien Wu (National Cheng Kung University, Taiwan)

Page 39: Automatic Clustering of World English Pronunciations on an ...mine/VOICE/WEC.pdf · A common paragraph to pron. structure Pron. distance calculation using structure Please call Stella.

Any interest in technical details?

Two technical papersASRU2013 (to appear) and ICASSP2014 (submitted)

AUTOMATIC PRONUNCIATION CLUSTERING USING A WORLD ENGLISH ARCHIVE AND PRONUNCIATION STRUCTURE ANALYSIS

H.-P. Shen1,2, N. Minematsu2, T. Makino3, S. H. Weinberger4, T. Pongkittiphan2, C.-H. Wu1

1 National Cheng Kung University, Tainan, Taiwan, 2 The University of Tokyo, Tokyo, Japan

3 Chuo University, Tokyo, Japan, 4 George Mason University, Virginia, USA 2{happy,mine,bank}@gavo.t.u-tokyo.ac.jp, [email protected],

[email protected], [email protected]

ABSTRACT English is the only language available for global communi-cation. Due to the influence of speakers’ mother tongue, however, those from different regions inevitably have dif-ferent accents in their pronunciation of English. The ulti-mate goal of our project is creating a global pronunciation map of World Englishes on an individual basis, for speakers to use to locate similar English pronunciations. If the speak-er is a learner, he can also know how his pronunciation compares to other varieties. Creating the map mathematical-ly requires a matrix of pronunciation distances among all the speakers considered. This paper investigates invariant pro-nunciation structure analysis and Support Vector Regression (SVR) to predict the inter-speaker pronunciation distances. In experiments, the Speech Accent Archive (SAA), which contains speech data of worldwide accented English, is used as training and testing samples. IPA narrow transcriptions in the archive are used to prepare reference pronunciation dis-tances, which are then predicted based on structural analysis and SVR, not with IPA transcriptions. Correlation between the reference distances and the predicted distances is calcu-lated. Experimental results show very promising results and our proposed method outperforms by far a baseline system developed using an HMM-based phoneme recognizer.

Index Terms — World Englishes, speaker-based pro-nunciation clustering, pronunciation structure analysis, f-divergence, support vector regression

1. INTRODUCTION English is the only language available for global communi-cation. In many schools, native pronunciation of English is presented as a reference, which students try to imitate. It is widely accepted, however, that native-like pronunciation is not always needed for smooth communication. Due to the influence of the students’ mother tongue, those from differ-ent regions inevitably have different accents in their pronun-ciation of English. Recently, more and more teachers accept the concept of World Englishes [1,2,3,4] and they regard US and UK pronunciations just as two major examples of ac-cented English. Diversity of World Englishes is found in

various aspects of speech acts such as dialogue, syntax, pragmatics, lexical choice, pronunciation etc. Among these kinds of diversity, this paper focuses on pronunciation. If one takes the concept of World Englishes as it is, he can claim that every kind of accented English is equally correct and equally incorrect. In this situation, there will be a great interest in how one type of pronunciation is different from another, not in how that type of pronunciation is incorrect compared to US or UK pronunciation. As shown in [5], the intelligibility of spoken English heavily depends on the na-ture of the listeners as well as that of the speaker and the spoken content, and foreign accented English can indeed be more intelligible than native English. Generally speaking, speech intelligibility tends to be enhanced among speakers of similarly accented pronunciation.

The ultimate goal of our project is creating a global map of World Englishes on an individual basis for each of the speakers to know how his pronunciation is located in the diversity of English pronunciations. If the speaker is a learn-er, he can then find easy-to-communicate English conversa-tion partners, who will have a similar kind of pronunciation. If he is too distant from many of other varieties, however, he will have to correct his pronunciation to achieve smoother communication with these others.

To the best of our knowledge, our project is the first trial to cluster World English pronunciations automatically and even on an individual basis. For this project, however, we have two major problems. One is collecting data and label-ing them, and the other is creating a good algorithm of drawing the global map for a huge amount of unlabeled data. In [6], some accented English corpora with good quality were introduced. However, labeling data is needed in this paper. Luckily enough, for the first problem, the fourth au-thor has made a good effort in systematically collecting World Englishes from more than a thousand speakers from all over the world and labeling them. This corpus is called the Speech Accent Archive (SAA) [7], which provides speech samples of a common elicitation paragraph and their narrow IPA transcriptions. To solve the second problem, we propose a method of clustering speakers only in terms of pronunciation differences. Clustering items can be per-formed by calculating a distance matrix among all of them.

IMPROVED AND ROBUST PREDICTION OF PRONUNCIATION DISTANCE FORINDIVIDUAL-BASIS CLUSTERING OF WORLD ENGLISHES PRONUNCIATION

S. Kasahara†, S. Kitahara†, N. Minematsu†, H.-P. Shen‡, T. Makino∗, D. Saito†, K. Hirose†

† The University of Tokyo, Tokyo, Japan‡ National Cheng Kung University, Tainan, Taiwan

∗ Chuo University, Tokyo, Japan

ABSTRACTEnglish is the only language available for global communication andis used by approximately 1.5 billions of speakers. It is also known tohave a large diversity of pronunciation due to the influence of speak-ers’ mother tongue, called accents. Our project aims at creating aglobal and individual-basis map of English pronunciations to be usedin teaching and learning World Englishes (WE) as well as researchstudies of WE [1, 2]. Creating the map mathematically requires adistance matrix in terms of pronunciation differences among all thespeakers considered, and technically requires a method of predictingthe pronunciation distance between any pair of the speakers only byusing their speech samples. In our previous study [3], we combinedinvariant pronunciation structure analysis [4, 5, 6, 7] and SupportVector Regression (SVR) to predict the inter-speaker pronunciationdistances. In this paper, several techniques are introduced and exam-ined whether they can increase accuracy and robustness of predic-tion. Experiments show that the correlation between IPA-based ref-erence distances and the predicted distances is increased from 0.78to 0.94, which is over the correlation of 0.91 that could be achievedby using a perfect but imaginary phoneme recognizer.

Index Terms— World Englishes, pronunciation clustering,SAA, IPA transcription, pronunciation structure analysis, supportvector regression, f -divergence, phoneme recognition

1. INTRODUCTION

In many schools, native pronunciation of English is presented as areference, which students try to imitate. It is widely accepted, how-ever, that native-like pronunciation is not always needed for smoothcommunication. Due to the influence of the students’ mother tongue,those from different regions inevitably have different accents in theirpronunciation of English. Recently, more and more teachers ac-cept the concept of World Englishes (WE) [1, 2] and they regardUS and UK pronunciations just as two major examples of accentedEnglish. Diversity of WE can be found in various aspects such as di-alogue, syntax, pragmatics, lexical choice, spelling, pronunciation,etc. Among these kinds of diversity, this paper focuses on pronunci-ation. If one takes the concept of WE as it is, he can claim that theredoes not exist the standard pronunciation of English. In this situa-tion, there will be a great interest in how one type of pronunciationcompares to other varieties, not in how that type of pronunciation isincorrect compared to the one and standard pronunciation.

The ultimate goal of our project is creating a global map of WEon an individual basis for each of the speakers to know how his pro-nunciation is located in the diversity of English pronunciations. Ifthe speaker is a learner, he can then find easier-to-communicate En-glish conversation partners, who are supposed to have a similar kind

[pliːz ̥kɔl əs̆tɛlːʌ as hɛr tu brɪŋ diz θɪŋs wɪθ hɛr frʌm ðə stɑɹ sɪks spuːnz ʌv̥ fɹɛʃ əs̆no piːz faɪv̥ θɪk əs̆lɛb̥s ʌv bluː ʧiːz æn meɪbiː eɪ snæk˺ foɹ hɛɹ bɹʌðɜ bɑb˺ wĭ ɑlso nid˺ eɪ smɑlˠ plæstɪk˺ əs̆n̬eɪk æn eɪ biɡ̥ tʰɔɪ fɹɔɡ˺ fɔɹ ðə kɪdz ̥ʃi kɛn əs̆kuːb˺ ðiːz θɪŋs ɪntu θriː ɹɛd˺ bæɡs æn ə wɪl ɡoː mitʰ hɛɹ wɛnzdeɪ æd˺ də̪ tɹeɪn əs̆teɪʃən]

Please call Stella. + Ask her to bring these things with her from the store: + Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. + We also need a small plastic snake and a big toy frog for the kids.+ She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

Fig. 1. The SAA paragraph and an example of IPA transcription

of pronunciation. If he is too distant from many of other varieties,however, he may have to correct his pronunciation for the first timeto achieve smoother communication with these others.

In this paper, we use the Speech Accent Archive (SAA), whichprovides speech samples of a common elicitation paragraph read bymore than a thousand speakers from all over the world. The SAAalso provides IPA-based narrow transcripts of all the samples, whichcan be used for training a pronunciation distance predictor. To cal-culate the pronunciation distance between two speakers of the SAA,[9, 10] proposed a method of comparing two IPA transcripts usinga modified version of the Levenshtein distance. Although it wasshown that the calculated distances had reasonable correlation withthe pronunciation distances perceived by humans, it cannot handleunlabeled data, i.e., raw speech. Recently, we proposed a method ofpredicting the pronunciation distance only using spoken paragraphsof the SAA [3]. The technical challenge is how to make the predic-tion independent of irrelevant but inevitably involved factors such asdifferences in age, gender, microphone, channel, background noise,etc. To this end, we used invariant pronunciation structure analysis[4, 5, 6, 7] for feature extraction and SVR for prediction. In trainingthe predictor, reference (correct) distances had to be prepared. In [3],IPA-based phonetic distances calculated through string-based DTWof two IPA transcripts were used. The correlation between the ref-erence distances and the predicted ones was 0.78 and in this paper,several techniques are examined to enhance the performance.

2. THE SPEECH ACCENT ARCHIVE

The corpus is composed of read speech samples of more than 1,800speakers and their corresponding IPA narrow transcripts. The speak-ers are from all over the world and they read the common elicitationparagraph, shown in Figure 1, where an example of IPA transcrip-tion is also presented. The paragraph contains 69 words and can bedivided into 221 phoneme instances using the CMU dictionary asreference [11]. The IPA transcripts will be used to prepare reference