Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg...

105
Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng (contains electronic versions of papers and links to data) Patterns of Speech Sounds in Unscripted Communication - Production, Perception, Phonology. Akademie Sankelmark, October 8-11, 2000
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    1

Transcript of Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg...

Page 1: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Understanding Spoken Language using

Statistical and Computational Methods

Steven GreenbergInternational Computer Science Institute1947 Center Street, Berkeley, CA 94704

http://www.icsi.berkeley.edu/~steveng(contains electronic versions of papers and links to data)

Patterns of Speech Sounds in Unscripted Communication - Production, Perception, Phonology. Akademie Sankelmark, October 8-11, 2000

Page 2: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

OR ….

Page 3: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

How I Learned to Stop Worryingand Use

The Canonical Form

Page 4: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

DisclaimerI am a Phonetician - NOT!

(many thanks for the invite)

Page 5: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

No Scientist is an Island …IMPORTANT COLLEAGUES

PHONETIC TRANSCRIPTION OF SPONTANEOUS SPEECH (SWITCHBOARD)Candace Cardinal, Rachel Coulston, Dan Ellis, Eric Fosler, Joy Holllenback, John

Ohala, Colleen Richey

STATISTICAL ANALYSIS OF PRONUNCIATION VARIATIONEric Fosler, Leah Hitchcock, Joy Hollenback

ARTICULATORY-ACOUSTIC BASIS OF CONSONANT RECOGNITIONLeah Hitchcock, Rosaria Silipo

AUTOMATIC PHONETIC TRANSCRIPTION OF SPONTANEOUS SPEECHShawn Chang, Lokendra Shastri

Page 6: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Germane Publications

http://www.icsi.berkeley.edu/~steveng

STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELINGFosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic

speech recognition. Proceedings of the International Congress of Phonetic Sciences, San Francisco.Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the

production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .

Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176.

Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32.

Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27.

PERCEPTUAL BASES OF SPEECH INTELLIGIBILITYGreenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral

information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77.

Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8.

Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest

AUTOMATIC PHONETIC TRANSCRIPTION AND SEGMENTATIONChang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech

(American English). Proc. Int. Conf. Spoken Lang. Proc., Beijing.Shastri, L. Chang, S. and Greenberg, S. (1999) Syllable detection and segmentation using temporal flow

neural networks. Proceedings of the International Congress of Phonetic Sciences, San Francisco, pp. 1721-1724.

Page 7: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Prologue

Page 8: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Language - The Traditional PerspectiveThe “classical” view of spoken language posits a quasi-arbitrary relation between

the lower and higher tiers of linguistic organization

Phonetic orthography

Page 9: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Language - A Syllable-Centric PerspectiveA more empirical perspective of spoken language focuses on the syllable as the

interface between “sound” and “meaning”

Within this framework the relationship between the syllable and the higher and lower tiers is non-arbitrary and systematic statistically

Page 10: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

Page 11: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

Page 12: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

Page 13: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

Page 14: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time

Page 15: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form

Page 16: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely

Page 17: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with

respect to syllabic position

Page 18: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with

respect to syllabic position– Therefore, it is important to model spoken language at the syllabic level

Page 19: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with

respect to syllabic position– Therefore, it is important to model spoken language at the syllabic level

• THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY

Page 20: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with

respect to syllabic position– Therefore, it is important to model spoken language at the syllabic level

• THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY– It may be unrealistic to assume that any phonetic transcription based

exclusively on segments (such as the IPA) is truly capable of capturing the important phonetic detail of spontaneous material

Page 21: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with

respect to syllabic position– Therefore, it is important to model spoken language at the syllabic level

• THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY– It may be unrealistic to assume that any phonetic transcription based

exclusively on segments (such as the IPA) is truly capable of capturing the important phonetic detail of spontaneous material

Page 22: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Take Home Messages• PHONETIC PROPERTIES OF SPONTANEOUS SPEECH REFLECT

INFORMATION CONTENT

Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .

Page 23: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

Page 24: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

– Provides the basis for the statistical analyses of spontaneous material

Page 25: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

– Provides the basis for the statistical analyses of spontaneous material

• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments

Page 26: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

– Provides the basis for the statistical analyses of spontaneous material

• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words

Page 27: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

– Provides the basis for the statistical analyses of spontaneous material

• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables

Page 28: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

– Provides the basis for the statistical analyses of spontaneous material

• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables– Articulatory-acoustic features

Page 29: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

– Provides the basis for the statistical analyses of spontaneous material

• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables– Articulatory-acoustic features

• PERCEPTUAL EVIDENCE– The articulatory-acoustic basis of consonant recognition

Page 30: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

– Provides the basis for the statistical analyses of spontaneous material

• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables– Articulatory-acoustic features

• PERCEPTUAL EVIDENCE– The articulatory-acoustic basis of consonant recognition– Not all articulatory-acoustic features are created equal - place-of-articulation

cues appear to be most important for consonant recognition

Page 31: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

– Provides the basis for the statistical analyses of spontaneous material

• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables– Articulatory-acoustic features

• PERCEPTUAL EVIDENCE– The articulatory-acoustic basis of consonant recognition– Not all articulatory-acoustic features are created equal - place-of-articulation

cues appear to be most important for consonant recognition

• COMPUTATIONAL METHODS– Automatic methods for phonetic transcription based on articulatory-acoustic

features

Page 32: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

– Provides the basis for the statistical analyses of spontaneous material

• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables– Articulatory-acoustic features

• PERCEPTUAL EVIDENCE– The articulatory-acoustic basis of consonant recognition– Not all articulatory-acoustic features are created equal - place-of-articulation

cues appear to be most important for consonant recognition

• COMPUTATIONAL METHODS– Automatic methods for phonetic transcription based on articulatory-acoustic

features– Is the most likely means through which it will be possible to generate sufficient

empirical data with which to rigorously test hypotheses germane to spoken language

Page 33: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Phonetic Transcription of Spontaneous (American) English

Page 34: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Phonetic Transcription of Spontaneous English• TELEPHONE DIALOGUES OF 5-10 MINUTES DURATION - SWITCHBOARD• AMOUNT OF MATERIAL MANUALLY TRANSCRIBED    

– 3 hours labeled at the phone level and segmented at the syllabic level (this material was later phonetically segmented by automatic methods)

– 1 hour labeled and segmented at the phonetic-segment level

• DIVERSITY OF MATERIAL TRANSCRIBED– Spans speech of both genders (ca. 50/50%) reflecting a wide range of American

dialectal variation (6 regions + “army brat”), speaking rate and voice quality

• TRANSCRIBED BY WHOM? – 7 undergraduates and 1 graduate student, all enrolled at UC-Berkeley. Most of

the corpus was transcribed by three individuals out of the original eight– Supervised by Steven Greenberg and John Ohala

• TRANSCRIPTION SYSTEM– A variant of Arpabet, with phonetic diacritics such as:_gl,_cr, _fr, _n, _vl, _vd

• HOW LONG DOES TRANSCRIPTION TAKE? (Don’t Ask!)– 388 times real time for labeling and segmentation at the phonetic-segment level– 150 times real time for labeling phonetic segments and segmenting syllables

• HOW WAS LABELING AND SEGMENTATION PERFORMED?– Using a display of the signal waveform, spectrogram, word transcription and

“forced alignments” (estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations

• DATA AVAILABLE AT - http://www.icsi/berkeley.edu/real/stp

Page 35: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

A Brief Tour of Pronunciation Variation

inSpontaneous American English

Page 36: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

The 10 most common words account for 27% of the corpus

The 100 most common words account for 67% of the corpus

The 1000 most common words account for 92% of the corpus

Thus, most informal dialogues are composed of a relatively small number of common words.

However, it is the infrequent words that typically provide the precision and detail required for complex information transfer

Cumulative Word Frequency in English

67%

27%

92%

Computed from the Switchboard corpus (American English telephone dialogues)

Focus on 100 most common words

Page 37: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

How Many Pronunciations of “And”?

82 ae n63 eh n45 ix n35 ax n34 en30 n20 ae n dcl d17 ih n17 q ae n11 ae n d

7 q eh n7 ae nx6 ae ae n6 ah n5 eh nx4 uh n4 ix nx4 q ae n dcl d3 eh n d3 q ae nx

3 eh2 ae n dcl2 ae2 ax m2 ax n d2 ae eh n dcl d2 eh n dcl d2 ax nx2 q ae ae n2 q ix n2 ix n dcl d2 ih 2 eh eh n2 q eh nx2 ix d n1 eh m1 ax n dcl d1 aw n1 ae q1 eh dcl

N Pronunciation N Pronunciation

Page 38: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

How Many Pronunciations of “And”?

1 ah nx1 ae n t1 eh d1 ah n dcl d1 ey ih n dcl1 ae ix n1 ae nx ax1 ax ng1 ay n1 ih ah n d1 ae hh1 ih ng1 ix1 ae n d dcl1 ix dcl d1 ae eh n1 hh n1 ix n t1 ae ax n dcl d1 iy eh n

1 m1 ae ae n d1 nx1 q ae ae n1 q ae ae n dcl d1 q ae eh n dcl d1 q ae ih n1 aa n1 q ae n d1 ? nx1 q ae n q1 eh n m1 q eh en dcl1 eh ng1 q eh n q1 em1 q eh ow m1 q ih n1 q ix en1 er

N Pronunciation N Pronunciation

Page 39: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

1   I 6 4 9   5 3   5 3   a y

2   a n d 5 2 1   8 7   1 6   a e n

3   th e 4 7 5    7 6   2 7   d h a x

4   y o u 4 0 6   6 8   2 0   y ix

5   th a t 3 2 8   1 1 7   1 1   d h a e

6   a 3 1 9   2 8   6 4   a x

7   to 2 8 8   6 6   1 4   tc l t u w

8   k n o w 2 4 9   3 4   5 6   n o w

9   o f 2 4 2   4 4   2 1   a x v

1 0   it 2 4 0   4 9   2 2   ih

1 1   y e a h 2 0 3   4 8   4 3   y a e

1 2   in 1 7 8   2 2   4 5   ih n

1 3   th e y 1 5 2   2 8   6 0   d h e y

1 4   d o 1 3 1   3 0   5 4   d c l d u w

1 5   s o 1 3 0   1 4   7 4   s o w

1 6   b u t 1 2 3   4 5   1 2   b c l b a h tc l t

1 7   is 1 2 0   2 4   5 0   ih z

1 8   lik e 1 1 9   1 9   4 6   l a y k c l k

1 9   h a v e 1 1 6   2 2   5 4   h h a e v

2 0   w a s 1 1 1   2 4   2 3   w a h z

2 1   w e 1 0 8   1 3   8 3   w iy

2 2   it's 1 0 1   1 4   2 0   ih tc l s

2 3   ju s t 1 0 1   3 4   1 7   jh ix s

2 4   o n 9 8   1 8   4 9   a a n

2 5   o r 9 4   2 3   3 6   e r

2 6   n o t 9 2   2 4   2 4   m a a q

2 7   th in k 9 2   2 3   3 2   th ih n g k c l k

2 8   fo r 8 7   1 9   4 6   f e r

2 9   w e ll 8 4   4 9   2 3   w e h l

3 0   w h a t 8 2   4 0   1 4   w a h d x

3 1   a b o u t 7 7   4 6   1 2   a x b c l b a w

3 2   a ll 7 4   2 7   2 4   a o l

3 3   th a t's 7 4   1 9   1 6   d h e h s

3 4   o h 7 4   1 7   6 1   o w

3 5   re a lly 7 1   2 5   4 5   r ih l iy

3 6   o n e 6 9   8   7 8   w a h n

3 7   a re 6 8   1 9   4 2   e r

3 8   I'm 6 7 9   2 6   q a a m

3 9   rig h t 6 1   2 1   2 8   r a y

4 0   u h 6 0   1 6   4 1   a h

4 1   th e m 6 0   1 8   2 3   a x m

4 2   a t 5 9   3 6   8   a e d x

4 3   th e re 5 8   2 8   2 2   d h e h r

4 4   my 5 8   9   6 6   m a y

4 5   me a n 5 6   1 0   5 8   m iy n

4 6   d o n 't 5 6   2 1   1 4   d x o w

4 7   n o 5 5   8   7 7   n o w

4 8   w ith 5 5   2 0   3 5   w ih th

4 9   if 5 5   1 8   4 1   ih f

5 0   w h e n 5 4   1 8   3 1   w e h n

5 1   c a n 5 4   2 8   1 5   k c l k a e n

5 2   th e n 5 1   1 9   3 8   d h e h n

5 3   b e 5 0   1 1   7 6   b c l b iy

5 4   a s 4 9   1 6   1 8   a e z

5 5   o u t 4 7   1 9   2 2   a e d x

5 6   k in d 4 7   1 7   2 1   k c l k a x n x

5 7   b e c a u e 4 6   3 1   1 5   k c l k a x z

5 8   p e o p le 4 5   2 1   4 4  p c l p iy p c l l e l

5 9   g o 4 5   5   8 3   g c l g o w

6 0   g o t 4 5   3 2   1 5   g c l g a a

6 1   th is 4 4   1 1   4 7   d h ih s

6 2   s o me 4 3   4   4 8   s a h m

6 3   w o u ld 4 1   1 6   2 9   w ih d c l

6 4   th in g s 4 1   1 5   5 2   th ih n g z

6 5   n o w 3 9   1 1   6 9   n a w

6 6   lo t 3 9   9   4 7   l a a d x

6 7   h a d 3 9   1 9   2 4   h h a e d c l

6 8   h o w 3 9   1 1   5 3   h h a w

6 9   g o o d 3 8   1 3   2 7   g c l g u h d c l

7 0   g e t 3 8   2 0   1 3   g c l g e h d x

7 1   s e e 3 7   6   8 0   s iy

7 2   fro m 3 6   1 0   2 8   f r a h m

7 3   h e 3 6   7   3 9   iy

7 4   me 3 5   5   8 7   m iy

7 5   d o n 't 3 5   2 1   1 4   d x o w

7 6   th e ir 3 3   1 9   2 5   d h e h r

7 7   mo re 3 2   1 1   5 6   m a o r

7 8   it's 3 1   1 4   2 0   ih tc l s

7 9   th a t's 3 1   2 0   1 6   d h e h s

8 0   to o 3 1   6   6 0   tc l t u w

8 1   o k a y 3 1   1 7   4 5   o w k c l k e y

8 2   v e ry 3 0   1 1   3 6   v e h r iy

8 3   u p 3 0   1 1   3 4   a h p c l p

8 4   b e e n 3 0   1 1   5 1   b c l b ih n

8 5   g u e s s 2 9   8   4 2   g c l g e h s

8 6   time 2 9   8   6 2   tc l t a y m

8 7   g o in g 2 9   2 1   1 3   g c l g o w ih n g

8 8   in to 2 8   2 0   1 4   ih n tc l t u w

8 9   th o s e 2 7   1 2   4 2   d h o w z

9 0   h e re 2 7   1 1   2 5   h h iy e r

9 1   d id 2 7   1 3   2 3   d c l d ih d x

9 2   w o rk 2 5   8   6 6   w e r k c l k

9 3   o th e r 2 5   1 4   2 6   a h d h e r

9 4   a n 2 5   1 2   2 8   a x n

9 5   I'v e 2 5   7   4 6   a y v

9 6   th in g 2 4   9   5 2   th ih n g

9 7   e v e n 2 4   7   4 0   iy v ix n

9 8   o u r 2 3   9   3 3   a a r

9 9   a n y 2 3   1 1   2 3   ix n iy

1 0 0   w e 're 2 3   8   2 5   w e y r

How Many Different Pronunciations?

1  I 649  53  53  ay2  and 521  87  16  ae n3  the 475   76  27  dh ax4  you 406  68  20  y ix5  that 328  117  11  dh ae6  a 319  28  64  ax7  to 288  66  14  tcl t uw8  know 249  34  56  n ow9  of 242  44  21  ax v

10  it 240  49  22  ih11  yeah 203  48  43  y ae12  in 178  22  45  ih n13  they 152  28  60  dh ey14  do 131  30  54  dcl d uw15  so 130  14  74  s ow16  but 123  45  12  bcl b ah tcl t17  is 120  24  50  ih z18  like 119  19  46  l ay kcl k19  have 116  22  54  hh ae v20  was 111  24  23  w ah z

Rank Word N #PronMost CommonPronunciation

MCP%Total

Page 40: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

How Many Different Pronunciations?

21  we 108  13  83  w iy22  it's 101  14  20  ih tcl s23  just 101  34  17  jh ix s24  on 98  18  49  aa n25  or 94  23  36  er26  not 92  24  24  m aa q27  think 92  23  32  th ih ng kcl k28  for 87  19   46  f er29  well 84  49  23  w eh l30  what 82  40  14  w ah dx31  about 77  46  12  ax bcl b aw32  all 74  27  24  ao l 33  that's 74  19  16  dh eh s34  oh 74  17  61  ow35  really 71  25  45  r ih l iy36  one 69  8  78  w ah n37  are 68  19  42  er38  I'm 67 9  26  q aa m39  right 61  21  28  r ay40  uh 60  16  41  ah

Rank Word N #PronMost CommonPronunciation

MCP%Total

Page 41: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Rank Word N #PronMost CommonPronunciation

MCP%Total

How Many Different Pronunciations?

41  them 60  18  23  ax m42  at 59  36  8  ae dx43  there 58  28  22  dh eh r44  my 58  9  66  m ay45  mean 56  10  58  m iy n46  don't 56  21  14  dx ow47  no 55  8  77  n ow48  with 55  20  35  w ih th49  if 55  18  41  ih f50  when 54  18  31  w eh n51  can 54  28  15  kcl k ae n52  then 51  19  38  dh eh n53  be 50  11  76  bcl b iy54  as 49  16  18  ae z55  out 47  19  22  ae dx56  kind 47  17  21  kcl k ax nx57  becaue 46  31  15  kcl k ax z58  people 45  21  44  pcl p iy pcl l el59  go 45  5  83  gcl g ow60  got 45  32  15  gcl g aa

Page 42: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

How Many Different Pronunciations?

61  this 44  11  47  dh ih s62  some 43  4  48  s ah m63  would 41  16  29  w ih dcl64  things 41  15  52  th ih ng z65  now 39  11  69  n aw66  lot 39  9  47  l aa dx67  had 39  19  24  hh ae dcl68  how 39  11  53  hh aw69  good 38  13  27  gcl g uh dcl70  get 38  20  13  gcl g eh dx71  see 37  6  80  s iy72  from 36  10  28  f r ah m73  he 36  7  39  iy74  me 35  5  87  m iy75  don't 35  21  14  dx ow76  their 33  19  25  dh eh r77  more 32  11  56  m ao r78  it's 31  14  20  ih tcl s79  that's 31  20  16  dh eh s80  too 31  6  60  tcl t uw

Rank Word N #PronMost CommonPronunciation

MCP%Total

Page 43: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

How Many Different Pronunciations?

81  okay 31  17  45  ow kcl k ey82  very 30  11  36  v eh r iy83  up 30  11  34  ah pcl p84  been 30  11  51  bcl b ih n85  guess 29  8  42  gcl g eh s86  time 29  8  62  tcl t ay m87  going 29  21  13  gcl g ow ih ng88  into 28  20  14  ih n tcl t uw89  those 27  12  42  dh ow z90  here 27  11  25  hh iy er91  did 27  13  23  dcl d ih dx92  work 25  8  66  w er kcl k93  other 25  14  26  ah dh er94  an 25  12  28  ax n95  I've 25  7  46  ay v96  thing 24  9  52  th ih ng97  even 24  7  40  iy v ix n98  our 23  9  33  aa r99  any 23  11  23  ix n iy

100  we're 23  8  25  w ey r

Rank Word N #PronMost CommonPronunciation

MCP%Total

Page 44: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

English is (sort of) like Chinese ….

81% of the word tokens are monosyllabic

Of the 100 most common words, 90 are one syllablein length

Only 22% of the words in the lexicon are one syllable long

Hence, there is a decided preference for monosyllablic words in informal discourse

95% of the words contain just ONE or TWO syllables ….

Page 45: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Syllable and. Word Frequencies are SimilarWords and syllables exhibit similar distributions over the 300 most common elements, accounting for 80% of the corpus

The similarity of their distributions is a consequence of most words consisting of just a single syllable

Page 46: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Word frequency as a function of word rank approximates a 1/f distribution, particularly after rank-order 10

Word Frequency in Spontaneous English

Word frequency is logarithmically related to rank order in the corpus (I.e., the 10th most common word occurs ca. 10 times more frequently than the 100th most common word, etc.

Computed from the Switchboard corpus (American English telephone dialogues)

Page 47: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Information Affects PronunciationThe faster the speaking rate the more likely that the pronunciation deviates from canonical

However, the effect is much more pronounced for the 100 most common words than for more infrequent words

From Fosler, Greenberg and Morgan (1999); Greenberg and Fosler (2000)

Page 48: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

0

5

10

15

20

25

30

35

40

45

50

CV CVC VC V

Syllable Type

PronunciationCorpus

English Syllable Structure is (sort of) Like Japanese

87% of the pronunciations are simple syllabic forms

84% of the canonical corpus is composed of simple syllabic forms

n= 103, 054

Most syllables are simple in form (no consonant clusters)

Page 49: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

There are many “complex” syllable forms (consonant clusters, but all occur relatively infrequently

Complex Syllables are Important, ThoughThus, despite English’s reputation for complex syllabic forms, only ca. 15% of the syllable tokens are actually complex

Complex codas are not as frequently realized in actual pronunciation as their canonical representation

Complex onsets tend to preserve the canonical pronunciation in realize their canonical representation

n= 17,760

Page 50: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Syllable-Centric Pronunciation

(Spontaneous speech)

(Read Sentences)“Cat” [k ae t][k] = onset[ae] = nucleus[t] = coda

Onsets are pronouncedcanonically far more often than nuclei or codas

Codas tend to be pronounced canonically more frequently in formal speech than in spontaneous dialogues

Percent Canonically Pronounced

Syllable Position

n= 120,814

Page 51: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

70

75

80

85

90

95

100

Simple (C) Complex (CC(C))

STP

TIMIT

Complex onsets are pronounced more canonically than simple onsets despite the greater potential for deviation from the standard pronunciation

(Spontaneous speech)

(Read Sentences)

Percent Canonically Pronounced

Syllable Onset Type

Complex Onsets are Highly Canonical

Page 52: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Speaking Style Affects Codas

Percent Canonically Pronounced

Codas are much more likely to be realized canonically in formal than in spontaneous speech

Syllable Coda Type

Page 53: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

50

55

60

65

70

AllNuclei

WithOnset

WithoutOnset

WithCoda

WithoutCoda

STP

TIMIT

Onsets (but not Codas) Affect Nuclei

Percent Canonically Pronounced

The presence of a syllable onset has a substantial impact on the realization of the nucleus

Page 54: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Syllable-Centric Feature Analysis• Place of articulation deviates most in nucleus position• Manner of articulation deviates most in onset and coda position• Voicing deviates most in coda position

Phonetic deviation along a SINGLE feature

Place deviates very little from canonical form in the onset and coda. It

is a STABLE AF in these positions

Place is VERY unstable in nucleus position

Page 55: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Articulatory PLACE Feature Analysis• Place of articulation is a “dominant” feature in nucleus position only• Drives the feature deviation in the nucleus for manner and rounding

Phonetic deviation across SEVERAL features

Place “carries” manner and rounding in the nucleus

Page 56: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

• Manner of articulation is a “dominant” feature in onset and coda position• Drives the feature deviation in onsets and codas for place and voicing

Articulatory MANNER Feature Analysis

Manner is less stable in the coda than in the onset

Manner drives place and

voicing deviations in the onset and

coda

Phonetic deviation across SEVERAL features

Page 57: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

• Voicing is a subordinate feature in all syllable positions• Its deviation pattern is controlled by manner in onset and coda positions

Articulatory VOICING Feature Analysis

Voicing is unstable in coda position and is dominated by manner

Phonetic deviation across SEVERAL features

Page 58: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

• Lip-rounding is a subordinate feature• Its deviation pattern is driven by the place feature in nucleus position

LIP-ROUNDING Feature Analysis

Rounding is stable everywhere except in the nucleus where

its deviation pattern is driven by place

Phonetic deviation across SEVERAL features

Page 59: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Perceptual Evidence for the

Importance of Place (and Manner) of Articulation Features

Page 60: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Spectral Slit Paradigm

Page 61: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - Single Slits

Page 62: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 1 Slit

Page 63: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 2 Slits

Page 64: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 3 Slits

Page 65: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 4 Slits

Page 66: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 5 Slits

Page 67: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 2 Slits

Page 68: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 2 Slits

Page 69: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 2 Slits

Page 70: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 2 Slits

Page 71: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 2 Slits

Page 72: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 2 Slits

Page 73: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 3 Slits

Page 74: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 3 Slits

Page 75: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 3 Slits

Page 76: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 4 Slits

Page 77: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Consonant Recognition - 5 Slits

Page 78: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Correlation - AFs/Consonant RecognitionConsonant recognition is almost perfectly correlated with place of articulation performance

This correlation suggests that the place feature is based on cues distributed across the entire speech bandwidth, in contrast to other features

Manner is also highly correlated with consonant recognition, voicing and rounding less so

Page 79: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Automatic Phonetic Transcription of Spontaneous Speech

Page 80: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC

   ALIGNMENT DATA TO TRAIN NEW SYSTEMS

Page 81: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC

   ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic

segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

Page 82: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC

   ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic

segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA

Page 83: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC

   ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic

segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA– Manual labeling and segmentation typically requires 150-400 times real time to

perform

Page 84: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC

   ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic

segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA– Manual labeling and segmentation typically requires 150-400 times real time to

perform

• WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE MATERIAL

Page 85: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC

   ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic

segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA– Manual labeling and segmentation typically requires 150-400 times real time to

perform

• WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE MATERIAL– Such material will be extremely useful for developing pronunciation

models and new algorithms for ASR

Page 86: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC

   ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic

segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA– Manual labeling and segmentation typically requires 150-400 times real time to

perform

• WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE MATERIAL– Such material will be extremely useful for developing pronunciation

models and new algorithms for ASR

• THE ALPS SYSTEM CURRENTLY LABELS SPONTANEOUS MATERIALS    (OGI Numbers Corpus) WITH ca. 83% ACCURACY

Page 87: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC

   ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic

segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA– Manual labeling and segmentation typically requires 150-400 times real time to

perform

• WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE MATERIAL– Such material will be extremely useful for developing pronunciation

models and new algorithms for ASR

• THE ALPS SYSTEM CURRENTLY LABELS SPONTANEOUS MATERIALS    (OGI Numbers Corpus) WITH ca. 83% ACCURACY– The algorithms used are capable of achieving ca. 93% accuracy with

only minor changes to the models

Page 88: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Phonetic Feature Classification System

Page 89: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Spectro-Temporal Profile (STeP)• STePs provide a simple, accurate means of delineating the acoustic

   properties associated with phonetic features and segments

Vocalic

Page 90: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Spectro-temporal Profile (STeP)• STePs incorporate information about the instantaneous modulation    spectrum distributed

across the (tonotopic) frequency axis and can be    used for training neural networks.

Fricative

Page 91: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Label Accuracy per Frame• Frames away from the boundary are labeled very accurately

Page 92: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Sample Transcription Output• The automatic system performs very similarly to manual transcription in terms of both labels and segmentation

– 11 ms average concordance in segmentation– 83% concordance with respect to phonetic labels

Page 93: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

In Conclusion ….

Page 94: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Grand Summary• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS

  FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES     – Such insights can only be obtained at present with large amounts of

phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

– Automatic methods will eventually supply badly needed data for more complete analyses and evaluation

• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 85-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with

respect to syllabic position– Therefore, it is important to model spoken language at the syllabic level

• THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY– It may be unrealistic to assume that any phonetic transcription based

exclusively on segments (such as the IPA) is truly capably of capturing the important phonetic detail of spontaneous material

Page 95: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

That’s All, Folks

Many Thanks for Your Time and Attention

Page 96: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Temporal View of Language

Page 97: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Linguistic Automatic Speech Recognition• CHARACTERIZE SPOKEN LANGUAGE WITH GREAT PRECISION

– Currently, manual transcription is the only means by which to collect detailed data pertaining to spoken language. Computational methods are currently being developed to perform transcription automatically in order to provide an abundance of data for statistical characterization of spontaneous discourse.

• USE THIS KNOWLEDGE TO DEVELOP COMPUTATIONAL TECHNIQUESTAILORED TO THE PROPERTIES OF THE SPEECH DOMAIN

– A detailed knowledge of spoken language is essential for deriving a computational framework for ASR. The phonetic properties of speech are structured in different ways depending on the location within the syllable, word and phrase. Such knowledge is currently under-utilized by mainstream ASR.

• FOCUS ON LOWER TIERS OF SPOKEN LANGUAGE FOR THE PRESENT– It is fashionable to emphasize the importance of “language” models (i.e., word

co-occurrence properties) in ASR. However, most of the problems lie in the acoustic-phonetic front end and therefore this domain should be attacked first.

• USE KNOWLEDGE OF HOW HUMAN LISTENERS UNDERSTAND SPOKEN LANGUAGE TO GUIDE DEVELOPMENT OF ASR

ALGORITHMS– Current ASR acoustic models are not based on perceptual capabilities of human

listeners, but on a distorted representation of what is important in hearing. It is important to perform intelligibility experiments to ascertain the identity of the truly important components of the speech signal and use this knowledge to develop robust, acoustic-front-end models for ASR.

Page 98: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Linguistic ASR Research @ ICSI• PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY

– Human listening experiments identifying the specific properties crucial for understanding spoken language

• MODULATION-SPECTRUM-BASED AUTOMATIC SPEECH RECOGNITION– Using auditory-based algorithms (linked to the syllable) for reliable

ASR in background noise and reverberation• SYLLABLE-BASED AUTOMATIC SPEECH RECOGNITION

– Development of a syllable-based decoder for ASR• STATISTICAL PROPERTIES OF SPONTANEOUS SPEECH

– Detailed and comprehensive statistical analyses of the Switchboard corpus pertaining to phonetic, prosodic and lexical properties, used for developing pronunciation models (among other things)

• AUTOMATIC PHONETIC LABELING AND SEGMENTATION– Development of (the first) automatic phonetic transcription system

using articulatory-acoustic features (e.g, voicing, manner, place etc.)• AUTOMATIC LABELING OF PROSODIC STRESS

– Development of (the first) automatic system for labeling prosodic stress in English

• AUTOMATIC SPEECH RECOGNITION DIAGNOSTIC EVALUATION– Detailed and comprehensive analyses of Switchboard-corpus ASR

systems in order to identify factors associated with word error

Page 99: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Linguistic ASR at ICSI• SENIOR PERSONNEL

– Steven Greenberg - Linguistic ASR, Spoken Language Statistics, Speech Perception– Lokendra Shastri - Neural Network Design, Higher-level Language & Neural Processing

• GRADUATE STUDENTS– Shawn Chang - ANN-based ASR, Automatic Phonetic Transcription & Segmentation– Michael Shire - Temporal & Multi-Stream Approaches to Automatic Speech Recognition– Mirjam Wester - Pronunciation Modeling in Automatic Speech Recognition

• UNDERGRADUATE STUDENTS– Micah Farrer - Database Development for ASR Analysis– Leah Hitchcock - Statistics of Pronunciation and Prosody of Spoken Language

• TECHNICAL STAFF– Joy Hollenback - Statistical Analyses, Data Collection and Maintenance

• ASSOCIATES AT ICSI – Hynek Hermansky, Nelson Morgan, Liz Shriberg and Andreas Stolcke

• ASSOCIATES AT LOCATIONS OTHER THAN ICSI– Takayuki Arai (Sophia University, Tokyo) - Speech Perception, Signal Processing– Les Atlas (University of Washington, Seattle) - Acoustic Signal Processing– Ken Grant (Walter Reed Army Medical Center) - Audio-visual Speech Processing– David Poeppel (University of Maryland) - Brain Mechanisms of Language– Tim Roberts (UC-San Francisco Medical Center) - Brain Imaging of Language Processes– Christoph Schreiner (UCSF) - Auditory Cortex and Its Relation to Speech Processing– Lloyd Watts (Applied Neurosystems) - Auditory Modeling from Cochlea to Cortex

• CURRENT FUNDING– National Security Agency - Automatic Transcription of Phonetic and Prosodic Elements– National Science Foundation - Syllable-based ASR, Speech Perception, Statistics of Speech

Page 100: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Linguistic ASR at ICSI (continued)

• FORMER ICSI POST-DOCTORAL FELLOWS– Takayuki Arai - Sophia University, Tokyo– Dan Ellis - Columbia University (as of September 1, 2000)– Eric Fosler - Bell Laboratories, Lucent Technologies– Rosaria Silipo - Nuance Communications

• FORMER ICSI GRADUATE STUDENTS– Jeff Bilmes - University of Washington, Seattle– Eric Fosler - Bell Laboratories, Lucent Technologies– Brian Kingsbury - IBM, Yorktown Heights– Katrin Kirchhoff - University of Washington, Seattle– Nikki Mirghafori - Nuance Communications– Su-Lin Wu - Nuance Communications

• FORMER ICSI UNDERGRADUATE STUDENTS– Candace Cardinal - Nuance Communications– Rachel Coulston - University of California, San Diego– Collen Richey - Stanford University

Page 101: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Publications - Linguistic ASRAUTOMATIC SPEECH RECOGNITION DIAGNOSTIC EVALUATION

Greenberg, S., Chang, S. and Hollenback, J. (2000) An introduction to the diagnostic evaluation of the Switchboard-corpus automatic speech recognition systems. Proceedings of the NIST Speech Transcription Workshop, College Park

Greenberg, S. and Chang, S. (2000) Linguistic dissection of switchboard-corpus automatic recognition systems. Proceedings of the ICSI Workshop on Automatic Speech Recognition: Challenges for the New Millennium, Paris.

AUTOMATIC PHONETIC TRANSCRIPTION AND SEGMENTATIONChang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech

(American English). Proc. Int. Conf. Spoken Lang. Proc., Beijing.Shastri, L. Chang, S. and Greenberg, S. (1999) Syllable detection and segmentation using temporal flow neural

networks. Proceedings of the International Congress of Phonetic Sciences, San Francisco, pp. 1721-1724.

STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELINGFosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic

speech recognition. Proceedings of the International Congress of Phonetic Sciences, San Francisco.Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production

of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .

Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176,

Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32.

Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27.

AUTOMATIC LABELING OF PROSODIC STRESS IN SPONTANEOUS SPEECHSilipo, R. and Greenberg, S. (1999) Automatic transcription of prosodic stress for spontaneous english

discourse. Proceedings of the International Congress of Phonetic Sciences, San Francisco.Silipo, R. and Greenberg, S. (2000) Prosodic stress revisited: Reassessing the role of fundamental frequency.

Proceedings of the NIST Speech Transcription Workshop, College Park.

Page 102: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Publications - Linguistic ASR (continued)

MODULATION-SPECTRUM-BASED AUTOMATIC SPEECH RECOGNITIONGreenberg, S. and Kingsbury, B. (1997) The modulation spectrogram: In pursuit of an invariant

representation of speech, in ICASSP-97, IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, pp. 1647-1650.

Kingsbury, B., Morgan, N. and Greenberg, S. (1999) The modulation-filtered spectrogram: A noise-robust speech representation, in Proceedings of the Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland.

Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Robust speech recognition using the modulation spectrogram, Speech Communication, 25, 117-132.

SYLLABLE-BASED AUTOMATIC SPEECH RECOGNITIONWu, S.-L., Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Incorporating information from syllable-length

time scales into automatic speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 721-724.

Wu, S.-L., Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Performance improvements through combining phone- and syllable-length information in automatic speech recognition, Proceedings of the International Conference on Spoken Language Processing, Sydney, pp. 854-857.

PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY GERMANE TO ASRArai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony,

IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936.Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral

asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678.

Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77.

Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8.

Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from

Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest.

Page 103: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Syllable Frequency - Spontaneous English

The distribution of syllable frequency in spontaneous speech differs markedly from that in dictionaries

Page 104: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Word frequency as a function of word rank approximates a 1/f distribution, particularly after rank-order 10

Word Frequency in Spontaneous English

Word frequency is logarithmically related to rank order in the corpus (I.e., the 10th most common word occurs ca. 10 times more frequently than the 100th most common word, etc.

Computed from the Switchboard corpus (American English telephone dialogues)

Page 105: Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

The Intricate Web of Research