Post on 20-Dec-2015
What are the Essential Cues for
Understanding Spoken Language?
Steven GreenbergInternational Computer Science Institute1947 Center Street, Berkeley, CA 94704
http://www.icsi.berkeley.edu/~stevengsteveng@icsi.berkeley.edu
No Scientist is an Island …IMPORTANT COLLEAGUES
ACOUSTIC BASIS OF SPEECH INTELLIGILIBILTYTakayuki Arai, Joy Hollenback, Rosaria Silipo
AUDITORY-VISUAL INTEGRATION FOR SPEECH PROCESSINGKen Grant
AUTOMATIC SPEECH RECOGNITION AND FEATURE CLASSIFICATIONShawn Chang, Lokendra Shastri, Mirjam Wester
STATISTICAL ANALYSIS OF PRONUNCIATION VARIATIONEric Fosler, Leah Hitchcock, Joy Hollenback
Germane PublicationsSTATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELING
Fosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic speech recognition. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco.
Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32.
Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176.
Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .
Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27.
AUTOMATIC PHONETIC TRANSCRIPTION AND ACOUSTIC FEATURE CLASSIFICATIONChang, S. Greenberg, S. and Wester, M. (2001) An elitist approach to articulatory-acoustic feature
classification. 7th European Conference on Speech Communication and Technology (Eurospeech-2001).
Chang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech (American English), Proceedings of the International. Conference on. Spoken. Language. Processing, Beijing.
Shastri, L., Chang, S. and Greenberg, S. (1999) Syllable segmentation using temporal flow model neural networks. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco.
Wester, M. Greenberg, S. and Chang,, S. (2001) A Dutch treatment of an elitist approach to articulatory-acoustic feature classification. 7th European Conference on Speech Communication and Technology (Eurospeech-2001).
http://www.icsi.berkeley.edu/~steveng
Germane PublicationsPERCEPTUAL BASES OF SPEECH INTELLIGIBILITY
Arai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936.
Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678.
Greenberg, S. and Arai, T. (2001) The relation between speech intelligibility and the complex modulation spectrum. Submitted to the 7th European Conference on Speech Communication and Technology (Eurospeech-2001).
Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77.
Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced
from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest. AUDITORY-VISUAL SPEECH PROCESSING
Grant, K. and Greenberg, S. (2001) Speech intelligibility derived from processing of asynchronous processing of auditory-visual information. Submitted to the ISCA Workshop on Audio-Visual Speech Processing (AVSP-2001).
PROSODIC STRESS ACCENT – AUTOMATIC CLASSIFICATION AND CHARACTERIZATIONHitchcock, L. and Greenberg, S. (2001) Vowel height is intimately associated with stress-accent in
spontaneous American English discourse. Submitted to the 7th European Conference on Speech Communication and Technology (Eurospeech-2001).
Silipo, R. and Greenberg, S. (1999) Automatic transcription of prosodic stress for spontaneous English discourse. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco.
Silipo, R. and Greenberg, S. (2000) Prosodic stress revisited: Reassessing the role of fundamental frequency. Proceedings of the NIST Speech Transcription Workshop, College Park, MD.
Silipo, R. and Greenberg, S. (2000) Automatic detection of prosodic stress in American English discourse. Technical Report 2000-1, International Computer Science Institute, Berkeley, CA.
http://www.icsi.berkeley.edu/~steveng
Language - The Traditional PerspectiveThe “classical” view of spoken language posits a quasi-arbitrary relation between
the lower and higher tiers of linguistic organization
The Serial Frame Perspective on Speech• Traditional models of speech recognition assume that the identity of a phonetic segment depends on the
detailed spectral profile of the acoustic signal for a given (usually 25-ms) frame of speech
Language - A Syllable-Centric PerspectiveA more empirical perspective of spoken language focuses on the syllable as the
interface between “sound” and “meaning”
Within this framework the relationship between the syllable and the higher and lower tiers is non-arbitrary and systematic statistically
• Segmentation is crucial for understanding spoken language – At the level of the phrase– the word– the syllable– the phonetic segment
• But …. this linguistic segmentation is inherently “fuzzy”
• As is the spectral information associated with each linguistic tier
• The low-frequency (3-25 Hz) modulation spectrum is a crucial acoustic (and possibly visual) parameter associated with
intelligibility– It provides segmentation information that unites the phonetic segment
with the syllable (and possibly the word and beyond)
• Many properties of spontaneous spoken language differ from those of laboratory and citation speech
– There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization
Take Home Messages
The Central Importance of the Modulation Spectrum and the Syllable for
Understanding Spoken Language
Effects of Reverberation on the Speech SignalReflections from walls and other surfaces routinely modify the spectro-temporal
structure of the speech signal under everyday conditions
Effects of Reverberation on the Speech SignalReflections from walls and other surfaces routinely modify the temporal and modulation spectral properties of the speech
signalThe modulation spectrum’s peak is attenuated and shifted down to ca. 2 Hz
[based on an illustration by Hynek Hermansky]
The Modulation Spectrum Reflects SyllablesThe peak in the distribution of syllable duration is close to the mean - 200 ms The syllable duration distribution is very close to that of the modulation spectrum - suggesting that the modulation spectrum
reflects syllables
Spectral Asynchrony - Method
Output of quarter-octave frequency bands quasi- randomly time-shifted relative to common reference. Maximum shift interval ranged between 40 and 240 ms (in 20-ms steps). Mean shift interval is half of the maximum interval. Adjacent channels separated by a minimum of one-quarter of the maximum shift range.
“She washed his dark suit in greasy dish water all year” Stimuli – 40 TIMIT Sentences
Spectral Asynchrony - Paradigm
The magnitude of energy in the 3-6 Hz region of the modulation spectrum is computed for each (4 or 7 channel sub-band) as a function of spectral asynchrony
The modulation spectrum magnitude is relatively unaffected by asynchronies of 80 ms or less (open symbols), but is appreciably diminished for asynchronies of 160 ms or more
Is intelligibility correlated with the reduction in the 3-6 Hz modulation spectrum?
Intelligibility and Spectral AsynchronySpeech intelligibility does appear to be roughly correlated with the energy in the modulation spectrum between 3 and 6 HzThe correlation varies depending on the sub-band and the degree of spectral asynchrony
• Speech is capable of withstanding a high degree of temporal asynchrony across frequency channels
• This form of cross-spectral asynchrony is similar to the effects of many common forms of acoustic reverberation
• Speech intelligibility remains high (>75%) until this asynchrony (maximum) exceeds 140 ms
• The magnitude of the low-frequency (3-6 Hz) modulation spectrum is highly correlated with speech intelligibility
Spectral Asynchrony - Summary
A Flaw in the Spectral Asynchrony Study Of the 448 possible combinations of four slits across the spectrum (where one slit is present in each of the 4 sub-bands) ca. 10% (i.e.
45) exhibit a coefficient of variation less than 10% - thus, the seeming temporal tolerance of the auditory system may be illusory (if listeners can decode the speech signal using information from only a small number of channels distributed across the spectrum)
Distribution of channel asynchronyIntelligibility of spectrally desynchronized speech
Spectral Slit ParadigmCan listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum?The edge of each slit was separated from its nearest neighbor by an octaveThe modulation pattern for each slit differs from that of the othersThe four-slit compound waveform looks very similar to the full-band signal
+
+
Word Intelligibility - Single SlitsThe intelligibility associated with any single slit is only 2 to 9%The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits
Word Intelligibility - Roap Map1. Intelligibility as a function of the number of slits (from one to four)
Word Intelligibility - Roap Map2. Intelligibility for different combinations of two-slit compounds
The two center slits yield the highest intelligibility
Word Intelligibility - Roap Map3. Intelligibility for different combinations of three-slit compounds
Combinations with one or two center slits yield the highest intelligibility
Word Intelligibility - Roap Map4. Four slits yield nearly (but not quite) perfect intelligibility of ca. 90%
This maximum level of intelligibility makes it possible to deduce the specific contribution of each slit by itself and in combination with others
• A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language
• An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility
Spectral Slits - Summary
Modulation Spectrum Across Frequency
The modulation spectrum varies in magnitude across frequency
The shape of the modulation spectrum is similar for the three lowest slits, but the highest frequency slit differs from the rest in exhibiting a far greater amount of energy in the mid modulation frequencies
Word Intelligibility - Single SlitsThe intelligibility associated with any single slit ranges between 2 and 9%, suggesting that the shape and
magnitude of the modulation spectrum per se is NOT the controlling variable for intelligibility
• A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language
• An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility
• The magnitude component of the modulation spectrum does not appear to be the controlling variable for intelligibility
Spectral Slits - Summary
Modulation Spectrum Across FrequencyDesynchronizing the slits by more than 25 ms results in a significant decline in
intelligibility
• Even small amounts of asynchrony (>25 ms) imposed on spectral slits can result in significant degradation of intelligibility
• Asynchrony greater than 50 ms has a profound impact of intelligibility
Spectral Slits - Summary
• A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language
• An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility
• The magnitude component of the modulation spectrum does not appear to be the controlling variable for intelligibility
Spectral Slits - Summary
• Small amounts of asynchrony (>25 ms) imposed on spectral slits can result in significant degradation of intelligibility
• Asynchrony greater than 50 ms has a profound impact of intelligibility
• Intelligibility progressively declines with greater amounts of asynchrony up to an asymptote of ca. 250 ms
• Beyond asynchronies of 250 ms intelligibility IMPROVES, but the amount of improvement depends on individual factors
• Such results are NOT inconsistent with the high intelligibility of desynchronized full-spectrum speech, but rather imply that the auditory system is capable of extracting phonetically important information from a relatively small proportion of spectral channels
• BOTH the amplitude and phase components of the modulation spectrum are extremely important for speech intelligibility
• The modulation phase is of particular importance for cross-spectral integration of phonetic information
Spectral Slits - Summary
Auditory-Visual Integration of Speech• Video of spoken (Harvard/IEEE) sentences, presented in tandem with
sparse spectral representation (low- and high-frequency slits)
Auditory-Visual Integration - Mean Intelligibility
9 Subjects
• When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals
• When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms
Variation across subjects
Video lagging often better than synchronous
Auditory-Visual Integration - by Individual Ss
• Sparse audio and speech-reading information when presented alone provide minimal intelligibility
• But can, when combined provide good intelligibility
• When the audio signal leads the video, intelligibility falls off rapidly as a function of onset asynchrony
• When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
• The dynamics of the video appear to be combined with the dynamics associated with the audio to provide good intelligibility
• The dynamics associated with the video signal are probably most closely associated with place of articulation information
• The implication is that place information has a long time constant of ca. 200 ms and appears linked to the syllable
Audio-Video Integration – Summary
• The consonant recognition results can be scored in terms of articulatory features correct
• The the accuracy of the features are scored relative to the accuracy of consonant recognition an interesting pattern emerges
• Certain features (place and manner) appear to be highly correlated with consonant recognition performance
• While the voicing and rounding features are less highly correlated
Articulatory - Feature Analysis
Correlation - AFs/Consonant Recognition
Consonant recognition is almost perfectly correlated with place of articulation performance
This correlation suggests that the place feature is based on cues distributed across the entire speech spectrum, in contrast to features such as voicing and rounding, which appear to be extracted from a narrower band of the spectrum
Manner is also highly correlated with consonant recognition, implying that this feature is extracted from a fairly broad portion of the spectrum
Phonetic Transcription of Spontaneous English• Telephone dialogues of 5-10 minutes duration - SWITCHBOARD• Amount of material manually transcribed
– 4 hours labeled at the phone level and segmented at the syllabic level (this material was later phonetically segmented by automatic methods)
– 1 hour labeled and segmented at the phonetic-segment level
• Diversity of material transcribed– Spans speech of both genders (ca. 50/50%) reflecting a wide range of American
dialectal variation (6 regions + “army brat”), speaking rate and voice quality
• Transcribed by whom? – 11 undergraduates and 1 graduate student, all enrolled at UC-Berkeley. Most of
the corpus was transcribed by four individuals out of the twelve– Supervised by Steven Greenberg and John Ohala
• Transcription system– A variant of Arpabet, with phonetic diacritics such as:_gl,_cr, _fr, _n, _vl, _vd
• How long does transcription take? (Don’t ask!)– 388 times real time for labeling and segmentation at the phonetic-segment level– 150 times real time for labeling phonetic segments and segmenting syllables
• How was labeling and segmentation performed?– Using a display of the signal waveform, spectrogram, word transcription and
“forced alignments” (estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations
• Data available at - http://www.icsi/berkeley.edu/real/stp
Phonetic Transcription What a “typical” computer screen shot of the speech material looks like to a transcriber
How Many Pronunciations of “and”?
82 ae n63 eh n45 ix n35 ax n34 en30 n20 ae n dcl d17 ih n17 q ae n11 ae n d
7 q eh n7 ae nx6 ae ae n6 ah n5 eh nx4 uh n4 ix nx4 q ae n dcl d3 eh n d3 q ae nx
3 eh2 ae n dcl2 ae2 ax m2 ax n d2 ae eh n dcl d2 eh n dcl d2 ax nx2 q ae ae n2 q ix n2 ix n dcl d2 ih 2 eh eh n2 q eh nx2 ix d n1 eh m1 ax n dcl d1 aw n1 ae q1 eh dcl
N Pronunciation N Pronunciation
How Many Pronunciations of “and”?
1 ah nx1 ae n t1 eh d1 ah n dcl d1 ey ih n dcl1 ae ix n1 ae nx ax1 ax ng1 ay n1 ih ah n d1 ae hh1 ih ng1 ix1 ae n d dcl1 ix dcl d1 ae eh n1 hh n1 ix n t1 ae ax n dcl d1 iy eh n
1 m1 ae ae n d1 nx1 q ae ae n1 q ae ae n dcl d1 q ae eh n dcl d1 q ae ih n1 aa n1 q ae n d1 ? nx1 q ae n q1 eh n m1 q eh en dcl1 eh ng1 q eh n q1 em1 q eh ow m1 q ih n1 q ix en1 er
N Pronunciation N Pronunciation
1 I 6 4 9 5 3 5 3 a y
2 a n d 5 2 1 8 7 1 6 a e n
3 th e 4 7 5 7 6 2 7 d h a x
4 y o u 4 0 6 6 8 2 0 y ix
5 th a t 3 2 8 1 1 7 1 1 d h a e
6 a 3 1 9 2 8 6 4 a x
7 to 2 8 8 6 6 1 4 tc l t u w
8 k n o w 2 4 9 3 4 5 6 n o w
9 o f 2 4 2 4 4 2 1 a x v
1 0 it 2 4 0 4 9 2 2 ih
1 1 y e a h 2 0 3 4 8 4 3 y a e
1 2 in 1 7 8 2 2 4 5 ih n
1 3 th e y 1 5 2 2 8 6 0 d h e y
1 4 d o 1 3 1 3 0 5 4 d c l d u w
1 5 s o 1 3 0 1 4 7 4 s o w
1 6 b u t 1 2 3 4 5 1 2 b c l b a h tc l t
1 7 is 1 2 0 2 4 5 0 ih z
1 8 lik e 1 1 9 1 9 4 6 l a y k c l k
1 9 h a v e 1 1 6 2 2 5 4 h h a e v
2 0 w a s 1 1 1 2 4 2 3 w a h z
2 1 w e 1 0 8 1 3 8 3 w iy
2 2 it's 1 0 1 1 4 2 0 ih tc l s
2 3 ju s t 1 0 1 3 4 1 7 jh ix s
2 4 o n 9 8 1 8 4 9 a a n
2 5 o r 9 4 2 3 3 6 e r
2 6 n o t 9 2 2 4 2 4 m a a q
2 7 th in k 9 2 2 3 3 2 th ih n g k c l k
2 8 fo r 8 7 1 9 4 6 f e r
2 9 w e ll 8 4 4 9 2 3 w e h l
3 0 w h a t 8 2 4 0 1 4 w a h d x
3 1 a b o u t 7 7 4 6 1 2 a x b c l b a w
3 2 a ll 7 4 2 7 2 4 a o l
3 3 th a t's 7 4 1 9 1 6 d h e h s
3 4 o h 7 4 1 7 6 1 o w
3 5 re a lly 7 1 2 5 4 5 r ih l iy
3 6 o n e 6 9 8 7 8 w a h n
3 7 a re 6 8 1 9 4 2 e r
3 8 I'm 6 7 9 2 6 q a a m
3 9 rig h t 6 1 2 1 2 8 r a y
4 0 u h 6 0 1 6 4 1 a h
4 1 th e m 6 0 1 8 2 3 a x m
4 2 a t 5 9 3 6 8 a e d x
4 3 th e re 5 8 2 8 2 2 d h e h r
4 4 my 5 8 9 6 6 m a y
4 5 me a n 5 6 1 0 5 8 m iy n
4 6 d o n 't 5 6 2 1 1 4 d x o w
4 7 n o 5 5 8 7 7 n o w
4 8 w ith 5 5 2 0 3 5 w ih th
4 9 if 5 5 1 8 4 1 ih f
5 0 w h e n 5 4 1 8 3 1 w e h n
5 1 c a n 5 4 2 8 1 5 k c l k a e n
5 2 th e n 5 1 1 9 3 8 d h e h n
5 3 b e 5 0 1 1 7 6 b c l b iy
5 4 a s 4 9 1 6 1 8 a e z
5 5 o u t 4 7 1 9 2 2 a e d x
5 6 k in d 4 7 1 7 2 1 k c l k a x n x
5 7 b e c a u e 4 6 3 1 1 5 k c l k a x z
5 8 p e o p le 4 5 2 1 4 4 p c l p iy p c l l e l
5 9 g o 4 5 5 8 3 g c l g o w
6 0 g o t 4 5 3 2 1 5 g c l g a a
6 1 th is 4 4 1 1 4 7 d h ih s
6 2 s o me 4 3 4 4 8 s a h m
6 3 w o u ld 4 1 1 6 2 9 w ih d c l
6 4 th in g s 4 1 1 5 5 2 th ih n g z
6 5 n o w 3 9 1 1 6 9 n a w
6 6 lo t 3 9 9 4 7 l a a d x
6 7 h a d 3 9 1 9 2 4 h h a e d c l
6 8 h o w 3 9 1 1 5 3 h h a w
6 9 g o o d 3 8 1 3 2 7 g c l g u h d c l
7 0 g e t 3 8 2 0 1 3 g c l g e h d x
7 1 s e e 3 7 6 8 0 s iy
7 2 fro m 3 6 1 0 2 8 f r a h m
7 3 h e 3 6 7 3 9 iy
7 4 me 3 5 5 8 7 m iy
7 5 d o n 't 3 5 2 1 1 4 d x o w
7 6 th e ir 3 3 1 9 2 5 d h e h r
7 7 mo re 3 2 1 1 5 6 m a o r
7 8 it's 3 1 1 4 2 0 ih tc l s
7 9 th a t's 3 1 2 0 1 6 d h e h s
8 0 to o 3 1 6 6 0 tc l t u w
8 1 o k a y 3 1 1 7 4 5 o w k c l k e y
8 2 v e ry 3 0 1 1 3 6 v e h r iy
8 3 u p 3 0 1 1 3 4 a h p c l p
8 4 b e e n 3 0 1 1 5 1 b c l b ih n
8 5 g u e s s 2 9 8 4 2 g c l g e h s
8 6 time 2 9 8 6 2 tc l t a y m
8 7 g o in g 2 9 2 1 1 3 g c l g o w ih n g
8 8 in to 2 8 2 0 1 4 ih n tc l t u w
8 9 th o s e 2 7 1 2 4 2 d h o w z
9 0 h e re 2 7 1 1 2 5 h h iy e r
9 1 d id 2 7 1 3 2 3 d c l d ih d x
9 2 w o rk 2 5 8 6 6 w e r k c l k
9 3 o th e r 2 5 1 4 2 6 a h d h e r
9 4 a n 2 5 1 2 2 8 a x n
9 5 I'v e 2 5 7 4 6 a y v
9 6 th in g 2 4 9 5 2 th ih n g
9 7 e v e n 2 4 7 4 0 iy v ix n
9 8 o u r 2 3 9 3 3 a a r
9 9 a n y 2 3 1 1 2 3 ix n iy
1 0 0 w e 're 2 3 8 2 5 w e y r
How Many Different Pronunciations?
1 I 649 53 53 ay2 and 521 87 16 ae n3 the 475 76 27 dh ax4 you 406 68 20 y ix5 that 328 117 11 dh ae6 a 319 28 64 ax7 to 288 66 14 tcl t uw8 know 249 34 56 n ow9 of 242 44 21 ax v
10 it 240 49 22 ih11 yeah 203 48 43 y ae12 in 178 22 45 ih n13 they 152 28 60 dh ey14 do 131 30 54 dcl d uw15 so 130 14 74 s ow16 but 123 45 12 bcl b ah tcl t17 is 120 24 50 ih z18 like 119 19 46 l ay kcl k19 have 116 22 54 hh ae v20 was 111 24 23 w ah z
Rank Word N #PronMost CommonPronunciation
MCP%Total
The 20 most frequency words account for 35% of the tokens
21 we 108 13 83 w iy22 it's 101 14 20 ih tcl s23 just 101 34 17 jh ix s24 on 98 18 49 aa n25 or 94 23 36 er26 not 92 24 24 m aa q27 think 92 23 32 th ih ng kcl k28 for 87 19 46 f er29 well 84 49 23 w eh l30 what 82 40 14 w ah dx31 about 77 46 12 ax bcl b aw32 all 74 27 24 ao l 33 that's 74 19 16 dh eh s34 oh 74 17 61 ow35 really 71 25 45 r ih l iy36 one 69 8 78 w ah n37 are 68 19 42 er38 I'm 67 9 26 q aa m39 right 61 21 28 r ay40 uh 60 16 41 ah
Rank Word N #PronMost CommonPronunciation
MCP%Total
How Many Different Pronunciations?
The 40 most frequency words account for 45% of the tokens
Rank Word N #PronMost CommonPronunciation
MCP%Total
41 them 60 18 23 ax m42 at 59 36 8 ae dx43 there 58 28 22 dh eh r44 my 58 9 66 m ay45 mean 56 10 58 m iy n46 don't 56 21 14 dx ow47 no 55 8 77 n ow48 with 55 20 35 w ih th49 if 55 18 41 ih f50 when 54 18 31 w eh n51 can 54 28 15 kcl k ae n52 then 51 19 38 dh eh n53 be 50 11 76 bcl b iy54 as 49 16 18 ae z55 out 47 19 22 ae dx56 kind 47 17 21 kcl k ax nx57 becaue 46 31 15 kcl k ax z58 people 45 21 44 pcl p iy pcl l el59 go 45 5 83 gcl g ow60 got 45 32 15 gcl g aa
How Many Different Pronunciations?
The 60 most frequency words account for 55% of the tokens
61 this 44 11 47 dh ih s62 some 43 4 48 s ah m63 would 41 16 29 w ih dcl64 things 41 15 52 th ih ng z65 now 39 11 69 n aw66 lot 39 9 47 l aa dx67 had 39 19 24 hh ae dcl68 how 39 11 53 hh aw69 good 38 13 27 gcl g uh dcl70 get 38 20 13 gcl g eh dx71 see 37 6 80 s iy72 from 36 10 28 f r ah m73 he 36 7 39 iy74 me 35 5 87 m iy75 don't 35 21 14 dx ow76 their 33 19 25 dh eh r77 more 32 11 56 m ao r78 it's 31 14 20 ih tcl s79 that's 31 20 16 dh eh s80 too 31 6 60 tcl t uw
Rank Word N #PronMost CommonPronunciation
MCP%Total
How Many Different Pronunciations?
The 80 most frequency words account for 62% of the tokens
81 okay 31 17 45 ow kcl k ey82 very 30 11 36 v eh r iy83 up 30 11 34 ah pcl p84 been 30 11 51 bcl b ih n85 guess 29 8 42 gcl g eh s86 time 29 8 62 tcl t ay m87 going 29 21 13 gcl g ow ih ng88 into 28 20 14 ih n tcl t uw89 those 27 12 42 dh ow z90 here 27 11 25 hh iy er91 did 27 13 23 dcl d ih dx92 work 25 8 66 w er kcl k93 other 25 14 26 ah dh er94 an 25 12 28 ax n95 I've 25 7 46 ay v96 thing 24 9 52 th ih ng97 even 24 7 40 iy v ix n98 our 23 9 33 aa r99 any 23 11 23 ix n iy
100 we're 23 8 25 w ey r
Rank Word N #PronMost CommonPronunciation
MCP%Total
How Many Different Pronunciations?
The 100 most frequency words account for 67% of the tokens
English Syllable Structure is (sort of) Like Japanese
87% of the pronunciations are simple syllabic forms
84% of the canonical corpus is composed of simple syllabic forms
n= 103, 054
Most syllables are simple in form (no consonant clusters)
C = ConsonantV = Vowel
ExamplesCV – “go”CVC – “cat”VC – “of”V – “a”
Corpus = “Canonical” representationPronunciation = Actual pronunciation
Coda consonants tend to “drop”
There are many “complex” syllable forms (consonant clusters), but all occur relatively infrequently
Complex Syllables ARE Important (Though)
Thus, despite English’s reputation for complex syllabic forms, only ca. 15% of the syllable tokens are actually complex
n= 17,760
Percent
C = ConsonantV = Vowel
ExamplesCVCC – “fifth”VCC – “ounce”CCV – “stow”CCVC – “stoop”CCVCC – “stops”CCCVCC – “strength”
Complex syllables tend to be part of noun phrases (nouns or adjectives)
Coda consonants tend to “drop”
Syllable-Centric Pronunciation Patterns
(Spontaneous speech)
(Read Sentences)
“Cat” [k ae t] [k] = onset [ae] = nucleus [t] = coda
Onsets are pronouncedcanonically far more often than nuclei or codas
Codas tend to be pronounced canonically more frequently in formal speech than in spontaneous dialogues
Percent Canonically Pronounced
Syllable Position
n= 120,814
70
75
80
85
90
95
100
Simple (C) Complex (CC(C))
STP
TIMIT
Complex onsets are pronounced more canonically than simple onsets despite the greater potential for deviation from the standard pronunciation
(Spontaneous speech)
(Read sentences)
Percent Canonically Pronounced
Syllable Onset Type
Complex Onsets are Highly CanonicalCOMPLEX onsets contain TWO or MORE consonants
Speaking Style Affects Syllable Codas
Percent Canonically Pronounced
Codas are much more likely to be realized canonically in formal than in spontaneous speech
Syllable Coda Type
COMPLEX codas contain TWO or MORE consonants
STP – Spontaneous phone dialoguesTIMIT – Read sentences
50
55
60
65
70
AllNuclei
WithOnset
WithoutOnset
WithCoda
WithoutCoda
STP
TIMIT
Onsets (but not Codas) Affect Nuclei
Percent Canonically Pronounced
The presence of a syllable onset has a substantial impact on the realization of the nucleus
STP – Spontaneous phone dialoguesTIMIT – Read sentences
Syllable-Centric Articulatory Feature Analysis• Place of articulation deviates most in nucleus position• Manner of articulation deviates most in onset and coda position• Voicing deviates most in coda position
Phonetic deviation along a SINGLE feature
Place deviates very little from canonical form in the onset and coda. It
is a STABLE AF in these positions
Place is VERY unstable in nucleus position
Articulatory PLACE Feature Analysis• Place of articulation is a “dominant” feature in nucleus position only• Drives the feature deviation in the nucleus for manner and rounding
Phonetic deviation across SEVERAL features
Place “carries” manner and rounding in the nucleus
• Manner of articulation is a “dominant” feature in onset and coda position• Drives the feature deviation in onsets and codas for place and voicing
Articulatory MANNER Feature Analysis
Manner is less stable in the coda than in the onset
Manner drives place and
voicing deviations in the onset and
coda
Phonetic deviation across SEVERAL features
• Voicing is a subordinate feature in all syllable positions• Its deviation pattern is controlled by manner in onset and coda positions
Articulatory VOICING Feature Analysis
Voicing is unstable in coda position and is dominated by manner
Phonetic deviation across SEVERAL features
What is (usually) Meant by Prosodic Stress?• Prosody is supposed to pertain to extra-phonetic cues in the
acoustic signal • The pattern of variation over a sequence of SYLLABLES
pertaining to: syllabic DURATION, AMPLITUDE and PITCH (fo) variation over time (but the plot thickens, as we shall see)
OGI Stories - Pitch Doesn’t Cut the Mustard • Although pitch range is the most important of the fo-related cues, it is not as good a
predictor of stress as DURATION
Duration
Amplitude
Pitch Range
Av. Pitch
Total Energy is the Best Predictor of Stress • Duration x Amplitude is superior to all other combination pairs of
acoustic parameters. Pitch appears redundant with duration.
Duration x Amplitude
Dur x Pitch Range
Duration
Dur x Pitch AvPitch Range x Average
Pitch Av x Amp
Pitch Range x Amp
• SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS – Switchboard contains informal telephone dialogues
– 54 minutes of material that had previously been phonetically transcribed (by highly trained phonetics students from UC-
Berkeley)
– 45.5 minutes of “pure” speech (filled pauses, junctures filtered out), consisting of:
9,991 words, 13,446 syllables, 33,370 phonetic segments
– All of this material had been hand-segmented at either the phonetic-segment or syllabic level by the transcribers
– The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified
The Nitty Gritty (a.k.a. the Corpus Material)
• 2 UC-Berkeley Linguistics students each transcribed the full 45 minutes of material (i.e., there is 100% overlap between the 2)
• Three levels of stress-accent were marked for each syllabic nucleus– Fully stressed (78% concordance between transcribers)– Completely unstressed (85% interlabeler agreement)– An intermediate level of accent (neither fully stressed, nor completely
unstressed (ca. 60% concordance)– Hence, 95% concordance in terms of some level of stress
• The labels of the two transcribers were averaged – In those instances where there was disagreement, the magnitude of
disparity was almost always (ca. 90%) one step. Usually, disagreement signaled a genuine ambiguity in stress accent
• The illustrations in this presentation are based solely on those data in which both transcribers concurred (i.e., fully stressed or completely unstressed)
Manual Transcription of Stress Accent
• Vowel quality is generally thought to be a function primarily of two articulatory properties - both related to the motion of the tongue
– The front-back plane is most closely associated with the second formant frequency (or more precisely F2 - F1) and the volume of the front-cavity resonance
– The height parameter is closely linked to the frequency of F1
• In the classic vowel “triangle” segments are positioned in terms of the tongue positions associated with their production, as follows:
A Brief Primer on Vocalic Acoustics
Durational Differences - Stressed/Unstressed• There is a large dynamic range in duration between stressed and unstressed nuclei• Diphthongs and tense, low monophthongs tend to have a larger range than the lax monophthongs
• Let’s return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data
• The duration, amplitude (and their product, integrated energy, will be plotted on a 2-D grid , where the x-axis will always be in terms of
hypothetical front-back tongue position (and hence remain a constant throughout the plots to follow)
• The y-axis will serve as the dependent measure, sometimes expressed in terms of duration, or amplitude, or their product
Spatial Patterning of Duration and Amplitude
• The vowel system of English (and perhaps other languages as well) needs to be re-thought in light of the intimate relationship
between vocalic identity, nucleic duration and stress accent
• Stressed syllables tend to have significantly longer nuclei than their unstressed counterparts, consistent with the findings
reported by Silipo and Greenberg in previous years’ meetings regarding the OGI Stories corpus (telephone monologues)
• Certain vocalic classes exhibit a far greater dynamic range in duration than others
– Diphthongs tend to be longer than monophthongs, BUT ….– The low monophthongs ([ae], [aa], [ay], [aw], [ao]) exhibit patterns of
duration and dynamic range under stress (accent) similar to diphtongs
• The statistical patterns are consistent with the hypothesis that duration serves under many conditions as either a primary or
secondary cue for vowel height (normally associated with the frequency of the first formant)
Take Home Messages
• Moreover, the stress-accent system in spontaneous (American) English appears to be closely associated with vocalic identity
• Low vowels are far more likely to be fully stressed than high vowels (with the mid vowels exhibiting an intermediate probability of being
stressed)
• Thus, the identity of a vowel can not be considered independently of stress-accent
• The two parameters are likely to be flip sides of the same Koine
• Although English is not generally considered to be a vowel-quantity language (as is Finnish), given the close relationship between
stress-accent and duration, and between duration and vowel quality, there is some sense in which English (and
perhaps other stress-accent languages) manifest certain properties of a “quantity” system
Take Home Messages
Manner Feature Classification/Segmentation • Automatic methods (neural networks) can accurately label MANNER of articulation features for spontaneous material (Switchboard
corpus)
• Implication – MANNER information may be relatively co-terminous with phonetic segments and evade “co-articulation” effects
Label Accuracy per Frame• Central frames are labeled more accurately than those close to the segmental boundaries• Implication – some frames are created more equal than others
OGI Numbers Corpus Frame step interval = 10 ms
MANNER Classification – Elitist Approach • “Confident” (usually central) frames are classified more accurately
NTIMIT (telephone) Corpus
Manner-Specific Place Classification • Knowing the “manner” improves “place” classification for consonants
NTIMIT (telephone) Corpus
Manner-Specific Place Classification • Knowing the “manner” improves “place” classification for vowels as well
NTIMIT (telephone) Corpus
Manner-Specific Place Classification – Dutch • Knowing the “manner” improves “place” classification for consonants and vowels in
DUTCH as well as in English
VIOS (telephone) Corpus
• Knowing the “manner” improves “place” classification for the “approximant” segments in DUTCH• Approximants are classified as “vocalic” rather than as “consonantal”
VIOS (telephone) Corpus
Manner-Specific Place Classification – Dutch
• Automatic recognition systems can be used to test specific hypotheses about the acoustic properties of articulatory features, segments
and syllables
• Manner information appears to be well classified and segmented – Suggests that manner features may be the key articulatory feature
dimension for segmentation within the syllable
• Place information is not as well classified as manner information
– Improvement of place with manner-specific classification suggests that place recognition does depend to a certain degree
on manner classification
• Voicing information appears to be relatively robust under many conditions and therefore is likely to emanate from a variety of spectral regions
– The time constant for voicing information is also likely to be less than or coterminous with the segment
Take Home Messages
Sample Transcription from the ALPS System• The ALPS (automatic labeling of phonetic segments) system performs very similarly to manual transcription in terms of
both labels and segmentation – 11 ms average concordance in segmentation– 83% concordance with respect to phonetic labels
OGI Numbers (telephone) corpus
ALPS Output Can Be Superior to Alignments
ALPS - Seg
ALPSManner Information
ForcedAlignment Segments
SpeechWaveform
Spectrogram
WordTranscript
Switchboard (telephone) Corpus
• The controlling parameters for understanding spoken language appear to be based on low-frequency modulation patterns in the acoustic signal associated with the syllable
• Both the magnitude and phase of the modulation patterns are important
• Encoding information in terms of low-frequency modulations provides a certain degree of robustness to the speech signal that enables it to
be decoded under a wide range of acoustic and speaking conditions
• Manner information appears to be the key to understanding segmentation internal to the syllable
• Place features appear to be dominant and most stable at syllable onset and coda
• Manner is the stable feature dimension for the syllabic nucleus
• Voicing and rounding appear to be auxiliary features linked to manner and place feature information
• “Real” speech can be useful in delineating underlying patterns of llinguistic organization
Grand Summary and Conclusions