More Urban and More Green: Cities Hold the Key - Ken Greenberg, Greenberg Consultants
Temporal Properties of Spoken Language Steven Greenberg The Speech Institute steveng...
-
Upload
alison-ross -
Category
Documents
-
view
219 -
download
2
Transcript of Temporal Properties of Spoken Language Steven Greenberg The Speech Institute steveng...
Temporal Properties of
Spoken Language
Steven GreenbergThe Speech Institute
http://www.icsi.berkeley.edu/[email protected]
Acknowledgements and Thanks
Research FundingU.S. Department of DefenseU.S. National Science Foundation
Research CollaboratorsHannah Carvey, Shawn Chang, Ken Grant, Leah Hitchcock, Joy Hollenback, Rosaria Silipo
For Further Information
Consult the web site:
www.icsi.berkeley.edu/~steveng
This presentation examines WHY the temporal properties of speech are the way they are
Some General Questions
Specifically, we ask ….
WHY is the average duration of a syllable (in spontaneous speech) ca. 200 ms?
Some General Questions
Specifically, we ask ….
WHY are some syllables significantly longer than others?
Some General Questions
Specifically, we ask ….
WHY are some phonetic segments (usually vowels) longer than others (typically consonants)?
Some General Questions
And ….
WHAT can the temporal properties of speech tell us about spoken language?
Some General Questions
The temporal properties of spoken language reflect INFORMATION contained in the speech signal
Conclusions
PROSODY is the most sensitive LINGUISTIC reflection of INFORMATION
(PROSODY refers to the RHYTHM and TEMPO of syllables in an utterance)
Conclusions
Much of the temporal variation in spoken language reflects prosodic factors
Conclusions
Hence, prosody is the key to understanding much of the temporal (and phonetic) variation observed in spoken language
Conclusions
Prosody also shields the information contained in the speech signal against the deleterious forces of nature (a.k.a. background noise and reverberation)
Conclusions
Therefore, to understand spoken language, it is also necessary to understand how prosody is encapsulated in the speech signal (acoustic and otherwise)
This is the focus for today’s presentation
But, before considering prosody per se, let’s first examine an important acoustic property of the speech signal ….
Conclusions
SLOW modulation of acoustic energy, reflecting movement of the speech articulators, is crucial for understanding spoken language
The fine spectral detail is FAR less important (80% of the spectrum can be discarded with much impact on intelligibility)
WHY should this be so? WHY? WHY? WHY? WHY?
Importance of Slow Modulations
90% Intelligibility
Quantifying Modulation Patterns in SpeechThe modulation spectrum provides a convenient quantitative method for
computing the amount of modulation in the speech signal
The technique is illustrated for a paradigmatic, simple signal
The computation is performed for each spectral channel separately
The low-frequency modulation patterns can thus be quantified using the modulation spectrum, which looks like this for spontaneous speech ….
Modulation Spectrum of Spoken Language
The modulation spectrum has a broad peak of energy between 3 and 10 Hz
Linguistically, the modulation spectrum reflects SYLLABLES
The distribution of syllable duration is similar to the modulation spectrum
Modulation Spectrum of Spoken Language
Syllable duration
Modulation Spectrum
15 minutes of spontaneous material from a single Japanese speaker
Questions:
Why do syllables vary so much in duration?
And why is the peak of the modulation spectrum so broad?
Variation in Syllable Duration
Syllable duration
Modulation Spectrum
15 minutes of spontaneous material from a single Japanese speaker
Why do syllables vary so much in duration?
In large part, it is because syllables carry differential amounts of information
Longer syllables tend to contain more information than shorter syllables
Below, the vowels in “ride” and “bikes” are longer than in other words (as well as more intense)
Variation in Syllable Duration
Duration is one of the most important correlates of syllable accent (prosody)
We know this because of studies SIMULATING syllable prominence (accent) labeling by highly trained linguistic transcribers
In one study, it was shown that duration is the single most important acoustic property related to syllable prominence (in Am. English)
Duration Correlates with Syllable Stress
Duration
Amplitude
Silipo and Greenberg (1999)
Pitch
Word Duration and Syllabic Accent LevelWords that contain an accented syllable tend to be considerably longer
than unaccented words
What are the implications of this insight?
Heavily AccentedLightly
Accented
Unaccented
All Words
Heavily AccentedLightly
Accented
Unaccented
All Words
Word Duration and Stress Accent LevelThe broad distribution of word duration (and, in turn, syllable duration) largely reflects the co-
existence of accented and unaccented words (and syllables), often within the same utterance
This interleaving of long and short syllables reflects the DIFFERENTIAL DISTRIBUTION of ENTROPY across an utterance
Breadth of the Modulation SpectrumThe broad bandwidth of the modulation spectrum, as it reflects syllable
duration, encapsulates the heterogeneity in syllabic and lexical duration associated with variation in syllable prominence
Does this insight have implications for spoken language?
Modulation spectrum of 40 TIMIT sentences (computed across a 6-kHz bandwidth)
UnaccentedHeavily Accented
All Accents(Convergnce)
Modulation Spectrum Breadth & IntelligibilityLong ago, Houtgast and Steeneken demonstrated that the modulation spectrum is highly predictive of speech intelligibility
In highly reverberant environments, the modulation spectrum’s peak is severely attenuated, shifted down to ca. 2 Hz, and the signal becomes largely unintelligible
What does this imply with respect to prosody?
[based on an illustration by Hynek Hermansky]
Modulation Spectrum
As the modulation spectrum is progressively low-pass filtered, intelligibility declines
Suggesting that intelligibility requires both long and short (i.e., accented and unaccented) syllables
(However, some syllables - the accented ones - are “more equal” than others)
Intelligibility and Modulation Frequency
Silipo et al. (1999)
Unaccented
Heavily Accented
All Accents(Convergnce)
Syllable Duration and Accent
Canonical Syllable Forms
Heavily accented syllables are generally 60-100% longer than their unaccented counterparts
The disparity in duration is most pronounced for syllable forms with one or no consonants (i.e., V, VC, CV)
This pattern implies that accent has its greatest impact on vocalic duration
V = VowelC = Consonant
Canonical Syllable Forms
Vowel Duration - Accent Level/Syllable FormVowels in accented syllables are at least twice as long as their unaccented
counterparts
This pattern implies that the syllabic nucleus absorbs much of accent’s impact (at least as far as duration is concerned)
Canonical Syllable Forms
Syllable Onset/Coda Duration and AccentONSETS of accented syllables are generally 50-60% longer than their
unaccented counterparts and are somewhat sensitive to stress accent
While there is little difference in duration between accented and unaccented CODA constituents
CODAS are relatively insensitive to prosody and carry less information than onsets
Onsets Codas
Sensitivity of Syllable Constituents to AccentThus, the duration of syllabic nuclei (usually vowels) is most sensitive to syllable accent
Syllable CODAS are LEAST sensitive to prosodic accent
This differential sensitivity to prosodic accent reflects some fundamental principles of information encoding within the syllable, as well as principles of auditory function (e.g., onsets are more important than offsets for evoking neural discharge – hence much of the neural entropy is embedded in the onset)
Syllable Prominence (Accent) Illustrated
[s]
[eh]
[vx]
[en]
accented syllable
unaccented syllable
“Seven”
mean duration
Full-spectrumperspective
OGI Numbers95
[s] [eh] [vx] [en]
Nucleus
Onset
Ambi-syllabic
“pure” juncture
Nucleus
Juncture
Robustness Based on Temporal PropertiesReflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions
Yet, the intelligibility of speech is remarkably stable
This implies that intelligibility is NOT based on the spectro-temporal DETAILS but rather on some more basic,TEMPORAL parameter(s)
Temporal Basis of Intelligibility
90% Intelligibility
Four narrow channels, presented synchronously, yield ca. 90% intelligibility
Intelligibility for two channels ranges between 10 and 60%
60% Intelligibility
Desynchronizing Slits Affects IntelligibilityWhen the center slits lead or lag the lateral slits by more than 25 ms intelligibility suffers significantly
Intelligibility plummets to ca. 55% for leads/lags of 50 ms
And declines to 40% for leads/lags of 75 ms
Asynchrony greater than 50 ms results in intelligibility lower than baseline
A trough in performance occurs at ca. 200-250 ms asynchrony, roughly the interval associated with the syllable
What does this mean?
Perhaps, that there is a syllable-length time window of integration
Slit Asynchrony Affects Intelligibility
Importance of Visual Cues
0.00.51.01.52.02.5
0 500 1000 1500 2000 2500 3000 3500
Time (ms)
0
10
20
30
40
50
60
70
80WB
F1
F2F3
RM
S A
mpl
itude
(dB
)Li
p A
rea
(in2
) Watch the log float in the wide river
Data courtesy of Ken Grant
Amplitude Fluctuation inDifferent Spectral Regions
Lip Aperture Variation
Visual cues often supplement the acoustic signal, and are particularly important in adverse acoustic environments (i.e., noise & reverberation)
What is the basis of visual supplementation to understanding speech?
One possibility is the common modulatory properties of the visual and acoustic components of the speech signal
Combining Audio and Visual Cues
+
+
Video Leads
40 – 400 ms
Audio Leads
40 – 400 ms
Baseline ConditionSYNCHRONOUS A/V
Place of Articulation
Visual cues (a.k.a. speechreading) also provide important information about consonantal place of articulation and the nature of both prosodic and vocalic properties
One can desynchronize the audio and visual streams and measure its impact on intelligibility
Focus on Audio-Leading-Video ConditionsWhen the AUDIO signal LEADS the VIDEO, there is a progressive decline in
intelligibility, similar to that observed for audio-alone signals
These data are next compared with data from the audio-alone study to illustrate the similarity in the slope of the function
Comparison of A/V and Audio-Alone Data The decline in intelligibility for the audio-alone condition is similar to that of
the audio-leading-video condition
Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar
The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves
When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms
These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio
Focus on Video-Leading-Audio Conditions
The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions
WHY? WHY? WHY?
Auditory-Visual Integration - the Full Monty
Time Constants of Audio-Visual IntergrationThe temporal limits of combining visual and acoustic information are SYLLABLE length, particularly when the video precedes the audio
signal
Suggesting that visual speech cues are syllabically organized
WHY are the temporal properties of speech the way they are
Because ….
The brain requires such intervals to combine information across sensory modalities and to associate the sensory streams with meaning
Some General Answers
WHY is the average duration of a syllable (in spontaneous speech) ca. 200 ms?
The syllable’s duration reflects a basic sensori-motor integration time constant and can be considered to represent the sampling rate of consciousness
Some General Answers
WHY are some syllables significantly longer than others?
The heterogeneity in duration reflects the unequal distribution of entropy across an utterance and is a basic requirement for decoding the speech signal
Some General Answers
WHY are some phonetic segments (usually vowels) longer than others (typically consonants)?
Vowels reflect the influence of prosodic factors far more than consonants, and therefore convey more information concerning a syllable’s intrinsic entropy than their consonantal counterparts
Some General Answers
WHAT can the temporal properties of speech tell us about spoken language in general?
It provides a general theoretical framework for understanding the organization of spoken language and how the brain decodes the speech signal
Some General Questions
The temporal properties of spoken language reflect INFORMATION contained in the speech signal
PROSODY is the most sensitive LINGUISTIC reflection of INFORMATION
Much of the temporal variation in spoken language reflects prosodic factors
Hence, prosody is the key to understanding much of the temporal (and phonetic) variation observed in spoken language
Prosody shields the information contained in the speech signal against the deleterious forces of nature (a.k.a. background noise and reverberation)
Therefore, to understand spoken language, it is also necessary to understand how prosody is encapsulated in the speech signal
Conclusions and Summary
That’s All
Many Thanks for Your Time and Attention
Language - A Syllable-Centric PerspectiveAn empirically grounded perspective of spoken language focuses on the SYLLABLE and
Syllabic ACCENT as the interface between “sound” and “meaning” (or at least lexical form)
Modes of AnalysisEnergy Time–FrequencyProsodic Accent
PhoneticInterpretation
Manner Segmentation
Fric Voc V NasJ
Word
“Seven”
Linguistic Tiers
Syllable as Interface between Sound & Meaning The syllable serves as a key organizational unit that binds the lower and higher tiers of linguistic
organization
There is a systematic relationship between the syllable and the articulatory-acoustic features comprising phonetic constituents
Moreover, the syllable is the primary carrier of prosodic information and is linked to morphology and the lexicon as well
These slow modulation patterns are DIFFERENTIALLY distributed across the acoustic frequency spectrum
The modulation spectra are similar (in certain respects) across frequency
But vary in certain important ways ….
Modulation Spectra Across Frequency
Modulation Spectra
Modulation Spectrum Varies Across FrequencyIn Houtgast and Steeneken’s original formulation of the STI, the modulation spectrum
was assumed to be similar across the acoustic frequency axis
An analysis of spoken English (in this instance TIMIT sentences) suggests that their formulation was not quite accurate for the high frequency channels, as shown below
The highest channels have considerable energy between 10 and 30 Hz
Summary of the PresentationLow-frequency modulation patterns reflect SYLLABLES, as well as their specific
content and structure
Syllable as Interface between Sound & Meaning The syllable serves as a key organizational unit that binds the lower and higher tiers of linguistic
organization
There is a systematic relationship between the syllable and the articulatory-acoustic features comprising phonetic constituents
Moreover, the syllable is the primary carrier of prosodic information and is linked to morphology and the lexicon as well
Summary of the PresentationSuch temporal properties reflect a basic sensory-motor time constant of ca. 200 ms
– the SAMPLIING RATE of CONSCIOUSNESS
Modulation Spectrum as Predictor of IntelligibilityIn the 1970’s, Houtgast and Steeneken demonstrated that the magnitude of the modulation spectrum could be used to predict speech intelligibility over a wide range of acoustic environments
In optimum listening conditions, the modulation spectrum has a peak between 4 and 5 Hz, as shown below
In highly reverberant environments, the modulation spectrum’s peak is attenuated, shifting down to ca. 2 Hz, becoming increasing unintelligible
[based on an illustration by Hynek Hermansky]
Modulation Spectrum
In face-to-face interaction the visual component of the speech signal can be extremely important for understanding spoken language (particularly in noisy and/or reverberant conditions)
It is therefore of interest to ascertain the brain’s tolerance for asynchrony between the audio and visual components of the speech signal
This exercise can also provide potentially illuminating insights into the nature of the neural mechanisms underlying speech comprehension
Specifically, the contribution of speechreading cues can provide clues about what REALLY is IMPORTANT in the speech signal for INTELLIGIBILITY that is independent of the sensory modality involved
Audio-Visual Integration of Speech
In Conclusion ….
Language - A Syllable-Centric PerspectiveA more empirically grounded perspective of spoken language focuses on the
SYLLABLE as the interface between “sound” “vision” and “meaning”
Important linguistic information is embedded in the TEMPORAL DYNAMICS
of the speech signal (irrespective of the modality)
Germane PublicationsArai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel
spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936.
Grant, K. and Greenberg, S. (2001) Speech intelligibility derived from asynchronous processing of auditory-visual information. Proceedings of the ISCA Workshop on Audio-Visual Speech Processing (AVSP-2001), pp. 132-137.
Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678.
Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedings of the International Conference on Spoken Language Processing, Sydney, pp. 74-77.
Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8.
Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal constraints on speech intelligibility as deduced from exceedingly sparse spectral representations, 6th European Conference on Speech Communication and Technology (Eurospeech-99), pp. 2687-2690.
http://www.icsi.berkeley.edu/~steveng
Syllables rise and fall in energy over the course of their duration
Vocalic nuclei are highest in amplitude
Onset consonants gradually rise in energy arching towards the peak
Coda consonants decline in amplitude, usually more abruptly than onsets
The Energy Arc Illustrated
Spectro-temporal profile (STeP)Spectrogram + Waveform
“seven”
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)
There are many potential interpretations of the dataThe interpretation currently favored (by the speaker) posits a relatively long
(200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio)
The data further suggest that place-of-articulation cues evolve over syllabic intervals of ca. 200 ms in length and could therefore potentially apply to models of speech processing in general
Speechreading also appears to provide important prosodic information that is extremely useful for decoding the speech signal
Audio-Video Integration – Summary
The temporal properties of spoken language reflect INFORMATION contained in the speech signal
PROSODY is the most sensitive LINGUISTIC reflection of INFORMATION
Much of the temporal variation in spoken language reflects prosodic factors
Hence, prosody is the key to understanding much of the temporal (and phonetic) variation observed in spoken language
Prosody is what likely shields the information contained in the speech signal against the deleterious forces of nature (a.k.a. background noise and reverberation)
Therefore, to understand spoken language, it is also necessary to understand how prosody is encapsulated in the speech signal
This is the focus for today’s presentation
But, before considering prosody, let’s first examine an important acoustic property of the speech signal ….
Take Home Messages