Temporal Properties of Spoken Language Steven Greenberg The Speech Institute steveng...

Temporal Properties of

Spoken Language

Steven GreenbergThe Speech Institute

http://www.icsi.berkeley.edu/[email protected]

Acknowledgements and Thanks

Research FundingU.S. Department of DefenseU.S. National Science Foundation

Research CollaboratorsHannah Carvey, Shawn Chang, Ken Grant, Leah Hitchcock, Joy Hollenback, Rosaria Silipo

For Further Information

Consult the web site:

www.icsi.berkeley.edu/~steveng

This presentation examines WHY the temporal properties of speech are the way they are

Some General Questions

Specifically, we ask ….

WHY is the average duration of a syllable (in spontaneous speech) ca. 200 ms?



WHY are some syllables significantly longer than others?



WHY are some phonetic segments (usually vowels) longer than others (typically consonants)?


And ….

WHAT can the temporal properties of speech tell us about spoken language?


The temporal properties of spoken language reflect INFORMATION contained in the speech signal

Conclusions

PROSODY is the most sensitive LINGUISTIC reflection of INFORMATION

(PROSODY refers to the RHYTHM and TEMPO of syllables in an utterance)

Conclusions

Much of the temporal variation in spoken language reflects prosodic factors

Conclusions

Hence, prosody is the key to understanding much of the temporal (and phonetic) variation observed in spoken language

Conclusions

Prosody also shields the information contained in the speech signal against the deleterious forces of nature (a.k.a. background noise and reverberation)

Conclusions

Therefore, to understand spoken language, it is also necessary to understand how prosody is encapsulated in the speech signal (acoustic and otherwise)

This is the focus for today’s presentation

But, before considering prosody per se, let’s first examine an important acoustic property of the speech signal ….

Conclusions

SLOW modulation of acoustic energy, reflecting movement of the speech articulators, is crucial for understanding spoken language

The fine spectral detail is FAR less important (80% of the spectrum can be discarded with much impact on intelligibility)

WHY should this be so? WHY? WHY? WHY? WHY?

Importance of Slow Modulations

90% Intelligibility

Quantifying Modulation Patterns in SpeechThe modulation spectrum provides a convenient quantitative method for

computing the amount of modulation in the speech signal

The technique is illustrated for a paradigmatic, simple signal

The computation is performed for each spectral channel separately

The low-frequency modulation patterns can thus be quantified using the modulation spectrum, which looks like this for spontaneous speech ….

Modulation Spectrum of Spoken Language

The modulation spectrum has a broad peak of energy between 3 and 10 Hz

Linguistically, the modulation spectrum reflects SYLLABLES

The distribution of syllable duration is similar to the modulation spectrum

Modulation Spectrum of Spoken Language

Syllable duration

Modulation Spectrum

15 minutes of spontaneous material from a single Japanese speaker

Questions:

Why do syllables vary so much in duration?

And why is the peak of the modulation spectrum so broad?

Variation in Syllable Duration

Syllable duration

Modulation Spectrum

15 minutes of spontaneous material from a single Japanese speaker

Why do syllables vary so much in duration?

In large part, it is because syllables carry differential amounts of information

Longer syllables tend to contain more information than shorter syllables

Below, the vowels in “ride” and “bikes” are longer than in other words (as well as more intense)

Variation in Syllable Duration

Duration is one of the most important correlates of syllable accent (prosody)

We know this because of studies SIMULATING syllable prominence (accent) labeling by highly trained linguistic transcribers

In one study, it was shown that duration is the single most important acoustic property related to syllable prominence (in Am. English)

Duration Correlates with Syllable Stress

Duration

Amplitude

Silipo and Greenberg (1999)

Pitch

Word Duration and Syllabic Accent LevelWords that contain an accented syllable tend to be considerably longer

than unaccented words

What are the implications of this insight?

Heavily AccentedLightly

Accented

Unaccented

All Words

Heavily AccentedLightly

Accented

Unaccented

All Words

Word Duration and Stress Accent LevelThe broad distribution of word duration (and, in turn, syllable duration) largely reflects the co-

existence of accented and unaccented words (and syllables), often within the same utterance

This interleaving of long and short syllables reflects the DIFFERENTIAL DISTRIBUTION of ENTROPY across an utterance

Breadth of the Modulation SpectrumThe broad bandwidth of the modulation spectrum, as it reflects syllable

duration, encapsulates the heterogeneity in syllabic and lexical duration associated with variation in syllable prominence

Does this insight have implications for spoken language?

Modulation spectrum of 40 TIMIT sentences (computed across a 6-kHz bandwidth)

UnaccentedHeavily Accented

All Accents(Convergnce)

Modulation Spectrum Breadth & IntelligibilityLong ago, Houtgast and Steeneken demonstrated that the modulation spectrum is highly predictive of speech intelligibility

In highly reverberant environments, the modulation spectrum’s peak is severely attenuated, shifted down to ca. 2 Hz, and the signal becomes largely unintelligible

What does this imply with respect to prosody?

[based on an illustration by Hynek Hermansky]

Modulation Spectrum

As the modulation spectrum is progressively low-pass filtered, intelligibility declines

Suggesting that intelligibility requires both long and short (i.e., accented and unaccented) syllables

(However, some syllables - the accented ones - are “more equal” than others)

Intelligibility and Modulation Frequency

Silipo et al. (1999)

Unaccented

Heavily Accented

All Accents(Convergnce)

Syllable Duration and Accent

Canonical Syllable Forms

Heavily accented syllables are generally 60-100% longer than their unaccented counterparts

The disparity in duration is most pronounced for syllable forms with one or no consonants (i.e., V, VC, CV)

This pattern implies that accent has its greatest impact on vocalic duration

V = VowelC = Consonant


Vowel Duration - Accent Level/Syllable FormVowels in accented syllables are at least twice as long as their unaccented

counterparts

This pattern implies that the syllabic nucleus absorbs much of accent’s impact (at least as far as duration is concerned)


Syllable Onset/Coda Duration and AccentONSETS of accented syllables are generally 50-60% longer than their

unaccented counterparts and are somewhat sensitive to stress accent

While there is little difference in duration between accented and unaccented CODA constituents

CODAS are relatively insensitive to prosody and carry less information than onsets

Onsets Codas

Sensitivity of Syllable Constituents to AccentThus, the duration of syllabic nuclei (usually vowels) is most sensitive to syllable accent

Syllable CODAS are LEAST sensitive to prosodic accent

This differential sensitivity to prosodic accent reflects some fundamental principles of information encoding within the syllable, as well as principles of auditory function (e.g., onsets are more important than offsets for evoking neural discharge – hence much of the neural entropy is embedded in the onset)

Syllable Prominence (Accent) Illustrated

[s]

[eh]

[vx]

[en]

accented syllable

unaccented syllable

“Seven”

mean duration

Full-spectrumperspective

OGI Numbers95

[s] [eh] [vx] [en]

Nucleus

Onset

Ambi-syllabic

“pure” juncture

Nucleus

Juncture

Robustness Based on Temporal PropertiesReflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions

Yet, the intelligibility of speech is remarkably stable

This implies that intelligibility is NOT based on the spectro-temporal DETAILS but rather on some more basic,TEMPORAL parameter(s)

Temporal Basis of Intelligibility

90% Intelligibility

Four narrow channels, presented synchronously, yield ca. 90% intelligibility

Intelligibility for two channels ranges between 10 and 60%

60% Intelligibility

Desynchronizing Slits Affects IntelligibilityWhen the center slits lead or lag the lateral slits by more than 25 ms intelligibility suffers significantly

Intelligibility plummets to ca. 55% for leads/lags of 50 ms

And declines to 40% for leads/lags of 75 ms

Asynchrony greater than 50 ms results in intelligibility lower than baseline

A trough in performance occurs at ca. 200-250 ms asynchrony, roughly the interval associated with the syllable

What does this mean?

Perhaps, that there is a syllable-length time window of integration

Slit Asynchrony Affects Intelligibility

Importance of Visual Cues

0.00.51.01.52.02.5

0 500 1000 1500 2000 2500 3000 3500

Time (ms)

0

10

20

30

40

50

60

70

80WB

F1

F2F3

RM

S A

mpl

itude

(dB

)Li

p A

rea

(in2

) Watch the log float in the wide river

Data courtesy of Ken Grant

Amplitude Fluctuation inDifferent Spectral Regions

Lip Aperture Variation

Visual cues often supplement the acoustic signal, and are particularly important in adverse acoustic environments (i.e., noise & reverberation)

What is the basis of visual supplementation to understanding speech?

One possibility is the common modulatory properties of the visual and acoustic components of the speech signal

Combining Audio and Visual Cues

+

+

Video Leads

40 – 400 ms

Audio Leads

40 – 400 ms

Baseline ConditionSYNCHRONOUS A/V

Place of Articulation

Visual cues (a.k.a. speechreading) also provide important information about consonantal place of articulation and the nature of both prosodic and vocalic properties

One can desynchronize the audio and visual streams and measure its impact on intelligibility

Focus on Audio-Leading-Video ConditionsWhen the AUDIO signal LEADS the VIDEO, there is a progressive decline in

intelligibility, similar to that observed for audio-alone signals

These data are next compared with data from the audio-alone study to illustrate the similarity in the slope of the function

Comparison of A/V and Audio-Alone Data The decline in intelligibility for the audio-alone condition is similar to that of

the audio-leading-video condition

Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar

The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves

When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms

These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio

Focus on Video-Leading-Audio Conditions

The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions

WHY? WHY? WHY?

Auditory-Visual Integration - the Full Monty

Time Constants of Audio-Visual IntergrationThe temporal limits of combining visual and acoustic information are SYLLABLE length, particularly when the video precedes the audio

signal

Suggesting that visual speech cues are syllabically organized

WHY are the temporal properties of speech the way they are

Because ….

The brain requires such intervals to combine information across sensory modalities and to associate the sensory streams with meaning

Some General Answers

WHY is the average duration of a syllable (in spontaneous speech) ca. 200 ms?

The syllable’s duration reflects a basic sensori-motor integration time constant and can be considered to represent the sampling rate of consciousness


WHY are some syllables significantly longer than others?

The heterogeneity in duration reflects the unequal distribution of entropy across an utterance and is a basic requirement for decoding the speech signal


WHY are some phonetic segments (usually vowels) longer than others (typically consonants)?

Vowels reflect the influence of prosodic factors far more than consonants, and therefore convey more information concerning a syllable’s intrinsic entropy than their consonantal counterparts


WHAT can the temporal properties of speech tell us about spoken language in general?

It provides a general theoretical framework for understanding the organization of spoken language and how the brain decodes the speech signal






Prosody shields the information contained in the speech signal against the deleterious forces of nature (a.k.a. background noise and reverberation)

Therefore, to understand spoken language, it is also necessary to understand how prosody is encapsulated in the speech signal

Conclusions and Summary

That’s All

Many Thanks for Your Time and Attention

Language - A Syllable-Centric PerspectiveAn empirically grounded perspective of spoken language focuses on the SYLLABLE and

Syllabic ACCENT as the interface between “sound” and “meaning” (or at least lexical form)

Modes of AnalysisEnergy Time–FrequencyProsodic Accent

PhoneticInterpretation

Manner Segmentation

Fric Voc V NasJ

Word

“Seven”

Linguistic Tiers

Syllable as Interface between Sound & Meaning The syllable serves as a key organizational unit that binds the lower and higher tiers of linguistic

organization

There is a systematic relationship between the syllable and the articulatory-acoustic features comprising phonetic constituents

Moreover, the syllable is the primary carrier of prosodic information and is linked to morphology and the lexicon as well

These slow modulation patterns are DIFFERENTIALLY distributed across the acoustic frequency spectrum

The modulation spectra are similar (in certain respects) across frequency

But vary in certain important ways ….

Modulation Spectra Across Frequency

Modulation Spectra

Modulation Spectrum Varies Across FrequencyIn Houtgast and Steeneken’s original formulation of the STI, the modulation spectrum

was assumed to be similar across the acoustic frequency axis

An analysis of spoken English (in this instance TIMIT sentences) suggests that their formulation was not quite accurate for the high frequency channels, as shown below

The highest channels have considerable energy between 10 and 30 Hz

Summary of the PresentationLow-frequency modulation patterns reflect SYLLABLES, as well as their specific

content and structure

Syllable as Interface between Sound & Meaning The syllable serves as a key organizational unit that binds the lower and higher tiers of linguistic

organization

There is a systematic relationship between the syllable and the articulatory-acoustic features comprising phonetic constituents

Moreover, the syllable is the primary carrier of prosodic information and is linked to morphology and the lexicon as well

Summary of the PresentationSuch temporal properties reflect a basic sensory-motor time constant of ca. 200 ms

– the SAMPLIING RATE of CONSCIOUSNESS

Modulation Spectrum as Predictor of IntelligibilityIn the 1970’s, Houtgast and Steeneken demonstrated that the magnitude of the modulation spectrum could be used to predict speech intelligibility over a wide range of acoustic environments

In optimum listening conditions, the modulation spectrum has a peak between 4 and 5 Hz, as shown below

In highly reverberant environments, the modulation spectrum’s peak is attenuated, shifting down to ca. 2 Hz, becoming increasing unintelligible

[based on an illustration by Hynek Hermansky]

Modulation Spectrum

In face-to-face interaction the visual component of the speech signal can be extremely important for understanding spoken language (particularly in noisy and/or reverberant conditions)

It is therefore of interest to ascertain the brain’s tolerance for asynchrony between the audio and visual components of the speech signal

This exercise can also provide potentially illuminating insights into the nature of the neural mechanisms underlying speech comprehension

Specifically, the contribution of speechreading cues can provide clues about what REALLY is IMPORTANT in the speech signal for INTELLIGIBILITY that is independent of the sensory modality involved

Audio-Visual Integration of Speech

In Conclusion ….

Language - A Syllable-Centric PerspectiveA more empirically grounded perspective of spoken language focuses on the

SYLLABLE as the interface between “sound” “vision” and “meaning”

Important linguistic information is embedded in the TEMPORAL DYNAMICS

of the speech signal (irrespective of the modality)

Germane PublicationsArai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel

spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936.

Grant, K. and Greenberg, S. (2001) Speech intelligibility derived from asynchronous processing of auditory-visual information. Proceedings of the ISCA Workshop on Audio-Visual Speech Processing (AVSP-2001), pp. 132-137.

Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678.

Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedings of the International Conference on Spoken Language Processing, Sydney, pp. 74-77.

Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8.

Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal constraints on speech intelligibility as deduced from exceedingly sparse spectral representations, 6th European Conference on Speech Communication and Technology (Eurospeech-99), pp. 2687-2690.

http://www.icsi.berkeley.edu/~steveng

Syllables rise and fall in energy over the course of their duration

Vocalic nuclei are highest in amplitude

Onset consonants gradually rise in energy arching towards the peak

Coda consonants decline in amplitude, usually more abruptly than onsets

The Energy Arc Illustrated

Spectro-temporal profile (STeP)Spectrogram + Waveform

“seven”

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)

There are many potential interpretations of the dataThe interpretation currently favored (by the speaker) posits a relatively long

(200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio)

The data further suggest that place-of-articulation cues evolve over syllabic intervals of ca. 200 ms in length and could therefore potentially apply to models of speech processing in general

Speechreading also appears to provide important prosodic information that is extremely useful for decoding the speech signal

Audio-Video Integration – Summary





Prosody is what likely shields the information contained in the speech signal against the deleterious forces of nature (a.k.a. background noise and reverberation)

Therefore, to understand spoken language, it is also necessary to understand how prosody is encapsulated in the speech signal

This is the focus for today’s presentation

But, before considering prosody, let’s first examine an important acoustic property of the speech signal ….

Take Home Messages

Temporal Properties of Spoken Language Steven Greenberg The Speech Institute steveng...

Documents

Transcript of Temporal Properties of Spoken Language Steven Greenberg The Speech Institute steveng...