On the Correlation between Energy and Pitch Accent in Read English Speech

15
On the Correlation On the Correlation between Energy and between Energy and Pitch Accent in Pitch Accent in Read English Read English Speech Speech Andrew Rosenberg Andrew Rosenberg Weekly Speech Lab Talk Weekly Speech Lab Talk 6/27/06 6/27/06

description

On the Correlation between Energy and Pitch Accent in Read English Speech. Andrew Rosenberg Weekly Speech Lab Talk 6/27/06. Talk Outline. Introduction to Pitch Accent Previous Work Contribution and Approach Corpus Results and Discussion Conclusion Future Work. Introduction. - PowerPoint PPT Presentation

Transcript of On the Correlation between Energy and Pitch Accent in Read English Speech

Page 1: On the Correlation between Energy and Pitch Accent in Read English Speech

On the Correlation On the Correlation between Energy and between Energy and Pitch Accent in Read Pitch Accent in Read

English SpeechEnglish Speech

Andrew RosenbergAndrew Rosenberg

Weekly Speech Lab TalkWeekly Speech Lab Talk

6/27/066/27/06

Page 2: On the Correlation between Energy and Pitch Accent in Read English Speech

Talk OutlineTalk Outline

Introduction to Pitch AccentIntroduction to Pitch Accent Previous WorkPrevious Work Contribution and ApproachContribution and Approach CorpusCorpus Results and DiscussionResults and Discussion ConclusionConclusion Future WorkFuture Work

Page 3: On the Correlation between Energy and Pitch Accent in Read English Speech

IntroductionIntroduction

Pitch Accent Pitch Accent is the way a word is made to “stand is the way a word is made to “stand out” from its surrounding utterance.out” from its surrounding utterance. As opposed to As opposed to lexical stresslexical stress which refers to the most which refers to the most

prominent syllable within a word.prominent syllable within a word. Accurate detection of pitch accent is particularly Accurate detection of pitch accent is particularly

important to many NLU tasks.important to many NLU tasks. Identification of “important” words.Identification of “important” words. Indication of Discourse Status and Structure.Indication of Discourse Status and Structure. Disambiguation Syntax/Semantics.Disambiguation Syntax/Semantics.

Pitch (f0), Duration, and Pitch (f0), Duration, and EnergyEnergy are all known are all known correlates of Pitch Accentcorrelates of Pitch Accent

Page 4: On the Correlation between Energy and Pitch Accent in Read English Speech

Previous WorkPrevious Work

Sluijter and van Heuven 96, 97 showed that accent in Sluijter and van Heuven 96, 97 showed that accent in Dutch strongly correlates with the energy of a word Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz.extracted from the frequency subband > 500Hz.

Heldner 99,01 and Fant, et al. 00 found that energy in a Heldner 99,01 and Fant, et al. 00 found that energy in a particular spectral region indicated accent in Swedish.particular spectral region indicated accent in Swedish.

A lot of researh attention has been given to the automatic A lot of researh attention has been given to the automatic identification of prominent or identification of prominent or accented accented words.words. Tamburini 03,05 used the energy components of the 500Hz-Tamburini 03,05 used the energy components of the 500Hz-

2000Hz band.2000Hz band. Tepperman 05 used the RMS energy from the 60Hz-400Hz band Tepperman 05 used the RMS energy from the 60Hz-400Hz band Far too many others to mention here.Far too many others to mention here.

Page 5: On the Correlation between Energy and Pitch Accent in Read English Speech

Contribution and ApproachContribution and Approach

There is no agreement as to the best -- most discriminative -- There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information.frequency subband from which to extract energy information.

We set up a battery of analysis-by-classification experiments varying:We set up a battery of analysis-by-classification experiments varying: The frequency band:The frequency band:

lower bound frequency ranged from 0 to 19 barklower bound frequency ranged from 0 to 19 bark bandwidth ranged from 1 to 20 barkbandwidth ranged from 1 to 20 bark

upper bound was 20 bark by the 8KHz Nyquist rateupper bound was 20 bark by the 8KHz Nyquist rate Also, analyzed the first and/or second formants.Also, analyzed the first and/or second formants.

The region of analysis:The region of analysis: Full word, only syllable nuclei, longest syllable, longest syllable nucleiFull word, only syllable nuclei, longest syllable, longest syllable nuclei

Speaker:Speaker: Each of 4 speakers separately, and all together.Each of 4 speakers separately, and all together.

We performed the classification using J48 -- a java implementation of We performed the classification using J48 -- a java implementation of C4.5.C4.5.

Page 6: On the Correlation between Energy and Pitch Accent in Read English Speech

Contribution and ApproachContribution and Approach

Local Features:Local Features: minimum, maximum, mean, standard deviation and RMS of energyminimum, maximum, mean, standard deviation and RMS of energy z score of max energy within the wordz score of max energy within the word mean slopemean slope energy contour classification {rising, falling, peak, valley}energy contour classification {rising, falling, peak, valley}

Context-based Features:Context-based Features: Use 6 contexts: (# previous words, #following words)Use 6 contexts: (# previous words, #following words)

(2,2) (1,1) (1,0) (2,0) (0,1) (2,1)(2,2) (1,1) (1,0) (2,0) (0,1) (2,1) (max(maxwordword - mean - meanregionregion) / std.dev) / std.devregionregion

(mean(meanwordword - mean - meanregionregion) / std.dev) / std.devregionregion

(max(maxwordword - max - maxregionregion) / std.dev) / std.devregionregion

maxmaxwordword / (max / (maxregionregion-min-minregionregion)) meanmeanwordword / (max / (maxregionregion-min-minregionregion))

Page 7: On the Correlation between Energy and Pitch Accent in Read English Speech

CorpusCorpus

Boston Directions Corpus (BDC) Boston Directions Corpus (BDC) [Hirschberg&Nakatani96][Hirschberg&Nakatani96]

Speech elicited from a direction-giving task.Speech elicited from a direction-giving task. Used only the Used only the readread portion. portion. 50 minutes50 minutes Fully ToBI labeledFully ToBI labeled 10825 words10825 words

Manually segmentedManually segmented

4 Speakers: 3 male, 1 female4 Speakers: 3 male, 1 female

Page 8: On the Correlation between Energy and Pitch Accent in Read English Speech

Results and DiscussionResults and Discussion

Energy from Energy from different different frequency regions frequency regions predict pitch predict pitch accent differentlyaccent differently mean relative mean relative

improvement of improvement of best region over best region over worst: 14.8%worst: 14.8%

Page 9: On the Correlation between Energy and Pitch Accent in Read English Speech

Results and DiscussionResults and Discussion

Our experiments did not confirm previously Our experiments did not confirm previously reported results.reported results.

The single most predictive subband for all The single most predictive subband for all speakers was 3-18bark over full wordsspeakers was 3-18bark over full words Classification Accuracy: 76% (42.4% baseline)Classification Accuracy: 76% (42.4% baseline)

p=71.6,r=73.4p=71.6,r=73.4

However, performs significantly worse than the However, performs significantly worse than the best for analyzing a single speakerbest for analyzing a single speaker notnot the female speaker the female speaker

Page 10: On the Correlation between Energy and Pitch Accent in Read English Speech

Results and DiscussionResults and Discussion

The subband from 2-20bark is performs The subband from 2-20bark is performs significantly worse than the most predicitive in significantly worse than the most predicitive in only a single experiment only a single experiment (h1nucl)(h1nucl) Accuracy: 75.5% (p=70.5, r=72.5)Accuracy: 75.5% (p=70.5, r=72.5) Due to its robustness we consider this band the “best”Due to its robustness we consider this band the “best”

The formant-based energyThe formant-based energy features tend to features tend to perform worseperform worse 6.4% mean accuracy reduction from 2-20bark6.4% mean accuracy reduction from 2-20bark Attributable to: Attributable to:

Errors in the formant tracking algorithmErrors in the formant tracking algorithm The presence of discriminative information in higher formantsThe presence of discriminative information in higher formants

Page 11: On the Correlation between Energy and Pitch Accent in Read English Speech

Results and DiscussionResults and Discussion

Most predictive features were Most predictive features were normalized normalized maximum energymaximum energy relative to the mean and relative to the mean and standard deviation of three contextual standard deviation of three contextual regionsregions

1 previous and 1 following word1 previous and 1 following word 2 previous and 1 following word2 previous and 1 following word 2 previous and 2 following words2 previous and 2 following words

Page 12: On the Correlation between Energy and Pitch Accent in Read English Speech

Results and DiscussionResults and Discussion

There is a relatively small intersection of There is a relatively small intersection of correct predictions even among similar correct predictions even among similar subbands.subbands.

10823 of 10825 words were correctly 10823 of 10825 words were correctly classified by at least one classifier.classified by at least one classifier.

Using a majority voting scheme:Using a majority voting scheme: Accuracy: 81.9% (p=76.7, r=82.5)Accuracy: 81.9% (p=76.7, r=82.5)

Page 13: On the Correlation between Energy and Pitch Accent in Read English Speech

Results and DiscussionResults and Discussion

How do the regioning strategies perform?How do the regioning strategies perform?

Full Word > All Nuclei > Longest Syllable ~ Longest NucleiFull Word > All Nuclei > Longest Syllable ~ Longest Nuclei

Why does analysis of the full word outperform Why does analysis of the full word outperform other regioning strategies?other regioning strategies? Duration is a crude measure of lexical stressDuration is a crude measure of lexical stress Syllable/nuclei segmentation algorithms are imperfectSyllable/nuclei segmentation algorithms are imperfect Pitch accents are not neatly placedPitch accents are not neatly placed More data has the ability to highlight distinctions more More data has the ability to highlight distinctions more

easily easily

Page 14: On the Correlation between Energy and Pitch Accent in Read English Speech

ConclusionConclusion

Using an analysis-by-classification approach Using an analysis-by-classification approach we showed:we showed: Energy from different frequency bands correlate Energy from different frequency bands correlate

with pitch accent differently.with pitch accent differently. The “best” (highest accuracy, most robust) The “best” (highest accuracy, most robust)

frequency region to be 2-20bark (>2bark?)frequency region to be 2-20bark (>2bark?) A voting classifier based exclusively on energy A voting classifier based exclusively on energy

can predict accent reliably.can predict accent reliably.

Page 15: On the Correlation between Energy and Pitch Accent in Read English Speech

Future WorkFuture Work

Can we predict which bands will predict Can we predict which bands will predict accent best for a given word?accent best for a given word?

We plan on incorporating these findings into We plan on incorporating these findings into a general pitch accent classifier with pitch a general pitch accent classifier with pitch and duration features.and duration features.

We plan on repeating these experiments on We plan on repeating these experiments on spontaneous speech data.spontaneous speech data.