Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

• Shout is the loudest mode of vocal communication

• It is used for increasing the signal-to-noise ratio (SNR) when communicating• over an interfering noise• over a distance

• Shouting is also used for expressing emotions or intentions

• Shout is produced by raising the subglottal pressure and increasing the vocal fold tension

• In effect, shout is characterized by• Increased sound pressure level (SPL)• Increased fundamental frequency (f0)• Increased amplitudes in mid-frequencies (1—4 kHz)• Increased duration and energy of vowels• Decreased duration and energy of consonants• Less accurate articulation

Properties of shout

• Fortunately, shouting is used rarely, but it is an essential part of human vocal communication

• Shout synthesis may be required e.g. for creating speech with emotional content, and it can be used in human-computer interaction or in creating virtual worlds and characters

Why perform shout synthesis?

In this study•Normal and shouted speech was recorded•Properties of normal and shouted speech were analyzed •Methods for producing natural sounding HMM-based synthetic shout are investigated

In this study…

• Normal and shouted speech was recorded in an anechoid chamber• 22 Finnish speakers• 24 sentences of speech and shout from each speaker• A total of 1056 sentences• Subjects were asked to use very loud voice in shouting

• In addition, a larger shouting corpus of 100 sentences was recorded from one male and one female for TTS purposes

Recording of normal and shouted speech

• The following acoustic properties were analyzed from the recorded shouted and normal speech: • sound pressure level (SPL)• duration• fundamental frequency (f0)• spectrum• properties of the voice source:

• shape of the glottal pulse• H1-H2 parameter• NAQ parameter

Acoustic analysis of shout

• On average (speech shout)• SPL increased 21 dB for females and 22 dB for males• Sentence duration increased 20% for females and 24% for males• f0 increased 71% for females and 152% for males• Spectrum was emphasized in the 1–4 kHz area

Acoustic analysis of shout – Results

Overall

Voiced

Unvoiced

Female Male

• Differences between normal speech and shout are large• This induces problems in many speech processing algorithms:

• Due to high f0, the accurate estimation of speech spectrum is difficult

• This is due to the biasing effect of the sparse harmonic structure of the shouted voice source

• Especially linear prediction (LP) is prone to this type of bias

Problems…

• The biasing effect of the harmonics must be reduced• For this purpose, e.g. weighted linear prediction (WLP) can be used

• In WLP, the effect of the excitation to spectrum is reduced• This is done by weighting the squared residual with a specific

function

Spectrum estimation of shout

LP vs. weighted linear prediction (WLP)

Conventional LP:

Weighted LP:

• Following spectrum estimation methods were compared for normal speech and shout:1. Conventional linear prediction (LP)2. WLP with STE weight (STE-WLP)3. WLP with AME weight (AME-WLP)

STE – short time energyAME – attenuation of the main excitation

Spectrum estimation of shout

• Subjective listening tests indicate that• WLP-AME performs best with normal speech• WLP-STE performs best with shout

WLP-STE

WLP-AME

LP vs. WLP in resynthesis

• Subjective listening tests indicate that WLP-STE is preferred in the synthesis of shout (by adaptation)

Female Male

LP vs. WLP in HMM-based speech synthesis

• HMM-based synthesis is a very flexible means to produce different speaking styles, such as shout

Synthesis of shout (1)

Speech dataStatistical

Synthetic speechTraining Synthesis

• It is difficult to obtain large amounts of shout data, enough for constructing a TTS voice

Shout data

• Statistical adaptation of the normal speech model was used to generate synthetic shouted speech

Statistical model

Shout data

Adaptation

Training Synthesis

Synthetic shout

Speech data

• Alternatively, using simple voice conversion technique, the synthetic speech can be converted into shouted speech

Shout data

Voice conversion

Statistical model

Training Synthesis

Synthetic shout

Speech data

• The following speech types were selected for the test:1. Natural normal speech2. Natural shout3. Synthetic normal speech4. Synthetic shout (adapted)5. Synthetic shout (voice conversion)

Evaluation (1)

• MOS style listening test: the following properties were rated:1. How would you rate the quality of the speech sample?2. How much the sample resembles shouting?3. How much effort did speaker use for producing speech?

• Scale from 1 to 5 with verbal anchors• Loudness of the speech samples was normalized so that the ratings

are based on other aspects than SPL• 11 test subjects evaluated 50 samples each

Evaluation (2)

Results – Naturalness

• Shout synthesis is rated lower in quality compared to normal speech synthesis (as expected)

Normal synthesis

Shout synthesis

Results – Impression of shouting

• The impression of shouting is, however, fairly well preserved

Natural shout

Synthetic shout

Results – Vocal effort

• Adaptation produces better impression of the used vocal effort compared to voice conversion method

Adapted shout

Voice conversion shout

• Synthesis of shout is challenging for many reasons:1. It is difficult to obtain large amounts of shout data with

consistent quality2. Differences between normal speech and shout are large, which

induces problems in many speech processing algorithms• In this work, the biasing effect of high-pitched shout was reduced by

using weighted linear predictive (WLP) methods• Subjective listening tests show the that WLP models work better with

shout than conventional LP

Summary (1)

• In this study, synthetic shout was produced with two different techniques:1. Adaptation2. Voice conversion of the synthetic normal speech

• Methods were rated equal in quality• Impression of shouting and the use of vocal effort were better

preserved in the adapted shout

Summary (2)

Thank you!

Male Female

Samples

Analysis and Synthesis of Shouted Speech

Documents

Transcript of Analysis and Synthesis of Shouted Speech

5- Speech Synthesis

Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.

Speech Analysis Synthesis and Perception

Improvements in Speech Synthesis

Multimodal speech synthesis

Low-Cost Portable Text Recognition and Speech Synthesis ... · Low-cost Portable Text Recognition and Speech Synthesis with ... portable text recognition and speech synthesis ...

speech synthesis ic

L18: Speech synthesis (back end)

Average-Voice-Based Speech Synthesis - Tokyo … This thesis describes a novel speech synthesis framework “Average-Voice-based Speech Synthesis.” By using the speech synthesis

Statistical Dialogue Management Speech Synthesis for ...projects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue2... · Statistical Dialogue Management ± Speech Synthesis

Festival Speech Synthesis System

Speech Synthesis Using Damped Sinusoidshillenbr/Papers/DampedSinewaveSynthesizer.pdfHillenbrand & Houde: Speech Synthesis Using Damped Sinusoids 3 for voiced speech. Natural-sounding

Analysis and Synthesis of Shouted Speech

Campbell - Expressive Speech Synthesis

Speech synthesis in

Speech Reco & Synthesis Tutorial

Statistical Speech Synthesis

1 Speech Synthesis User friendly machine must have complete voice communication abilities Voice communication involves Speech synthesis Speech recognition.

Text Processing for Speech Synthesis

Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech