Post on 15-Jan-2016
description
Analysis and Synthesis of Shouted Speech
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
2
• Shout is the loudest mode of vocal communication
• It is used for increasing the signal-to-noise ratio (SNR) when communicating• over an interfering noise• over a distance
• Shouting is also used for expressing emotions or intentions
Shout
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
3
• Shout is produced by raising the subglottal pressure and increasing the vocal fold tension
• In effect, shout is characterized by• Increased sound pressure level (SPL)• Increased fundamental frequency (f0)• Increased amplitudes in mid-frequencies (1—4 kHz)• Increased duration and energy of vowels• Decreased duration and energy of consonants• Less accurate articulation
Properties of shout
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
4
• Fortunately, shouting is used rarely, but it is an essential part of human vocal communication
• Shout synthesis may be required e.g. for creating speech with emotional content, and it can be used in human-computer interaction or in creating virtual worlds and characters
Why perform shout synthesis?
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
5
In this study•Normal and shouted speech was recorded•Properties of normal and shouted speech were analyzed •Methods for producing natural sounding HMM-based synthetic shout are investigated
In this study…
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
6
• Normal and shouted speech was recorded in an anechoid chamber• 22 Finnish speakers• 24 sentences of speech and shout from each speaker• A total of 1056 sentences• Subjects were asked to use very loud voice in shouting
• In addition, a larger shouting corpus of 100 sentences was recorded from one male and one female for TTS purposes
Recording of normal and shouted speech
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
7
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
8
• The following acoustic properties were analyzed from the recorded shouted and normal speech: • sound pressure level (SPL)• duration• fundamental frequency (f0)• spectrum• properties of the voice source:
• shape of the glottal pulse• H1-H2 parameter• NAQ parameter
Acoustic analysis of shout
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
9
• On average (speech shout)• SPL increased 21 dB for females and 22 dB for males• Sentence duration increased 20% for females and 24% for males• f0 increased 71% for females and 152% for males• Spectrum was emphasized in the 1–4 kHz area
Acoustic analysis of shout – Results
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
10
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
11
Overall
Voiced
Unvoiced
Female Male
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
12
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
13
• Differences between normal speech and shout are large• This induces problems in many speech processing algorithms:
• Due to high f0, the accurate estimation of speech spectrum is difficult
• This is due to the biasing effect of the sparse harmonic structure of the shouted voice source
• Especially linear prediction (LP) is prone to this type of bias
Problems…
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
14
• The biasing effect of the harmonics must be reduced• For this purpose, e.g. weighted linear prediction (WLP) can be used
• In WLP, the effect of the excitation to spectrum is reduced• This is done by weighting the squared residual with a specific
function
Spectrum estimation of shout
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
15
LP vs. weighted linear prediction (WLP)
Conventional LP:
Weighted LP:
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
16
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
17
• Following spectrum estimation methods were compared for normal speech and shout:1. Conventional linear prediction (LP)2. WLP with STE weight (STE-WLP)3. WLP with AME weight (AME-WLP)
STE – short time energyAME – attenuation of the main excitation
Spectrum estimation of shout
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
18
• Subjective listening tests indicate that• WLP-AME performs best with normal speech• WLP-STE performs best with shout
LP
WLP-STE
WLP-AME
LP vs. WLP in resynthesis
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
19
• Subjective listening tests indicate that WLP-STE is preferred in the synthesis of shout (by adaptation)
Female Male
LP vs. WLP in HMM-based speech synthesis
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
20
• HMM-based synthesis is a very flexible means to produce different speaking styles, such as shout
Synthesis of shout (1)
Speech dataStatistical
model
Synthetic speechTraining Synthesis
Text
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
21
• It is difficult to obtain large amounts of shout data, enough for constructing a TTS voice
Shout data
Synthesis of shout (2)
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
22
• Statistical adaptation of the normal speech model was used to generate synthetic shouted speech
Statistical model
Shout data
Adaptation
Training Synthesis
Text
Synthetic shout
Speech data
Synthesis of shout (3)
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
23
• Alternatively, using simple voice conversion technique, the synthetic speech can be converted into shouted speech
Shout data
Voice conversion
Statistical model
Training Synthesis
Text
Synthetic shout
Speech data
Synthesis of shout (4)
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
24
• The following speech types were selected for the test:1. Natural normal speech2. Natural shout3. Synthetic normal speech4. Synthetic shout (adapted)5. Synthetic shout (voice conversion)
Evaluation (1)
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
25
• MOS style listening test: the following properties were rated:1. How would you rate the quality of the speech sample?2. How much the sample resembles shouting?3. How much effort did speaker use for producing speech?
• Scale from 1 to 5 with verbal anchors• Loudness of the speech samples was normalized so that the ratings
are based on other aspects than SPL• 11 test subjects evaluated 50 samples each
Evaluation (2)
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
26
Results – Naturalness
26
• Shout synthesis is rated lower in quality compared to normal speech synthesis (as expected)
Normal synthesis
Shout synthesis
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
27
Results – Impression of shouting
27
• The impression of shouting is, however, fairly well preserved
Natural shout
Synthetic shout
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
28
Results – Vocal effort
28
• Adaptation produces better impression of the used vocal effort compared to voice conversion method
Adapted shout
Voice conversion shout
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
29
• Synthesis of shout is challenging for many reasons:1. It is difficult to obtain large amounts of shout data with
consistent quality2. Differences between normal speech and shout are large, which
induces problems in many speech processing algorithms• In this work, the biasing effect of high-pitched shout was reduced by
using weighted linear predictive (WLP) methods• Subjective listening tests show the that WLP models work better with
shout than conventional LP
Summary (1)
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
30
• In this study, synthetic shout was produced with two different techniques:1. Adaptation2. Voice conversion of the synthetic normal speech
• Methods were rated equal in quality• Impression of shouting and the use of vocal effort were better
preserved in the adapted shout
Summary (2)
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku
31
Thank you!
Male Female
Samples