Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 54 (2012) 175–188

Synthesis and evaluation of conversational characteristicsin HMM-based speech synthesis

Sebastian Andersson ⇑, Junichi Yamagishi, Robert A.J. Clark

The Centre for Speech Technology Research, University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, UK

Received 2 March 2011; received in revised form 10 August 2011; accepted 11 August 2011Available online 26 August 2011

Abstract

Spontaneous conversational speech has many characteristics that are currently not modelled well by HMM-based speech synthesisand in order to build synthetic voices that can give an impression of someone partaking in a conversation, we need to utilise data thatexhibits more of the speech phenomena associated with conversations than the more generally used carefully read aloud sentences. In thispaper we show that synthetic voices built with HMM-based speech synthesis techniques from conversational speech data, preserved seg-mental and prosodic characteristics of frequent conversational speech phenomena. An analysis of an evaluation investigating the percep-tion of quality and speaking style of HMM-based voices confirms that speech with conversational characteristics are instrumental forlisteners to perceive successful integration of conversational speech phenomena in synthetic speech. The achieved synthetic speech qualityprovides an encouraging start for the continued use of conversational speech in HMM-based speech synthesis.� 2011 Elsevier B.V. All rights reserved.

Keywords: Speech synthesis; HMM; Conversation; Spontaneous speech; Filled pauses; Discourse marker

1. Introduction

Unit selection and HMM-based synthetic voices cansynthesise neutral read aloud speech quite well, both interms of intelligibility and naturalness (King andKaraiskos, 2009). For many applications, e.g. in vehicleGPS systems, an intelligible read aloud speaking style issufficient to provide a user with relevant information. Butother applications, e.g. believable virtual characters(Traum et al., 2008), embodied conversational agents(Romportl et al., 2010) or speech-to-speech translation(Wahlster, 2000), require synthetic voices with more con-versational characteristics that can synthesise turn-takingbehaviour, provide backchannels and express agreement,disagreement, hesitation, etc.

The fundamental concept in both unit selection andHMM-based speech synthesis is the ability to utilise

0167-6393/$ - see front matter � 2011 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2011.08.001

⇑ Corresponding author. Tel.: +44 131 651 3284.E-mail addresses: [email protected] (S. Andersson), jyamagis

@inf.ed.ac.uk (J. Yamagishi), [email protected] (R.A.J. Clark).

recordings of natural speech directly, and build syntheticvoices that preserve the segmental and prosodic propertiesof the speech and the speaker in the recordings. To be ableto synthesise the segmental and prosodic properties of var-ious speech phenomena from recordings of speech, therecordings must contain examples of these speech phenom-ena. This has been demonstrated for HMM-based speechsynthesis by Yamagishi et al. (2005) where recordings por-traying emotional speaking styles were shown to be instru-mental in listeners being able to perceive such styles insynthetic speech. Badino et al. (2009) demonstrated thatrecordings incorporating emphatic accents similarlyimprove the quality of emphatic accents in the resultingsynthesis.

Whereas it may be theoretically possible to synthesiseany style of speech from a corpora of neutrally read sen-tences, for example by manipulating the speech at theframe level, our current understanding of the relationshipbetween the intention that a speaker wishes to conveyand the resulting speech signal is such that doing this well,

http://dx.doi.org/10.1016/j.specom.2011.08.001

mailto:[email protected]

mailto: [email protected]

mailto: [email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.specom.2011.08.001

176 S. Andersson et al. / Speech Communication 54 (2012) 175–188

particularly in a conversational context, is extremelydifficult.

As our current speech synthesis techniques are specifi-cally formulated to reproduce the speech characteristicsfound in the data that they use, it is appropriate to attemptto use richer data and capture more of the natural charac-teristics found there. More specifically, in order to buildsynthetic voices that are better suited to generating realisticconversational speech, we should be building these voicesfrom data that exhibits more of the characteristics we asso-ciate with conversational speech.

Scientific questions and problems arising for this goalinclude how to acquire spontaneous conversational speechin the high quality conditions that suit speech synthesiswhile maintaining the conversational speech phenomenathat distinguish this speech from read speech, and how tobest model these phenomena within our statistical frame-works. If we can solve these issues, they will provide a step-ping-stone to being able to use the wealth of emerginginformal speech data resources such as pod-casts andspeech available from Internet sites such as YouTube.1

As the first step towards this goal, we utilised carefullyselected speech from a spontaneous conversation to buildsynthetic voices with HMM-based speech synthesis tech-niques (Andersson et al., 2010b). These “conversational”voices were contrasted with HMM-based voices built froma more conventional data source of neutrally read aloudsentences in a perceptual evaluation designed to comparethe quality and speaking style of the voices. The evaluationin the previous publication showed an interesting tendencyof the listeners’ perception: the “conversational” voice wasperceived as being more natural than the “read aloud”

voice, and was perceived as having a more conversationalspeaking style when the compared synthetic utterancescontained certain lexical items that are frequent in conver-sations. But, when these conversational characteristics wereremoved from the sentences synthesised with the “readaloud” voice, the “conversational” voice was perceived asless natural and was no longer perceived as having a moreconversational speaking style.

Therefore, in this paper we present the details of ourspeech synthesis systems built from spontaneous conversa-tional speech and make a comparative analysis of segmen-tal and prosodic properties of the natural and syntheticvoices, together with an analysis of the perceptual evalua-tion of naturalness and conversational speaking style, toinvestigate why the presence or absence of a few wordshad such an impact on perceived speech quality.

The rest of the paper is organised as follows: Section 2provides a background of previous research on conversa-tional speech phenomena and synthetic speech withconversational characteristics. Section 3 gives an overviewof the HMM-based speech synthesis system used to buildour synthetic voices. Sections 4 and 5 describe theconversational and read aloud speech data. Sections 6

1 http://www.youtube.com.

and 7 describe the building of synthetic voices and the pho-netic and perceptual evaluations of the synthetic speech.Section 8 contains a final discussion and conclusions.

2. Background

2.1. Related work

Previous research synthesising speech with spontaneousor conversational characteristics has mainly been achievedwith techniques other than HMM-based speech synthesis,e.g. unit selection speech synthesis (Cadic and Segalen,2008; Andersson et al., 2010a; Adell et al., 2010), limiteddomain unit selection synthesis (Sundaram and Naraya-nan, 2002; Gustafsson and Sjolander, 2004), phrase levelselection from a very large corpus (Campbell, 2007), orarticulatory speech synthesis (Lasarcyk and Wollerman,2010). The approach in (Lee et al., 2010) did use HMM-based speech synthesis to model pronunciationvariation of spontaneous speech, but has a different focusto that of dealing with the language differences in contentand structure between read aloud and conversationalspeech.

2.2. Conversational speech phenomena

An inclusive definition of the differences in languagestructure and content between read aloud and conversa-tional speech as being “wrappers” around propositionalcontent was given in (Campbell, 2006). An example fromour data is given below with the wrappers in italics andthe propositional content in bold face:

“yeah exactly and even like uh I’ll go see bad movies that I

know will be bad um just to see why they’re so bad”

Based on previous research regarding the phonetic anddiscourse properties of the wrapper category, we furtherdivided wrappers into filled pauses, discourse markersand backchannels.

Filled pauses are generally regarded as a hesitation phe-nomena. The transliteration of English filled pauses differsslightly within the literature, but we will use um and uh.Filled pauses are word-like (Clark and Fox Tree, 2002),but their specific phonetic properties distinguish them fromother words in terms of vowel quality, F0 and on average amuch longer vowel duration (O’Shaughnessy, 1992;Shriberg, 1999). As a hesitation phenomena they are oftenassociated with a prolongation of at least the preceding syl-lable (Adell et al., 2008).

Discourse markers is one of many terms that have beenused to refer to similar sets of words and expressions thatare used to regulate the flow of the conversation, ratherthan communicate propositional content (Schiffrin, 1987).Although discourse markers consist of lexical forms thatexist also in the read aloud sentences, e.g. okay, so, becauseor you know, the phonetic properties of these words aredifferent when used as discourse markers than when used

http://www.youtube.com

S. Andersson et al. / Speech Communication 54 (2012) 175–188 177

in other functions (Schiffrin, 1987; Gravano et al., 2007). Afrequent discourse marker in conversation is yeah, oftenused to signal agreement in the beginning of a turn(Jurafsky et al., 1998). It is also worth noting that Jurafskyet al. (1998) treat yeah, oh yeah, and well yeah as separatetokens that can share the same discourse function, e.g.agreement or backchannel.

Backchannels are signals that the listener is involved inthe dialogue, but does not want to take the turn from thespeaker (Gravano et al., 2007). They often have the samelexical realisation as discourse markers, e.g. yeah or okay.Although phonetic properties, such as pitch slope, were dif-ferent in manually classified backchannels than when usedin other functions, an important classification cue was thatbackchannels were often isolated from other speech by thesame speaker (Jurafsky et al., 1998; Gravano et al., 2007).

3. HMM-based speech synthesis

All the synthetic voices described in this paper were builtwith the speaker dependent HMM-based speech synthesissystem (HTS) (Zen et al., 2007). The text analysis and gen-eration of context dependent phonemes are not part of thestandard HTS system but were added by us, in conjunctionwith using the CereVoice system (Aylett and Pidcock,2007). The only difference between the voices describedhere and standard HTS voices is the speaking style of thedata and the additional blending of speaking stylesdescribed in Section 6.1. An overview of the acoustic fea-ture extraction, training of HMM-based models, and gen-erating synthetic speech is given below:

1. Acoustic Feature Extraction: Spectral and excitationparameters are extracted from the acoustic speech signalas STRAIGHT (Kawahara et al., 1999, 2001) mel-ceps-trals, aperiodicity and logF0 parameters.

2. HMM Training: The acoustic parameters together withthe context dependent phoneme descriptions are jointlytrained in an integrated HMM-based statistical frame-work to estimate Gaussian distributions of excitation(logF0 and aperiodicity), spectral (STRAIGHT mel-cepstrals) and duration parameters for the contextdependent phonemes.

3. HMM Clustering: Due to the large number of contextcombinations there are generally only a few instancesof each combination and many combinations are notpresent in the training data. To reliably estimate statisti-cal parameters for context combinations the data isshared between states in the HMMs through decisiontree-based context clustering (Odell, 1995). The resultingclustered trees also enable dealing with unseen contextcombinations at the synthesis stage. Trees are con-structed separately for mel-cepstrals, aperiodicity, logF0 and duration.

4. Speech Generation: At the synthesis stage an input textsentence is converted into a context dependent phonemesequence. Speech (spectral, excitation and duration)

parameters are then generated from the correspondingtrained HMMs and rendered into a speech signalthrough the STRAIGHT mel-cepstral vocoder withmixed excitation.

4. Spontaneous conversational speech data

4.1. Recording

The spontaneous conversational speech data, used tobuild the synthetic voices in this paper, were recorded forthe purpose of use in speech synthesis. The speech dataconsisted of manually transcribed and selected utterances,from a total of approximately seven hours of studiorecorded conversation between the first author of thispaper and an American male voice talent in his late thirtiesfrom Texas (Andersson et al., 2010a).

The voice talent was positioned inside a recordingbooth, with the author positioned outside of it. The speechfrom the voice talent and the author were recorded on sep-arate channels, communication between them was madevia microphones and headphones, and they had eye-con-tact through a window. The conversation wasunconstrained, but mainly focused around the voice tal-ent’s work as an actor and his interests in films and martialarts.

4.2. Transcription and selection

The manual transcription and selection of speech fromthe conversation were conducted by the first author toobtain speech data that would provide a sensible startingpoint for state-of-the-art statistical speech synthesistechniques.

First the speech of the voice talent was transcribedorthographically and aligned at the utterance level.The motivation for an orthographic transcription wasthat it is the description level generally assumed intext-to-speech synthesis, and it provides a token levelannotation suitable for subsequent manual or automaticprocessing.

Then, only utterances that represented the speakers“normal” speaking style were selected. For example, utter-ances where the speaker put on different voices to portray athird person, such as his wife or friends, were excluded.Utterances with word fragments, mispronunciations, heav-ily reduced pronunciations, mumbling and laughter werealso excluded, to allow standard forced alignment. How-ever, we emphasise that the remaining utterances were stillrich in conversational speech phenomena, in particularbackchannels, discourse markers and filled pauses. Anexample is shown below:

“yeah it’s it’s a significant amount of swelling um more

than like I’d say a bruise”

Table 1A comparative analysis of phone alignment accuracy between read aloudand conversational speech. Ten randomly selected utterances with a totalof approximately 300 phones were used.

# Phones Total error Max error/phone

Conversational 347 2085 ms 800 msRead aloud 313 710 ms 80 ms


4.3. Forced alignment and analysis

The forced alignment and linguistic “front-end” analysisof the transcribed conversational speech were made withthe CereVoice (Aylett and Pidcock, 2007) speech synthesissystem. The segmentation procedure followed the forcedalignment method outlined in (Young et al., 2006). In addi-tion to detecting and aligning utterance internal pauses, thesystem made use of a rich system of pronunciation vari-ants, with full and reduced pronunciation variants formany frequent English words,2 e.g. and can be pronouncedfully (�nd) or reduced (en), but can be pronounced fully(bVt) or reduced (bet), and the can be pronounced fully(bi) or reduced (be). Many other words also had pronun-ciation variants, e.g. in because and ’cause, where the vowelquality was decided automatically based on the specificspeaker’s usage.

Although these pronunciation variants gave a bettermatch to pronunciations in the conversational speech,forced alignment initiated with just the conversationalspeech did not provide a sufficiently accurate alignment.By utilising phoneme models trained from additional readaloud speech from the same speaker, recorded for phoneticcoverage, the forced alignment of the conversational speechwas improved (Andersson et al., 2010a). This alignmentand linguistic analysis were also used for the HMM-basedvoices in this paper.

To evaluate the alignment of the conversational speech,ten randomly selected utterances with a total of approxi-mately 300 phones for the conversational and read aloudspeech were compared with manually annotated results.Results are shown in Table 1. This shows that total align-ment error in the conversational speech is about three timesthat of the read aloud speech. However this error is mostlyconcentrated in a few specific segments. 1500 ms of thetotal 2085 ms error can be accounted for by only two seg-ments from one particular sentence. In general, excludinggross alignment errors, the alignment of the conversationalspeech was found to be as accurate as the read aloudspeech.

4.4. Context dependent phonemes

The context dependent phonemes define the segmentaland prosodic categories and dependencies in speech, forboth the training and generation parts of HMM-basedspeech synthesis, and were generated with the CereVoice

2 All phonemes in this paper are denoted with IPA symbols.

system from the text analysis and corresponding forcedaligned utterances. CereVoice’s contexts were based onthe contexts in (Tokuda et al., 2002) and its more recentvariant in (Zen et al., 2009), and took into account:

� quinphone (i.e. current phoneme with the two precedingand succeeding phonemes as context, example: s-p-O-r-t)� preceding, current, and succeeding phoneme types

(vowel, plosive, etc.)� nucleus of current syllable (e.g. �, O or V)� position of phoneme in syllable, word and phrase� position of syllable in word and phrase� number of phonemes in syllable, word and phrase� number of syllables in word and phrase� part-of-speech (content or function word)� preceding, current, and succeeding syllable stress and

accent� boundary tone of phrase (utterance final or medial)

Although the contexts did not include explicit represen-tation of the backchannels, discourse markers or filledpauses, the context specifications implicitly identified manyimportant characteristics. The quinphone context encapsu-lated many of the discourse markers and filled pauses, e.g.yeah, you know or oh yeah, together with their frequentutterance initial or final positions. The quinphone contextwas also large enough to cover a filled pause together witha preceding short function word, such as and or but, or acommon word ending, such as -ing, and thereby potentiallypreserving any associated hesitation and discourse func-tion. The contexts with counts and phrase positions shouldalso be able to capture segmental and prosodic differencesbetween e.g. yeah as a stand alone backchannel, the confir-mation yeah yeah yeah, or in the longer utterance yeah I

feel kind of dirty afterwards.The current contexts did not, by any means, capture all

the important characteristics, and better and more accuratetext/speech analysis and representation of conversationalspeech phenomena is an important problem for HMM-based speech synthesis. However, the current contextsallowed us to establish a baseline for the continuedresearch of conversational speech in HMM-based speechsynthesis and to make a direct comparison where the onlyvarying factor is the underlying data.

5. Comparing conversational and read aloud data

5.1. Word distribution

In addition to the conversational data, we used neutrallyread aloud sentences from the same male Americanspeaker, recorded in the same studio, with the same micro-phone, around the same time as the conversation(Andersson et al., 2010a). The read aloud sentences camefrom a wide range of text genres, including news, weatherreports, and “conversation”. An overview of the conversa-tional and read aloud data is shown in Table 2. In terms of

Table 2Overview of the conversational and read aloud data. The duration showsthe amount of phonetic material, including or excluding utterance internalsilent pauses. The quinphone types include silences, but not lexical stress.

Conversation Read aloud

Utterances 2120 2717Word tokens 19,841 22,363Word types 2200 5026Syllable tokens 24,657 30,902Phone tokens 58,332 75,856Quinphone types 37,654 58,867Total duration (incl. silence) 89 min 106 minTotal duration (excl. silence) 75 min 103 min

Table 4The 30 most frequent trigrams in the conversational data. Excluding oneword backchannels, but including utterance beginning/end as “sil”, andutterance internal short pauses as “sp”. In total 82 trigrams in theconversational data occurred 10 times or more. In comparison, only seventrigrams in the read aloud data occurred more than 10 times.

Frequency Trigram Frequency Trigram

124 sil_yeah_sp 27 sp_so_sil118 sp_you_know 23 sp_uh_sp68 sil_um_sp 23 sp_yeah_sp68 sil_you_know 20 you_know_and53 yeah_sp_yeah 20 sil_and_then46 you_know_sp 19 uh_sp_yeah43 you_know_what 19 sil_and_I38 know_what_I 18 I_mean_sil38 you_know_sil 18 yeah_yeah_sil37 a_lot_of 17 you_know_I37 sp_um_sp 17 but_uh_sp37 sp_yeah_sil 16 sp_and_uh36 sp_um_sil 16 and_uh_sp36 what_I_mean 16 sp_and_then27 sil_yeah_I 16 sp_yeah_yeah


duration of usable utterances there was more read alouddata than conversational data, however the distributionof useful characteristics in each set were quite different.Table 3 shows the twenty most frequent words in the con-versational and read aloud data. It is no surprise that shortfunction words, such as the, a, of or to, were frequent inboth the read aloud and conversational data. More inter-estingly, the most frequent word in the conversational datawas yeah, which occurred a mere three times in the readaloud data, and many other words, e.g. know and so,showed similarly large distributional differences in the readaloud and conversational data.

The reason for these distributional differences is thatmany of the frequent words in the conversational dataare frequent because they were used to regulate the conver-sational flow, through discourse markers and backchan-nels, or express non-propositional content such asagreement or hesitation. Approximately thirty percent ofthe utterances in the conversational speech data were

Table 3The 20 most frequent words in the conversational and read aloud data.Non-overlapping words between the two columns are bold faced.

Conversational Read aloud

Rank Type Count Type Count

1 yeah 818 a 7622 I 787 the 7093 and 690 I 3904 you 570 to 3905 the 488 of 3406 a 448 is 3047 that 366 and 2908 know 344 you 2519 to 336 in 220

10 uh 318 he 20411 so 302 it 19312 um 292 one 19213 it 291 with 16714 of 278 two 16515 it’s 262 we 15516 but 248 was 15117 like 217 three 13818 right 210 on 13419 was 207 are 13120 is 195 they 130

backchannels consisting of a single word (e.g. 339 yeah,167 right and 54 okay), but the discourse markers and filledpauses are mainly integrated with propositional content inlonger utterances, and as Table 4 shows, often occur in thevicinity of the phrase or utterance boundaries, hence repre-sent our speaker’s means of starting, ending or keeping aturn.

5.2. Prosodic properties

HMM-based speech synthesis generally assumes record-ings of consistently spoken material, where the main differ-ence between utterances is the sequences of phonemes.Conversational speech however, has more segmental andprosodic variation. An example of the consistency in care-fully read aloud speech and the variation in conversationalspeech is the speaking rate, shown in Fig. 1.

Whereas some of the variation in conversational speechcan probably be attributed to less careful delivery, much ofthe variation has other explanations related to the languageused (Shriberg and Stolcke, 1996; Bell et al., 2003; Aylettand Turk, 2006). Fig. 2 shows that whereas our speakerhad approximately the same pitch range and F0 distribu-tion when reading aloud and when speaking in a conversa-tion, the conversation had more variation in utterance finalF0, probably because of more variation in speech acts(questions, confirmations, etc.) and speaker state (enthusi-astic, doubtful, polite, etc.).

5.3. Segmental properties

Fig. 3 shows the average centre frequencies of automat-ically extracted first and second formants of vowels in theconversational and read aloud data. In this figure, we can-not see a clear tendency towards a reduced vowel space forthe conversational speech, contrary to Nakamura et al.

Fig. 3. Mean formant values (F1 and F2) for American Englishmonophthongs, denoted with IPA symbols, in read aloud and conversa-tional speech. The mean formant values for the two filled pause types (um

and uh) in the conversational speech are also plotted.

Fig. 1. Speaking rate for utterances with 5–10 words in the conversationaland read aloud data. The solid line is the median, box borders show theupper and lower quartiles, and the whiskers are drawn to 1.5 times theinter-quartile range. The speaking rate of the conversational and readaloud data was measured for speech sequences delimited with silentpauses, as syllables per second. The variation in length of utterances waslarger in the conversational data, and it is questionable if speaking rate is arelevant measure for backchannels, therefore the speech rate was onlymeasured for utterances that were five to ten words long.


(2008). There were two main reasons for this. The first rea-son is that the data did not contain many heavily reducedpronunciations. The second reason is that the forced align-ment process has explicitly dealt with vowel reduction andhas assigned a schwa in reduced variants of the, you, but,etc. However, there were still some reduction tendencies

Fig. 2. F0 distribution in read aloud and conversational speech. Left figure showat the end of utterances. Due to uncertainties of F0 at the end of utterances,length was 5 ms.

observable, and an example of different vowel formant val-ues in fully pronounced and reduced it is shown in Fig. 4.

We can also see some differences related to language dif-ferences or lexicon errors. The vowel in the filled pauseswas stipulated in the pronunciation lexicon as an /V/, butas Fig. 3 shows, this was not correct, and was the main rea-son for the difference between read aloud and conversa-tional /V/. After we manually verified the automaticallyextracted formant values for the vowel /u/, we found thatthe formants were not well estimated for the word you,and probably not for other short function words, like to

s F0 variation across the whole utterance. Right figure shows F0 variationthe utterance final F0 was measured at the tenth last voiced frame, frame

Fig. 4. Mean formant values (F1 and F2) for fully pronounced andreduced vowel in the word it. Represented as it1 (full) and it0 (reduced) inthe figure.


or do, which had a higher proportion in the conversationaldata.

6. Conversational and read aloud synthetic voices

The context dependent phonemes introduced in Section4.4 for the read aloud and conversational speech were usedto build one “spontaneous” and one “read aloud” syntheticvoice with the HTS system described in Section 3. Thesevoices are henceforth referred to as Spontaneous HTS

and Read Aloud HTS, respectively.The size of the clustered decision trees reflects the

amount and complexity of the speech data. Table 5 showsthat despite less data for the Spontaneous HTS than theRead Aloud HTS voice the clustered duration tree waslarger for the Spontaneous HTS due to more variationneeding to be accounted for. Unlike, for example, themel-cepstral tree where the Read Aloud HTS tree waslarger due to more data and better quinphone coverage.

6.1. Blending read aloud and conversational speech

Our first impression of the quality of the SpontaneousHTS voice was that whereas the discourse markers and

Table 5Number of leaf nodes in the clustered duration, logF0, mel-cepstral andaperiodicity trees, for the Spontaneous HTS (SP) and Read Aloud HTS(RD) voices. The ratio(SP/RD) shows the relative tree sizes.

Spon. (SP) Read (RD) Ratio (SP/RD)

Duration 1699 1602 1.06logF0 4618 5248 0.88mel-cepstral 837 1405 0.60Aperiodicity 994 1543 0.64

filled pauses could be synthesised with quite high quality,the quality of the propositional content was often lessgood. To increase the phonetic coverage, and therebyimprove general segmental and prosodic quality, while stillpreserving important conversational characteristics, theconversational and read aloud data were blended in thetraining and clustering of HMM-based models with amethod previously used to blend and preserve different“emotional” speaking styles (Yamagishi et al., 2005).

All the conversational and read aloud data were pooledin training, and an additional context: speaking style (spon-taneous or read), was added to the context dependent pho-neme descriptions in Section 4.4. In the training of thecontext dependent HMM-based models, the speaking stylecontext was then available as a question in the decision treebased clustering, and was automatically selected in theclustering to share mutual and sparse phonetic propertiesbetween conversational and read aloud speech, while pre-serving frequent and distinguishing characteristics. Thespeaking style context was automatically selected as animportant feature throughout the clustering process. Forexample, in the clustering of the duration tree, a split wasmade almost immediately based on the difference in dura-tion of the syllable nucleus in conversational and readaloud speech, whereas for the excitation and spectral partthe sharing or splitting seemed to be more complex.

During synthesis with this voice one of the speakingstyles was selected by setting the speaking style context toeither spontaneous or read aloud, and then speech param-eters were generated. Henceforth, utterances generated inthis way are referred to as from the Blend.Spon voice andBlend.Read voice, respectively.

6.2. Test set for the synthetic voices

A test set of synthetic speech was generated from each ofthe synthetic voices: the Spontaneous HTS, the ReadAloud HTS, the Blend.Spon and the Blend.Read voice.The context dependent phonemes for the synthetic speechin the test set were obtained from unused transcripts ofthe same conversation and speaker as in Section 4. Thebenefit of using this material as test sentences was that itwas from the same speaker as in the training data, hencerepresenting his way of speaking and it was rich in conver-sational speech phenomena; nearly one hundred filledpauses, eighty-one yeah and at least a few instances eachof e.g. okay, right and oh.

This gave us a set of 169 utterances for each syntheticvoice that were rich in conversational phenomena, andhad identical phonemic sequences and linguistic analysis,hence, allowing a linguistically balanced acousticcomparison.

6.3. Segmental and prosodic properties

In Sections 5.2 and 5.3 we showed some segmental andprosodic differences between read aloud and conversational

Fig. 5. Mean formant values (F1 and F2) for American English monophthongs, denoted with IPA symbols, in 169 utterances with the same phonemicsequence, synthesised with four different synthetic voices. Some of the natural vowels from Fig. 3 are provided as a reference.

Fig. 6. Speaking rate for 169 synthetic utterances with the same phonemicsequence, synthesised with four different synthetic voices.


speech. In this section we will show segmental and prosodicdifferences between the synthetic voices built with eitherconversational or read aloud speech, and the blended voicebuilt with both.

Fig. 5 shows a comparison of the first two formants inthe synthetic speech. For the Spontaneous HTS and theRead Aloud HTS the mean formant values were generallysimilar to each other, and similar to the natural speech. Asin the natural speech, there was no strong tendencytowards a reduced vowel space in the Spontaneous HTScompared to the Read Aloud HTS, and again the manuallyremoved heavily reduced pronunciations from the conver-sational data probably contributed to this. We can alsosee that the extracted vowel formants were closer to eachother for the blended voice than in the Spontaneous HTSand Read Aloud HTS or in the natural read aloud and con-versational speech.3 This pattern was also preserved in thesynthetic speaking rate, shown in Fig. 6, where the Sponta-neous HTS and Read Aloud HTS preserved the speakingrate differences in the natural speech, but the blendingresulted in more similar speaking rates.

On the other hand, both duration and vowel quality offilled pauses in natural conversational speech were to alarge extent preserved in the Spontaneous HTS, as wellas the Blend.Spon voice, and different from the vowelquality and duration in the Read Aloud HTS (see Figs. 7and 8). The duration of uh in the Read Aloud HTS wasmore similar to the natural filled pauses than um, butthere were no filled pauses in the read aloud speech data,and the long median duration was due to the long duration

3 We also see that the /u/ vowel in non-reduced you, to, do and doing

where due to coarticulation F2 starts off high, were difficult to automat-ically extract formants from in all the synthetic voices. In the /O/ vowel, asin more, long or talk, the closeness of F1 and F2 made the automaticformant extraction unreliable.

of the words ah (mean = 260 ms) and oh (mean = 205 ms)in the “conversational style” text in the read aloud cover-age material, e.g. in the sentence “Ah well, maybe morenext week.”.

In general, there was more variation in the naturalspeech than in either of the synthetic voices. But, Fig. 10shows an utterance initial filled pause where the Spontane-ous HTS had segmental and prosodic properties similar toa natural reference sample, and hence conveyed a similardegree of hesitation, whereas the segmental and prosodicproperties of the um from Read Aloud HTS were differentand did not convey any hesitation and therefore did notsound like a filled pause. Similarly, many discourse mark-ers were also generally well preserved in both the Spontane-ous HTS and Blend.Spon. Fig. 9 shows an utterance initialyeah, followed by a short pause, from natural and syntheticspeech, where the Spontaneous HTS had segmental andprosodic properties that were similar to the natural

Fig. 7. Duration of the vowel in the filled pauses (um and uh), and in the reference word but, for natural and synthetic speech. But was used as referencebecause it was represented in the lexicon as having the same vowel quality as the filled pauses, and existed in both the natural and synthetic speech.

Fig. 8. Vowel quality of filled pauses in conversational speech, Sponta-neous HTS and Blend.Spon. (Vowel qualities of the filled pauses in ReadAloud HTS are not plotted, but were more different; um F1:661/F2:1342,uh F1:589/F2:1399.)


reference sample, whereas the yeah from the Read AloudHTS had completely different shape of the F0 contour,

longer duration of the vowel part of the yeah, and despitethat the phonemic sequence was intelligible, it came acrossas almost meaningless.

7. Perceptual evaluation

7.1. Listening test design

People can distinguish perceptually between naturalspontaneous and natural read aloud speech (Blaauw,1994; Laan, 1997) and we designed a listening test to eval-uate whether such a distinction could also be observed forHMM-based synthetic voices built with spontaneous con-versational or read aloud speech. In particular, we designedthe evaluation to investigate the impact on speech qualityand speaking style on the integration of conversationalcharacteristics: discourse markers and filled pauses, withpropositional content.

“Naturalness” is conventionally used in speech synthesisto evaluate speech quality, but evaluating a spontaneous orconversational speaking style has been less explored. Wesuspect that when listeners are asked to judge the qualityof synthetic speech using a positive versus negative distinc-tion they do so in a quite general way, not paying particu-lar attention to the specific feature they have been asked to

Fig. 9. A yeah in the “same” utterance: natural (left), Spontaneous HTS (mid), and Read Aloud HTS (right).

Fig. 10. The filled pause um in the “same” utterance: natural (left), Spontaneous HTS (mid), and Read Aloud HTS (right).


judge. To investigate this issue further, the listening testwas designed to determine whether naturalness could beevaluated separately from speaking style. To do so, listen-ers were divided into two groups and the two groups weregiven one criteria each to evaluate, quality or speakingstyle.

One group of listeners was requested to evaluate:“Which utterance sounds more like natural speech?”.Another group of listeners was requested to evaluate“Which utterance has a more conversational speakingstyle?”. The listeners who were asked about the conversa-tional style were also explicitly requested to disregard thespeech quality: “Please try and disregard the speech qual-ity, and focus on the speaking style.”.

Test sentences for the listening test were randomlyselected from the set of the 169 synthetic utterances, butwith restrictions on the syntactic and semantic content,so that they contained at least two discourse markers orfilled pauses and were between 5–15 words long in total.All test sentences are shown in Table 6.

To evaluate the integration of discourse markers andfilled pauses with propositional content in synthetic speechthe listening test compared pairs of utterances synthesisedwith the Spontaneous HTS to utterances synthesised withthe Read Aloud HTS. To evaluate the contribution of dis-course markers and filled pauses on quality and speakingstyle, utterances with these conversational characteristicssynthesised with the Blend.Spon were compared to moreconventional text-to-speech utterances where discoursemarkers, filled pauses and disfluencies were removed fromthe original sentence and synthesised with the Blend.Readvoice.

To avoid a scenario where it was obvious from the textalone that the discourse markers and filled pauses had beenremoved from one of the utterances, we always comparedutterances with differing lexical content. For example: ifwe had compared (A) so let’s see, but um, yeah, nothing

exciting, to (B) let’s see, but nothing exciting, listeners couldquite easily identify that one utterance had the same con-tent as the other plus/minus a few conversational markersyeah, um, oh, etc. Whereas when we compared the utter-ances (A) right, oh you have to to transcribe all this, to(B) let’s see, but nothing exciting, the large lexical differ-ences would make it harder to identify that we justremoved a few words, and hence evaluate speaking styleand not text style.

7.2. Listening test results

The results in this section were reported in (Andersson etal., 2010b) and summarised here. The result of the first lis-tening test (to evaluate the integration of discourse markersand filled pauses) is shown in Fig. 11. We can see that theperceptual judgements were significantly in favour of theSpontaneous HTS, due to the better realisations of the dis-course markers and filled pauses, and thereby also a betterutterance prosody. The result of the second listening test(to compare utterances with and without conversationalcharacteristics) is shown in Fig. 12. We see that when theseconversational characteristics were removed from the testsentences synthesised with the Blend.Read voice, and com-pared to sentences with discourse markers, filled pausesand disfluencies synthesised with the Blend.Spon voice,the results were more in favour of the Blend.Read voice.

Table 6All the utterance pairs in the perceptual evaluation. The pairs for the Blend.Spon and Blend.Read evaluation are shownin the Text column. The utterances for the Spontaneous HTS are the same as for the Blend.Spon. The utterances for theRead Aloud HTS can be derived by replacing the Blend.Read utterance, with the next Blend.Spon utterance, e.g. theRead Aloud HTS utterance for pair 3 would be yeah, x-men is cool, yeah instead of x-men is cool. Commas indicatewhere utterance internal silences were located.

Utt. No. Voice Text

1. Blend.Spon Right, yeah that that could make you kind of a freakBlend.Read Boxing for me was more, it was far more challenging

2. Blend.Spon You know um boxing for me was more, uh it was far more challengingBlend.Read Well not yet

3. Blend.Spon Uh no, no well not, yet, umBlend.Read X-men is cool

4. Blend.Spon Yeah, x-men is cool, yeahBlend.Read You have to transcribe all this

5. Blend.Spon Right, oh you have to to transcribe all thisBlend.Read Let’s see, but nothing exciting

6. Blend.Spon So let’s see, but um, yeah, nothing excitingBlend.Read When you go, shit ’cause they didn’t expect that

7. Blend.Spon You know like when a, you go oh shit ’cause they didn’t expect thatBlend.Read A lot of people think I am in my late twenties

8. Blend.Spon Um, like a lot of people think I am in my late twentiesBlend.Read Mid-life crisis, it’s gonna hit eventually, pretty quickly

9. Blend.Spon So, it’s uh, yeah, mid-life crisis got it’s gonna hit eventually, pretty quickly soBlend.Read I could give a shit less, I’m just happy to get a meal

10. Blend.Spon Yeah, I could give a shit less um I’m just happy to get a mealBlend.Read But even that I can give a shit less

11. Blend.Spon Um, but even that like, I can give a shit less, you know what I meanBlend.Read You don’t want that to happen

12. Blend.Spon Oh yeah you don’t want that to happenBlend.Read We quit, the movie ended

13. Blend.Spon Well we quit I mean you know the movie endedBlend.Read I just fill in my schedule

14. Blend.Spon Yeah I just fill in my schedule so it’s uhBlend.Read I have, I tried once when I was a kid

15. Blend.Spon No I have well you know I tried once when I was a kidBlend.Read That could make you kind of a freak


This is an interesting tendency and therefore, in thefollowing sections, we will take a closer look at why remov-ing a few words had such an impact on the perception ofnaturalness and speaking style.

Fig. 11. The bars show the percentages of the listeners’ preferences fornaturalness and conversational style when comparing the SpontaneousHTS to the Read Aloud HTS when synthesising utterances with discoursemarkers and filled pauses.

7.3. Naturalness

Figs. 13 and 14 show the listeners’ perceived naturalnessfor individual utterances behind the results summarised inFigs. 11 and 12. By listening to the utterances we identified

Fig. 12. The bars show the percentages of the listeners’ preferences fornaturalness and conversational style when comparing utterances withconversational characteristics (Blend.Spon) to more fluent utterances(Blend.Read).


a few factors that probably contributed to the perceiveddifference in naturalness between utterances in Figs. 13and 14.

Some factors were easily identified in Fig. 14: e.g. theprominent local pitch movement errors in the Blend.Sponutterances 10 and 11, the awkward prosody in utterance15, or the too prominent word repetition in utterance 5.This was probably to some extent a reflection of our under-specified analysis and representation of segmental and pro-sodic properties in conversational speech, includingsegmentation, prosodic phrasing and disfluencies. But,the prominent word repetition in utterance 5 was promi-nent also in the original natural utterance, and the listenersjudgements were perhaps influenced by the presence of anaudible disfluency, a factor that we also discuss in the nextparagraph.

An important general tendency in the perceived natural-ness between Figs. 13 and 14 was that (1) the conversa-

Fig. 13. Listeners’ perception of naturalness for individual utteranceswhen comparing sentences with discourse markers and filled pausessynthesised with the Spontaneous HTS or Read Aloud HTS voices.

Fig. 14. Listeners’ perception of naturalness for individual utteranceswhen comparing sentences synthesised with the blended voice. TheBlend.Spon bar shows preference for utterances with conversationalcharacteristics, and the Blend.Read bar shows preference for more fluentutterances.

tional characteristics sounded bad with the Read AloudHTS and (2) that when the discourse markers, filled pausesand disfluencies were removed in the sentences synthesised,it made the Blend.Read utterances sound substantially bet-ter. The removal of discourse markers, filled pauses anddisfluencies made many utterances more grammatical andmore fluent than the conversational utterances, e.g. utter-ance pairs 2, 3, 4, 9, and 14, which could have contributedto making the perceived differences in naturalness largerthan they were and contribute to the differences for theseutterance pairs between Figs. 13 and 14.

7.4. Conversational speaking style

The perceptual evaluation was designed to see if wecould evaluate speaking style separately from naturalness.The questions about naturalness and speaking style weretherefore asked to separate groups of listeners.

Fig. 15. Plot of the listeners’ preferences of naturalness and conversa-tional speaking style for the Blend.Spon voice. Spearman’s rho showedsignificant (p� 0.05) correlation of 0.86.

Fig. 16. Individual listeners’ perception of conversational speaking stylewhen comparing sentences synthesised with the blended voice. TheBlend.Spon bar shows preference for utterances with discourse markers,filled pauses and disfluencies, and the Blend.Read bar shows preference formore fluent utterances.


The speaking style results, shown in Fig. 11, were signif-icantly in favour of the Spontaneous HTS, but so were theresults for naturalness, and the correlation between themwas significant (q = 0.72, p� 0.05). Our interpretation isthat the difference between the voices in Fig. 11 was mainlya reflection of the difference in naturalness rather thanspeaking style.

Fig. 12 shows the charts for naturalness and speakingstyle, where we see differences, but there was no significantdifference between the perceived speaking style for the twovoices. However, correlation between the two groups’ per-ception of naturalness and speaking style was stronger(q = 0.86,p� 0.05), as shown in Fig. 15. This indicatesthat for an utterance to be perceived as having a conversa-tional speaking style, it also needs to be perceived as fairlynatural. Even without discourse markers and filled pauses,the test sentences contained other conversational, orcasual, characteristics, e.g. . . . I could give a shit less. . .,. . . cool or . . . kind of a freak, which probably contributedto making the evaluation of speaking style more difficultfor the listeners.

Fig. 16 shows individual listeners’ perception of conver-sational speaking style, and we can see that there were atleast two different interpretations of speaking style, wherelisteners a-d have interpreted speaking style differently thanlisteners l–p.

8. Conclusions

We have shown that speech from a spontaneous conver-sation was instrumental in building an HMM-based syn-thetic voice that could integrate discourse markers, filledpauses and propositional content into more natural sound-ing utterances than an HMM-based voice built with care-fully read aloud sentences. It was able to capturedistinguishing segmental and prosodic properties of con-versational speech phenomena, despite worse phonemealignment and underspecified and erroneous linguisticanalysis.

The strong correlation in the perceptual evaluationbetween naturalness and speaking style showed that it isdifficult to separately evaluate these aspects in syntheticspeech. A better alternative for future evaluations of con-versational speech synthesis than naturalness or speakingstyle, could be to evaluate agreement, disagreement, hesita-tion, etc. in a discourse context, as in (Lasarcyk andWollerman, 2010).

The quality and expressiveness of the synthetic voicesachieved from a moderate amount of conversationalspeech data provide an encouraging start for the continuedresearch of conversational speech in HMM-based speechsynthesis. More sophisticated analysis and representationof conversational speech phenomena, as well as develop-ments of the training/generation framework of HMM-based speech synthesis, would probably allow furtherimprovements to synthesising the rich variation of speechphenomena in human conversation.

Acknowledgements

The authors are grateful to David Traum and KallirroiGeorgila at the USC Institute for Creative Technologies(http://www.ict.usc.edu) for making the speech data avail-able to us. The first author is supported by Marie CurieEarly Stage Training Site EdSST (MEST-CT-2005-020568).

References

Adell, J., Bonafonte, A., Escudero-Mancebo, D. 2008. On the generationof synthetic disfluent speech: Local prosodic modifications caused bythe insertion of editing terms. In: Proc. Interspeech, Brisbane,Australia, pp. 2278–2281.

Adell, J., Bonafonte, A., Escudero-Mancebo, D., 2010. Modelling filledpauses prosody to synthesise disfluent speech. In: Proc. SpeechProsody, vol. 100624, Chicago, USA, pp. 1–4.

Andersson, S., Georgila, K., Traum, D., Aylett, M.P., Clark, R.A.J.,2010a. Prediction and realisation of conversational characteristics byutilising spontaneous speech for unit selection. In: Proc. SpeechProsody, vol. 100116, Chicago, USA, pp. 1–4.

Andersson, S., Yamagishi, J., Clark, R., 2010b. Utilising spontaneousconversational speech in HMM-based speech synthesis. In: Proc.SSW7, Kyoto, Japan, pp. 173–178.

Aylett, M.P., Pidcock, C.J., 2007. The CereVoice characterful speechsynthesiser SDK. In: Proc. AISB’07, Newcastle Upon Tyne, UK, pp.174–178.

Aylett, M., Turk, A., 2006. Language redundancy predicts syllabicduration and the spectral characteristics of vocalic syllable nuclei. J.Acoust. Soc. Amer. 119, 3048–3058.

Badino, L., Andersson, S., Yamagishi, J., Clark, R.A.J., 2009. Identifi-cation of contrast and its emphatic realisation in HMM based speechsynthesis. In: Proc. Interspeech, Brighton, UK, pp. 520–523.

Bell, A., Jurafsky, D., Fossler-Lussier, E., Girand, C., Gregory, M.,Gildea, D., 2003. Effects of disfluencies, predictability, and utteranceposition on word form variation in english conversation. J. Acoust.Soc. Amer. 113 (2), 1001–1024.

Blaauw, E., 1994. The contribution of prosodic boundary markers to theperceptual difference between read and spontaneous speech. SpeechComm. 14, 359–375.

Cadic, D., Segalen, L., 2008. Paralinguistic elements in speech synthesis.In: Proc. Interspeech, Brisbane, Australia, pp. 1861–1864.

Campbell, N., 2006. On the structure of spoken language. In: Proc. SpeechProsody, Dresden, Germany.

Campbell, N., 2007. Towards conversational speech synthesis; lessonslearned from the expressive speech processing project. In: Proc. SSW6,Bonn, Germany, pp. 22–27.

Clark, H., Fox Tree, J., 2002. Using uh and um in spontaneous speaking.Cognition 84 (1), 73–111.

Gravano, A., Benus, S., Chavez, H., Hirschberg, J., Wilcox, L., 2007. Onthe role of context and prosody in the interpretation of ‘okay’. In:Proc. ACL, Prague, Czech Republic, pp. 800–807.

Gustafsson, K., Sjolander, K., 2004. Voice creation for conversationalfairy-tale characters. In: Proc. SSW5, Pittsburgh, USA, pp. 145–150.

Jurafsky, D., Shriberg, E., Fox, B., Curl, T., 1998. Lexical, prosodic, andsyntactic cues for dialog acts. In: COLING-ACL 98 Workshop onDiscourse Relations and Discourse Markers.

Kawahara, H., Masuda-Katsuse, I., Cheveigne, A., 1999. Restructuringspeech representations using a pitch-adaptive time-frequency smooth-ing and an instantaneous-frequency-based f0 extraction: possible roleof repetitive structure in sounds. Speech Comm. 27, 187–207.

Kawahara, H., Estill, J., Fujimura, O., 2001. Aperiodicity extraction andcontrol using mixed mode excitation and group delay manipulation fora high quality speech analysis, modification and synthesis systemSTRAIGHT. In: Proc. 2nd MAVEBA, Firenze, Italy.

http://www.ict.usc.edu


King, S., Karaiskos, V., 2009. The Blizzard Challenge 2009. In: Proc. TheBlizzard Challenge Workshop, Edinburgh, U.K.

Laan, G., 1997. The contribution of intonation, segmental durations, andspectral features to the perception of a spontaneous and a readspeaking style. Speech Comm. 22, 43–65.

Lasarcyk, E., Wollerman, C., 2010. Do prosodic cues influence uncer-tainty perception in articulatory speech synthesis? In: Proc. SSW7,Kyoto, Japan, pp. 230–235.

Lee, C-H., Wu, C.-H., Guo, J.-C., 2010. Pronunciation variationgeneration for spontaneous speech synthesis using state-based voicetransformation. In: Proc. ICASSP, Dallas, USA, pp. 4826–4829.

Nakamura, M., Iwano, K., Furui, S., 2008. Differences between acousticcharacteristics of spontaneous and read speech and their effects onspeech recognition performance. Comput. Speech Lang. 22, 171–184.

Odell, J., 1995. The Use of Context in Large Vocabulary SpeechRecognition. PhD thesis, Cambridge University.

O’Shaughnessy, D., 1992. Recognition of hesitations in spontaneousspeech. In: Proc. ICASSP, San Fransisco, USA, pp. 521–524.

Romportl, J., Zovato, E., Santos, R., Ircing, P., Relano Gil, J., Danieli,M., 2010. Application of expressive TTS synthesis in an advancedECA system. In: Proc. SSW7, Kyoto, Japan, pp. 120–125.

Schiffrin, D., 1987. Discourse Markers. Cambridge University Press.Shriberg, E., 1999. Phonetic consequences of speech disfluency. In: Proc.

ICPhS, San Fransisco, USA, pp. 619–622.Shriberg, E., Stolcke, A., 1996. Word predictability after hesitations:

a corpus-based study. In: Proc. ICSLP, Philadelphia, USA,pp. 1868–1871.

Sundaram, S., Narayanan, S., 2002. Spoken language synthesis:experiments in synthesis of spontaneous monologues. In: Proc.2002 IEEE Speech Synthesis Workshop, Santa Monica, USA, pp.203–206.

Tokuda, K., Zen, H., Black, A.W., 2002. An HMM-based speechsynthesis system applied to English. In: Proc. 2002 IEEE SpeechSynthesis Workshop, Santa Monica, USA, pp. 227–230.

Traum, D.R., Swartout, W., Gratch, J., Marsella, S., 2008. A virtualhuman dialogue model for non-team interaction. In: Dybkjaer, Laila,Minker, Wolfgang (Eds.), Recent Trends in Discourse and Dialogue.Springer, Antwerp, Belgium, pp. 45–67.

Wahlster, W. (Ed.), 2000. Verbmobil: Foundations of Speech-to-SpeechTranslation. Springer-Verlag, Berlin, Heidelberg.

Yamagishi, J., Onishi, K., Masuko, T., Kobayashi, T., 2005. Acousticmodelling of speaking styles and emotional expressions in HMM-based speech synthesis. IEICE Trans. Inform. Systems E88-D (3),502–509.

Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.,Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland,P., 2006. The HTK Book. Cambridge University EngineeringDepartment.

Zen, H., Toda, T., Nakamura, M., Tokuda, K., 2007. Details of theNitech HMM-based speech synthesis system for the Blizzard Chal-lenge 2005. IEICE Trans. Inform. Systems E90-D (1), 325–333.

Zen, H., Tokuda, K., Black, A.W., 2009. Statistical parametric speechsynthesis. Speech Comm. 51 (11), 1039–1064.

Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis

Documents

Transcript of Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis