Phone duration modeling: overview of techniques and performance optimization via feature selection...

14
Int J Speech Technol (2010) 13: 175–188 DOI 10.1007/s10772-010-9077-x Phone duration modeling: overview of techniques and performance optimization via feature selection in the context of emotional speech Alexandros Lazaridis · Todor Ganchev · Theodoros Kostoulas · Iosif Mporas · Nikos Fakotakis Received: 20 May 2010 / Accepted: 11 July 2010 / Published online: 30 July 2010 © Springer Science+Business Media, LLC 2010 Abstract Accurate modeling of prosody is prerequisite for the production of synthetic speech of high quality. Phone duration, as one of the key prosodic parameters, plays an important role for the generation of emotional synthetic speech with natural sounding. In the present work we offer an overview of various phone duration modeling techniques, and consequently evaluate ten models, based on decision trees, linear regression, lazy-learning algorithms and meta- learning algorithms, which over the past decades have been successfully used in various modeling tasks. Furthermore, we study the opportunity for performance optimization by applying two feature selection techniques, the RReliefF and the Correlation-based Feature Selection, on a large set of nu- merical and nominal linguistic features extracted from text, such as: phonetic, phonologic and morphosyntactic ones, which have been reported successful on the phone and syl- lable duration modeling task. We investigate the practical usefulness of these phone duration modeling techniques on a Modern Greek emotional speech database, which consists of five categories of emotional speech: anger, fear, joy, neutral, A. Lazaridis ( ) · T. Ganchev · T. Kostoulas · I. Mporas · N. Fakotakis Artificial Intelligence Group, Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rion-Patras 26500, Greece e-mail: [email protected] T. Ganchev e-mail: [email protected] T. Kostoulas e-mail: [email protected] I. Mporas e-mail: [email protected] N. Fakotakis e-mail: [email protected] sadness. The experimental results demonstrated that feature selection significantly improves the accuracy of phone du- ration prediction regardless of the type of machine learning algorithm used for phone duration modeling. Specifically, in four out of the five categories of emotional speech, fea- ture selection contributed to the improvement of the phone duration modeling, when compared to the case without fea- ture selection. The M5p trees based phone duration model was observed to achieve the best phone duration prediction accuracy in terms of RMSE and MAE. Keywords Phone duration modeling · Statistical modeling · Feature selection · Emotional speech · Text-to-speech synthesis 1 Introduction In the field of speech synthesis, the main objective lies upon the improvement of the quality of synthesized speech. The quality of synthetic speech is measured with respect to two main criteria: its naturalness and its intelligibility. The nat- uralness of the synthetic speech expresses the similarity of the synthetic speech to the human speech (Dutoit 1997). The intelligibility of the synthetic speech reflects the level of dif- ficulty for the listener to understand the semantic contents of the speech (Dutoit 1997). Over the last years, there is an urge towards researching and modeling the factors that af- fect human speech, for the purpose of improving the quality of synthetic speech. Accurate modeling and control of prosody is a very im- portant aspect in speech synthesis and generally in the field of speech processing. Prosody refers to the introduction of functions and aspects of speech such as emphasis, intent,

Transcript of Phone duration modeling: overview of techniques and performance optimization via feature selection...

Int J Speech Technol (2010) 13: 175–188DOI 10.1007/s10772-010-9077-x

Phone duration modeling: overview of techniquesand performance optimization via feature selection in the contextof emotional speech

Alexandros Lazaridis · Todor Ganchev ·Theodoros Kostoulas · Iosif Mporas · Nikos Fakotakis

Received: 20 May 2010 / Accepted: 11 July 2010 / Published online: 30 July 2010© Springer Science+Business Media, LLC 2010

Abstract Accurate modeling of prosody is prerequisite forthe production of synthetic speech of high quality. Phoneduration, as one of the key prosodic parameters, plays animportant role for the generation of emotional syntheticspeech with natural sounding. In the present work we offeran overview of various phone duration modeling techniques,and consequently evaluate ten models, based on decisiontrees, linear regression, lazy-learning algorithms and meta-learning algorithms, which over the past decades have beensuccessfully used in various modeling tasks. Furthermore,we study the opportunity for performance optimization byapplying two feature selection techniques, the RReliefF andthe Correlation-based Feature Selection, on a large set of nu-merical and nominal linguistic features extracted from text,such as: phonetic, phonologic and morphosyntactic ones,which have been reported successful on the phone and syl-lable duration modeling task. We investigate the practicalusefulness of these phone duration modeling techniques on aModern Greek emotional speech database, which consists offive categories of emotional speech: anger, fear, joy, neutral,

A. Lazaridis (�) · T. Ganchev · T. Kostoulas · I. Mporas ·N. FakotakisArtificial Intelligence Group, Wire Communications Laboratory,Department of Electrical and Computer Engineering, Universityof Patras, Rion-Patras 26500, Greecee-mail: [email protected]

T. Gancheve-mail: [email protected]

T. Kostoulase-mail: [email protected]

I. Mporase-mail: [email protected]

N. Fakotakise-mail: [email protected]

sadness. The experimental results demonstrated that featureselection significantly improves the accuracy of phone du-ration prediction regardless of the type of machine learningalgorithm used for phone duration modeling. Specifically,in four out of the five categories of emotional speech, fea-ture selection contributed to the improvement of the phoneduration modeling, when compared to the case without fea-ture selection. The M5p trees based phone duration modelwas observed to achieve the best phone duration predictionaccuracy in terms of RMSE and MAE.

Keywords Phone duration modeling · Statisticalmodeling · Feature selection · Emotional speech ·Text-to-speech synthesis

1 Introduction

In the field of speech synthesis, the main objective lies uponthe improvement of the quality of synthesized speech. Thequality of synthetic speech is measured with respect to twomain criteria: its naturalness and its intelligibility. The nat-uralness of the synthetic speech expresses the similarity ofthe synthetic speech to the human speech (Dutoit 1997). Theintelligibility of the synthetic speech reflects the level of dif-ficulty for the listener to understand the semantic contentsof the speech (Dutoit 1997). Over the last years, there is anurge towards researching and modeling the factors that af-fect human speech, for the purpose of improving the qualityof synthetic speech.

Accurate modeling and control of prosody is a very im-portant aspect in speech synthesis and generally in the fieldof speech processing. Prosody refers to the introduction offunctions and aspects of speech such as emphasis, intent,

176 Int J Speech Technol (2010) 13: 175–188

attitude or emotional state that cannot be encoded by gram-mar. Prosody is shaped by the relative level of the funda-mental frequency, the intensity and the duration of the pro-nounced phones (Dutoit 1997; Möbius and Santen 1996).Consequently each of these prosody aspects is crucial forrobust prosody modeling.

In the present article, we focus on the task of phone dura-tion modeling, which is essential in speech synthesis, sincesegmental duration affects the structure of utterances, alter-ing their naturalness and intelligibility. Therefore, the pro-duction of highly natural synthetic speech is directly corre-lated to the construction of proper phone duration models.For achieving this objective, the determination of the lengthof phones is essential and can be achieved through accuratemodeling of the factors that affect it.

Over the years many factors affecting phone durationand various methods on the phone duration modeling taskhave been studied in the literature (Barbosa and Bailly 1994;Bell et al. 2003; Crystal and House 1988; Gregory et al.2001; Riley 1992; van Santen 1992). Several domains ofspeech representation at the phonetic, phonological, andmorphosyntactic levels affect the duration of the phones.These levels can be incorporated in phone duration mod-eling by extracting the features related to them.

Emotional speech synthesis depends on knowledge ex-tracted by analyzing natural speech, and particularly onknowledge concerning the influence of various prosodic fea-tures on emotional speech perception. Following the stepsof speech synthesis of neutral speech, several approaches inemotional speech synthesis have been introduced over theyears, such as formant synthesis (Burkhardt and Sendlmeier2000; Murray and Arnott 1995), diphone concatenation(Heuft et al. 1996; Rank and Pirker 1998) and unit selec-tion corpus based (Black 2003; Iida et al. 2000). Prosodymodeling is already employed in these approaches in or-der to synthesize certain categories of emotional speechor generate more expressive speech by implementing emo-tional prosody in TTS systems (Inanoglu and Young 2009;Jiang et al. 2005; Tesser et al. 2005). However, we deemthat in-depth investigation of segmental duration modeling,in particular in the context of emotional speech, along withthe analysis of prosodic features which play a very importantrole in emotional synthetic speech, would contribute bothto the enhancement of the quality of emotional syntheticspeech and to the achievement of more expressive syntheticspeech.

In concept-learning problems, often some of the featureswhich are computed from the data are not really relatedto the target concept. Feature selection algorithms are em-ployed to overcome this problem by selecting only the mostappropriate features in respect to the target concept, limit-ing the computational cost and improving the concept qual-ity and accuracy of the representation (Kira and Rendell

1992). Most of the algorithms used in machine learning, aredesigned—in theory—to be able to learn which are the mostappropriate features concerning the target concept each timeand discard the rest of them (Witten and Frank 2005). Nonethe less, in practice, including irrelevant features in a datasetoften deteriorates the effectiveness of the algorithms sincethe latter ones are not always capable of eliminating them(Witten and Frank 2005).

In the present work, we overview various phone durationmodeling techniques and consequently evaluate ten models,based on well-known and frequently used machine learningalgorithms, serving here as phone duration models. Thesemachine learning algorithms have been reported success-ful on the phone or syllable duration modeling task andare members of four distinct groups, namely: decision trees(DT) (Kääriäinen and Malinen 2004; Quinlan 1992; Wangand Witten 1997), lazy-learning algorithms (Aha et al. 1991;Atkeson et al. 1996), meta-learning algorithms (Breiman1996; Friedman 2002) and linear regression (LR) (Wittenand Frank 2005). Here, we investigate their performance forthe needs of phone duration modeling in the context of emo-tional speech. For this purpose, we congregate all the fea-tures used so far for the needs of the phone or syllable du-ration modeling in a large feature set, which is then usedin the phone duration modeling task. Furthermore, we em-ploy two feature ranking and selection algorithms, the RRe-liefF algorithm (Robnik-Sikonja and Kononenko 1997) andthe Correlation-based Feature Selection (CFS) (Hall 1999),which are applied on the large feature set, in order to inves-tigate the relevance of the individual attributes with respectto the phone duration modeling problem. Irrelevant and lessrelevant features were eliminated aiming at performance im-provement of the phone duration modeling in the context ofemotional speech. In all experiments we made use of a Mod-ern Greek emotional speech database, which consists of fivecategories of emotional speech: anger, fear, joy, neutral andsadness, and is manually annotated according to the Gr-ToBIsystem (Arvaniti and Baltazani 2000).

The remaining of this article is organized as follows. InSect. 2, we overview various phone duration modeling tech-niques and features used in related work. In Sect. 3, we of-fer a description of the emotional speech database that wasused in the experiments, then we outline the large featureset used in the present study, and briefly describe the twofeature selection algorithms employed in the experiments.Next, the ten machine learning algorithms evaluated on thephone duration modeling task are outlined, and we specifythe performance evaluation measures used here. The exper-imental results concerning the phone duration models, withand without applying the feature selection algorithms, arediscussed in Sect. 4. In Sect. 5 we conclude this paper witha brief summary of work and results.

Int J Speech Technol (2010) 13: 175–188 177

2 Overview of phone duration modeling techniques

The phone duration modeling methods reported in the liter-ature can be broadly divided in two major categories: therule-based (Allen et al. 1987; Bartkova and Sorin 1987;Carlson and Granstrom 1986; Epitropakis et al. 1993; Klatt1979; Olive and Liberman 1985; Simoes 1990) and thedata-driven methods (Chien and Huang 2003; Goubanovaand King 2008; Krishna et al. 2004; Lazaridis et al. 2007;Möbius and Santen 1996; Riley 1992; Takeda et al. 1989).Although, in the present study we consider only durationmodeling techniques, which belong to the category of data-driven methods, for comprehensiveness of exposition webriefly review also the main rule-based techniques.

2.1 Rule-based techniques

The rule-based techniques rely on manually produced rules,extracted from experimental studies on large sets of utter-ances or based on previous knowledge concerning factorswhich affect the segmental durations. Some of the most fre-quently studied and used factors in duration modeling arethe identity of the neighboring segments (previous or next),the position of the segment in the syllable or in the word,the phrase boundaries and many other factors. Expert pho-neticians are required for studying these factors and extractthe rules which are used for determining the duration of thesegments. Dennis Klatt (1979) introduced one of the firstand most well known attempts in the field of rule-based seg-mental duration modeling. The derived rules, by analyzinga phonetically balanced set of sentences, were used in or-der to predict segmental duration. These rules were basedon linguistic information such as phonological, morpholog-ical and syntactic factors. Initially a set of intrinsic valueswas assigned on each phone, which was modified each timeaccording to the extracted rules, such as the example givenin (1) from Klatt (1976):

DUR = MINDUR

+ (INHDUR − MINDUR) ∗ PRCNT

100(1)

where INHDUR is the inherent (intrinsic) duration of eachsegment, MINDUR is the minimum duration which is as-signed to each segment in order to ensure that the durationof the segment will never be too short and finally, PRCNTis the shortening percentage determined by the applied rules(Klatt 1976).

Similar models were developed in other languages, suchas: French (Bartkova and Sorin 1987), Swedish (Carlsonand Granstrom 1986), German (Kohler 1988) and Greek(Epitropakis et al. 1993), and varieties, such as: AmericanEnglish (Allen et al. 1987; Olive and Liberman 1985) andBrazilian Portuguese (Simoes 1990). The major drawback

of the rule-based approaches is the difficulty to representand manually tune all the linguistic factors which influencethe segmental duration in speech. Thus, in order to collectall the appropriate (or even adequate amount of) rules, long-term devotion to this task becomes crucial and mandatory(Klatt 1987). Therefore, the rule-based duration models arerestricted to controlled experiments, where only a limitednumber of contextual factors are involved, in order to beable to deduce the interaction among these factors and ex-tract these rules (Campbell 1992).

2.2 Data-driven techniques

Data-driven techniques for the task of phone duration mod-eling were developed after the construction of large data-bases (Kominek and Black 2003). These approaches arebased either on statistical methods or artificial neural net-work (ANN) based techniques. Data-driven approaches au-tomatically produce phonetic rules and construct durationmodels using large speech corpora, overcoming the prob-lem of manually extracting rules. Their main advantage isthat this process does not depend so much on manual laborof phoneticians as the rule-based techniques do, where theknowledge and expertise of phoneticians is mandatory forproducing the rules.

In the phone duration modeling task various statisticalmethods have been applied over the last years such as, linearregression (LR) (Takeda et al. 1989) models, decisions tree-based models (Krishna et al. 2004; Möbius and Santen 1996;Riley 1992) and sums-of-products (SOP) (Febrer et al.1998; van Santen 1992). Furthermore, artificial neural net-works (Campbell 1992; Cordoba et al. 2001), Bayesian net-works models (Chien and Huang 2003; Goubanova andKing 2008)) and instance-based algorithms (Lazaridis et al.2007) have also been introduced on the phone duration mod-eling task. In the following, we briefly mention some of themost frequently and successfully used methods for phoneduration modeling (cf. Table 1).

The linear regression (LR) (Iwahashi and Sagisaka 2000;Lazaridis et al. 2007; Takeda et al. 1989; Yamagishi et al.2008) models are based on the assumption that there is linearindependence among the features which affect the segmen-tal duration. Specifically, the features are weighted in a lin-ear combination creating a prediction function. In Lazaridiset al. (2007), on a Modern Greek speech database, it wasshown that a linear regression model outperformed clas-sification and regression tree (CART) and instance-based(IBK) models in terms of root mean square error (RMSE),achieving duration prediction accuracy of 25.5 ms, against26 ms for the CART model and 27.5 ms for the IBKmodel, respectively. On the other hand, decision tree-basedmodels, and in particular the CART models (Chung 2002;Goubanova and King 2008; Iwahashi and Sagisaka 2000;

178 Int J Speech Technol (2010) 13: 175–188

Table 1 Phone duration modeling methods studied in the literature

Method Study Database Error Metric (ms) Vowels Consonants

LR (Yamagishi et al. 2008) American English male RMSE 25.2 21.4

Japanese male RMSE 16.6 14.9

Japanese female RMSE 14.3 15.4

(Iwahashi and Sagisaka 2000) RP English male Standard deviation 20.3

(Lazaridis et al. 2007) Modern Greek RMSE 25.5

CART (Yamagishi et al. 2008) American English male RMSE 26.4 21.5

Japanese male RMSE 17.2 14.2

Japanese female RMSE 14.9 13.9

(Iwahashi and Sagisaka 2000) RP English male Standard deviation 19.6

(Goubanova and King 2008) RP English male RMSE 23.0 20.0

RP English female RMSE 26.0 21.0

American English male RMSE 27.0 24.0

(Krishna and Murthy 2004) Indian Hindi RMSE 27.1

Indian Telugu RMSE 22.7

(Lee and Oh 1999a) Korean RMSE 22.0

(Lazaridis et al. 2007) Modern Greek RMSE 26.0

(Chung 2002) Korean RMSE 27.5 24.2

Model Trees (Yamagishi et al. 2008) American English male RMSE 25.4 21.1

Japanese male RMSE 16.5 13.6

Japanese female RMSE 14.3 12.9

(Iwahashi and Sagisaka 2000) RP English male Standard deviation 19.4

BN (Goubanova and King 2008) RP English male RMSE 1.5 4.1

RP English female RMSE 1.5 3.5

American English male RMSE 1.7 3.6

FFNN (Teixeira and Freitas 2003) European Portuguese Standard deviation 19.5

IBK (Lazaridis et al. 2007) Modern Greek RMSE 27.5

SoP (Goubanova and King 2008) RP English male RMSE 28.0 26.0

RP English female RMSE 25.0 25.0

American English male RMSE 32.0 33.0

(Chung 2002) Korean RMSE 32.1 28.9

Add. Regr. Trees (Yamagishi et al. 2008) American English male RMSE 24.5 20.2

Japanese male RMSE 16.1 12.8

Japanese female RMSE 13.9 12.1

Bagg. Trees (Yamagishi et al. 2008) American English male RMSE 25.8 20.9

Japanese male RMSE 16.7 13.9

Japanese female RMSE 14.5 13.5

Krishna and Murthy 2004; Lazaridis et al. 2007; Lee andOh 1999a; Riley 1992; Yamagishi et al. 2008), can repre-sent the dependencies among the features but cannot putconstraints of linear independency for reliable predictions(Iwahashi and Sagisaka 2000). Model trees is another tree-based technique incorporating both linear regression and re-gression trees overcoming the drawbacks of each of these

two methods (Iwahashi and Sagisaka 2000; Yamagishi et al.2008). Specifically when the training data related to a spe-cific feature are few, model trees incorporate the linear in-dependence constrains between this feature and the others,using a small number of free parameters as the linear re-gression models do. On the other hand, when the amount oftraining data related to a specific feature is large, the model

Int J Speech Technol (2010) 13: 175–188 179

trees represent the dependence between this feature and theothers using a large number of parameters as the regres-sion trees do (Iwahashi and Sagisaka 2000). Even thoughdecision trees manage to build models which fit the trainingdata well, there is no certainty that during the operationalphase the unseen data will be predicted properly. This state-ment is based on the fact that these models lack the abilityof interpolation and cannot handle sparse data, which dete-riorates their performance (Rao and Yegnanarayana 2007;Goubanova and King 2008; Riley 1992). In Yamagishi et al.(2008), on two Japanese (male and female) and one Ameri-can English (male) databases model trees outperformed therespective CART models in both vowel and consonant cat-egories as can be seen in Table 1. Iwahashi and Sagisaka(2000) also presented that model trees outperformed CARTmodel in a Received Pronunciation (RP) English databaseenforcing the advantage of model trees over regression trees.

In addition, artificial neural networks, such as the feed-forward neural networks (Teixeira and Freitas 2003), havebeen used for the needs of phone duration modeling. It iswell known that neural networks have the ability of gener-alizing and capturing the relationship between the input andthe output patterns, which makes them appropriate for thetask of phone duration modeling (Rao and Yegnanarayana2007; Yegnanarayana 1999). Furthermore, an additional ad-vantage of neural networks is their ability to cope with miss-ing data (Yegnanarayana 1999).

Bayesian networks models have also been introduced inthe phone duration modeling task. These models incorporatea straightforward representation of the problem domain in-formation, and despite their demanding training they wereshown to make accurate predictions even when unknownvalues come across in some features (Goubanova and King2008). Moreover, the sums-of-products (SOP) method hasbeen used in phone duration modeling where the segmen-tal duration modeling is based on a sum of factors andtheir product terms that affect the duration (Chung 2002;Goubanova and King 2008; van Santen 1994). SOP mod-els can be trained using small amount of data and have theability of interpolation when they come across with rare orunseen feature vectors in comparison to the regression trees.On the other hand, the main drawback of the SOP mod-els lies upon the exponentially growth of the number ofpossible models when the dimension of the feature vectorincreases, making expert intuition and the use of heuristicsearch techniques mandatory for finding the best SOP model(Goubanova and Taylor 2000; Goubanova and King 2008).In Chung (2002), on a Korean database, a SOP model wasreported to achieve accuracy of 32.1 ms and 28.9 ms in termsof RMSE on the vowels and consonants respectively, andwas outperformed by a CART model which achieved per-formance of 27.5 ms and 24.2 ms, respectively. Goubanovaand King (2008) performed a comparison evaluation among

Bayesian networks, CART models and SOP models in aBritish (RP) and an American English database showing thatBayesian networks outperformed both of the other two mod-els in both databases in vowel and consonant categories sep-arately.

Furthermore, instance-based algorithms (Lazaridis et al.2007) have been used in the phone duration modeling task.In instance-based models the training data are stored dur-ing the training phase and a distance function is used duringthe prediction phase in order to determine which member ofthe training set is closer to the test instance and predict thephone duration. In a resent study (Yamagishi et al. 2008),the gradient tree boosting (GTB) (Friedman 2001, 2002)method was introduced in this task as an alternative to theconventional method using regression trees. The GTB algo-rithm is a meta-algorithm incorporating multiple regressiontrees, built in the model iteratively, and consequently tak-ing advantage of them. The advantage of this method liesupon the fact that each new regression tree is constructedusing the residuals of the prediction function of the currenttree regardless of the structure of the previous trees (Yam-agishi et al. 2008). As mentioned above, in Yamagishi etal. (2008), model trees outperformed the CART models inboth categories, vowels and consonants, on the two Japanese(male and female) and the American English (male) data-bases, though it should be pointed out that the best per-formance was achieved by the model based on the meta-algorithm GTB, outperforming the model trees, the CARTmodel and the linear regression model. As can be concludedbased on the research reported over the years on the phoneduration modeling task, the data-driven methods provide aconvenient mechanism for overcoming the time consuminglabor for the manual extraction of rules that is needed in therule-based phone duration modeling methods.

2.3 Features used in the task of phone duration modeling

The features used in the task of phone duration modelingcan be extracted from various domains of speech represen-tation, such as the phonetic, phonological, morphologicaland syntactic level. Numerous studies have been reportedover the years concerning factors affecting the phone dura-tions and features determining them (Bell et al. 2003; Chung2002; Crystal and House 1988; Iwahashi and Sagisaka 2000;Goubanova and King 2008; Gregory et al. 2001; Klatt 1987;Yamagishi et al. 2008; Krishna and Murthy 2004; Lazaridiset al. 2007; Lee and Oh 1999b; Möbius and Santen 1996;Riley 1992; Takeda et al. 1989; Teixeira and Freitas 2003;van Santen 1994). Some of the most frequently used fea-tures in the phone duration modeling task are: the phone, thenumber of phones in the syllable, the stress of the syllable,the position of the phone in the syllable, the position of thesyllable in the word or in the phrase and the part of speech

180 Int J Speech Technol (2010) 13: 175–188

of the word. In other studies (Iwahashi and Sagisaka 2000;Lazaridis et al. 2007; Möbius and Santen 1996) apart fromthe stress feature, additional prosodic features were used inthe phone duration modeling task such as the accent of thesyllable, the type of the syllable, the distance to the nextor previous accent, the break after the syllable or the dis-tance to the next break or pause. Furthermore in Goubanovaand King (2008), Krishna and Murthy (2004), Lazaridiset al. (2007), Möbius and Santen (1996), Yamagishi et al.(2008), features concerning the phonetic characteristics ofthe phones have been incorporated in the phone durationmodeling such as the vowel height, the vowel frontness, thelip rounding, the manner of production and the place of ar-ticulation, or the number of phones before or after the vowelof the syllable. In other studies (Goubanova and King 2008;Krishna and Murthy 2004; Lazaridis et al. 2007; Teixeiraand Freitas 2003), the information concerning the neighbor-ing instances, such as the next or previous phones, the typeof the next or previous syllable, the stress or accent of thenext or previous syllable, was taken under consideration.

In the preset work, we take advantage of these previousstudies and use a large feature set that includes all the afore-mentioned features. In order to exclude the irrelevant or theless relevant features concerning the phone duration mod-eling task and achieve better accuracy and overall perfor-mance optimization, we experiment with two feature selec-tion algorithms (cf. Sect. 3.3). The accuracy achieved by tendifferent phone duration modeling algorithms (cf. Sect. 3.4),with and without feature selection preprocessing of the fea-ture vector, is investigated in Sect. 4.

3 Experimental setup

In order to evaluate the accuracy of various phone dura-tion modeling techniques, in the present work we made useof experimental setup based on a Modern Greek emotionalspeech database, which is purposely designed in support ofresearch on speech synthesis. This database is linguisticallyand prosodically rich, and contains emotional speech fromthe categories: anger, fear, joy, sadness, which are consid-ered as the four archetypal emotions (Oatley and Johnson-Laird 1998), as well as neutral speech. In the next subsec-tions the database is introduced, followed by the feature setused for the phone duration modeling task, the feature se-lection algorithms, and the machine learning techniques em-ployed here as the phone duration models. Finally, the per-formance assessment measures used in the present work arepresented in Sect. 3.5.

3.1 Database description

The positional and contextual factors of a phone (place insyllable, word, etc.) play a very important role in the assess-ment of its duration (Febrer et al. 1998; Krishna et al. 2004;

Möbius and Santen 1996; van Santen 1992). The ModernGreek database, which was used in our experiments, was de-signed following this statement, so as for each phone to havemultiple instances in various positions (initial, medial, final)in different words in the database. The contents of the data-base were extracted from passages, newspapers or were setup by a professional linguist. The database which was usedfor the experiments consisted of 62 utterances, which arepronounced several times with different emotional charge.The length of the utterances was ranging from a single word,a phrase, short and long sentence or even a sequence of sen-tences of fluent speech. The context of all sentences wasemotionally neutral, meaning that it did not convey any emo-tional charge through lexical, syntactic or semantic means.Moreover, all the utterances were uttered separately in thefive emotional styles.

The entire database consisted of 4150 words (310 utter-ances). We used a phone inventory of 34 phones, with to-tal of 22045 instances (15667 voiced and 6378 unvoicedphones). Furthermore, each vowel class included bothstressed and unstressed cases of the corresponding vowel.All utterances were uttered by a professional, female ac-tress, speaking Modern Greek. To ensure that the speakerwould not have to change her emotional state more than fivetimes, expressing anger, fear, joy, sadness and neutral emo-tion respectively, all the recordings of each specific emo-tional category were recorded in series, before proceedingwith the other emotional categories. In addition, the actresswas instructed to express a ‘casual’ intensity of the chosenemotional states avoiding any theatrical exaggeration.

All recording sessions were held in the anechoic chamberof a professional studio. Speech was sampled at 44.1 kHz,and a resolution of 16 bit. For the needs of our experimentswe down-sampled all the recordings to sampling rate of16 kHz.

3.2 Composite feature set

In the present study, we consider a number of features whichhave been reported successful on the task of phone durationmodeling (Bell et al. 2003; Chung 2002; Crystal and House1988; Iwahashi and Sagisaka 2000; Febrer et al. 1998;Goubanova and King 2008; Gregory et al. 2001; Klatt 1987;Yamagishi et al. 2008; Krishna and Murthy 2004; Krishnaet al. 2004; Lazaridis et al. 2007; Lee and Oh 1999b;Möbius and Santen 1996; Riley 1992; Takeda et al. 1989;Teixeira and Freitas 2003; van Santen 1992; van Santen1994). Thus, the feature set is composed of linguistic fea-tures extracted only from text—since the input in a speechsynthesis system is text. Specifically, from each utterance wecomputed 33 features. The syntagmatic neighbors of someof these features, defined on the level of the respective fea-ture, i.e. phone-level, syllable-level, word-level, were also

Int J Speech Technol (2010) 13: 175–188 181

used. The features involved in the feature set are the follow-ing:

(i) eight phonetic features:

• the phone class (consonants/non-consonants),• the phone types (short vowels, long vowels, diph-

thongs, schwa, consonants),• the vowel height (high, middle, low),• the vowel frontness (front, central, back),• the lip rounding (rounded/unrounded),• the manner of production (plosive, fricative, af-

fricate, liquids, nasal),• the place of articulation (labial, labio-dental, dental,

alveolar, palatal, velar, glottal),• the consonant voicing,

along with the aforementioned features, concerningeach current instance, the information of the respectivefeatures concerning the two previous and the two nextinstances was also used,

(ii) three segment-level features:

• the phone name with the information of the neigh-boring instances (previous, next),

• the position of the phone in the syllable,• the onset-rhyme type (onset: if the specific phone is

before the vowel in the syllable, rhyme: if the spe-cific phone is the vowel or it is after the vowel in thesyllable),

(iii) thirteen syllable-level features:

• the position type of the syllable (single, initial, mid-dle or final) in the word along with the informationof the neighboring instances (previous, next),

• the number of all the syllables in the utterance,• the number of accented syllables and the number of

stressed syllables since the last and to the next phrasebreak (i.e. the break index tier of ToBI Silverman etal. 1992 with values: 0, 1, 2, 3, 4),

• syllable’s onset-coda size (the number of phones be-fore and after the vowel of the syllable) along withthe information of the previous and next instances,

• the onset-coda type (if the consonant before and af-ter the vowel in the syllable is voiced or unvoiced)along with the information of the previous and nextinstances,

• the position of the syllable in the word,• the onset-coda consonant type (the manner of pro-

duction of the consonant before and after the vowelin the syllable),

(iv) two word-level features:

• the part-of-speech (noun, verb, adjective, etc.),• the number of syllables in the word,

(v) one phrase-level feature:

• the syllable break (the phrase break after the syl-lable) along with the information of the neighbor-ing (two previous, two next) instances. The syllablebreak feature is based on the break index tier (0, 1,2, 3, 4) of ToBI (Silverman et al. 1992),

(vi) six accentual features:

• the ToBI accents and boundary tones along with theinformation of the neighboring (previous, next) in-stances,

• the last-next accent (the number of the syllablessince the last and to the next accented syllable),

• the stressed-unstressed syllable (if the syllable isstressed or not),

• the accented-unaccented syllable (if the syllable isaccented or not) with the information of the neigh-boring (two previous, two next) instances.

The overall size of the feature set is 93, including theaforementioned features along with their syntagmatic neigh-bors information as reported above (one or two previous andnext instances on the level of the respective feature, phone-level, syllable-level, word-level).

3.3 Feature selection algorithms

In the present section we outline two dissimilar feature se-lection algorithms, namely the Relief (Kira and Rendell1992) and the Correlation-based Feature Selection (CFS)(Hall 1999), which are employed in the experiments re-ported in Sect. 4.

Kira and Rendell (1992) introduced a statistical featureselection algorithm called Relief that uses instance-basedlearning to assign a relevance weight to each feature in orderto achieve maximal margin. The key idea behind this algo-rithm is to estimate iteratively the quality of attributes, forall the attributes, according to how well their values distin-guish between the instances that are near to each other. Inbrief, Relief iteratively samples instances randomly acrossthe training data and then for each selected instance findsout the closest neighboring instances of the same and dif-ferent classes (Witten and Frank 2005). The margin of adata instance j used in this method is defined as the dis-tance between the near-hit to the instance j and the near-miss to the instance j , respectively, where near-hit and near-miss denote the nearest samples to instance j with the sameand different labels, respectively (Kira and Rendell 1992).The main shortcoming of the Relief algorithm is its abil-ity to handle only binary class problems. This drawbackwas overcome by the introduction of ReliefF algorithm byKononenko (1994), which handles multiple classes. Further-more the ReliefF algorithm can handle noisy and incomplete

182 Int J Speech Technol (2010) 13: 175–188

data. In the following we consider the RReliefF algorithmwhich was introduced by Robnik-Sikonja and Kononenko(1997) in order to handle continuous classes in regressionproblems. Once the use of nearest hits and misses is notfeasible in regression problems, the probability that the pre-dicted values of two instances are different is used in orderto find the closest neighboring instances as mention above.Consequently, for each attribute a quality degree is mea-sured according to how well its values distinguish betweeninstances that are near to each other. This algorithm is basedon the rational that a useful feature should differentiate be-tween instances with relative large distance between the pre-dicted values of them and have the same value for instanceswith relative small distance between their predicted values.The advantage of margin-based feature selection methods isbased on the consideration of the global information, whichleads to optimal weights of features, making these meth-ods efficient even for data with dependent features (Gilad-Bachrach et al. 2004). The number of nearest neighbors forattribute estimation was set equal to 15 after grid searchexperiments (k = {5,10,15, . . . ,100}) on a randomly se-lected subset of the training set, representing the 40% of thesize of the full training set. The RReliefF algorithm is usedin combination with a ranking algorithm, in our case theRanker (Witten and Frank 2005). Ranker is a search methodfor ranking individual features according to their individualevaluations.

The Correlation-based Feature Selection (CFS) tech-nique (Hall 1999), on the other hand, scores and ranks sub-sets of features, rather than individual features as Relief al-gorithm does. It uses the criterion that a good feature subsetfor a classifier contains features that are highly correlatedwith the class variable but poorly correlated with each other(Hall 1999). In an iterative scheme the CFS adds featuresthat have the highest correlation with the target, providedthat the set does not already contain a feature whose cor-relation with the target is not higher from the one of thefeature in question (Witten and Frank 2005). Base on thisrational the CFS algorithm is capable of dealing not onlywith irrelevant but also with redundant features. The CFSalgorithm overcomes the problem of missing values by re-placing them with the mean of all the values in the caseof continuous attributes and with the most common valuein the case of discrete attributes (Hall 2000). Furthermore,CFS discretizes internally the continuous features, makingthe algorithm proper for mixed type of input features. Inour experiments the CFS algorithm was used in conjunctionwith a genetic algorithm (GA) (Goldberg 1989).

3.4 Phone duration modeling algorithms

In the present work, we evaluate the performance of ten dif-ferent phone duration models (PDMs), which have been re-ported successful on the phone duration modeling task. They

represent the four categories of machine learning algorithms(decision trees, linear regression, lazy-learning and meta-learning algorithms), mentioned in Sect. 2. A brief outlineof these algorithms follows:

• Three decision trees were used, namely the M5p modeltree (Wang and Witten 1997) and the M5pR regressiontree (Quinlan 1992) based on the M5′ algorithm (Wangand Witten 1997) and the regression tree, Reduced Er-ror Pruning trees (REPTrees) (Kääriäinen and Malinen2004).

• Two lazy-learning algorithms were used: the IBK (Aha etal. 1991), which uses the k-nearest neighbors algorithm(k-NN), and the Locally Weighted Learning (LWL) algo-rithm (Atkeson et al. 1996), which assigns weights us-ing an instance-based method. Concerning the IBK algo-rithm, for locating the instance that is closer to the train-ing instance, it searches among the k nearest neighbors ofthe instance. Evaluating this method with different num-ber of neighbors resulted in the adaptation of 12 neigh-bors (k = 12), since it gave the best results. In the caseof LWL, the kernel function which is used to calculateweights for the data points was the tricube kernel func-tion, while REPTrees were used as classifiers.

• Two meta-learning algorithms were used: the AdditiveRegression (AR) (Friedman 2002) and the Bagging al-gorithm (BG) (Breiman 1996), using two different re-gression trees (M5pR and REPTrees) as base classifiers.In the two cases of additive regression meta-learningalgorithm the shrinkage parameter, ν, indicating thelearning rate, was set equal to 0.5 and the number ofthe regression trees, rt-num, was set equal to 10 af-ter grid search experiments (ν = {0.1,0.3,0.5,0.7,0.9},rt-num = {5,10,15,20}) on a randomly selected sub-set of the training set, representing the 40% of the sizeof the full training set. In the two cases of the baggingalgorithm, the number of the regression trees, rt-num,was set equal to 10 after some grid search experiments(rt-num = {5,10,15,20}) on the randomly selected sub-set of the training set, mentioned earlier.

• Finally, the linear regression (LR) (Witten and Frank2005) algorithm was used for our experiments, which is aclassification and prediction algorithm that expresses theclass variable as a linear combination of the features. Theerror estimation in LR algorithm is given by the AkaikeInformation Criterion (AIC) (Akaike 1974).

3.5 Performance assessment measures

The performance of the phone duration prediction modelswas measured in terms of root mean square error (RMSE),mean absolute error (MAE) and the correlation coefficient(CC). The RMSE is frequently used as a global measuresensitive to gross errors. The MAE, described as the average

Int J Speech Technol (2010) 13: 175–188 183

magnitude of the errors in a set of predictions, does not con-sider the direction of the deviations from the ground truthand is not that sensitive to gross errors, as the RMSE is. TheRMSE and the MAE are defined respectively as:

RMSE =√∑N

i=1(F (xi) − yi)2

N, (2)

and

MAE =∑N

i=1 |F(xi) − yi |N

, (3)

where N is the number of the test instances, yi is the ac-tual duration of ith instance and F(xi) is the predicted valuefor the ith instance. Finally, we calculated the correlationcoefficient (CC) which measures the statistical correlationbetween the actual and the predicted values of the phone du-ration. The CC is defined as

CC = cov(F (X),Y )

sF(X)sY= E((F (X) − F (X))(Y − Y ))

sF (X)sY, (4)

where F(X) is the variable of the predicted values and Y

are the actual values of the phone durations. The F (x) andY are the mean values of the two variables and sF(X) and sY

are the standard deviations of the variables F(X) and Y , re-spectively. Together, these three performance measures offera good indication about the accuracy of different models.

4 Experimental results

An experimental protocol based on 10-fold cross-validationwas applied in all the experiments in order to exploit in thebest way the available data. The results of the phone durationmodels implemented by the ten different machine learningtechniques are reported in the Tables 2, 3 and 4. Specifically,the results for the cases without feature selection, denoted inthe tables as NoFS, and for the RReliefF and CFS featureselection algorithms in terms of RMSE, MAE and CC arepresented in Tables 2, 3 and 4, respectively. The values ofthe RMSE and MAE are in milliseconds. The best NoFSmodel, trained with the full set of features, i.e. without fea-ture selection (NoFS), for each category of emotional speechis shown in italics. Bold font indicates the cases where thephone duration models with feature selection (RReliefF orCFS) outperform the respective ones without feature selec-tion (NoFS). Finally, the best result for each category ofemotional speech, regardless if feature selection was imple-mented on the models or not, is indicated in the cell withgrey background.

4.1 Phone duration modeling with no feature selection

All the evaluated PDMs for the NoFS case, as shownin Tables 2 and 3, demonstrated reasonable performance,yielding RMSE values between 19.0 and 30.3 millisec-onds, and MAE values between 14.0 and 22.2 millisec-onds. Despite the fact that these values are bearable forthe needs of emotional speech synthesis, a more accuratephone duration model (prediction error <20 ms) wouldresult in synthetic speech of better quality (Wang et al.2004). Regarding the CC, the PDMs achieved values inthe range of 0.54 to 0.83, which is a respectable out-come for a phone duration model. The highest accuracyfor phone duration modeling in all categories of emo-tional speech was observed for the M5p trees and the meta-learning, Additive Regression and Bagging, algorithms us-ing M5pR regression trees as base classifiers (AR.M5pR andBG.M5pR).

It was also observed that the LR model although show-ing higher error rates, has performance, which approachesthat of the M5pR regression trees, showing an average per-formance drop over all the emotional categories in terms ofa relative increase of RMSE and MAE by 3.4% and 3.5%in respect to M5p. Regarding the CC, a respective reduc-tion by 2.7% was shown. Moreover, it is interesting to pointout that between the two local learning methods that wereemployed, LWL rather than IB12, performed worse in allcategories of emotional speech, showing an average per-formance drop over all the emotional categories in termsof a relative increase of RMSE and MAE by 12.6% and11.6% in respect to IB12. Regarding the CC, a respectivereduction by 12.1% was shown. On the contrary, REPTrees(R.Tr.) appeared to demonstrate the lowest accuracy amongall evaluated methods, both as a single model, and as a baseclassifier for the cases of AR and BG algorithms (AR.R.Tr.,BG.R.Tr.).

The M5p trees achieved the best performance due to thefact that these models adopt a greedy algorithm. The greedyalgorithm constructs a model tree with a non-fixed struc-ture by using a certain stopping criterion, minimizing theerror at each interior node, one node at a time, recursivelyuntil the best accuracy is achieved. In this way, the com-putational cost increases, but very robust models are con-structed. Moreover, the meta-learning algorithms, process-ing meta-data and taking advantage of the information thatis produced by other methods (in our experiments the M5pRand the REPTrees model) were expected to perform well.However, as we can notice, Additive Regression and Bag-ging performed better when combined with a robust predic-tion method such as M5pR, while they didn’t perform thatwell when the REPTrees (R.Tr.) were used as base classi-fier, showing an average performance drop over all the emo-tional categories in terms of a relative increase of RMSE and

184 Int J Speech Technol (2010) 13: 175–188

Table 2 RMSE values in milliseconds for the different categories of emotional speech. Each phone duration model was evaluated for the fullfeature set, i.e. without feature selection (NoFS), and with feature selection based either on the RReliefF or the CFS methods

Anger Fear Joy Neutral Sadness

NoFS RReliefF CFS NoFS RReliefF CFS NoFS RReliefF CFS NoFS RReliefF CFS NoFS RReliefF CFS

AR.M5pR 22.1 22.4 22.7 20.1 20.0 22.4 19.0 19.3 22.2 26.3 26.1 28.1 20.6 20.8 22.0

AR.R.Tr. 23.8 23.6 23.2 21.3 21.1 22.9 20.8 20.9 23.4 26.7 26.4 29.2 22.1 21.9 22.3

BG.M5pR 23.3 23.3 23.6 20.9 20.9 22.6 20.4 20.5 22.8 26.7 26.7 28.0 21.4 21.7 22.5

BG.R.Tr. 28.2 28.0 23.4 22.5 22.5 23.5 22.8 22.9 25.5 27.6 27.1 29.3 24.3 24.2 22.7

IB12 24.7 24.3 23.9 21.8 21.6 21.5 22.2 21.4 21.0 27.5 27.3 26.9 20.6 22.5 23.4

LWL 28.6 28.5 25.0 24.4 24.1 25.1 23.4 23.1 26.1 28.9 28.3 30.9 25.7 25.4 24.2

LR 22.8 23.3 23.8 22.0 19.9 22.1 19.8 20.2 22.7 26.4 26.8 28.6 20.8 21.2 22.4

M5p 21.7 22.2 21.3 20.2 19.8 22.0 19.5 19.5 21.7 26.2 26.0 27.7 20.9 20.9 20.7

M5pR 24.1 23.5 24.4 21.6 21.7 23.2 21.6 21.4 23.4 27.2 27.0 28.7 22.1 22.3 23.2

R.Tr. 30.3 30.2 25.3 24.3 24.4 23.5 24.5 24.2 27.4 29.4 29.2 31.8 26.6 26.6 24.3

Table 3 MAE values in milliseconds for the different categories of emotional speech. Each phone duration model was evaluated for the full featureset, i.e. without feature selection (NoFS), and with feature selection based either on the RReliefF or the CFS methods

Anger Fear Joy Neutral Sadness

NoFS RReliefF CFS NoFS RReliefF CFS NoFS RReliefF CFS NoFS RReliefF CFS NoFS RReliefF CFS

AR.M5pR 16.3 16.3 16.7 14.9 14.9 16.9 14.0 14.3 16.7 17.5 17.1 19.4 15.6 15.8 16.9

AR.R.Tr. 17.5 17.5 17.0 15.7 15.6 17.3 15.3 15.4 17.6 17.8 17.7 19.8 16.8 16.6 17.0

BG.M5pR 17.1 17.0 17.2 15.4 15.5 17.0 15.1 15.1 17.2 17.7 17.7 19.1 16.2 16.4 17.2

BG.R.Tr. 20.5 20.4 16.9 16.5 16.6 17.7 16.7 16.7 19.0 18.6 18.2 20.0 18.1 18.0 17.1

IB12 18.0 17.5 17.4 15.8 15.6 15.4 16.4 15.6 15.3 18.4 18.0 17.7 15.6 16.8 17.8

LWL 20.5 20.3 18.1 18.0 17.6 18.7 17.0 16.7 19.6 19.3 19.0 20.7 19.0 18.6 18.1

LR 17.1 17.4 17.8 16.0 14.9 16.7 14.9 15.2 17.0 17.7 17.8 19.3 16.1 16.2 17.3

M5p 16.1 16.3 15.9 15.0 14.6 16.5 14.8 14.5 16.3 17.1 17.0 18.5 16.0 15.9 15.5

M5pR 17.6 16.8 17.8 16.0 16.0 17.6 15.9 15.9 17.6 18.2 18.1 19.7 16.8 17.0 17.7

R.Tr. 22.2 22.1 18.4 18.2 18.2 17.8 17.9 17.7 20.5 20.1 19.9 21.9 20.0 20.0 18.3

Table 4 CC values for the different categories of emotional speech. Each phone duration model was evaluated for the full feature set, i.e. withoutfeature selection (NoFS), and with feature selection based either on the RReliefF or the CFS methods

Anger Fear Joy Neutral Sadness

NoFS RReliefF CFS NoFS RReliefF CFS NoFS RReliefF CFS NoFS RReliefF CFS NoFS RReliefF CFS

AR.M5pR 0.83 0.82 0.82 0.72 0.72 0.63 0.78 0.77 0.69 0.66 0.68 0.61 0.75 0.75 0.71

AR.R.Tr. 0.79 0.79 0.81 0.67 0.68 0.60 0.73 0.73 0.64 0.65 0.66 0.56 0.70 0.71 0.70

BG.M5pR 0.81 0.81 0.80 0.70 0.70 0.63 0.75 0.74 0.66 0.66 0.66 0.61 0.73 0.72 0.69

BG.R.Tr. 0.70 0.70 0.80 0.62 0.62 0.58 0.66 0.66 0.54 0.62 0.63 0.55 0.63 0.63 0.68

IB12 0.78 0.79 0.80 0.66 0.67 0.68 0.69 0.72 0.73 0.63 0.64 0.66 0.75 0.70 0.66

LWL 0.70 0.70 0.77 0.55 0.57 0.52 0.65 0.68 0.54 0.59 0.61 0.51 0.59 0.61 0.64

LR 0.81 0.80 0.79 0.66 0.72 0.64 0.76 0.75 0.66 0.66 0.65 0.58 0.74 0.73 0.69

M5p 0.83 0.82 0.84 0.72 0.73 0.65 0.77 0.77 0.70 0.67 0.69 0.62 0.74 0.74 0.76

M5pR 0.79 0.81 0.78 0.66 0.66 0.59 0.70 0.71 0.64 0.63 0.64 0.58 0.70 0.70 0.67

R.Tr. 0.65 0.65 0.76 0.55 0.55 0.66 0.60 0.62 0.46 0.57 0.57 0.46 0.54 0.54 0.63

MAE by 6.4% and 6.3% in the case of AR.R.Tr. models andby 11.5% and 10.9% in the case of BG.R.Tr. models. Re-garding the CC, a respective reduction by 5.3% and 11.4%

respectively for AR.R.Tr. and BG.R.Tr. models was shown.Regarding the other methods, we noticed that local learn-ing methods or methods that apply a more strict strategy of

Int J Speech Technol (2010) 13: 175–188 185

Table 5 The number of features selected by the two feature selection algorithms, RReliefF and CFS, for the different categories of emotionalspeech

Emotional Categories Anger Fear Joy Neutral Sadness

Feature Selection Algorithms RReliefF CFS RReliefF CFS RReliefF CFS RReliefF CFS RReliefF CFS

Number of Features 76 39 77 38 70 27 72 31 68 37

pruning might perform faster, but they do not yield the high-est accuracy.

4.2 Phone duration modeling with feature selection

It is also interesting to compare the results obtained by thephone duration models with the implementation of the twofeature selection algorithms (RReliefF and CFS) to the onesobtained by the PDMs using the full feature set, i.e. withno feature selection (NoFS). In Table 5 the number of fea-tures in the feature sets produced by the two feature selectionmethods, RReliefF and CFS, for all the emotional categoriesare shown. Applying feature selection led to a significant re-duction of the feature vector size: an average reduction overall the emotional categories of 22% in the case of RReliefFalgorithm and 63% in the case of CFS algorithm was ob-served in comparison to the size of the full feature vector.As can be seen, the average reduction over all the emotionalcategories in the case of CFS is much bigger than that inthe case of RReliefF, primarily due to the CFS strategy tominimize redundancy among the selected features.

The decision to build separate phone duration models foreach emotional category rather than one model incorporat-ing all the emotional categories is supported by the resultsof the feature selection procedure. Even though a degree ofuniformity existed in the selected features throughout all theemotional categories, in many cases (among which somephonetic features such as the place of articulation, the man-ner of production or the lip rounding in the case of CFS al-gorithm and the last-next accent feature in the case of RRe-liefF algorithm) different features or syntagmatic neighborsinformation concerning a feature, (i.e. the information of afeature concerning the previous or next instance) were se-lected in each emotional category. The features which wereselected as relevant (by both feature selection algorithms) inmost of the emotional categories were mainly the ToBI ac-cents and boundary tones and the stressed-unstressed sylla-ble features along with their syntagmatic neighbors informa-tion concerning the next instance, in the category of the ac-centual features and the syllable break in the category of thephrase-level features. Furthermore, concerning the syllable-level features, the syllable’s onset-coda type and size fea-tures along with their syntagmatic neighbors informationconcerning the next instance and the position type of thesyllable feature were selected by both the feature selection

Fig. 1 Weighted average values of standard deviations in millisecondsof phone durations for all the emotion

algorithms. Finally, the onset-rhyme type belonging to thesegment-level features category was selected in most of theemotional categories by both feature selection algorithms.

Furthermore, the RReliefF algorithm except for the abovefeatures, which were selected by both the feature selectionalgorithms, selected additional features as relevant to theconcept target of phone duration modeling. Some of thesefeatures are the last-next accent feature concerning the ac-centual features and both of the two word-level features, i.e.the part-of-speech and the number of syllables in the wordfeatures. Moreover several syllable-level features, additionalto the ones mentioned above selected from both algorithms,were selected by the RReliefF algorithm, such as the numberof the syllables in the utterance and the number of accentedsyllables and stressed syllables since the last and to the nextphrase break. In addition, all the phonetic features, whichmost of them were rejected by the CFS as irrelevant fea-tures, were selected by the RReliefF algorithm, along withmost of their syntagmatic neighbors concerning mainly theone previous and one next instance. Finally, the name of theprevious and the name of the next phone features, which aresegment-level features, were selected by the RReliefF algo-rithm.

Apart from Joy emotional category, where the best per-formance was achieved by the PDM with NoFS (AR.M5pR),in all the other categories (Anger, Fear, Neutral and Sad-ness), one of the two feature selection algorithms led to thelowest RMSE and MAE (please refer to Tables 2 and 3).Along all the emotional categories the highest performancewas achieved with the implementation of no feature se-lection algorithm for the Joy emotional category. This isjustified by the lowest weighted standard deviation of the

186 Int J Speech Technol (2010) 13: 175–188

phone’s duration corresponding to this category (please re-fer to Fig. 1). Given that it is a less challenging task, featureselection cannot contribute to the enhancement of the phoneduration models in Joy emotional category. The PDMs, im-plemented by M5p algorithm, relying on the CFS had thebest performance in the categories Anger and Sadness, pre-senting a relative reduction of RMSE and MAE by 1.8% and1.2% respectively, in Anger category and by 1% and 3.1%respectively, in Sadness category. Regarding the CC (referto Table 4), improvement by 1.2% and 2.7% was shown inAnger and Sadness respectively. The PDMs, implementedby M5p algorithm, employing the RReliefF algorithm pre-sented the best results in the categories Fear and Neutral,presenting a relative reduction of RMSE and MAE by 2%and 2.7% respectively, in Fear category and by 0.8% and0.6% respectively, in Neutral category. Regarding the CC,improvement by 1.4% and 3% was shown in Fear and Neu-tral respectively.

These results indicate that the additional effort to performfeature selection is justified, since it contributes to the im-provement of the accuracy of phone duration modeling opti-mizing their performance. This can be seen in the Tables 2, 3and 4, where the PDMs using feature selection outperformedthe respective PDMs with NoFS even if the particular phoneduration models didn’t have the best accuracy among theother PDMs in the particular category of emotional speech.

5 Conclusions

In the present work we performed an overview on variousphone duration modeling techniques studied over the yearsand evaluated ten different phone duration models, basedon decision trees, linear regression, lazy-learning algorithmsand meta-learning algorithms, in the context of emotionalspeech. In this evaluation we employed all the features in-troduced in the literature on the phone and syllable durationprediction tasks. Two feature selection techniques, namelythe RReliefF and Correlation-based Feature Selection (CFS)algorithms were involved for performance optimization ofthe phone duration models. A Modern Greek database ofemotional speech was used in all experiments. The experi-mental results demonstrated that in four out of the five cat-egories of emotional speech, i.e. Anger, Fear, Neutral andSadness, regardless of the machine learning algorithm usedin the phone duration models, feature selection contributedfor achieving a better phone duration prediction accuracy,when compared to the case without feature selection. In Joyemotional category, the best model was built with the fullfeature set, and feature selection didn’t bring better accu-racy. This might be related to the low value of the weightedstandard deviation of Joy emotional category leading to avery robust phone duration model even without the imple-mentation of feature selection. The M5p trees based phone

duration model was observed to achieve the best phone dura-tion prediction accuracy in terms of RMSE and MAE amongall evaluated models.

References

Aha, D., Kibler, D., & Albert, M. (1991). Instance-based learning al-gorithms. Journal of Machine Learning, 6, 37–66.

Akaike, H. (1974). A new look at the statistical model identification.IEEE Transactions on Automatic Control, 19, 716–723.

Allen, J., Hunnicutt, S., & Klatt, D. H. (1987). From text to speech: theMITalk system. Cambridge: Cambridge University Press.

Arvaniti, A., & Baltazani, M. (2000). Greek ToBI: a system for theannotation of Greek speech corpora. In Proceedings of the 2ndinternational conference on language resources and evaluation(pp. 555–562). Athens, Greece.

Atkeson, C. G., Moorey, A. W., & Schaal, S. (1996). Locally weightedlearning. Artificial Intelligence Review, 11, 11–73.

Barbosa, P. A., & Bailly, G. (1994). Characterisation of rhythmic pat-terns for text-to-speech synthesis. Speech Communication, 15,127–137.

Bartkova, K., & Sorin, C. (1987). A model of segmental duration forspeech synthesis in French. Speech Communication, 6, 245–260.

Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., &Gildea, D. (2003). Effects of disfluencies, predictability, and ut-terance position on word form variation in English conversation.Journal of the Acoustical Society of America, 113(2), 1001–1024.

Black, A. (2003). Unit selection and emotional speech. In Proceedingsof EUROSPEECH’03 (pp. 1649–1652). Geneva, Switzerland.

Breiman, L. (1996). Bagging predictors. Journal of Machine Learning,24(2), 123–140.

Burkhardt, F., & Sendlmeier, W. F. (2000). Verification of acousticalcorrelates of emotional speech using formant-synthesis. In Pro-ceedings of the ISCA workshop on speech & emotion (pp. 151–156). Northern Ireland.

Campbell, W. N. (1992). Syllable based segment duration. In G. Bailly,C. Benoit, & T. R. Sawallis (Eds.), Talking machines: theories,models and designs (pp. 211–224). Amsterdam: Elsevier.

Carlson, R., & Granstrom, B. (1986). A search for durational rules inreal speech database. Phonetica, 43, 140–154.

Chien, J. T., & Huang, C. H. (2003). Bayesian learning of speech dura-tion models. IEEE Transactions on Speech and Audio Processing,11(6), 558–567.

Chung, H. (2002). Duration models and the perceptual evaluation ofspoken Korean. In Proceedings of speech prosody (pp. 219–222).France.

Cordoba, R., Montero, J. M., Gutierrez-Ariola, J., & Pardo, J. M.(2001). Duration modeling in a restricted-domain female-voicesynthesis in Spanish using neural networks. In Proceedings ofICASSP’01 (pp. 793–796). Utah, USA.

Crystal, T. H., & House, A. S. (1988). Segmental durations inconnected-speech signals: current results. Journal of the Acousti-cal Society of America, 83(4), 1553–1573.

Dutoit, T. (1997). An introduction to text-to-speech synthesis. Dor-drecht: Kluwer Academic.

Epitropakis, G., Tambakas, D., Fakotakis, N., & Kokkinakis, G.(1993). Duration modelling for the Greek language. In Proceed-ings of EUROSPEECH’93 (pp. 1995–1998). Berlin, Germany.

Febrer, A., Padrell, J., & Bonafonte, A. (1998). Modeling phone dura-tion: application to Catalan TTS. In Workshop of speech synthesis(pp. 43–46). Australia.

Friedman, J. H. (2001). Greedy function approximation: a gradientboosting machine. Annals of Statistics, 29(5), 1189–1232.

Int J Speech Technol (2010) 13: 175–188 187

Friedman, J. H. (2002). Stochastic gradient boosting. ComputationalStatistics and Data Analysis, 38(4), 367–378.

Gilad-Bachrach, R., Navot, A., & Tishby, N. (2004). Margin based fea-ture selection—theory and algorithms. In P. Tadepalli, R. Givan,& K. Driessens (Eds.), Proceedings of the 21st international con-ference on machine learning (pp. 43–50). Banff: Morgan Kauf-mann.

Goldberg, D. E. (1989). Genetic algorithms in search, optimization andmachine learning. Boston: Addison–Wesley/Longman.

Goubanova, O., & King, S. (2008). Bayesian network for phone dura-tion prediction. Speech Communication, 50, 301–311.

Goubanova, O., & Taylor, P. (2000). Using Bayesian belief networksfor modeling duration in text-to-speech systems. In Proceedingsof the ICSLP’00 (pp. 427–431). Beijing, China.

Gregory, M., Bell, A., Jurafsky, D., & Raymond, W. (2001). Frequencyand predictability effects on the duration of content words in con-versation. Journal of the Acoustical Society of America, 110(5),27–38.

Hall, M. A. (1999). Correlation-based feature subset selection for ma-chine learning. PhD thesis, Department of Computer Science,University of Waikato, Waikato, New Zealand.

Hall, M. A. (2000). Correlation-based feature selection for discrete andnumeric class machine learning. In P. Langley (Ed.), Proceed-ings of the 17th international conference on machine learning (pp.359–366). San Francisco: Morgan Kaufmann.

Heuft, B., Portele, T., & Rauth, M. (1996). Emotions in time domainsynthesis. In Proceedings of ICSLP’96 (pp. 1974–1977). Philadel-phia, USA.

Inanoglu, Z., & Young, S. (2009). Data-driven emotion conversion inspoken English. Speech Communication, 51, 268–283.

Iida, A., Campbell, N., Iga, S., Higuchi, F., & Yasumura, M. (2000). Aspeech synthesis system for assisting communication. In Proceed-ings of the ISCA workshop on speech & emotion (pp. 167–172).Northern Ireland.

Iwahashi, N., & Sagisaka, Y. (2000). Statistical modeling of speechsegment duration by constrained tree regression. IEICE Transac-tions on Information and Systems, E83-D(7), 1550–1559.

Jiang, D. N., Zhang, W., Shen, L., & Cai, L. H. (2005). Prosody analy-sis and modeling for emotional speech synthesis. In Proceedingsof ICASSP’05 (pp. 281–284). Philadelphia, USA.

Kääriäinen, M., & Malinen, T. (2004). Selective rademacher penaliza-tion and reduced error pruning of decision trees. Journal of Ma-chine Learning Research, 5, 1107–1126.

Kira, K., & Rendell, L. A. (1992). A practical approach to feature se-lection. In Sleeman, & P. Edwards (Eds.), Proceedings of the 9thinternational conference on machine learning (pp. 249–256). Ab-erdeen, Scotland. San Francisco: Morgan Kaufmann.

Klatt, D. H. (1976). Linguistic uses of segmental duration in English:Acoustic and perceptual evidence. Journal of the Acoustical Soci-ety of America, 59, 1209–1221.

Klatt, D. H. (1979). Synthesis by rule of segmental durations in Eng-lish sentences. In B. Lindlom & S. Ohman (Eds.), Frontiers ofspeech communication research (pp. 287–300). New York: Acad-emic Press.

Klatt, D. H. (1987). Review of text-to-speech conversion for English.Journal of the Acoustical Society of America, 82(3), 737–793.

Kohler, K. J. (1988). Zeistrukturierung in der Sprachsynthese. ITG-Tagung Digitalc Sprachverarbeitung, 6, 165–170.

Kominek, J., & Black, A. W. (2003). CMU ARCTIC databases forspeech synthesis, CMU-LTI-03-177, Language Technologies In-stitute, School of Computer Science, Carnegie Mellon University.

Kononenko, I. (1994). Estimating attributes: analysis and extensionsof relief. In F. Bergadano & L. De Raedt (Eds.), Proceedings ofthe European conference machine learning (pp. 171–182). NewYork: Springer.

Krishna, N. S., & Murthy, H. A. (2004). Duration modeling of In-dian languages Hindi and Telugu. In Proceedings of the 5th ISCAspeech synthesis workshop (pp. 197–202). Pittsburgh, USA.

Krishna, N. S., Talukdar, P. P., Bali, K., & Ramakrishnan, A. G. (2004).Duration modeling for Hindi text-to-speech synthesis system. InProceedings of ICSLP’04 (pp. 789–792). Jeju Island, Korea.

Lazaridis, A., Zervas, P., & Kokkinakis, G. (2007). Segmental durationmodeling for Greek speech synthesis. In Proceedings of ICTAI’07(pp. 518–521). Patras, Greece.

Lee, S., & Oh, Y. H. (1999a). Tree-based modeling of prosodic phras-ing and segmental duration for Korean TTS systems. Speech Com-munication, 28, 283–300.

Lee, S., & Oh, Y. H. (1999b). CART-based modelling of Korean seg-mental duration. In Proceedings of the oriental COCOSDA’99(pp. 109–112). Taipei, Taiwan.

Möbius, B., & Santen, P. H. J. (1996). Modeling segmental durationin German text-to-speech synthesis. In Proceedings of ICSLP’96(pp. 2395–2398). Philadelphia, USA.

Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of asystem for producing emotion-by-rule in synthetic speech. SpeechCommunication, 16, 369–390.

Oatley, K., & Johnson-Laird, P. (1998). The communicative theory ofemotions. In J. Jenkins, K. Oatley, & N. Stein (Eds.), Human emo-tions: a reader (pp. 84–87). Oxford: Blackwell.

Olive, J. P., & Liberman, M. Y. (1985). Text to speech—an overview.Journal of the Acoustical Society of America, 78(1), S6.

Quinlan, R. J. (1992). Learning with continuous classes. In Proceed-ings of the 5th Australian Joint Conference on Artificial Intelli-gence (pp. 343–348). Hobart, Tasmania.

Rank, E., & Pirker, H. (1998). Generating Emotional Speech with aConcatenative Synthesizer. In Proceedings of ICSLP’98 (pp. 671–674). Sydney, Australia.

Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of sylla-bles using neural networks. Computer Speech & Language, 21(2),282–295.

Riley, M. (1992). Tree-based modelling for speech synthesis. In G.Bailly, C. Benoit, & T. R. Sawallis (Eds.), Talking machines: the-ories, models and designs (pp. 265–273). Amsterdam: Elsevier.

Robnik-Sikonja, M., & Kononenko, I. (1997). An adaptation of relieffor attribute estimation in regression. In D. H. Fisher (Ed.), Pro-ceedings of the 14th international conference on machine learn-ing (pp. 296–304). San Francisco: Morgan Kaufmann.

Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C.,Price, P., Pierrehumbert, J., & Hirschberg, J. (1992). ToBI: a stan-dard for labeling English prosody. In Proceedings of ICSLP’92(pp. 867–870). Banff, Alberta, Canada.

Simoes, A. R. M. (1990). Predicting sound segment duration in con-nected speech: an acoustical study of Brazilian Portuguese. InProceedings of the workshop on speech synthesis (pp. 173–176).Autrans, France.

Takeda, K., Sagisaka, Y., & Kuwabara, H. (1989). On sentence-levelfactors governing segmental duration in Japanese. Journal ofAcoustic Society of America, 86(6), 2081–2087.

Tesser, F., Cosi, P., Drioli, C., & Tisato, G. (2005). Emotional festival-mbrola TTS synthesis. In Proceedings of INTERSPEECH’05(pp. 505–508). Lisboa, Portugal.

Teixeira, J. P., & Freitas, D. (2003). Segmental durations predicted witha neural network. In Proceedings of EUROSPEECH’03 (pp. 169–172). Geneva, Switzerland, September.

van Santen, J. P. H. (1992). Contextual effects on vowel durations.Speech Communication, 11, 513–546.

van Santen, J. P. H. (1994). Assignment of segmental duration in text-to-speech synthesis. Computer Speech & Language, 8(2), 95–128.

Wang, Y., & Witten, I. H. (1997). Induction of model trees for predict-ing continuous classes. In Proceedings of the 9th European con-ference on machine learning (pp. 128–137). University of Eco-nomics, Faculty of Informatics and Statistics, Prague, Czech.

188 Int J Speech Technol (2010) 13: 175–188

Wang, L., Zhao, Y., Chu, M., Zhou, J., & Cao, Z. (2004). Refiningsegmental boundaries for TTS database using fine contextual-dependent boundary models. In Proceedings of ICASSP’04(pp. 641–644). Montreal, Canada.

Witten, H. I., & Frank, E. (2005). Data mining: practical machinelearning tools and techniques (2nd ed.) San Francisco: MorganKaufmann.

Yamagishi, J., Kawai, H., & Kobayashi, T. (2008). Phone durationmodeling using gradient tree boosting. Speech Communication,50(5), 405–415.

Yegnanarayana, B. (1999). Artificial neural networks. New Delhi:Prentice Hall.