1996-Feature Analysis and Neural Network-Based Classification of Speech Under Stress

7
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 4, NO. 4, JULY 1996 307 Feature Analysis and Neural Netw ork-Based Classification o f Speech Under Stress John H. L. Hansen and Brian D. Womack Abstract- It is well known that the variability in speech production due to task-i nduce d stress contributes significantly to loss in speech processing algorithm performance. If an algorithm could be formulated that dete cts the presence of stress in speech, then such knowledge could b e used to monitor speaker state, improve the naturalness of speech coding algorithms, or increase the robustness of speech recognizers. The goal in this study is to consider several speech features as potential stress- sensitive relayers using a previously established stressed speech database (SUSAS). The following speech parameters will he considered: mel, delta-mel, delta-delta-mel, auto-correlation-mel, a nd cross-correlation- me 1 cepstral parameters. Next, an algorithm for speaker-dependent stress classification is formulated for the 11 stress conditions: Angry, Clear, CondSO, Cond70, Fast, Lombard, Loud, Normal, Question, Slow and Soft. It is suggested that additional feature variations beyond neutral Conditions reflect the perturbation of vocal tract articulator movement under stressed conditions. Given a robust set of features, a neural networ k-base d classifier is formulated ba sed on an extended delta-bar- delta learning rule. Performance is considered for the following three test scenarios: monopartition nontargeted) and tripartiti on both n ontargeted and targeted) input feature vectors. I INTRODUCTION The problem of speak er stress classifi cation is to assess the deg ree of a specific stress condition present in a speech utterance. “Stress,” in this stud y, refers to perceptually induced variations on the production of speech. The variation in speech production due to stress can be substantial and will therefore have an impact on the performance of speech processing applications [4], [lo]. A number of studies have focused on stressed speech analysis in an effort to identify meaningful relayers of stress. Unfortunately, many research findings at times disagree, due in part to the variation in the experimental design protocol employed to induce stressed speech and to differences in how speakers impart stress in their speech production. A number of studies have considered the effects of stress on variability of speech production [I], [5], [6], [9]. One stress condition of interest is the Lombard effect [2], [3], [6], which results when speech is produced in the presence of back ground noise. In order to reveal the underlying nature of speech production under stress, an extensive evaluation of five speech production feature domains, including glottal spectrum, pitch, duration, intensity, and vocal tract spectral structure, was previously conducted [5]. Extensive statistical assessment of over 200 parameters for sim ulated and actual speech under stress suggests that stress classification based on feature distribution separability characteristics is possible. One approach that has been suggested as a means of modeling stress for recognition i s based on a source generator framework [3], [4]. In this approach, stress is modeled as perturbations along a path in a multidimensional articulatory space. Using this framework, improvement in speech recognition was demonstrated for noisy Lombard effect speech [3]. This approach has also been considered as a means for generating artificial stressed training tokens [4]. Finally, a tandem neural network stress classifier Manuscript received July 18, 1994; revised February 2, 1996. Th e associate editor coordinating the review of this paper and approving it for publication was Dr. Xuedong Huang. The authors are with the Robust Speech Processing Laboratory, Department of Electrical Engineering, Duke University, D urham, NC 27708-029 USA. Publisher Item Identifier S 1063-6676(96)05073-0. and HMM recognizer based on this framework has been shown to be effective for recognition under several stress conditions including the Lombard effect [lo]. Although a number of studies have considered analysis of speech under stress, the problem of stressed speech classification has re- ceived little attention in the literature. One exception is a study on detection of stressed speech using a parameterized response obtained from the Teager nonlinear energy operator [I]. Previous studies directed specifically at robust speech recognition differ in that they estimate intraspeaker variations via spedcer adaptation, front-end stress compensation, or wider domain training sets. While speaker adaptation techniques c an address the variatior. across speaker groups under neutral conditions, they are not, in general, capable of addressing the variations exhibited by a given speaker under stressed conditions. Front-end stress compensation techniques such as MCE-AC C [3], which employ morphologically constrained featur e enhancement, have demonstrated improved recognition performance. Next, larger data sets have been considered such as multistyle training [7] to improve performance in speaker-dependent systems. Additionally, an extension of multistyle training based on stress token generation has also shown improvement in ;tressed speech recognition [4]. However, for speaker independent syslems, multistyle training results in performance loss over neutral trained systems [lo] since it is believed that the HMM’s cannot always span both the stress and speaker production domains. Hence, the problem of stressed speech recognition requires the incorporation of strms knowledge. This can be accomplished implicitly through robust features or front-end stress equalization. Altemately, stress knowledge could be incorporated explicitly by using a stress classifier to direct a codebook of stress-dependent recognition systems [ 101. Several application areas exist for stress classification such as objective stress assessment or improved speech processing for synthesis or rccognition. For example, a stress detector could direct highly emotional telephone calls to a priority operator at a metropolitan emergency service. A stress classi fication system could also provide meaningful information to speech systems for recognition, speaker verification, and coding. In this study, the problem of speech feature selection for classifica- tion of speech under stress is addressed. The focus here is to develop a classification algorithm using features that have traditionally been employed for recognition. It is suggested that such a stress classifier could be used t o improve the robustness of future speech recognition algorithms under stress. An analysis of vocal tract variation under stress using cross-sectional areas, acoustic tubes, and spectral features is considered. Given knowledge of these variations, five cepstral- based feature sets are employed in the formulation of a neural network-based stress classification algorithm. Next, the feature sets are analyzed using an objective separability measuis to select the set that is most appropriate for stress classification. Finally, the stress classification algorithm is evaluated using an existing speech under stress database (SUSAS) for i) feature set sdection and ii) monopartition stress classification algorithm performince. 1 1 SPEECH UNDER STRESS Speaker stress assessment is useful for applicationi such as emer- gency telephone message sorting and aircraft voice communications monitoring. Here, stress can be defined as a condition that causes the speaker to vary the production of speech from neutral conditions. Neutral speech is defined as speech produced assuming that the speaker is in a “quiet room’’ with no task obligations. With this 1063-6676/96 $05,00 996 IEEE

Transcript of 1996-Feature Analysis and Neural Network-Based Classification of Speech Under Stress

7/25/2019 1996-Feature Analysis and Neural Network-Based Classification of Speech Under Stress

http://slidepdf.com/reader/full/1996-feature-analysis-and-neural-network-based-classification-of-speech-under 1/7

IEEE TRANSACTIONS ONSPEECHAND AUDIO PROCESSING,VOL. 4, NO. 4, JULY 1996 307

Feature Analysis and Neural Netw ork-BasedClassification of Speech Under Stress

John H. L. Hansen and BrianD. Womack

Abstract- It is well known that the variability in speech production

due to task-induced stress contributes significantly to loss in speechprocessing algorithm performance. If an algorithm could be formulatedthat detects the presence of stress in speech, then such knowledge could beused to monitor speaker state, improve the naturalness of speech codingalgorithms, or increase the robustness of speech recognizers. The goalin this study is to consider several speech features as potential stress-sensitive relayers using a previously established stressed speech database(SUSAS). The following speech parameters will he considered: mel,delta-mel, delta-delta-mel, auto-correlation-mel, and cross-correlation-me1 cepstral parameters. Next, an algorithm for speaker-dependent stressclassification is formulated for the 11 stress conditions: Angry, Clear,CondSO, Cond70, Fast, Lombard, Loud, Normal, Question,Slow andSoft. It is suggested that additional feature variations beyond neutralConditions reflect the perturbation of vocal tract articulator movementunder stressed conditions. Given a robust set of features, a neuralnetwor k-based classifier is formulated based on an extended delta-bar-delta learning rule. Performance is considered for the following three testscenarios: monopartition nontargeted) and tripartition both nontargetedand targeted) input feature vectors.

I INTRODUCTIONThe problem of speak er stress classification is to assess the deg ree

of a specific stress condition present in a sp eech utterance. “Stress,” inthis stud y, refers to perceptually induced variations on the productionof speech. The variation in speech production due to stress can besubstantial and will therefore have an impact on the performance ofspeech processing applications[4], [lo]. A number of studies havefocused on stressed speech analysis in an effort to identify meaningfulrelayers of stress. Unfortunately, many research findings at timesdisagree, due in part to the variation in the experimental designprotocol employed to induce stressed speech and to differences inhow speakers impart stress in their speech production. A number ofstudies have considered the effects of stress on variability of speech

production [I], [5], [6],[9]. One stress condition of interest is theLombard effect[ 2 ] ,[3], [6], which results when speech is producedin the presence of back ground noise. In order to reveal the underlyingnature of speech production under stress, an extensive evaluation offive speech production feature domains, including glottal spectrum,pitch, duration, intensity, and vocal tract spectral structure, waspreviously conducted [5]. Extensive statistical assessment of over200 parameters for sim ulated and actual speech under stress suggeststhat stress classification based on feature distribution separabilitycharacteristics is possible. One approach that has been suggested asa means of modeling stress for recognitioni s based on a sourcegenerator framework [3],[4]. In this approach, stress is modeledas perturbations along a path in a multidimensional articulatoryspace. Using this framework, improvement in speech recognition wasdemonstrated for noisy Lombard effect speech [3]. This approach

has also been considered as a means for generating artificial stressedtraining tokens[4]. Finally, a tandem n eural network stress classifier

Manuscript received July 18, 1994; revised February2, 1996. The associateeditor coordinating the reviewof this paper and approving it for publicationwas Dr. XuedongHuang.

The authors arewith the Robust Speech Processing Laboratory, Departmentof Electrical Engineering,Duke University, Durham,NC 27708-029 USA.

Publisher Item IdentifierS 1063-6676(96)05073-0.

and HMM recognizer based on this framework has been shown tobe effective for recognition under several stress conditions includingthe Lombard effect[lo].

Although a number of studies have considered analysis of speechunder stress, the problem of stressed speech classification has re-ceived little attention in the literature. One exception is a studyon detection of stressed speech usinga parameterized responseobtained from the Teager nonlinear energy operator [I]. Previousstudies directed specifically at robust speech recognition differ inthat they estimate intraspeaker variations via spedcer adaptation,front-end stress compensation, or wider domain training sets. Whilespeaker adaptation techniques c an address the variatior. across speakergroups under neutral conditions, they are not, in general, capableof addressing the variations exhibited by a given speaker understressed conditions. Front-end stress compensation techniques suchas MCE-AC C [3], which employ morphologically constrained featureenhancement, have demonstrated improved recognition performance.Next, larger data sets have been considered such as multistyletraining [7] to improve performance in speaker-dependent systems.Additionally, an extension of multistyle training based on stresstoken generation has also shown improvement in ;tressed speechrecognition[4]. However, for speaker independen t syslems, multistyletraining results in performance loss over neutral trained systems [lo]since it is believed that theHMM’s cannot always span both the stressand speaker production domains. Hence, the problem of stressedspeech recognition requires the incorporation of strms knowledge.This can be accomplished implicitly through robust features orfront-end stress equalization. Altemately, stress knowledge could beincorporated explicitly by using a stress classifier to direct a codebookof stress-dependent recognition systems[101. Several applicationareas exist fo r stress classification suchas objective stress assessmentor improved speech processing for synthesis or rccognition. Forexample, a stress detector could direct highly emotional telephonecalls to a priority operator at a metropolitan emergency service.Astress classification system could also provide meaningful informationto speech systems for recognition, speaker verification, and coding.

In this study, the problem of speech feature selection for classifica-tion of speech under stress is addressed. The focus here is to developa classification algorithm using features that have traditionally beenemployed for recognition. It is suggested that such a stress classifiercould be used t o improve the robustness of future speech recognitionalgorithms under stress. An analysis of vocal tract variation understress using cross-sectional areas, acoustic tubes, and spectral featuresis considered. Given knowledge of these variations, five cepstral-based feature sets are employed in the formulation ofa neuralnetwork-based stress classification algorithm. Next, the feature setsare analyzed using an objective separability measuisto select theset that is most appropriate for stress classification. Finally, thestress classification algorithm is evaluated using an existing speechunder stress database (SUSAS) for i) feature set sdection and ii)monopartition stress classification algorithm performince.

11 SPEECH UNDER STRESS

Speaker stress assessment is useful for applicationi such as emer-gency telephone message sorting and aircraft voice communicationsmonitoring. Here, stress can be defined as a condition that causes thespeaker to vary the production of speech from neutral conditions.Neutral speech is definedas speech produced assuming that thespeaker is in a “quiet room’’ with no task obligations. With this

1063-6676/96$05,00 996 IEEE

7/25/2019 1996-Feature Analysis and Neural Network-Based Classification of Speech Under Stress

http://slidepdf.com/reader/full/1996-feature-analysis-and-neural-network-based-classification-of-speech-under 2/7

308 IEEE TRANSACTIONSON SPEECH AND AUDSOPROCESSING,VOL. 4 NO. 4 JULY 1996

_Neutral Vowel / EH/ in Help Angry Vowel /EH/ In Help

Fig. 1 Neutral versus Angn vowel /EH/ n “help” vocal tract variation

definition, two stress effect areas emerge: perceptual and physiolog-ical. Perceptually induced stress results when a speaker perceives

his environment to be different from “normal” such that speechproduction intention vanes from neutral conditions. The causes ofperceptually induced stress include emotion, environmental noise(i.e., Lombard effect), and actual task workload (pilot in an aircraftcockpit). Physiologically induced stress is the result of a physicalimpact on the human body that results in deviations from neutralspeech production despite intentions. Causes of physiological stresscan include vibration, G-force, drug interactions, sickness, and airdensity. In this study, the following11 perceptually induced stressconditions from the SU SAS database are considered:Angry, Clear,CondSO,Cond70,‘ Fast, Lombard, Loud, Neutral, Question, Slow, andSoft.

A. SUSASDatabase

The evaluations conducted in this study are basedon data pre-viously collected for analysis and algorithm formulation of speechanalysis in noise and stress. This database refers tospeech under sim-ulated and actual stress (SUSAS)and has been employed extensivelyin the study of how speech production varies when speaking duringstressed conditions.* A vocabulary set of35 aircraft words make upover 95 of the database. These words consist of monosyllabic andmultisyllabic words that are highly confuseable. Examples include/go-oh-no/, /wide-white/, and /six-fix/.A more complete discussionof SUSAS can be found in the literature[ 3 ] , 5].

111 VOCAL TRACT MODELING

Before a stress classification algorithm is formulated, it would hebeneficial to illustrate how stress effects vocal tract structure. Thissection will demonstrate feature perturbations due to stress via

i) visualization of vocal tract shapeii) analysis of acoustic tube cross sectional areaiii) speech parameter movements.

This analysis is based on a linear acoustic tube model with speechsampled at8 kHz The following sections are intended to show that arelation exists between speech production perturbation, acoustic tubeanalysis, and recognition feature variations.

CondSO and C o d 7 0 refer to speech spoken while performinga moderate

*Approximately halfof the SUSAS database consists of style data donatedand high workload computer response task.

by Lincoln Laboratories[7].

A. Vocal Tract Shape

One mean s of illustrating the effects of stress on speech production

is to visualize the physical vocal tract shape. The movements through-out the vocal tract can be displayed by superim posing a time sequenceof estimated vocal tract shapes for a chosen phoneme. Th e vocal tractshape analysis algorithm assumes a known normalized area functionand acoustic tube length. The articulatory model approach by Wakita[111 was usedto consider changes in vocal tract shape under neutraland angry conditions, as illustrated inFig. 1. Here, a set of vocaltract shapes are superimposed for each frame in the analysis window(10 frames forNormal and 18 frames forAngry, with 24 mslframe).For the Normal condition, the greatest perturbation is in the pharynxcavity. However, for theAngry condition, the greatest perturbationis in the blade and dorsum of the tongue and the lips. This suggeststhat when a speaker is under stress, typical vocal tract movementiseffected, resulting in q uantifiable perturbation in articulator position.

B. Acoustic Tube Area

Next, a second experiment is performed to demonstrate that vocaltract variation due to stress results in voca l tract parameter variation.The experiment assumed fixed tube lengths in order to calculate thearea coefficients for a 15-tube vocal tract model. These coefficientsare calculated and logarithmically scaled for all frames of the/EH/sound in the word “help” forNormal and Angry stressed speech.Fig. 2 shows the resulting change in acoustic tube models for cross-sectional areas. Each frame of the/EH/ honeme is superimposedto show acoustic tube perturbations throughout the utterance.Agreater perturbation in the acoustic tube area parameters for theAngry condition is observed.In addition, note the wide range ofarea perturbations across stress conditions.

C. Speech Parameter Variation Dueto

StressFinally, speech parameter variation due to stress is considered.

In Fig. 3, one autoconelation Me1AC, (AC-Mel) speech analysisparameter is chosen to illustrate the variation due to stress. The keydifference is observed by contrasting the gradu al transitions across theutterance for theNormal compared with theAngry speech parametercontour. We also note the longer duration and approximately bimod alnature of theAngry contour.

It has been show n that speech under stress causes variation in vocaltract structure, acoustic tube models, and speech parameters acrosstime. In general, assessment of vocal tract shape is useful for theanalysis of speech under stress. We note that the vocal tract model

7/25/2019 1996-Feature Analysis and Neural Network-Based Classification of Speech Under Stress

http://slidepdf.com/reader/full/1996-feature-analysis-and-neural-network-based-classification-of-speech-under 3/7

7/25/2019 1996-Feature Analysis and Neural Network-Based Classification of Speech Under Stress

http://slidepdf.com/reader/full/1996-feature-analysis-and-neural-network-based-classification-of-speech-under 4/7

310 IEEE TRANSACTIONSON SPEECH AND AUDIO PROCESSING,VOL. 4 NO. 4 JULY 1996

1.0?

w 0 8

z

5 0 6

II

>

0.2 0.4 0 6 0 8 1 0

1 o

2II

w 0 8

za

2 0 6

-

>

R0.4

I

za

za

0.2

E3

I I I II I I

0 2 0.4 0 6 0 8 1.0

A C 3 ( M E A N = 0.4, VA R I A N C E = 0.1) A C 2 ( M E A N = 0.7, VA R I A N C E = 0.3)

(a) (b)

Fig. 4. Separableand nonseparable stressed speech parameters.a) SEPARABLEowel/EH/ n "Help" (angry).(b) NON-SEPARABLEowel/EH/ n "Help" (angry).

cross-correlation is found from one Me1 coefficientC, o another C,across frames

The XC-Me1 parametersX C , , , provide a quantitative measure of therelative change of broad versus fine spectral structurein energy bands.Since the correlation window length L = 7) and correlation lagsI = 2 ) are fixed in this study, the correlation terms are a measure

of how correlated adjacent frames are over a 72-ms window (24ms/frame and 8 ms skip rate). It is apparent that both AC-Me1 and

XC-Me1 parameters provide a measure of correlation and relativechange in spectral band energies over an extended window frame.From feature analysis, it is suggested that the AC-Me1 parametershave similar propertiesto the XC-Me1 parameters.In addition, theAC-Me1 parameters can be directly compared with other selectedfeature sets since they are basedon a single coefficient indexi .Therefore, AC-Me1 parameters will be used for stress classificationinstead of the XC-Me1 parameters.

B. Cluster Analysis of Parameters as PotentialStress Relayers for Classijication

A clear visualization of parameter distributions is a beneficialfirst step for the determination of an optimal stress classificationfeature set.This is accomplished by obtaining, for a chosen phoneme,pairwise parameter scatter distributions for each frame and each

stress condition to be studied.An evaluation over the five parameterrepresentations(C, DC, D2C, AC, XC) considered each feature set'sability to reflect stress variation. Scatter distributions were used tovisualize the degree of separability for a selected pair of parametersversus time (i.e., 10 coefficients per parameter set and 495 scatterplots per parameter domain for a total of 2475 possible scatter plotsper phoneme). After considering an extensive number of scatterdistributions such as that illustrated for the sample/EW phonemein Fig. 4, a number of clear trends emerged that confirmed whichspeech parameters are better suited for stress classification.In thisexample, the figure illustrates two pairs of features: the first pairis well separated, and the second is poorly separated. For example,

Lombard and Soft speech is shown to be well separated from allother stress conditions. AC-Me1 parameters consistently show ed highdegrees of separability for those phones considered. Similar analysiswas condu cted for the DC-Me1 and D2C-Me1 parameters, indicating aconsistently high degree of overlap; hence, they are less appropriatefor stress classification. This implies that these parameters are lesssensitive to stress effects and, hence, will be more useful for speechrecognition [2] .

C. Separability Distance Measure

Due to the w ide range of features and stress conditions, itis desir-

able to establish an objective measure to predict stress classificationperformance. Hence, a measure that assesses a parameter's classifi-cation ability is one that increases when the distance between clustercenters increase and variances decrease. The following measureissuggested:

/[" Q,tj + g ~ , ~ ) ( b , z ) + O ( b , i ) l ,

given r # o 3 )

where i = 1 , ' . n and j = l , . . . , n are the numbered possiblestress conditions, and.rz and .rf are the cluster centers for param eterindexes a and b (for Table I a = 3 and h = 6). Here, reflectsthe mean and ( a , 2 ) the standard deviation of the ith stress condition

for speech featurea. This measure forms a 2-D distance betweentwo speech parameterization classes that is easily visualized. Themain underlying assump tion of this measure is that the features undertest form a Gaussianly distributed convex set.An example of theobjective measure of separability is calculated for the two clustercenters x; and zg given AC3 and ACe for the /EW phoneme inthe word "help." The values calculated using(3) are summarizedin Table I, providing a pairwise comparison of the separability oftwo stress conditions. Each mean provides an overall measure ofthe degree of overlap between a given stress condition and all otherstress conditions. Values ford l that are greater than the mean indicatebetter separability than values less than the mean. For example, in

7/25/2019 1996-Feature Analysis and Neural Network-Based Classification of Speech Under Stress

http://slidepdf.com/reader/full/1996-feature-analysis-and-neural-network-based-classification-of-speech-under 5/7

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,VOL. 4, NO. 4, JULY 1996

SPEECHSTYLE

311

DISTANCEMEASUREAn g ry C l ea r Co nd 5 O Co n d 7 0 F as tI Lombard Loud I N o r m a l I Questzon Slow Soft

Normal

Questzon

SlowS o f t

MAXIMUMMEAN

Table I, d l = 1.44 for Angry and Soft, which is much higher thanthe mean of 0.62, thus indicating that theAC3 and ACs parametersare well separated for these two stress conditions. Finally, havingobtained pairwise intrafeature distance measuresd l i, ) a , b as shownin Table I, it is desirable to have an overall measure that providesasummary of the differentiating capability of pairwise features acrossstress conditions. This measure, which has been denotedd2 x:, x: ,

estimates the distance between two stress classesn and b as follows:n n

0.20 0.43 0.20 0.37 0.14 0.53 0.45 0 0 .85 0 .83 1 .270.89 0.45 2.03 0.75 1.00 0.47 0.40 0.85 0 0.25 0.710.82 0.40 2.00 0 . 7 9 1.02 0.60 0.45 0.83 0.25 0 0.901.44 1.00 2.85 1.35 1.46 0.67 0.89 1.27 0.71 0.90

1.44 1 .00 2.85 1.35 1.46 1.00 1 .03 1 .27 2.03 2.00 El0.62 0.51 1.21 0.60 0 . 6 3 0 .55 0.48 0 .5 3 0.78 0.81

L l 3=1

I n

(4)

This measure assesses the n-dimensional “distance” between alln

stress classes under consideration. A stress separability evaluation ofMel-cepstral parameters was performed using thed 2 measure for eachfeature and stress condition across selected phonemes. The globald 2 ; c : . ~ g ) cores for (Cz. DC, . D2 C, , and AC,) were (6.96, 1.42,1.69, and 7.24), respectively. Hence, the AC-Me1 features are themost separable spectral feature set considered based on this distancemeasure. It is suggested that the broader detail captured by theAC-Me1 parameters is more reliable for stress classification.

V . STRESS CLASSIFICATION

There are three issues addressed in this study to demonstratethe viability of a perceptually induced stress classification system.

First, fine versus broad stress group definitions are considered todetermine if improved stress classification can be achieved. Second,an evaluation is conducted using extracted mono-phones from a i) 35-word versus ii) five-word speech corpus vocabulary training and testset. The 35-word corpus is used to evaluate the performance of thesingle monophone stress classifier acrossa larger set of phonemes,whereas the five-word corpus is employed to evaluatea more limitedset of phonemes. The first vocabularyis evaluated using a limitedfive-word versus larger 35-word test set to select the “best” featureset for stress classification. The second vocabulary, which consistsof five words different from the first vocabulary, is assessed toestablish the level of performanceof the proposed stress classification

algorithm. Finally, performance of the objective sep arability measuresversus the stress classification rates are compared to select the “best”feature set for stress classification. The goalof the stre:;s classificationformulation and evaluations in this study isnot to tind the “best”classification system for stress but rather to o btain the “best” selectionfrom five feature sets for classification.

A. Neural Network Cluss@er

The proposed neural network classifier consists ofa single neuralnetwork that is trained with monopartition features (i.e.,a singlephone class partition). Each partition of spe ech features is propagatedthrough two hidden layers of the neural network to an output layerthat estimates the stress probability scores. The neural network

training method employed in this studyis the cascade correlationbackpropagation network using the extended delta-bar-delta learningrule [8]. This metho d was selected due to its flexibility. Its strengthis its ability to only use as many hidden units as are needed toperform optimal classification, Additionally, this algorithmis capableof forming the complex contoured hypersurface decision boundariesneeded for the stress classification problem.

B. Stress Class iJ ier Evaluation

The stress classification algorithm was evaluated using a collectionof features derived from frame- to word-level features. Both fine andbroad stress classes are evaluated to determine which is more effectivefor stress classification. The fine (i.e., ungrouped) stress classes aresimply the 11 stress conditions in this study. Ungrouped stress class

neural network classifier performance is summarized in TableI1 usingthe closed 35- and five-word test sets from the first vocabulary underevaluation. Classification rates ranged from 25% to4.7 for the 35-word test set, which is greater than chance (i.e.,Sl ). It is clearthat for some stress conditions, suchas computer response tasksCond50/70, Fast, and Soft spoken speech, significant classificationperformance is attained. By decreasing the first vocabulary sizefrom 35 to five words, classification rates increaseti to60 -61as summarized in Table II(b). These increased c1ar;sification ratessupport the assertion that phonemes are affected differently by stresssince the smaller vocabulary has fewer phonemes, and the neuralnetwork classifier can then focus o n particular variations due to stress.

7/25/2019 1996-Feature Analysis and Neural Network-Based Classification of Speech Under Stress

http://slidepdf.com/reader/full/1996-feature-analysis-and-neural-network-based-classification-of-speech-under 6/7

312

STRESS CLASSIFICATION ERFORMANCESingle Speak er, 35 Words , Stress Ungroup ed“ B ra ke ” , “ Ea s t” , “ Fre e z e ”, “ He lp” , “ S te e r”

IEEE TRANSACTIONSO N SPEECH AND AUDIO PROCESSING,VOL. 4, NO. 4, JULY 1996

STRESSCLASSIFICATION ERFORMANCESingle Speaker, 5 Words , Stress IJngro uped“Brake” , “Eas t” , “Freeze” , “Help” , “St ,eer”

STRESSCLASS CLASSIFICATION ATE ( )C, I DC, D 2 C , I AC,I I I I

OVERALL 33.14 24.69 I 47.31 32.57

STRESSCLASS

CLASSIFICATION ATE ( )C, DC. D 2 C , I AC,

AngryClear

Cond50/70Fast 76.47 76.47 95.24 100 .00

82.35 82.35 59.09 29.4141.18 35.29 53.33 23.5379.41 79.41 74.29 31.43

OVERALL 59.98 6 0 . 5 1 61.16 61.40

STRESSGROUP

, GIG.,

TABLE I11CLASSIFICATIONOR GROUPEDIVEW O R Da) I N - V O C A B ~ ~ L A RYLOSEDN D (b) OUT-OF-VOCABULARYPENTESTS

CLASSIFICATION ATE ( )C, DC, D2C, A C ,

53.33 41.86 91.30 80.9590.20 82.69 80.00 83.33

STRESSGROUP

Next, broad (i.e., grouped) stress classes are evaluated by combiningperceptually similar stress conditions that may cluster in similardomains. Note that this grouping resulted fr om informal listening testsas to which stressed conditions were perceptually similar, (i.e.,G IAngry, Loud), Ga CondSO, Cond70, Normal, Soft , Gn Fast), GQuestion), GS Slow), Gs Clear), and G7 Lombard)). Employing

stress class grouping, classification rates are further improved by-t17 -20 to 77%-81% (compare Table II(b) with Table III(a)).

Hence, it is shown that stress class grouping using less confuseablesubgroups improves classifier performance. It is suggested that furtherimprovement in classification could be accomplished using a two-step decision process in which grouped stress conditions are morefinely discriminated in a second stage if a larger speech corpus isused or if noise is present. Finally, the performance of the stressclassification system is evaluated using the second five-word out-of-vocabulary test set w ith similar phoneme content. Classification ratesranged from 43 48 as shown in Table III(b), which is greaterthan chance (i.e.,14.3 ).These results agree with the expected stressclass differentiability of the AC-Me1 feature set based on objectiveseparability measures.

CLASSIFICATION ATE ( )C , D C I D 2 C , AC,

OVERALL 47.82 42.85 44.90 48.38

(b)

VI. CONCLUSIONS

In this study, a stress-sensitive feature set has been proposed forusein stress classification. Further, a monopartition stress classificationsystem has been formulated using neural networks. An analysis wasperformed for five speech feature representations as potential stressrelayers. Features were considered with respect to the following:

i) pair-wise stress class separability;ii) a numerical pair-wise and global objective measure of feature

separability across stressed conditions;iii) analysis of acoustic tube and vocal tract cross-sectional area

variation under stress.Feature analysis suggests that perturbations in speech productionunder stress are reflected to varying degrees across multiple featuredomains depending on stress condition and phoneme group. Theresults have demonstrated the effects of speaker stress on bothmicro (phoneme) and macro (whole word or phrase) levels. Phonemeclasses are affected differently by stress. For example, the unvoicedconsonant stops(/P/, W E / re perturbed little by stress, whereasvowels /AE/, EW, /IW, /ER/, /Uw) are significantly effected. In

7/25/2019 1996-Feature Analysis and Neural Network-Based Classification of Speech Under Stress

http://slidepdf.com/reader/full/1996-feature-analysis-and-neural-network-based-classification-of-speech-under 7/7

IEEETRANSACTIONSON SPEECH AND AUDIO PROCESSING, VOL.4, NO. 4, JULY 1996 313

addition, coarticulation effectsare more critical for stressed speechsince stress variation across a phoneme sequence is more pronouncedfor an isolated phoneme. Hence, an algorithm thatuses a front-endphoneme group classifier could improve overall stress classificationperformance [lo]. It was shown that the autocorrelation of the Mel-cepstral (AC-M el) parameters a re the most useful features consideredfor separating the selected stress conditions.

Next, a cascade correlation extended delta-bar-delta-based neuralnetwork w as form ed using each feature to determine stress classifica-tion performance. Classification rates across1 1 stress conditions were79 for in-vocabulary and46 for out-of-vocabulary tests (which areboth greater than chance14.3 ), further confirming that AC -Me1 pa-rameters are the m ost separable feature set considered. In conclusion,this study has shown thata particular speech feature representationcan influence stress classification performance for differentstressstyles kond itions and that a neural network-based classifier over wordor phoneme partitions can achieve good classification performance.It is suggested that such knowledge wouldbe useful for monitoringspeaker state, as well as ultimately contributing to improvements inspeech coding and recognition systems[lo].

REFERENCESD. A. Cairns andJ. H. L. Hansen, “Nonlinear analysis and detection ofspeech under stressed conditions,”J. Acous. Soc. Amer.,vol. 96, no. 6,pp. 3392-3400, 1994.B. A. Hanson andT. Applebaum, “Robust speaker-independent wordrecognition using instantaneous, dynamic and acceleration features:Experiments with Lombard and noisy speech,” inProc. ZCASSP, Apr.

J. H. L. Hansen, “Morphological constrained feature enhancement withadaptive cepstral compensation (MCE-ACC) for speech recognition innoise and Lombard effect,”ZEEE Trans. Speech Audio Processing, vol.

J H. L. Hansen and S E. Bou-Ghazale, “Robust speech recognitiontraining via duration and spectral-based stress token generation,”ZEEETrans. Speech Audio Processing,vol. 3 , pp. 415421, Sept . 1995.J. H. L. Hanse n, “A source generator framework for analysis of acousticcorrelates of speech under stress. PartI: Pitch, duration, and intensityeffects,” J. Acous. Soc. Amer.,submitted for publication.J C. Junqua, “The Lombard reflex and its roleon human listenersand automatic speech recognizers,”J Acous. Soc. Amer., vol. 93, pp,510-524, Jan. 1993.R. P. Lippmann,E. A. Martin, andD. B. Paul, “Multistyle training forrobust isolated-word speech recognition,” inP roc. ZCASSP, Apr. 1987,pp. 705-708.A. A . Minai andR. D. Williams, “Back-propagation heuristics:A studyof the extended delta-bar-delta algorithm,” inZJCNN, June 17-21, 1990,pp. 595-600.B. J. Stanton, L.H. Jamieson, andG. D. Allen, “Robust recognition ofloud and Lombard speech in the fighter cockpit environment,” inP roc.ICASSP, May 1989, pp. 675-678.B. D. Womack and J. H. L. Hansen, “Stress independent robust HMMspeech recognition using neural network stress classification,” inP roc.EuroSpeech, 1995, pp. 1999-2002.H. Wakita, “Direct estimation of the vocal tract shape by inv erse filteringof acoustic speech waveforms,”lEEE Trans. Audio Electroacoust., vol.

1990, pp. 857-860.

2, pp. 598-614, Oct. 1994.

AU-21, pp. 417-27, Oct. 1973.

An Extended Clustering Algorithmfor Statistical Language Models;

Joerg P. Ueherla

Abstract- An existing cluster ing algorithm is extended to deal with

higher order N-gr ams and a faster heurist ic version is deve loped. Eventhough results are not comparable to back-off trigralm models, theyoutperform back-off bigram models when many mil lion words of trainingdata are not available.

I. INTRODUCTIONIt is well known that statistical language mod els often suffer from a

lack of training data. This istrue for standard tasks and even moresowhen one tries to build a language model for a new domain, becausea large corpus of texts from that dom ain is usually nct available. Onefrequently used approach to alleviate this problem is to construct aclass-based language model. LetW = w1, . . w, be a sequenceof words from a vocabularyV and let G:w GI w ) = gw be afunction that maps each wordw to its class G (w ) = g w classbased bigram language model calculates the probability of seing the

next word w, asp ( ~ l w ~ - i ) P G ( G ( w ~ ) J G ( w ~ - ~ ) )P G w I G w ) ) . 1)

In order to derive the clustering functionG automatically, a clusteringalgorithm as shown inFig. 1 can be used (see[ 2 ] In the spirit ofdecision-directed lea nin g, it uses a s optimization crilterion a functionF that is very closely relatedor identical to the final perforniancemeasure one wishes to maximize.As suggested in[: I, F is based inall our experiments on the leaving-one-out likelihood of the modelgenerating the training data.

In Section11, the algorithm is extendedso that it c m cluster higherorder N-grams. When such a clustering algorithm is applied to a largetraining corpus, e.g., the Wall Street Journal (WS J)corpus, with tensof millions of words, the computational effort required can easilybecome prohibitive. Therefore, a simple heuristicto speed up the

algorithm is developed in Section 111. It can thenIx applied moreeasily to theWSJ corpus and the obtained results will be presentedin Section IV.

11. EXTENDINGHE CLUSTERINGLGORITHM O N-GRAMSAs shown in [6], there are several ways of extending the algorithm

to higher order N-gram s. The method we choseUSI:S two clusteringfunctions GI and G z :

P ( W 2 I Wi - N+ 1 , . . ’ , wz-1)

= PG (GZ w) IGI ( w i - ~ + i . , ~ i - i ) ) p s wi Gz (w) )

2)

G I is a function that maps the current contextc = w z - ~ + 1 ,

... wi-1) into one of a set of con text equivalent classes (or states).Thus any two contexts, which are mapped byG I to the same class,will have identical probability distributions.G s , s the G of l ) , maps

Manuscript received December 16, 1994; revised February 28, 1996. Theassociate editor coordinating the review of this paper arid approving it forpublication wasDr. Xuedong Huang.

The author is with Forum Technology,DRA Malvern, Malvern, Worc.WR14 3PS,U.K.

Publisher Item IdentifierS 1063-6676(96)05074-2.

1063-6676/96$05.00 996IEEE