Research Article A Research of Speech Emotion Recognition...

Research ArticleA Research of Speech Emotion Recognition Based onDeep Belief Network and SVM

Chenchen Huang Wei Gong Wenlong Fu and Dongyu Feng

Department of Computer Communication University of China Beijing 100024 China

Correspondence should be addressed to Chenchen Huang hcc1990163com

Received 27 May 2014 Revised 21 July 2014 Accepted 21 July 2014 Published 12 August 2014

Academic Editor Stefan Balint

Copyright copy 2014 Chenchen Huang et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Feature extraction is a very important part in speech emotion recognition and in allusion to feature extraction in speech emotionrecognition problems this paper proposed a new method of feature extraction using DBNs in DNN to extract emotional featuresin speech signal automatically By training a 5 layers depth DBNs to extract speech emotion feature and incorporate multipleconsecutive frames to form a high dimensional feature The features after training in DBNs were the input of nonlinear SVMclassifier and finally speech emotion recognition multiple classifier system was achieved The speech emotion recognition rate ofthe system reached 865 which was 7 higher than the original method

1 Introduction

Speech emotion recognition was a technology that extractemotional feature from speech signals by computer andcontrasts and analyses the characteristic parameters and theemotional change acquired Finally the law of speech andemotion was concluded and speech emotional states werejudged according to the law At present speech emotionrecognition was an emerging crossing field of artificial intelli-gence and artificial psychology besides it was a hot researchtopic of signal processing and pattern recognition [1] Theresearch was widely applied in human-computer interactioninteractive teaching entertainment security fields and so on

Speech emotion processing and recognition system wasgenerally composed of three parts which were speech signalacquisition feature extraction and emotion recognitionSystem framework is shown in Figure 1

In this system the quality of feature extraction directlyaffected the accuracy of speech emotion recognition In theprocess of feature extraction it usually took the whole emo-tion sentence as units for feature extracting and extractioncontents were four aspects of emotion speech which wereseveral acoustic characteristics of time construction ampli-tude construction fundamental frequency construction andformant constructionThen contrast emotion speech with no

emotion sentence from these four aspects acquiring the lawof emotional signal distribution then classify emotion speechaccording to the law [2]

Deep neural network (DNN) has unprecedented successin the field of speech recognition and image recognition [3]however so far no research on deep neural network has beenapplied to speech emotion processingWe found that the deepbelief network (DBN) of DNN in speech emotion processinghas a huge advantage [4] Therefore this paper proposeda method to realize the emotional features automaticallyextracted from the sentence It used DBNs to train a 5-layer-deep network to extract speech emotion features Itincorporates the speech emotion features ofmore consecutiveframes to build a high latitude characteristic and uses SVMclassifier to classify the emotional speechWe compared othertraditional feature extraction methods with this method andconcluded that the speech emotion recognition rate reached865 which was 7 higher than the original method

2 Feature Extraction of Speech Emotion

Emotion can be expressed by speech because speech con-tains the characteristic parameters that can reflect emotioninformation [5] We can extract and observe the change

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2014 Article ID 749604 7 pageshttpdxdoiorg1011552014749604

2 Mathematical Problems in Engineering

Signalacquisition

Featureextraction

Emotionrecognition

CorpusThe model of

emotiondescription

VoiceDigitalsignal Feature Results

Figure 1 Speech emotion recognition system block diagram

of characteristic parameters to measure the correspondingspeech emotional changes The key above is extracting char-acteristic parameters of speech emotion from speech signalsThe quality of feature extraction directly affects the accuracyof speech emotion recognition Meanwhile speech signalscontain not only emotional feature information but alsothe speakerrsquos own important information therefore researchon how to extract and which speech emotion characteristicparameters to extract are of great importance [6]

21 Emotion Speech Database Before the speech emotionfeature extraction at first we need to input emotional speechsignal Emotion speech library is the foundation of speechemotion recognition which provides standard speech forspeech emotion recognition

At present there ismuch literature on this research aspect[7] and throughout the world English German Spanish andChinese single language emotion speech databases have beenbuilt A few speech libraries also contain a variety of lan-guages This paper is aimed at using the language of Chineseto recognize speech emotion In order to establish a perfectspeech data sampling library when selecting experimentalsentence we need to follow the following principles

(i) The sentence selected must not contain a particu-lar aspect of emotional tendency ensuring that therecorded statements will not affect the experimenterrsquosjudgment [8]

(ii) The sentence selected has relatively emotional free-dom That is the sentence can express differentemotions not just a single emotion otherwise wewill be unable to compare the emotional speechparameters in the same emotional sentence underdifferent emotional states [9]

According to the above principles and considering allaspects to ensure the accuracy in this paper we selectedBuaa emotional speech database which passed the effec-tiveness measurement instead of recording speech databaseby ourselves This database was recorded by 7 males and 8females and it consisted of 7 kinds of basic emotions whichare sadness anger surprise fear joy hate and calm Eachemotion had 20 sentences referential scripts That is to saythe database consisted of 2100 emotion sentences and all theemotion sentences above were recorded into WAV formatfiles The sampling rate of speech above was 16000Hz alland the quantitative accuracy was 16 bits This paper selected

1200 sentences which contain sadness anger surprise andhappiness four basic emotions for training and recognition

22 Traditional Emotional Feature Extraction Traditionalemotional feature extraction was based on the analysis andcomparison of all kinds of emotion characteristic parametersselecting emotional characteristics with high emotional reso-lution for feature extraction In general traditional emotionalfeature extraction concentrates on the analysis of the emo-tional features in the speech from time construction ampli-tude construction and fundamental frequency constructionand signal feature [10]

221 Time Construction Speech time construction refers tothe emotion speech pronunciation differences in time Whenpeople express different feelings the time construction ofthe speech is different Mainly in two aspects one is thelength of continuous pronunciation time the other is theaverage rate of pronunciationOne is the length of continuouspronunciation time and the other is the average rate ofpronunciation

Zhao Lirsquos research showed that different emotionalpronunciations are different in pronunciation length andpronunciation speed Compared with the length of calmpronunciation time the pronunciation time of joy angerand surprise significantly shortened But comparing withcalm pronunciation time sad pronunciation length is longerCompared with quiet pronunciation rate sad pronunciationrate is slow while joy anger and surprise pronunciation ratesare quick relatively

In conclusion if we extract the time construction charac-teristics parameters of the speech it is easy to distinguish sademotion from other emotional states Of course we also canset a certain time threshold to distinguish joy anger surpriseand speech However it is obvious that using only speech timeconstruction is not enough to recognize speech emotionalstate

222 Amplitude Construction Speech signal amplitude con-struction and speech emotional state also have a direct linkWhen the speaker is angry or happy the volume of speechis generally high When speaker is sad or depressed thevolume of speech is generally low Therefore the analysis ofspeech emotion features of amplitude construction is moremeaningful

Figure 2 is the comparison of emotional speech and calmspeechwhich is shownwith the average amplitude difference

Mathematical Problems in Engineering 3

Happiness

Angry

Surprise

Sadness

0

500

1000

1500

2000

The a

vera

ge am

plitu

de o

f ene

rgy

Emotional typeHappiness Angry Surprise Sadness

Figure 2 The distribution of emotional speech amplitude energy

Sadness 40

Calm 35

Angry 45

Surprise 50

Happiness 90

0

20

40

60

80

100

Pitc

h va

rianc

e (H

z)

Emotional typeCalm HappinessAngry SurpriseSadness

Figure 3 Curves of different emotional variance

Figure 2 shows that the amplitude of joy anger and surprisethree kinds of emotional speech compared with calm voicesignal is larger while the sad speech amplitude is smaller

223 Fundamental Frequency Construction Banziger andScherer [11] proposed that for the same sentence if theemotions expressed were different fundamental frequencycurves were also different besides the mean and variance offundamental frequencywere also differentWhen the speakeris in a state of happiness the fundamental frequency curveof speech generally is bent upwards And when the speakeris in a state of sadness the fundamental frequency curve ofspeech generally is bent downward Figure 3 shows the curvesof different emotional variance

Compared to calm emotional state happiness surpriseand angry speech signal characteristics variation is largerThus analyzing fundamental frequency curve of the samesentence under different emotional states we can contrastand acquire fundamental frequency construction of differentemotion speech

3 The Depth of the Belief Network

Deep neural network stems from artificial neural network[12] Deep neural network is literally a deep neural networkIn 2006 Professor Hinton in the University of Torontopresented a deep belief network (DBN) structure in [13]

Since then deep neural network and deep learning havebecome the most popular hot spot research in artificialintelligence He clarified the effectiveness of unsupervisedlearning and training learning at each layer and pointed outthat each layer can conduct unsupervised train again onthe basis of the results output from previous layer trainingCompared with the traditional neural network deep neuralnetwork has deep structure with multiple nonlinear mappingwhich can complete complex function approximation [14]

Hinton first put forward DBNs in 2006 Since then DBNshave got unprecedented success in areas such as speechrecognition and image recognition Microsoft researcher DrDeng cooperated with Hilton found deep neural networkcan significantly improve the accuracy of speech recognitionhowever so far no research on deep neural network hasbeen applied to speech emotion recognition In this paperresearch found that deep belief networks (DBNs) have agreat advantage in speech emotion recognition thereforewe choose deep belief networks to extract emotional featureautomatically in the speech

A typical deep belief network is a highly complex directedacyclic graph which is formed by a series of restrictedBoltzmannmachine (RBM) stacks TrainingDBNs is realizedby training RBMs layer by layer from bottom to up BecauseRBM can be trained rapidly via layered contrast divergencealgorithm training RBM can avoid a high degree of complex-ity of training DBNs simplifying the process to training eachRBM Numerous studies have demonstrated that deep beliefnetwork can solve low convergence speed and local optimumproblems in traditional backpropagation algorithm trainingmultilayer neural network

31 Restricted Boltzmann Machine DBNs are stacked reg-ularly by restricted Boltzmann machine (RBM) RBM is akind of typical neural network RBM is composed via theconnection of visible layer and hidden layer but there is noconnection between visible layer visible layer and hiddenlayer and hidden layer In Figure 4 training RBM usedunsupervised greedymethod step by stepThat is in trainingthe characteristic value of the visible layer maps to the hiddenlayer then visible layer can be reconstructed through thehidden layer this new layer visible characteristic value mapsto the hidden layer again then acquiring a new hidden layerIts main purpose is to obtain the generative power valueThus the main characteristics of RBM are the activationfeatures of the layer inputted to next layer as training dataand as a consequence the learning speed is fast [15] This is alayer by layer and efficient learning strategy theory and proofof procedure is in [16]

In Figure 4 DBNs are stacked frombottom to up byRBMUsing Gauss-Bernoulli RBM and Bernoulli- Bernoulli RBMto connect the lower layer output is the input for the upperlayer

Figure 5 is the structure diagram of DBNsThe number oflayer and unit is an example and the number of hidden layeris not necessarily the same in actual experiment

In Bernoulli RBM visible and hidden layer units arebinary 119881 isin 0 1

119863 and ℎ isin 0 1119870 119863 and 119870 present

unit numbers of visible and hidden layers In Gaussian RBM


Hidden layer

Visible layer

middot middot middot


Figure 4 RBMmodule

RBM4

RBM3

RBM2

RBM1

Hiddenlayers

layersVisible

)

)

)

)

)

)

)

)

Figure 5 DBNs structure sketch map

visible layer unit is a real number 119881 isin 119877119863 119881 and ℎ joint

probability is expressed as

119875 (V ℎ) =1

119885exp (minus119864 (V ℎ)) (1)

119885 is a normalized constant and 119864(V ℎ) is an energyequation For Bernoulli RBM energy equation is

119864 (V ℎ) = minus

119863

sum

119894=1

119870

sum

119895=1

119908119894119895V119894ℎ119895minus

119863

sum

119894=1

119887119894V119894minus

119870

sum

119895=1

119886119895ℎ119895 (2)

119908119894119895presents the quality of unconnected visible layer nodes

V119894and hidden layer nodes ℎ

119895and 119886 and 119887 are implicit bias of

visible and hidden units For Gaussian RBM energy equationis

119864 (V ℎ) =119863

sum

119894=1

(V119894minus 119887119894)2

2minus

119863

sum

119894=1

119870

sum

119895=1

119908119894119895V119894ℎ119895minus

119870

sum

119895=1

119886119895ℎ119895 (3)

DBNs combined emotional speech signal features of con-tinuous frames forming a high dimensional feature vectorfully describing the correlation between the emotional speechfeatures Meanwhile DBNs use these high dimensionalfeatures to simulate [17] Besides DBNs extracting speechinformation process is similar to brain processing speechusing RBM to extract emotional information layer by layereventually acquiring the most suitable high dimensionalcharacteristics for pattern recognition DBNs can combine

Table 1Hyperparameters and training statistics of the chosenDBN

Number of hidden layers 5Units per layer 50Unsupervised learning rate 0001Supervised learning rate 001Number of unsupervised epochs 50Number of supervised epochs 475Total training time (hours) 136Classification accuracy 0865

well with traditional speech emotion recognition technologyin practical application (eg SVM) improving the accuracyof speech emotion recognition

32 DBNs Model Training We have used Theano to trainthe deep belief networks (DBNs) in this paper Theano is amathematical symbols compilation kit in Python which isextremely a powerful tool in machine learning because itcombines the power of Python and 119862 making it easier toestablish deep learning model

TheDBNs were first pretrained with the training set in anunsupervisedmanner [18]We split the Buaa dataset and used40 of the voice data for training and 60 of the voice datafor testing Then we proceeded to the supervised fine tuningusing the same training set and used the validation set to doearly stopping [19]

We tried approximately 100 different hyperparameterscombinations and selected the model that has the smallesterror rateThe selectedDBNmodel has been shown inTable 1

We fixed the DBN architecture for all experiments andused 5 hidden layers (the first layer is a Gaussian RBM allother layers are Bernoulli RBMs) and 50 units per layer andthe only variable is size of input vector which depends on thecontext window length The hyperparameters for generativepretraining are shown in Table 1The unsupervised layers ranfor 50 epochs with 0001 learning rate Supervised layers ranfor 475 epochs with 001 learning rate

4 Support Vector Machine Classifier

SVM is a kind of machine learning methods based onstatistical theory and structural risk minimization principleIts principle is to map the low dimensional feature vectorto high-dimensional feature vector space so as to solvenonlinear separable problem SVM has been widely used inpattern classification field [20]

The key to solve nonlinear separable problem is to con-struct the optimal separating hyperplane the constructionof the optimal hyperplane eventually translates into thecalculation of optimal weights and bias We set the trainingsample set for (1198831 1198891)(1198832 1198892) sdot sdot sdot (119883119875 119889119901) sdot sdot sdot (119883119901 119889119901) thecost function of minimization weight119882 and slack variable 120576

119901

is

Φ (119882 120576) =1

2119882119879119882 + 119862

119875

sum

119875=1

120576119875 (4)


Conditions for its limitation 119889119901(119882119879119882119875 + 119887) ge 1 minus 120576119875 119901 =

1 2 119901 using to measure the deviation degree of a samplepoints relative to linear separable ideal conditions Due to thesample set we can calculate the dual to know the optimalweight and bias To solve quadratic optimization problemwhen optimal hyperplane in the feature space is constructedthe optimal hyperplane can be defined as the followingfunction

119892 (119909) =

119876

sum

119894=1

119906119894119886lowast

119894119870(119883119883

lowast

119894) + 119887lowast 119894 = 1 2 3 119876 (5)

In the formula above 119870(119883119883lowast

119894) is kernel function 119883lowast

119894

is support vector of nonzero Lagrange multiplier 119886lowast

119894 119876 is

the number of support vectors and 119887lowast is bias parameter For

nonlinear SVM nonlinearmapping kernel functionmappingdata to higher-order feature space optimal hyperplane existsin this space Several kinds of kernel function are appliedto nonlinear SVM such as Gaussian kernel function andpolynomial kernel function This paper takes the followingGaussian kernel function to do the research

119870(119909 119910) = exp (minus1205741003816100381610038161003816119909 minus 119910

1003816100381610038161003816

2) (6)

In this function 120574 is a Gaussian transmission coefficientWhen we use support vector machine (SVM) to solve

the problem of classification there are two solutions one-to-allone-to-one In previous studies we found that the accu-racy of one-to-one way of classification is higher Thereforethis paper chose the one-to-one way of classification to dealwith four kinds of emotion (surprise joy angry and sadness)

ldquoOne-to-onerdquo mode is to construct hyper-plane for everytwo emotions so the number of child classifiers we need tobe trained is 119896lowast (119896minus1)2 [21] In this experiment and duringthe whole process of training the number of SVM classifierwe need is 119862

2

4 namely 6 Each child classifier was trained

by surprise joy anger sadness or any of these two kinds ofemotion namely joy-anger joy-sadness joy-surprise anger-sadness anger-surprise and sadness-amazement

Training a classifier for any two categories when classi-fying an unknown speech emotion each classifier judges itscategory and votes for the corresponding category Finallytake the category withmost votes as the category of unknownemotional Decision stage uses voting method it is possiblethat votes tied formultiple categories leading to the unknownsamples belonging to different categories at the same timetherefore affecting the classification accuracy

SVM classifier defines a label for each speech emotionsignal before training and recognition to indicate speechemotional signalrsquos emotional categories Label type must beset to double In the process of emotion recognition inputfeature vector into all SVMs then the output of each SVMpasses through logic judgment to choose the most probableemotion Finally the emotion with the highest weight (mostvotes) is the emotional state of speech signal The settingof penalty factor 119862 in classifier training and 120574 in kernelfunction can be determined via cross-validation of trainingset One thing to note here is that 119862 and 120574 are fit forrecognition effect of training set but do not necessarily also

fit for test set After repeated testing in the experiments ofspeech emotion recognition in this paper the magnitude ofGaussian transmission coefficient 120574 SVM was set to 10

minus3 andthe magnitude of parameter 119862 is 106 These parameters willchange according to the experiment and the error rate toimprove the accuracy rate of training data classification

DBNs proposed multidimensional feature vector as theinput of SVM Since the SVM does not scale well with largedatasets we subsampled the training set by randomly picking10000 frames For nonlinear separable problem in speechemotion recognition take kernel function tomap input char-acteristics sample points to high-dimensional feature spacemaking the corresponding sample space linear separable Insimple terms it creates a classification hyperplane as decisionsurface making the edge isolation of positive and negativecase maximum Evaluating decision function calculation isstill in the original space making less computational com-plexity of the high dimensional feature space after mapping

Figure 6 is the block diagram for the emotion recognitionbased on SVM system

5 The Experiment and Analysis

This paper selected 1200 sentences which contain sadnessanger surprise and happiness four basic emotions fortraining and recognition

Before inputting the emotional speech signal to themodel we need to preprocess this signal In this paper thespeech signal to be inputted first goes through pre-emphasisand window treatments We selected median filter that hasthe window of the length of 5 to do smoothing processing fordenoised emotional speech signal

This paper used 40 of the voice data for training and60 of the voice data for testing The experimental groupis speech emotion recognition model established by deepbelief network via extracting phonetic characteristics Thecontrol group is established by extracting traditional speechfeature parameters to the speech emotion recognition modelunder the condition of the same emotional speech input Atlast we contrasted and analyzed the experimental data forthe conclusion The process of the experiment is shown inFigure 7

51 The Experimental Group Research of Speech EmotionRecognition Based on Multiple Classifier Models of the DeepBelief Network and SVM This paper proposed a method torealize the emotional features extracted automatically fromthe sentence It used DBNs to train a 5-layer-deep networkto extract speech emotion features In this experiment weput the output of the DBNs hidden layer as a characteristictraining regression model and the finished training charac-teristics as the input of a nonlinear support vector machine(SVM) Eventually we set up a multiple classifier model ofspeech emotion recognition It would also be possible totrain our DBN directly to do classification However our goalis to compare the DBN learned representation with otherrepresentations By using a single classifier we were able tocarry out direct comparisons The experimental results areshown in Table 2


Logical judgment

Result

Eigenvector

SVM(joy-angry)

SVM(joy-sadness)

SVM(joy-surprise)

SVM(angry-sadness)










Figure 6 Diagram of the emotion recognition system based on SVM

Acqu

isitio

n of

sign

al

Pret

reat

men

t DBNs

Classical

SVM

SVM

Comparisons

Figure 7 The process of the experiment

Table 2 The result of recognition rate based on DBNs and SVM

Test TrainAngry Joy Sadness Surprise

Angry 0913 0021 0021 0028Joy 0031 0845 0041 0122Sadness 0029 0020 0883 0061Surprise 0047 0124 0035 0819

As it was shown in Table 1 anger and sadness have ahigher recognition rate than joy and surprise they reached913 and 883 respectively and the overall recognition ratewas 865

52 The Control Group Research of Speech Emotion Recog-nition Based on Extracting the Traditional Emotional Char-acteristic Parameters and SVM In this paper the controlgroup was established by extraction traditional emotionalcharacteristic parameters time construction amplitude con-struction and fundamental frequency construction Afterextracting them we also need to input the emotional char-acteristic parameters into the SVM classification of speechemotion recognition Finally we compared the differenceof experimental group and control group The experimentalresults were shown in Table 2

Table 3 The result of recognition rate based on extraction oftraditional emotional characteristic parameters and SVM



Table 4 Comparison of results of the two methods

Angry Joy Sadness Surprise AverageDBNs 0913 0845 0883 0819 0865Classical 0861 0742 0848 0729 0795

As it was shown in Table 3 overall recognition rate basedon the traditional emotional characteristic parameters of theSVM system was 795

As it was shown in Table 4 and Figure 8 comparedwith the two methods the method using DBNs to extractspeech emotion features improved these four types of (sad-nessjoyangersurprise) emotion recognition rate The aver-age rate was increasing by 7 and the recognition rate of joyhas increased by 10

6 Conclusion

In this paper we proposed a method that used the deepbelief networks (DBNs) one of the deep neural networksto extract the emotional characteristic parameter from emo-tional speech signal automatically We combined deep beliefnetwork and support vector machine (SVM) and proposeda classifier model which is based on deep belief networks(DBNs) and support vector machine (SVM) In the practicaltraining process the model has small complexity and 7higher final recognition rate than traditional artificial extract


Angry Joy Sadness Surprise

DBNs

0913 0845 0883 0819

Classical

0861 0742 0848 0729

0

02

04

06

08

1

Reco

gniti

on ra

te

Comparisons

Figure 8 Comparison of results of the two methods

and this method can extract emotion characteristic parame-ters accurately improving the recognition rate of emotionalspeech recognition obviously But the time cost for trainingDBNs feature extraction model was 136 hours and it waslonger than other feature extraction methods

In future work we will continue to further study speechemotion recognition based on DBNs and further expand thetraining data set Our ultimate aim is to study how to improvethe recognition rate of speech emotion recognition

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the National Key Scienceand Technology Pillar Program of China (the key technologyresearch and system of stage design and dress rehearsal2012BAH38F05 study on the key technology research ofthe aggregation marketing production and broadcasting ofonline music resources 2013BAH66F02 Research of SpeechEmotion Recognition Based on Deep Belief Network andSVM 3132013XNG1442)

References

[1] Z Yongzhao and C Peng ldquoResearch and implementation ofemotional feature extraction and recognition in speech signalrdquoJoural of Jiangsu University vol 26 no 1 pp 72ndash75 2005

[2] L Zhao C Jiang C Zou and Z Wu ldquoStudy on emotionalfeature analysis and recognition in speechrdquo Acta ElectronicaSinica vol 32 no 4 pp 606ndash609 2004

[3] H Lee C Ekanadham and A Y Ng ldquoSparse deep belief netmodel for visual area V2rdquo in Proceedings of the 21st AnnualConference onNeural Information Processing Systems (NIPS rsquo07)MIT Press December 2007

[4] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedings

of the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[5] Z Li ldquoA study on emotional feature analysis and recognitionin speech signalrdquo Journal of China Institute of Communicationsvol 21 no 10 pp 18ndash24 2000

[6] T L Nwe S W Foo and L C de Silva ldquoSpeech emotion recog-nition using hidden Markov modelsrdquo Speech Communicationvol 41 no 4 pp 603ndash623 2003

[7] X Bo ldquoAnalysis of mandarin emotional speech database andstatistical prosodic featuresrdquo in Proceedings of the InterntionalConference on Affective Computing and Intelligent Interaction(ACLL rsquo03) pp 221ndash225 2003

[8] L Zhao X Qian C Zhou and Z Wu ldquoStudy on emotionalfeature derived from speech signalrdquo Journal of Data Acquistionamp Processing vol 15 no 1 pp 120ndash123 2000

[9] G Pengjuan and J Dongmei ldquoResearch on emotional speechrecognition based on pitchrdquo Application Research of Computersvol 24 no 10 pp 101ndash103 2007

[10] PGuoResearch of theMethod of Speech Emotion Feature Extrac-tion and the Emotion Recognition Northwestern PolytechnicalUniversity 2007

[11] T Banziger and K R Scherer ldquoThe role of intonation inemotional expressionsrdquo Speech Communication vol 46 no 3-4pp 252ndash267 2005

[12] R Rui and M Zhenjiang ldquoEmotional speech synthesis basedon PSOLA algorithmrdquo Journal of System Simulation vol 20 pp423ndash426 2008

[13] S Zhijun X Lei X Yangming and W Zheng ldquoOverview ofdeep learningrdquo Application Research of Computers vol 29 no8 pp 2806ndash2810 2012

[14] G E Hinton S Osindero and Y Teh ldquoA fast learning algorithmfor deep belief netsrdquoNeural Computation vol 18 no 7 pp 1527ndash1554 2006

[15] Z Sun L Xue Y Xu and ZWang ldquoOverview of deep learningrdquoApplication Research of Computers vol 29 no 8 pp 2806ndash28102012

[16] Z Chunxia J Nannan and W Guanwei ldquoIntroduction ofrestricted boltzmann machinerdquo China Science and TechnologyPapers Online httpwwwpapereducnreleasepapercontent201301-528

[17] T Shimmura ldquoAnalyzing prosodic components of normalspeech and emotive speechrdquo The Preprint of the AcousticalSociety of Japan pp 3ndash18 1995

[18] X Kai J Lei C Yuqiang and XWei ldquoDeep learning yesterdaytoday and tomorrowrdquo Journal of Computer Research and Devel-opment vol 50 no 9 pp 1799ndash1804 2013

[19] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 pp 153ndash160 MIT Press 2007

[20] Y Kim H Lee and E M Provost ldquoDeep learning for robustfeature generation in audio-visual emotion recognitionrdquo inProceedings of the IEEE International Conference on AcousticsSpeech and Signal Processing (ICASSP 13) Vancouver Canada2013

[21] J Zhu XWu andZ Lv ldquoSpeech emotion recognition algorithmbased on SVMrdquo Computer Systems amp Applications vol 20 no5 pp 87ndash91 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of


Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of


Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of


CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of


Operations ResearchAdvances in

Journal of


Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Algebra

Discrete Dynamics in Nature and Society



Decision SciencesAdvances in

Discrete MathematicsJournal of


Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of


Signalacquisition

Featureextraction

Emotionrecognition

CorpusThe model of

emotiondescription

VoiceDigitalsignal Feature Results

Figure 1 Speech emotion recognition system block diagram

of characteristic parameters to measure the correspondingspeech emotional changes The key above is extracting char-acteristic parameters of speech emotion from speech signalsThe quality of feature extraction directly affects the accuracyof speech emotion recognition Meanwhile speech signalscontain not only emotional feature information but alsothe speakerrsquos own important information therefore researchon how to extract and which speech emotion characteristicparameters to extract are of great importance [6]

21 Emotion Speech Database Before the speech emotionfeature extraction at first we need to input emotional speechsignal Emotion speech library is the foundation of speechemotion recognition which provides standard speech forspeech emotion recognition

At present there ismuch literature on this research aspect[7] and throughout the world English German Spanish andChinese single language emotion speech databases have beenbuilt A few speech libraries also contain a variety of lan-guages This paper is aimed at using the language of Chineseto recognize speech emotion In order to establish a perfectspeech data sampling library when selecting experimentalsentence we need to follow the following principles

(i) The sentence selected must not contain a particu-lar aspect of emotional tendency ensuring that therecorded statements will not affect the experimenterrsquosjudgment [8]

(ii) The sentence selected has relatively emotional free-dom That is the sentence can express differentemotions not just a single emotion otherwise wewill be unable to compare the emotional speechparameters in the same emotional sentence underdifferent emotional states [9]

According to the above principles and considering allaspects to ensure the accuracy in this paper we selectedBuaa emotional speech database which passed the effec-tiveness measurement instead of recording speech databaseby ourselves This database was recorded by 7 males and 8females and it consisted of 7 kinds of basic emotions whichare sadness anger surprise fear joy hate and calm Eachemotion had 20 sentences referential scripts That is to saythe database consisted of 2100 emotion sentences and all theemotion sentences above were recorded into WAV formatfiles The sampling rate of speech above was 16000Hz alland the quantitative accuracy was 16 bits This paper selected

1200 sentences which contain sadness anger surprise andhappiness four basic emotions for training and recognition

22 Traditional Emotional Feature Extraction Traditionalemotional feature extraction was based on the analysis andcomparison of all kinds of emotion characteristic parametersselecting emotional characteristics with high emotional reso-lution for feature extraction In general traditional emotionalfeature extraction concentrates on the analysis of the emo-tional features in the speech from time construction ampli-tude construction and fundamental frequency constructionand signal feature [10]

221 Time Construction Speech time construction refers tothe emotion speech pronunciation differences in time Whenpeople express different feelings the time construction ofthe speech is different Mainly in two aspects one is thelength of continuous pronunciation time the other is theaverage rate of pronunciationOne is the length of continuouspronunciation time and the other is the average rate ofpronunciation

Zhao Lirsquos research showed that different emotionalpronunciations are different in pronunciation length andpronunciation speed Compared with the length of calmpronunciation time the pronunciation time of joy angerand surprise significantly shortened But comparing withcalm pronunciation time sad pronunciation length is longerCompared with quiet pronunciation rate sad pronunciationrate is slow while joy anger and surprise pronunciation ratesare quick relatively

In conclusion if we extract the time construction charac-teristics parameters of the speech it is easy to distinguish sademotion from other emotional states Of course we also canset a certain time threshold to distinguish joy anger surpriseand speech However it is obvious that using only speech timeconstruction is not enough to recognize speech emotionalstate

222 Amplitude Construction Speech signal amplitude con-struction and speech emotional state also have a direct linkWhen the speaker is angry or happy the volume of speechis generally high When speaker is sad or depressed thevolume of speech is generally low Therefore the analysis ofspeech emotion features of amplitude construction is moremeaningful

Figure 2 is the comparison of emotional speech and calmspeechwhich is shownwith the average amplitude difference


Happiness

Angry

Surprise

Sadness

0

500

1000

1500

2000

The a

vera

ge am

plitu

de o

f ene

rgy



Sadness 40

Calm 35

Angry 45

Surprise 50

Happiness 90

0

20

40

60

80

100

Pitc

h va

rianc

e (H

z)


















Hidden layer

Visible layer



Figure 4 RBMmodule

RBM4

RBM3

RBM2

RBM1

Hiddenlayers

layersVisible

)

)

)

)

)

)

)

)




119875 (V ℎ) =1

119885exp (minus119864 (V ℎ)) (1)


119864 (V ℎ) = minus

119863

sum

119894=1

119870

sum

119895=1

119908119894119895V119894ℎ119895minus

119863

sum

119894=1

119887119894V119894minus

119870

sum

119895=1

119886119895ℎ119895 (2)





119864 (V ℎ) =119863

sum

119894=1

(V119894minus 119887119894)2

2minus

119863

sum

119894=1

119870

sum

119895=1

119908119894119895V119894ℎ119895minus

119870

sum

119895=1

119886119895ℎ119895 (3)












119901

is

Φ (119882 120576) =1

2119882119879119882 + 119862

119875

sum

119875=1

120576119875 (4)




119892 (119909) =

119876

sum

119894=1

119906119894119886lowast

119894119870(119883119883

lowast

119894) + 119887lowast 119894 = 1 2 3 119876 (5)



119894


119894 119876 is



119870(119909 119910) = exp (minus1205741003816100381610038161003816119909 minus 119910

1003816100381610038161003816

2) (6)




2















Logical judgment

Result

Eigenvector

SVM(joy-angry)

SVM(joy-sadness)

SVM(joy-surprise)

SVM(angry-sadness)











Acqu

isitio

n of

sign

al

Pret

reat

men

t DBNs

Classical

SVM

SVM

Comparisons














6 Conclusion




DBNs

0913 0845 0883 0819

Classical

0861 0742 0848 0729

0

02

04

06

08

1

Reco

gniti

on ra

te

Comparisons






Acknowledgments


References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra










Happiness

Angry

Surprise

Sadness

0

500

1000

1500

2000

The a

vera

ge am

plitu

de o

f ene

rgy



Sadness 40

Calm 35

Angry 45

Surprise 50

Happiness 90

0

20

40

60

80

100

Pitc

h va

rianc

e (H

z)


















Hidden layer

Visible layer



Figure 4 RBMmodule

RBM4

RBM3

RBM2

RBM1

Hiddenlayers

layersVisible

)

)

)

)

)

)

)

)




119875 (V ℎ) =1

119885exp (minus119864 (V ℎ)) (1)


119864 (V ℎ) = minus

119863

sum

119894=1

119870

sum

119895=1

119908119894119895V119894ℎ119895minus

119863

sum

119894=1

119887119894V119894minus

119870

sum

119895=1

119886119895ℎ119895 (2)





119864 (V ℎ) =119863

sum

119894=1

(V119894minus 119887119894)2

2minus

119863

sum

119894=1

119870

sum

119895=1

119908119894119895V119894ℎ119895minus

119870

sum

119895=1

119886119895ℎ119895 (3)












119901

is

Φ (119882 120576) =1

2119882119879119882 + 119862

119875

sum

119875=1

120576119875 (4)




119892 (119909) =

119876

sum

119894=1

119906119894119886lowast

119894119870(119883119883

lowast

119894) + 119887lowast 119894 = 1 2 3 119876 (5)



119894


119894 119876 is



119870(119909 119910) = exp (minus1205741003816100381610038161003816119909 minus 119910

1003816100381610038161003816

2) (6)




2















Logical judgment

Result

Eigenvector

SVM(joy-angry)

SVM(joy-sadness)

SVM(joy-surprise)

SVM(angry-sadness)











Acqu

isitio

n of

sign

al

Pret

reat

men

t DBNs

Classical

SVM

SVM

Comparisons














6 Conclusion




DBNs

0913 0845 0883 0819

Classical

0861 0742 0848 0729

0

02

04

06

08

1

Reco

gniti

on ra

te

Comparisons






Acknowledgments


References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra










Hidden layer

Visible layer



Figure 4 RBMmodule

RBM4

RBM3

RBM2

RBM1

Hiddenlayers

layersVisible

)

)

)

)

)

)

)

)




119875 (V ℎ) =1

119885exp (minus119864 (V ℎ)) (1)


119864 (V ℎ) = minus

119863

sum

119894=1

119870

sum

119895=1

119908119894119895V119894ℎ119895minus

119863

sum

119894=1

119887119894V119894minus

119870

sum

119895=1

119886119895ℎ119895 (2)





119864 (V ℎ) =119863

sum

119894=1

(V119894minus 119887119894)2

2minus

119863

sum

119894=1

119870

sum

119895=1

119908119894119895V119894ℎ119895minus

119870

sum

119895=1

119886119895ℎ119895 (3)












119901

is

Φ (119882 120576) =1

2119882119879119882 + 119862

119875

sum

119875=1

120576119875 (4)




119892 (119909) =

119876

sum

119894=1

119906119894119886lowast

119894119870(119883119883

lowast

119894) + 119887lowast 119894 = 1 2 3 119876 (5)



119894


119894 119876 is



119870(119909 119910) = exp (minus1205741003816100381610038161003816119909 minus 119910

1003816100381610038161003816

2) (6)




2















Logical judgment

Result

Eigenvector

SVM(joy-angry)

SVM(joy-sadness)

SVM(joy-surprise)

SVM(angry-sadness)











Acqu

isitio

n of

sign

al

Pret

reat

men

t DBNs

Classical

SVM

SVM

Comparisons














6 Conclusion




DBNs

0913 0845 0883 0819

Classical

0861 0742 0848 0729

0

02

04

06

08

1

Reco

gniti

on ra

te

Comparisons






Acknowledgments


References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra












119892 (119909) =

119876

sum

119894=1

119906119894119886lowast

119894119870(119883119883

lowast

119894) + 119887lowast 119894 = 1 2 3 119876 (5)



119894


119894 119876 is



119870(119909 119910) = exp (minus1205741003816100381610038161003816119909 minus 119910

1003816100381610038161003816

2) (6)




2















Logical judgment

Result

Eigenvector

SVM(joy-angry)

SVM(joy-sadness)

SVM(joy-surprise)

SVM(angry-sadness)











Acqu

isitio

n of

sign

al

Pret

reat

men

t DBNs

Classical

SVM

SVM

Comparisons














6 Conclusion




DBNs

0913 0845 0883 0819

Classical

0861 0742 0848 0729

0

02

04

06

08

1

Reco

gniti

on ra

te

Comparisons






Acknowledgments


References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra










Logical judgment

Result

Eigenvector

SVM(joy-angry)

SVM(joy-sadness)

SVM(joy-surprise)

SVM(angry-sadness)











Acqu

isitio

n of

sign

al

Pret

reat

men

t DBNs

Classical

SVM

SVM

Comparisons














6 Conclusion




DBNs

0913 0845 0883 0819

Classical

0861 0742 0848 0729

0

02

04

06

08

1

Reco

gniti

on ra

te

Comparisons






Acknowledgments


References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra











DBNs

0913 0845 0883 0819

Classical

0861 0742 0848 0729

0

02

04

06

08

1

Reco

gniti

on ra

te

Comparisons






Acknowledgments


References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra
















Volume 2014




Journal of











Journal of


Function Spaces






Algebra









Research Article A Research of Speech Emotion Recognition...

Documents

Transcript of Research Article A Research of Speech Emotion Recognition...