Emotional analysis and evaluation of kannada speech database

11
Proceedings of the 2 nd International Conference on Current Trends in Engineering and Management ICCTEM -2014 17 – 19, July 2014, Mysore, Karnataka, India 160 EMOTIONAL ANALYSIS AND EVALUATION OF KANNADA SPEECH DATABASE Pallavi J 1 , Geethashree A 2 , Dr. D J Ravi 3 1 Student– Master of Technology, ECE, VVCE, Mysore, Karnataka, India 2 Asst.Professor– Dept. of ECE, VVCE, Mysore, Karnataka, India 3 Professor and HOD– Dept. of ECE, VVCE, Mysore, Karnataka, India ABSTRACT Emotion is an affective state of consciousness that involves feeling and plays a significantrole in communication. So it is necessary to analyze and evaluate speech data base to build an effective emotion recognition system and efficient man machine interface. This paper presents and discusses development of emotional Kannada speech data base analysis and its evaluation using Mean opinion score (MOS), PNN and k-NN. Keywords: K-Neighbouring Numbers (K-NN), Probability Neural Network (PNN), Speech Corpus. I. INTRODUCTION Emotion plays an important role in day-to-day interpersonal human interactions. Recent findings have suggested that emotion is integral to our rational and intelligent decisions. A successful solution to this challenging problem would enable a wide range of important applications. Correct assessment of the emotional state of an individual could significantly improve quality of emerging, natural language based human-computer interfaces [1,3,6]. It helps us to relate with each other by expressing our feelings and providing feedback. There have been many studies [3,4,7-10] for emotional speech but it is observed that most of the studies are for English, Hindi and other languages, there is also a need to study these aspects for Kannada speech. The investigation of both prosody related features [13] and spectral features for the evaluation of emotion recognition is necessary 50-500 LPC coefficients as spectral features, whereas mean value of pitch (F0), intensity, pressure of sound, Power Spectral Density (PSD), pressure, as prosody related features have been studied. The human capability to recognize the emotion from speech was also studied and compared with machine classifiers. This important aspect of human interaction needs to be considered in the design of human– machine interfaces. Initially a listening test of sample Sentences was done to identify speaker’s INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) ISSN 0976 – 6464(Print) ISSN 0976 – 6472(Online) Volume 5, Issue 8, August (2014), pp. 160-170 © IAEME: http://www.iaeme.com/IJECET.asp Journal Impact Factor (2014): 7.2836 (Calculated by GISI) www.jifactor.com IJECET © I A E M E

description

Emotion is an affective state of consciousness that involves feeling and plays a significantrole in communication. So it is necessary to analyze and evaluate speech data base to build an effective emotion recognition system and efficient man machine interface. This paper presents and discusses development of emotional Kannada speech data base analysis and its evaluation using Mean opinion score (MOS), PNN and k-NN.

Transcript of Emotional analysis and evaluation of kannada speech database

Page 1: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

160

EMOTIONAL ANALYSIS AND EVALUATION OF KANNADA SPEECH

DATABASE

Pallavi J1, Geethashree A

2, Dr. D J Ravi

3

1Student– Master of Technology, ECE, VVCE, Mysore, Karnataka, India

2Asst.Professor– Dept. of ECE, VVCE, Mysore, Karnataka, India

3Professor and HOD– Dept. of ECE, VVCE, Mysore, Karnataka, India

ABSTRACT

Emotion is an affective state of consciousness that involves feeling and plays a

significantrole in communication. So it is necessary to analyze and evaluate speech data base to

build an effective emotion recognition system and efficient man machine interface. This paper

presents and discusses development of emotional Kannada speech data base analysis and its

evaluation using Mean opinion score (MOS), PNN and k-NN.

Keywords: K-Neighbouring Numbers (K-NN), Probability Neural Network (PNN), Speech Corpus.

I. INTRODUCTION

Emotion plays an important role in day-to-day interpersonal human interactions. Recent

findings have suggested that emotion is integral to our rational and intelligent decisions. A

successful solution to this challenging problem would enable a wide range of important applications.

Correct assessment of the emotional state of an individual could significantly improve quality of

emerging, natural language based human-computer interfaces [1,3,6]. It helps us to relate with each

other by expressing our feelings and providing feedback.

There have been many studies [3,4,7-10] for emotional speech but it is observed that most of

the studies are for English, Hindi and other languages, there is also a need to study these aspects for

Kannada speech. The investigation of both prosody related features [13] and spectral features for the

evaluation of emotion recognition is necessary 50-500 LPC coefficients as spectral features, whereas

mean value of pitch (F0), intensity, pressure of sound, Power Spectral Density (PSD), pressure, as

prosody related features have been studied. The human capability to recognize the emotion from

speech was also studied and compared with machine classifiers.

This important aspect of human interaction needs to be considered in the design of human–

machine interfaces. Initially a listening test of sample Sentences was done to identify speaker’s

INTERNATIONAL JOURNAL OF ELECTRONICS AND

COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

ISSN 0976 – 6464(Print)

ISSN 0976 – 6472(Online)

Volume 5, Issue 8, August (2014), pp. 160-170

© IAEME: http://www.iaeme.com/IJECET.asp

Journal Impact Factor (2014): 7.2836 (Calculated by GISI)

www.jifactor.com

IJECET

© I A E M E

Page 2: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

161

emotion based on auditory impressions and Mean opinion score was collected. Then speaker’s

emotion Identification of sample sentences was done with probabilistic neural network (PNN) and k-

neighboring numbers (KNN) using LPC and subsequently PRAAT software package was used to

extract the Pattern of acoustic parameters for sample sentences [2].

II. EMOTIONAL DATABASE

Obtaining emotional corpus is quite difficult in itself. Various methods have been utilized in

the past, like the use of acted speech, the speech obtained from movies or television shows and

speech recorded in event recall [2, 5, 6].

The database is composed of 4 different emotions (happy, sad, anger and fear) and neutral

emotion as uttered by two male Kannada actors, consisting of a total of 60 sentences containing

minimum 3 to maximum 7 words. The first step was to record the voice of each words and sentences.

The recordings of all the words and sentences were done using recording studio. These words and

sentences were recorded at a sample rate of 44100 Hz with a mono channel. The sentences used for

Statistical analysis is listed in table 1.

Table 1: Sentences used in analysis

Sent. KANNADA (English)

S2 �ాయు�నం�ెౕఘ �ాల�ాళ� (long live like a wind)

S3 �ెూడ�వరప�ణవను�ర�సువ�ానుప ణ!వంత. ( I am blessed ,as I protected the lives of elders)

S5 #మ%ంతహఎ(ెూ)జన+ెూడ�ెగు� -ా.పళ/దవను�ాను. (I have fought and Experienced with so many people like you.)

S5 అర�ందనన�2ష! (Aravinda is my Disciple)

S6 �ానుఓదువ దు+ా5��ెౕ6ెయ78 (I study during night time)

S7 ఆత�ా�హ%ణ�ెౕఇర�ెౕకుఅదర< 8ెౕనుసంశయ�ల8. (He might be a Brahmin ,there is no doubt about it)

S8 అప>అవను?ారప>నమ%ను@ౕ.సు5Aరువవను?

(Father, who is that fellow who troubles us?)

III. ANALYSIS

Pitch is strongly correlated with the fundamental frequency of the sound. It occupies a central

place in the study of prosodic attributes as it is the perceived fundamental frequency of the sound [3,

4 & 8]. It differs from the actual fundamental frequency due to overtones inherent in the sound

Fig 1 to Fig 5 shows the pitch and intensity of different emotions of Sentence 6. The table 2

shows the mean pitch of the different emotion and Fig.6 shows the variation of mean pitch in

different emotions. It shows that mean pitch is highest in fear and lowest in sadness when compare to

other emotions.

Page 3: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

162

Figure 1: Pitch and intensity of neutral sentence

Figure 2: Pitch and intensity of emotion (sad)

Figure 3: Pitch and intensity of emotion (fear)

Figure 4: Pitch and intensity of emotion (anger)

Figure 5: Pitch and intensity of emotion (happy)

Table 2: Mean pitch of sentences in different emotion (Hz)

Sent Neutral Sadness Fear Anger Happy

S1 129.12 119.71 209.53 189 140.4

S2 116.95 137.37 198.84 189 135.2

S3 123.33 131.45 195.83 210 176.3

S4 113.37 116.56 164.74 177 162.7

S5 125.55 156.28 226.61 195 172.5

S6 103.04 160.46 202.5 223 153.2

S7 108.97 124.59 192.17 174 127.7

S8 108.87 107.61 165.21 136 110

Table.3 shows the intensity of different emotions and Fig.7 shows the variation of intensity. It

shows that intensity is highest in anger and lowest in fear.

Page 4: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

163

Figure.6: Mean pitch of 8 sentences in different emotion

Table 3: Intensity of different emotion in dB

Sent.No neutral Sad Fear anger Happy

S1 85.64 84.94 88.78 90.88 90.39

S2 85.50 79.29 78.29 83.15 84.17

S3 87.33 84.82 87.70 89.17 90.51

S4 83.29 88.01 88.99 91.98 86.93

S5 86.39 86.98 89.16 91.30 90.61

S6 83.22 85.35 88.98 87.28 85.59

S7 88.92 86.48 88.00 92.16 85.74

S8 88.14 87.70 87.26 87.95 85.51

Figure 7: Intensity of different emotions

For analysis purpose speech signal is decomposed in to number of frames. These frames may

be voiced or unvoiced. If voiced frame contain prosodic feauteres, unvoiced frames contains

excitation features along the prosodic features. so it necessary to analyse the unvoiced frames.

Page 5: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

164

Table 4 contains the percentage of unvoiced frames in a sentence in all emotions. Fig 8 shows that

unvoiced frames are highest in fear & lowest in happy when compare other emotion.Pressure of

sound influence the Intensity which in turn affects the power at each formant. (PSD) of different

emotions is plotted in Fig.9 and pressure of sound in Fig.10. Irrespective of emotions the radiance of

lips for the a sentences or utterence remains same. The rate of vocal fold changes for different

emotions causing the less tilt in specrtum, which greatly influences the emotions. This indicates that

not only prosodic features but also excitation sources influence the emotions. Fig.11 shows the vocal

ract variations in different emotions.

Table 4: Percentage of unvoiced frames in different emotions

Sent.No Neutral Sadness Fear Anger Happy

S1 17.88% 43.08% 54.37% 28.41% 25.73%

S2 31.14% 33.93% 39.02% 19.41% 24.74%

S3 14.86% 28.37% 29.43% 23.17% 27.32%

S4 30.77% 25.65% 43.56% 19.16% 20.15%

S5 34.04% 43.28% 50.00% 37.69% 38.53%

S6 29.44% 27.38% 53.40% 31.10% 30.09%

S7 23.61% 32.16% 41.76% 22.55% 27.25%

S8 25.94% 27.45% 29.13% 32.28% 40.29%

Figure 8: Percentage of unvoiced frames in different emotions

Figure 9: PSD in different emotions

Page 6: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

165

Figure 10: Pressure of sound in different emotions

By Analysis of different parameters like intensity, pitch, number of unvoiced frames, sound

pressure, PSD and vocal fold influence it is very difficult to characterize each emotions. While

coming to statistical variance of values, it is much more difficult to characterize emotions. So it is

necessary to design an envelope which considers all the above characterises. This can be done using

LPC, LSF, MFCC or LFCC. In this work we are making use of LPC

Figure 11: Vocal fold variance in different emotions

.

Figure 12: Spectrogram of the neutral sentence

Figure 13: Spectrogram of Emotion (sad)

Page 7: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

166

Figure 14: Spectrogram of Emotion (Fear)

Figure 15: Spectrogram of Emotion (Anger)

Figure 16: Spectrogram of Emotion (Happy)

The Effects of Excitation which cannot be seen in prosodic analysis can be seen in

Spectrogram analysis, which can be analysed using the nonparametric methods of non-stationary

signal.

IV. FEATURE EXTRACTION

The performance of an emotion classifier relies heavily on the Quality of speech data. LPC is

powerful speech signal analysis technique. LPC determines the coefficients of a forward linear

predictor by minimizing the error in the least squares sense. It has applications in filter design and

speech coding, since LPC provides a good approximation of vocal tract spectral envelop. LPC finds

the coefficients of a pth-order linear predictor (FIR filter) that predicts the current value of the real-

valued time series x based on past samples.

Figure 17: Block diagram of LPC

Page 8: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

167

p is the order of the prediction filter polynomial, a = [1,a(2), ... a(p+1)]. If p is unspecified,

LPC uses as a default p = length(x)-1. If x is a matrix containing a separate signal in each column,

LPC returns a model estimate for each column in the rows of matrix and a column vector of

prediction error variances g. The length of p must be less than or equal to the length of x.

LPC analyses the speech signal by eliminating the formant and speech by estimating the

intensity and frequency of the remaining buzz. The process is called inverse filtering and the

remaining is called the residue. The excitation signal obtained from the LPC analysis is viewed

mostly as error signal, and contains higher order relations. Higher order relations contain strength of

excitation, characteristics of glottal volume velocity waveform, shapes of glottal pulse, variance of

vocal folds.

V. EVALUATION

Evaluation is carried in two methods

Evaluation by listener: Perception test is done and Mean Opinion Score is taken, the main objective

of perception test is to validate the recorded voice for recognition of emotion. The perception test

involved 25 people from various backgrounds. Sentences in random order were played to the

listeners and they were asked to identify expression of emotion in the utterances. The listeners were

required to choose the emotion of the recorded voice from a list of 4 emotions along with the neutral

sentences. The MOS was of the test was calculated.

Evaluation by classifier

Probabilistic neural network (PNN): PNN is closely related to Parzen window Probability Density

Function (PDF) estimator. A PNN consists of several sub-networks, each of which is a Parzen

window PDF estimator for each of the classes. The input nodes are the set of measurements. The

second layer consists of the Gaussian functions formed using the given set of data points as centers.

The third layer performs an average operation of the outputs from the second layer for each class.

The fourth layer performs a vote, selecting the largest value. The associated class label is then

determined.

Figure 18: PNN classifier

Page 9: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

168

In general, a PNN for M classes is defined as

����� ��

��∑ exp ��

||� �,���||�

���

����� ---------(1)

Where nj denotes the number of data points in class j. The PNN assign x into class k if yk(x) >yj(x),

j€[1……M], ||x j,i-x||2 is calculated as the sum of Squares

K-Neighboring numbers: In pattern recognition, the k Nearest Neighbors algorithm is a non-

parametric method used for classification. The output depends on value of K in algorithm.

In k-NN classification, the output is a class membership. An object is classified by a majority

vote of its neighbors, with the object being assigned to the class most common among its k nearest

neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the

class of that single nearest neighbor.

Figure 19: Block diagram of emotion recognition

In k-NN regression, the output is the property value for the object. This value is the average

of the values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning,

where the function is only approximated locally and all computation is deferred until classification.

The k-NN algorithm is among the simplest of all machine learning algorithms.

Both for classification, it can be useful to weight the contributions of the neighbors, so that the nearer

neighbors contribute more to the average than the more distant ones. For example, a common

weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the

neighbor.

The neighbors are taken from a set of objects for which the class (for k-NN classification) or

the object property value (for k-NN regression) is known. This can be thought of as the training set

for the algorithm, though no explicit training step is required.

VI. RESULTS AND DISCUSSION

EVALUSTION OF EMOTION

Evaluation by people: Confusion matrix created after calculating the MOS is shown in table 5, it

was observed that the most recognised emotion was anger (91%), while the least recognized emotion

was fear (70%). From the table, it can be observed that fear is the most confusing emotion that is

very much confused with sadness. The average of recognition of emotion was 81% and the order of

recognition of all emotion is anger > neutra l > sadness > happy > fear.

Page 10: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

169

Table 5:Confusion matrix of perception test

Category Neutral sadness fear anger happy

Neutral 89% 2% 1% 6% 2%

Sadness 4% 78% 11% 4% 3%

Fear 3% 18% 70% 7% 2%

Anger 5% 1% 1% 91% 2%

happy 10% 1% 1% 11% 77%

Evaluation by classifiers : LPC coeffieints are fed as input to both algorithms for classification of

emotions. The results obtained in both methos are almost same. That is, as the number of coeffients

and K increases the accuracy towards detecting emotions like sadness and fear increases but

ambiguity in detecting other emotions like neutral, happy, anger also increases. As the number of co-

efficient and k decreases the accuracy toward detecting emotions like neutral, happy and anger

increases and ambiguity exists between sad and fear.

Table 6: Confusion matrix of evaluation of emotions by k-NN and PNN

LPC=50,K=1

Neutral sadness fear anger happy

Neutral 70% 2% 5% 3% 20%

Sadness 30% 11% 6% 30% 23%

Fear 35% 10% 5% 25% 25%

Anger 12% 5% 8% 65% 10%

happy 5% 2% 5% 20% 68%

LPC=500,K=5

Neutral sadness fear anger happy

Neutral 20% 2% 8% 30% 40%

Sad 6% 69% 20% 5% 0%

fear 2% 11% 68% 19% 0%

anger 30% 5% 8% 22% 35%

happy 20% 25% 5% 30% 20%

VII. CONCLUSION

In this paper, the prosodic and excitation features in Kannada speech has been analysed

from spoken sentences for important categories of emotion. It has been observed that all these

prosodic features (F0, A0, D), along with the excitation parameters (PSD, pressure and vocal fold

variance) play significant role in expression of emotions. Evaluation of database has been conducted

using the database created to express the emotion. Here along with prosodic parameter excitation

parameters has been used for training PNN, k-NN classifier. The result shows, there is an ambiguity

in detection of emotion like neutral, anger, happy with sad and fear when LPC co-efficient and k

value varies.This work can be enhanced using MFCC, LFCC, and PFCC. Further studies should be

conducted using database created by natural conversations

Page 11: Emotional analysis and evaluation of kannada speech database

Proceedings of the 2nd

International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 – 19, July 2014, Mysore, Karnataka, India

170

REFERENCES

[1] Takashi & Norman D. Cook, “Identifying Emotion in Speech Prosody Using Acoustical Cues

of Harmony”, INTERSPEECH, ISCA, DBLP (2004).

[2] Paul Boersma and David Weenink. (2009, November) Praat: doing phonetics by computer.

[Online]. URL “http://www.fon.hum.uva.nl/praat/”.

[3] Sendlmeier, W.F., Kienast M. and Paeschke, A. “F0 contours in Emotional Speech.”

Technische University, Berlin, Proc. ICPhS, 1999.

[4] Mozziconacci, S.J.L and Hermes D.J. “Role of Intonational Patterns in Conveying Emotion

in Speech.” ICPhS 1999, 1999 - Citeseer.

[5] Kwon O W, Chan K L, Hao J, et al. “Emotion Recognition by Speech Signals”. Eurospeech,

Geneva, Switzerland, 2003.

[6] Rong J, Li G, Chen Y-P P. “Acoustic feature selection for automatic emotion recognition

from speech”. J InfProcManag, 2009.

[7] D.J.Ravi and SudarshanPatilkulkarni, “Kannada Text to Speech Synthesis Systems: Emotion

Analysis” international conference on natural language processing (ICON-2009).

[8] Sushma Bahuguna1, Y. P. Raiwani. “A Study Of Acoustic Features Pattern Of Emotion

Expression For Hindi Speech” international journal of computer engineering & technology

(ijcet) measurement science review, Volume 10, No. 3, 201072.

[9] J. Přibil, and A. Přibilová, “An Experiment with Evaluation of Emotional Speech Conversion

by Spectrograms” Institute of Photonics and Electronics, Academy of Sciences CR, v.v.i.,

Chaberská 57,CZ-182 51 Prague 8, Czech Republic.

[10] Slobodan T. Jovičić ,ZorkaKašić , MiodragĐorñević, MirjanaRajković, “Serbian emotional

speech database: design, processing and evaluation” ISCA Archive SPECOM’2004:9th

Conference Speech and Computer St.Petersburg, Russia September, 20-22, 2004.

[11] Shashidhar G. Koolagud, RaoSreenivasaKrothapalli “Two stage emotion recognition based

on speaking rate” Received: 16 November 2010 / Accepted: 2 December 2010 / Published

online: 11 December 2010 © Springer Science+Business Media, LLC 2010.

[12] Shashidhar G. Koolagudi, K. SreenivasaRao “Emotion recognition from speech: a review”

Received: 7 July 2011 / Accepted: 17 December 2011 / Published online: 4 January 2012 ©

Springer Science+Business Media, LLC 2011.

[13] Syed Abbas Ali, SitwatZehar, Mohsin Khan & Faisal Wahab, “Development and Analysis of

Speech Emotion Corpus using Prosodic Features fpr Cross Linguistics” International Journal

of Scientific & Engineering Research, vol-4, Issue 1, Jan-2013, ISSN 2229-5518.