REport

Investigation Of Human Emotions Using Speech Signals

Nurul Aida Amira Bt Johari Dr Hariharan Muthusamy Prof Dr Sazali B Yaacob

Intelligent Signal Processing GroupAcoustic ApllicationsUniversiti Malaysia Perlis

Abstract-Human social communication depends largely on exchange of non-verbal signals including non-lexical expression of emotions in speech Recent studies on emotion detection mainly concentrate on visual modalities including facial expression muscles movement action units and body movement Classification of emotions by speech is one of research fields for emotional human computer interaction or affective computing This study will propose the use of linear and nonlinear parameters to distinguished the Emotion state and its correlation to the human stress level Research will proposed new algorithm using linear and nonlinear The new dynamic methods of Lyapunov and Correlation Dimension which precisely relate to the approximation emotions state criteria value

CHAPTER 1Introduction In this research work it is proposed to develop an intelligent emotion

classification system based on the acoustic analysis of speech and artificial intelligent techniques When appropriately process speech signal can potential to facilitate the understanding of the emotion state information be convey in communication and speech recognitive process

This study objectives are 1) To study and investigate the recorded speech samples from Speech Under Simulated and Actual Stress (SUSAS) database 2)To investigate and propose suitable signal processing algorithms for extracting the salient features of speech signals which are recorded under different emotional states of a human3) To investigate and propose suitable classification algorithm for catogerizing different emotional state of a human 4)To develope user interactive intelligent emotion classification system

Emotions are the backbone of human interactions and are closely related to rational thinking perception cognition and decision making Emotional cues can be analyzed from speech facial expressions and gestures In our case we prefer to focus in the study of speech cues Since speech is the primary medium for interaction speech based emotional studies are more significant Emotion in speech do not alter the linguistic content of the speech but changes its effectivenes[22]

Recently many application of an automatic emotion recognition systems had been explored such applications are HumanndashComputer Interfaces (HCIs) humanoid robotics text-to-speech synthesis sistems forensic lie detection interactive voice response systems and etc [22]

Althou speech analysis is unintrusive this do not mean the anlysis as simple at forehand for the emotion to be distinguished however Emotion recognition is a complex pattern recognition problem which relate cognitive and neural approach[22]

In this modern world we are surround by all kinds of signal in various forms Some signals are necessary (speech) and some are pleasant (music) While many are unnecessay and unwanted in a given situation In engineering signal are carriers of information both useful and unwanted The distinction of useful and unwanted information is always subjective as well as objectiveUsing DSP approach it is more possible to convert an inexpensive personal computer into a powerful signal processor Some important DSP advantages are system using DSP can be developed using software running on general purpose computer that the DSP is convenient to develope and test and software is portable DSP operation based solely on addition and multiplications which cause extremely stable processing capability DSP operation also can be modified in real time with often by little programming changing and DSP Because of it advantages DSP is becoming the first choise in many technologies and application such consumer electronics communications wireless telephones ad medical imaging

Roughly speaking for a given learning task with a given finite amount of training data the best generalization performance will be achieved if the right balance is struck between the accuracy attained on that particular training set and the ldquocapacityrdquo of the machine that is the ability of the machine to learn any training set without error[d] The interpretation of emotions from variations of physiological expression seem have not been fully achieved using linear measures Therefore linear measures such pitch energy fundamental frequency formant lpcc speaking rate frequency spectral should should not only be concentrated for the analyse of speech emotions to discriminate the physiological expression Nevertheles linear measures circumstances were the main motivation to investigate the application of nonlinear mathematics to the emotions using speechConsequently nonlinear dynamical analysis has emerged a novel method for the study of complex systems during the past few decades Lyapunov exponent Hurst exponent fractal dimension information dimension box dimension and correlation dimension are some of the methods by which the complexity of a system or a signal could be quantified

A significant step forward was made by Grassberger and Procaccia [f] who showed how the so called correlation dimension could be used to estimate and bound the fractal dimension of the strange attractor associated with the nonlinear time signal at hand Specifically in [g] the application of correlation dimension to determine emotion through speech due to the vocal tract nonlinear characteristic was described Correlation dimension is by the most popular due to its computational efficiency compared to other fractal dimensions such as the information dimension capacity dimension and pulse dimension [h]

The novel technique which incorporates Lyapunovrsquos direct method is general flexible and can be easily adapted to analyze the behavior of many types of nonlinear iterative signal processing algorithms[ On the Use of Lyapunov Criteria to Analyze the Convergence of Blind Deconvolution Algorithms]

Speech Speech is an acoustic waveform that conveys information from a speaker to listener Almost all speech processing applications currently fall in three broad categories speech recognition speech synthesis and speech coding Speech recognition may be concerned with the identification of speaker Isolated word recognition algorithms attemp to identify individual words such as in automated telephone services Automatic speech recognition systems attempt to recognize continuous spoken language possibly to convert into text within a word precessor These systems often corporate grammatical cues to increase their accuracy Speaker identification is mostly used in security applications as a personrsquos voiced is much like a ldquofingerprintrdquo [j] The elementary properties of the signal shortime discretye fourier transorm and use the spectrogram to estimate the properties of speech waveforms

Speech Production

Figure 1 The Human Speech Production System

Speech consist of acoustic pressure waves created by the voluntary movements of anatomical structures in the human speech production system shown in Figure 1 A the diaphragm forces air through the system these wave form can be categorize into voices and unvoiced speech and shape of wide variety of waveform The waveforms can be broadly categorized into voiced and unvoiced speech Voiced sounds vowel or example are produce by forcing air through the larynx with the tension of the vocal cords adjusted so that they vibrate in a relaxed oscillation This produces quasi-periodic pulses or slow pulses of air which are acoustically filtered as they propagate through the vocal tract The shape of the cavities that comprise the vocal tract known as the area function determines the natural frequencies or formants which are emphesized in the speech waveform The period excitation known as the pitch period is generally small with respect to the rate at which the vocal tract changes shape Therefore a segment of voicedspeech covering several pitch periods will appear somewhat periodic Average values for the pitch period are around 8 ms for male speakers and 4 ms for female speakers In contrast unvoiced speech has more of a noise-like quality Unvoiced sounds are usually much smaller in amplitude and oscillate much faster than voiced speech These sounds are generally produced by turbulence as air is forced through a constriction at some point in the vocal tract For example an h sound comes from a constriction at the vocal cords and an f is generated by a constriction at the lips[j]

An illustrative example of voiced and unvoiced sounds contained in the word ldquoeraserdquo are shown in Figure 2 The original utterance is shown in (21) The voiced segment in (22) is a time magnification of the ldquoardquo portion of the word Notice the highly periodic nature of this segment The fundamental period of this waveform which is about 85 ms here is called the pitch period The unvoiced segment in (23) comes from the ldquosrdquo sound at the end of the word This waveform is much more noise-like than the voiced segment and is much smaller in magnitude

EmotionsEmotions were affective states which can be experienced and have arousing and

motivational properties There are many emotions that can be categorized in a specific mental state conditions Several can be seem perform as positive and few other reflects a negatives of the individual conditions

It is difficult to precisely measure the role of social experience on the organization of developing emotion recognition systems (Pollak in press) Within seconds of post-natal life the human infant experiences a wealth of emotional input sounds such as coos laughter tears of joy sensations of touch hugs restraints soothing and jarring movements After a few days let alone after months or years it becomes a significant challenge to quantify an individualrsquos emotional experiences Almost instantly the

maturation of the brain has been confounded by experiences with emotions This explain the role and existences of emotions and mental states and their observables expressions[c]

Most research related to automated of expression is based on small sets of basic emotions such as joy sadness fear anger disgust and surprise Based on Darwinian theory on Darwinrsquos book in 1872 the implication of this theory is that if there is indeed a small number of a basic or fundamental emotions where each correspond to particular evolve adaptive response[a] However there are also several more emotions state can be notify such panic shy disappointment frust and many others

It is well known that the performance of the speech recognition algorithms is greatly influenced by the environmental conditions in which speech is produced The performance of the recognition influenced by several factor include variation communication channel background noise common place interfere with stress task or activities[1] It is still limited research done on optimize the emotion recognition The features that make their environments a typical serve as approximations of how environmental variations in experience may affect development of emotion states[c]

Conceptual mental state and emotionThe relationship between expression may have several uses for automatic recognition and syhthesis In early inspection it can be useful for continuous tracing expressions and assuming gradual changes over time However there are several approaches to conceptualise emotions and distinguishing them[b] The majority of studies in the field of the speaker stress analysis have concentrate on pitch with several considering spectral features derived from a linear models of the speech production which assume that airflow propagates in the vocal tract as a plane wave[15] However according the studies by Teager the true source of sound production is actually the vortex-flow interactions which are non-linear[15]It is believed that that changes in vocal system physiology induced by stressfull conditions such as muscles tension will affect the vortex flow interaction patterns in the vocal tract Therefore nonlinear speech features are necessary to classify stressed speech from neutral[15] There were two type of feature extraction using linear which always taken for the low level descriptive features and non-linear features extraction to extract most useful information content of the speech to distinguish the emotions and stress Using the linear feature extractions the low level descriptives features commonly used are pitch formant fundamental frequency mean maximum standard deviation acceleration and speeds were the type to be as the parameters to be select as a variables to distinguished the different of human emotion state The value extracted will be the boundry between each state Whereas nonlinear extraction in which was severely been used in speech have been eagerly been investigate by recent researcher nowadays The fractal dimension feature is one of the useful method to analysis the emotion state in speech signal[a]

Description of Speech Emotion Database

J Hansen at the University of Colorado Boulder has constructed database SUSAS (Speech Under Simulated and Actual Stress) The database contains voice from 32 speakers with ages ranging from 22 to 76years old Two domains of SUSAS contain the similated stress from ldquotalking stylesrdquo and actual stress from amusement park roller coster and speakers of four military helicopter pilots which were recorded during the flight were used for the evaluation Words from a vocabulary of 35 aircraft communication words make up the database[e] Among the best result from the previous papers of emotion recognition by speech signals the best performance is achieved on the databases containing acted prototypical emotions While the database containing the most spontaneous and naturalistics emotions are in turn the most challengeable to label Because database containing of long pause with a high level and annotation [11]

Segmentation in words level During segmentation of the signal at word level certain decision is made to resolve some ambiguity from the signals

Windowing

Windowing is useful operator in order to eliminate the sparks from signal In our study

the speech signal is separated for speech processing by Hamming window and its

window length is ms

(2-2)

Where speech signal and is windowing operator

Spectrogram

As previously stated the short-time DTFT is a collection of DTFTs that differ by the position of the truncating window This may be visualized as an image called a spectrogram A spectrogram shows how the spectral characteristics of the signal evolve with time A spectrogram is created by placing the DTFTs vertically in an image allocating a different column for each time segment The convention is usually such that frequency increases from bottom to top and time increases from left to right The pixel value at each point in the image is proportional to the magnitude (or squared magnitude) of the spectrum at a certain frequency at some point in time A spectrogram may also use a ldquopseudo-colorrdquo mapping which uses a variety of colors to indicate the magnitude of the frequency content as shown in Figure 4

Emotions and Stress Classification Features

Extracting the speech information for human emotion state shall be done by taking attention of measuring for the linear and nonlinear parameters of the speech signals Using nonlinear Fractal analysis fractal features can be extracted using several methods As an effective tool for emotion state recognition for the nonlinear characterization of nonlinear signals deterministic chaos play an important role Through the dynamic perspective the straightforward way to estimate emotion state

Auditory Feature Extraction

Feature processing frontndashend for extracting the feature set is an important stage in any speech recognition system In the recent year the optimum feature set is still not yet decided inspite of extensive research There were many type of features which are derived differently and have good impact on the recognition rate This work present one more successful technique to extract the feature set from a speech signal which can be used in emotion recognition system

Wavelet Packet Transform

In the literature there have been various reported studies but there is still significant research to be done investigation on wavelet packet transform for speech processing applications A generalization type of the discrete wavelet transform (DWT) called as WP analysis enables subband analysis without the constraint of dyadic decomposition Basically the discrete WP transform performs an adaptive decomposition in frequency axis This particular discrimination may be doned with optimization criterions (L Brechet MF Lucas at all 2007)

The wavelet transform which provides good resolution both in time and frequency is most suitable tool to analyze non-stationary signals such as speech signals Moreover the power of the wavelet transform in analyzing speech strategies of Cochlea is the fact that the cochlea seems to be behaving as parallel with the wavelet transform filter banks

The wavelet theory guarantees a unified framework for various signal processing applications such as signal and image denoising compression analysis of non-stationary signal etc In speech processing applications the wavelet transform has been intended to improve the speech enhancement quality of classical methods The suggested method in this work is tested on recorded noisy speech from real environments

WPs were first investigated by Coifman and Meyer as a orthogonal bases for L2(R) Realization of a desired signal with a best basis selection method involves the introduction of an adequate cost function which provides energy localization to a decrising operation (RRCoifman amp MV Wickerhauser 1992) The cost function selection is directly related to the fixed structure of the application Consequently if signal compression identification or classifications are the interests as an application entropy may reveal desired basis functions Then the statistical analysis of coefficients taking from these basis functions may be used indicating the original signal Therefore the WP analysis is effective to the signal localization in time and frequency

Phase space plots for various mental states

httpwwwncbinlmnihgovpmcarticlesPMC400247figureF1

NormalizationVariance normalization is applied to better cope with channel characteristics Cepstral Mean Substraction (CMS) also used to characterize precisely into each specified channel[11]

Test runFor all databases test-runs are carried out in leave-One-Speaker-Out In(LOSO) or Leave-One-Speakers-Group-Out (LOSGO) manner to face speaker independent In case of 10 or less speakers on one corpus we apply the LOSO strategy

Statistics

The data of linear and nonlinear measurements were previously subjected to repeated measures analysis of variance (ANOVA) with two within-factors emotions type (angry Loud Lombard Neutral) and stress level (medium vs high) Based on this separation two factors were constructed Speed rate Slope gradient spectral power and phase space were calculated In all ANOVAs GreenhousendashGeisser epsilons ( ) were used for non-sphericity correction when necessary To assess the relationship between emotions type and stress level Pearson product correlations between Speed rate Slope gradient spectral measures and phase space were computed at individual recording sites separately for the two different experimental conditions(linear and nonlinear) For statistical testing of correlation differences between the two conditions (emotion state and stress level) the correlation coefficients were Fischerrsquos Z-transformed differences in Z-values were assessed with paired two-sided t-tests across all emotions state

Classifier to classify the emotions

The text independent pairwise method where the 35 words of commonly used in air craft communication are uttered two times The signal been extracted using parameters entropy and energy coefficient through subbands There were twenty subband applied for the Mel Scale based wavelet Packet Decomposition and 19 subband filterbanks of gammatone ERB based wavelet Packet DecompositionThe output of the filterbank were cofficient relate to the relatively parameters Two classifier were choosed for the classification There were Linear Discriminant Analysis and K nearest Neighbour classifiers The two different classifier had given different accuracy to perform the related emotions investigated The four emotions investigated were Natural Angry Lombard and Loud

CHAPTER 2LITERATURE REVIEW

Literature Review1 The study by Guojun et al [1] (1996) Promote the three new features which are the TEO-decomposed FM Variation (TEO-FM-Var) Normalized TEO Autocorrelation Envelope Area (TEO-Auto-Env) and TEO based Pitch (TEO-Pitch) The stress classification feature using TEO-FM-Var suggested features represent the fine excitation variations due to the effect of modulation The raw input speech been filtered through a Gabor bandpass filter (BPF) Second TEO-Auto-Env is passing raw input speech through a filterbank consisting 4 bandpass filters (BPF) Next TEO-Pitch is direct estimate of the pitch itself for representing frame-to-frame excitation variations The research using following subset of SUSAS words rdquofreezerdquo rdquohelprdquo rdquomarkrdquo rdquonavrdquo rdquoohrdquo and ldquozerordquo Angry loud and Lombard styles were used for simulated speech Baseline 5-state HMM-based stress classifier with continuous Gaussian mixture distributions was employed for the evaluations A round robin was employed for training and scoring Result using SUSAS showed the TEO-Pitch perform more consistently with overall stress classification rates of (Pitch m = 5756 =2340) vs (TEO-Pitch m = 863 6 = 783) 2 2003 research by Wook Kwon Kwokleung Chan et al [5] on Emotion Recognition by Speech Signals had using pitch log energy formant mel-band energies and mel-frequency cepstral coefficient (MFCCs) and velocity acceleration of pitch and MFCCs as feature streams Extracted feature were analyze using quadratic discriminant analysis (QDA) and support vector machine (SVM) The cross-validation ROC area then plotted included from the forward selection (which feature to be added) and backward elimination The group feature selection features divided into 13 groups showed the pitch

and energy more essential in distinguishing stress and neutral speech Using speaking style classification by varying treshold with different control detection the detection rate of neutral and stressed utterances 90 and 926 respectively Using GSVM as a classifier showed average accuracy 671 Speaking style was also modeled by 5-state HMM the classifier showed 963 average occuracy HMM detection rate also better than SVM classifier 3 SCasale et Al [6] (1996) made emotion classification using the architecture of a distributed Speech Recognition System (DSR) Using WEKA (Waikato Environment for Knowledge Analysis) software most significant parameter for the classification of the emotional states were selected Two corpora used EMO-DB and SUSAS contain semantic corpora made of sentences and single word respectively Best performance achieved using a Support Vector Machine (SVM) trained with Squential Minimal Optmization (SMO) algorithm after normalizing and discretizing the input statistical parameters Result yield using EMO-DB was 92 n the SUSAS system yield extremely accuracy over 92 and 100 in some cases4 Bogdan Vlasenco Bjorn Schuller et al [7] (2009) investigate benefits of integration of information within turn-level feature space The frame-level analysis used GMM classification and 39 MFCC and energy feature with Cepstral Mean Substraction (CMS) In subsequent step the output scores are fed forward into 14k large-feature space turn level SVM emotion recognition engine Variety of Low-Level-Discriptors (LLD) and functionals used covering prosodic speech quality and articulatory aspects Result emphasized the benefits of feature integration on diverse time scales Provided were result each single approach and the fussion 899 accuracy for leave-one-speaker-out (LOSO) evaluation for EMODB and 838 for the 10-fold stratified cross validation (SCV) decided by SUSAS 5 2009 Allen N [8] new method used to extract characteristics of features from speech magnitude spectograms First approach the spectograms sub-divided into ERB frequency bands and the average energy was calculated and for the second approach the spectrograms passed through an optimal feature selection procedure base on mutual information criteria Proposed method test using three classes of stress single vowels words and sentences from SUSAS and using ORI with angry happy anxious dysphoric and neutral emotional classes Base on Gaussian mixture model results show correct classification of 40-81 for different SUSAS data sets and 40-534 for the ORI data base 6 In 1996 Hansen and Womack [9] their study consider several speech parameters mel delta-mel delta-delta-mel auto correlation mel and cross-relation-mel cepstral parameters Considered several speech feature as potential stress sensitive relayers using SUSAS database An algorithm for speaker dependent stress classification is formulated for 11 stress conditions Angry clear Cond50 Cond70 Fast Lombard Loud Normal Question Slow and Soft Given robust set of features a neural network classifier is formulated based on an extended Delta-bar-delta learning rule By employing stress class group classification rate are further improve by +17-20 to 77-81 using five word closed of vocabulary test set Most useful featured for separating the selected stress condition is by auto correlation of Mel-Cepstral(AC-Mel) parameters7 Resa in 2008 [7] had work on Anchor Model Fusion (AMF) exploit the characteristic behaviour of the scores of a speech utterance among different emotion

models By mapping to a Back-end anchor-model feature space followed by SVM classifier AMF used to combine scores from two prosodic emotion recognition system denoted as GMM-SVM and statistics-SVM Result measured in term of equal error (err) showing relative improvement 15 and 18 Ahumuda III and SUSAS Simulated corpora respectively While SUSAS Actual show neither improvement not degradation8 2010 Namachai M [8] in build emotion detection the pitch energy and speaking rate observed to carry the most significant characteristics of affect in speech Method uses least square support vector machines which computed sixty features from the input utterances from the stressed input utterances The features are fed into a five state Hidden Markov mathemathical Model (HMM) and a Probabilistic Neural Network(PNN) Both speech classify the stressed speech into four basic categories of angry disgust fear and sad New feature selection algorithm called Least Square Bound (LSBOUND) measure and have both advantage of filter and wrapper methods where criterion derived from leave-one-out cross validation (LOOCV) procedure of LSVM Average accuracy for both the classification methods adopted for PNN and HMM are 971 and 907 respectively9 Report by Ruhi Sarikaya and JNGowdy [9] proposed new set feature base on wavelet analysis parameters Scale Energy (SE) Autocorrelation-Scale-Energy (ACSE) Subband based cepstral parameters (SC) and autocorrelation-SC (ACSC) Feed forward neural network of multi-layer-peceptron (MLP) is formulated for speaker- dependent stress classification of 10 stress conditions Angry clear Cond5070 Fast Loud Lombard Neutral Question Slow and Soft Subband base features shown to achieve +73 and 91 increase in classification rates Average scores across the simulations of new features are +86 and +136 higher than MFCC based features for ungroup and grouped stress set scenarious respectively The overall classification rates of MFCC based are around 45 While subband base parameters achieved higher in particular SC parameter received 591 higher than MFCC based10 In 1998 Nelson et al [10] investigation was done to several feature across the style classes of emotion classes of the simulated portion of the SUSAS database The feature considered a recently introduced measure of speaking rate called mrate shimmer jitter and feature from fundamental frequency (F0) contours For each speaker and feature pair a Multivariate Analysis of Variance (MANOVA) used to determine if any statistically significant differences (SSDs) existed between the feature means for the various styles for that speaker The dependent-samples t-test with Bonferroni procedure used to control familywise error rate for the pairwise style comparison The standard deviation range are 111-653 for Shim 014-066 for ShdB 040-431 for Jitt 1983-41849 for jita and F0 The trained using speaker-dependent Maximum likelihood classifiers resulted group S1 S4 and S7 showed good result while S2 S3 S5 and S6 groups do not show consistence result across speakers11 A benchmark comparison of performances by B Schuller et al [11] in using the two-predominant paradigms Modeling a frame-level by means of Hidden Markov Models and modeling suprasegmental by systematic feature brute-forcing Comparability among corpora sets done by clustered each databasersquos emotions into binary and valences In frame-level modeling rersearcher employed 39 dimensional feature vector per each frame consisting 12 MFCC and log frame energy plus speed and acceleration coefficients HTK toolkit used to built this modeled had used forward-backward and

Baumwelch re-estimation algorithm While using openEAR toolkit in suprasegmental modeling features are extracted as 39 functional of 56 acoustic low-level descriptors (LLD) The classifier of choise for suprasegmentel is support vector machine with polynomial kernel and pairwise multi-class discrimination based on Sequential Minimal Optimisation Comparing the frame-level modeling with supra-segmental modeling it seems to be more superior for corpora containing variable content which subject not restrict to predefined script while supra-segmental modeling outperforms the frame-level by large on corpora where the topicscript is fixed12 K Paliwal and et al (2007) since the speech signal is known to be robust to noise it is expected that the high energy regions of the speech spectrum carry the majority of the linguistic information This paper tries derived the frequency warping which refer as the speech signal based frequency cepstral coefficient or SFFCs directly from the speech signal by sampling the frequency axis non-uniformity with the high energy regions sample more densely than the low energy regions The average short time power spectrum is computed from speech corpus The speechndashsignalndashbased frequency warping is obtained by considering equal area portion of the log spectrum The warping used in filterbank design for the automatic speech recognition system Result show that the cepstral features based on the proposed warping achieve performance under clean conditions comparable to that of mel frequency cepstral coffficients (MFCC) while outperforming them under noisy conditions1314 Speech-based affect identification was built by syaheerah LL JM Montero and et al (2009) they employed speech modality for the affective-recognition system The experiment was based on a Hidden Marcov Model Classifier for emotion identification in speech Prior to experiment result that certain speech feature is more precise to identify certain emotional state and happiness the most difficult emotion to detectThe training algorithm used to optimize the parameters of the architecture was Maximum likelihood This software employs the Baum-Welch algorithm for training and the Viterbi algorithm for recognition All utterance processed in frames of 25ms window with 10ms frame shift The two common signal presentation coding techniques employed are the Mel-frequency Cepstral coefficient (MFCC) and linear Prediction Coefficient (LPC) and are improved using common normalization techniques Cepstral Mean Normalization (CMN) Research point out that dimensionality of speech waveform reduce when using cepstral analysis while dimensionality features vector increased by extended of features form features vectors that include derivatives and acceleration information Recognition results according to features with 30 Gaussians per state and 6 training iteration base-PLP feature with derivatives and accelerations declined while it was opposite for the MFCC In contrast the normalized features without derivatives and acceleration are almost equal to that of features with derivatives and accelerations Base-PLP with derivatives is shown as slightly better features with error rate 159 percent While base-mfcc without derivatives and acceleration also 159 percent where mfcc was the precise feature at identifying happiness1516 KR Aida-Zade C Ardil et Al (2006) highlight in the computing algorithm of speech features for being main part of the speech recognition From this point of view they combines use of cepstrals of Mel Frequency Cepstral Coefficient(MFCC) and Linear

predictive coding to improve the reability of speech recognition system To this end the recognition system is divided into MFCC and LPC subsystem The training and recognition processes are realized both subsystem separately by artificial neural network in the automatic speech recognition system The Multilayer Artificial neural network (MANN) was trained by the conjugate gradient method The recognition system get the same result of the subsystem The result is the decrease of the error rate during recognition The subsystem of LPC result with the less error of 417 while MFCC result 449 while the combination of the subsystem result 151 error rate17 Martin W Florian E et al (22 Firoz S R Sukumar and Babu AP (2010) created and analyzed three emotional databases For the feature extraction Daubechies-8 type of mother wavelet of discrete wavelet Transformation (DWT) has been used and Artificial Neural Network of Multi Layer Perceptron (MLP) were used for the classification of the pattern The MLP networks are learn with using Backward Propagation algorithm which is widely using in machine learning applications [8] MLP uses hidden layers to classify succesfully the patterns into different classsesThe speech samples are recorded 8kHz frequency range or 4kHz band limited Then using Daubechies-8 wavelet the successive decomposition of the speech signals to obtain feature vector The database divide by 80 training and remaining for testing the classifier Thus overall accuracy were 72056605 7125 could be obtain for malefemale and combine male and female database respectively43 Ling He M argaret L Namunu Maadage et al (2009) use new method to extract characteristic features of stress and emotion from speech magnitude spectrograms In the first approach the spectrograms are sub-divided into ERB frequency bands and the average energy for each band is calculated The analysis was perform at three alternative sets of frequency bandscritical bands Bark scale bands and equivalent rectangular bandwidth(ERB) scale bands Throgh 2nd approach the spectrograms are passed through a bank of 12-Gabor filters and output are average and passed through optimal feature selection procedure based on mutual information criteria Methods were test using vowels word and sentences from SUSAS database from three class stress and spontaneous speech made by psychologist (ORI) with 5 emotional classes The classification result based on the Gaussian model show correct classification rates of 40-81 for SUSAS and 40-534 for the ORI data base44 R barra JM Montero JM-Guarasa DrsquoHaro RS Segundo Cordoba carried out the 46 Vassilis P and Petros M (2003) had proposed some advances on speech analysis using generalized dimensions The development of nonlinear signal processing system suitable to detect such phenomenon and extract related information as acoustic signals This papers explore modern methods and algorithms from chaotic systems theory for modeling speech signals in multidimensional phase space and extracting characteristic invariant measures such generalized fractal dimension Non-linear systems based on chaos theory can model various aspects of the nonlinear dynamic phenomenon occurring during speech production Such measures can capture valuable information for the characterization of the multidimensional phase space since they are sensitive on the frequency that he attractor visit different region Further Vassilis integrate some of chaotic features with the standard ones (cepstrum) to develop a generalized hybrid set of short-time acoustic feature for speech signals Demonstrate result of it efficacy showed slight improvement in HMM-based phoneme recognition

48 NS Srikanth Supriya NP Arunjitha G NK Narayan (2009) in present a method of extracting Phase Space Point Distribution (PSPD) parameter for improving speech recognition systemThe research utilizing nonlinear or chaotic signal processing techniques to extract time domain based spase features Accuracy of the speech recognition system can be proove by appending the time domain based PSPD The parameter is proved invariant to the speaking style say prosody of speech The PSPD parameter found to be relevant parameter by combining with the conventional frequency based parameter MFCCThe study of a Malayalam vowel here a whole speech signal vowel is considered as single frame to explain the concept of the phase space mapBut when handle with wordsword signal be split by 512 samplesframe and phase space parameter is extractedThe phase space map is generated by plotting X(n) versus X(n+1) of a normalized speech data sequence of a speech segment ie frame The phase space map is devide into grid 20x20 boxesThe boxe define by co-ordinate (19)(-91) is taken as location 1 Box just right to it is taken as location 2 and it extended to X direction50 Micheal A C (2000) proposed generalized sound recognition system used to reducedndashdimension log-spectral features and a minimum hidden Markov model classifier The method generality were test by soughting sound classes which consisting time-localize events sequences textures and mixed scenesto address problem of dimensionality and redundancy while keeping the complete spectral Thus the used of low dimensional subspace via reduced-rank spectral basis The used independent subspace analysis (ISA) for extracting statistically independent reducedndashrank features from spectral informtion The singular value decomposition (SVD) used to estimate new basis for data and the right singular basis functions were cropped to yirld fewer basis functions that are passed to independent analysis (ICA) The decorrelated SVD had reduced-rank features and the ICA imposed the additional constraint of minimum mutual information between the marginal of the output features The representation affect the presentation in two HMM classifier One which used the complete spectral information yield result of 6061 while the other using the reduced ndashrank shown result of 9265

52 Jesus D AFernando D et al (2003) had using nonlinear features for voice disorder detectionThey test features from dynamical system theory namely correlation dimension and Lyapunove ExponentHere they study of the optimal size of time window for this type of analysis in the field of the characterization of the voice quality Also classical characteristics have been divided into five groups depending on the physical phenomenon that each parameter quantifies quantifying the variation in amplitude (shimmer) quantifying the presence of unvoiced frames quantifying the absence of wealth spectral (Hitter) quantifying the presence of noise and quantifying the regularity and periodicity of the waveform of a sustained voiced voice In this work the possibility of using nonlinear features with the purpose of detecting the presence of laryngeal pathologies has been explored The four measures proposed will be used mean value and time variation of the Correlation Dimension and mean value and time variation of the Maximal Lyapunov Exponent values The system is based on the use of a Neural Netwok classifiers where each one discriminates frames of a certain vowel Combinational logic has been added to evaluate the success rate of each classifier In this study normalized data (zero-mean and variance one) have been used A global success rate of 91 77 is obtained using classic parameters whereas a global success rate of 9276 using

classic parametersrdquo and ldquononlinearrdquo parameters These results show the utility of new parameters64 J KrajewskiD SommerT Schnupp doing study on a speech signal processing method to measure fatigue from speech Applying methods of Non Linear Dynamics (NLD) provides additional information regarding the dynamics and structure of fatigue speech The research achieved significant correlations between fatigue and NLD features of 02969 Chanwoo K (2009) present new features extraction algorithm of Power Normalized-Cepstral Coeficient (PNCC) Major of this PNCC processing include the use of a power law nonlinearity that replaces the traditional log nonlinearity used in MFCC coefficient Further he suppress background excitation using medium-duration power estimation based on the ratio of the arithmetic mean to the geometric mean to estimate the degree of speech corruption and substracting the medium duration background power that is assumed to present the unknown level of background stimulation In addition PNCC uses frequency weighting based on gammatone filter shape rather than the triangular frequency weighting or the trapeizodal frequency weigthing associated with the MFCC and PLP respectively Result of PNCC processing provide substantial improvement in recognition accuracy compared to MFCC and PLP To evaluate the robustness of the feature extraction Chanwoo digitally added three different type of noisewhite noise street noise and background music The amount of lateral treshold shift used to characterize the improvement in recognition accuracy For white noise PNCC provides improvement about 12dB to 13dB compare to MFCC For the Street and noise and music noise PNCC provide 8 dB and 35 dB shifts respectively7174 Suma SA KS Gurumurthy (2010) had analyzed speech with a Gammatone filter bank The full band speech signal splitted into 21 frequency band (sub bands) of gammatone filterbank For each subband speech signal pitch is extractedThe signal to Noise Ratio of the each subband is determined Next the average of pitch period of highest SNR subband used to obtain a optimal pitch value75 Wannaya N C Sawigun (2010) Design analog complex gammatone filter in order to extract both envelope and phase information of the incoming speech signals as well to emulate basilar membrane spectral selectivity to enhance perceptive capability of a cochlear implant processor The gammatone impulse response is transform into frequency domain and the resulting 8th order transfer function is subsequently mapped onto a state-space description of an othonormal ladder filter In order to preserve the high frequency domain gammatone impulse response been combine with the Hilbert tansform Using this approach the real and imaginary transfer functions that share the same denominator can be extracted using two different C matric The propose filter is designed using Gm-C integrators and sub-treshold CMOS devices in AMIS 035um technology Simulation results using Cadence RF Spectra Conform the design principle and ultra low power operation

xX Benjamin J S and Kuldip K P (2000) introduce a noise robust spectral estimation technique for speech signals as Higher-lag Autocorrelation Spectral Estimation (HASE) It compute magnitude spectrum of only the one-sided of the higher-lag portion of the autocorrelation sequence The HASE method reduces the contribution of noise components Also study Introduce a high dynamic range window design method called Double Dynamic Range (DDR)

The HASE and DDR techniques is used in a modified Mel Frequency Cepstral Coefficient (MFCC) algorithm to produce noise robust speech recognition features New features then called Autocorrelation Mel Frequency Cepstral Coefficients (AMFCCs) The recognition performance of AMFCCs to MFCCs for a range of stationary and non-stationary noises(They used an emergency vehicle siren frame and artificial chirp noise frame to highlight the noise reduction properties of the HASE algorithm ) on the Aurora II database showed that the AMFCC features perform as well as MFCCs in clean conditions and have higher noise robustness 77 Aditya B KARoutray T K Basu (2008) proposed new features based on normal and Teager energy operated wavelet Packet Cepstral Coefficient (MFCC2 and tfWPCC2) computed by method 2 There are 6 full-blown emotions and also for neutral were been collectedWhich then the database were named Multi lingual Emotional Speech of North East India (ESDNEI) A total of seven GMMs are trained using EstimationndashMaximization (EM) algorithm one for each emotion In the training of the GMM classifier its mean-vector are randomly initializedHence the entires above train-test procedure is repeated 5 time and finally best PARSS (BPARSS) are determined corresponding to best MPARSS (MPARSS) The standard deviation (std) of this PARSS of these 5 repetition are computed The GMMs are considered taking number Gaussian probabbility distribuation function (pdfs) Ie M=8 1624 and 32In case of computation of MFCC tfMFCC LFPC tfLFPC the values in case of LFPC and tfLFPC features indicate steady convergence of EM algorithmThe highest BMPARSS using WPCC and tfWPCC features are achieved as 951 with M=24 and 938 with M=32 respectively The proposed features tfWPCC2 produced 2nd highest BMPARSS 943 at M=32 with GMM classifier The tfWPCC2 feature computational time (1 hour 15 minutes) is substantially less then the computational of WPCC (36 hours) Also observed that TeagerndashEnergy Operator cause increase of BMPARSS79 Jian FT JD Shu-Yan WH Bao and MG Wang (2007) The combined wavelet packet which had introduced the S-curve of Quantum Neural Network (QNN) to wavelet treshold method to realize Speech Enhancement combined to wavelet package to simulate the human auditory characteristicsThis Simulation result showed the methods superior to that of traditional soft and hard treshold methods This made great improvement of objective and subjective auditory effect The algorithm also presented can improve the quality of voice zZZ R Sarikaya B Pellom and H L Hansen proposed set of feature parameter The new features were name Subband Cepstral parameter (SBC) and wavelet packet parameter (WPP) This improve the performance off speech processing by formulate parameters that is less sensitive to background and convolutional channel noise The ability of each parameter to capture the speakers identity conveyed in speech signal is compared to the widely used MFCCThrough this work 24 subband wavelet packet tree which approximates the Mel-Scale frequency division used The wavelet packet transform is computed for the given wavelet treewhich result a sequence of subband signals or equivalently the wavelet packet transform coefficients at the leave of the tree The energy of sub-signals for each subband is computed and then scaled by the number of transform coefficient in that subband SBC parameter are derived from subband energies by applying the Discrete Cosine Transformation transformation The wavelet Packet Parameter (WPP) shown decorelates the filterbank energies better in coding applications Further more the feature energy for each frame is normalized to 10 to eliminated the differences resulting from a different scaling parameters For all the speech evaluated they consistently observed the total correlation term for the wavelet is smaller than the discrete cosine transforms This result confirm the wavelets transform of the log-subband energies better than a DCT The WPPs are derived by taking the wavelet transform of

the log-subband energies The used of the Gaussian Mixture Model is linear cobination of M Gaussian mixture densitiesIt is motivated by the interpretation that the Gaussian components represent some general speaker-dependent spectral shapeThe models then are trained using the Expectation Maximization algorithm (EM) Simulation conduct on TIMIT databases that contain 6300 sentences spoken by 630 speakers and 168 speakers from TIMIT test speaker set had been downsampled from originally16kHz to 8kHz for this study The simulation results with 3 seconds of testing data on 630 speakers for MFCC SBC and WPP parameters are 948 960 and 973 respectively And WPP and SBC have achieved 988 and 985 respectively for 168 speakers WPP outperformed SBC on the full test set

Zz1 Zhang X Jiao Z (2004) proposed a speech recognition front-end processor that used wavelet packet bandpass filter as filter model in which it represent the wavelet frequency band partition based on ERB Scale and Bark Scaled which are used in a practical speech recognition method In designing WP a signal space Aj of multi-rate solution is decomposed by wavelet transform into a lower resolution space A j+1 and a detail space Dj+1 This done by deviding orthogonal basis ( ǿj (t-2jn))nZ of Aj into two new orthogonal bases (ǿ j+1(t-2j+1n))nZ of A j+1 (ψ j+1(t-2j+1n))nZ

of Dj+1 Where Z is a set of integers ǿ(t) and ψ(t) are scaling and wavelet function respecively This decomposition can be achieved by using a pair of conjugate mirror filters which divides the frequency band into two equal halves The WP decomposes the approximate spaces as well as detail spaces Each subspace in the tree is indexed by its depth j and number of subspaces p below itThe two WP orthogonal bases at a parent node (jp) are defined by ψ 2p

j+1(k) = Σinfinn=-infin h[n] p

j ( u-2jn) and ψ 2p+1j+1(k) = Σinfin

n=-infin h[n] p j

( u-2jn) where h(n) is a low pass and g(n) is a high pass filter given by the equations of h[n] = ψ 2p

j+1(u)ψ p j ( u-2jn) and h[n] = ψ 2p+1

j+1(u)ψ p j ( u-2jn) Decomposition by wavelet

packet had partition the frequency axis at higher frequency side into smaller bands that cannot be achieved by using discrete wavelet transform An admissible binary tree structure is achieved by choosing the best basis selection algorithm based on entrophyDue to this problem a fixed partition of frequency axis is performed in manner such it closely matches the Bark scale or ERB rate and results in an admissible binary WP tree structureThis study prepared center bandwidth of sixteen critical bandwidth of according the Bark unit and closely wavelet packet decomposition frequency band The study selected Daubechies wavelet as FIR filters whose order is 20 In training the algorithm 9 speakers utterances were used and 7 speakers used in test Matlab wavelet toolbox and Mallat algorithm used to computes wavelet transform and inverse transform Using C++ ZCPA program to deals with the results obtained from first step to gain feature data file Multi Layer perception (MLP) used as a classifier in speech recognition for train and test From the result of recognition rates gain 8143 and 8381 respectively based on wavelet Bark scale and ERB scalersquos which are used in front end processor of the speech recognition system Result gain explain that the cause of the performance not accurately equal to the originally frequency band of Bark and ERB 82 LLin E Ambikairajah and WH Holmes (2001) proposed critical band auditory filterbank with superior masking properties The output analysis filter are proposed to obtain the series of pulse trains that represent neural firing generated by auditory nerves Motivated by Inexpensive method for generating psychoacoustic tuning curve from well

known auditory masking curve two new approach to obtain the critical band filterbank that model this tuning curves are Log modelling technique which gives very accurate results and second is unified transfer function to represent each filter in the critical band filterbank Then Simultaneous and temporal masking models are applied to reduce the number of pulses and achieved a compact time frequency parameterization Run-length coding algorithm used to in coding the pulse amplitudes and pulse positions 103 RGandhiraj DrPSSathidevi (2007)had Modeling of Auditory Periphery that acting as front-end model of speech signal processing The two quantitave model for signal processing in auditory system promote are Gammatone Filter Bank (GTFB) and Wavelet Packet (WP) as front end robust speech recognition The classification system done by the neural network using backpropogation (BP) algorithm From the identified proposed system problem using of the auditory feature vectors was measured by recognition rate with various signal to ratios over -10 to 10 dB Comparing for the performances of proposed models with gammatone filterbank and wavelet packet with front-ends the proposed method showed that the proposed system with wavelet packet as front-end and back propagation Neural Network(BPNN) as for the training and recognition purpose as having good recognition rate over -10 to 10dB112 Manikandan tried to reduce the power of the noise signal and raise the power level of the informative signal at the receiver and improvement in the signal to noise ratio(SNR) For this aim an adaptive signal processing technic using Grazing Estimation technique usedHe trying to graze through the informative signal and find the approximate noise and information content at every instant Technique is based on having first two samples of the original signal correctly The third sample estimated using the two samples This done by finding the slope between the first two samples for the third sampleThe estimated sample is substracted with the value at that instant in the received signalThe implementation of wavelet denoising is three step procedure involving wavelet decomposition nonlinear treshold and wavelet reconstructing Here the perceptual speech wavelet denoisig system using adaptive time-frequency treshold estimation (PWDAT) is utilized Based on efficient 6-stage tree structure decomposition using 16-tap FIR filters derived from the Daubeches wavelet For 8 kHz speech the decomposition results in 16 critical bands Down sampling operation in multi-level wavelet packet transform result in multirate signal representation (The resulting number of samples corresponding to index m differ for each subband) Wavelet denoising involves tresholding in which coefficient below a specific value are set to zero Here new adaptive time-frequency dependent treshold is used First involve the estimating the standard deviation of the noise for every subband and time frame For this a quintile-based noise tracking approach is adapted Supression filter then applied to decomposed noisy coefficients The last stage simply involves resynthesizing the enhance speech using the inverse perceptual wavelet transform219 Yu Shao C Hong (2003) proposed a versatile speech enhancement system Where a psycoacoustic model is incorporated into the wavelet denoising technique Thus enable to combat different adverse noise conditions A feed forward subsystem connected to the Perceptual Wavelet Transform (PWT) processor soft-treshold based denoising scheme and unvoice speech enhancementThe average normalized time frequency energy guide the feedeforward tresholding to reduce the stationary noise while the non-statioary and correlated noise are reduce by an improved wavelet denoising technique with soft-tresholdingResulting the system which capable reducing noise with little speech degradation

CHAPTER 3METHODOLOGY

USING LINEAR FEATURE

MFCC

PNCC

Auditory based WAVELET PACKET DECOMPOSITION-ENERGYENTROPY coefficientsThrough wavelet packet decomposition we ideally had taken the approach in the process of mimicking human auditory system operation in analysing our emotional speech signal There are many type of auditory based filter had been develope to improve the speech processing

Gammatone ndashENERGYENTROPY coefficients

PLPs

Correlations between linear and nonlinear measures

Using Nonlinear Features

Nonlinear Dynamical System The Embedding TheoremThe Chaos Theory can be used to gain a better understanding and interpretation of observed complex dynamical behaviour Besides It can give some advantages in predicting or controlling such time evolution Deterministic dynamical systems describe the time evolution of a system in some state space Such an evolution can be described case by ordinary differential equations

x1048581(t) = F(x(t)) (1)

or in discrete time t = nΔt by maps of the form

( ) n 1 n x = F x + (2)

Unfortunately the actual state-vector only can be inferred for quite simple systemsand as anyone can imagine the dynamical system underlying the speech productionprocess is very complex Nevertheless as established by the embeddingtheorem 0 it is possible to reconstruct a state space equivalent to the original oneFurthermore a state-space vector formed by time-delayed samples of the observation(in our case the speech samples) could be an appropriate choice

s n =[s(n) s(n minusT )hellip s(n minus (d minus1)T )]t (3)

where s(n) is the speech signal d is the dimension of the state-space vector T is atime delay and t means transpose Finally the reconstructed state-space vector dynamic n 1 ( n ) s = F s + can be learned through either local or global models which in turn will be polynomialmappings neural networks etc

Correlation Dimension

The correlation dimension 2 D gives an idea of the complexity of the dynamics Amore complex system has a higher dimension which means that more state variablesare needed to describe its dynamics The correlation dimension of a random noise isnot bounded while the correlation dimension of a deterministic system yields a finitevalue The correlation dimension can be obtained as follows

XXXXX

The Largest Lyapunov Exponent

Chaotic behaviour arises from the exponential growth of infinitesimal perturbations This exponential instability is characterized by the Lyapunov exponents Lyapunov exponents are invariant under smooth transformations and are thus independent of the measurement function or the embedding procedure The largest Lyapunov exponent can be determined without the explicit construction of a model for the time series It considers the representation of the time series as a trajectory in the embedding space and assume that you observe a very close return n s to a previously visited point n s Then one can consider the distance 0 n n Δ = s minus s as

an small perturbation which should grow exponentially in time Its evolution can befollowed from the time series

l n l n l s s + + Δ = minus

If one finds that l l oΔ asymp Δ eλ λ

is the largest Lyapunov exponent[52]

Classifiers

Computational Approach

Computationally discriminant function analysis is very similar to analysis of variance (ANOVA) Let us consider a simple example Suppose we measure energy in a sample of 630 utterancesOn the average and this difference will be reflected in the difference in means for the energy variable from each type of emotional speech produced Therefore variable energy allows us to discriminate between type of emotion with a better than chance probability if a person is anger then he is likely to have greater energy if a person is neutral then he is likely to have a low energy

We can generalize this reasoning to groups and variables that are less trivial To summarize the computation definition so far the basic idea underlying discriminant function analysis is to determine whether groups differ with regard to the mean of a variable and then to use that variable to predict group membership of the type of emotion uttered by the speakers

Analysis of Variance Stated in this manner the discriminant function problem can be rephrased as a one-way analysis of variance (ANOVA) problem Specifically one can ask whether or not two or more groups are significantly different from each other with respect to the mean of a particular variable To learn more about how one can test for the statistical significance of differences between means in different groups you may want to read the Overview section to ANOVAMANOVA However it should be clear that if the means for a variable are significantly different in different groups then we can say that this variable discriminates between the groups

In the case of a single variable the final significance test of whether or not a variable discriminates between groups is the F test As described in Elementary Concepts and ANOVA MANOVA F is essentially computed as the ratio of the between-groups variance in the data over the pooled (average) within-group variance If the between-

group variance is significantly larger then there must be significant differences between means

Multiple Variables Usually one includes several variables in a study in order to see which one(s) contribute to the discrimination between groups In that case we have a matrix of total variances and covariances likewise we have a matrix of pooled within-group variances and covariances We can compare those two matrices via multivariate F tests in order to determined whether or not there are any significant differences (with regard to all variables) between groups This procedure is identical to multivariate analysis of variance or MANOVA As in MANOVA one could first perform the multivariate test and if statistically significant proceed to see which of the variables have significantly different means across the groups Thus even though the computations with multiple variables are more complex the principal reasoning still applies namely that we are looking for variables that discriminate between groups as evident in observed mean differences

LDA Classifier

Linear discriminant analysis (LDA) and the related Fishers linear discriminant are methods used in statistics pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events LDA is closely related to ANOVA (analysis of variance) and regression analysis which also attempt to express one dependent variable as a linear combination of other features or measurements In the other two methods however the dependent variable is a numerical quantity while for LDA it is a categorical variable (ie the class label) LDA works when the measurements made on independent variables for each observation are continuous quantities When dealing with categorical independent variables the equivalent technique is discriminant correspondence analysis Traditional linear discriminant analysis requires that the predictor variables be measured on at least an interval scale In linear discriminant analysis the number of linear discriminant functions that can be extracted is the lesser of the number of predictor variables or the number of classes on the dependent variable minus one In the linear discriminant analysis the raw canonical discriminant function coefficients for Longitude and Latitude on the (single) discriminant function are 122073 and -633124 respectively Discriminant Analysis could then be used to determine which variable(s) are the best predictors

LDA for two classes

Consider a set of observations (also called features attributes variables or measurements) for each sample of an object or event with known class y In this research

in analysing the emotions in utterances we pairwise the object of the emotion coeffiticient such Natural vs Angry Natural vs Loud Natural vs Lombard and the Natural Vs Angry vs Lombard vs Loud This set of samples coefficient measured is called the training set The classification problem is then to find a good predictor for the class y of any sample of the same distribution given only an observation

LDA approaches the problem by assuming that the conditional probability density

functions and are both normally distributed with mean and

covariance parameters and respectively Under this assumption the Bayes optimal solution is to predict points as being from the second class if the ratio of the log-likelihoods is below some threshold T so that

Without any further assumptions the resulting classifier is referred to as QDA ( quadratic discriminat analysis) LDA also makes the simplifying homocedasic assumption (ie that the class covariances are identical so Σy = 0 = Σy = 1 = Σ) and that the covariances have full rank In this case several terms cancel and the above decision criterion becomes a threshold on the dot product

for some threshold constant c where

This means that the criterion of an input being in a class y is purely a function of this linear combination of the known observations

It is often useful to see this conclusion in geometrical terms the criterion of an input being in a class y is purely a function of projection of multidimensional-space point onto direction In other words the observation belongs to y if corresponding is located on a certain side of a hyperplane perpendicular to The location of the plane is defined by the threshold c

Multiclass LDA

In the case where there are more than two classes the analysis used in the derivation of the Fisher discriminant can be extended to find a subspace which appears to contain all of the class variability Suppose that each of C classes has a mean μi and the same covariance Σ Then the between class variability may be defined by the sample covariance of the class means

where μ is the mean of the class means The class separation in a direction in this case will be given by

This means that when is an eigenvector of Σ minus 1Σb the separation will be equal to the corresponding eigenvalue Since Σb is of most rank C-1 then these non-zero eigenvectors identify a vector subspace containing the variability between features These vectors are primarily used in feature reduction as in PCA The smaller eigenvalues will tend to be very sensitive to the exact choice of training data and it is often necessary to use regularisation as described in the next section

Other generalizations of LDA for multiple classes have been defined to address the more general problem of heteroscedastic distributions (ie where the data distributions are not homocedastic) One such method is Heteroscedastic LDA (see eg HLDA among others)

If classification is required instead of dimension reduction there are a number of alternative techniques available For instance the classes may be partitioned and a standard Fisher discriminant or LDA used to classify each partition A common example of this is one against the rest where the points from one class are put in one group and everything else in the other and then LDA applied This will result in C classifiers whose results are combined Another common method is pairwise classification where a new classifier is created for each pair of classes (giving C(C-1)2 classifiers in total) with the individual classifiers combined to produce a final classification

Fishers linear discriminant

The terms Fishers linear discriminant and LDA are often used interchangeably although Fisherrsquos original article [1] actually describes a slightly different discriminant which does not make some of the assumptions of LDA such as normally distributed classes or equal class covariances

Suppose two classes of observations have means and covariances Σy = 0Σy = 1

Then the linear combination of features will have means and variances

for i = 01 Fisher defined the separation between these two distributions to be the ratio of the variance between the classes to the variance within the classes

This measure is in some sense a measure of the signal-to-noise for the class labelling It can be shown that the maximum separation occurs when

When the assumptions of LDA are satisfied the above equation is equivalent to LDA

Be sure to note that the vector is the normal to the discriminant hyperplane As an example in a two dimensional problem the line that best divides the two groups is perpendicular to

Generally the data points to be discriminated are projected onto then the threshold that best separates the data is chosen from analysis of the one-dimensional distribution There is no general rule for the threshold However if projections of points from both classes exhibit approximately the same distributions the good choice would be

hyperplane in the middle between projections of the two means and In this case the parameter c in threshold condition can be found explicitly

Tresholding

lsquoWdenrsquo is a one-dimensional de-noising function lsquowdenrsquo performs an automatic de-noising process of a one-dimensional signal using wavelets The underlying model for the noisy signal is basically of the following form

where time n is equally spaced In the simplest model suppose that e(n) is a Gaussian white noise N(01) and the noise level a is supposed to be equal to 1 The de-noising objective is to suppress the noise part of the signal s and to recover f The de-noising procedure proceeds in three steps

1 Decomposition Choose a wavelet and choose a level N Compute the wavelet decomposition of the signal s at level N

2 Detail coefficients thresholding For each level from 1 to N select a threshold and apply soft thresholding to the detail coefficients

3 Reconstruction Compute wavelet reconstruction based on the original approximation coefficients of level N and the modified detail coefficients of levels from 1 to N

When looking at the operation function calling below

[XDCXDLXD] = wden(XTPTRSORHSCALNwname)

returns a de-noised version XD of input signal X obtained by thresholding the wavelet coefficients Additional output arguments [CXDLXD] are the wavelet decomposition structure of the de-noised signal XD

TPTR string contains the threshold selection rule

rigrsure uses the principle of Steins Unbiased Risk

heursure is an heuristic variant of the first option

sqtwolog for universal threshold

minimaxi for minimax thresholding (see thselect for more information)

SORH (s or h) is for soft or hard thresholding (see wthresh for more information)

SCAL defines multiplicative threshold rescaling

one for no rescaling sln for rescaling using a single estimation of level noise based on first-level

coefficients

mln for rescaling done using level-dependent estimation of level noise

Wavelet decomposition is performed at level N and wname is a string containing the name of the desired orthogonal wavelet The detail coefficients vector is the superposition of the coefficients of f and the coefficients of e and that the decomposition of e leads to detail coefficients that are standard Gaussian white noises Minimax and SURE threshold selection rules are more conservative and are more convenient when small details of function f lie in the noise range The two other rules remove the noise more efficiently The option heursure is a compromise

Soft Tresholding

Soft or hard thresholding

Y = wthresh(XSORHT)

Y = wthresh(XSORHT) returns the soft (if SORH = s) or hard (if SORH = h) thresholding of the input vector or matrix X T is the threshold value

Y = wthresh(XsT) returns soft thresholding is wavelet shrinkage ( (x)+ = 0 if x lt 0 (x)+ = x if x 0 )

Hard Tresholding

Y = wthresh(XhT) returns hard thresholding is cruder[I]

References1Guojun Zhou John HL Hansen and James F Kaiser ldquoClassification of speech under Stress Based On Features Derived from the Nonlinear Teager Energy Operatorrdquo Robust Speech Processing Laboratory Duke University Durham 19962 OW Kwon K Chan J Hoa Te-Won Lee rdquoEmotion Recognition by Speechrdquo Institute for Neural computation Sandiego GENEVA EUROSPEECH 20033S Casale ARusso GScebba rdquoSpeech Emotion Classifictaion Using Machine Learning Algorithmsrdquo IEEE International Conference on Semantic Computing 20084Bogdan Vlasenko Bjorn SchullerAndreas W Gerhard Rigolrdquo Combining Frame and Turn Level Information For Robust Recogniyion of Emotions within Speechrdquo Cognitive System Otto-von-Gueric University and Institute for human-Machine Communication Technische Universitat Munchen Germany20075Ling He Margaret Lech Namunu Maddage ldquoStress and Emotion Recognition Using Log-Gabor Filter Analysis of Speech Spectrgramsrdquo School of Electrical and Computer Engineering RMIT University Australia 20096 JHL Hansen and BD WomackrdquoFeature Analysis and Neural Network-Based Classificatin of Speech Under Stressrdquo Robust Speech Processing Laboratory IEEE Transaction On Speech and Audio ProcessingVol4 No 4 19967Resa CO Moreno IL Ramos D Rodriguez JG ldquoAnchor Model Fusion for Emotion Recognition in Speechrdquo ATVS-Biometric Group LNCS 5707 pp 49-56Springer Heidelberg 20098NachamaiMTSanthanamCPSumathirdquoAnew Fangled Insunuation for stress Affect Speech ClassificationrdquoInternational Journal of computer Applications Vol 1No19 20109Ruhi Sarikaya John N Gowdy rdquoSubband Based Classification of Speech Under Stress Digital Speech and Audio Processing Laboratory Clemson University199710RES Slyh WT Nelson EG HansenrdquoAnalysis of Mrate Shimmer Jitter F0

Contour Features Across Stress and Speaking Style in the SUSAS databaserdquo Air Force Research Laboratory Human Effective Directorate Ohio1998

11B Schuller B Vlasenko F Eyben GRigollA Wendemuth Acoustic Emotion Recognition A Benchmark Comparison of Performancesldquo IEEE 200912KPaliwal B Shannon JLyons Kamil W rdquoSpeech Signal Based Frequency Wrappingrdquo IEEE Signal Processing letters14Syaheerah L Lutfi J M Montero R Barra-Chicote J M Lucas-CuestardquoExpressive Speech Identifications based on Hidden Markov ModelrdquoInternational conference of Health informatic HEALTHINF200916KR Aida-Zade C Ardil SS Rustamov rdquoInvestigation of combined use of MFCC and LPC Features in Speech Recognition Systemsrdquo World Academy of Science Engineering and Technology 19200622 Firoz Shah A Raji Sukumar Babu Abto P ldquoDiscrete wavelet transforms and Artificial Neural Network for Speech Emotion recognitionrdquoIACSIT2010 43Ling He Margaret L Namunu MaadageNicholas ArdquoStresss and Emotion Using Log Gabor Filter Analysis of speech SpectrogramsIEEE200946Vasillis Pitsikalis Petros MargosrdquoSome Advances on Speech Analysis using Generalized Dimensions ISCA200348NS Sreekanth Supriya N Pal Arunjath G lsquoPhase Space Point Distribution E Parameter for Speech RecognitionThird Asia International Conference on Mpdelling and Simulation 200950Micheal A Casey lsquoReduced-Rank Spectra and Minimum-Entropy Priors as Consistent and Reliable Cues for Generalized Sound Recognitionrsquo Cambridge Research Laboratory MERL2000

5269Chanwoo KimR MSternrdquoFeature extraction for Robust Speech recognition using PowerndashLaw Nonlinearity and Power Bias-Substractionrdquo Department of Computer Engineering Carnegie Mellon University Pitsburg 2009xX [Benjamin J Shannon Kuldip K Paliwal(2000)rdquo Noise Robust Speech Recognition using Higher-lag Autocorrelation Coefficientsrdquo School of Microelectronic Engineering Griffth University Australia]77 Aditya BK ARoutray TK Basu ldquoEmotion Recognition from Speeches of Some Native Languages of Assam independent of Text and Speakerrdquo National Seminar on Devices Circuits and Communication Department of ECE India Institute of Technology India200879 Jian-Fu Teng Jian Dong Shu-Yan Wang Hu Bao and Ming Guo Wang (2007)Yu Shao Chip- Hong Chang (2003) 82 LLin E Ambikairajah and WH Holmes ldquoWideband Speech and Audio coding in the pPerceptual DomainrdquoSchool of Electrical Engineering The university of New South Wales Sydney Australia103 RGandhiraj DrPSSathidevi Auditory-Based Wavelet Packet Filterbank for Speech Recognition using Neural Network International Conference On Advance d Computing and CommunicationsIEEE2007Zz1 Zhang Xueying Jiao Zhiping ldquoSpeeh Recognition Based on Auditory Wavelet Packet FilterrdquoIEEE Information Engineering CollegeTaiyuan University Technology China2004 zZZ Ruhi Sarikaya Bryan LP J H L Hansen ldquoWavelet Packet Transform Feature with Application to Speaker Identificationrdquo Robust speech Processing Laboratory Duke UniversityDurham112 Manikandan ldquoSpeech Enhancement base on wavelet denoisingrdquo Anna UniversityINDIA

219Yu Shao Chip-Hong Chang rsquoA Versatile Speech Enhancement Syatem Based on Peceptual Wavelet Denoising lsquo Center for Integrated Circuits and Systems Nanyang Technological UniversitySingapore2001

a]b]Tal Sobol-Shikler rsquoAnalysis of affective expression in speechrsquoTechnical Report Computer Laboratory University of Cambridge pp a]11 b]14 2009

c]

[d][e] Dimitrios V Constantine K lsquoA State of the Art Review on Emotional Speech databasesrsquo Artificial Intelligence amp Information Analysis Laboratory Aristotle University of ThessalonikiGreece2003

[f] P Grassberger and I Procaccia Characterization of strange attractors Physical Review Letters No 50 pp 346-349 1983

[g] G Mayer-Kress S P layne S H Koslow A J Mandell and M F shlesinger Perspectives in biomedical dynamics and theoretical medicine Annals of the New York Academy of Sciences New York USA pp 62-87 1987

[h] B J West Fractal physiology and chaos in medicine World Scientific Singapore Studies of Nonlinear Phenomena in Life Sciences Vol 1 1990

[I] Donoho DL (1995) De-noising by soft-thresholding IEEE Trans on Inf Theory 41 3 pp 613-627

[j] JR Deller Jr J G Proakis J H Hansen (1993) Discrete ndashTime Processing Processing of Speech Signals New York Macmillan httpcnxorgcontentm18086latestuid

Windowing


Statistics



LDA for two classes

Multiclass LDA


Althou speech analysis is unintrusive this do not mean the anlysis as simple at forehand for the emotion to be distinguished however Emotion recognition is a complex pattern recognition problem which relate cognitive and neural approach[22]

In this modern world we are surround by all kinds of signal in various forms Some signals are necessary (speech) and some are pleasant (music) While many are unnecessay and unwanted in a given situation In engineering signal are carriers of information both useful and unwanted The distinction of useful and unwanted information is always subjective as well as objectiveUsing DSP approach it is more possible to convert an inexpensive personal computer into a powerful signal processor Some important DSP advantages are system using DSP can be developed using software running on general purpose computer that the DSP is convenient to develope and test and software is portable DSP operation based solely on addition and multiplications which cause extremely stable processing capability DSP operation also can be modified in real time with often by little programming changing and DSP Because of it advantages DSP is becoming the first choise in many technologies and application such consumer electronics communications wireless telephones ad medical imaging

Roughly speaking for a given learning task with a given finite amount of training data the best generalization performance will be achieved if the right balance is struck between the accuracy attained on that particular training set and the ldquocapacityrdquo of the machine that is the ability of the machine to learn any training set without error[d] The interpretation of emotions from variations of physiological expression seem have not been fully achieved using linear measures Therefore linear measures such pitch energy fundamental frequency formant lpcc speaking rate frequency spectral should should not only be concentrated for the analyse of speech emotions to discriminate the physiological expression Nevertheles linear measures circumstances were the main motivation to investigate the application of nonlinear mathematics to the emotions using speechConsequently nonlinear dynamical analysis has emerged a novel method for the study of complex systems during the past few decades Lyapunov exponent Hurst exponent fractal dimension information dimension box dimension and correlation dimension are some of the methods by which the complexity of a system or a signal could be quantified

A significant step forward was made by Grassberger and Procaccia [f] who showed how the so called correlation dimension could be used to estimate and bound the fractal dimension of the strange attractor associated with the nonlinear time signal at hand Specifically in [g] the application of correlation dimension to determine emotion through speech due to the vocal tract nonlinear characteristic was described Correlation dimension is by the most popular due to its computational efficiency compared to other fractal dimensions such as the information dimension capacity dimension and pulse dimension [h]

The novel technique which incorporates Lyapunovrsquos direct method is general flexible and can be easily adapted to analyze the behavior of many types of nonlinear iterative signal processing algorithms[ On the Use of Lyapunov Criteria to Analyze the Convergence of Blind Deconvolution Algorithms]


Speech Production














Windowing



window length is ms

(2-2)


Spectrogram















Statistics




















n=-infin h[n] p j








MFCC

PNCC



PLPs




x1048581(t) = F(x(t)) (1)


( ) n 1 n x = F x + (2)






XXXXX







Classifiers








LDA Classifier


LDA for two classes










Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA



Speech Production














Windowing



window length is ms

(2-2)


Spectrogram















Statistics




















n=-infin h[n] p j








MFCC

PNCC



PLPs




x1048581(t) = F(x(t)) (1)


( ) n 1 n x = F x + (2)






XXXXX







Classifiers








LDA Classifier


LDA for two classes










Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA














Windowing



window length is ms

(2-2)


Spectrogram















Statistics




















n=-infin h[n] p j








MFCC

PNCC



PLPs




x1048581(t) = F(x(t)) (1)


( ) n 1 n x = F x + (2)






XXXXX







Classifiers








LDA Classifier


LDA for two classes










Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA















Statistics




















n=-infin h[n] p j








MFCC

PNCC



PLPs




x1048581(t) = F(x(t)) (1)


( ) n 1 n x = F x + (2)






XXXXX







Classifiers








LDA Classifier


LDA for two classes










Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA




















n=-infin h[n] p j








MFCC

PNCC



PLPs




x1048581(t) = F(x(t)) (1)


( ) n 1 n x = F x + (2)






XXXXX







Classifiers








LDA Classifier


LDA for two classes










Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA





MFCC

PNCC



PLPs




x1048581(t) = F(x(t)) (1)


( ) n 1 n x = F x + (2)






XXXXX







Classifiers








LDA Classifier


LDA for two classes










Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA


x1048581(t) = F(x(t)) (1)


( ) n 1 n x = F x + (2)






XXXXX







Classifiers








LDA Classifier


LDA for two classes










Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA






Classifiers








LDA Classifier


LDA for two classes










Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA




LDA Classifier


LDA for two classes










Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA










Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA
















Tresholding

















coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA












coefficients



Soft Tresholding


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA


Y = wthresh(XSORHT)



Hard Tresholding








c]







Windowing


Statistics



LDA for two classes

Multiclass LDA






c]







Windowing


Statistics



LDA for two classes

Multiclass LDA


REport

Documents

Transcript of REport