[IEEE 2011 National Conference on Communications (NCC) - Bangalore, India (2011.01.28-2011.01.30)]...

5
Signi๏ฌcance of the LP-MVDR Spectral Ratio Method in Whisper Detection Arpit Mathur and Rajesh M. Hegde Indian Institute of Technology Kanpur Kanpur 208016 India {arpitmat, rhegde}@iitk.ac.in Abstractโ€”A new spectral ratio method is proposed in this pa- per for detecting whispered segments within a normally phonated speech stream. The method is based on computing the ratio of the linear Prediction(LP) spectrum to the minimum variance distortion less response (MVDR) spectrum. Both the linear prediction method and the LP residual method by themselves are found to be inadequate in modelling medium to high frequencies in the speech signal. On the contrary, the MVDR method shows robustness in modelling spectra of all frequencies. This difference in spectral estimation between the two is utilized in the proposed spectral ratio method to separate whispered segments having less harmonics and more noise from normally phonated segments of speech. A comparative analysis of the proposed method with other methods like the LP residual and the spectral ๏ฌ‚atness methods is described. Whisper Detection experiments are conducted on the CHAINS database. The proposed method indicates reasonable improvements as noted from the ROC curves and the whisper diarization error rate. I. I NTRODUCTION Whispered speech is a natural speech mode when secrecy or quietness of the conversation are required. Whispered speech is different from voiced speech in many ways. Whisper is the result of a rigid vocal chord vibration leading to the character- istic suppression in intensity as well as comprehensibility of the generated speech signal. Whisper is hence characterized by shift of formants to higher frequencies, concentration of spectral energies in higher frequency bands and lack of a speci๏ฌc harmonic structure. Conventional speech recognition engines are reported to give poor results with whispered speech signals. Hence there is need to recognize the whispered segments and adopt suitable modelling techniques for better speech recognition results. Linear Prediction [1], is the most widely used and accepted method, for whisper detection [2]and also whisper island detection [3], [4]. Other methods like vocal effort change point detection have also been used for improved whisper detection within normally phonated audio streams in this context [5]. These methods in general parameterize an AR spectra using a least square errors method. However the LP model works well only for speech formants as in voiced speech and more speci๏ฌcally at low frequencies. The most undesirable effect of these techniques is that the LP model tends to overestimate the power at formant frequencies. Moreover increasing the model order increases the overestimation rather than correcting it. Hence this method is able to resolve harmonics but is poor at estimating the power at the frequencies in the spectrum. Hence it leads to poor characterization of the vocal tract transfer func- tion. The minimum variance distortion less (MVDR) spectra [6], on the other hand, is capable of modelling the power of the spectra ef๏ฌciently at all harmonic frequencies due to the nature of the estimation method. The MVDR spectra also responds to increase in model order and improves the model at higher harmonic frequencies [7]. As whisper is characterized by formants shifts and overall increased concentration of power (due to more noise) in high frequency bands, the models must essentially provide good modelling results in those bands. Thus exploiting the use of MVDR spectra is desirable in robust whisper detection. Moreover, MVDR coef๏ฌcients can be computed from LP coef๏ฌcients themselves [8], making the process computationally less expensive. The paper begins with a discussion on the inadequacy of the LP spectrum to model signal power at higher frequencies and hence in detecting whispered segments within normally phonated speech. The proposed spectral ratio method is de- scribed next with an algorithm to detect whispered speech segments. Whisper segmentation plots are illustrated for the proposed spectral ratio method and also two other conventional methods like the spectral ๏ฌ‚atness [9], [10], and the LP resid- ual method. Experimental results on the CHAINS database [11], [12], are reported next with concluding remarks on the signi๏ฌcance and future scope of the work presented herein. II. THE LP SPECTRUM AND WHISPER DETECTION Let the input speech signal be s(n). The LP method models speech as an autoregressive process ห† ()= โˆ‘ =1 ( โˆ’ ) (1) Hence the residual signal or the modelling error (e(n)) is ()= () โˆ’ โˆ‘ =1 ( โˆ’ ) (2) ( )= ( )(1 โˆ’ โˆ‘ =1 โˆ’ ) (3) This is mathematically equivalent to applying an FIR ๏ฌlter A(z)= 1- โˆ‘ =1 โˆ’ , on the input signal (). The output of the ๏ฌlter is also the error spectrum E(z). ( )= ( )( ) (4) 978-1-61284-091-8/11/$26.00 ยฉ2011 IEEE

Transcript of [IEEE 2011 National Conference on Communications (NCC) - Bangalore, India (2011.01.28-2011.01.30)]...

Page 1: [IEEE 2011 National Conference on Communications (NCC) - Bangalore, India (2011.01.28-2011.01.30)] 2011 National Conference on Communications (NCC) - Significance of the LP-MVDR spectral

Significance of the LP-MVDR Spectral RatioMethod in Whisper Detection

Arpit Mathur and Rajesh M. HegdeIndian Institute of Technology Kanpur

Kanpur 208016 India{arpitmat, rhegde}@iitk.ac.in

Abstractโ€”A new spectral ratio method is proposed in this pa-per for detecting whispered segments within a normally phonatedspeech stream. The method is based on computing the ratio ofthe linear Prediction(LP) spectrum to the minimum variancedistortion less response (MVDR) spectrum. Both the linearprediction method and the LP residual method by themselves arefound to be inadequate in modelling medium to high frequenciesin the speech signal. On the contrary, the MVDR methodshows robustness in modelling spectra of all frequencies. Thisdifference in spectral estimation between the two is utilized in theproposed spectral ratio method to separate whispered segmentshaving less harmonics and more noise from normally phonatedsegments of speech. A comparative analysis of the proposedmethod with other methods like the LP residual and the spectralflatness methods is described. Whisper Detection experimentsare conducted on the CHAINS database. The proposed methodindicates reasonable improvements as noted from the ROC curvesand the whisper diarization error rate.

I. INTRODUCTION

Whispered speech is a natural speech mode when secrecy orquietness of the conversation are required. Whispered speechis different from voiced speech in many ways. Whisper is theresult of a rigid vocal chord vibration leading to the character-istic suppression in intensity as well as comprehensibility ofthe generated speech signal. Whisper is hence characterizedby shift of formants to higher frequencies, concentration ofspectral energies in higher frequency bands and lack of aspecific harmonic structure. Conventional speech recognitionengines are reported to give poor results with whisperedspeech signals. Hence there is need to recognize the whisperedsegments and adopt suitable modelling techniques for betterspeech recognition results. Linear Prediction [1], is the mostwidely used and accepted method, for whisper detection [2]andalso whisper island detection [3], [4]. Other methods like vocaleffort change point detection have also been used for improvedwhisper detection within normally phonated audio streams inthis context [5].

These methods in general parameterize an AR spectra usinga least square errors method. However the LP model workswell only for speech formants as in voiced speech and morespecifically at low frequencies. The most undesirable effect ofthese techniques is that the LP model tends to overestimate thepower at formant frequencies. Moreover increasing the modelorder increases the overestimation rather than correcting it.Hence this method is able to resolve harmonics but is poor atestimating the power at the frequencies in the spectrum. Hence

it leads to poor characterization of the vocal tract transfer func-tion. The minimum variance distortion less (MVDR) spectra[6], on the other hand, is capable of modelling the power of thespectra efficiently at all harmonic frequencies due to the natureof the estimation method. The MVDR spectra also respondsto increase in model order and improves the model at higherharmonic frequencies [7]. As whisper is characterized byformants shifts and overall increased concentration of power(due to more noise) in high frequency bands, the models mustessentially provide good modelling results in those bands.Thus exploiting the use of MVDR spectra is desirable inrobust whisper detection. Moreover, MVDR coefficients canbe computed from LP coefficients themselves [8], making theprocess computationally less expensive.

The paper begins with a discussion on the inadequacy ofthe LP spectrum to model signal power at higher frequenciesand hence in detecting whispered segments within normallyphonated speech. The proposed spectral ratio method is de-scribed next with an algorithm to detect whispered speechsegments. Whisper segmentation plots are illustrated for theproposed spectral ratio method and also two other conventionalmethods like the spectral flatness [9], [10], and the LP resid-ual method. Experimental results on the CHAINS database[11], [12], are reported next with concluding remarks on thesignificance and future scope of the work presented herein.

II. THE LP SPECTRUM AND WHISPER DETECTION

Let the input speech signal be s(n). The LP method modelsspeech as an autoregressive process

ห†๐‘ (๐‘›) =๐‘šโˆ‘

๐‘˜=1

๐‘Ž๐‘˜๐‘ (๐‘›โˆ’ ๐‘˜) (1)

Hence the residual signal or the modelling error (e(n)) is

๐‘’(๐‘›) = ๐‘ (๐‘›)โˆ’๐‘šโˆ‘

๐‘˜=1

๐‘Ž๐‘˜๐‘ (๐‘›โˆ’ ๐‘˜) (2)

๐ธ(๐‘ง) = ๐‘†(๐‘ง)(1โˆ’๐‘šโˆ‘

๐‘˜=1

๐‘Ž๐‘˜๐‘งโˆ’๐‘˜) (3)

This is mathematically equivalent to applying an FIR filterA(z)= 1-

โˆ‘๐‘š๐‘˜=1 ๐‘Ž๐‘˜๐‘ง

โˆ’๐‘˜, on the input signal ๐บ๐‘ข(๐‘›). The outputof the filter is also the error spectrum E(z).

๐ธ(๐‘ง) = ๐‘†(๐‘ง)๐ด(๐‘ง) (4)

978-1-61284-091-8/11/$26.00 ยฉ2011 IEEE

Page 2: [IEEE 2011 National Conference on Communications (NCC) - Bangalore, India (2011.01.28-2011.01.30)] 2011 National Conference on Communications (NCC) - Significance of the LP-MVDR spectral

It is well known that an infinite order AR model can alwaysmodel the signal arbitrarily closely [1]. Hence the equation(4), is an exact equation. But limiting the order of the ARmodel to โ€™mโ€™, leads to an approximation. Also as the signalis finite in (n=0-N-1), the error can only be minimized at bestwith a finite order AR model. Let the minimum possible errorfor this model be ๐ธ๐‘š๐‘–๐‘›. Hence the equation becomes

๐‘†(๐‘ง) =๐ธ๐‘š๐‘–๐‘›

1โˆ’โˆ‘๐‘š๐‘˜=1 ๐‘Ž๐‘˜๐‘ง

โˆ’๐‘˜(5)

with non zero ๐ธ๐‘š๐‘–๐‘› and the transfer function being ๐ด(๐‘ง). Theestimated power ๐‘ƒ (๐œ”) of the speech signal ๐‘†(๐‘ง) can now becalculated by taking the square of the modulus as

๐‘ƒ (๐œ”) =โˆฃ๐ธ(๐œ”)โˆฃ2

โˆฃ1โˆ’โˆ‘๐‘š๐‘˜=1 ๐‘Ž๐‘˜๐‘ง

โˆ’๐‘˜โˆฃ2 (6)

However the error minimization strategy is the key to theinadequacy of LP spectra in estimating the power spectra atmedium and high frequencies. To illustrate this let us definethe total error as

๐ธ๐‘›๐‘’๐‘ก =12๐œ‹๐‘‡

โˆซ ๐œ‹๐‘‡

โˆ’๐œ‹๐‘‡

โˆฃ๐ธ(๐œ”)โˆฃ2 ๐‘‘๐œ” (7)

The standard Parsevals relation has been used in this context.Further substituting the value of โˆฃ๐ธ(๐œ”)โˆฃ from (4) and ๐‘†(๐œ”)from (5), we have the total error as

๐ธ๐‘›๐‘’๐‘ก =๐ธ2

๐‘š๐‘–๐‘›๐‘‡

2๐œ‹

โˆซ ๐œ‹๐‘‡

โˆ’๐œ‹๐‘‡

๐‘ƒ (๐œ”)

๐‘ƒ (๐œ”)๐‘‘๐œ” (8)

Hence the total error can be represented in terms of integralof the ratio of the actual power to the estimated power. As themodel is just an approximation of the actual spectrum, it isimportant for a model to approximate the powers at spectralfrequencies relevant for whisper detection. To illustrate thisfurther consider the ratio inside the integral in 8. The numer-ator is the actual spectral power and the denominator is theestimated spectral power.

In this context two cases can be made out with respect tothe spectral ratio.

A. Case1 : Estimated power (๐‘ƒ (๐œ”))is over estimated by ๐œ–times the actual power (๐‘ƒ (๐œ”))

. The ratio within the integral in this case turns out to be

(๐‘ƒ (๐œ”))

(๐‘ƒ (๐œ”))(1 + ๐œ–)=

1

1 + ๐œ–= 1โˆ’ ๐œ–

1 + ๐œ–(9)

B. Case2 : Estimated power (๐‘ƒ (๐œ”))is under estimated by ๐œ–times the actual power (๐‘ƒ (๐œ”))

. The ratio within the integral in this case turns out to be

(๐‘ƒ (๐œ”))

(๐‘ƒ (๐œ”))(1 โˆ’ ๐œ–)= 1 +

๐œ–

1โˆ’ ๐œ–(10)

Comparing the two afore mentioned cases, the effect onerror is more when ๐‘ƒ (๐œ”) > ๐‘ƒ (๐œ”) than when the inequalitysign is reversed. Hence the error in ๐ถ๐‘Ž๐‘ ๐‘’2 is much largerthan in ๐ถ๐‘Ž๐‘ ๐‘’1. However error minimization strategies do not

take these cases into account resulting in overestimation atcertain frequencies. This overestimation at crucial harmonicfrequencies is liable to give poor modelling results for whisperdetection. The inaccurate modelling at higher frequencies byLP, compounds the problem for whisper detection.

III. WHISPER DETECTION USING MVDR

As discussed in the preceding Section, the LP envelope overestimates the actual speech spectrum at harmonic frequencies.This over estimation is of concern in whisper detection asthe harmonic frequencies shift to the higher frequency regionsin whispered segments. A whisper spectrum is characterizedby low SNR and hence high noise content. Hence there is apronounced over estimation effect in LP based spectrum esti-mation at the formants that are shifted to a higher frequency,typically 1.2 - 1.4 times the normal frequency [10]. Howeverthe MVDR spectral estimation method ensures the signal ofinterest is not distorted in any frequency range. From a whisperdetection standpoint the smoothness of the MVDR spectrumin all spectral regions assumes great importance. We brieflydescribe the MVDR method of spectral estimation in the nextsection to set the prelude for the discussion on the proposedspectral ratio for whisper detection.

A. The minimum variance spectrum estimation and whisperdetection

The MVDR spectrum estimate [7], is a non parametric, dataadaptive technique that can be used to obtain better resolutionthan the DFT based spectrum estimation methods. The MVDRspectral estimate of order ๐‘€ is given by

๐‘…๐‘š๐‘ฃ๐‘‘๐‘Ÿ๐‘€ (๐‘’๐‘—๐œ”) =

1

v๐ป(๐œ”)R๐‘ฅโˆ’1v(๐œ”)

, (11)

where R๐‘ฅ is the (๐‘€)ร— (๐‘€) data autocorrelation matrix and

v(๐œ”) = [1, ๐‘’๐‘—๐œ”, ๐‘’๐‘—2๐œ”, ๐‘’๐‘—3๐œ” , ....., ๐‘’๐‘—(๐‘€โˆ’1)๐œ” ]๐‘‡. (12)

This estimate has some interesting properties which we brieflymention below.

It can be efficiently computed exploiting the relationshipwith linear prediction methods as

๐‘…๐‘š๐‘ฃ๐‘‘๐‘Ÿ๐‘€ (๐‘’๐‘—๐œ”) =

1โˆ‘๐‘€๐‘˜=โˆ’๐‘€ ๐œ‡(๐‘˜)๐‘’โˆ’๐‘—๐œ”๐‘˜

(13)

where the parameters ๐œ‡(๐‘˜) are obtained by a simple non iter-ative computation involving the linear prediction coefficientsby minimizing the prediction error variance ๐‘…๐‘’ [8], as

๐œ‡(๐‘˜)

{1๐‘ƒ๐‘’

โˆ‘๐‘€โˆ’๐‘˜๐‘–=0 (๐‘€ + ๐‘–โˆ’ ๐‘˜ โˆ’ 2๐‘–)๐‘Ž๐‘–๐‘Ž

โˆ—๐‘–+๐‘˜ ๐‘˜ = 0...๐‘€

๐œ‡โˆ—(โˆ’๐‘˜) ๐‘˜ = โˆ’๐‘€...โˆ’ 1(14)

The filter bank interpretation of MVDR is most insightful forour problem. The MVDR spectrum at a given frequency ๐œ”๐‘˜

can be viewed as the power at the output of a FIR filter whosecoefficients ๐›ฝ = [โ„Ž(0), โ„Ž(1), ...., โ„Ž(๐‘€ โˆ’ 1)]

๐‘‡ are obtained asa solution to the following constrained optimization problem

min๐›ฝ

๐›ฝ๐ปR๐‘ฅ๐›ฝ subject to๐›ฝ๐ปv(๐œ”) = 1.

Page 3: [IEEE 2011 National Conference on Communications (NCC) - Bangalore, India (2011.01.28-2011.01.30)] 2011 National Conference on Communications (NCC) - Significance of the LP-MVDR spectral

The linear constraint ensures the signal of interest is notdistorted and the minimization of the output power minimizesleakage from other frequencies.

B. The LP-MVDR spectral ratio method for whisper detection

It is important to note that MVDR spectrum is a smootherspectrum when compared to the LP spectrum. This is onaccount of the fact that an MVDR spectrum at any frequencycan be represented as a harmonic average of the LP spectraof a particular order [8].

1

๐‘ƒ๐‘€๐‘‰ (๐œ”)=

๐‘โˆ‘๐‘˜=0

1

๐‘ƒ๐ฟ๐‘ƒ (๐‘˜)(๐œ”)(15)

This averaging effect smooths out the spectrum at the regionsof sharp rise i.e. at the harmonics. Thus the MVDR spectrumtends to have lower amplitude than that of corresponding LPspectra at the harmonics. Figure 1 shows the various spectra(80th order) for a short segment of normally phonated speech.From the illustration in Figure 1, it is clear that a LP to MVDR

1000 2000 3000 4000 5000 6000 7000 8000

โˆ’15

โˆ’10

โˆ’5

0

Frequency in Hz

Ma

gn

itu

de

in

dB

FFTLPMVDR

Fig. 1. Diagram illustrating the various spectra for a short segment ofnormally phonated speech . MVDR (red), LP (blue), and FFT (gray), spectra.

ratio spectrum can be used to identify the whisper segments inspeech, since this ratio is expected to be high where the speechsignal has significant harmonics in the higher frequency regionthan in the normal phonated speech spectrum. In the contextof whispered speech where the harmonic shifts are prominent,this ratio is expected to be high. Also the ratio is expected tobe robust to wide band noise because the LP spectrum of highorder can still model the spectrum and the averaging effect ofMVDR will eliminate the effect of wide band noise when aspectral ratio is taken. Formally the the LP-MVDR spectralratio is defined as

๐‘‹ =ห†๐‘ƒ๐ฟ๐‘ƒ (๐œ”)

๐‘ƒ๐‘€๐‘‰ (๐œ”)=

(โˆฃ๐ธ(๐œ”)โˆฃ2)(โˆ‘๐‘€๐‘™=โˆ’๐‘€ ๐œ‡(๐‘˜)๐‘’โˆ’๐‘—๐œ”๐‘™)

โˆฃ1โˆ’โˆ‘๐‘š๐‘˜=1 ๐‘Ž๐‘˜๐‘ง

โˆ’๐‘˜โˆฃ2 (16)

The LP-MVDR ratio spectrum is computed on a short timebasis for each speech data window. The ratio spectrum isfurther smoothed and a threshold is decided depending on thepenalties fixed based on the False Alarm Rate and Detection

Failure Rate. The whispered segments can then be segmentedfrom the normally phonated segments of speech.

The salient steps used for whisper detection using the LP-MVDR ratio spectrum is described below:

โˆ™ Hamming Window the test speech waveform using aframe size of 20 ms and a frame overlap of 50%.

โˆ™ Compute the MVDR coefficients using Equation 14, foreach frame.

โˆ™ Compute the smooth MVDR Power spectrum for suffi-cient number of frequency points.

โˆ™ Compute the linear predictor coefficients for each frame.โˆ™ Compute the LP power spectrum for the same number

of frequency points as used for computing the MVDRpower spectrum.

โˆ™ Compute the LP to MVDR ratio spectrum.โˆ™ Select the threshold according the penalties required for

False Alarm Rate and Detection Failure Rate to segmentthe whisper from within normally phonated speech.

IV. PERFORMANCE EVALUATION

In this Section, we evaluate the proposed LP-MVDR ratiospectrum for detection of whispered segments within normallyphonated speech streams from the CHAINS corpus [11], [12].The results of detection using the proposed are also comparedwith the conventional spectral flatness measure and also thewidely used LP residual autocorrelation method. The resultsare presented are ROC curves and using a whisper diarizationerror rate.

A. Whisper database

The CHAINS speech corpus [11], consists of 36 speakerswith recordings done in two different sessions that weretwo months apart. Two recording sessions provided speechin six different speaking styles. The first recording session(solo) was carried out in a professional recording studioand speakers were recorded in a sound-attenuated booth.The recordings in the released corpus were done using aNeumann U87 condenser microphone. The second recordingsession (whisper) was carried out in a quiet office environment,using an AKG C420 headset condenser microphone. Boththe whispered speech reading and solo reading parts of thecorpus are used in our experiments. In the solo mode, thespeakers are asked to speak with their natural pace and tone.On the other hand in the whispered mode, subjects read alltexts in a whisper. Any involuntary switch to modal voicing isinterpreted as a dysfluency and leads to a restart of the phrase.The texts spoken in the corpus contain sections of four famousfables, twenty TIMIT sentences and nine sentences fromCSLU speaker identification corpus. Segments of solo andwhispered speech that form a complete sentence are appendedin our experiments to test the changes in the proposed ratiospectrum with the change in the mode of speaking.

B. Results of whisper segmentation

In order to illustrate the segmentation performance of theproposed ratio spectrum, experiments were conducted using

Page 4: [IEEE 2011 National Conference on Communications (NCC) - Bangalore, India (2011.01.28-2011.01.30)] 2011 National Conference on Communications (NCC) - Significance of the LP-MVDR spectral

a forty point LP-MVDR ratio spectrum. The segmentationresults illustrated herein are based on the following paragraphfrom the CHAINS corpus.

โ€One fine day it occurred to the Members of the Body thatthey were doing all the work and the Belly was having all thefood......the Hands could hardly move.โ€

Whisper segmentation results with the proposed ratio spec-trum are illustrated in Figure 2. The ratio spectrum was

2 3 4 5 6 7 8 9 10 11 12

x 104

โˆ’1.5

โˆ’1

โˆ’0.5

0

0.5

1

Sample Number

Am

plit

ud

e

Speech Signal with WhisperLP โˆ’ MVDR Ratio Spectrum

Fig. 2. Whisper segmentation using the LP-MVDR ratio spectrum. LP-MVDR ratio spectrum (red) and speech signal with whisper (blue).

smoothed using a moving average filter and the correspondingsegmentation result is shown in Figure 3.

0 2 4 6 8 10 12

x 104

โˆ’1.5

โˆ’1

โˆ’0.5

0

0.5

1

Sample Number

Am

plit

ud

e

Speech Signal with WhisperLP โˆ’ MVDR Ratio Spectrum

Fig. 3. Whisper segmentation using the smoothed LP-MVDR ratio spectrum.

C. Comparison with other methods

In this section we compare the results of whisper detectionwith other methods like the spectral flatness method and theLP residual autocorrelation method.

1) Spectral flatness method of whisper detection: Spectralflatness [9] measures the flatness of the speech spectrum.A perfectly flat spectrum has a spectral flatness measure ofzero. On the other hand a spectrum with peaks and troughswill yield a value greater than zero but less than one. This

measure is often used used to separate voiced and unvoicedsegments of speech [9]. Whispered speech has properties veryclose to unvoiced speech, i.e, high noise content, lack of clearharmonic structure and concentration of energy in the higherfrequency region. The spectral flatness measure is thereforean ideal candidate for whisper detection. An analysis of thespectral flatness method for whisper detection can be found in[10]. The results of whisper segmentation using the spectral

0 2 4 6 8 10 12

x 104

โˆ’1

โˆ’0.5

0

0.5

1

1.5

Sample Number

Am

plit

ud

e

Spectral FlatnessSpeech Signal with Whisper

Fig. 4. Whisper segmentation using the Spectral Flatness method (in red).

flatness method is shown in Figure 4.2) LP residual Auto-correlation method: In this method

first the speech is segmented by a hamming window of 20msand 50% overlap. Then the corresponding LP residual is usedto segment the whispered and normally phonated parts [2].LP over-estimates the spectrum at the harmonic frequenciesas explained earlier. Hence the difference between actualspectrum and LP estimate is expected to show troughs at theharmonic frequencies. Correlation of this residual is foundbetween the two halves of a window and the maximum valueis computed. This correlation measure is further smoothenedover short duration windows and data clustering is done usingk-means clustering. As whisper contains uncorrelated noise,the correlation thus computed is expected to be lesser thanthat for the voiced segment. The result is shown in Figure5, where the upper level of output represents the whisperedsegment.

D. Experimental results on whisper detection

Whisper detection experiments were conducted on theCHAINS database and the results are illustrated as an ROCcurve in Figure 6. In Figure 6, True positive rate(TPR) isdefined as

๐‘‡๐‘ƒ๐‘… =๐‘๐‘œ. ๐‘œ๐‘“ ๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก๐‘™๐‘ฆ ๐‘‘๐‘’๐‘ก๐‘’๐‘๐‘ก๐‘’๐‘‘ ๐‘คโ„Ž๐‘–๐‘ ๐‘๐‘’๐‘Ÿ ๐‘ ๐‘’๐‘”๐‘š๐‘’๐‘›๐‘ก๐‘ 

๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘œ. ๐‘œ๐‘“ ๐‘คโ„Ž๐‘–๐‘ ๐‘๐‘’๐‘Ÿ ๐‘ ๐‘’๐‘”๐‘š๐‘’๐‘›๐‘ก๐‘ (17)

while the False Positive Rate(FPR) is defined as

๐น๐‘ƒ๐‘… =๐‘๐‘œ. ๐‘œ๐‘“ ๐‘ค๐‘Ÿ๐‘œ๐‘›๐‘”๐‘™๐‘ฆ ๐‘‘๐‘’๐‘ก๐‘’๐‘๐‘ก๐‘’๐‘‘ ๐‘คโ„Ž๐‘–๐‘ ๐‘๐‘’๐‘Ÿ ๐‘ ๐‘’๐‘”๐‘š๐‘’๐‘›๐‘ก๐‘ 

๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘œ. ๐‘œ๐‘“ ๐‘โ„Ž๐‘œ๐‘›๐‘Ž๐‘ก๐‘’๐‘‘ ๐‘ ๐‘’๐‘”๐‘š๐‘’๐‘›๐‘ก๐‘ (18)

Page 5: [IEEE 2011 National Conference on Communications (NCC) - Bangalore, India (2011.01.28-2011.01.30)] 2011 National Conference on Communications (NCC) - Significance of the LP-MVDR spectral

0 2 4 6 8 10

x 104

0

0.5

1

1.5

2

2.5

Sample Number

Am

plit

ud

e a

nd

Qu

an

tize

d A

mp

litu

de

Quantized LP Residual ACFSpeech Signal with Whisper

Fig. 5. Whisper segmentation using the LPR-ACF method (in red).

In our experiments on whisper detection, short pauses inphonated speech also exhibited high values of the LP-MVDRratio spectrum. This can be observed in the ROC curve inFigure 6. This is expected as short pauses usually have onlynoise content due to inhalation of air through the nostrils.But the FPR is increased due to the presence of short pauses.Hence an increased FPR is observed at higher threshold valuesas indicated at the start of the ROC curve. Removal of shortpauses is expected to give even better results. Note that theproposed ratio spectrum gives reasonably better performancethan the other two conventional methods in terms of the areaunder the ROC curve. For evaluating the performance of the

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate (FPR)

True

Pos

itive

Rat

e (T

PR

)

LPโˆ’MACFSpectral FlatnessLPโˆ’MVDR ratio spectrum

Fig. 6. ROC curve illustrating the whispered detection performance of theLP-MVDR ratio spectrum along with other methods

detection methods, a performance index similar to the onein [5], is used. The possible errors in whisper detection aregenerally the false alarm(FA) and the detection failure(DF)failure. Hence the whisper diarization Error Rate (WDER)using the above sources of error can be defined as

๐‘Š๐ท๐ธ๐‘… =๐ถ1.๐น๐ด+ ๐ถ2.๐ท๐น

๐‘๐‘“(19)

where ๐‘๐‘“ denotes the total number of speech frames, and ๐ถ1

and ๐ถ2 are the weights assigned to FA and DF respectively.Note that ๐ถ1 and ๐ถ2 are selected based on the penalties fixedfor false alarm rate and detection failure rate respectively. Thewhisper detection results in terms of WDER is shown in TableI.

TABLE ICOMPARISON OF WDER FOR THE THREE WHISPER DETECTION METHODS

Spectral Flatness Modified Autocorrelation LP-MVDR Ratio0.4046 0.3507 0.2869

V. CONCLUSIONS

A new method based on the linear prediction to minimumvariance spectral ratio is proposed for detection of whisperedspeech segments in normally phonated speech streams. Thedifference in spectral estimation between the two techniquesis utilized in the proposed spectral ratio method to effectivelyseparate the whispered segments from normally phonatedsegments of speech. The ratio spectrum gives reasonably betterperformance in terms of the area under the ROC curve andthe WDER. However the presence of short pauses and breathreduce the efficiency of the method in certain regions ofthe ROC curve. Along with this issue we are also currentlyaddressing the issue of model order selection which can alsofurther improve the whisper detection performance.

VI. ACKNOWLEDGMENT

This work was funded by the BITCOE under project num-bers 20080252 and 20080253.

REFERENCES

[1] J. Makhoul, Linear prediction: A tutorial review, In Proc. IEEE,vol. 63,no. 4, pp. 561-580, Apr. 1975

[2] Michael A. Carlin , Brett Y. Smolenski , Stanley J. Wenndt โ€UnsupervisedDetection of Whispered Speech in the Presence of Normal Phonationโ€ InProc. INTERSPEECH-2006, paper 1990-Mon3CaP.13., 2006

[3] Chi Zhang and John H.L. Hansen โ€Advancements in Whisper-IslandDetection Using The Linear Predictive Residualโ€, In Proc. ICASSP2010,pp.5170-5173, 2010

[4] Chi Zhang and John H.L. Hansenโ€Advancements in whisper-island detec-tion within normally phonated audio streamsโ€, In Proc. INTERSPEECH-2009, pp.860-863, 2010

[5] Chi Zhang and John H.L Hansen โ€Effective Segmentation based onVocal Effort Change Point Detectionโ€, In Proc. ITRW, Aalborg, 2008

[6] P.J.Sherman and K.N.Lou โ€On the family of ML Spectral Estimates formixed spectrum identificationโ€, In IEEE Trans. Signal Processing, vol.39,pp 644-655, Mar.1991

[7] Manohar N. Murthi and Bhaskar D. Raoโ€Minimum Variance Distortion-less Response(MVDR) Modelling of Voiced Speechโ€, In Proc. ICASSP1997, Munich, Germany, Vol. 3, pp.1687-1690, 1997

[8] J.P.Burg โ€โ€The relationship between maximum entropy spectra andmaximum likelihood spectraโ€, Geophysics, vol. 37, pp 375-376, 1972

[9] Augustine Grey and John D. Markel โ€A Spectral Flatness measure forstudying the Autocorrelation method of Linear Prediction of Speech Anal-ysisโ€, In IEEE Transactions on Acoustics,Speech and Signal ProcessingVol ASSP-22,No.3,pp.207-217, 1974

[10] Taisuke Ito and Kazuya Takeda and Fumitada Itakura โ€Analysis andrecognition of whispered speechโ€, Speech Communication, 45(2), pp.139-152, 2005

[11] F. Cummins, M. Grimaldi, T. Leonard, and J. Simko โ€The CHAINScorpus: Characterizing Individual Speakersโ€, In Proc of SPECOM,pp.431-435, 2006

[12] M. Grimaldi and F. Cummins โ€Speaker Identification Using Instan-taneous Frequenciesโ€, In IEEE Trans. Audio, Speech, and LanguageProcessing, 16(6), pp.1097-1111, 2008