Linear Prediction Analysis of Crosscorrelation Sequence ...

4
Linear Prediction Analysis of Crosscorrelation Sequence for Voiced Speech Liqing LIU and Tetsuya SHIMAMURA Saitama University, Saitama, Japan E-mail: [email protected] Tel/Fax: +81-048-8583776 Saitama University, Saitama, Japan E-mail: [email protected] Tel/Fax: +81-048-8583496 Abstract—The conventional linear prediction (LP) analysis is known to suffer from problems that it is sensitive to additive noise. In this paper a new approach for LP analysis of crosscorre- lation sequence between speech signal and its zero-crossing wave has been presented. Simulation results show that the proposed method is capable of performing the speech analysis under a white noisy environment. I. I NTRODUCTION Linear prediction (LP) [1] is a commonly used technique for speech analysis and has been applied to estimating spec- trum, formants, pitches, glottal ows and so on. LP analysis represents the speech signal by a set of predictive coefcients related to an autoregressive (AR) system lter. The basic aim of LP analysis is to estimate the predictive coefcients. As a conventional stationary formulation, autocorrelation method [2] not only estimates the predictive coefcients accurately for clean speech signal, but also can guarantee the stability of the AR lter. However, the conventional autocorrelation method has been veried to have signicant problems in the presence of additive noise. Although the conventional autocorrelation method has dif- culty in estimating the predictive coefcients accurately in a noisy environment, autocorrelation sequence itself involved in has attracted extensive attention. Plenty of approaches around autocorrelation sequence have been developed. Auto- correlation sequence possesses two major properties. One is that the autocorrelation sequence is less affected by additive noise than the speech signal. The noise components are considered to just occupy zeroth lag or lower lags of the autocorrelation sequence. Hence removing or compensating lower lags of noisy speech autocorrelation sequence, we can eliminate the inuence of noise and lead to an accurate approximation of clean speech autocorrelation sequence. Noise compensation analysis [3] [4] compensates the lower lags of noisy autocorrelation so as to attenuate the inuence of the noise by a priori estimate of the noise. The high-order Yule-Walker estimator [5] , which removes the lower lags of autocorrelation without a priori estimate of the noise, utilizes the higher lags of autocorrelation to estimate the AR coefcients. However, this technique suffers from a singular problem and cannot guarantee the stability of the AR lter. The other property of the autocorrelation sequence is pole- preserving property [6]. It means that the AR coefcients estimated by linear prediction of autocorrelation sequence is similar to that estimated by linear prediction of speech signal. The repeated autocorrelation function (RACF) method [7] and one-sided autocorrelation linear prediction (OSALPC) analysis [8] are based on the pole-preserving property of autocorrelation sequence. The RACF approach states that the repeated autocorrelation function could retain the poles of an original AR system. However, the autocorrelation sequence is a decaying sequence and it is very crucial to determine the optimum number of repeated times. The OSALPC technique has been applied to noisy speech recognition [9] [10], pitch determination [11], AR system identication [12] and so on, though the OSALPC technique actually only performs a partial deconvolution of speech signal, which may lead to arising some spurious peaks in the OSALPC envelope. In addition, linear prediction of autocorrelation sequence also cause a squared spectral distortion because of the squared amplitude of each frequency component. In order to avoid the squared spectral distortion produced by autocorrelation sequence, we introduce a linear prediction analysis of crosscorrelation sequence between speech signal and its zero-crossings wave. In the next Section, we discuss the proposed LP analysis of crosscorrelation sequence. II. PROPOSED METHOD Voiced speech is a periodic or quasi periodic waveform and can be expressed as s(n)= i=0 a i cos(w 0 in + θ i ) (1) where w 0 is a fundamental angular frequency. Its autocorrela- tion sequence is obtained as r(τ )= i=0 a 2 i 2 cos(w 0 ). (2) Comparing these two equations, we should note that the ampli- tude of autocorrelation sequence is squared. This phenomenon causes the squared spectral distortion when autocorrelation sequence is directly applied to linear prediction analysis such as OSALPC analysis. In order to avoid this phenomenon, a crosscorrelation sequence [13] is introduced as follows: 978-616-361-823-8 © 2014 APSIPA APSIPA 2014

Transcript of Linear Prediction Analysis of Crosscorrelation Sequence ...

Page 1: Linear Prediction Analysis of Crosscorrelation Sequence ...

Linear Prediction Analysis of CrosscorrelationSequence for Voiced Speech

Liqing LIU∗ and Tetsuya SHIMAMURA†∗Saitama University, Saitama, Japan

E-mail: [email protected] Tel/Fax: +81-048-8583776†Saitama University, Saitama, Japan

E-mail: [email protected] Tel/Fax: +81-048-8583496

Abstract—The conventional linear prediction (LP) analysis isknown to suffer from problems that it is sensitive to additivenoise. In this paper a new approach for LP analysis of crosscorre-lation sequence between speech signal and its zero-crossing wavehas been presented. Simulation results show that the proposedmethod is capable of performing the speech analysis under awhite noisy environment.

I. INTRODUCTION

Linear prediction (LP) [1] is a commonly used techniquefor speech analysis and has been applied to estimating spec-trum, formants, pitches, glottal flows and so on. LP analysisrepresents the speech signal by a set of predictive coefficientsrelated to an autoregressive (AR) system filter. The basic aimof LP analysis is to estimate the predictive coefficients. Asa conventional stationary formulation, autocorrelation method[2] not only estimates the predictive coefficients accurately forclean speech signal, but also can guarantee the stability of theAR filter. However, the conventional autocorrelation methodhas been verified to have significant problems in the presenceof additive noise.

Although the conventional autocorrelation method has dif-ficulty in estimating the predictive coefficients accurately ina noisy environment, autocorrelation sequence itself involvedin has attracted extensive attention. Plenty of approachesaround autocorrelation sequence have been developed. Auto-correlation sequence possesses two major properties. One isthat the autocorrelation sequence is less affected by additivenoise than the speech signal. The noise components areconsidered to just occupy zeroth lag or lower lags of theautocorrelation sequence. Hence removing or compensatinglower lags of noisy speech autocorrelation sequence, we caneliminate the influence of noise and lead to an accurateapproximation of clean speech autocorrelation sequence. Noisecompensation analysis [3] [4] compensates the lower lagsof noisy autocorrelation so as to attenuate the influence ofthe noise by a priori estimate of the noise. The high-orderYule-Walker estimator [5] , which removes the lower lagsof autocorrelation without a priori estimate of the noise,utilizes the higher lags of autocorrelation to estimate the ARcoefficients. However, this technique suffers from a singularproblem and cannot guarantee the stability of the AR filter.The other property of the autocorrelation sequence is pole-preserving property [6]. It means that the AR coefficients

estimated by linear prediction of autocorrelation sequenceis similar to that estimated by linear prediction of speechsignal. The repeated autocorrelation function (RACF) method[7] and one-sided autocorrelation linear prediction (OSALPC)analysis [8] are based on the pole-preserving property ofautocorrelation sequence. The RACF approach states that therepeated autocorrelation function could retain the poles of anoriginal AR system. However, the autocorrelation sequence isa decaying sequence and it is very crucial to determine theoptimum number of repeated times. The OSALPC techniquehas been applied to noisy speech recognition [9] [10], pitchdetermination [11], AR system identification [12] and so on,though the OSALPC technique actually only performs a partialdeconvolution of speech signal, which may lead to arisingsome spurious peaks in the OSALPC envelope. In addition,linear prediction of autocorrelation sequence also cause asquared spectral distortion because of the squared amplitudeof each frequency component.

In order to avoid the squared spectral distortion producedby autocorrelation sequence, we introduce a linear predictionanalysis of crosscorrelation sequence between speech signaland its zero-crossings wave. In the next Section, we discussthe proposed LP analysis of crosscorrelation sequence.

II. PROPOSED METHOD

Voiced speech is a periodic or quasi periodic waveform andcan be expressed as

s(n) =∞∑i=0

aicos(w0in+ θi) (1)

where w0 is a fundamental angular frequency. Its autocorrela-tion sequence is obtained as

r(τ) =∞∑i=0

a2i2cos(w0iτ). (2)

Comparing these two equations, we should note that the ampli-tude of autocorrelation sequence is squared. This phenomenoncauses the squared spectral distortion when autocorrelationsequence is directly applied to linear prediction analysis suchas OSALPC analysis. In order to avoid this phenomenon, acrosscorrelation sequence [13] is introduced as follows:

978-616-361-823-8 © 2014 APSIPA APSIPA 2014

Page 2: Linear Prediction Analysis of Crosscorrelation Sequence ...

q(m) =1

N

N−1∑n=0

sign(s(n))·s(n+m) m = 0, 1...N−1. (3)

In (3), the length of s(n) is set to 2N so that the amplitudeof q(m) is unbiased. Here sign(s(n)) is a zero-crossingswave of speech signal without specific amplitude informationand preserves information of the original speech. Hence theamplitude of q(m) is considered to be almost similar to thatof original speech signal s(n).

In linear prediction analysis, the speech sample s(n) is anapproximation of a linear weighted combination of its pastsamples s(n− i) and a certain input δ(n) as

s(n) =

p∑i=1

ais(n− i) +Gδ(n) (4)

where G is the gain function, ai are the predictive coefficients,p is the LP order and δ(n) is a driving function.

Applying (4) to (3) results in

q(m)=N−1∑n=0

sign(s(n)) · (p∑

i=1

ais(n+m− i) +Gδ(n+m))

=N−1∑n=0

sign(s(n)) ·p∑

i=1

ais(n+m− i)

+

N−1∑n=0

sign(s(n)) ·Gδ(n+m)

=

p∑i=1

ai

N−1∑n=0

sign(s(n)) · s(n+m− i)

+G

N−1∑n=0

sign(s(n)) · δ(n+m)

=

p∑i=1

aiq(m− i) +GN−1∑n=0

sign(s(n)) · δ(n+m). (5)

Equation (5) is pole-preserving. Hence the linear prediction ofcrosscorrelation sequence has its ability to estimate the AR co-efficients. Furthermore, the crosscorrelation sequence q(m) issimilar to a statistical mean computation process. For additiverandom noise, this process is capable of reducing the noiselevel. The noise power concentrates on the zeroth lag of thecrosscorrelation sequence q(m), which is similar to the caseof autocorrelation sequence. Hence like the autocorrelationsequence, the crosscorrelation sequence has stronger immunityagainst noise than the original speech signal.

Utilizing these two properties, we propose an LP analysisof crosscorrelation sequence.

The specific procedures of the proposed LP analysis ofcrosscorrelation sequence are summarized as follows:

(I) Calculate the crosscorrelation sequence until N fromone frame speech signal of length 2N using (3);

(II) Apply Hamming window of length N to crosscorrelationsequence obtained from (I);

Fig. 1. Block diagram for the proposed LP analysis of crosscorrelationsequence

(III) Utilize biased autocorrelation estimator to compute theautocorrelation sequence;

(IV) Estimate the predictive coefficients by the Levinson-Durbin algorithm.

A block diagram of the proposed method is depicted inFig. 1.

III. EXPERIMENTAL RESULTS

To verify the effectiveness of the proposed LP analysisof crosscorrelation sequence, several experiments have beenconducted for synthetic vowels and real vowels. We comparedthe performance of the proposed LP analysis of crosscorrela-tion sequence with that of the conventional autocorrelationmethod and OSALPC [8]. The experimental specifications forthe following simulation are listed as :

• frame length N : 51.2 ms;• LP order: 12;• frame shifting: 25.6 ms;• analysis window: Hamming window;• Additive noise: white noise.

A. Results on synthetic speechWe utilized a Liljencrants-Fant (LF) model [14], which can

be considered to be an approximation of human being natureglottal source model and be capable of generating naturalsounding synthetic speech [15] [16], to generate syntheticvowels. The generated synthetic vowels were sampled at afrequency of 10kHz. In order to simulate the lip radiationcharacteristic, we preemphasized these synthetic vowels by afilter 1− z−1.

Firstly power spectra of five synthetic vowels (/a/, /i/, /u/, /o/and /e/) estimated by the conventional autocorrelation method,OSALPC and proposed LP analysis of crosscorrelation se-quence is shown in Fig. 2 for 100 consecutive frames withoutadditive noise. As shown in Fig. 2, spurious spectral peaksappear in the power spectra estimated by OSALPC in the caseof vowels /a/ and /o/. This phenomenon is probably due to the

Page 3: Linear Prediction Analysis of Crosscorrelation Sequence ...

0 5000Frequency (Hz)

(a)

0 5000Frequency (Hz)

(i)

0 5000Frequency (Hz)

(u)

0 5000Frequency (Hz)

(o)

0 5000Frequency (Hz)

(e)

Fig. 2. Spectra of synthetic vowels /a/ (a), /i/ (i), /u/ (u), /o/ (o) and /e/ (e) at F0 = 150Hz estimated by OSALPC (red), Proposed LP analysis of crosscorrelationsequence (black) and conventional autocorrelation method (blue).

0 1000 2000 3000 4000 5000Frequency (Hz)

(a)

0 1000 2000 3000 4000 5000Frequency (Hz)

(b)

0 1000 2000 3000 4000 5000Frequency (Hz)

(c)Fig. 3. Spectra of synthetic vowels /a/ estimated at SNR = 20dB (a), SNR = 10dB (b) and SNR = 0dB (c) by OSALPC (red), Proposed LP analysisof crosscorrelation sequence (black) and conventional autocorrelation method (blue).

reason that OSALPC technique performs a partial deconvolu-tion of the speech signal [11]. On the other hand, the proposedmethod shows an almost similar spectral performance with theconventional autocorrelation method in a clean environment.

Next we evaluated the performance of the proposed methodunder a white noisy environment. Fig. 3 shows an example ofthe power spectra estimated by the conventional autocorrela-tion method, OSALPC and proposed LP analysis of crosscor-relation sequence for 100 consecutive frames of the syntheticvowel /a/ at SNR = 20dB, 10dB, 0dB. As seen in Fig. 3(b), the proposed method and OSALPC can basically providefive formants while the forth and fifth formants disappear from

spectra estimated by the conventional autocorrelation methodat SNR = 10dB. In case of SNR = 0dB, the powerspectra estimated by OSALPC provides a better sharp of thirdformants than that estimated by the proposed method. Thereason is that, actually, the autocorrelation sequence is lessaffected by additive noise than crosscorrelation sequence.

Here a measurement of the cepstrum distance is introducedto compare these three methods. The cepstrum distance iscomputed by

CD =10

ln10

√√√√2M∑i=1

(ci − c̃i)2 (6)

Page 4: Linear Prediction Analysis of Crosscorrelation Sequence ...

0 5 10 15 202

3

4

5

6

7

8

9

SNR [dB]

Cep

stru

m D

ista

nce

Autocorrelation methodOSALPCProposed method

I 95% confidence interval

Fig. 4. Comparison of average cepstrum distance for synthetic vowel /a/ underwhite noise

where ci are true cepstrum coefficients and c̃i are the estimatedcepstrum coefficients calculated from the noisy speech signal.Fig. 4 shows comparisons of the average cepstrum distancefor 100 consecutive frames of synthetic vowel /a/ under whitenoise. The vertical line at the top of the bar exhibits the95% confidence interval. The proposed method shows a betterperformance than the conventional autocorrelation methodexcept at SNR = 20dB. Meanwhile, the proposed methodalso provide a better performance than the OSALPC methodexcept at a low SNR = 0dB.

B. Results on real vowelA real vowel /a/ is used to carry out the experiments. Fig. 5

shows comparisons of the average cepstrum distance for 100consecutive frames. Both the OSALPC and proposed methodshave means significantly different from the conventional auto-correlation method at each SNR case. Although the obtainedaverage cepstrum value of OSALPC is slightly superior to thatof the proposed method at low SNR , the proposed method isconsidered to be competitive with the OSALPC.

In addition, it is worth noticing that the proposed method isclosely similar to OSALPC. Their basic concept is to utilizethe autocorrelation sequence or crosscorrelation sequence totake place of original signal. However, the computation pro-cess of the proposed method is efficient than that of OSALPC.The reason is that the calculation of crosscorrelation sequenceis only made by addition and detection of polarity [13].However, the calculation of autocorrelation sequence needsaddition and multiplication.

IV. CONCLUSIONS

In this paper, a new approach for LP analysis has beenproposed for use in noisy environment. This approach for LPanalysis is based on crosscorrelation sequence between speechsignal and its zero-crossings wave. Based on the experimentalresults, the proposed method is shown to be suitable forperforming speech signals analysis in a noisy environment andbe capable of reducing the noise level. And we will continue to

0 5 10 15 200

2

4

6

8

10

12

14

16

18

20

SNR [dB]

Cep

stru

m D

ista

nce

Autocorrelation methodOSALPCProposed method

I 95% confidence interval

Fig. 5. Comparison of average cepstrum distance for real vowel /a/ underwhite noise

investigate the effectiveness of the proposed method by morereal speech data.

REFERENCES

[1] B. S. Atal and S. Hanauer, “Speech analysis and synthesis by linearprediction of the speech wave,” J. Acoust. Soc. Am., Vol. 50, No. 2, pp.637-655, August 1971.

[2] J. Markel, “Digital inverse filtering - A new tool for formant trajectoryestimation,” IEEE Trans. Audio and Electroacoust., Vol. AU-20, No. 2,pp. 129–137, June 1972.

[3] Q. F. Zhao, T.Shimamura and J.Suzuki, “Improvement of LPC analysisof speech by noise compensation,” IEICE Trans, Vol. J81-A, No. 11, pp.1583-1591, Nov 1998.

[4] L. Q. Liu and T. Shimamura, “A noise compensation LPC method basedon pitch synchronous analysis for speech,” Journal of Signal Processing,Vol. 17, No. 6, pp. 283-292, 2013.

[5] S. M. Kay, “Noise compensation for autoregressive spectral estimates,”IEEE Trans. Acoust., Speech, and Signal Process., Vol. ASSP-28, No. 3,pp. 292-303, 1980.

[6] D. McGinn and D. Johnson, “Reduction of all-pole parameter estimatorbias by successive autocorrelation,” Proc. ICASSP’83, Vol. 8, pp. 1088-1091, April 1983.

[7] S. A. Fattah, W. P. Zhu and M. O. Ahmad, “Noisy autoregressive systemidentification based on repeated autocorrelation function,” IEEE CCECE,pp. 1572-1575, May 2006.

[8] J. Hernando adn C. Nadeu, “A comparative study of parameters anddistances for noisy speech recognition,” EUROSPEECH, pp. 91-94, 1991.

[9] J. Hernando and C. Nadeu, “AR modelling of the speech autocorrelationto improve noisy speech recognition,” SPAC, pp. 107-110, November1992.

[10] J. Hernando adn C. Nadeu, “Speech recognition in noisy car environ-ment based on OSALPC representation and robust similarity measuringtechniques,” Proc. ICASSP’94, pp. II69-II72, 1994.

[11] C. Nadeu, J. Pascual and J. Hernando, “Pitch determination using thecepstrum of the one-sided autocorrelation sequence,” Proc. ICASSP’91,pp. 3677-3680, 1991.

[12] S. A. Fattah, W. P. Zhu and M. O. Ahmad, “Noisy autoregressivesystem identification by the ramp cepstrum of one-sided autocorrelationfunction,” IEEE ISCAS Vol. 4, pp. 3147-3150, 2005.

[13] J. Suzuki, “Speech processing system by use of short-time crosscorre-lation function,” Proc. ICASSP, Vol. 2, pp. 24-27, 1977.

[14] G. Fant, J. Liljencrants and Q. G. Lin, “A four parameter model ofglottal flow,” Quart. Progress and Status Rep., Speech Transmission Lab,Royal Inst. Technol., pp. 1-13, 1985.

[15] H. Fujisaki and M. Ljungqvist, “Proposal and evaluation of model forthe glottal source waveform,” Proc. IEEE Int. Conf. on Acoust., Speechand Signal Processing, Vol. 4, pp. 1605-1608, 1986.

[16] H. Strik, “Automatic parametrization of differentiated glottal flow:comparing methods by means of synthetic flow pulses,” J. Acoust. Soc.Am., Vol. 103, no.5, pp. 2659-2669, 1998.