A Pitch-synchronousAnalysis of Hoarseness in Running …A method of pitch-synchronousacoustic...

A Pitch-synchronous Analysis ofHoarseness in Running Speech*

Hiroshi Muta, Thomas Baer, Kikuju Wagatsuma} TeruoMuraoka} and Hiroyuki Fukudatt

A method of pitch-synchronous acoustic analysis of hoarseness requiring a voicesample of only four fundamental periods is presented. This method calculates anoise-to-signal (NjS) ratio, defined from the power spectrum, which indicatesthe depth of valleys between harmonic peaks. A pitch-synchronous spectrum iscalculated from a discrete Fourier transform of the signal, windowed through acontinuously variable Hanning window spanning exactly four fundamentalperiods. A two-stage procedure is used to determine the exact duration of thefour fundamental periods. An initial estimate is obtained using autocorrelationin the time domain. A more precise estimate is obtained in the frequencydomain by minimizing the errors between the preliminarily calculated powerspectrum and the predicted spectrum spread of a windowed harmonic signal.Analysis of synthesized voices showed that the NjS ratio is sensitive toadditive noise, jitter, and shimmer, and is insensitive to slow (8 Hz)modulation in fundamental frequency and amplitude. An analysis of pre- andpostoperative voices of six patients with benign laryngeal disease showed thatthe NjS ratio for vowel juj in running speech consistently improved aftersurgery for all subjects, in agreement with their successful therapeutic results.

INTRODUCTION

A degradation in voice quality, generally called hoarseness, is one of the majorsymptoms of such benign laryngeal disease as 'Vocal cord polyps or nodules, and isoften the first symptom of neoplastic diseases such as laryngeal cancer, as well.Quantitative measures of the acoustic characteristics associated with laryngealpathology have focused on two different kinds of parameters. which are compatiblewith the standard model of voice production (Isshiki. Yanagihara. & Morimoto,1966): (1) parameters defined by cyc1e-to-cyc1e variation of the glottal source Signal.and (2) those defined within one glottal cycle of the source signal. such as the signalto-noise ratio and the relative intensity of higher harmonics.

Description of the glottal source periodicity in a sustained vowel, such as measuresof cycle-to-cycle perturbation of pitch period (Lieberman. 1961) and amplitude(Koike. 1969). has objectively indicated the degree of hoarseness either directly fromthe audiO signal or from the glottal source signal calculated by inverse filtering(DaVis. 1976). However. while these measures may change in advanced laryngeal

Haskins Laboratories

Status Report on Speech Research103

SR-93/94

1988

104

cancer, they do not always show significant glottal source perturbation in a hoarsevoice associated with a benign disease or an early cancer (Ludlow, Bassich, Connor,Coulter, & Lee, 1987).

Sound spectrographic analysis of sustained vowels shows less conspicuousharmonic structure in hoarse voices than in nonnal voices (Yanagihara, 1967). Thisphenomenon, low intensity of the hannonic component relative to the background,has been explained either as a decrease of higher hannonics in the source spectrum(Isshiki et al., 1966), or as an increase of additive noise in the source signal (Kasuya,Ogawa, Mashima, & Ebihara, 1986). The modulation effect of cycle-to-cycleperturbation of the glottal source may also contribute to the apparent decay ofharmonic structure.

Several methods for quantitative documentation of the spectrographicphenomenon have been reported, using calculations either in the frequency domain(Hiraoka, Kitazoe, Ueta, Tanaka, & Tanabe, 1984; Kasuya et aI., 1986; Kitajima, 1981;Kojima, Gould, Lambiase, & Isshiki, 1980) or in the time domain (Yumoto, Gould, &Baer, 1982). All of them showed differences between nonnal and pathologicalsubjects, as well as correlations with subjective ratings of hoarseness severity.

However, such methods require a long sustained vowel for analysis, and thus aresensitive to fluctuations of pitch, intensity, or articulation, as well as intentionalvibrato. Any of these factors would contribute to an apparent reduction of theharmonic structure of the voice. Reliability of these methods thus depends on thesubjects' ability to produce a long sustained vowel at constant pitch and intensity.

An additional problem with previous methods for quantifying hannonic contentand spectral noise is their limited ability to resolve individual glottal cycles foranalysis. A fractional error in fundamental period extraction or in pitchsynchrOnization causes additional spectrum leakage of the original hannonics,causing further deterioration of the hannonic structure. As a result of all theseproblems, previous quantification methods have yet to demonstrate their clinicalusefulness in the evaluation of mild to moderate hoarseness, such as evaluation ofthe therapeutic effects of phono-surgery.

We have developed a method of pitch-synchronous analysis that requires a veryshort voice sample, consisting of only four fundamental periods. The four-cyclesample can be extracted not only from sustained vowels, but also from vowels inrunning speech. This method calculates a noise-to-signal (N/S) ratio from the powerspectrum, which indicates the depth of valleys between hannonic peaks. A precisepitch-synchronous spectrum is calculated from a discrete Fourier transfonn of thewindowed signal, through a continuously variable Hanning window spanningexactly four fundamental periods. A two-stage procedure is used to detennine theexact duration of the four fundamental periods: one in the time domain, and one inthe frequency domain. This acoustic analysis will be useful in assessing mild ormoderate hoarseness, because the examinees do not have the difficult task ofproducing a constant long sustained vowel for analysis.

I. ANALYSIS PROCESS

A. Pitch Extraction1. Estimation of the Fundamental Period in the Time Domain

The continuous-time wavefonn of the speech signal is denoted by set). Then, thediscrete-time sequence, s*(n), is given by

s*(n) =s(ru1i), ( 1)

Muta et ai.

105

where L1t is the sampling period.The size for the four fundamental periods, M, is temporarily set according to the

preliminary estimate fundamental period, KoL1t:

M=4KO

The Hanning window function for this analysis frame is defined as

( 2)

wet) =0.5 (1 - COS21tt/T), {O~t ~T}, (3)

where T = ML1t. The continuous-time waveform of the windowed speech signal, sw(t), isdefined by

The discrete auto-correlation function, R(n), for this frame is defined as

M-n-l

R(n)= I.. sw*{i}sw*{i+n),

t=O

(4)

( 5)

where sw*(n) is the discrete-time sequence of swlt).

The fundamental period size, K, is obtained from the function peak, R(K). If K is notequal to Ko, Kois set to K, and steps (2) to (5) are repeated until the frame size, M,consists offour fundamental periods. The fundamental frequency, Fo, is given by

( 6)

2. Calculation of the Precise Fundamental Frequency in the Frequency Domain

The amplitude spectrum, IX(k) I, is derived by computing the discrete Fouriertransform, X(k), of the windowed signal:

M-l"'" -flr>kn/MX{k} == £..J Sw*{n} e .n=O

( 7)

The analysis frame consists of four fundamental periods, so there is one harmonicpeak of IX(k) I for every four steps of k.

Hanning windowing causes the line spectrum of a harmonic signal to spread. Ifthere is a small error in the estimated fundamental frequency, this spread will not becentered around the harmonic peaks of X(k). We define a function, Fhlf,xl. whichdescribes the spectrum spread of the hth harmonic, as a function of the error in

Pitch-synchronous Analysis of Hoarseness

106

fundamental frequency, x, given the measured amplitude of the hthIX(4h) I.

IX(4hli )Fh(f,x)=IW(_hx)[ W(f - h{fo+ x} ,

where W(f) is the Fourier transform of the window function, w(t):

W{J) = fav (t) e-J21iftdt

= 0.5T[sinnjT+ 0.5 { sinn(jT-l} + sinn lfT+ 1) }] e-jn./T.njT nUT-I) nifT+ 1)

harmonic,

(8)

(9)

A better estimate of the fundamental frequency is obtained by searching for thevalue of x for which the difference between I Fh(f) 1

2 and the measured power spectrum,IX(k) 1

2, on both sides of each harmonic peak is minimized. The estimation errors for

the lower and higher spectrum spread ofthe hth harmonic, ELh(x) and Ellh(X), are definedas

The total square error, G(x), from the first to the Lth harmonic is

L L

Qxl:= I ELh2

(X) + I EHh2

(X).h= 1 h= 1

(10)

(ll)

(12)

In this study, the square errors are calculated up to the 16th harmonic peak, whichis lower than the Nyquist frequency for all subjects.

Mula el al.

107

The minimum of G(x) is found from its derivative. G'(x);

G'(x}=O. (13)

This equation is solved using Newton's method. starting with an initial guess of x = O.Thus the precise fundamental frequenCY.fR' is given by

(14)

B. Pitch-Synchronous Spectrum AnalysisThe Hanning window is redefined in order to cover four pitch cycles more preciselyaccording to the new estimate of the fundamental frequency. fRo The window size. TR•

is defined as

The Hanning window function is defined as

(15)

{O.5(l-cos2n/TR).

WJt)=o.

(O~t~TR)'

(otherwise) .

(16)

The continuous-time waveformofthe windowed speech signal. SR(t). is defined by

(17)

and the corresponding discrete-time sequence. sR*(n). is therefore

{

WJru1t) s*(n). (n =O. 1.2..... M ~sR*(n)=

O. (all other n).

(l8)

where MR is the largest integer which is smaller than TR/ At.

The continuous spectrum ofa continuous-time signal is obtained from the Fouriertransform of its discrete-time sequence proVided that the signal is bandlimitedwithin the NyqUist frequency. As long as the original signal is sufficientlyhandlimited. the windowed signal is bandlimited to a good approximation.Therefore. the Fourier transform. Xi.f). of sR*(n) is given by

(19)00

XJj) == L s R*(n) e:i21if

fu1t•

n=-oo


108

The pitch-synchronous power spectrum of the windowed signal. P(kl, which isevaluated at frequency steps of l/TR• is thus calculated as

(20)

2

M R~ .() :J2nknL1t/T== £..J SR· n e R

n=Q

C. Calculation of Noise-to-Signal RatioBecause the Hanning window covers exactly four fundamental periods. harmonic

peaks and valleys appear in every four steps of k. If the signal consists of pureharmonics. the hth main lobe consists of P(4h-l). P(4h) and P(4h+ll, and no side lobesappear in the valley. P(4h+2). The shallower the valley. the higher the level of the nonharmonic components.

The smallest value of the signal power. P(k). over hth harmonic peak and valley, 4h1 ::;; k ::;; 4h+2. is taken as the power of the noise component for the hth harmonic peak,PNh• Therefore. the estimated power spectrum ofthe noise component. PJ-kl, is definedas

PrJk}== minP(4h+i)=PNh,

1=-1,0.1.2

(4h-l ::;;k::;;4h+2) , 121}

where h == 1.2.3..... L. In this study. these spectra are calculated up to the 16thharmonic.

The noise-to-signal ratio. RNS ' is defined as

{

4L+2 4L+2)RNS== 1010 It PJk)/ L P(k} .

k==3 k=3

II. METHOD OF THIS STUDY

122}

A. Analysis of Synthesized VoicesIn order to study the sensitivity of the N/S ratio. voices synthesized by the SPEAK

program (Titze. 1986) were analyzed by the present method. The source model wasnoninteractive with the vocal tract. and a parameterized model of the glottal flowwaveform was used. Voice samples were created with varying amounts of jitter,shimmer. additive noise, amplitude modulation. and frequency modulation; thevowel /u/ was used for synthesis. Samples were synthesized at a rate of 20.000samples per second with 6 dB/octave pre-emphasis.

Mula et al.

109

TABLE 1.Subjects for analysis.

Subject Name Age Sex DiagnosisPerceptual Result H/N Ratio (dB)

N=34 (%) Pre Post

1 N.O. 39 M Polyp 94.1 14.4 10.6

2 K.I. 46 M Polvp 79.4 18.2 18.2

3 F.I. 29 M Polyp 100 2.3 14.8

4 K.1. 35 F Cyst 100 19.0 21.4

5 N.K. 30 F Nodules 67.6 16.0 11.2

6 M.U. 46 F Polyp 100 19.9 17.5

Subject 1, Pre-operation, Reading 1

a 0 u 0 n 0 e 0 k a ta

Subject 1, Post-operation, Reading 1

a 0 u o n 0 e o k a i ta

Time (x100 ms)

Figure 1: Waveforms of the sentence, laoi uo no e 0 kaita/, for the first preoperative reading (top), andthe first postoperative reading (bottom) by SUbject 1.


110

B. Analysis of Pre- and Postoperative Voices

Table 1 describes the subjects used in the present study. Three males and threefemales, with mild or moderate hoarseness due to benign laryngeal disease. wereselected for study. All subjects underwent microscopic laryngeal surgery and hadsufficient perceptual voice quality after surgery so that both surgeons and patientswere satisfied with the results. Pre- and postoperative samples of the six. voices werepresented to 34 listeners, in paired comparison format. The listeners correctlyselected the postoperative sample at the levels indicated in Table 1. The levels areabove chance (p < .03) for each speaker. However, the calculated pre- andpostoperative of the H/N ratio for sustained vowel lal (Yumoto et aI., 1982) fallwithin the normal range of 7.4 dB or greater in all cases except the preoperative valuefor Subject 3. These results suggest that the most of the preoperative samples may beconSidered to be mild or moderate hoarseness. though the voice quality wasdefinitely improved after surgery for all subjects.

The subjects were requested to read the Japanese sentence, laoi uo no e 0 kaita/, ("Idrew a picture of a blue fish"). The sentence was read twice in a session, andrecordings were made both pre- and postoperatively, three to eight weeks after thesurgery. Recording was made using a high fidelity electret condenser microphone(Sony ECM-23F) and a cassette tape recorder (Sony TC-2890SE) in a lightly soundtreated booth at Keio University Hospital. Figure 1 shows the waveform for the preand postoperative utterances of Subject 1. The sentence was read rather slowly anddistinctly, as can be seen in the figure.

The recorded voice was digitized with 12-bit precision at a sampling rate of 10,000samples per second without preemphasis. The cut off frequency for the anti-aliasinglowpass filter was 4.8 kHz. Voice samples of 200-ms duration. which covered thevowel luI in luo nol, were extracted for the analysis. We chose this vowel because thephrase luo nol has a flat accent pattern and is located in the middle of the sentence.The extracted region is indicated by arrows.

1510Time (ms)

5____~ -'-.m,

4 a2 3Frequency (kHz)

Power Spectrum(dB).--.-,- -,- -.- -

80

~: rv\o L-__~_--'-_'_'_'__.L__"__'__'___'___~~'-----~----~---~----'

(dB)180 1\

60

40

20

o ---~---(dB)".-.-,- _80

60

40

20

o 0

Figure 2: Waveforms and power spectra for an analysis frame of the synthesized voices, vowel lUI

Fo=220 Hz, with 1%,4%, and 16% additive noise in the glottal source.

Muta et al.

111

III. RESULTS OF ANALYSIS

A. Results of Synthesized Voice AnalysisResults of the synthesized voice analysis demonstrate the sensitivity of the NIS

ratio. Figure 2 shows the waveform and the power spectrum for an analysis frame ofa synthesized voice, vowel lui, Fo=220 Hz, with 1%,4%, and 16% additive noise in theglottal source. As expected, the greater the noise, the shallower the valleys in thepower spectrum. Figure 3 shows the N/S ratio for synthesized voices with varyingamounts of additive noise. Each result consists of 25 frames, shifted 6.4 ms each,whose standard deviations are indicated by error bars. The N/S ratio varies with theamount of additive noise in the glottal source signal. The same result was obtainedfrom voice samples with F0= 110Hz.

0

- -10 One Standard Deviation Error Barsa:'I"'0.......

-200- -30co

c:::

en -40-Z-50

-60

-702 4 8 16

Additive Noise (%)

32 64

Figure 3: The N/S ratio for synthesized voices, vowel lui, Fo=220 Hz, with varying amounts of additivenoise in the glottal source. Error bars show one standard deviation for each sample.

Figure 4 shows the averaged power spectrum of 25 frames, shifted 6.4 ms each, forsynthesized voices, vowel lui, Fo=220 Hz, with 1%, 4%, and 16% amplitudeperturbation and 1/4%, 1%, and 4% pitch perturbation of the glottal source. Again,the greater the perturbation, the shallower the valleys in the power spectrum. Figure5 shows the N/S ratio for synthesized voices with varying amounts of amplitudeperturbation and pitch perturbation. The NI S ratio varies with the amount of theamplitude or pitCh perturbation of the glottal source, and again the same result wasobtained from the voice samples with Fo=110 Hz. It may be noted that the N/S ratiosfor pitch and amplitude perturbation show greater variance than those for additivenoise. This appears to be a statistical artifact. A synthesized voice with sourceperturbation contains only one random factor for each glottal cycle, while there is arandom component in each sample for the additive noise case.


112

Pitch Perturbation 1/4%(dB)~--------------~80 Amplitude Perturbation 1%

60

40

20OL-_-~~---'----':--'--'------="""":"--'--_~--'L ~__-~---~---~~

Amplitude Perturbation 4% Pitch Perturbation 1%

2 3 4Frequency (kHz)

(dB)".---,.--,- ----,

80 Amplitude Perturbation 16%

60

40

20

00 o

Pitch Perturbation 4%

2 3 4Frequency (kHz)

Figure 4: Averaged power spectra for the synthesized voices, vowel lui, Fo=220 Hz, with 1%,4%, and16% amplitude perturbation and 1/4%,1%, and 4% pitch perturbation of the glottal source.

Figure 6 shows the N/S Ratio and the H/N Ratio (Yumoto et al., 1982) forsynthesized voices of varying fundamental frequency with 16%, 32%, and 64%additive noise. The fundamental frequency was varied from 98 Hz to 392 Hz at 6logarithmic steps per octave. Both indexes showed the same pattern of fluctuation,which appeared to be an artifact created by the synthesizing program. While both N/Sratio and H/N ratio were fairly insensitive to fundamental frequency over thenormal speech range, the N/S ratio was somewhat less sensitive.

Figure 7 shows time domain results for modulated synthesized voices with 16%additive noise. The glottal source was modulated at 8 Hz with 32% sinusoidalamplitude modulation or with 4% sinusoidal frequency modulation. One hundredframes with 1.6-ms frame shift were analyzed for each of the two conditions. The toppanels indicate the voice waveform. Upper markings show the center of each frame.The middle panels show the fundamental frequency for each frame. The bottompanels show the N/S ratio smoothed by a moving average of three successive frames.

Muta et al.

113

o

-1 0 One Standard Deviation Error Bars

-40

-m'0 -20........

~ -30coa::en-z

-50

-60

2 4 8 16

Amplitude Perturbation (%)

32

o

-1 0 One Standard Deviation Error Bars

-40

-60

-50

.2 -30-coa::en-z

-m -20"0........

1/4 1/2 1 2 4 8

Pitch Perturbation (%)

Figure 5: The N/S ratio for synthesized voices, vowel lui, Fo=220 Hz, with varying amounts ofamplitude perturbation (top) and pitch perturbation (bottom) of the glottal source. Error bars showone standard deviation for each sample.


114

-20

.... Noise 64%

..... Noise 32%

- -30 -a- Noise 16%OJ"tS'-"

0-CO -40a:(f)-Z

-50

110 220 440Fundamental Frequency (Hz)

10

20-m"C'-"0:;::CO 30a:Z-:I:

40

.... Noise64%

..... Noise 32%-a- Noise 16%

110 220 440Fundamental Frequency (Hz)

Figure 6: The N/S Ratio (top) and the H/N Ratio (bottom) for the synthesized voices with 16%, 32%and 64% additive noise.

Muta et al.

115

32% Amplitude Modulation by 8 Hz Sine Wave

N/S Ratio

-40Minimum

-60 L_'--__'_~-~:---'::"::___':_::-_=_-_'::"::_---'':::::__--';_::::::__';:_::_~~_::::_.::___;';;;:__~___;_~_:;_:~o 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

Time (ms)

4% Frequency Modulation by 8 Hz Sine Wave

Waveform

(Hz) 1----'''--.........---'----'--........--'----'---'-----''-----'---'----'---'---'---'----'---'---1240 Fundamental Frequency

230220210

2001--'---'----''---'---'---'----'----'--.......- .......---'-_-'-_--'-_--'-_-'--_-'--_-'---1

(dB) N/S Ratio-20

-60 L..._'-----''-----'_....L,.,._.........._--'-_--'-_........_--'-_--'-_--'-_-'-_-'--_-'-_-'-_-'-"'----'--l

o 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170Time (ms)

-40

Figure 7.. Time domain results for the modulated synthesized voices, vowel lui. Fo=220 Hz, with 16%additive noise. The glottal source was modulated at 8 Hz with 32% sinuosoidal amplitude modulation(top) or with 4% sinusoidal frequency modulation (bottom).

Figure 7 shows that the N/S ratiovaries as a result of glottal source modulation. Inorder to extract the most stable parts of the modulated signals, three successiveframes, whose averaged N/S ratio showed the minimum value, were taken as therepresentatives for these samples. These three frames, whose center for each of thetwo conditions is indicated by the vertical bar in each bottom panel, predict the N/Sratio for this noise level without modulation. Figure 8 shows the waveforms andpower spectra for the selected three frames from the modulated samples with 16%additive noise. These spectra show similar harmonic structure to those for thenonmodulated voice with the same amount of additive noise shown in Figure 2.


Amplitude Modulation 32%

Waveform

116

Frequency Modulation 4%

Waveform

Time (ms) Time (ms)(dB

Power Spectrum Power Spectrum Frame 1880 Frame 13

60

40

20

0

(~)I\I

Frame 19

!I~)l

1I

Fmm.20 1

0 2 3 4 0 2 3 4Frequency (kHz) Frequency (kHz)

Figure 8: Waveforms and power spectra for the three frames, with minimum N/5 ratio, from themodulated synthesized voices, vowel lui, Fo=220 Hz, with 16% additive noise. The glottal source wasmodulated at 8 Hz with 32% sinusoidal amplitude modulation (left) or with 8% sinusoidal frequencymodulation (right). These spectra show a similar harmonic structure to that for the non-modulatedvoice with the same amount of additive noise in Figure 2.

Figures 9 and 10 show the NjS ratio for modulated synthesized voices with 16%,32%, and 64% additive noise, with varying amounts of 8 Hz glottal sourcemodulation either in amplitude or in frequency. Each data point is an average ofthree successive frames whose NjS ratio showed the minimum value. The NjS ratiois insensitive to glottal source modulation (within one standard deviation of thenonmodulated samples) up to 32% amplitude modulation or up to 4% frequencymodulation for samples Fo=220 Hz and up to 16% amplitude modulation or up to 2%frequency modulation for samples Fo=110 Hz. The relatively small frame size, 18.2ms for Fo=220 Hz, compared to the period of source modulation, 125 ms for 8 Hz, isthe reason for the insensitivity ofthe NjS ratio.

Mula et al.

117

-30

-CD"C-o:;:;~ -40

~Z

FO ::: 220 HZ.... Nolse64%.... Nolse32%-G- Noise 16%

-50+--'o 2 4 8 16 32

Amplitude Modulation (%)64

-30

FO ::: 110 HZ

.. Nolse64%

.... Nolse32%-G- Noise 16%

-50 -t----t

o 2 4 8 16 32Amplitude Modulation ('Yo)

64

Figure 9. The N/S ratio for modulated synthesized voices, vowel/u/, Fo=220 Hz (top) and Fo=110 Hz(bottom), with 16%, 32%, and 64% additive noise, whose glottal source contained varying amounts of 8Hz sinusoidal modulation in amplitude. Each data point is an average of the three frames, whose N/Sratio showed the minimum value.


118

-30

FO :: 220 HZ... Nolse64%... Nolse32%-Go Noise 16%

m"-o~ -40

en--z

-504----1o

-30

1/4 1/2 1 2 4Frequency Modulation (%)

8

FO :::: 110 HZ

___-il 1--_-_I

-m"-o:;:: I ~__~

t}, -40 J

~Z

... Nolse64%

... Nolse32%-a- Noise 16%

-504---1:o 1/4 1/2 1 2 4

Frequency Modulation (%)8

Figure 10. The N/S ratio for modulated synthesized voices, vowel lui, Fo=220 Hz (top) and Fo=110(bottom), with 16%, 32%, and 64% additive noise, whose glottal source contained varying amounts of 8Hz sinusoidal modulation in frequency. Each data point is an average of three frames, whose N ISratio showed the minimum value.

Muta et al.

119

B. Results of Patient Voice AnalysisFigure 11 shows the time domain results for the pre- and postoperative voice

samples of Subject 1. The NIS ratio varied during the speech sample. Threesuccessive frames. whose averaged N/S ratio showed the minimum value, were takenas the representatives for each sample. Figure 12 shows the waveforms and powerspectra for the selected three frames from the pre- and postoperative samples of thissubject. The postoperative spectrum shows better harmonic structure than the preoperative spectrum.

SUbject 1, Pre-operation, Reading 1: luol

Waveform

(Hz) I---'-~->-~'----'-~-'--~'----'-~-'-----''-----'-~-'----'~--'-~-'-----'~--'---'-----'--Y200 Fundamental Frequency

150

100

50Ol---''----'--.......L--'---'---'---'----''----'---'---'---'---'---"'---L.---''-----'----'---'-l

(dB) N/S Ratio'--_--..-20

-40 Minimum

-60 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180Time (ms)

Subject 1, Post-operation, Reading 1: luol

Waveform

(Hz) I-.......- ...............-'---...............--'---'-----'--&....--'---........--L--'--........- ................--'---<-----l200 Fundamental Frequency

15010050

or--.......--'--'--.......~-'--'----"---'---'----'-- .......- ........--'---'--~ ........--'---'---........--I(dB) N/S Ratio-20

-40 Minimum

-60o 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180

Time (ms)

Figure 11. Time domain results for the first pre-operative reading (top) and the first post-operativereading (bottom) by Subject 1. The top panels indicate the waveforms of the voice for the vowel luiin luo no/. One hundred frames with 1.6 ms frame shift were analyzed for each of the twoconditions. Upper markings show the center of each frame. The middle panels show thefundamental frequency for each frame. The bottom panels show the N/S ratio smoothed by themoving average of three successive frames. The vertical bar in each bottom panel, which shows theminimum of the smoothed N/S ratio, indicates the most stable part of the vowel lui.


120

Subject 1, Pre-operation Subject 1, Post-operation

Waveform Waveform

Time (ms) Time (ms)(dB)80 Frame 18

60

40

20Power Spectrum Power Spectrum

0

(~illII

{~!I j I0 1 2 0 1 2

Frequency (kHz) Frequency (kHz)

Figure 12. Waveforms and power spectra for the selected three frames, which showed the minimumNjS ratio, from the first preoperative reading (left) and the first postoperative reading (right) bySubject 1.

Table 2 shows the analysis results for the N/S ratio and fundamental frequencyfor the six subjects before and after laryngeal surgery. Each result is an average ofthree successive frames, whose N/S ratio showed the minimum value. Figure 13shows the averaged N/S ratio of each pair (first and second readings) of pre- andpostoperative voice samples. The N/S ratio consistently improved after the surgeryin all six subjects. Thus, results of therapy considered to be successful by doctor andpatient were indicated by the analysis.

IV. DISCUSSION

Voice quality is difficult to assess objectively. Various laryngeal diseases maycause a pathological change in voice quality, and each abnormal voice may give adifferent perceptual impression to different listeners. We need better understandingof the perception of voice quality as well as better understanding of pathologicalproduction in order to evaluate the acoustic characteristics of a deviant voiceproperly in relation both to the perceptual impression of listeners and to thepathological state of the larynx.

Classifications of listeners' impressions in multiple dimensions, such as rough,breathy, asthenic, and strained, have been proposed (Hirano, 1981), and acousticparameters associated with different kinds of voice quality have been studied(Imaizumi, 1986a, 1986b). For example, "roughness" may be associated withmodulations over several pitch periods or, at low pitch, with factors that are the

Muta et aI.

121

same across cycles. "Breathy" voice may be characterized by additive noise or byweakness of harmonics above the fundamental. The relative strength of harmonicsalso contributes to the perceptual contrast between "asthenic" and "strained" voices.

TABLE 2Analysis results of the N/S ratio and the fundamental frequency for six subjects before and afterlaryngeal surgery.

Pre-operation Post-operation

Subject Reading 1 Reading 2 Reading 1 Reading 2

FO(Hz) N/S(dB) FO(Hz) N/S(dB) FO(Hz) N/S(dB) FO(Hz) N/S(dB)

1 97.5 -25.8 97.1 -24.8 103.8 -29.0 97.2 -33.4

2 136.0 -20.1 135.6 -29.6 140.5 -34.3 136.9 -31.4

3 144.5 -29.8 135.9 -25.9 131.3 -32.4 131.6 -35.9

4 234.5 -34.9 233.3 -36.0 248.6 -40.4 252.5 -42.0

5 202.4 -29.0 212.7 -30.8 237.1 -42.5 228.3 -41.4

6 195.9 -34.5 200.6 -26.0 209.3 -36.7 219.9 -36.9

- 50

-..m"C - 40-o:;a: -30fl2z"C8, -20C'CIb.(!):>

<C -10

o2 3 4

Subject5 6

II Pre-operationI2J Post-operation

Figure 13. Averaged NjS ratio of each pair (first and second readings) of pre- and postoperative voicesamples.

The kinds of acoustic parameters mentioned above do not bear a simplerelationship to pathological modes of vocal-fold vibration. and. in addition. theyinteract with each other. For example. glottal source perturbations distort theharmonic structure and thus affect both noise measures and harmonic strengthmeasures. Similarly. additive noise may contribute to acoustic measures of source


122

perturbation. To provide a proper evaluation of each acoustic characteristicseparately, it is necessary to extract individual glottal cycles from the acoustic signalaccurately and to separate the glottal excitation signal from the nonspecific spectralnoise in each cycle.

Inverse filtering has been proposed as a method for extracting sourcecharacteristics from the acoustic signal (Davis, 1976). However, it is doubtfulwhether inverse filtering prOVides sufficiently accurate results, espeCially withabnormal voices. For example, in a study applying the LPC method to hoarse voices,measured variations in formant patterns appeared to be caused by cycle-to-cyclevariations in source characteristics (Muta et at, 1987).

If we are to understand the acoustic characteristics of hoarse voice funy, we willhave to learn much more about the relationship between pathological vibrations ofthe vocal folds and the resulting acoustic Signal. In the meantime, we have adopted asimple assumption for the present analysis based on sound-spectrographic findings(Yanagihara, 1967): for whatever reason, a hoarse voice has a greater nonharmoniccomponent and a less pure harmonic component than a normal voice.

Periodic structure in the voice signal is the prereqUisite for pitch-synchronousspectrum analysis. Therefore, the present method can be applied only to a case ofmild or moderate hoarseness. In such cases, the fundamental period can beestimated easily by measures of the acoustic waveform without additionalinstrumental observations of vocal fold Vibration, such as laryngeal stroboscopy orelectroglottograpy.

I1:te N/S ratio was calculated over the spectral region between the 1st and 16thhannonics. Generally, the harmonic structure of a voice signal shows greaterdistortion in higher harmonics than in lower harmonics, because of the modulationeffect of source perturbation. The higher the harmonic, the greater the noise-tosignal ratio. However, voice signals were not preemphasized and we analyzed thevowel lu!, whose first and second fOlIDant frequencies are among the lowest of the

Japanese vowels. The vowel spectra were thus dominated by low sothe analysis parameters, such as the sampling rate and the number of harmonicschosen, were wide enough to cover the most of the acoustic power of the voice.Calculation of the power spectrum up to higher harmonics did not change the N!Sratio for voice samples.

However, it should be noted that spectral differences between source signals, suchas an increase or decrease of higher harmonics, may affect the N!S ratio, because ofthe modulation effects of source perturbation. The pathological characteristics of thesource spectrum, such as weakness of higher harmonics, may be evaluated from thepresent pitch-synchronous spectrum, if we can assume that the effect of the vocaltract resonance was the same for the given voice samples.

In summary, we have developed a pitch-synchronous analysis method forhoarseness, which is sensitive to additive noise, jitter, and shimmer, and isinsensitive to slower modulations in amplitude and fundamental frequency. Theresults of the analYSiS of pre- and postoperative running speech, which indicatesuccessful therapy of six patients with laryngeal disease, show the clinical usefulnessof this method.

ACKNOWLEDGMENTThiS work was supported by NINCDS Grant 13870 to Haskins Laboratories.

REFERENCES

Davis, S. B. (1976). Computer evaluation of laryngeal pathology based on inverse filtering of speech.SCRL Monograph 13.

Muta et aI.

123

Hirano, M., (1981). Clinical examination of voice (pp. 81-84). Vienna: Springer-Verlag.Hiraoka, N., Kitazoe, Y., Ueta, H., Tanaka,S., & Tanabe, M. (1984). Harmonic-intensity analysis of

normal and hoarse voices. Journal of the Acoustical Society of America, 76, 1648-1651.Imaizumi, S. (1986a). Acoustic measure of roughness in pathological voice. Journal of Phonetics, 14,

457-462.Imaizumi, S. (1986b). Clinical application of the acoustic measurement of pathological voice

qualities. Annual Bulletin of the Research Institute of Logopedics and Phoniatrics (University ofTokyo), 20, 211-216.

Isshiki, N., Yanagihara, N., & Morimoto, M. (1966). Approach to the objective diagnosis ofhoarseness. Folia Phoniatica, 18, 393-400.

Kasuya, H., Ogawa,S., Mashima, K., & Ebihara, S. (1986). Normalized noise energy as an acousticmeasure to evaluate pathologic voice. Journal of the Acoustical Society of America, 80, 1329-1334.

Kitajima, K. (1981). Quantitative evaluation of the noise level in the pathologic voice. FoliaPhoniatica, 33, 115-124.

Koike, Y. (1969). Vowel amplitude modulations in patients with laryngeal diseases. Journal of theAcoustical Society of America, 45, 839-844.

Kojima, H., Gould, W. L Lambiase, A., & Isshiki, N. (1980). Computer analysis of hoarseness. ActaOtolaryngological 89, 547-554.

Lieberman, P. (1961). Perturbations in vocal pitch. Journal of the Acoustical Society of America, 33, 597603.

Ludlow, C. L., Bassich, C. J., Connor, N. P., Coulter, D. c., & Lee, Y. J. (1987). The validity of usingphonatory jitter and shimmer to detect laryngeal pathology. In T. Baer, C. Sasaki, & K. Harris(Eds.), Laryngeal function in phonation and respiration (pp. 492-508). Boston: Little Brown.

Muta, H., Muraoka, T., Wagatsuma, K., Horiuchi, M., Fukuda, F., Takayama, E., Fujioka, T., &Kanou, 5. (1987). Analysis of hoarse voices using the LPC method. In T. Baer, C. Sasaki and K.Harris (Eds.), Laryngeal function in phonation and respiration, (pp.463-474). Boston: Little Brown.

Titze, 1. R. (1986). Three models of phonation. Journal of the Acoustical Society of America, Suppl. 1,79,581.

Yanagihara, N. (1967). Significance of harmonic changes and noise components in hoarseness.Journal of Speech and Hearing Research, 10, 531-541.

Yumoto, E., Gould, W. J., & Baer, T. (1982). Harmonics-to-noise ratio as an index of the degree ofhoarseness. Journal of the Acoustical Society of America, 71, 1544-1550.

FOOTNOTES

*Journal of the Acoustical Society of America, submitted.tCentral R&D Center, Research and Development Division, Victor Company of Japan, Ltd. 58-7

5hinmei-cho, Yokosuka, Kanagawa 239 Japan.ttDepartment of Otolaryngology, Keio University School of Medicine 35 Shinanomachi, Shinjuku

ku, Tokyo 160 Japan.


A Pitch-synchronousAnalysis of Hoarseness in Running …A method of pitch-synchronousacoustic...

Documents

Transcript of A Pitch-synchronousAnalysis of Hoarseness in Running …A method of pitch-synchronousacoustic...