An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard...

13
SPEECH COMMU- ELSEVIER Speech Communication 20 (1996) 23-35 An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect Antonio Castellanos av *, Jo&-Miguel Benedi b, Francisco Casacuberta b a Dbepartamento de Znforma’tica, Uniuersidad Jaime I - Campus de Penyeta Roja, 12071 Castelkh, Spain Departamento de Sistemas Informhticos y Computacibn, Uniuersidad Politknica, Valencia, Spain Received 15 April 1996; revised 15 June 1996 Abstract A noisy environment usually degrades the intelligibility of a human speaker or the performance of a speech recognizer. Due to this noise, a phenomenon appears which is caused by the articulatory changes made by speakers in order to be more intelligible in the noisy environment: the Lombard effect. Over the last few years, special emphasis has been placed on analyzing and dealing with the Lombard effect within the framework of Automatic Speech Recognition. Thus, the first purpose of the work presented in this paper was to study the possible common tendencies of some acoustic features in different phonetic units for Lombard speech. Another goal was to study the influence of gender in the characterization of the above tendencies. Extensive statistical tests were carried out for each feature and each phonetic unit, using a large Spanish continuous speech corpus. The results reported here confirm the changes produced in Lombard speech with regard to normal speech. Nevertheless, some new tendencies have been observed from the outcome of the statistical tests. R&urn6 Un environnement bruit6 degrade gCnCralement l’intelligibilite d’un locuteur humain et les performances d’un systeme de reconnaissance de la parole. A cause de ce bruit environnant, les locuteurs modifient leur articulation pour rendre leur parole plus intelligible. Ce phenomene est appele l’effet Lombard. Ces dernieres anntes, des efforts specifiques ont CtC produits pour analyser et traiter l’effet Lombard dans le cadre de la reconnaissance automatique de la parole. L’objectif principal du travail present6 dans cet article conceme l’etude de tendances communes ‘a certains parametres acoustiques dans differentes unites phonetiques, likes a l’effet Lombard. Un autre objectif conceme l’etude de l’influence du sexe du locuteur sur la caracterisation des tendances recherchees. Des tests statistiques approfondis sont effect&s pour chaque parametre et chaque unite phonetique sur une large base de donnees de parole continue en Espagnol. Les resultats report& confirment les changements produits sur la parole naturelle par l’effet Lombard. Quelques nouvelles tendances ont CtC observees suite aux tests statistiques effect&. Keyvords: Lombard effect; Continuous speech; Speech production * Corresponding author. E-mail: [email protected] 0167-6393/96/.$15.00 Copyright 0 1996 Elsevier Science B.V. All rights reserved. PZZ SOl67-6393(96)00042-8

Transcript of An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard...

Page 1: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

SPEECH COMMU- ELSEVIER Speech Communication 20 (1996) 23-35

An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

Antonio Castellanos av * , Jo&-Miguel Benedi b, Francisco Casacuberta b a Dbepartamento de Znforma’tica, Uniuersidad Jaime I - Campus de Penyeta Roja, 12071 Castelkh, Spain

Departamento de Sistemas Informhticos y Computacibn, Uniuersidad Politknica, Valencia, Spain

Received 15 April 1996; revised 15 June 1996

Abstract

A noisy environment usually degrades the intelligibility of a human speaker or the performance of a speech recognizer.

Due to this noise, a phenomenon appears which is caused by the articulatory changes made by speakers in order to be more intelligible in the noisy environment: the Lombard effect. Over the last few years, special emphasis has been placed on analyzing and dealing with the Lombard effect within the framework of Automatic Speech Recognition. Thus, the first purpose of the work presented in this paper was to study the possible common tendencies of some acoustic features in different phonetic units for Lombard speech. Another goal was to study the influence of gender in the characterization of the above tendencies. Extensive statistical tests were carried out for each feature and each phonetic unit, using a large Spanish continuous speech corpus. The results reported here confirm the changes produced in Lombard speech with regard to normal

speech. Nevertheless, some new tendencies have been observed from the outcome of the statistical tests.

R&urn6

Un environnement bruit6 degrade gCnCralement l’intelligibilite d’un locuteur humain et les performances d’un systeme de reconnaissance de la parole. A cause de ce bruit environnant, les locuteurs modifient leur articulation pour rendre leur parole plus intelligible. Ce phenomene est appele l’effet Lombard. Ces dernieres anntes, des efforts specifiques ont CtC produits pour analyser et traiter l’effet Lombard dans le cadre de la reconnaissance automatique de la parole. L’objectif principal du travail present6 dans cet article conceme l’etude de tendances communes ‘a certains parametres acoustiques dans differentes unites phonetiques, likes a l’effet Lombard. Un autre objectif conceme l’etude de l’influence du sexe du locuteur sur la caracterisation des tendances recherchees. Des tests statistiques approfondis sont effect&s pour chaque parametre et chaque unite phonetique sur une large base de donnees de parole continue en Espagnol. Les resultats report& confirment les changements produits sur la parole naturelle par l’effet Lombard. Quelques nouvelles tendances ont CtC observees suite aux tests statistiques effect&.

Keyvords: Lombard effect; Continuous speech; Speech production

* Corresponding author. E-mail: [email protected]

0167-6393/96/.$15.00 Copyright 0 1996 Elsevier Science B.V. All rights reserved. PZZ SOl67-6393(96)00042-8

Page 2: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

24 A. Castellonos et al. /Speech Communication 20 (1996) 23-35

1. Introduction

A noisy environment usually degrades the intelli- gibility of a human speaker or the performance of a speech recognizer. On the one hand, the presence of certain types of noise masks some acoustic features of the speech. On the other hand, speakers produce articulatory changes in order to be more intelligible in a noisy environment. This latter phenomenon is known as the Lombard effect (Lombard, 1911). In addition, a study of a speaker-dependent speech rec- ognizer, which under normal conditions have a high accuracy rate, was carried out to determine their robustness under noisy conditions (Junqua and Wakita, 1989). In this study, the Lombard effect appeared to be even more degrading than the addi- tive noise. Nevertheless, an increase in the human intelligibility of Lombard speech with regard to nor- mal speech in noisy environments was confirmed by Summers et al. (1988).

Throughout this century, different aspects of the Lombard effect have been addressed and studied (Lane and Tranel, 1971). Some of these studies have focused on the perception by human listeners of the speech produced in noise (Summers et al., 1988; Junqua, 1993). However, over the last few years, special emphasis has been placed on analyzing and dealing with the Lombard effect within the frame- work of Automatic Speech Recognition (ASR) (Stan- ton et al., 1989; Applebaum and Hanson, 1990; Hansen and Bria, 1992; Junqua, 1993). Some experi- ments have been carried out to evaluate the ability to suppress the Lombard response (Pick et al., 1989). However, most of the recent research deals with analyzing some acoustic-phonetic features and/or the perception of human listeners, and also with improving the accuracy rates of ASR systems. In particular, methods for building ASR systems based on features that were robust to the Lombard effect have been searched and analyzed (Junqua and Wakita, 1989; Stanton et al., 1989; Dvorak and Hijrmann, 1991; Cairns and Hansen, 1992; Mak et al., 1992; Hanson and Applebaum, 1993).

Several studies have found clear and consistent differences between acoustic-phonetic features of speech produced under quiet conditions versus those produced in noise (Stanton et al., 1988; Summers et al., 1988; Bond et al., 1989; Summers et al., 1989;

Junqua, 1993). In these studies, some general com- mon results have been reported for Lombard speech: an increase in phoneme (vowel) duration, a shift in the Energy Distribution from low frequency bands to middle or high bands, an increase in vowel first formant frequency, etc. Along with these common results, some discrepancies appeared which might be due to the different data and procedures used in each experiment.

In a European project ’ (Alinat, 1994), a study on the behaviour of some specific features used in a specific feature-based ASR system was carried out. One of the tasks developed in the project was to evaluate the influence of the Lombard effect on these features using Spanish as the reference language (Castellanos and Casacuberta, 1992). To the knowl- edge of the authors, this evaluation constituted the first study of the Lombard effect for Spanish.

The work presented in this paper is the result of this task. The first purpose was to study the possible common tendencies of these features in different phonetic units for Lombard speech. A second goal was to study the influence of gender in the character- ization of the above tendencies. For this purpose, extensive statistical tests were carried out for each feature and each phonetic unit, using a large Spanish continuous speech corpus.

2. Experimental framework

In order to present the experimental study of acoustic-phonetic features of Spanish produced with normal versus Lombard speech, it is necessary to consider and discuss several aspects: 1. the choice of the set of phonetic units together

with the qualitative and quantitative definition of the continuous speech corpus used;

2. the selection of acoustic-phonetic features; 3. the experimental procedure to carry out the statis-

tical study proposed.

2.1. Spanish phonetic units

The Spanish phonetic units were grouped into several categories from a phonetic and phonological

’ ROARS: “Robust Analytical Speech Recognition System”.

55 16 Project of ESPRIT-II Program.

Page 3: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

A. Castellanos et al./Speech Communication 20 (1996) 23-35 25

Table 1 A summary of the phonetic units considered and their transcription according to IPA and SAMPA notations (Wells, 1989)

IPA SAMPA Examples

Plosives Unvoiced

Voiced

bilabial

dental

v&r

bilabial

dental

velar

P P 6pera [bpera]

t t pat0 @toI k k casa [k&a]

b b 6mbar [imbar]

d d toldo [tbldo]

g g congo [kbngo]

Nasals Voiced bilabial

alveolar

palatal

m m cama [k&na]

n n lino [linol

P J leiia WJal

Fricatives Unvoiced

Voiced

labiodental

interdental

alveolar

velar

palatal

f f fatal [fat811

0 T cocer [koTCr]

s s sala [&la] X X jirafa [xi&i] y J jj hierba Lijkrba]

Affricate

Liquids Laterals

Vibrants

Unvoiced

Voiced

Voiced

palatal

alveolar

palatal

alveolar tap

alveolar trill

ts ancho [&Sol

1 1 isla [isla]

K L calle [kBLe]

r r a0 [ho]

r rr rota [rrbka]

Vowels Front

Central

Back

closed

middle

open

middle

closed

i i gent3 [gentil]

e e pagut [paghl a a bajo [b&x0]

0 0 &al0 [bbalo]

u ” zurdo [Tiirdo]

point of view (Benedi et al., 1992). A detailed description of each category is given in Table 1. In this work, the SAMPA notation (Wells, 1989) was selected to represent the Spanish phonetic units stud- ied.

The voiced palatal allophone of the lateral conso- nant [L] was included in this work as a palatal allophone of the fricative consonant [ii] (Benedi et al., 1992). The reason for this is twofold. On the one hand, the acoustic differences between them are slight and there are also few occurrences of either allophone in the speech corpus. On the other hand, in Spanish there exists a very common pronunciation phenomenon called “ yeismo” (Quillis, 1988). This phenomenon refers to the pronounciation of the [L] as bj] and many Spanish speakers have difficulty in establishing any difference between them.

2.2. Speech material

A continuous speech corpus of quasi-phonetically balanced Spanish sentences was designed and

recorded. This corpus was carefully selected in order to include both the more important acoustic-phonetic phenomena and the main articulatory variability.

The phonetic criteria used in the selection of this corpus were similar to those of the ALBAYZIN acoustic-phonetic corpus * (Casacuberta et al., 1991). The design of the corpus followed a basic criterion: the statistical properties of the corpus had to be close to those of Spanish. The following statistical require- ments were considered for the design of the corpus used in this study: 1. the frequency of appearance of the phonetic units

within the whole corpus; and 2. the frequency of appearance of each possible

context for each phonetic unit. Based on these considerations, the final corpus used in this study was composed of 13 specially selected

’ ALBAYZIN is a Spanish project whose main objective is to

obtain general continuous speech corpora for the Spanish lan-

guage.

Page 4: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

26 A. Castellanos et al. /Speech Communication 20 (19961 23-35

sentences (Table 2). The average number of phonetic units per sentence was 27.4. This corpus was pro- duced under both normal and Lombard conditions. The final set of utterances was composed of: - The 13 sentences of the corpus, produced twice

with normal speech by 10 speakers. - The same 13 sentences produced twice with

In the second stage, in order to obtain the utter- ances with the Lombard effect, the speakers wore a pair of commercial headphones. They were then informed that white-Gaussian masking-noise at 85 dB would be presented over the headphones during the recording of the 13 sentences (twice). They were again asked to speak as clearly as possible.

Lombard effect by the same 10 speakers. Our findings are based on the analysis of approxi- mately 14,240 phonetic units from normal and Lom- bard speech.

The ten speakers (five males and five females) that participated in the recording of the corpus were selected from among the laboratory staff and gradu- ate students in computer science. They were not informed about the final purpose of the experiment at the time of recording. Also, all of them were native and from the same linguistic area, and none of the speakers reported any hearing or speech diffi- culty. The recording procedure for each speaker went on for approximately one hour distributed over sev- eral sessions.

After all the utterances were recorded, an evalua- tion of the difference of energy between Lombard and normal speech signals was carried out for each speaker. The ratio between the total speech signal energy and the total energy of the background noise was used as the Signal to Noise Ratio (SNR). For each speaker, Lombard and normal SNRs were aver- aged on his/her Lombard and normal speech sig- nals, respectively. Table 3 shows the result of this evaluation.

The recording site was a quiet common room in our laboratory. A “closed-talk” microphone placed at an approximate distance of 10 cm from the lips of the speakers was used for the recordings. They were also informed that the acquisiton system was going to randomly display the 13 sentences (twice) in order for them to read each sentence at the time it ap- peared on the screen. The speakers were asked to speak clearly and did not wear headphones in this first stage, in which normal utterances were recorded.

Finally, they were digitized (at 20 kHz) and parametrized in order to obtain a short-term analysis of the speech signal. The analyzer employed here was a coarse cochlea model comprising a bank of 45 linear bandpass filters with center frequencies and transfer functions based on human characteristics. The analyzer frequency band was 70 Hz-8 kHz. Each filter was followed by a detection-integration, and a “spectrum” was obtained every 7.2 ms (Alinat, 199 1). The corpus was segmented and labeled by hand. The manual segmentation was carried out start- ing from an initial automatic segmentation which was based on spectral changes and obtained through a Dynamic Time-Warping procedure (Benedi et al., 1992).

Table 2

Continuous speech sentences of the corpus used in the statistical study. The corresponding phonetic transcription in SAMPA notation for

each sentence (Table 1) is also presented

Hay que urdir un plan convincente.

Este segment0 es excesivamente large.

Se puso amarillo de un acceso de ictericia.

El problema de1 agua es arduo.

Tengounagangadecoche.

Dijalo cocer para eliminar el kido.

Anoche por fin machucaron al bicho.

Era de color aiiil en mi sueiio.

Hay una torre horrible alll en la sierra.

El circa ambulante no vendr5 este a!io.

Antes del bailarin entr6 el payaso.

No hay atajos para el future.

Una zarza fue adomada con un laze azul.

aikeurdirunplankonbinTente

estesegmentoeseksTesibamentelargo

sepusoamariLodeunakTesodeikteriTia

elproblemadelaguaesarduo

tengounagangadekotSe

dexalokoTerparaeliminarelaTido

anotSepo$inmatSukaronalbitSo

eradekoloraJilenmisueJo

&unatorreorribleaLaenlasierra

elTirkoambulantenobendruesteafo

antesdelbailarinentroelpajjaso

noaiataxosparaelfuturo

unaTarTafueadomadakonunlaToaTu1

Page 5: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

A. Castellanos et al/Speech Communication 20 (1996) 23-35 27

Taking into account the frequency of appearance of the phonetic units in the corpus, and the sub-sam- ple frequency selected: 125 Hz (i.e., a parametric representation -frame- of speech signal every 7.2 ms), the maximum, minimum and average number of frames per phonetic unit were respectively 18.7 (for [tS]), 4.0 (for [r]) and 9.2.

2.3. Acoustic ,features

Over the last few years, a large variety of acoustic features have been studied in the literature to analyze the Lombard effect with respect to normal speech (Stanton et al., 1988; Summers et al., 1988; Bond et al., 1989; Junqua and Anglade, 1990; Hajislam et al., 1992). The acoustic features considered in this work are presented in Table 4. The computational proce- dures for the features and the implementation details can be found in (Alinat, 1991).

Although the majority of the acoustic features in Table 4 are in accordance with previous studies, some differences have been introduced. The compu- tation of Fricative Formant is quite similar to the formants Fl and F2 (Alinat, 1991). However, in this case, FF is evaluated in the high frequency band (1.2-8 kHz). The FF is used to characterize the fricative consonants. The Friction Percentage is pro- portionally evaluated in terms of medium (OS-l.2

Table 3 Average energy difference between Lombard and normal speech

for each speaker

E, (dB) E2 (dB) Ez - E, (dB)

Female MCB 53 66 13 speakers IBT 56 72 16

IAG 53 69 16 ESS 55 69 14 ITB 57 66 9

Table 4

Acoustic features selected

Phone Duration (PD)

Total Energy (TE)

Low-band Spectral Tilt (LST)

High-band Spectral Tilt (HST)

Pitch (P)

First Formant (Fl ) Second Formant (F2)

Fricative Formant (FF)

Friction Percentage (F’P)

Energy behveen O-250 Hz

Energy behveen 250-500 Hz

Energy behueen 500- 1,000 Hz

Energy behueen l,OlX-2,000 Hz

. Energy behveen 2,000-3,000 Hz

Energy behveen 3,000-4,000 Hz

Energy between 4,000-5,000 Hz

Energy between 5,000-6,000 Hz

Energy between 6,000-7,000 Hz

Energ)? between 7,000-8,000 Hz

kHz) and high (1.2-8 kHz) frequency energy ratios and inversely proportional to the degree of sonority in the same frequency bands (Alinat, 1991). The FP takes maximum values when a frication exists. The two spectral tilt measures for each frame were com- puted from the filter-bank representation of the spec- trum. The Low-band Spectral Tilt was computed with the first thirty-three frequency channels corre- sponding to the range O-3 kHz. The High-band Spectral Tilt was computed by considering the last twelve channels of the spectrum which correspond to the range 3-8 kHz. In each case, the slope of the regression line of the corresponding log energy val- ues was used as the estimated value of the spectral tilt.

2.4. Procedure

Male JBR 55 61 6

speakers AJC 51 66 15 JSP 53 57 4

JPG 53 66 13 JGA 54 65 11

E, : average energy for normal speech E2 : average energy for Lombard speech

Ez - E,: difference of average energies

A two-way analysis of variance (carried out using the SYSTAT 3.0 statistical software package run on a PC) was applied to study the statistical significance of the differences between Lombard and normal speech. First, SYSTAT was used to run some prelim- inary tests on data to contrast the hypotheses of an equal variance and Normal distribution for the popu-

Page 6: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

28 A. Castellanos et al/Speech Communication 20 (19961 23-35

lations (Pefia, 1982). As a result, both hypotheses could be assumed in general for the available data, but some outlier values were rejected in order to guarantee the reliability of the statistical analyses. Then, two factors were selected for the two-way tests (Rohatgi, 1976); the kind of speech (normal and Lombard) and the gender of the speakers (male and female). A 0.05 level of significance was established to test whether the differences were significant. Specifically, the purpose of the two-way variance analysis was to examine the statistical significance of the differences between Lombard and normal values corresponding to each feature for each phonetic unit in three cases: female speakers, male speakers, and male and female speakers together.

Afterwards, in order to give a quantitative mea- sure of the differences that were found to be signifi- cant, the averages of the normal speech values and the Lombard speech values were computed. For each average, a 95% confidence interval for the mean of the population was obtained (Rohatgi, 1976). The gap between the upper bound of the lower interval and the lower bound of the upper interval provided a (positive or negative) “minimal difference” between the means of the two (normal and Lombard) popula- tions, whenever the hypothesis of different means was supported by the two-way test. This difference between bounds of the confidence intervals has been called “minimal” because the difference between the means of the normal and Lombard populations should be at least such a difference. Finally, in order to obtain a relative measure of the changes, this

minimal difference was computed in a percentage with regard to the normal average.

3. Experimental results on the differences be- tween normal and lombard Spanish speech

Tables 5-9 show the percentage variations (minimal differences) of Lombard with regard to normal speech for each phonetic unit in each cate- gory and for each feature proposed. In each table, the global results for the ten speakers (G) and the sepa- rate results for the five female (F) and the five male (M) speakers are presented. The symbol “X” means that no statistically significant differences were found. In the tables, a blank cell means that such a feature has not been evaluated for the phonetic unit. In Figs. l-5, the percentage variations of the Spec- tral Energy Distribution are shown (in a chart repre- sentation) for each phonetic unit and are grouped by categories. The chart representation illustrates the shifts of energy along the frequency intervals and allows for a better comparison of the behaviour of the different phonetic units. Two charts in each figure present the evolution of Lombard-normal dif- ferences for the female and male speaker groups.

For the plosiues, only the increases of the Low- band Spectral Tilt are worth noting (Table 5). In addition, moderate decreases of the High-band Spec- tral Tilt and moderate increases of the Pitch are observed for the voiced plosives. Changes in the Spectral Distribution of energy were, in general,

Table 5 Percentage variations (100 *(Lombard average - Normal average)/Normal average) of the acoustic features (Phone Duration (PD), Total

Energy (TE), Low-Band Spectral Tilt (LST), High-Band Spectral Tilt (HST), Pitch (P) and Friction Percentage (FP)) for unvoiced and

voiced plosives. Global results for all speakers (C) and separate results for female (F) and male (M) speakers are presented. “X” means that

no statistically significant differences were found

[PI [tl kl [bl [dl [&?I

GF MG F M G F M GFMGF MG FM

PD X X X X x x X X X x x x x X x x xx TE X X X -13 -25 -13 x -2.4 X x x x 26 26 26 x x x

LST 49 49 49 216 216 216 466 193 155 51 78 7 212 287 67 85 85 85 HST X X X -0.1 x 0.1 -3.9 x 3.9 -42 -42 -42 -132 -205 -11 -52 X -52

P 29 22 58 21 21 21 24 24 24

FP X X X -10 -10 -10 x X X x x x -6 -6 X X X 62

Page 7: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

A. Castellanos et d/Speech Communication 20 (1996) 23-35 29

2 - 200

g : (a) - [PI

‘2 150- 1‘1 ,q ‘EI 6 /’ ‘\._, _ _ _ kl & loo- z /’ \ - u : /’

PI !?50-

‘\ ‘\

-.- [dl .--.,

g .loo+I 3 0123456789

Frequency fkHz)

- [PI

,q

_ - - &I

- bl

-.- [dl

_.- kl

, \

z E (); ./-‘;.*--‘:

/* ‘\

2

‘;,+ . \

‘;.. . ...’ ‘... ,,~,,,,,,....... ,,.... ..” I

4 -50: : z

2 -100 s

, , , , , , , , 0 12 3 4 5 6 7 8 9

Frequency (kHz)

Fig. 1. Percentage variations of the spectral energy distribution for

unvoiced and voiced plosives. (a) Female speakers. (b) Male

speakers.

Fig. 2. Percentage variations of the spectral energy distribution for

nasals. (a) Female speakers. (b) Male speakers.

small for unvoiced and voiced plosives (Fig. l), In general, moderate changes were found for except for the voiced [d] which presented moderate nasals, particularly for the Low and High Spectral increases in the medium and high frequencies. Tilt, the Pitch and the Friction Percentage (Table 6).

Table 6 Percentage variations (100 * (Lombard average - Normal average)/Normal average) of the acoustic features (Phone Duration (PD), Total

Energy (TE), Low-Band Spectral Tilt (LST), High-Band Spectral Tilt (HST), Pitch (P) and Friction Percentage (FP)) for nasals. Global

results for all speakers (G) and separate results for female (F) and male (M) speakers are presented. “X” means that no statistically significant differences were found

[ml [nl [Jl

G F M G F M G F M

2 - 3oo 5 : (a) .= 250+ - ,..I [ml .9 5 : : ‘.. 200- ,:’ ,:’ - - In1 .,

.:’ Ul

j_;i;j’ ,, ,,,,,, ,,, ,,,, ,,,, , ,,,, ,I 3 0 1 2 3 4 5 6 7 8 9

Frequency (kHz)

- c 0‘ 300

g 0 .;: 250: - Iml .z g 200- - - [nl

. . . . . . . . . . .

%.lOO~ s ,- 0123456789 ,, ,,,,,, ,,,, ,,,,,,,,,, ,,,.

Frequency (kHz)

I

PD

TE LST

HST P

FP

X X X X X X X X X

2.1 X 2.1 9 9 21 44 19 59 183 183 183 191 164 226 148 148 148

-86 -141 -39 -88 -120 -42 -155 -155 -15s 38 2-l 58 39 27 58 43 29 68

50 50 50 63 16 86 181 162 140

Page 8: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

30

Table I

A. Castellanos et al. /Speech Communication 20 (1996) 23-35

Percentage variations (100 * (Lombard average - Normal average)/Normal average) of the acoustic features (Phone Duration (PD), Total

Energy (TE), Low-Band Spectral Tilt (LST), High-Band Spectral Tilt (HST), Pitch (P), Fricative Formant @F) and Friction Percentage

(FP)) for fricatives and affricate. Global results for all speakers (G) and separate results for female (F) and male (M) speakers are presented.

“X” means that no statistically significant differences were found

[fl [Tl [sl [xl [ii1 LSI

G F M G F M G F M GF M G FM GFM

PD X X X -0.5 -0.5 x x x x x x x x x x x x x TE -30 -30 -30 -23 -39 -1.5 -23 -28 -15 -17 -17 -17 67 40 81 X X X

LST X -26 291 x 29 x -13 -13 -13 x -11 25 641 442 906 -11 -11 -11

HST 47 58 13 38 38 38 42 42 42 19 19 19 -147 -147 -147 22 22 22

P 38 25 60

FF -1.0 -4.5 -1.0 -9 0.1 -27 2.0 2.0 4.7 x x 3.1 2.0 13 2.0 10 13 2.5

FP -4.3 -4.3 x -6 -6 -13 x x x x -0.1 x 25 25 25 X X X

The influence of the sex of the speakers is important in the majority of these changes. Nasals showed a general shift in the Spectral Energy Distribution

z 300

8 IO

p 250:

-E 200; >

jr.\

- [fl

[T]

5 150-

t ./. J’

‘\., --- [s]

-

P loo- / [Xl

- . - Bl -.- WI

s -100 ( I, , I, , / , , , ,

s 0 1 2 3 4 5 6 7 8 9

Frequency CkHz)

g - 300

s

PI

: (b) .=I 250,

.\ - WI

.9 F 2007

’ ‘\

[T]

.c I/’ ‘\ _ _ -

x 150- Is1

B IOO- I ‘\ ,/ --- [xl

a ‘\.A - - [jjl g 50: I -.- [iSI

+? 250

5 0

P 200- ;:“., - 111

- - [rl

2 -100

s

/ , , , , , , , , ,

0 1 2 3 4 5 6 7 8 9

Frequency (kHz)

2 E: -100 ,

s

, , , , , , , ,

0 1 2 3 4 5 6 7 8 9

Frequency tkHz)

Fig. 3. Percentage variations of the spectral energy distribution for Fig. 4. Percentage variations of the spectral energy distribution for fricatives and affricate. (a) Female speakers. (b) Male speakers. liquids. (a) Female speakers. (b) Male speakers.

from low to medium and high frequency bands (Fig. 21, although different kinds of shifts appeared for females than for males.

3 - 250

5 : (a) I

- VI

% -100 , , , , , , ,

s 0 1 2 3 4 5 6 7 8 9

Frequency tkHz)

Page 9: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

A. Castellanos et al. /Speech Communication 20 (1996) 23-35 31

Table 8

Percentage variations (100 * (Lombard average - Normal average.)/Normal average) of the acoustic features (Phone Duration (PD), Total

Energy (TE), Low-Band Spectral Tilt (LST), High-Band Spectral Tilt (HST), Pitch (P) and Friction Percentage (FP)) for liquids. Global

results for all speakers (G) and separate results for female (F) and male CM) speakers are presented. “X” means that no statistically

significant differences were found

PD

TE

LST

HST

P

FP

[II

G

X

8

306

-102

39

19

F M

X X

3.7 8

629 192

-149 -49

28 60

19 19

[rl M

G F M G F M

-1.9 X - 1.9 X X X

33 22 36 34 34 55

581 581 581 297 535 215

-83 -95 -57 -89 -27 -137

36 28 53 42 31 56

-14 -14 -14 -46 -26 -53

Two different behaviours in fricatives and the affricate under the Lombard effect can be observed in Table 7. For almost all of them, small or moderate changes in all the features are shown. However, the voiced palatal bj] presented opposite tendencies in the changes, with a very outstanding increase in the Low-band Spectral Tilt. A common tendency for all fricatives and the affricate of a small decrease in all bands is shown in Fig. 3, with the exception of bjiil which had a different behaviour; that is, relevant increases of the energy in the middle and high frequency bands. These increases follow different patterns depending upon the sex of the speakers.

For the liquids, small or moderate changes of the

Total Energy, the Pitch and the Friction Percentage are shown in Table 8. The changes in both the Spectral Tilts are even greater. The Energy Distribu- tion for liquids shifted from low to medium and high frequency bands, although the shifts are clearly dif- ferent between female and male speakers for [ll and [rrl (Fig. 4).

The results of the tests on uowels showed a certain homogeneity among them (Table 9) Phoneme Duration, the Total Energy, the Pitch and the First and Second Formant. A clear shift in the Spectral Energy Distribution was indicated by the large de- crease in the High-band Spectral Tilt and the much greater increase in the Low-band Spectral Tilt. The

Table 9 Percentage variations (100 *(Lombard average - Normal average)/Normal average) of the acoustic features (Phone Duration (PD), Total

Energy (TE), Low-Band Spectral Tilt (LST), High-Band Spectral Tilt (HST), Pitch (P), First Formant (Fl) and Second Formant (F2)) for

vowels. Global results for all speakers (G) and separate results for female (F) and male (M) speakers are presented. “X” means that no

statistically significant differences were found

[il fel [al fol M

G F M G F M G F M G F M G F M

PD 3.0 3.0 3.0 13 13 13 11 11 11 18 18 18 X X X

TE 18 18 18 19 17 20 24 19 29 15 15 15 -1.8 -1.8 -1.8 LST 3761 3761 3761 7711 1616 1869 427 255 3905 170 281 103 81 81 81 HST -98 -143 -53 -106 -154 -58 -108 -169 -58 -144 -197 -91 -115 -139 -82

P 42 30 65 38 28 58 41 30 62 35 26 53 40 29 60

Fl 15 9 21 20 23 16 17 17 17 22 23 21 17 17 17

F2 1.3 2.3 1.3 6 9 2.0 0.7 0.7 0.7 3.3 3.3 3.3 6 6 6

Page 10: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

32 A. Castellanos et al./ Speech Communication 20 (1996) 23-35

s 2

-loo+1 0 1 2 3 4 5 6 7 8 9

Frequency (kHz)

3 - 250 I

(b) lil

- - M

- - [aI 3 Lo]

- bl

Frequency (kHz)

Fig. 5. Percentage variations of the spectral energy distribution for

vowels. (a) Female speakers. (b) Male speakers.

common shift from low to medium and high fre- quencies had a different behaviour when the sex of the speakers was considered (Fig. 5).

4. Discussion of the results

In this section, the main results are discussed and the behaviour of the more relevant acoustic features is analyzed. This study was carried out taking into consideration the acoustic-phonetic characteristics of sonority, manner of articulation and spectral shape. In addition, some results of other authors are also set forth. Although, the selected works have made an acoustic-phonetic analysis which is similar to ours, obviously, the differences in the recording condi- tions, the number and characteristics of the speakers,

the type of speech corpus selected, etc. only allow for comparisons of an approximate nature.

The statistical analysis of the Phone Duration showed that small increases could be observed only for vowels, especially for medium and open vowels. A similar behaviour was reported in the literature (Stanton et al., 1988; Anglade and Junqua, 1990; and Junqua, 1993).

For those phonetic units which presented statisti- cally significant variations of the High Spectral Tilt, a different behaviour between voiced and unvoiced phonetic units was found. For voiced phonetic units, a remarkable decrease in the High Spectral Tilt was observed which indicates a clear shift from high to medium frequencies. This phenomenon was more pronounced for females than for males, in some cases. This behaviour fully agrees with the general increase in medium frequencies of the Spectral En- ergy Distribution for voiced units, especially for those with a stable formant zone (segments where the formants can have more or less stable positions), such as nasal& voiced fricative hj], liquids and vow- els (Figs. 2-5). For unvoiced units, the differences in the behaviour of the High Spectral Tilt were also in accordance with the differences in the Spectral Dis- tributions. A slight shift from medium to high fre- quencies (corresponding to a small increase of the High Spectral Tilt) was presented for the unvoiced fricatives. Whereas for unvoiced plosives no statisti- cally significant variations were observed.

In general, a homogeneous behaviour was found for the Low Spectral Tilt. Therefore, a shift from low to medium frequencies (corresponding to relevant increases of the Low Spectral Tilt) was observed for all categories except for unvoiced fricatives. These units presented heterogeneous results of the Low Spectral Tilt, although in all cases they showed small differences when they were statistically significant. The previous studies by Stanton et al. (1988), Anglade and Junqua (1990) and Junqua (1993) re- ported similar results for both Low and High Spec- tral Tilt, although only for voiced units.

A general increase of the Pitch was observed in all the voiced phonetic units. This behaviour is simi- lar to those reported in the literature (Stanton et al., 1988; Summers et al., 1988; Bond et al., 1989). However, it is important to notice that the pattern is extremely homogeneous among all of them, mainly

Page 11: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

A. Castellanos et d/Speech Communication 20 (1996) 23-35 33

for categories with a stable formant zone (nasals, liquids and vowels). On the other hand, in accor- dance with the results reported by Anglade and Junqua (1990) and Junqua (19931, the increase for male speakers was larger than for female speakers.

A small positive shift of the First Formant posi- tion from normal to Lombard speech was found. This shift, which was quite homogeneous for all vowels in the general case, appeared to be more variable with the influence of gender. This result, in principle, was in line with previous studies (Stanton et al., 1988; Summers et al., 1988), although Anglade and Junqua (1990) and Junqua (1993) observed more homogeneity for female speakers than for male speakers. Contrary to the variability of the results of the Second Formant reported by Stanton et al. (1988) and Bond et al. (19891, a homogeneous increase in the Second Formant position was shown, although it was very small. Whereas Anglade and Junqua (1990) and Junqua (1993) also found a consistent increase in the Second Formant, but only for females.

For the voiced plosives, a moderate increase of Spectral Distribution in middle frequencies was ob- served. This behaviour was similar for all of these units except for the dental [d], where this increase was greater. As in the previous case, no remarkable differences were presented between male and female speakers.

One of the most interesting results appeared from the study of the differences in Spectral Energy Dis- tribution for categories of phonetic units. In this sense, for all of the phonetic units with fricative components, such as unvoiced affricate, unvoiced fricatives and unvoiced plosives (burst), a similar behaviour of the percentage variations of the Spec- tral Energy Distribution was found, which is in accordance with their similar spectral shape. Part of this behaviour consisted in a decrease throughout the Spectral Distribution. On the other hand, in the middle frequencies, this decrease was less pro- nounced. Obviously, this pattern of decrease was more important for unvoiced fricatives than unvoiced plosives. Moreover, no remarkable differences were found when gender of speakers was considered. In (Anglade and Junqua, 1990) and in (Junqua, 19931, an energy decrease in all the frequency bands was also reported. However, in (Stanton et al., 1988) an increase in the interval of 4-8 kHz for unvoiced fricatives and plosives was observed.

The behaviour for all of the voiced phonetic units with a pronounced stable formant zone was very similar: a decrease in low frequencies followed of an important increase in the middle and high frequen- cies. These results were closer to those of Anglade and Junqua (1990) than to those of Stanton et al. (1988). However, a very remarkable difference in the Spectral Distribution between male and female speakers was found. For female speakers, the in- crease of the Energy Distribution was centred at 5 kHz. Whereas for male speakers this increase was centred between 2 and 4 kHz with a valley in the Spectral Energy Distribution at 5 kHz. This pattern presented a curious homogeneity for all of these categories. On the other hand, this behaviour was in accordance with the analysis achieved for Low and High Spectral Tilt.

5. Conclusions

A characterization of the global tendencies of some acoustic-phonetic features in different phonetic units for Lombard speech was studied using Spanish as the reference language. Moreover, the influence of gender of the speakers in the characterization of these tendencies was also studied. Two-way analyses of variances have allowed for the extensive study of each of the 19 acoustic features on 24 Spanish phonetic units, using a database consisting of 260 Lombard utterances and 260 normal utterances. The utterances were recorded by 10 speakers (5 females and 5 males), and have yielded a total amount of 14,240 instances of the phonetic units to be ana- l yzed.

With regard to the voiced units, two patterns The statistical results and their following study, in could be clearly distinguished in relation to their general, showed similar results to those achieved by spectral shape: vowels, liquids, nasals and the voiced other previous studies reported in the literature. More fricative bj], with stable formant zones in their Spec- concretely, the results reported here confirm the tral Energy Distribution; and voiced plosives, with- changes produced in Lombard speech with regard to out them. normal speech relating to the vowel duration, the

Page 12: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

34 A. Castellanos et al. /Speech Communication 20 (1996) 23-35

Pitch, the Spectral Tilt and the vowel Formants. Nevertheless, some new tendencies have been ob- served from the outcome of the statistical tests. On the one hand, the pattern of the Pitch is extremely homogeneous among all voiced phonetic units, mainly for categories with a stable formant zone. This pattern includes clear and also homogeneous differences between males and females in all the voiced units. On the other hand, the tendency of the changes in the Spectral Tilts also showed great ho- mogeneity. This was confirmed and more detailed by the differences of the Spectral Energy Distribution. The shifts of energy presented a homogeneous pat- tern, particularly for voiced phonetic units, pointing out clear differences between female and male speakers.

In general, the application of the results of this work, as well as the results of other studies on Lombard effect, to the development of ASR systems with high accuracy rates within noisy environments could still require more research.

Acknowledgements

This work was partially supported by the ES- PRIT-II program of the CEE, 5516-ROARS project. Antonio Castellanos was also supported by a post- graduate grant from the Spanish “Ministerio de Edu- cation y Ciencia”. The authors thank the staff of the ROARS project for numerous contributions in carry- ing out this project and for their help in collecting and analyzing the speech corpora. The authors also thank the anonymous reviewers who helped to im- prove the quality and the presentation of this paper.

References

P. Alinat (1991), “Functional specification for analyser and pitch

determination”, ESPRIT-II (55 16) ROARS Project, Dl 1 Re-

port, January 199 1. P. Alinat, Ed. (1994), “Final public report”, ESPRIT II (5516)

ROARS Project, Thomson Sintra Activites Sous-Marines,

Sophia-Antipolis, June 1994. Y. Anglade and J.C. Junqua (19901, “Acoustic-phonetic study of

Lombard speech in the case of isolated-words”, in: L. Torres, E. Masgrau and M.A. Laguna, Eds., Signal Processing V:

Theories and Applications (Elsevier, Amsterdam), pp. 119%

1198.

T.H. Applebaum and B.A. Hanson (1990). “Robust speaker-inde-

pendent word recognition using spectral smoothing and tempo-

ral derivatives”, in: L. Torres, E. Masgrau and M.A. Laguna,

Eds., Signal Processing V: Theories and Applications (Else-

vier, Amsterdam), pp. I 183- 1186. J.M. Benedi, I. Benlloch, I. Torres, J.A. Gdmez and M.J. Castro

(1992). “Acoustic-phonetic knowledge for the Spanish ver-

sion of the ROARS system”, ESPRIT-II (5516) ROARS

Project, D13 Report, December 1992.

Z. Bond, T. Moore and B. Gable (19891, “Acoustic-phonetic

characteristics of speech produced in noise and while wearing

an oxygen mask”, .I. Acoust. Sot. Amer., Vol. 85, pp. 907-

912.

D.A. Cairns and J.H.L. Hansen (1992). “ICARUS: An Mwave

based real-time speech recognition system in noise and Lom-

bard effect”, Proc. Internat. Con5 on Spoken Language Pro-

cessing, pp. 703-706.

F. Casacuberta, R. Garcia, J. Llisterri, C. Nadeu, J.M. Pardo and

A. Rubio (19911, “Development of a Spanish corpora for the

speech research (ALBAYZIN)“, Workshop on International

Cooperation and Standardization of Speech Databases and

Speech I/O Assessment Methods, CEC DGXIII, ESCA and

ESPRIT PROJECT 2589 “SAM”, Chiavari, 26-28 Septem-

ber 1991.

A. Castellanos and F. Casacuberta (1992), “A study of the

ROARS acoustic features for Spanish speech produced with

Lombard effect”, ESPRIT II (5516) ROARS Project, D26

Report, December 1991. Addendum I, August 1992. Adden-

dum II, December 1992.

S. Dvorak and T. Hiirmann (1991). “High-performance speech

recognition in noise by continuously updated reference tem-

plates”, Proc. European Conf: on Speech, Communication

and Technology, Vol. 3, pp. 1375-1378.

R. Hajislam, J.M. Pierrel, D. Fohr, Y. Anglade and D. Fran9ois

(19921, “Technical report on changes in speaker articulation

due to ambient noise (Complement Report)“, ESPRIT II

ROARS Project, D25 Report, September 1992.

J.H.L. Hansen and O.N. Bria (19921, “Improved automatic recog-

nition of speech in noise and Lombard effect”, in: J. Vande-

walle, R. Boite, M. Moonen and A. Oosterlinck, Eds., Signal

Processing VI: Theories and Applications (Elsevier, Amster-

dam), pp. 403-406.

B.A. Hanson and T.H. Applebaum (19931, “Subband or cepstral

domain filtering for recognition of Lombard and channel-dis-

torted speech”, Proc. IEEE Internat. Conf: Acoust. Speech

Signal Process., Vol. II, pp. 79-82.

J.C. Junqua (1993), “The Lombard reflex and its role on human

listeners and automatic speech recognizers”, J. Acoust. Sot.

Amer., Vol. 93, pp. 510-524. J.C. Junqua and Y. Anglade (19901, “Acoustic and perceptual

studies of Lombard speech: Application to isolated-word auto-

matic speech recognition”, Proc. IEEE Internat. Conf Acoust.

Speech Signal Process., Vol. 2, pp. 841-844. J.C. Junqua and H. Wakita (1989), “A comparative study of

cepstral lifters and distance measures for all-pole models of

speech in noise”, Proc. IEEE Internat. Con& Acoust. Speech

Signal Process., pp. 476-479.

Page 13: An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect

A. Castellanos et al/Speech Communication 20 (1996) 23-35 35

H. Lane and B. Tranel (197 11, “The Lombard sign and the role of

hearing in speech”, J. Speech Hearing Research, Vol. 14, pp.

677-709.

E. Lombard (19111, “Le signe de l’tlevation de la voix”, An-

nales des Maladies de L’Oreille et du Larynx, Vol. XXXVII,

No. 2, pp. 101-119.

B. Mak, J.C. Junqua and B. Reaves (19921, “A robust

speech/non-speech detection algorithm using time and fre-

quency-based features”, Proc. IEEE Internat. Con$ Acoust.

Speech Signal Process., Vol. I, pp. 269-272.

D. Peiia (19821, Estudistica: Modelos y Me’todos (Alianza Edito-

rial), Vol. II.

H. Pick, Cl. Siegel, P. Fox, S. Garber and J. Kearney (19891,

“Inhibiting the Lombard effect”, J. Acoust. Sot. Amer., Vol.

85, pp. 894-900. A. Quillis (19881, Fone’tica Actkfica de la Lengua Espatiolu

(Editorial Gredos).

B. Stanton, L. Jamieson and G. Allen (19881, “Acoustic-phonetic

analysis of loud and Lombard speech in simulated cockpit

conditions”, Proc. IEEE Internat. Conf Acoust. Speech Sig-

nal Process., Vol. 1, pp. 331-334.

B. Stanton, L. Jamieson and G. Allen (19891, “Robust recognition

of loud and Lombard speech in the fighter cockpit environ-

ment”, Proc. IEEE Internat. Conf Acoust. Speech Signal

Process., Vol. 2, pp. 675-678.

V. Summers, D. Pisoni, R. Bernacki, R. Pedlow and M. Stokes

(19881, “Effects of noise on speech production: Acoustic and

perceptual analyses”, J. Acoust. Sot. Amer., Vol. 84, pp.

917-928.

V. Summers, K. Johnson, D. Pisoni and R. Bemacki (I 9891, “An

addendum to ‘Effects of noise on speech production: Acoustic

and perceptual analyses’ “, J. Acoust. Sot. Amer., Vol. 86, pp.

1717-1721.

J.C. Wells (19891, “Computer-code phonemic notation of individ-

ual languages of the European Community”, J. Internat.

Phonetic Association, Vol. 1, pp. 3 I-54.

V.K. Rohatgi (19761, An Introduction to Probability Theory and

Mathematical Statistics (Wiley, New York).