Proc Fonetik 2005

download Proc Fonetik 2005

of 148

Transcript of Proc Fonetik 2005

  • Proceedings

    FONETIK 2005 The XVIIIth Swedish Phonetics Conference

    May 2527 2005

    Department of Linguistics Gteborg University

  • Proceedings FONETIK 2005 The XVIIIth Swedish Phonetics Conference, held at Gteborg University, May 2527, 2005 Edited by Anders Eriksson and Jonas Lindh Department of Linguistics Gteborg University Box 200, SE 405 30 Gteborg

    ISBN 91-973895-9-5

    The Authors and the Department of Linguistics

    Cover photo and design: Anders Eriksson

    Printed by Reprocentralen, Humanisten, Gteborg University

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    iii

    Preface This volume contains the contributions to FONETIK 2005, the Eighteenth Swedish Phonetics Conference, organized by the Phonetics group at Gteborg University on May 2527, 2005. The papers appear in the order they were presented at the confer-ence.

    Only a limited number of copies of this publication has been printed for distribu-tion among the authors and those attending the conference. For access to electronic versions of the contributions, please look under:

    http://www.ling.gu.se/konferenser/fonetik2005/ We would like to thank all contributors to the Proceedings. We are also indebted to Fonetikstiftelsen for financial support.

    Gteborg in May 2005

    On behalf of the Phonetics group

    Anders Eriksson sa Abelin Jonas Lindh

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    iv

    Previous Swedish Phonetics Conferences (from 1986)

    I 1986 Uppsala University II 1988 Lund University III 1989 KTH Stockholm IV 1990 Ume University (Lvnger) V 1991 Stockholm University VI 1992 Chalmers and Gteborg University VII 1993 Uppsala University VIII 1994 Lund University (Hr) 1995 (XIIIth CPhS in Stockholm) IX 1996 KTH Stockholm (Nsslingen) X 1997 Ume University XI 1998 Stockholm University XII 1999 Gteborg University XIII 2000 Skvde University College XIV 2001 Lund University XV 2002 KTH Stockholm XVI 2003 Ume University (Lvnger) XVII 2004 Stockholm University

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    v

    Contents Dialectal, regional and sociophonetic variation

    Phonological quantity in Swedish dialects: A data-driven categorization 1 Felix Schaeffler

    Phonological variation and geographical orientation among students in a West 5 Swedish small town school Anna Gunnarsdotter Grnberg

    On the phonetics of unstressed /e/ in Stockholm Swedish and FinlandSwedish 9 Yuni Kim

    The interaction of word accent and quantity in Gothenburg Swedish 13 My Segerup

    Speaker recognition and synthesis

    Visual acoustic vs. aural perceptual speaker identification in a closed set of disguised 17 voices Jonas Lindh

    A model based experiment towards an emotional synthesis 21 Jonas Lindh

    Annotating speech data for pronunciation variation modelling 25 Per-Anders Jande

    Prosody duration, quantity and rhythm

    Estonian rhythm and the Pairwise Variability Index 29 Eva Liina Asu and Francis Nolan

    Duration of syllable-sized units in casual and elaborated Finnish: a comparison with 33 Swedish and Spanish Diana Krull

    Language contact, second language learning and foreign accent

    The sound of 'Swedish on multilingual ground' 37 Petra Bodn

    The communicative function of "s" in Italian and "ja" in Swedish: an acoustic 41 analysis Loredana Cerrato

    Presenting in English and Swedish 45 Rebecca Hincks

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    vi

    Phonetic aspects in translation studies 49 Dieter Huber

    Scoring children's foreign language pronunciation 51 Linda Oppelstrup, Mats Blomberg and Daniel Elenius

    Speech development and acquisition

    On linguistic and interactive aspects of infant-adult communication in a pathological 55 perspective Ulla Bjurster, Francisco Lacerda and Ulla Sundberg

    Durational patterns produced by Swedish and American 18- and 24-month-olds: 59 Implications for the acquisition of the quantity contrast Lina Bonsdroff and Olle Engstrand

    /r/ realizations by Swedish two-year-olds: preliminary observations 63 Petra Eklund, Olle Engstrand, Kerstin Gustafsson, Ekaterina Ivachova and sa Karlsson

    Tonal word accents produced by Swedish 18- and 24-month-olds 67 Germund Kadin and Olle Engstrand

    Development of adult-like place and manner of articulation in initial sC clusters 71 Fredrik Karlsson

    Poster session

    Phonological interferences in the third language learning of Swedish and German 75 (FIST) Robert Bannert

    Word accents over time: comparing present-day data with Meyers accent contours 79 Linna Fransson and Eva Strangert

    Multi-sensory information as an improvement for communication systems efficiency 83 Francisco Lacerda, Eeva Klintfors and Lisa Gustavsson

    Effects of stimulus duration and type on perception of female and male speaker age 87 Susanne Schtz

    Effects of age of learning on VOT in voiceless stops produced by near-native L2 91 speakers Katrin Stlten

    Prosody F0, intonation and phrasing

    Prosodic phrasing and focus productions in Greek 95 Antonis Botinis, Stella Ganetsou and Magda Griva

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    vii

    Syntactic and tonal correlates of focus in Greek and Russian 99 Antonis Botinis, Yannis Kostopoulos, Olga Nikolaenkova and Charalabos Themistocleous

    Prosodic correlates of attitudinally-varied back channels in Japanese 103 Yasuko Nagano-Madsen and Takako Ayusawa

    Speech perception

    Prosodic features in the perception of clarification ellipses 107 Jens Edlund, David House, and Gabriel Skantze

    Perceived prominence and scale types 111 Christian Jensen and John Tndering

    The postvocalic consonant as a complementary cue to the perception of quantity in 115 Swedish a revisit Bosse Thorn

    Gender differences in the ability to discriminate emotional content from speech 119 Juhani Toivanen, Eero Vyrynen and Tapio Seppnen

    Speech production

    Vowel durations of normal and pathological speech 123 Antonis Botinis, Marios Fourakis and Ioanna Orfanidou

    Acoustic evidence of the prevalence of the emphatic feature over the word in Arabic 127 Zeki Majeed Hassan

    Closing discussion

    Athens 2006 ISCA Workshop on Experimental Linguistics 131 Antonis Botinis, Christoforos Charalabakis, Marios Fourakis and Barbara Gawronska

    Additional paper submitted for the poster session

    A positional analysis of quantity complementarity in Swedish with comparison to Arabic Zeki Majeed Hassan and Barry Heselwood 135

    Author index 139

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    viii

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    Phonological Quantity in Swedish dialects: A data-driv-en categorisationFelix SchaefflerDepartment of Philosophy and Linguistics, Ume University

    AbstractThis study presents a data-driven categorisa-tion (cluster analysis) of 86 Swedish dialects,based on durational measurements of long andshort vowels and consonants. The study revealsa clear geographic distribution that, for themost part, corresponds with dialectological de-scriptions. For a minor group of dialects,however, the results suggest mismatchesbetween the quantity system and the observedsegment durations. This phenomenon is dis-cussed with reference to a theory of quantitychange (Labov 1994).

    IntroductionPhonological quantity in Standard Swedish isusually described as being complementary:Long vowels in closed syllables are followedby short consonants, while short vowels are al-ways followed by a long consonant or a con-sonant cluster.This modern system has developed from a

    quantity system with independent vowel andconsonant quantity, where all four possiblecombinations of long and short segments (VC,V:C, VC: and V:C:) existed. The modern sys-tem evolved by shortening of V:C: and length-ening of VC structures. Not all dialects of Mod-ern Swedish have completed this change. Somedialects kept the four-way-distinctions. This ap-plies to a group of dialects in the Finnish-Swedish region and in Dalarna in WesternSweden. Another group of dialects abandonedV:C: successions but kept VC structures. Thishas mainly been reported for large parts ofNorthern Sweden, but also for some places inMiddle Sweden.There are, thus, today at least three different

    quantity systems in the dialects of ModernSwedish: 4-way-systems (VC, V:C, VC: andV:C:), 3-way-systems (VC, V:C and VC:) and2-way-systems with complementarity quantity(V:C and VC:).

    Aims and ObjectivesIn this study, a data-driven categorisation of

    Swedish dialects was performed, based onmeasurements of sound duration from 86 placesin Sweden and Finland.This aimed at a categorisation of the dia-

    lects that was independent of traditional de-scriptions and categories, thus providing thepossibility to discover new dialectal groups ortypological categories.The study is an extension of Strangert & Wret-ling (2003), Schaeffler & Wretling (2003) andSchaeffler (2005), motivated by an extendeddata-set and methodological improvements.

    Material and SubjectsThe material used for this study was part of theSweDia corpus (www.swedia.nu). It comprisedhand-segmented data from 976 speakers from86 recording locations (usually 12 speakers perrecording location). One word pair was invest-igated: tak vs. tack (english: roof thanks). Forhistorical reasons, these two words should havea V:C vs. VC: structure, even in those dialectswhere additional quantity patterns exist. Inmost dialects, the words consist of a voicelessdental or alveolar plosive, a low vowel and avoiceless velar plosive.Segmentation included the vocalic phase

    and the closure phase of the velar plosive. Ifpresent, preaspiration was marked as a separatesegment, but was treated as a part of the con-sonantal closure. Four variables were measured:the durations of the long vowel, the short con-sonant, the short vowel and the long consonant.To arrive at a measure of central tendency

    for every recording location, the median of eachvariable was calculated for every speaker. Themedian of the speaker medians for every re-cording location then served as the value thatwas used for the data-driven categorisation.Medians instead of arithmetic means were cal-culated as, unlike the arithmetic mean, the me-dian is not sensitive to outliers.

    Data-driven categorisationThe method of choice in this study was a hier-archical cluster-analysis, with euclidean dis-

    1

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    tances as dissimilarity measures and the Wardmethod as the linkage criterion (see e.g. Gor-don, 1999).Hierarchical clustering treats each object

    initially as a single cluster. In the next steps, theobjects and resulting clusters are combined ac-cording to the dissimilarity measure and thelinkage criterion, until all objects are joined in asingle cluster. The increment of the linkage cri-terion is recorded during the process. It is usu-ally displayed in a so-called dendrogram, thatvisualises the increment of the linkage criterionas the length of vertical branches in a tree struc-ture. This information can be used for the selec-tion of an adequate number of clusters.In the present study, the recording locations

    constituted the objects. Each recording locationwas described by four variables: the mediandurations of the four segments (the long andshort vowels and consonants).It is often recommended in the literature to

    verify a hierarchical cluster analysis with a non-hierarchical method (Bortz, 1999). This hasbeen done in the present study but did not leadto strongly deviating results. The results of thenon-hierarchical method are therefore not re-ported here.

    ResultsThe visual inspection of the cluster dendrogramsuggested a four cluster solution as an appropri-ate number of clusters for the current analysis.Additionally, the parameter 2 was calculated,which is usually used in analyses of variance todescribe the amount of explained variance, buthas also been suggested as a criterion for theestimation of different cluster solutions (Timm,2002). The four-cluster solution lead to an 2value of 0.68. For comparison, the three-clustersolution showed an 2 of 0.32, and a ten-cluster solution showed an 2 of 0.86. The largeincrement of 2 from the three to the fourcluster solution followed by comparatively lowincrements supported the four cluster solution.

    Geographic distribution

    Figure 1 shows a stylised map of Sweden andthe Swedish parts of Finland. The colour-cod-ing and the numbers show the geographic dis-tribution of the four clusters.The clusters show a clear geographic distri-

    bution. Cluster (4), n=7, separates all dialectson the Finnish mainland from the rest of thedialects. Cluster (3), n=23, is mainly restricted

    to the Northern parts of Sweden, while theclusters (2), n=17, and (1), n=39, are restrictedto Southern Sweden, with the exception of anarea comprising Jmtland, ngermanland andMedelpad. The clusters (1) and (2) show lessgeographic separation, but there is a tendencyfor cluster (2) to occur mainly in the South-western parts and coastal regions.

    Durational characteristics of theclusters

    The figures 2 and 3 show the distributions ofthe four segment durations across the fourclusters in the form of box-plots. Figure 2shows the distribution of V: and V durations,figure 3 of C: and C durations.The Finnish cluster (4) is separated from the

    other clusters by longer V: and V durations andshorter C durations. C: durations in the Finnishcluster (4) are close to the ones in the Northerncluster (3), but clearly longer than those in theSouthern clusters (1) and (2). Consequently, theNorthern cluster (3), as well, is separated fromthe Southern clusters (1) and (2) by longer C:durations. However, the Northern cluster (3)shows rather long C durations, which consti-tutes a clear difference to the short C durationsin the Finnish cluster (4).

    Figure 1: Geographic distribution of the 4clusters obtained by cluster analysis (see text).

    2

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    The vowel durations show no clear separa-tion between the Northern cluster (3) and theSouthern clusters (1) and (2): The V and V: dur-ations of cluster (3) lie in between the durationsof (1) and (2). Cluster (1) shows a tendency tohave longer V and V: durations than cluster (2).The same relationship exists between the con-sonants in cluster (1) and (2). Thus, all seg-ments tend to be longer in cluster (1) than incluster (2).

    Relations between vowel and conson-ant duration

    The different durational characteristics of thesegments result also in different V:/C and V/C:ratios. Because of the very long V: and veryshort C, the Finnish cluster (4) shows a deviat-ing V:/C ratio (see table 1). The differences inthe V/C: ratios are less pronounced. However,cluster (3) and (4) show lower V/C: ratios than

    cluster (1) and (2), due to their longer C:. V:/Cratios as well as V/C: ratios are very similar inthe Southern clusters (1) and (2).

    Table 1: Median V:/C and V/C: ratios in the fourclusters.

    Cl. (1) Cl. (2) Cl. (3) Cl. (4)V:/C 1.15 1.13 1 1.83V/C: 0.53 0.57 0.41 0.44

    Relations between long and short seg-ments

    The ratios between the durations of the longand the short segments are shown as boxplotsin figure 4.The figure shows that the consonant ratios

    have a wider range than the vowel ratios, butare generally lower. Some dialects show con-sonant ratios close to 1.0, hence long andshort consonants have approximately thesame duration, while the highest values arearound 2.0-2.2, showing long consonants thatare twice as long as short vowels. The medianof the distribution is 1.22, and 50% of the val-ues lie between 1.14 and 1.32. The vowel ra-tios, on the other hand, are rarely lower than1.4, the highest values are similar to the highestvalues for the consonant ratios, around 2.0-2.2.The median is 1.82, 50% of the values liebetween 1.7 and 1.9.

    Cluster (4) is clearly separated from the restof the dialects by the consonant ratios. All dia-lects in cluster (4) show values higher than 1.6,while all other dialects lie below this value. The

    Figure 2: V and V: durations per clus-ter. Grey: short vowels, white: longvowels.

    Figure 3: C and C: durations per clus-ter. Grey: short consonants, white: longconsonants.

    Figure 4: V:/V and C:/C ratios per clus-ter. Grey: vowel ratios, white: con-sonant ratios.

    3

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    dialects in cluster (3) show somewhat higherC:/C ratios than those in the clusters (1) and(2), although there is a lot of overlap. The me-dian for cluster (3) is 1.29, for cluster (2) it is1.14 and for cluster (1) it is 1.19.

    DiscussionThe cluster analysis showed results similar tothe analysis presented in Schaeffler (2005).There are three main groups with a clear geo-graphic distribution: A Finnish cluster, whichincludes all dialects on the Finnish mainland, aNorthern cluster, mainly concentrated in North-ern Sweden from Lappland to Gstrikland, andtwo Southern clusters.The consideration of two Southern cluster

    instead of one was motivated by a major in-crease in the value of 2. The durational charac-teristics, however, do not present much supportfor such a partitioning. All segments in cluster(1) are longer than in cluster (2), which leads tovery similar segment ratios (see table 1 and fig-ure 4). This, together with the lack of a cleargeographic distribution, suggests that speechrate effects might be responsible for the differ-ence.The geographic distribution corresponds

    with the traditional descriptions of the Swedishdialects as outlined in the introduction. 4-waydistinctions are mainly found in the Finnish re-gion, 3-way distinctions frequently in theNorthern regions and 2-way distinctions in theSouthern Swedish areas (see e.g. Wessen 1969,Riad 1992). In Schaeffler (2005), the observeddurational differences were attributed to thesefunctional differences. In 4-way distinctions,where V:C and VC: sequences contrast with VCand V:C:, clear durational distinctions of vow-els and consonants are expected. This corres-ponds with the observed durations for theFinnish region. All other dialects, however,show rather low consonant ratios, while a cleardurational difference between the vowels ismaintained.A further aspect of the results deserves at-

    tention: The geographic distribution resultingfrom the cluster analysis is almost too clear-cut.According to dialectological descriptions, somedialects in the Finnish cluster (4) show 3-waysystems (Ivars 1988), comparable to those inthe Northern Swedish regions. In spite of thesefunctional congruences between parts of North-ern Swedish and Finland-Swedish, the segmentdurations differ. There is, however, strong evid-

    ence, that the loss of the V:C: structure in theFinnish areas is a relatively recent phenomenon(Ivars, 1988). If this is correct, then 4-way dur-ational values in 3-way dialects could be attrib-uted to the recent loss of a richer quantity sys-tem. While the phonological system is a 3-way-system, the phonetic durations still reflect a 4-way system.Labov (1994) has argued that quantity

    changes spread via lexical diffusion, i.e.phonetically abrupt but lexically gradual. Thus,when V:C: recently became VC: in certainFinnish dialects, the change was presumablyphonetically abrupt. A speaker adopting thechange, replaces V: in V:C: by the correspond-ing V, and replaces thus a V:C: structure with aVC: structure with the same durational charac-teristics as original VC: structures. maintainingthe durational characteristics of V and C:. Thechange does not necessarily affect all wordswith the matching environment at the sametime (lexically gradual), but might graduallyspread through the lexicon until all V:C: se-quences have vanished from the dialect.This scenario would provide an explanation

    for the observed mismatches between phonolo-gical system and phonetic duration in some ofthe Finnish dialects.

    ReferencesBortz J. (1999) Statistik fr Sozialwis-senschaftler. 5th edition. Berlin: Springer.

    Gordon A.D. (1999) Classification. 2nd edition.Boca Raton: Chapman & Hall

    Ivars A.-M. (1988) Nrpesdialekten p 1980-talet. In Std. i nordisk fonologi, volume 70.

    Labov W. (1994) Principles of LinguisticChange., Vol. I. Cambridge: Blackwell P.

    Riad, T. (1992) Structures in Germanic Pros-ody. Stockholm: Univ.

    Schaeffler F. and Wretling P. (2003) Towards atypology of phonological quantity in thedialects of Modern Swedish. Proc. 15th

    ICPhS Barcelona, p. 2697-2700.Schaeffler, F. (2005, forthcoming) Drawing theSwedish quantity map: from Bara to Vr.Proc. Nordic Prosody IX. Lund.

    Strangert & Wretling (2003) Complementaryquantity in Swedish dialects. Proc. Fonetik2003. Ume.

    Timm N.H. (2002) Applied Multivariate Ana-lysis. New York: Springer.

    Wessn, E. (1969) Vra folkml. C E FritzesBokfrlag, Lund.

    4

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    Phonological variation and geographical orientation among students in a West Swedish small town school Anna Gunnarsdotter Grnberg Institute for Dialectology, Onomastics and Folklore Research, Gteborg, Sweden Abstract This paper presents main results from a Ph.D. thesis on sociolinguistic variation among stu-dents in an upper secondary school in Alingss, a town of 25,000, northeast of Gteborg. Phonological variants are found to be associ-ated with traditional local dialect, regional and supraregional standard, Gteborg vernacular, general and Gteborg youth language.

    Correlations with demogeographical areas generally show a pattern going from southwest to northeast (along the E20 highway and the railway from Gteborg). One area does not fit into the continuum, Sollebrunn (NW of Aling-ss), where particularly female informants tend to use standard and innovations to a surpris-ingly high extent. Gender is the second most important social factor, but in different ways. There are major differences from one social group to another when it comes to expressing gendered identity through linguistic means.

    Introduction This article presents some of the core results from my Ph.D. thesis (Grnberg 2004), a study of sociolinguistic variation among students from five municipalities, all attending an upper secondary school in Alingss, a town of 25,000, northeast of Gteborg, Sweden.1

    The main aim of the thesis was to study co-variation between linguistic variation and social identity, and to relate it to language and dialect change. A number of questions were raised, of which the following will be discussed here:

    To what extent does linguistic variation depend on the informants orientation towards the place where (s)he lives or other places?

    How do the findings from the upper sec-ondary school in Alingss differ from results from comparable informants in Gteborg?

    Which social factors are most important for linguistic choices?

    Material The material studied consists of tape-recorded interviews with 97 students at the Alstrmer-gymnasium, which at the time of recording in

    1998 had a total of 1400 students in 14 differ-ent national study programmes. These students come from quite a large area surrounding Al-ingss. The informants in the study represent five municipalities, ten different study pro-grammes, and there is an even gender distribu-tion.

    A sample of the results was also compared with the GSM corpus2, consisting of recordings of group conversations with 105 upper secon-dary school students in Gteborg 19971998. Even in this corpus, the informants are distrib-uted evenly with respect to gender, study pro-grammes and geographical areas in Gteborg and surroundings.

    Method The informants were categorized according to social variables representing different aspects of background and identity: Gender, type of study programmes (vocational, intermediate, preparatory for university), demogeographical areas (based on the extent of urbanization in the five municipalities), Alingss neighbourhoods (divided on the basis of socio-economic fac-tors), and lifestyle based on two-dimensional mapping (concerning taste, leisure, mobility, plans for the future, etc.). The lifestyle analysis both complements and includes traditional so-ciolinguistic variables.3

    Eight linguistic variables were analyzed ex-tensively, four phonological, two lexical and two morpho-phonological. Instants of variants in the recorded interviews were counted manu-ally, and frequencies of variants were correlated statistically to social variables on a group level. Examples from analyzes of three phonological variables will be used in the followng discus-sions.

    Geographical orientation and linguistic variation One of the main issues was to find out to what extent linguistic variation depends on the in-formants orientation towards the place where (s)he lives, towards Gteborg, Stockholm, or other places. Does the influence stem from the

    5

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    Gteborg dialect, ideas about standard Swed-ish, or from an ideal national youth language?

    The variants studied can be related to sev-eral layers of spoken Swedish: traditional local dialect (vstgtska), regional and supraregional standard, traditional Gteborg dialect, Gteborg youth language, and general youth language. The question concerning the origin of a variant in a certain level or variety is not always easy to answer, as there are several cases of identical forms in different layers. One such example is the variable (long //, except when preceding /r/), where the variant closed [:] (grn [gr:n] green) can be found in both local dia-lect and standard language, while at the same time contrasting with the innovation open [:] (grn [gr:n]), which can be found in both traditional Gteborg dialect and in general youth language.

    The variation found can be interpreted as at-tributable to differences in geographical orien-tation in the various groups. Some groups seem quite locally rooted, while others are more ori-ented towards Gteborg, and some have more far-reaching aspirations not necessarily to-wards Stockholm, but towards major urban ar-eas in general. There is hardly any orientation towards other areas than these, except in the case of a few informants who are drawn to-wards other places in West Sweden.

    The linguistic influence as seen in the use of standard forms and innovations is associated with an orientation towards both youth and standard language, on both a regional and su-praregional level. When correlated with geo-graphical areas, the linguistic variables gener-ally show a pattern going from southwest to northeast. The frequency of both standard forms and innovations grows higher the closer the informants live to Gteborg, and the two peripheral areas Lerum (SW) and Herrljunga (NE) are, in almost every case, the extremes with the highest frequencies of non-local and local forms. Even in central Alingss, a similar tendency can be found with informants, in the NE parts of town showing a high frequency of local forms, while those living in Centre and SW are more prone to using standard forms and innovations.

    One area does not, however, fit into the dia-lect continuum discernible along the E20 high-way and the railway from Gteborg through Lerum, Alingss and Herrljunga: in Sollebrunn (NW of Alingss), some distance away from both the highway and the railway, the nine fe-male informants particularly tend to use stan-

    dard forms and innovations to a much higher extent than the informants in Herrljunga, situ-ated at the same distance from Gteborg. (Re-sults are statistically significant at a five per-cent level.)

    One example is the variable U, that repre-sents three variants of the pronunciation of long /u/. The variants are the local lowered U [:] (hur [h:r] how), the standard U [:] (hur [h:r]), and the diphthongized U [] (hur [hr]) that is considered an innovation from Gteborg in this study. As can be seen in figure 1, informants in Sollebrunn and Lerum have a similar frequency of the Gteborg innovation diphthongized U.

    0

    20

    40

    60

    80

    100

    Lowered U

    Standard U

    Diphtongized U

    Herrljunga

    Grfsns Sollebrunn!!

    Alingss Lerum

    %

    Figure 1. Frequency of U variants in geographical areas

    Elements in the Sollebrunn-informants ad-

    herence to groups point towards a stronger need for identification with other geographical areas than their own. I hope to be able in future re-search to go into greater detail with regard to the attitudes and values of these informants, to find out why they differ from the overall pat-tern.

    Comparing results from Alingss and Gteborg How do the findings from the upper secondary school in Alingss differ from findings from comparable students in Gteborg?

    The answer to this question is in some ways already given above. The distribution pattern going from SW to NE, as seen between Lerum and Herrljunga, is supplemented and strength-ened through a comparison with the GSM cor-pus. In relation to three of the phonological variables, the results are unambiguous, with the frequency of innovations being substantially higher in the Gteborg informants than in the Alingss informants.

    6

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    The curves, which show a strong slant be-tween NE and SE, show an even steeper slant between the areas of Lerum and Gteborg. One example is the variable I/Y, as illustrated in fig-ure 2.

    The variable I/Y represents three variants of pronunciation of the long /i/ and /y/. The local variant is the lowered I/Y [:]/[:] (fin [f:n] nice, typ [t:p] like, sort of), the standard variant is the standard I/Y [i:]/[y:] (fin [fi:n], typ [ty:p]), and the Gteborg innovation is fricativized I/Y [i:z]/[y:z] (fin [fi:zn], typ [ty:zp]). The results of correlation with geo-graphical areas forms a step, increasing curve for the Gteborg fricativized I/Y from Solle-brunn via Lerum to Gteborg, and a steeply de-creasing curve for the local lowered I/Y be-tween the same areas.

    0

    20

    40

    60

    80

    100

    Lowered I/Y

    Standard I/Y

    Fricativized I/Y

    Alingss

    Sollebrunn

    LerumHerrljunga

    GteborgGrfsns

    %

    Figure 2. Frequency of I/Y variants in geographical areas including Gteborg

    One of the lexical variables displays a

    somewhat different distribution, but when a study of adolescents from Stockholm is added to the comparison, the results point to a spread-ing of the innovation typ (like) from Stock-holm to Gteborg in the first place, and then to Alingss, and that the use has stagnated in fa-vour of other discourse markers in the two ma-jor cities, while upper secondary school stu-dents in Alingss and the catchment area still use typ to quite a large extent.

    Differing patterns of distribution for innova-tions (three phonological and one lexical) can be interpreted as two types of regionalization taking place at the same time. The first type consists of a gradual spread from the regional centre of Gteborg towards Alingss and then further north, and the other type consists of a form of urban jumping, where forms spread from the capital to Gteborg, and then on to the town of Alingss, and from there to surround-ing areas (cf Sandy 1993:119).

    Further spreading of Gteborg features? One interesting question is whether innovations that are spreading in Gteborg will spread even more in the region. This would change the spo-ken language of the Alingss area even more towards that of Gteborg, as has already hap-pened in, for instance, Kunglv and Kungsbacka, some 30 km to the north and south of Gteborg (Grnvall 2005), or whether the variants close to standard will take over.

    Thelander (1979) and Westerberg (2004) describe the rise of a West Bothnia regional standard, where forms which are common to dialects in a larger area survive, while more lo-cal forms disappear. The question is whether a similar development might take place in rela-tion to certain West Swedish forms. The vari-able R, for instance, might suggest such a thing. R represents the pronunciation of the long // before /r/, with two possible variants: the traditional, local closed R [:r] (gra [j:ra] do), and the standard open R [:r] (gra [j:ra]). The closed R is present in a large area including the region of Vstergtland (but not in Gteborg or the coastal regions), and this feature shows a high frequency in cen-tral Alingss and all three of the areas to the north (Herrljunga, Grfsns and Sollebrunn), as shown in figure 3. This points to the possibility of the closed R surviving as a part of a Vst-gta regional standard, distinguishable from the Gteborg regional standard, in which the open R is standard.

    0

    20

    40

    60

    80

    100

    Open R

    Closed R

    HerrljungaGrfsns

    Sollebrunn!!

    Alingss

    Lerum

    %

    Figure 3. Frequency of R variants in geographical areas

    Discussion Which social factors are most important for lin-guistic choices? Is it possible to identify groups

    7

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    groups with common social identities in order to explain differences in linguistic behaviour?

    The social variables used in the study gen-der, study programme, demogeography, and lifestyle all show co-variation with linguistic variables as well as with each other, to some extent. The hypotheses formulated were not always verified, but this was not attributable to a lack of variation but to the fact that this varia-tion was not the predicted variation. For the eight linguistic variables analyzed, different social factors are important, but the one factor, which is most often salient, is that of de-mogeography. After that, gender can be seen as being second most important, but in two differ-ent ways. On the one hand there are general dif-ferences between girls and boys seen as groups, on the other hand there are differences between different groups when sex is combined with study programme, and also in the lifestyle analysis. This proves that there are major dif-ferences from one social group to another when it comes to expressing gendered identity through linguistic means. The most salient geo-graphical variation can be found in the phonological variables, while the lexical vari-ables co-vary to a higher extent with gender, programme type, and lifestyle.

    As was discussed above, a distribution pat-tern going from SW to NE is discernible, and this is not only related to distance in kilometres to Gteborg, but also to the dominant lifestyles and values in the adult population in the differ-ent areas. The ones who stand out most clearly are the girls in Sollebrunn, with respect to both the demogeographical and the socio-economic categorization of the informants. The lifestyle analysis is an attempt to supply extra informa-tion to combine with the traditional social vari-ables, and there is good potential for develop-ing this method further in studies of linguistic change. It provides a better understanding of the informants social background and aspira-tion, both in that it takes into account more as-pects, and makes it possible to move away from the hierarchical way of thinking which charac-terizes e.g. social indexation, and thus to cap-ture more aspects of how social identities are constructed in contemporary society.

    Notes 1. The Swedish upper secondary school, gym-nasium, gives courses of three years duration for students having completed nine years of

    school, thus having reached the age of 15-16 years. About 98 percent of Swedish 16-year-olds apply to the gymnasium. 2. Gymnasisters sprk- och musikvrldar, The Language and Music Worlds of High School Students. See Norrby & Wirdens (1998). 3. The lifestyle analysis was based on Ung-domsbarometern (1999). Cf Bourdieu (2002) and Dahl (1997).

    References Bourdieu, Pierre (2002) [1984]. Distinction. A

    Social Critique of the Judgement of Taste. Reprint. London: Routledge & Kegan Paul Ltd.

    Dahl, Henrik 1997. Hvis din nabo var en bil. En bog om livsstil. Kbenhavn: Akademisk Forlag A/S.

    Grnberg, Anna Gunnarsdotter. (2004) Ung-domar och dialekt i Alingss. (Nordistica Gothoburgensia 27.) Gteborg: Acta Uni-versitatis Gothoburgensis.

    Grnvall, Camilla. (2005) Lttgteborgska i Kungsbacka. En beskrivning av ngra gym-nasisters sprk 1997. Gteborg. Unpub-lished manuscript.

    Norrby, Catrin & Karolina Wirdens. (1998) The Language and Music Worlds of High School Students. In: Pedersen, Inge Lise & Jann Scheuer (eds.). Sprog, Kn og Kom-munikation. Rapport fra 3. Nordiske Kon-ference om Sprog og Kn. Kbenhavn, 11.13. Oktober 1997. Kbenhavn: C.A. Rietzels Forlag.S. 155164.

    Sandy, Helge (1993). Taleml. Oslo: Novus. Thelander, Mats. (1979) Sprkliga varia-

    tionsmodeller tillmpade p nutida Burtrsktal. Del 1 & 2. (Studia Philologiae Scandinavicae Upsaliensa 14:1 &14:2.) Uppsala: Acta Universitatis Upsaliensis.

    Ungdomsbarometern. (1999) Livsstil & fritid. Stockholm: Ungdomsbarometern AB.

    Westerberg, Anna. (2004) Norsjmlet under 150 r. (Acta Academiae Regiae Gustavi Adolphi LXXXVI.) Uppsala: Kungl. Gus-tav Adolf Akademien fr svensk folkkultur.

    8

  • On the phonetics of unstressed /e/ in Stockholm Swedishand Finland SwedishYuni KimDepartment of Linguistics, University of California at Berkeley

    AbstractDialects of Swedish vary in the pronunciation ofunstressed /e/ in different phonologicalenvironments. In this pilot study, StockholmSwedish is compared with several FinlandSwedish dialects. Stockholm and one landdialect lower and back /e/ before [n], whileHelsinki and most Nyland dialects lower andback /e/ before [r]. The data provide evidencefor the sociolinguistic relevance of unstressedvowel pronunciation.

    IntroductionStressed short [e] and [] are in complementarydistribution in most Swedish dialects: theallophone [] occurs before [r], and [e] occurs inall other environments. In Finland Swedish,transcription conventions (e.g. in Harling-Kranck1998) and informal reports by native speakerssuggest that the same distribution may hold inunstressed syllables as well. Since it is not clearhow widespread this phenomenon is, a pilotstudy was conducted to investigate the phoneticsof unstressed /e/ across several dialects. Thefollowing questions were addressed: 1) How isunstressed /e/ pronounced in StockholmSwedish? 2) Are unstressed [e] and [] in fact incomplementary distribution in standard HelsinkiSwedish? 3) Do rural Finland Swedish dialectspattern with Helsinki, or with Stockholm or dothey show their own patterns? Finally, 4) Canthe regional differences be explained?

    Materials and methodsFor the Stockholm and Helsinki speech samples,5-minute news broadcasts from each city wererecorded from the Internet into Praat at 22500kHz. One male and one female newscaster wererecorded for each variety. The audio files hadbeen compressed for Internet broadcast, but itwas assumed that the compression would not

    have affected the lower frequencies where thevowel formants were located.

    The data for the rural Finland Swedishdialects consisted of the audio recordings inHarling-Kranck (1998), a transcribed collectionof spontaneous narratives by speakers bornaround 1900. The scope of this study was limitedto the southern dialects, and of these, 10 dialects(represented by one speaker each) had enoughtokens of unstressed /e/ for consistent patterns toarise. From west to east, these were: Fgl andKkar in eastern land; Houtskr in westernboland; Tenala and Karis in western Nyland;Sjunde and Helsinge in central Nyland; andBorg, Lapptrsk, and Pyttis in eastern Nyland.

    F1 and F2 values were measured forunstressed tokens of the phoneme /e/ in word-final syllables. Using Praat, measurements weretaken at a stable point at or near the midpoint ofeach vowel. Formants were calculated by LPC,and FFT spectra were also consulted in cases ofinconsistency between the LPC value and visualinspection of the spectrogram. Excluded fromthe measurements were: extremely reducedtokens with unclear formant structure, and tokenswith dramatic formant movement throughout thecourse of the vowel (e.g. a linear drop of 300 Hzin F2). These criteria had the effect of excludingmost tokens following velars, but due to thesmall total number of tokens it seemed safer forpurposes of comparability to only include vowelswith reasonably stable formant values.

    ResultsPreliminary inspection of the data indicated threecategories of environments relevant to thephonetic realization of unstressed /e/: preceding[n], preceding [r], and elsewhere (usually word-final, or preceding [t] or [s]). Below, tokens forthese environments are graphed in each dialect.The ellipses are marked N, R, and E,respectively.

    9

  • StockholmUnstressed /e/ in Stockholm Swedish wasgenerally realized as schwa, but a patternemerged for both Stockholm speakers that theschwa had higher F1 and lower F2 whenpreceding [n] than in other environments. Thereis little overlap between the [n]-environmenttokens and the other tokens in the F2 vs. F1 plotsin Figs. 1 and 2.

    Figure 1. Stockholm newscaster. Female, rec. 2005.

    Figure 2. Stockholm newscaster. Male, rec. 2005.

    HelsinkiThe Helsinki newscasters had a very differentpattern from Stockholm. Both speakerscategorically lowered and backed unstressed /e/before [r], as in Fig. 3. This result seems toconfirm the existence of [e]~[] allophony inunstressed syllables, at least on a phonetic level.

    Figure 3. Helsinki newscaster. Female, rec. 2005.

    land and the bo archipelagoThe next question is which pattern we find indialects of land and the bo archipelago,geographically located halfway betweenStockholm and Helsinki. Previously part ofSweden, land became an autonomous part ofFinland in 1921 and maintains contacts with bothcountries. Thus it is not immediately obviouswhether land dialect speakers would orientthemselves more toward a Central Swedish orFinland Swedish norm in unstressed vowelpronunciation.

    The speaker from Fgl in eastern land hasthe Stockholm pattern, as shown in Fig. 4.

    Figure 4. Fgl, eastern land. Male, b. 1901.

    On the other hand, the speakers from Kkarand Houtskr (females, born 1900 resp. 1899)show a third type of pattern, where /e/ has lower

    400

    500

    600

    700

    15001600170018001900

    E

    R

    N

    300

    400

    500

    600

    700

    800

    13001500170019002100

    E

    R N

    400

    500

    600

    700

    12501350145015501650

    E

    R

    N

    350

    450

    550

    650

    750

    1600180020002200

    E N

    R

    10

  • F2 before [r], but with (apparently) less of adifference in F1 between the environments.

    Figure 5. Kkar, eastern land. Female, b. 1900.

    NylandIn most rural villages of Nyland (the provincewhere Helsinki is located), the Helsinki patternobtains: before [r], unstressed /e/ approaches an[]-like pronunciation.

    Figure 6. Tenala, western Nyland. Female, b. 1885.

    Figure 7. Borg, eastern Nyland. Male, b. 1900.

    Easternmost Nyland, on the other hand,presents a bit of a mystery. The Lapptrskspeaker (not shown here) has a tendency to lowerand back /e/ before [r], but unlike in other partsof Nyland there is significant overlap with non-[r]-environment tokens. The Pyttis speaker hasan even more divergent pattern, illustrated in Fig.8. Since the easternmost dialects have undergoneheavy phonetic influence from Finnish, it may bepossible to relate these divergent patterns toBergroths (1917) observation that it ischaracteristic of Finnish-accented Swedish topronounce unstressed er as [er] instead of [r].The easternmost Nyland dialects should beinvestigated further.

    Figure 8. Pyttis, western Kymmene (E. Nyland dialectgroup). Male, b. 1895.

    Discussion and conclusionAlthough recordings of only one or two speakersper dialect were available, multiple speakers ineach region showed approximately the samepatterns. Thus the results, though preliminary,seem to point to robust regional differences in thequality and distribution of unstressed tokens of/e/.

    It may be possible to explain some of thisvariation. As mentioned in the introduction, theHelsinki pattern, where [e] and [] are incomplementary distribution, is parallel to anidentical alternation in stressed syllables in manySwedish dialects. The fact that the alternationseems to have generalized to unstressed syllablesprecisely in Finland Swedish may perhaps beattributable to contact with Finnish, which tendsnot to reduce vowel quality in unstressedsyllables. That is, speakers of Finland Swedish

    450

    550

    650

    750

    1600180020002200

    R

    N

    E

    450

    550

    650

    750

    14001600180020002200

    EN

    R

    450

    550

    650

    750

    850

    950

    1850205022502450

    E

    N R

    400

    500

    600

    700

    1400160018002000

    E

    R N

    11

  • may have acquired a habit of articulating the fullor nearly-full quality of unstressed vowels, whichcould have triggered the [e]~[] alternation. The[] of Finland Swedish is noticeably more openthan in the Swedish of Sweden (Reuter 1971),which also seems to contribute to the salience ofthe allophony. This hypothesis must remain asspeculation, however, pending further data onvowel reduction in Finland Swedish (as well asevidence that the Helsinki pattern really is aninnovation and not archaic).

    Once a wider range of dialects is studied, itmay be possible to assemble a more coherentpicture of cross-dialectal variation in unstressedvowel pronunciation. In future research,comparing measurements of unstressed /e/ to therest of the vowel system, for example to stressedrealizations of [e] and [], could shed furtherlight on this topic. Normalization of the vowelformants would also allow direct comparisonamong speakers.

    Finally, these results have more generalimplications. Although sociophonetic researchhas often focused on stressed vowels (e.g. Labov1994), the evidence presented here suggests thatunstressed vowels can also have sociolinguisticrelevance.

    AcknowledgementsI would like to thank Leanne Hinton for valuableadvice and discussion. This research has beensupported by a Fulbright Grant and a Jacob K.Javits Graduate Fellowship.

    ReferencesBergroth H. (1917) Finlandssvenska:

    handledning t i l l undvikande avprovinsialismer i tal och skrift. Helsinki:Holger Schildts.

    Harling-Kranck G. (1998) Frn Pyttis tillNedervetil: tjugonio dialektprov frn Nyland,boland, land och sterbotten. Helsinki:Svenska Litteratursllskapet.

    Kuronen M. and Leinonen K. (2000) Fonetiskaskillnader mellan finlandssvenska ochrikssvenska. Svenskans beskrivning 24, 125-138. Linkpings universitet.

    Labov W. (1994) Principles of LinguisticChange: Internal Factors. Oxford: Blackwell.

    Reuter M. (1971) Vokalerna i finlandssvenskan:en instrumentell analys och ett frsk tillsystematisering enligt srdrag. Studier inordisk filologi 58, 240-249. Helsinki:Svenska Litteratursllskapet.

    12

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    The interaction of word accent and quantity in

    Gothenburg Swedish

    AbstractAccording to the conventional wisdom the wordaccent distinction in Swedish (dialects) ismaintained chiefly by a difference in the timingof the word accent gesture (Grding, 1973).Gothenburg Swedish, however, does not obey tothe norm since both pitch height and timingcontribute to the word accent distinction in thisdialect (Segerup, 2004). In Gothenburg Swedishboth word accents have a fall on the stressedvowel, which makes the pitch contours strikinglysimilar (Segerup, 2004).Up until now the material investigated hasconsisted of contrastive words in which thestressed vowel is phonologically long. In thepresent production study we proceed with word-pairs where the stressed vowel is phonologicallyshort for a comparison. The acoustic analysisinvolves measurements of fundamental frequency(F0) and segments duration of five speakersproduction of seven word-pairs altogether.The results show a significant difference in theduration of the short stressed vowel betweenaccent 1 and accent 2. Further, that word accenthas effects on the vowel duration

    INTRODUCTIONThe present paper investigates the interactionbetween word accent and quantity in GothenburgSwedish. Minimal pairs with accent 1 and accent 2and with either long or short stressed vowel areexamined. How are pitch height and timing affectedwhen the voiced portion of the syllable is minimizedby having a short vowel followed by a voicelessconsonant as opposed to a more sonorantenvironment, i.e. a long vowel or sonorant consonant?This is related to the general question of truncationor compression of the f0 contour in an intonationallyunfavourable environment (see e.g. Bannert &Bredvad-Jensen, 1975).

    BackgroundAccording to the Swedish tonal typology (Grding,1973, with Lindblad 1975, Bruce & Grding, 1978)the timing/alignment of the word accent gesture isessential to the Swedish word accent distinction.

    The traditionally described word accent patternof the West Swedish prosodic dialect type (see Bruce& Grding, 1978) involves low pitch on the stressedvowel for accent 1 words and a peak on the stressedvowel for accent 2 words in focal position. Bruce(1998) has suggested that Gothenburg Swedish ischaracterized by two-peaked pitch contours for bothword accents with an earlier timing in accent 1. Aprevious production study (Segerup, 2004) confirmsthat Gothenburg Swedish accent 1 deviates from thegenerally accepted West Swedish accent 1 patternthrough having a fall on the stressed vowel.Furthermore, the fall of accent 2 is only marginallylater than that of accent 1, meaning that the expectedtiming difference between accent 1 and accent 2 isless than in other dialect types. Consequently, theoverall shape of the pitch contours is strikingly similar,but yet they are perceptually distinct (Segerup, 2004,Segerup & Nolan, forthc.).

    Pitch height and timing collaboratingcuesPerhaps the most unexpected finding of theproduction study summarized above is that accent 2was shown often to involve higher pitch in the stressedvowel than accent 1. The result of the acoustic analysisshows that the word accent distinction is maintainedby comparatively small differences in the timing andheight of the fall and further that speakers apparentlyuse different strategies in order to maintain thedistinction. Some speakers rely primarily on one cueor the other, other speakers rely on both.

    In order to find out whether listeners attend topitch height or disregard it most likely as anunintentional consequence of producing the alignmentdifference, a perception experiment was carried out(Segerup & Nolan, forthc.). The stimuli used in theexperiment were resynthesized from naturalutterances with alignment and pitch height variedsystematically. Twenty-four native speakers of

    My SegerupDepartment of Linguistics and Phonetics, Lund University, LundE-mail: [email protected]

    13

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    Speakers were five elderly male native speakersof Gothenburg Swedish. The recordings were madeusing a portable DAT recorder in the subjects localenvironment. Two sets of recordings were made ontwo separate occasions. A Gothenburg Swedishinterlocutor read various questions, to which thesubjects read the answer (which proved to induce

    Accent 1 V: Accent 2 V:Polen (Poland) plen (the stake)Judith ljudit (to have sounded)malen (the moths) malen (ground)buren (the cage) buren (carried)

    Table 1. Contrastive word-pairs included in thepresent study.

    Accent 1 V Accent 2 Vpollen (pollen) pllen (horsey)tecken (signs) tcken (quilts)tjecker (Czechs) checker (cheques)

    Gothenburg Swedish served as subjects. The resultsshow that listeners do take note of the height fromwhich the fall takes place as well as the alignment ofthe fall.

    The results of the production and perception studyabove suggest that there is a trading relationshipbetween pitch height and timing, meaning that thesetwo independent dimensions contribute in variousproportions to the perception of the word accentdistinction (Segerup & Nolan, forthc.).

    Purpose

    The purpose of the present study is to investigate theinteraction between word accent and quantity, and,further, to investigate whether there is a differencebetween accent 1 and 2 as regards the duration ofthe short stressed vowel and long stressed vowel,respectively.

    INVESTIGATION

    Materials, subjects, methodThe speech materials comprise seven contrastive di-syllabic word-pairs, all of which are listed pair-wisein Table 1 below. (Since the present investigation ispart of a large-scale study, the word-pairs includedhere are not completely symmetric). The target words,in phrase-final focal position, were extracted fromvarious sets of sentences (statements) spoken in twodifferent speaking styles; normal and clear voice.

    very natural and colloquial speech). At least three(but up to nine) repetitions of every sentence in eachspeaking mode (in random order) were recorded foreach of the 5 speakers.

    The total number of tokens (including all 5speakers repetitions) in the present analysis variesfrom approximately 15 to 28 tokens per word in eachspeaking style.

    Acoustic analysis

    The acoustic analysis includes segments duration andseven measurements of pitch values at specific pre-selected points. These are; the height of the precedingvowel (1), the start of the stressed vowel (2), the topcorner of the fall (3), the bottom corner of the fall(4), the start of the rise (5), the phrase accent peak(6), and the end (7). In the case where the stressedvowel is followed by a voiced/voiceless consonant,the measurement point (5) is at the onset of the secondvowel. In Figures 1-3 below the measurement pointsare marked by triangles for accent 1 and squares foraccent 2.

    RESULTSThe results of the acoustic analysis are exemplified inFigures 1-4. Figures 1-3 show the average f0 curvesfor five speakers production of malen/malen, pollen/pllen and tecken/tcken in clear style, respectively.The duration of the stressed vowel is indicated by abar (at the top for accent 2 and at the bottom foraccent 1). The pitch contours are aligned at the startof the stressed vowel and earlier points are shown ashaving negative times relative to the alignment point.In words with a long vowel the duration of the stressedvowel and the overall timing of the two word accentsis very similar, meaning that a direct comparison ofthe timing of pitch events is possible, which is generallynot the case in words with a short vowel.

    Figure 4 compares, for accent 1 and accent 2,the average duration of the stressed vowel for theword-pairs malen/malen, pollen/pllen and tecken/tcken.

    It is clear from the acoustic results that wordswith a short vowel behave differently from wordswith a long vowel. Words with a long vowel (Judith/ljudit, Polen/plen and buren/buren) behave nearlythe same as malen/malen, which is shown in Figure1. For both accents the f0 contour is falling throughoutthe vowel segment from an initial f0 maximum (definedas the top corner of the fall), which starts slightlylater at a higher frequency level for accent 2 than foraccent 1, to an f0 minimum (the bottom corner of thefall) at the end of the stressed vowel.

    14

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    There is no durational effect of accent evident inlong vowel words, while this does seem to be thecase in short vowel words. The difference seen forpollen/pllen (Figure 2) and tecken/tcken (Figure3) was also seen for tjecker/checker.

    In the short vowel contours of pollen and pllen(Figure 2) the main pattern of the f0 contours of thelong vowel words is preserved. Even if the fall ofaccent 1 pollen starts comparatively late into thevowel, the f0 contour is falling rapidly through thesecond part of the vowel to a final Low,approximately at the VC-boundary.

    Figure 3 reveals a further effect. In tecken/tckenand tjecker/checker (not shown here) where thevoicing part is very short it appears that speakerscompress the fall in accent 1, so that the Low isachieved at the end of the short vowel, whereas inaccent 2 the pitch stays high at the end of the stressedvowel. The graph interpolates to the Low measuredat the beginning of the second vowel, so that the trueslope of the fall cannot be determined because of thevoicelessness.

    The falling f0 contour of accent 2 starts at a higherfrequency level and with slightly later timing than thefall of accent 1 and reaches Low in the followingconsonant.

    From the results it is obvious that the gradient ofthe fall in accent 1 differs between long vowel wordsand short vowel words. The results suggest that thegradient of the fall also differs between short vowelwords, i.e. between words where the stressed vowelis followed by a voiced consonant versus a voicelessconsonant. The gradient of the fall in tecken appearsto be twice as steep as that of malen, while thegradient of the fall in pollen is approximately inbetween that of malen and tecken.

    Figure 4 compares the duration of the stressed vowelof accent 1 and accent 2 for comparison betweenmalen/malen, pollen/pllen and tecken/tcken. Thedifference in duration of the short stressed vowelbetween accent 1 and accent 2 is noticeable.A t-test showed this difference to be significant foreach individual speaker in each speaking style (at 5% level). One exception is tjecker/checker in clear

    Figure 1. Average fundamental frequency contoursfor accent 1 (triangles) and accent 2 (squares) forfive speakers. The first point in the curves is in thepreceding unstressed vowel. The bars show theduration of the stressed vowel for accent 1 (bottom)and accent 2 (top). Times are expressed relative tothe onset of the stressed vowel.

    Average 5 speakers malen A1 & A2

    5075

    100125150175200225250275

    -300 -100 100 300 500 700time (ms)

    freq

    uen

    cy (

    hz)

    Accent 1

    Accent 2

    Acc 1 V

    Acc 2 V

    Figure 2. Average fundamental frequency contoursfor accent 1 (triangles) and accent 2 (squares) forfive speakers. The first point in the curves is in thepreceding unstressed vowel. The bars show theduration of the stressed vowel for accent 1 (bottom)and accent 2 (top). Times are expressed relative tothe onset of the stressed vowel.

    Average 5 speakers pollen A1pllen A2

    5075

    100125150175200225250275

    -300 -100 100 300 500 700time (ms)

    freq

    uen

    cy (

    Hz)

    Acc 1

    Acc 2

    Acc 1 V

    Acc 2 V

    Figure 3. Average fundamental frequency contoursfor accent 1 (triangles) and accent 2 (squares) forfive speakers. The first point in the curves is in thepreceding unstressed vowel. The bars show theduration of the stressed vowel for accent 1 (bottom)and accent 2 (top). Times are expressed relative tothe onset of the stressed vowel.

    Average 5 speakers tecken A1 tcken A2

    5075

    100125150175200225250275

    -300 -100 100 300 500 700time (ms)

    freq

    uen

    cy (

    hz)

    Accent 1

    Accent 2

    Acc 1 V

    Acc 2 V

    15

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    speaking style, where two of the speakers vowelduration for accent 2 was only marginally shorter thanthat of accent 1.

    ReferencesBannert R. and Bredvad-Jensen A-C. (1975)

    Temporal organization of Swedish tonalaccents: The effect of vowel duration. WorkingPapers (Phonetics Laboratory, Department ofgeneral Linguistics, Lund University) 10,1-26.

    Bruce G. and Grding E. (1978) A prosodictypology for Swedish dialects. In Grding G.,Bruce G. and Bannert R. (eds) NordicProsody: Papers from a symposium(Department of Linguistics, Lund University)219-229.

    Bruce G. (1998) Allmn och svensk prosodi(Department of Linguistics, Lund University)16.

    Grding E. (1973) The Scandinavian wordaccents. Working papers (PhoneticsLaboratory, Lund University) 8.

    Grding E. and Lindblad P. (1975) Constancy andvariation in Swedish word accent patterns.Working papers (Phonetics Laboratory, LundUniversity) 3, 36-100.

    House, D. (1990) Tonal Perception in Speech.(Traveaux de lInstitut de Linguistique deLund, Lund University) 24.

    Segerup M. (2003) Word accent gestures in WestSwedish. In Heldner M (ed.) Proceedingsfrom FONETIK 2003, Phonum (Departmentof Philosophy and Linguistics, UmeUniversity) 9, 25-28.

    Segerup, M. (2004) Gothenburg Swedish wordaccents a fine distinction. In Branderud, P. &H. Traunmller (eds). Proceedings Fonetik2004 (Department of Linguistics, StockholmUniversity) 28-31.

    Segerup, M. & Nolan F. (forthc) GothenburgSwedish word accents a case of cue trading?Nordic prosody (Department of Linguistics andPhonetics, Lund University) IX.

    Average duration stressed vowel 5 speakers

    0 50 100 150 200 250

    malenmalen

    pollenpllen

    teckentckenv

    ow

    el d

    ura

    tio

    n (

    ms)

    Acc 1

    Acc 2

    Figure 4. Average duration (ms) of the stressed vowelfor malen/malen, pollen/pllen and tecken/tcken fivespeakers. Accent 1 is represented by the light bar andaccent 2 by the dark bar.

    DISCUSSIONIn Gothenburg Swedish short vowel words, accent2 seems to demonstrate truncation of the pitch falland accent 1 seems to demonstrate compressionof the fall and also some lengthening of the stressedvowel. It appears to be the case that GothenburgSwedish speakers strategy is to preserve the fallon accent 1, while the fall seems to be of lessimportance for accent 2.

    One interpretation of this is that the falling f0contour in the stressed vowel of accent 1 and theheight from which the fall takes place in accent 2is enough of a cue to maintain the distinctionbetween the word accents in words with a shortstressed vowel. House (1990) has worked with amodel of tonal feature perception which may beapplied to these findings.

    In order to fully understand the interaction ofthese cues a perceptual experiment with syntheticstimuli is in preparation, which will manipulate pitchheight and slope in order to discover the relativeimportance of these factors.

    16

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    Visual Acoustic vs. Aural Perceptual Speaker Identifica-tion in a Closed Set of Disguised VoicesJonas LindhDepartment of LinguisticsGteborg University

    AbstractMany studies of automatic speaker recognitionhave investigated which parameters that per-form best. This paper presents an experimentwhere graphic representations of LTAS (LongTime Average Spectrum) were used to identifyspeakers from a closed set of disguised voicesand determine how well the graphic methodperformed compared to an aural approach.

    Nine different speakers were recorded ut-tering a fake threat. The speakers used differentdisguises such as dialect, accent, whisper, fal-setto etc. and the verbatim threat in a normalvoice.

    Using high quality recordings, visual com-parison of the Praat vocal tract graphs ofLTAS outperformed the aural approach in iden-tifying the disguised voices. Performing speakeridentification aurally does not mean analyzing adifferent sample than the one being analyzedacoustically. Studies of aural perception show ahypothesizing, top-down, active process, whichcreate interesting questions regarding auralspeaker identification with bad quality record-ings in noisy backgrounds etc. However, moretests on telephone quality recordings, authenticmaterial and additional types of acoustic meas-urements, must be performed to be able to sayanything about LTAS with implications for fo-rensic purposes.

    Background and IntroductionThe so-called voiceprint approach introducedby Lawrence Kersta (1962) suggested a patternmatching procedure comparing broadband spec-trograms for speaker identification purposes. Itis within this context that an interest in study-ing visual vs. aural methods arose. Since com-plex visual pattern matching activates the righthemisphere of the brain and speech- and lan-guage processes usually the left (Rose, 2002) itwould be preferable to find a way to integrateboth. There are many problems to be consid-ered when using visual representations ofacoustic data within the context of forensicspeaker identification, especially considering

    the effects of low quality recordings. Generally,one can say that primarily aural identificationhas been the leading method when it comes tocasework. Many studies have been carried outto see what parameters are most stable or whereeffects of low quality can be calculated, for ex-ample the telephone effect (Knzel, 2001).

    Generally, LTAS becomes rather stable after30-40 seconds of speech. (Boves, 1984; Fritzellet. al., 1974; Keller, 2004) LTAS reflects theenergy highs and lows generated by the vocaltract filter on average, which means that itshould be more difficult to alter than, for exam-ple, F0 or specific phones, why this measure isoften chosen to visually represent the generalenergy distributions in frequency for the speechsignal. Several studies have been conducted tostudy energy ratios and level differences forLTAS (Lfqvist, 1986; Lfqvist & Manders-son, 1987; Gauffin & Sundberg, 1977; Kitzing,1986). Kitzing (1986) recommended that pa-tients should read at the same degree of vocalloudness to avoid the differences that occurredespecially in higher frequencies. Kitzing & k-erlund (1993) pointed out the need for an inves-tigation of the effect of vocal loudness on LTAScurves. Nordenberg & Sundberg (2003) per-formed such a test and showed that vocal loud-ness and varied f0 gave variations in Long TimeAverage Spectra. However, even though an ex-pected variation has been shown, the ability toperform pattern matching on the graphs seemsto be possible. It has been observed that a slightdifference between the identification results be-tween subjects depends on whether they con-sider distance more important thanshape/pattern. Hollien & Majewski (1977)tested long-term spectra as a means of speakeridentification under three different speakingconditions, i.e. normal, during stress and dis-guised speech. LTS for fifty American and fiftypolish male speakers were used under fullbandas well as passband conditions. The resultsdemonstrated high levels of correct identifica-tion (especially under fullband conditions) fornormal speech with degrading results for stressand disguise.

    17

  • Method

    The sixteen disguised voices and suspects(references), were recorded by six females andthree males. The recordings were made with ahigh quality microphone in front of a personalcomputer and the subjects recorded one nor-mal and as many disguised voices as theywanted, repeating the same fake threat inSwedish. All recordings were between four andsix seconds long and sampled at 16kHz. Forcedchoice was applied in both the aural and visualtests.

    The Graphic Representations of LTAS

    The vocal tract function in Praat draws theLTAS envelope (in decibel) as if they were vo-cal tract areas (in square meters). This gives agraph representing the LTAS. The graph doesnot give the axis values, which is reasonablesince the overall absolute amplitude, as a pa-rameter, has no real value (Nordenberg & Sund-berg, 2003). The important information lies inthe relative spectral envelopes represented bythe line showing the energy distribution as afunction of frequency.

    Figure 1. A graph comparison sample (in the testthe target is red and each reference blue).The graphic representations of LTAS were cre-ated from an LTAS object using 100 Hz fre-quency bins. (Boersma & Weenink, 2005)

    The Visual Comparison Test

    Graphs representing LTAS were created for six-teen disguised voices and paired up with each ofthe reference samples to be used in a visualidentification test performed by ten subjects.The order in which they were presented wasrandomized. The subjects were all students oremployees at the Department of Linguistics,Gteborg University. They had all, at somepoint, taken an undergraduate course in pho-netics and/or speech technology.

    The subjects compared each disguised voicewith all the suspects/references in pairs and

    then decided which one they thought was themost similar one comparing both shape and/ordistance. The subjects were also told that thegraphs had no timeline and that they were sup-posed to perform pattern matching, answeringwhich graphs were the most similar ones in eachtest sample. They were also asked to commenton how they reached each conclusion and if dis-tance or shape was most important when com-ing to a decision. This was done to be able tointerpret how subjects compared the visual in-put. They were allowed to inspect the graphsas many times and as long as they wanted.

    The Aural Identification Test

    Seven subjects performed aural identification onthe same set of samples to be able to comparethe results easily.

    The recordings were put in a list in a ran-domized order. Subjects used headphones andcould listen to the samples as many times asthey wanted before deciding which one of thereferences they thought sounded most like thetarget. All subjects were of the same category asin the visual test. Some test subjects were thesame as in the visual test.

    Results and Discussion

    Even though there is a great difference in per-formance between subjects within each test, itis clear that the visual identification outper-forms the aural.

    The Visual Identification Results

    The results for the visual tests show consis-tency.

    Table 1. Inter-rater Reliability Analysis (Cron-bachs alpha).

    N of Disguised Voices 16N of Subjects 10Alpha 0.91

    The impression based on the comments is thatsubjects with a preference for pattern and shaperather than distance generally performed betterin the visual test.

    18

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    Figure 2. Percent correct visual identifications persample (16) for 10 subjects.

    Figure 2 shows how many correct identifica-tions that were made per disguised voice sam-ple. Some graphs were obviously very difficultto identify. Why that is so, or how thosegraphs differ, has not yet been investigated.

    Figure 3. Percent correct visual identifications persubject (10) for 16 samples.

    Figure 3 shows the identification results foreach subject, which varies from nine correctidentifications to five. As mentioned above theperformance was clearly related to whether thesubject used pattern/shape matching more thandistance. The average identification score for thevisual test is 6.9, which could be considered asrather low, but considering the difficulties pre-sented in the aural test results it is merely thecomparison which is taken into consideration inthis study.

    The Aural Identification Results

    The results in the aural test were less correlated.The reason is simply that subjects found thetask much more difficult, i.e. most subjects

    thought that no decision should have beenadded as an alternative answer.

    Table 2. Inter-rater Reliability Analysis (Cron-bachs alpha).

    N of Disguised Voices 16N of Subjects 7Alpha 0.83

    The reliability score is lower in this test com-pared to the visual test. However, the correla-tion is high enough to be interpreted as a ratherhigh correlation between subjects.

    Figure 4. Percent correct aural identifications persample (16) for 7 subjects.

    Figure 4 gives a result overview, which may becompared with the corresponding Figure 2 forthe visual test. The amount of correct identifica-tions per sample is significantly lower though themaximum is lower (seven subjects vs. ten).

    Figure 5. Percent correct aural identifications persubject (7) for 16 samples.

    % Correct Visual Identifications / Sample

    0

    90

    70

    40

    10

    80

    10

    3040 40 40

    30

    10

    80

    100

    20

    0

    20

    40

    60

    80

    100

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    Disguised Voice Samples

    % C

    orr

    ect

    % Correct Visual identifications / Subject

    31

    5644

    31

    4438

    50 5044 44

    0102030405060708090100

    1 2 3 4 5 6 7 8 9 10

    Subject

    % C

    orr

    ect

    % Correct Aural Identifications / Sample

    14 14

    29

    14 14

    57

    29

    43

    71

    29

    14

    0

    43

    100

    14

    71

    0

    20

    40

    60

    80

    100

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    Disguised Voice Samples

    % C

    orr

    ect

    % Correct Aural Identifications / Subject

    3831

    19

    44

    3138

    44

    0102030405060708090100

    1 2 3 4 5 6 7

    Subject

    % C

    orr

    ect

    19

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University Figure 5 presents the figures corresponding to table three in the visual identification test. The subjects results are significantly lower even though lowest visual score (five) is higher than the highest aural score (seven). Since there seems to be an individual strategy success in-volved. The result per subject in the aural test also shows a higher degree of variation than the visual. This is probably due to the difficulties they showed in deciding on which reference to choose.

    Conclusions General advantages with graphic representa-tions are:

    Intra subjectively applicable (depending on the amount of data).

    Relatively simple fundamentals for cal-culation.

    Rather easy to visualize. General disadvantages are:

    Difficult to quantify and substantiate comparisons.

    The visualization depends on F0 and vocal loudness variations.

    An average always ignores specific events in the speech signal.

    Considering the categorical, top-down active human speech perception process (Grosjean, 1980), it is interesting to find complementary visual acoustic information to aural methods in forensic speaker identification. When two voice samples are compared, the same input is judged no matter if it is aurally or acoustically. The question is how it is analyzed and how the acoustic visual and the aural perceptual infor-mation are processed. If a better understanding between the two is reached, objective methods can be used to judge similarities. Objective acoustic methods can also more easily be ex-cluded on well-grounded arguments as well as subjective aural ones. This could also lead to better statistical data in forensic speaker identi-fication if computer based methods can be used with more confident supervision. It is clear that aural mistakes are made, especially for dis-guised voices.

    The graphic representations used in this ex-periment are not claimed to be complete images reflecting the voice of a speaker. They are but examples showing that in some cases visual acoustic input are better at discriminating be-tween speakers than are ears alone.

    References Boersma P. & Weenink D. (2005) Praat: doing

    phonetics by computer (Version 4.3.01) [Computer Program]. Retrieved from

    Boves L. (1984) The phonetic basis of percep-tual ratings of running speech Foris Publi-cations, Dordrecht.

    Gauffin J. & Sundberg J. (1977) Clinical appli-cation of acoustic voice analysis. Part II: Acoustic analysis, results 1977/2-3: 39-43.

    Grosjean F. (1980). Spoken word recognition processes and the gating paradigm. Percep-tion and Psychophysics, 28, 267-283.

    Hollien H. & Majewski W. (1977) Speaker identification by long-term spectra under normal and distorted speech conditions. Journal of the Acoustical Society of Amer-ica 62: 975-980.

    Keller E. (2004) The analysis of voice quality in speech processing. In Lecture notes in computer science, Springer Verlag, Berlin.

    Kersta L. G. (1962) Voiceprint identification. Nature 196: 1253-1257.

    Kitzing P. (1986) LTAS criteria pertinent to the measurement of voice quality. Journal of Phonetics, 14: 477-482.

    Knzel H. J. (2001) Beware of the 'telephone effect': The influence of telephone transmis-sion on the measurement of formant fre-quencies. Forensic Linguistics 8: 80-99.

    Lfqvist A. (1986) The long-time-average spectrum as a tool in voice research. Journal of Phonetics, 14: 471-475.

    Lfqvist A. & Mandersson B. (1987) Long-time average spectrum of speech and voice analysis. Folia phoniatrica, 39: 221-229.

    Nordenberg M. & Sundberg J. (2003) Effect on LTAS of vocal loudness variation. In: TMH-QPSR, KTH, 45: 93-100.

    Rose P. (2002) Forensic Speaker Identification. New York, Taylor & Francis.

    Stevens K. N. (1993) Lexical access from fea-tures. In Speech communication group working papers (Vol. VIII, p. 119-144). Re-search Laboratory of Electronics, Massa-chusetts Institute of Technology.

    20

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    A Model-Based Experiment towards an Emotional Syn-thesisJonas LindhDepartment of LinguisticsGteborg University

    AbstractThe most successful methods to induce emotionson state of the art unit selection speech synthesishave been built by switching speech databasedepending on the desired emotion. These meth-ods require a substantial increase of memorycompared to a single database and are compu-tationally slow. The model-based approach isan attempt to reshape a neutrally recorded ut-terance (comparable to the desired output froma modern unit selection system) into simulatinga recorded model of a desired emotion.

    Factors for manipulation of duration, am-plitude and formant shift ratio are calculated bycomparing the recorded neutral utterance withthree recorded, basic emotional models in ac-cordance with discrete emotion theory sad-ness, happiness and anger. F0 (regarded as theintonation) is copied from the model and is thenimposed on the neutrally recorded utterance.

    The evaluation of the experiment shows thatsubjects easily categorize discrete emotions in aforced choice. They also grade the resynthesizedemotional quality from the neutrally recordedutterance almost equally high as the naturallyrecorded models for the male voice. The femalevoice created more difficulties and containedmore synthetic artifacts, i.e. it was judged tohave a lower quality than the recorded models.

    Background and IntroductionCreating emotional synthesis has been a re-search area for quite some time. Formant speechsynthesis is easily distinguished from humanspeech not only because of the underdevelopednaturalness, but also due to the lack of expres-siveness. Several attempts to implement emo-tions in formant synthesis have taken place(Cahn, 1988; 1989; 1990; Carlson et al., 1992).

    When dealing with emotional content inspeech the point of departure is almost alwaysthe neutral utterance. What is neutral speech,i.e. speech without emotions? Normally, neutralspeech is thought of as a carrier being modu-lated to reveal the emotions being communi-cated. Such a concept is rather useful when it

    comes to synthesizing expressive speech. Onesimply treats the relationship in a hierarchywhere the abstract underlying expression is neu-tral and the surface expressions are the emo-tions we want to induce, in this case the basicemotions from discrete emotional theory - an-ger, sadness and happiness (Levenson, 1994;Laukka, 2004; Tatham & Morton, 2004; Nara-yanan & Alwan, 2004).

    A modern state of the art unit selectionspeech synthesis normally produces a sentenceas neutrally as possiblein order to avoid unde-sired side effects or miscommunication. Neutralin this case means near monotone or containingas few speech fluctuations as possible. This isnot always desirable when it comes to for ex-ample dialogue systems. To be able to comparewhether a system succeeds in expressing a cer-tain emotion or desire, it is obviously also im-portant to study how well people in generalsucceed in communicating emotions.

    The development of conversational systemshas increased, meaning that understandable,neutral synthetic speech is barely acceptableanymore. Some success has been reached, butthe best ones still depend too much on storeddata, including a separate emotional speech da-tabase. (Bulut et al., 2002)

    The most successful attempts to synthesizeemotions have been built by using additionalspeech databases containing only recordingsrepresenting specific emotions uttered (this ap-plies to concatenative/unit selection synthesissystems). The system has to be able to switchdatabase when a specific emotion is desirable.The system must perhaps also use different al-gorithms/analyses for the different databasessince the acoustic content might differ signifi-cantly. The databases needed for such a systemalso mean a substantial increase of data tochoose from. A simpler and computationallymore efficient method is to induce rules for ex-pressive speech and resynthesize an utteranceproduced by the system.

    Nowadays, most unit selection systems arecreated by recording a single professionalspeaker and then using specified parts (nor-

    21

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    mally diphones as basic element) of the utter-ances to concatenate new ones. This normallymeans that a professional speaker must beavailable to be recorded for emotional utterancesof different lengths. If these recordings are usedas models, they will then hopefully not differmore from the utterances that will be producedby the system than there will be differences fora specific speaker.

    The desire for creating a simpler way of in-ducing emotions in unit selection synthesisbased on rules have been proposed by for ex-ample Murray et al. (2000) and Zovato et al.(2004). However, in this paper an experimentusing models to calculate differences between aneutral and an emotional utterance is presentedand tested. The results show both difficultiesand promising results, which are then discussedconcerning how to find ways to induce emo-tions in synthesis. If emotions are to be createdby a system they cannot be expected to outper-form the communication of emotional contentfrom recorded models.

    MethodTwo speakers, one female and one male, wererecorded uttering the same sentence (Jag harinte varit dr idag) in four different expressivestyles: natural, sad, happy and angry. The re-cordings were made in a studio environmentusing a high quality microphone. The speakerswere told to first consider how to express theemotions in speech concerning duration, ampli-tude and intonation. They were then told to ex-press the emotions as clearly as possible whilerecorded, even though the semantic content didnot suggest a specific emotion.

    Each recorded emotion was then used, bothas a model to induce the specific emotion in theneutrally recorded utterance as well as a refer-ence against which the resynthesized speechshould be compared. If one uses the samespeaker and the calculated differences from thesame utterance with different emotions oneshould be able to resynthesize at least the spe-cific parameters correctly. Six subjects finallyevaluated the results by categorizing and gradingthe neutral recording, the recorded models andthre three resynthesized objects for the twospeakers, i.e. fourteen utterances of the samesentence.

    The model-based approach

    The approach described and tested in this paperis similar to the rule-based idea that is describedin Zovato et al. (2004) and Murray et al.(2000), except that the rules are based on inter-active calculations compared to models. Themodel calculations are also applied to the com-plete utterances and not applied to specificunits (i.e. syllables or diphones etc.).

    A state of the art unit selection synthesis at-tempts to sound as natural and neutral as pos-sible. If the voice used in the system is recordedto produce models of emotions, the neutrallyproduced output can be seen as the underlyingneutral representation. The representation canthen be compared to the produced models to beable to calculate variations for specified parame-ters.

    The aim of the model-based approach is toapproach the recognition rate for the modelsthemselves and keep naturalness. The limita-tions are obvious, when stretching and changingtoo much, PSOLA will create synthetic arti-facts.

    Figure 1. Flow chart showing the script procedurefor the model based experiment.

    22

  • Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

    Figure 1 shows how a neutral utterance and amodel is compared and the neutral utterance fi-nally resynthesized. First, the neutral utteranceand model duration and average amplitude arecalculated. Equal duration is then calculated forthe objects. Pitch tier objects are then createdafter point processing the framed fundamentalfrequency values. A point-processed object is asequence of points (ti) in time, defined on a do-main [tmin, tmax]. The index (i) runs from 1 to thenumber of points. The points are sorted bytime (i.e. ti+1 > ti). Points are generated along theentire time domain of the pitch tier, becausethere is no voiced/unvoiced information, thenthe F0 contour is linearly interpolated. Thismeans that one can easily exchange the pointprocessed signal tier from one object to another,thus cloning the intonation (Boersma & Ween-ink, 2005). The formant shift ratio is then cal-culated for the first three formants and ma-nipulated. Finally the duration (relative to themodel) and the average amplitude is modifiedand resynthesized.

    Results and DiscussionThe result of the modulations was calculated bycomparing averages and standard deviations forthe resynthesized objects and the models.

    Table 1. Model and modified parameter valuesfor the male voice

    Male voice F0mean

    F0std

    Ampl(dB)

    F1mean

    F2mean

    F3mean

    Neutral 95 24 68 519 1482 2644Sad Model 153 18 69 405 1300 2592Resynth 148 16 69 512 1508 2517Happy Model 133 52 75 528 1464 2602Resynth 133 46 72 522 1451 2629Angry Model 84 8 70 517 1367 2672Resynth 83 5 68 507 1452 2629

    Table 2. Model and modified parameter valuesfor the female voice

    FemaleVoice

    F0mean

    F0std

    Ampl(dB)

    F1mean

    F2mean

    F3mean

    Neutral 172 17 70 573 1670 2687Sad Model 328 73 67 587 1535 2783Resynth 311 25 68 610 1651 2681Happy Model 358 119 77 707 1661 2709Resynth 349 107 73 608 1734 2767Angry Model 250 53 77 638 1658 2649Resynth 236 52 74 614 1689 2686

    As can be observed in Table 1 the F0 values aremodified fairly well compared to the models.The formant shift ratio should be individualizedto each formant and not changed depending onthe general averages from the first three. For thefemale voice (table 2) the neutral recording con-tained some traces of creakiness, which led tosome failure in the F0 analysis and thereby alsothe resynthesis. Generally, the values approachthe models.

    Evaluation TestSeven subjects with normal hearing and someprevious experience of listening to synthesizedspeech (six employees and one student at thedepartment of linguistics) performed an evalua-tion. In the evaluation the subjects listened tosixteen samples, eight male and eight female.The samples were the neutral utterance plus thethree recorded models of the same sentence andthe three resynthesized samples. When hearingthe samples the subjects had to categorize eachsample belonging to one of the four categoriesneutral, happy,