Auditory Masquing: Wearable Sound Systems for Diegetic ...

4
Auditory Masquing: Wearable Sound Systems for Diegetic Character Voices Alex Stahl 360 Group, Pixar Animation Studios 1200 Park AveEmeryville, CA 94608 USA +1 510 922 3008 [email protected] Patricia Clemens 6575 Farallon Way Oakland, CA 94611 USA +1 510 658 5478 [email protected] ABSTRACT Maintaining a sense of personal connection between increasingly synthetic performers and increasingly diffuse audiences is vital to storytelling and entertainment. Sonic intimacy is important, because voice is one of the highest- bandwidth channels for expressing our real and imagined selves. New tools for highly focused spatialization could help improve acoustical clarity, encourage audience engagement, reduce noise pollution and inspire creative expression. We have a particular interest in embodied, embedded systems for vocal performance enhancement and transformation. This short paper describes work in progress on a toolkit for high-quality wearable sound suits. Design goals include tailored directionality and resonance, full bandwidth, and sensible ergonomics. Engineering details to accompany a demonstration of recent prototypes are presented, highlighting a novel magnetostrictive flextensional transducer. Based on initial observations we suggest that vocal acoustic output from the torso, and spatial perception of situated low frequency sources, are two areas deserving greater attention and further study. Keywords Spatialization, Paralinguistics, Speech Enhancement, Voice Transformation, Sound Reinforcement, Wearable Systems, Magnetostrictive Flextensional Transducer. 1.INTRODUCTION 1.1Motivation Technological enhancement of an actor’s persona is ancient and ubiquitous. Makeup and costumes have been used for millenia to magnify expressions and transform identities. The voice of an invented character is crucial to its believability and impact [25]. Artificial vocal enhancement is also nothing new. Ancient examples of embodied voice transformation include the alaspraka masks of Papua New Guinea and the African mirliton [14,19]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NIME2010, 15-18 th June, 2010, Sydney, Australia Copyright remains with the author(s). Figure 1. Alaspraka “trumpet” mask. These devices use stretched membranes and long tubes that perform non-linear waveshaping and resonant filtering, interestingly akin to simple vocal tract models [3]. Makeup and costumes have come a long way from wooden masks to articulated animatronic prosthetics [1]. Tools and techniques for acoustical costumes or “auditory masquing” have not advanced as much. 1.2 Focus Modern sound systems can be very good at diffusing sound [16], but we are concerned that they often also diffuse the audience’s attention. The size, weight and operational complexity of traditional system architecture has made extra- diegetic sound the norm in popular performance. We think that disembodied voices should be a creative option, not the default! The first goal of this multidisciplinary project is to develop a flexible, high-quality wearable sound toolkit to facilitate further research in and application of re-embodied sound. We hope to identify tools that can be deployed quickly, as a way to answer some questions and ask others. Initial efforts employed a highly iterative, rapid prototyping approach [10]. In the following three sections, this paper provides background on specific psychoacoustic design goals, describes the technology of the latest working prototype, and concludes with an outline of upcoming work. 2.DESIGN CONSIDERATIONS Certainly a wearable vocal sound system should be comfortable, intelligible, and reliable. In order to be compelling and engaging, we think the system should also pay particular consideration to extended low frequency response, perceivable resonances, and 3D directionality. This section provides context on these concerns. 2.1 The paralanguage of resonance Beyond words, many aspects of a speaker are echoed in the paralinguistic parts of speech [7]. Information about physiology, personality, mental and physical states is usually present and often exceeds the information content of the textual messages. Proceedings of the 2010 Conference on New Interfaces for Musical Expression (NIME 2010), Sydney, Australia 427

Transcript of Auditory Masquing: Wearable Sound Systems for Diegetic ...

Auditory Masquing: Wearable Sound Systems for Diegetic Character Voices

Alex Stahl360 Group, Pixar Animation Studios

1200 Park AveEmeryville, CA 94608 USA+1 510 922 3008

[email protected]

Patricia Clemens6575 Farallon Way

Oakland, CA 94611 USA+1 510 658 5478

[email protected]

ABSTRACTMaintaining a sense of personal connection between increasingly synthetic performers and increasingly diffuse audiences is vital to storytelling and entertainment. Sonic intimacy is important, because voice is one of the highest-bandwidth channels for expressing our real and imagined selves.

New tools for highly focused spatialization could help improve acoustical clarity, encourage audience engagement, reduce noise pollution and inspire creative expression. We have a particular interest in embodied, embedded systems for vocal performance enhancement and transformation.

This short paper describes work in progress on a toolkit for high-quality wearable sound suits. Design goals include tailored directionality and resonance, full bandwidth, and sensible ergonomics. Engineering details to accompany a demonstration of recent prototypes are presented, highlighting a novel magnetostrictive flextensional transducer. Based on initial observations we suggest that vocal acoustic output from the torso, and spatial perception of situated low frequency sources, are two areas deserving greater attention and further study.

KeywordsSpatialization, Paralinguistics, Speech Enhancement, Voice Transformation, Sound Reinforcement, Wearable Systems, Magnetostrictive Flextensional Transducer.

1.INTRODUCTION1.1MotivationTechnological enhancement of an actor’s persona is ancient and ubiquitous. Makeup and costumes have been used for millenia to magnify expressions and transform identities. The voice of an invented character is crucial to its believability and impact [25].Artificial vocal enhancement is also nothing new. Ancient examples of embodied voice transformation include the alaspraka masks of Papua New Guinea and the African mirliton[14,19].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.NIME2010, 15-18th June, 2010, Sydney, AustraliaCopyright remains with the author(s).

Figure 1. Alaspraka “trumpet” mask.These devices use stretched membranes and long tubes that perform non-linear waveshaping and resonant filtering, interestingly akin to simple vocal tract models [3].Makeup and costumes have come a long way from wooden masks to articulated animatronic prosthetics [1]. Tools and techniques for acoustical costumes or “auditory masquing” have not advanced as much.

1.2 FocusModern sound systems can be very good at diffusing sound [16], but we are concerned that they often also diffuse the audience’s attention. The size, weight and operational complexity of traditional system architecture has made extra-diegetic sound the norm in popular performance. We think that disembodied voices should be a creative option, not the default!The first goal of this multidisciplinary project is to develop a flexible, high-quality wearable sound toolkit to facilitate further research in and application of re-embodied sound. We hope to identify tools that can be deployed quickly, as a way to answer some questions and ask others.

Initial efforts employed a highly iterative, rapid prototyping approach [10]. In the following three sections, this paper provides background on specific psychoacoustic design goals, describes the technology of the latest working prototype, and concludes with an outline of upcoming work.

2.DESIGN CONSIDERATIONSCertainly a wearable vocal sound system should be comfortable, intelligible, and reliable. In order to be compelling and engaging, we think the system should also pay particular consideration to extended low frequency response, perceivable resonances, and 3D directionality. This section provides context on these concerns.

2.1 The paralanguage of resonanceBeyond words, many aspects of a speaker are echoed in the paralinguistic parts of speech [7]. Information about physiology, personality, mental and physical states is usually present and often exceeds the information content of the textual messages.

Proceedings of the 2010 Conference on New Interfaces for Musical Expression (NIME 2010), Sydney, Australia

427

Paralinguistic cues are particularly important for trust and rapport-building [13] in relationships including audience-performer. The authors’ empirical knowledge of stage presence, musical acoustics, recording engineering and attachment parenting leads us to hypothesize that the perception of acoustical resonance is an important mechanism of affective resonance.

The value of adjusting vocal resonances to match vocal pitch is well known to singers, although not necessarily in precise acoustical terms [18]. Theatrical voice projection is similar, but seemingly more a matter of adjusting pitch to match resonance. These optimizations improve acoustic output efficiency and they effect the timbre of the voice.

Listeners can infer the size of a sound source, particularly a vocalization, based on the scale of its resonances [27]. We also use formant perception to distinguish the loudness of vocal exertion from the loudness of close proximity. Humans and other animals use resonance cues to estimate both current [8], and desirable [5,6,20], proximity to each other.

Proprioceptive mirror neuron effects have been observed in speech perception brain imaging studies [24]. Listening to a full and resonant voice may invoke the feeling of speaking with one, regardless of relative body size or sound power.

We suggest that in a source-specific audio costume, the timbre of the system may be tuned to the source. Many lightweight and compact speakers have a noticeably “boxy”, “tinny” or otherwise highly colored sound due to uncontrolled resonances. We do not consider this a minor blemish that may be covered up with equalization. Linear transducers tend to be bigger and heavier, and less efficient, which makes the power system bigger and heavier too. Informed by vocal practice and string instrument design, we have been exploring deliberate transducer resonance as a form of voice enhancement.

2.2. Bass is fundamentalA legacy assumption from early telephony is that frequencies below 300 Hz are considered unnecessary for voice messaging. The psychoacoustics concept of missing fundamental, or virtual pitch [17], helps to explain how we can still recognize a deep voice when its fundamental pitch is absent. Nevertheless, while the linguistic content of speech can be recognized with limited bandwidth, paralinguistic information can be lost [4]. To reproduce the full emotional power of dialog in cinema, sound designers prefer bass response down to at least 70Hz [22].

2.3 Situated directionalityWe also observe that the directional sound radiation pattern of a situated source, such as a live actor or musician, makes a significant contribution to its acoustic and theatrical presence. The complex interaction of indirect sound and the environment is an active area in musical acoustics, and we think it is highly relevant to voice costumes as well [9,12]. For bass singers, it has been observed that more than half of the sound power emanates from the chest, not the mouth [23].

Sound systems in general are installed at the periphery of occupied space. The preponderance of 5.1 surround and 2.1 systems is justified by the idea that low frequencies aren’t localizable. Our early experiments (Fig. 2), suggest that auditory perception is different when the bass is in the space. Walking around a woofer, that may itself be moving, promotes a distinct and intriguing spatial impression. While several listeners have independently commented on this effect in our

prototype systems, both authors admittedly play bass and may be biased against being perceived as without a place. We need more cogent study of this, using the novel wearable woofer that will be described in the next section.

Figure 2. Early experiment in computational ventriloquism.

3.CURRENT STATUS

3.1Prototype System ArchitectureThe current system is a two-way design. The typical division of signal into low and high frequency bands is implemented We have also demonstrated a new crossover concept based on voiced/unvoiced classification of speech sounds: one transducer is optimized for harmonic (voiced) components and another for impulsive, noisy features. This approach may have both engineering and perceptual advantages.

3.2ElectronicsThe crossover, time alignment delays, dynamics processing and multiband parametric EQ for each transducer, as well as an external volume control and calibration signal generators, are implemented using an ADI ADAU1701 audio codec/DSP integrated circuit designed for inexpensive portable devices. Power amplifiers for the prototype system are small class-D (switching topology) amplifiers from Sure Electronics,.

3.3Transducers and ResonatorsA variety of unusual transducers have been evaluated and curated in this investigation. Flat Flexible Loudspeakers (FFL) from Warwick Audio Labs resemble frameless electrostatic panels and have several relevant characteristics. These and other thin film possibilities (PVDF piezo film from MSI, ferroelectric film from Emfit)remain interesting candidates, especially for the voiced/unvoiced crossover concept. However none we have seen so far have the power output we needed for an initial application test, without requiring an overly large exposed surface area. Another type of audio transducer potentially suitable for a costume is the inertial exciter or tactile transducer, marketed as a way to turn a wall or a piece of furniture into a loudspeaker. Much of our recent experimentation has been with atypical application of these devices. Many of these are moving-coil based and seem too fragile for this application. Others, from FeONIC and Induction Dynamics, use the magnetostrictive alloy Terfenol-D.

Proceedings of the 2010 Conference on New Interfaces for Musical Expression (NIME 2010), Sydney, Australia

428

Magnetostrictive alloys change their dimensions in response to a magnetic field. They are exceptionally robust; instead of a moving coil or moving magnet, a stationary coil causes a stationary piece of metal to expand and contract in place.These products are typically used to drive a larger panel into a complex vibration pattern. This configuration has been called a distributed mode loudspeaker or DML[15]. We tried using them differently, with the driver housing secured to a harness, and various small (300-600 cm2) rigid panels attached to its active element. A series of vaguely folded horn like enclosures made of flexible foam and other comfortable materials were made and tested as well. (Fig. 3).

Figure 3. Experimental “buzz boards” (front) and resonators (rear)

While it may evoke a few cringes from loudspeaker design experts, this simple approach worked surprisingly well in practice and has been a very convenient experimental apparatus. Not surprisingly, the best results came from very stiff and low mass panels: first 1cm foam core presentation board and later shop-made foam core laminates with wood veneer and carbon fiber.Compared to induction motors, magnetostrictive transducers have very small excursion and high force. Hence driving a lightweight small panel is an inefficient use of them. We tried several other configurations and the current favorite uses two SolidDrives coupled edge-wise to the ends of a thin, curved carbon-fiber panel. The prestressed panel provides mechanical leverage that amplifies the driver vibration and couples it to the air. (Fig. 4) In underwater acoustics, similar designs at an ultrasonic scale are called flextensional transducers [21].

Figure 4. Flextensional bass belt.The inside of the panels close to the wearer’s body vibrates out of polarity like the back side of a loudspeaker cone. This raises the intriguing potential of using the body itself as a resonator. Some energy from the loudspeaker is weakly coupled into the

body, then travels up and out the same path as the wearer’s voice. The inherent dimension ratios are such that this may work to some degree as a transmission line speaker cabinet, meaning that by the time the internally coupled vibrations exit the mouth, they will be in phase at the primary resonant frequency of the panels and of the wearer.At least, in theory. We have received enough positive subjective feedback from expert listeners and surprised bystanders, that it seems there is some serendipitous interaction happening. We look forward to isolating and understanding it in future work.The curved panel model has good high frequency response. However, using a belt-pack speaker for the full frequency range creates a distracting sense of literally speaking from one’s gut.To address this issue without requiring a large expanse of exposed thin film around the head, a line array choker has been created (Fig. 5). This is comprised of a KZ10 ultracompact speaker from K-array in Florence, Italy, held by a comfortable if not couture setting hand-molded from polycaprolactone (Shapelock brand), a low temperature molding plastic. The KZ10 is usually powerful for its size, weighs only 90g, and is small enough to be hidden in plain sight.

Figure 5. Neckline array.

4. SUMMARY and FUTURE WORK

A high-quality prototype wearable sound reproduction system, featuring a novel flextensional radiator has been demonstrated. Subjective audio quality is generally rated as very good. Quantitative measurements and refinements of the design are high priorities. Another near-term step is to build multichannel versions, for example a 5 channel (LCRLsRs) flextensional belt. In the longer term, small integrated modules are envisioned that could be sewn together into self-organizing speaker array. The current system somewhat resembles sonic plate armour; the next evolutionary step should be more flexible, like chain maille.The combination of high-tech rapid prototyping skills and interdisciplinary empiricism has been efficient so far. Moving forward, predictive simulation, e.g. finite element analysis and modeling of the flextensional emitter [2], could facilitate refinements and more practical tailoring of sound suits to specific roles.Another component of a complete audio costume is of course the microphone. We are concurrently investigating some unusual sensors in this context.

An obvious challenge when co-locating mics and speakers is avoiding feedback. One of the authors is working on a novel method for feedback avoidance in speech reinforcement, that will be described in a future paper.

Proceedings of the 2010 Conference on New Interfaces for Musical Expression (NIME 2010), Sydney, Australia

429

5. ACKNOWLEDGEMENTS

Thanks to Joe Marks, Ben Burtt, Loren Carpenter, Janet McAndless and Miranda McDonald-Stahl for their encouragement and support of this project. Special thanks to Eduardo and Roxanne, our tireless fitting models and test subjects.

6. REFERENCES

[1] Berger, Howard, personal communication, April 2010.[2] Bilbao, S. Numerical Sound Synthesis: Finite Difference

Schemes and Simulation in Musical Acoustics. New York, NY: Wiley, 2009.

[3] Blackwood B.M., and Balfour, H. Ritual and Secular Uses of Vibrating Membranes as Voice-Disguisers. Journal of the Royal Anthropological Institute of Great Britain and Ireland, 78, 1/2, (1948), 45-69.

[4] Campbell, N. Getting to the Heart of the Matter: Speech as the Expression of Affect; Rather than Just Text or Language. Language Resources and Evaluation, 39, 1 (Feb 2005), 109-118.

[5] Collins, S. Men's voices and women's choices. Animal Behaviour, 60, 6, (Dec 2000), 773-780.

[6] Collins, S., and Missing, C. Vocal and visual attractiveness are related in women.Animal Behaviour, 65, 5, (May 2003), 997-1004.

[7] Dellwo, V., Huckvale, M., and Ashby, M. How Is Individuality Expressed in Voice? An Introduction to Speech Production and Description for Speaker Classification. in Speaker Classification I. Berlin/Heidelberg: Springer, 2007, 1-20.

[8] Eriksson, A., and Traunmüller, H. Perception of vocal effort and speaker distance on the basis of vowel utterances. In Proceedings of 14th International Congress of Phonetics Sciences (14th ICPhS) San Francisco, USA, Aug 1-7, 1999). 2469–2472.

[9] Flanagan, J. Analog measurements of sound radiation from the mouth. The Journal of the Acoustical Society of America, 32, 12, (Dec 1960), 1613–1620.

[10] Freed, A, Application of new fiber and malleable materials for agile development of augmented instruments and controllers. In Proceedings of New Interfaces for Musical Expression 2008 (NIME 2008) (Genova, Italy, June 5-7, 2008).

[11] Halkosaari, T., and Vaalgamaa, M. Directivity of human and artificial speech. In Proceedings of Baltic-Nordic Acoustics Meeting 2004 (BNAM2004) (Mariehamn, Aland, Finland. June 8-10, 2004). http://www.acoustics.hut.fi/asf/bnam04/webprosari/onlineproc.html

[12] Katz, B., and d’Alessandro, C. Measurement of 3D Phoneme-Specific Radiation Patterns in Speech and Singing. N.p., n.d. Web. 25 Mar. 2010. <http://rs2007.limsi.fr/index.php/PS:Page_14>

[13] Lewis, T., Amini, F., and Lannon, R. A General Theory of Love. New York, NY: Random House, 2000.

[14] Lifschitz, E. Hearing Is Believing: Acoustic Aspects of Masking in Africa. In West African Masks and Cultural Systems. Kasfir, S.L., ed. Tervuren, Belgium: Musee Royal de L'Afrique Centrale, 1988, 221-229.

[15] Mackenzie, N. Distributed Mode Loudspeakers. In Proceedings of Acoustics 2002: The Annual Conference of the Australian Acoustic Society (Adelaide, Australia, Nov 13015 2002), 400-405.

[16] McCarthy, B.. Sound Systems: Design and Optimization, Second Edition: Modern Techniques and Tools for Sound System Design and Alignment. 2 ed. Burlington,VT: Focal Press, 2009.

[17] McDonough, J., and M. Woelfel. Distant Speech Recognition, New ed. New York, NY: Wiley, 2009..

[18] Miller, D.. Resonance in Singing. Princeton, NJ: Inside View Press, 2008.

[19] Peek, P. The Sounds of Silence: Cross-World Communication and the Auditory Arts in African Societies. American Ethnologist, 21,3 (Aug 1994), 474-494.

[20] Reby D., and McComb K. Anatomical constraints generate honesty: Acoustic cues to age and weight in the roars of red deer stags Animal Behaviour, 65, 3, (March 2003), 519-530.

[21] Royster, L. The flextensional concept: a new approach to the design of underwater acoustic transducers. Applied Acoustics, 3. London, England; Elsevier Publishing Company Ltd., 1970. 17-126.

[22] Gary Rydstrom, Ben Burtt, and John Meyer, personal communication, 2008-2009.

[23] Skålevik, M. Sound Radiation from the Chest of Bass Singers. In Proceedings of the Stockholm Music Acoustics Conference 1993 (SMAC 93) (Stockholm, Sweden, Aug 6-9, 1993). <http://www.akutek.info/Papers/MS_Chest_radiation_SMAC93.pdf>

[24] Skipper, J.I., Nusbaum, H.C., and Small, S.L. Listening to talking faces: motor cortical activation during speech perception. Neuroimage, 25,1,(March 2005), 76-89.

[25] Speck, B.P.. Voice and the construction of identity and meaning. In In-roads of Language: Essays in English Studies. Alberola Crespo, M.N., and  Navarro Ferrando, I., ed. Castelló de la Plana, Spain: Universidad Jaume I, 2006. 91-102.

[26] Taylor A.M., Reby D., and McComb K. Human listeners attend to size information in domestic dog growls. The Journal of the Acoustical Society of America, 123, 5, (May 2008), 2903–2909.

[27] van Dinther, R., and Patterson, R.D. Perception of acoustic scale and size in musical instrument sounds. The Journal of the Acoustical Society of America 120, 4, (Oct 2006), 2158–2176.

Proceedings of the 2010 Conference on New Interfaces for Musical Expression (NIME 2010), Sydney, Australia

430