Voice Modulation: A Window into the Origins of Human … et al 2016... · Review Voice Modulation:...

15
Review Voice Modulation: A Window into the Origins of Human Vocal Control? Katarzyna Pisanski, 1,2 Valentina Cartei, 1 Carolyn McGettigan, 3 Jordan Raine, 1 and David Reby 1, * An unresolved issue in comparative approaches to speech evolution is the apparent absence of an intermediate vocal communication system between human speech and the less exible vocal repertoires of other primates. We argue that humansability to modulate nonverbal vocal features evolutionarily linked to expression of body size and sex (fundamental and formant frequen- cies) provides a largely overlooked window into the nature of this intermediate system. Recent behavioral and neural evidence indicates that humansvocal control abilities, commonly assumed to subserve speech, extend to these nonverbal dimensions. This capacity appears in continuity with context-depen- dent frequency modulations recently identied in other mammals, including primates, and may represent a living relic of early vocal control abilities that led to articulated human speech. The Dynamic Human Voice Recent research examining the communicative function of the human voice has effectively established the roles of two key acoustic components the fundamental frequency (F0) (see Glossary) and formants (Box 1) in the expression of many biological and psychological dimensions, including sex and age, body size and shape, hormonal condition, dominance, masculinity or femininity, and attractiveness (for reviews see [15]). For example, men and women whose voices are characterized by low frequencies are typically judged by listeners as more dominant, physically larger and stronger, and more masculine than are speakers with higher-frequency voices [5], and these stereotypes appear partly driven by the physical, anatomical, and physiological mechanisms that inuence and constrain F0 and formant pro- duction. In addition to typically having larger bodies than women, men also tend to have relatively larger larynges, longer vocal tracts, and lower-frequency voices (Box 1). As a consequence, listeners often associate lower frequencies with stereotypically male traits [6,7]. By revealing the many ways in which the human voice is linked physically or perceptually to ecologically relevant traits, this body of research has greatly advanced our understanding of the evolutionary functions of nonverbal vocal communication in humans. In particular it provides compelling evidence that the human voice, and the strong sexual dimorphism that characterizes its main frequency components, has been largely shaped by sexual selection [8] and continues to play a substantial role in human mate choice and mate competition [1,3]. At the same time, by focusing on vocal communication of mate quality, recent studies have overwhelmingly described human voice production and perception in terms of static cues that are almost exclusively studied in the absence of social context. This is clearly an oversimplication, as sexual selection may also favor the ability to exibly manipulate the voice to exploit listenerstendency to Trends Sourcelter theory is the unifying methodological framework in the study of nonverbal vocal communication in humans and other vertebrates. Numerous studies have implicated fun- damental frequency (F0) and vocal tract resonances (formants) in mammalian social communication. In humans, these sexually dimorphic source and lter voice features reliably indicate sex, age, body size, and dominance. By focusing on static rather than dynamic vocal processes, this literature has largely overlooked the[3_TD$DIFF] human capacity to volitionally modulate F0 and formants to express or exaggerate ecologically relevant traits, often in response to specic social contexts. Emerging research suggests that F0 and formant scaling is widespread in humans and present in some other mammals, potentially representing an evolutionary path from simple voice-frequency scal- ing to articulated human speech. 1 Mammal Vocal Communication and Cognition Research Group, School of Psychology, University of Sussex, Brighton, UK 2 Institute of Psychology, University of Wroclaw, Wroclaw, Poland 3 Royal Holloway Vocal Communication Laboratory, Department of Psychology, Royal Holloway, University of London, Egham, UK *Correspondence: [email protected] (D. Reby). TICS 1533 No. of Pages 15 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy http://dx.doi.org/10.1016/j.tics.2016.01.002 1 © 2016 Elsevier Ltd. All rights reserved.

Transcript of Voice Modulation: A Window into the Origins of Human … et al 2016... · Review Voice Modulation:...

TrendsSource–filter theory is the unifyingmethodological framework in the studyof nonverbal vocal communication inhumans and other vertebrates.

Numerous studies have implicated fun-damental frequency (F0) and vocal tractresonances (formants) in mammaliansocial communication. In humans, thesesexually dimorphic source and filtervoice features reliably indicate sex,age, body size, and dominance.

By focusing on static rather than

TICS 1533 No. of Pages 15

ReviewVoice Modulation: A Windowinto the Origins of HumanVocal Control?Katarzyna Pisanski,1,2 Valentina Cartei,1 Carolyn McGettigan,3

Jordan Raine,1 and David Reby1,*

An unresolved issue in comparative approaches to speech evolution is theapparent absence of an intermediate vocal communication system betweenhuman speech and the less flexible vocal repertoires of other primates. Weargue that humans’ ability to modulate nonverbal vocal features evolutionarilylinked to expression of body size and sex (fundamental and formant frequen-cies) provides a largely overlooked window into the nature of this intermediatesystem. Recent behavioral and neural evidence indicates that humans’ vocalcontrol abilities, commonly assumed to subserve speech, extend to thesenonverbal dimensions. This capacity appears in continuity with context-depen-dent frequency modulations recently identified in other mammals, includingprimates, and may represent a living relic of early vocal control abilities that ledto articulated human speech.

dynamic vocal processes, this literaturehas largely overlooked the [3_TD$DIFF] humancapacity to volitionally modulate F0and formants to express or exaggerateecologically relevant traits, often inresponse to specific social contexts.

Emerging researchsuggests thatF0andformant scaling iswidespread inhumansand present in some other mammals,potentially representing an evolutionarypath from simple voice-frequency scal-ing to articulated human speech.

1Mammal Vocal Communication andCognition Research Group, School ofPsychology, University of Sussex,Brighton, UK2Institute of Psychology, University ofWrocław, Wrocław, Poland3Royal Holloway VocalCommunication Laboratory,Department of Psychology, RoyalHolloway, University of London,Egham, UK

*Correspondence: [email protected](D. Reby).

The Dynamic Human VoiceRecent research examining the communicative function of the human voice has effectivelyestablished the roles of two key acoustic components – the fundamental frequency (F0) (seeGlossary) and formants (Box 1) – in the expression of many biological and psychologicaldimensions, including sex and age, body size and shape, hormonal condition, dominance,masculinity or femininity, and attractiveness (for reviews see [1–5]). For example, men andwomen whose voices are characterized by low frequencies are typically judged by listeners asmore dominant, physically larger and stronger, and more masculine than are speakers withhigher-frequency voices [5], and these stereotypes appear partly driven by the physical,anatomical, and physiological mechanisms that influence and constrain F0 and formant pro-duction. In addition to typically having larger bodies than women, men also tend to have relativelylarger larynges, longer vocal tracts, and lower-frequency voices (Box 1). As a consequence,listeners often associate lower frequencies with stereotypically male traits [6,7].

By revealing the many ways in which the human voice is linked – physically or perceptually – toecologically relevant traits, this body of research has greatly advanced our understanding of theevolutionary functions of nonverbal vocal communication in humans. In particular it providescompelling evidence that the human voice, and the strong sexual dimorphism that characterizesits main frequency components, has been largely shaped by sexual selection [8] and continuesto play a substantial role in human mate choice and mate competition [1,3]. At the same time, byfocusing on vocal communication of mate quality, recent studies have overwhelmingly describedhuman voice production and perception in terms of static cues that are almost exclusivelystudied in the absence of social context. This is clearly an oversimplification, as sexual selectionmay also favor the ability to flexibly manipulate the voice to exploit listeners’ tendency to

Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy http://dx.doi.org/10.1016/j.tics.2016.01.002 1© 2016 Elsevier Ltd. All rights reserved.

TICS 1533 No. of Pages 15

GlossaryDual-pathway model: currentneural models implicate twopathways in human vocal control:sensorimotor cortical systems thatsupport the production of learnedvocalizations such as speech andsong; and the limbic system thatsupports innate vocalizations such aslaughter. Although these twopathways have traditionally beenthought of as separate, recentresearch on vocal control suggestsotherwise.Enculturation: the process by whichan individual is exposed to and learnsthe traditional content of a culture orassimilates its practices and values;for instance, nonhuman primates thathave been raised by humans in ahuman-like cultural and socialenvironment.Flexible: characterizes a behaviorthat can change as a function ofvarious (e.g., social, temporal,environmental) factors.Formant: resonant frequencies ofthe supralaryngeal vocal tract thataffect our perception of voice timbre.The shape of the vocal tractinfluences the relative positions offormants and formant spacing,whereas its length (VTL) affectsformant scaling. Longer vocal tractsproduce lower, more closely spacedformants.Fundamental frequency (F0): therate of vocal fold vibration. All elsebeing equal, longer, more massive,and less tense vocal folds tend tovibrate more slowly and produce arelatively lower F0. Voice pitch is theperceptual correlate of F0 (along withits harmonics, integer multiples ofF0).Language: a symbolic, rule-governed system of communicationshared within a group that need notinvolve verbal communication orspeech (e.g., written or signlanguage).Speech: a vocal communicationsystem involving precise coordinationof vocal anatomical structures (larynx,supralaryngeal vocal tract, andarticulators such as tongue and lips)required to articulate specific sounds.Vocal control: the capacity tocontrol the larynx (affecting theproduction of F0) or thesupralaryngeal vocal tract (affectingthe production of formants) in aflexible or voluntary manner, for

Box 1. Source–Filter Theory of Speech Production

In humans and most other mammals, vocal signals are produced by the larynx (the source) and subsequently filtered bythe supralaryngeal vocal tract (the filter) [88,89]. When humans phonate, air that is expelled from the lungs and forcedthrough the closed glottis causes oscillation of the vocal folds within the larynx, which determines F0, also perceived asvoice pitch. The sound waves produced by this oscillation travel up from the larynx and through the pharynx and oral andnasal cavities, which comprise the supralaryngeal vocal tract, to the mouth, through which vocal sounds are radiated(Figure IC). In the process, the vocal tract filters the sound, attenuating certain frequencies and not others, therebyproducing resonant frequencies or formants that affect our perception of voice timbre [90].

Following the principles of biomechanics and sound physics, larger vocal folds vibrate at a slower rate than [22_TD$DIFF] do smallervocal folds, resulting in a relatively lower F0 and perceived pitch; however, regardless of mass [23_TD$DIFF], F0 increases when thevocal folds are stretched and become tenser [16,17]. Thus, the effective mass, length, and tension of the vocal foldsaffect F0. Independently of this, longer vocal tracts produce relatively lower, more closely spaced formants [formantspacing (DF)] than[22_TD$DIFF] do shorter vocal tracts [91,92] and hence a more perceptually resonant voice (Figure IA,B). Both F0and formants are inversely related to body size in most mammals [5,10,11,92]. [24_TD$DIFF]Among humans, these voice–sizerelationships are particularly salient between sexes, where F0 and formants are substantially lower among men thanwomen [29]. However, controlling for sex and age, formants explain several times more variation in body size and shapethan does F0 [92,93].

Manipulations of the tongue, lips, jaw, and soft palate can also alter the shape (rather than length) of the vocal tract,affecting the relative distribution (rather than absolute scaling) of formant frequencies and thereby giving rise to differentspeech sounds. The relative positions of formants (especially F1 and F2 [90]; see, for example, Figure 1D[25_TD$DIFF], Key Figure) playa critical role in speech production and perception and in non-tonal languages play a much greater role than F0. Thelowered position of the human larynx also allows humans to produce a wider range of sounds than any other primate[27,28]. However, whether the human larynx descended due to selection for a more complex vocal repertoire [94] orselection for apparent larger body size [27] remains unclear.

F2

F3

F4

F1

ΔF = 875 HzVTL = 20 cm

F2

F3

F4

F1

ΔF = 1250 HzVTL = 14 cm

F5

F0 = 100 Hz F0 = 200 Hz

Form

ant s

calin

gFu

ndam

enta

l fre

quen

cysc

alin

g

(A)

(B)

(C)

Larynx(housing thevocal folds)

Tongue

Nasal cavity

Oral cavityPharynx

Trachea

Lips

So� palate

Figure I. Voice Frequency Scaling. Each spectrogram illustrates the vowel /eI/ spoken by the sameman, whose voicefrequencies were manipulated using Praat acoustic software [95]. (A) Formant scaling, wherein lower, more closelyspaced formants (left) correspond to a relatively longer vocal tract[19_TD$DIFF] and a perceptually more resonant voice. Apparentvocal tract length (VTL) was estimated from DF; for algorithms see [91,92]. (B) Fundamental frequency (F0) scaling [20_TD$DIFF],wherein a lower F0 and more closely spaced harmonics (multiple integers of F0) (left) correspond to a perceptually lower-pitched voice. Note when comparing panels (A) and (B) that formants and F0 can vary independently. (C) Labeled sagittalMRI of the human vocal apparatus during the production of the sustained vowel /u:/. Note the back position of the tongueand protrusion of the lips. Spectrogram parameters: y-axis, 0–5 kHz; x-axis, 0.25 s; window length, 0.04.

2 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy

TICS 1533 No. of Pages 15

example as a function of socialcontext.Vocal learning: the capacity toproduce novel vocalizations throughimitation or experience (which doesnot necessarily involve vocal control).Voice modulation: manipulation ofany nonverbal property of the voiceincluding but not limited to F0 andformant frequencies.Volitional: characterizes a voluntaryand intentional behavior. Volitionalvocal control involves the capacity tocontrol the larynx and supralaryngealvocal tract independent of immediatecontext and physiological state.

associate sex-typical voice patterns with sex-typical traits (i.e., the frequency or gender code[6,7,9]). Although several animal species are known to scale their vocal frequencies to exag-gerate their body size or to express threat [10–12], the unusually strong sexual dimorphism informants and F0 that differentiates humans from other animals, including[35_TD$DIFF] nonhuman primates[8], may offer a particularly effective platform by which humans can exploit sex-typical vocalpatterns above and beyond size exaggeration. Hence, we suggest that men and womenroutinely take advantage of the salient sex differences in F0 and formants, using the broadacoustic space between (and within) these male and female categories to express various sex-related attributes such as dominance, masculinity or femininity, and threat [9,13].

It is well known that, with vocal training, actors and voice impersonators can significantly raise orlower the key frequency components of their voices (F0 and formants), often adopting vocalranges characteristic of the opposite sex [2]. In singing, sopranos can raise their F0 and matchtheir first formant to frequencies above 1200 Hz [14], six times the average pitch of a woman'svoice. Some politicians (such as Margaret Thatcher) have had extensive vocal coaching to lowerthe frequency of their voice to exude amore authoritative, powerful persona [15]. Although theseexamples may be extreme or anecdotal, a growing body of empirical evidence suggests thatpeople routinely and volitionally modulate their voice in everyday social contexts such as firstdates and job interviews [36_TD$DIFF], when making a compelling argument [37_TD$DIFF], or [38_TD$DIFF] when attempting to deceivesomeone: the human voice is a dynamic and flexible channel for self-expression and, beyondits role in encoding linguistic information, functions also as a social tool.

In this reviewwe suggest that the ability of humans to flexibly control the size-related source–filterdimensions of our vocal signals (i.e., vocal control) is likely to predate our ability to articulate theverbal dimensions of speech and therefore may provide an evolutionary pathway from nonhu-man primate vocal communication to human speech. We show that this largely overlookedmodulation of nonverbal components is not only highly prevalent and functional in human vocalcommunication but also present in the vocal systems of other, nonhuman mammal species. Wepropose that voice modulation can provide key insights regarding the evolution of articulatedhuman speech, focusing on how comparative explorations of the behavioral and neural bases ofvocal control have the potential to lead to important advances in our understanding of theevolution of human vocal communication and the origins of speech.

Behavioral Evidence of Human Vocal Modulation and its Social FunctionsVocal control involvesmanipulation of the vocal folds and larynx or the supralaryngeal vocal tract,including the articulators such as tongue and lips. Mechanistically, F0 is controlled by manipu-lating the tension and effective length or surface area of the vocal folds; for example, bycontracting or relaxing the thyroarytenoid and cricothyroid muscles or increasing subglottalpressure. By contrast, formants are altered by manipulating the length of the vocal tract – forexample, by lowering the larynx or protruding the lips [4_TD$DIFF] – within limits imposed by the skull andbody size [ [39_TD$DIFF]2,10,16,17][40_TD$DIFF] (Box 1).

Recent studies investigating voice modulation in humans have focused on its functional role,typically in a mating context, and provide evidence that people dynamically alter sexuallydimorphic and size-related vocal features (F0 and formants) to advertise or exaggerate traitsthat are relevant to mate selection and competition. For example, in studies involving simulateddating scenarios, men who perceived themselves as physically dominant relative to their malecompetitor lowered their F0 in conversations with him (whereas men who felt subordinate raisedtheir F0 [13]) and produced more monotone speech with lower F0 variability when describingthemselves to a potential female partner [18]. Men exhibiting masculine traits such as relativelylarge body size, high levels of androgens, or low formant frequencies also [41_TD$DIFF]appear to articulatevowels less clearly than [42_TD$DIFF]do more feminine men by differentially modulating the lower formants

Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 3

TICS 1533 No. of Pages 15

[19]. The authors [43_TD$DIFF] of this latter study suggest that some men may opt for communicatively lessintelligible phonetic variants, typically associated with lower social economic status but also withtoughness, as a means of communicating their masculinity [19]. Other studies show that menand women volitionally and spontaneously decrease F0 and formants when asked to soundmasculine and increase these frequencies to sound feminine [9]. A similar ability has beenobserved in children aged 6–9 years [20], suggesting that prepubescent boys and girls learn toexpress their gender through the voice by imitating adult models. Our own ongoing researchfurther indicates that men and women from distinct cultures spontaneously lower their F0 andformants to sound physically large and raise these vocal frequencies to sound small. Thesemodulations are in some cases extreme, as men can increase their apparent vocal tract length(VTL) by 25% to sound larger or raise their voice pitch to frequencies characteristic of a smallchild when attempting to imitate a small body size.

Sex-typical voice patterns (i.e., relatively lower frequencies in men's voices and higher frequen-cies in women's voices) are generally considered attractive [3,5]. Recent research efforts todetermine whether men and women can effectively manipulate vocal attractiveness haveproduced mixed results [21–25][44_TD$DIFF], but unequivocally indicate that both sexes modulate theirF0 when speaking to attractive members of the opposite sex. Men and women also appear tovolitionally raise their F0 to express confidence and intelligence [21]. At a perceptual level,listeners should be selected and/or learn to detect vocal modulation, given that misattribution ofspeaker traits such as dominance and attractiveness can be costly. Evidence for this hypothesisis [45_TD$DIFF] also mixed: although listeners correctly discriminate deliberately modulated from habitualspeech [21], modulated voices can affect listeners’ assessments of target traits includingintelligence, dominance, and attractiveness [21,24,25], suggesting that even when detectable,voice modulation can remain effectual (cf. [24]).

The behavioral studies reviewed above show that humans can spontaneously manipulate vocalfrequencies to deemphasize or accentuate various biosocially relevant traits with meaningfulvariation across social contexts, within the limits imposed by various anatomical or mechanisticconstraints on vocal production (see, for example, Box 1). Volitional modulation of F0 andformants is often achieved by exploiting robust perceptual correspondences (e.g., gender [9] orfrequency code [6,7,12]). Such vocal modulation has obvious potential benefits. For example,between- and within-individual variation in F0 and formants can influence others’ social attri-butions and mate-choice decisions [21,25] and predicts socioeconomic outcomes includingsuccess in a job interview [22] and political votes [26]. We suggest that similar advantagesassociated with vocal modulation during social interactions are likely to have led to the selectionof this capacity in our ancestors [7,12,27].

Is Voice Modulation the Origin of Vocal Control?Vocal modulation of source–filter components is important for conveying (or exaggerating)various indexical characteristics of the speaker, as outlined above, but is also necessary forspeech production (Box 1). Thus, although behavioral studies of humans offer compellingevidence for vocal modulation and its influence on others, they do not alone indicate whetherF0 and formant modulation emerged to subserve speech [27,28], to exploit apparent indexicalfeatures of the speaker [7,12,27], or potentially, for both purposes.

To answer this question, wemust consider at least two additional, comparative lines of research.First, we need to examine vocal modulation in other mammals that lack articulated speech,including our closest primate relatives [46_TD$DIFF]as [47_TD$DIFF]well as other mammals known to scale their vocalfrequencies, and to compare this with human nonverbal vocal communication as well as speech.Here it is important to distinguish between involuntary vocal variation (i.e., automatically elicitedby environmental stimuli or different levels of arousal) and more controlled vocal modulation

4 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy

TICS 1533 No. of Pages 15

(more goal-directed and less directly dependent on arousal or external stimuli, but not neces-sarily implying intentionality). Second, we must understand the underlying neural mechanisms ofnonverbal vocal control and, crucially, determine whether the neural correlates involved innonverbal vocal modulation in humans (including emotional vocalizations such as laughter) [5_TD$DIFF]and in other primates, differ from those involved in volitional human speech production. Here wereview initial work in these two areas that, together with the behavioral studies outlined above,suggests that control over F0 and formants in our ancestors may have emerged to exaggeratesocially relevant traits such as body size and threat potential even before the advent of articulatedspeech.

Vocal Flexibility in Other MammalsBetween- and within-individual differences in F0 and formants play an important role in the socialcommunication of many animals [10,11,28,29]. Several mammalian species scale or modulatetheir vocal frequencies in social interactions (Figure 1, Key Figure)[48_TD$DIFF] (e.g., [30–32][49_TD$DIFF]). Although, in theexamples given in Figure 1, formant modulation in nonhumanmammals appears largely linked toan external stimulus or arousal state, we suggest that such formant scaling neverthelessrepresents a potential path for the evolution of vowel articulation via an increasing complexityin formant modulation. In particular we argue that vocal exaggeration of apparent body size,which has been documented in animals ranging from red deer [50_TD$DIFF] (Cervus elaphus) to humans(Figure 1), may have paved the way for more volitional forms of source and filter modulation andcould have been the main driver in the evolution of vocal control [7,27].

Humans are highly unique in the advanced nature of our voice modulation abilities for two keyreasons. First, we can voluntarily and independently control the source and filter properties of ourvocalizations, and second we can perform these modulations in the complete absence of anassociated inducing experience or state [33–35]. It is important to note that vocal control asdefined above is a phenomenon distinct from vocal learning, which does not necessarilyinvolve flexible manipulation of the vocal anatomy but rather entails the capacity to acquire orconverge call types or entire vocal repertoires through imitation or learning [36,37]. The use ofspecific [51_TD$DIFF] and pre-existing vocalizations in different or novel contexts, and the ability to responddifferentially to the vocalizations of others through experience, is considered distinct from vocalproduction learning, namely because these behaviors are present in a broader range of animalsand are likely to require comparatively less complex neural control [38]. These latter forms oflearning are typically referred to as usage and contextual learning, respectively [39]. Likehumans, many marine mammals and songbirds are capable of complex vocal learning. Indeed,vocal learning evolved independently in multiple lineages [34,38]. Until recently it was consideredto be largely absent in nonhuman primates [40,41]; however, new evidence suggests thatchimpanzees (Pan troglodytes) and orangutans (Pongo) may be capable of some forms of vocallearning [42–44] (cf. [41]).

Although human vocal control capabilities are unparalleled, below we present recent evidencethat nonhuman primates may share our capacity to modulate F0 and formants to perhaps agreater extent than previously thought. Other mammals, such as red deer [6_TD$DIFF], are capable of simple,uniform scaling of formants (Figure 1), but the comparative literature examining flexible vocalcontrol has focused on nonhuman primates due to their relative intelligence and phylogeneticproximity to humans, and also the long-held notion that vocal control subserves spokenlanguage acquisition [27,28].

Vocal Control in Nonhuman Primates?Much of the evidence for control of F0 and formants and imitation in nonhuman primatevocalizations has surfaced in the past 5 years, most of it from captive or enculturated greatapes. For example, captive orangutans [7_TD$DIFF] spontaneously imitate voiceless whistles produced by

Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 5

TICS 1533 No. of Pages 15

Key Figure

Spectrograms Illustrating Progressive Functional Complexity in FormantModulation Across Several Mammalian Species

F1

F2

F3

F4

0

1.5

0

2

F1

F2

‘leopard’ ‘eagle’ /u:/

‘small size’ /u:/ ‘large size’ /u:/

/i:/

5

5

F1

F2

F3

F4

F1

F2

F3

F4

F1

F2

F1

F2

F3

F4

F1

F2

F3

F4

F1

F2

F3

F4

Body size exaggera�on by vocal tract lengthening

Red deer Humanstag roar

Predator-specific alarm calls Speech sounds (vowels)

Diana monkey Human

(B)

(C) (D)

F1

F2

F3

F4

0

1.5

Unmodulated vocaliza�ons

Bisonbull bellow

(A)

Baboongrunt series

5

F1

F2

F3

F4

10

F1

F2

F3

F4

Rhesus macaquegrunt series

(Formants’ posi�ons and formant spacing remain constant within a call or call series)

(Formants decrease uniformly as the vocalizer increases vocal tract length)

(Formants vary non-uniformly as a func�on ofpredator context)

(Formants vary non-uniformly as a func�on ofvowel sound)

Figure 1. The figure illustrates how formant scaling [15_TD$DIFF] (formants F1–F4 labeled) may have first emerged in mammals for sizeexaggeration and threat displays, becoming increasingly complex over evolutionary time and ultimately allowing [16_TD$DIFF] for thesophisticated formant modulation that characterizes articulated human speech. (A) Calls lacking formant modulationproduced by a male bison (Bison bison), a baboon (Papio hamadryas ursinus), and a female rhesus macaque (Macacamulatta). (B) Male red deer (Cervus elaphus, left), whose highly mobile larynges are positioned unusually low in the vocal trac[96], scale the formant spacing of their roars downward to project a large body size when encountering a threatening malecompetitor [30]; humans (Homo sapiens, right) scale formants similarly when instructed to sound ‘large’. (C) Diana monkeys(Cercopithecus diana) produce alarm calls in which the lowest two formants vary differentially in response to aerial versusterrestrial predators. Reproduced, with permission, from [31]; see also [32] in meerkats (Suricata suricatta). (D) Humansmodulate relative formant positions, particularly F1 and F2, to produce different speech sounds; here we illustrate thedifferences in formant spacing between the vowels /i:/ and /u:/. y-[17_TD$DIFF]axes, frequency (0–x kHz).

6 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy

t

TICS 1533 No. of Pages 15

humans, indicating voluntary control over the lips and respiratory musculature [45], whereaschimpanzees produce novel and apparently flexible attention-seeking grunts toward humans[46]. Although this latter call type, with extended duration, is not produced in the wild, itdemonstrates a latent capacity to control vocal fold adduction and airflow that is required toproduce sustained laryngeal vibration [52_TD$DIFF], vocal behavior rarely observed in nonhuman primates.Many other primates also possess a rich repertoire of voiceless calls used in the wild, usuallyinvolving articulator manipulation without vocal-fold vibration homologous to human voicelessconsonants [40]. Case studies of enculturated great apes even document some latent abilitiesto voluntarily control the vocal musculature, thus manipulating F0 and individual formants[47–50], and in some cases producing novel rudimentary speech sounds beyond species-typical repertoires [53_TD$DIFF], with phonological features paralleling those observed in human speech[47,49,50].

There is mounting evidence for behavioral and contextual flexibility in great ape vocalizations. Forinstance, wild chimpanzees are capable of voluntary inhibition of affectively triggered food grunts[51], alarm calls [52], greetings [53], and responses to unfamiliar conspecifics [54]. The prefrontalcortex is thought to play a key role in such behaviors [55]. Chimpanzees also preferentiallyproduce snake alarm calls in the presence of group members unaware of the threat [56] [54_TD$DIFF]. [55_TD$DIFF]Thesealarm calls appear flexible and potentially volitional [57], as do their food grunts [58] and joint-travel calls [59]. [56_TD$DIFF]Despite occupying similar habitats, different populations of wild orangutans arenow known to use different calls in the same context [8_TD$DIFF] [44] [57_TD$DIFF], as well as functionally arbitrary calls[60], suggesting a detachment between the production of the acoustic signal and physiologicalinfluence. Finally, among wild bonobos (Pan paniscus), ‘peeps’ and ‘contest hoots’ with thesame acoustic structure are used in various social contexts [61,62], providing the first evidenceof vocal signals used by a nonhuman primate to express a range of emotional states indepen-dent of context and biological function (i.e., functional flexibility).

These studies offer preliminary evidence that vocal adaptations important in human speechproduction are also present or latent in other mammals, and that among nonhuman primatesmanipulation of the larynx and vocal tract may be more flexible than once thought. In somecases, vocal control among nonhuman primates appears independent of affective state, animportant prerequisite for articulated speech [55]. Studies of enculturated apes provideevidence that the potential for frequency modulation beyond simple, uniform scaling of for-mants exists in nonhuman primates, wherein great apes are able in some cases to manipulateindividual formants to produce articulatory contrasts with speech-like rhythm and someapparent degree of voluntary, self-initiated control of the glottis and supralaryngeal vocal tract.However, critical work is now needed to determine whether nonhuman primates modulate F0and formants within call types to express different levels of aggression/subordination, asposited by several authors [7,12,27]. In this regard, mammals whose F0 and formant scalingis clearly tied to communicating internal states or exaggerating physical traits, such as deer(Figure 1), may provide better convergent models for the study of early vocal modulation inhumans.

A comprehensive understanding of vocal control also demands greater insight into the neuralcorrelates of this behavior. Although neocortical control of the vocal apparatus duringvolitional speech production has traditionally been seen as setting humans apart from otherprimates [63,64], the behavioral studies reviewed above suggest that nonhuman primatesmay possess some cognitive control over vocal behavior. Unfortunately, the neural networksresponsible for modulation of vocal motor systems in animals are poorly understood, andthose involved in source and filter modulation are particularly understudied, even in humans.However, studies comparing volitional and spontaneous vocal production in humans can offersome insight.

Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 7

TICS 1533 No. of Pages 15

Box 2. The Neuroanatomy of Human Vocal Control

The prevailing understanding of the neural systems controlling human vocal behavior owes much to extensiveinvestigation of nonhuman primates such as the squirrel monkey (Saimiri), as well as to brain lesion and stimulationstudies in human patients. A dual-pathway model is posited where midline and lateral motor systems are associated withthe control of (innate) affective and learned vocalizations, respectively [55,63,97–99].

The production of vocal calls of various types is associated with a complex set of midbrain and brainstem structuresincluding the basal ganglia, the periaqueductal grey (PAG), and brainstem nuclei such as the nucleus ambiguus (whichhouses motoneurons innervating the muscles of the larynx). The PAG has been implicated in the gating of vocalizationpatterns, receiving inputs on the motivational and affective [26_TD$DIFF]states of the organism and activating a motor response. Theneural initiation of innate calls and vocalizations (including, in humans, emotional sounds such as laughter and crying)implicates the frontal lobe in human and nonhuman primates and other mammals (specifically, the ACC).

Vocal flexibility sufficient for human speech requires the control of several source- and filter-related elements: (i) finecontrol of the extrinsic and intrinsic laryngeal muscles; (ii) regulation of airflow from the lungs through the oral and nasalcavities; and (iii) the capacity for precise configuration of the supralaryngeal articulators (lips, tongue, jaw, soft palate) [97].Numerous studies, ranging from cortical stimulation to functional imaging, have demonstrated evidence for somatotopiccontrol of the articulators on the human precentral gyrus (e.g., [100–102]). The direct corticobulbar hypothesis arguesthat the flexibility of human vocal behavior, such as the rapid switching on and off of laryngeal signals, can be attributed tothe presence of direct connections from M1 to brainstem motoneurons, which are indirect (via the reticular formation) innonhuman primates. Interestingly, the first functional imaging study of cortical control of the human larynx implicated amore dorsal portion of the precentral gyrus than the ventral sites described for monkeys [103]. This alternative location,adjacent to both the articulatory and respiratory control centers (see also [104]), might facilitate coordinated processessubserving connected and coarticulated speech[27_TD$DIFF] (Figure 2).

Controlled speech production implicates higher-order structures such as Broca's area in the inferior frontal gyrus; this isargued to host a mental ‘syllabary’ important in the selection of learned motor programs [105]. Auditory and soma-tosensory representations have also been implicated in self-monitoring and error correction during voluntary vocalbehavior (e.g., [106,107]).

Neural Correlates of Volitional and Spontaneous Vocal Control in HumansThe advent of functional neuroimaging (PET and fMRI) has brought with it the possibility toinvestigate the systems regulating speech and voice control in the intact brain and thus hasyielded substantial advances in our understanding of human vocal production. This has led tothe development of [9_TD$DIFF] neurobiological models implicating a cortical network of inferior frontal,auditory, and sensorimotor regions in the selection and articulation of spoken syllables and theuse of sensory feedback to guide output processes (e.g., [65]). Although the literature onmammalian vocalizations suggests two pathways for vocal control (i.e., the dual-pathwaymodel; Box 2 and Figure 2), difficulties in imaging small and deep midbrain and brainstemcenters with fMRI means that many existing accounts of vocal behavior in humans focus oncortical, and predominately perisylvian, sensorimotor systems. Recently, some authors haveargued for greater interaction between the dual paths in human speech, positing meaningfulinvolvement of subcortical regions (the basal ganglia) in the affective andmotivational modulationof laryngeal speech signals [63,66,67]. There are numerous sources of evidence that challengethe independence of these two routes, or at least suggest a greater complexity to the system(see, for example, [68,69]), including evidence that patients with lesions in motor cortical regionsretain the ability to swear [70].

Yet in the same way that the behavioral literature has largely neglected the para- or ‘supra’-linguistic aspects of flexible and voluntary human vocal behavior, so has the field of cognitiveneuroscience. As a result, most studies have focused on what is being said (i.e., choice ofphonemes, syllables, and words) and a very small number of studies have specifically focusedon how speech is produced, or flexibly controlled (i.e., pronunciation, indexical cues, andaffective content). Below we review this modest but growing area of research.

The investigation of novel speech sound imitation and pronunciation has revealed variation inboth the activation and the[10_TD$DIFF] structure of typically left-hemisphere [58_TD$DIFF] frontal sites (e.g., inferior frontal

8 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy

TICS 1533 No. of Pages 15

Periaqueductalgrey (PAG)

Nucleusambiguus

Anteriorcingulate

cortex(ACC)

Supplementarymotor area

(SMA)

Broca’s area andinsula

Premotorcortex

Primary motorcortex (M1) Somatosensory

cortex (S1)

Auditorycortex

Cerebellum

Caudate nucleus

Putamen andglobus pallidus

Larynx/Phona�on area (LPA)

Lip protrus�on

Ver�cal tongue movement- a�er [103]

Key:

Figure 2. Main Brain Regions Implicated in Human Vocal Control. According to the dual-pathway model,sensorimotor cortical systems (and the cerebellum, shaded orange) support the production of learned vocalizations suchas speech and song and involve direct innervation, whereas the limbic system in the anterior cingulate cortex ([18_TD$DIFF]ACC, blue) isresponsible for the initiation of innate vocalizations such as laughter. Several studies have identified an important role for thedorsal striatum (green) in vocal behavior, including foreign language learning (e.g., [74]). A first direct investigation of thecontrol of the laryngeal muscles in humans identified a more dorsal location than had been observed in monkeys [103]. Weargue that this represents an important evolutionary change in the control of vocalizations in humans, potentially supportingthe emergence of articulated speech.

cortex, insula) as well as regions of the inferior parietal cortex (e.g., Rolandic operculum,supramarginal gyrus) and the superior temporal cortex, all associated with individual differ-ences in task performance [71– [59_TD$DIFF]74]. Studies have also highlighted the role of subcorticalstructures in voluntary phonetic learning contexts, where the dorsal striatum (bilateralcaudate and putamen) is involved in initial attempts to imitate a novel sequence of sounds[74]. One experiment showed that the left anterior insula and inferior frontal gyrus wereengaged during voluntary impersonations of speaking styles, accents, and familiar individu-als [75]. A similar study showed increased responses in the left anterior insula (as well as thesupplementary motor area, the anterior cingulate cortex (ACC), and regions of the superiortemporal cortex) when participants voluntarily adopted a question inflection for a train ofnonsense syllables [76].

Similar to research on speech production, work on the neural correlates of song – [60_TD$DIFF]whereinsinging often involves significant voice frequency modulation (see, for example, [14]) – hasidentified four key neural substrates of auditory–motor control, for example in the production andcorrection of melodic pitch: the auditory cortex, the intraparietal sulcus, the anterior insula, andthe ACC [77]. Researchers have reported experience-dependent differences in the engagementof brain regions such as the anterior insula and sensorimotor cortices associated with the controlof breathing (see [78]) and increased use of sensory feedback during vocal performance [79,80].A recent study of vocal pitch processing showed greater engagement of the basal ganglia(bilateral putamen) in the imitation of melodies compared with non-imitative productions andmelody perception [67]. The authors [61_TD$DIFF] of that study argue that establishing how these subcortical

Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 9

TICS 1533 No. of Pages 15

Box 3. The Special Case of Laughter: Novel Insight from an Animal-[28_TD$DIFF]like Human Vocalization

Human laughter is ubiquitous in interpersonal interactions and serves various emotional and social signaling functions[108]. Importantly, spontaneous (i.e., authentic) laughter can be elicited in the laboratory with relative ease and can becompared with the volitional production of laughter (on demand). These two forms of laughter are developmentally,acoustically, and perceptually distinct [109–111] (Figure I), suggesting different production mechanisms that may alignwith the midline and lateral pathways outlined in Box 2.

Among humans, the hypothalamus appears to be more active during tickling laughter than during volitional, promptedlaughter [68]. This suggests a potential role for the hypothalamus in conveying motivational and affective information tothe PAG in the implementation of laughter motor patterns. The same study found ACC activation during volitional laughterand [29_TD$DIFF] during the suppression of laughter during tickling [30_TD$DIFF], but not during spontaneous laughter, implicating the ACC involitional vocal control [68]. Nonetheless, spontaneous tickling still engages a range of lateral sensorimotor regionsequivalent to activation during volitional laughter; indeed, these appear even more strongly activated when ticklinglaughter is actively suppressed [68]. This finding suggests fairly continuous involvement of newer cortical control systemsin voice production, even for affective and innate vocalizations, and argues against the rather clear demarcation of neuralsystems suggested in the literature (see also Box 2).

Spontaneous human laughter is argued to be similar in form and function to tickling-induced play vocalizations in otherprimates [112]. Both are often produced without voicing [113] and when spontaneous (but not volitional) human laughsare slowed down and pitch proportionally adjusted, they are largely indiscriminable from nonhuman primate vocalizations[109]. Spontaneous human laughter and homologous macaque and chimpanzee vocalizations are characterized bysimilar interval duration, serial organization, and high intra-bout variability in acoustic parameters [114]. The presence ofregular vocal [31_TD$DIFF] fold vibration and consistently egressive airflow in some ape laughter vocalizations – call characteristicspreviously described as markers of human laughter and speech – thus indicates homology in primate vocal controlmechanisms [115].

In addition to spontaneous laughter, adult (but not infant) chimpanzees produce acoustically distinct (e.g., shorter) laughreplications in response to the laughter of conspecifics, with a comparatively delayed response latency [115], implicatingat least some degree of non-automatic vocal control [108,110,115]. Hence, chimpanzees’ capacity to imitate ormodulate laughter appears highly similar to volitional human laughter in both developmental trajectory and its functionin social cohesion.

103

Freq

uenc

y (H

z)

F0 (H

z)

0

104

75

2.1 seconds 1.2 seconds

Voli�onal laughter Spontaneous laughter

Figure I. Human Spontaneous Amusement Laughter (Left) and Volitional Laughter (Right) Produced by theSameMan. Volitional control of laughter as a normative social signal is refined throughout human development whereasspontaneous laughter emerges early in infancy and appears innate, and relatively more animal-like, in its acousticstructure. As illustrated by these waveforms and spectrograms, bouts of spontaneous laughter tend to be longer inoverall duration with shorter individual calls, higher [21_TD$DIFF]fundamental frequency (F0) (plotted in blue), and differing spectralprofiles compared with volitional laughter [109,110]. Nonlinearities in spontaneous laughter vocalizations (e.g., wheezing,glottal whistles) reflect reduced control of vocal behavior during authentic emotional experience [110,116].

10 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy

TICS 1533 No. of Pages 15

Outstanding QuestionsHow do speakers modulate or scale F0and formants differently across distinctsocial contexts, including, for example,a first date, job interview, or publicspeech?

Does volitional voice modulationprompted in the laboratory differbehaviorally, acoustically, or neurallyfrom the more spontaneous modula-tion observed in real social contexts?

What happens in the brain when wescale formants or lower F0 to soundphysically larger compared with whenwe modulate these vocal features inother contexts, such as during voli-tional or faked conversational laughter?

What are the neural correlates of F0and formant scaling in wild nonhumanprimates and other mammals and howdo these compare with those ofhumans and enculturated primates?

How do individuals vary in the extent towhich they can achieve voice modula-tion and how reliably do these individ-ual differences relate to the ultimateeffectiveness of modulation in alteringthe social assessment or behaviour oflisteners?

How does the multimodal interplaybetween vocal and visual cues influ-ence the effectiveness of voice modu-lation, particularly when vocal andvisual information conflict?

Why do some aphasic patients withlateral lesions retain the ability to swearor imitate animal vocalizations andwhat are the implications for crosstalkbetween the laryngeal motor cortexand the ACC/midline voice-controlpathway?

sites functionally connect with cortical structures is crucial to understanding how they assist inthe auditory-to-motor transforms necessary for imitative behavior.

Other studies have examined the voluntary production of emotional inflections in speech toinvestigate the influence of limbic regions on vocal processes. A comparison of happy andneutral production of syllables revealed increased activation in many cortical and subcorticalregions, crucially including the anterior cingulate and medial temporal cortex as well as thethalamus and midbrain [76]. Recent studies have linked involvement of the ACC and theamygdala to autonomic arousal during the production of angry speech vocalizations [81].The induction of sad (but not happy) affect also predicts activation of the ACC as well aschanges in F0 range [82]. A recent study argued for a more general role for the anterior cingulatein the control of pitch modulations by showing equivalent activation of this region (and of thelaryngeal motor cortex) in the production of both volitional affective and non-affective syllables.These functional imaging studies have thus revealed clear roles for neocortical sensorimotorsystems in the flexible control of human speech and song and our underlying relative expertise inthese behaviors. However, reports of additional involvement of the limbic and striatal regions –

part of an evolutionarily older control system (Box 2) – in comparatively demanding contextsrequiring voluntary modulation of emotional prosody or in phonetic learning are consistent withthe idea that both vocal pathways play a role in voice modulation.

Crucially, then, current evidence from neuroimaging research in humans seems to suggest thatthe limbic and cortical vocal pathways (Figure 2) are highly interactive and that both cortical andsubcortical routes are involved in volitional and spontaneous vocal production [66]. Moreover,vocal flexibility in nonhuman primates suggests that other species have greater neuroanatomicalelaboration of the direct lateral motor cortical route than previously thought or, alternatively, maybe achieving flexibility with older neural structures (e.g., ACC, limbic system). Although there areissues associated with establishing genuine spontaneity in laboratory settings, laughter mayoffer one of the most promising signals with which to interrogate the influence of the differentneural systems controlling vocal output: it allows us to directly compare what happens in thebrain during volitional and spontaneous vocal production in humans and also to comparespontaneous vocalizations produced by humans and nonhuman primates (Box 3).

Concluding Remarks and Future DirectionsVolitional and flexible control of the anatomical structures involved in speech production (thelarynx producing F0 and the supralaryngeal vocal tract producing formants) has long beenthought to set humans apart from other vocalizing animals and is generally considered aprecursor to human speech acquisition and production. We have reviewed preliminary behav-ioral and neural evidence to suggest that the capacity for vocal control is not uniquely human butpresent to some degree in other mammals, including nonhuman primates [62_TD$DIFF](Box 4[63_TD$DIFF]).

We suggest that flexible scaling of source–filter components of vocalizations (F0 and formants)may provide a path for the evolution of vocal control, wherein control of these acousticcomponents may have emerged in our ancestors to express, exaggerate, or even fake physicaltraits or internal motivational states. The fact that many animal species and humans from distinctcultures are capable of scaling their formants to exaggerate their apparent size imparts thepossibility that formant scaling related to size/threat exaggeration (and perception) may consti-tute a precursor of more complex formant modulation involved in speech articulation [27]. Asillustrated in Figure 1, [11_TD$DIFF] linear formant scaling [64_TD$DIFF] as observed in red deer roars[65_TD$DIFF], or [66_TD$DIFF]slightly morecomplex formant [67_TD$DIFF]modulations as observed in the predator calls of diana monkeys, [68_TD$DIFF]effectivelycommunicating size-related information, could have opened the way for increasingly greatervocal control, ultimately leading to full-blown articulated human speech. The sound-symbolicproperties of high front-vowel sounds associated with smalleness (such as /i:/, in which F2 is

Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 11

TICS 1533 No. of Pages 15

Box 4. Neural Networks Involved in Nonhuman Primate Vocal Control: Preliminary Studies

Although the neural correlates of nonhuman primate vocal control are poorly understood, preliminary studies suggestthat the cortex plays a role and may underlie aspects of flexible vocal behavior in primates. For example, in pigtailedmacaques (Macaca nemestrina), vocalization-selective neurons in the ventral premotor cortex fire during prompted butnot spontaneous vocalization [117]. Implanted electrocorticographic arrays corroborate motor cortex activity duringrhesus macaque (Macaca mulatta) vocal production [118]. Rhesus macaques also possess neurons in the ventrolateralprefrontal cortex that are specifically responsible for the planning of instructed vocalizations [119,120], some of whichmay integrate higher-order auditory processing with cognitive control of vocal motor action. Most recently, and inperhaps the most convincing substantiation of cortical involvement in nonhuman primate vocal behavior to date,researchers observedmotor production-related changes in commonmarmoset (Callithrix jacchus) frontal cortex neuronsduring natural communicative calls [121].

Chimpanzees – who are (along with bonobos) our closest primate relatives and whose laughter calls most stronglyresemble those of humans [112] –may possess greater direct innervation of laryngeal motor nuclei than our more distantape cousins [122]. Quantitative phylogenetic trees constructed based on laughter acoustics of humans and other greatapes map closely onto trees based on genetic similarity [112] and cross-species examinations of laughter highlightcontinuity in the capacity for controlledmodulatory adaptation of an innate vocal repertoire (see also Box 3). Nevertheless,investigation into the neural correlates of laughter in nonhuman primates is a promising but as yet uninvestigated avenueof research.

Nonhuman primates do not appear to possess direct motor cortex innervation of the laryngeal nuclei [55,97] [32_TD$DIFF]; [33_TD$DIFF]however,complex laryngeal control may not be important in the natural primate environment. In macaques, indirect (i.e., at leastdisynaptic) connections from the primary motor cortex to motoneurons innervating hand muscles are present at birthwhereas monosynaptic innervation pathways develop by age 2 years, coinciding with the capacity for fine fingermovements for the skilled hand tasks [2_TD$DIFF] integral to successful survival [34_TD$DIFF] (see [123]). If the development of such monosynapticpathways is at least partly dependent on experience, the motor cortex of an enculturated great ape with species-atypicalvocal modulation capabilities may establish direct connections with laryngeal motoneurons, and the lack of suchinnervation in normal development may be due to a lack of environmental necessity rather than neural capacity.

raised), versus low back-vowel sounds associated with largeness (such as /#:/, in which F2 islowered) could reflect this original function [ [69_TD$DIFF]83,124].

The current neuroimaging literature on vocal control is representative of a subfield influenced bythe historical motivations of the linguistics and neuropsychology (i.e., aphasia) literatures,focusing largely on speech and language production. Thus, we are still lacking studies thatprimarily and systematically investigate the neural control of core source–filter modulations suchas those that may have facilitated exaggeration of body size and cues to sex and identity in ourpreverbal evolutionary past [4,28] and those used ubiquitously in the voluntary vocal communi-cation of attitude, emotional state, and even aspects of personality [84,85]. Hence, future workshould make greater attempts to link neural activation data to the underlying articulatorydynamics and acoustic content of vocal behaviors, including imitation. Real-time anatomicalimaging of the vocal tract using MRI (up to 80 frames/s [86]) offers the opportunity to imagenaturalistic vocal behavior while also measuring neural activation with echoplanar imaging of thebrain. Using amodal analysis approaches such as representational similarity analysis (RSA) [87] [70_TD$DIFF],it is possible to probe the neural representation of articulatory dynamics.

There exists an opportunity to strengthen the links between the behavioral and neural imagingliteratures and form a more coherent cognitive neuroscience of the voice. We thereforeencourage colleagues across disciplines to engage in research that directly examines voicemodulation as a signal with tremendous social, as well as linguistic, significance. In our quest touncover the origins of speech, understanding the behavioral functions and neural mechanismsof both involuntary and voluntary nonverbal modulation in humans, and looking for emergingequivalents in other mammals, is likely to be our best way forward (see Outstanding Questions).

AcknowledgmentsThe authors thank Tecumseh Fitch, Drew Rendall, and Megan Wyman for sharing vocal recordings used in the creation of

Figure 1.

12 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy

TICS 1533 No. of Pages 15

References

1. Puts, D. et al. (2012) Sexual selection on human faces and

voices. J. Sex Res. 49, 227–243

2. Kreiman, J. and Sidtis, D. (2011) Foundations of Voice Studies:An Interdisciplinary Approach to Voice Production and Percep-tion, Wiley Blackwell

3. Feinberg, D.R. (2008) Are human faces and voices ornamentssignaling common underlying cues to mate value? Evol. Anthro-pol. Issues News Rev. 17, 112–118

4. Owren, M.J. (2011) Human voice in evolutionary perspective.Acoust. Today 7, 24–33

5. Pisanski, K. and Bryant, G.A. (2016) The evolution of voiceperception. In The Oxford Handbook of Voice Studies (Eidsheim,N.S. and Meizel, K.L., eds), Oxford University Press (in press)

6. Ohala, J.J. (1983) Cross-language use of pitch: an ethologicalview. Phonetica 40, 1–18

7. Ohala, J.J. (1984) An ethological perspective on common cross-language utilization of F0 of voice. Phonetica 41, 1–16

8. Puts, D. et al. (2016) Sexual selection on male vocal fundamentalfrequency in humans and other anthropoids. Proc. Biol. Sci. (inpress)

9. Cartei, V. et al. (2012) Spontaneous voice gender imitation abili-ties in adult speakers. PLoS ONE 7, e31353

10. Fitch, W.T. and Hauser, M.D. (2003) Unpacking “honesty”: ver-tebrate vocal production and the evolution of acoustic signals. InAcoustic Communication (Simmons, A.M. et al., eds), pp. 65–137, Springer

11. Taylor, A.M. and Reby, D. (2010) The contribution of source–filtertheory to mammal vocal communication research: advances invocal communication research. J. Zool. 280, 221–236

12. Morton, E.S. (1977) On the occurrence and significance of moti-vation–structural rules in some bird and mammal sounds. Am.Nat. 111, 855–869

13. Puts, D. et al. (2006) Dominance and the evolution of sexualdimorphism in human voice pitch. Evol. Hum. Behav. 27, 283–296

14. Joliveau, E. et al. (2004) Acoustics: tuning of vocal tract reso-nance by sopranos. Nature 427, 116

15. Karpf, A. (2006) The Human Voice, Bloomsbury

16. Titze, I.R. (2011) Vocal fold mass is not a useful quantity fordescribing F0 in vocalization. J. Speech Lang. Hear. Res. 54,520–522

17. Hollien, H. (2014) Vocal fold dynamics for frequency change. J.Voice 28, 395–405

18. Hodges-Simeon, C.R. et al. (2010) Different vocal parameterspredict perceptions of dominance and attractiveness. Hum. Nat.21, 406–427

19. Kempe, V. et al. (2013) Masculine men articulate less clearly.Hum. Nat. 24, 461–475

20. Cartei, V. and Reby, D. (2013) Effect of formant frequency spac-ing on perceived gender in pre-pubertal children's voices. PLoSONE 8, e81022

21. Hughes, S.M. et al. (2014) The perception and parameters ofintentional voice manipulation. J. Nonverbal Behav. 38, 107–127

22. Hughes, S.M. et al. (2010) Vocal and physiological changes inresponse to the physical attractiveness of conversational part-ners. J. Nonverbal Behav. 34, 155–167

23. Fraccaro, P.J. et al. (2011) Experimental evidence that womenspeak in a higher voice pitch to men they find attractive. J. Evol.Psychol. 9, 57–67

24. Fraccaro, P.J. et al. (2013) Faking it: deliberately altered voicepitch and vocal attractiveness. Anim. Behav. 85, 127–136

25. Leongómez, J.D. et al. (2014) Vocal modulation during courtshipincreases proceptivity even in naive listeners. Evol. Hum. Behav.35, 489–496

26. Klofstad, C.A. et al. (2015) Perceptions of competence, strength,and age influence voters to select leaders with lower-pitchedvoices. PLoS ONE 10, e0133779

27. Fitch,W.T. (2000) The evolution of speech: a comparative review.Trends Cogn. Sci. 4, 258–267

28. Ghazanfar, A.A. and Rendall, D. (2008) Evolution of human vocalproduction. Curr. Biol. 18, R457–R460

29. Rendall, D. et al. (2005) Pitch (F0) and formant profiles of humanvowels and vowel-like baboon grunts: the role of vocalizer bodysize and voice-acoustic allometry. J. Acoust. Soc. Am. 117, 944

30. Reby, D. et al. (2005) Red deer stags use formants as assess-ment cues during intrasexual agonistic interactions. Proc. Biol.Sci. 272, 941–947

31. Riede, T. et al. (2005) Vocal production mechanisms in a non-human primate: morphological data and a model. J. Hum. Evol.48, 85–96

32. Townsend, S.W. et al. (2014) Acoustic cues to identity andpredator context in meerkat barks. Anim. Behav. 94, 143–149

33. Fitch, T.W. (2006) Production of vocalizations in mammals. Vis.Commun. 3, 115–121

34. Janik, V.M. (2014) Cetacean vocal learning and communication.Curr. Opin. Neurobiol. 28, 60–65

35. Pinker, S. and Bloom, P. (1990) Natural language and naturalselection. Behav. Brain Sci. 13, 707–727

36. Janik, V.M. and Slater, P.J. (1997) Vocal learning in mammals.Adv. Study Behav. 26, 59–100

37. Petkov, C.I. and Jarvis, E.D. (2012) Birds, primates, and spokenlanguage origins: behavioral phenotypes and neurobiologicalsubstrates. Front. Evol. Neurosci. 4, 12

38. Nowicki, S. and Searcy, W.A. (2014) The evolution of vocallearning. Curr. Opin. Neurobiol. 28, 48–53

39. Janik, V.M. and Slater, P.J. (2000) The different roles of sociallearning in vocal communication. Anim. Behav. 60, 1–11

40. Lameira, A.R. et al. (2014) Primate feedstock for the evolution ofconsonants. Trends Cogn. Sci. 18, 60–62

41. Fischer, J. et al. (2015) Is there any evidence for vocal learning inchimpanzee food calls? Curr. Biol. 25, R1028–R1029

42. Watson, S.K. et al. (2015) Vocal learning in the functionallyreferential food grunts of chimpanzees. Curr. Biol. 25, 495–499

43. Watson, S.K. et al. (2015) Reply to Fischer et al. Curr. Biol. 25,R1030–R1031

44. Wich, S.A. et al. (2012) Call cultures in orang-utans? PLoS ONE7, e36180

45. Lameira, A.R. et al. (2013) Orangutan (Pongo spp.) whistling andimplications for the emergence of an open-ended call repertoire:a replication and extension. J. Acoust. Soc. Am. 134, 2326–2335

46. Hopkins, W.D. et al. (2007) Chimpanzees differentially producenovel vocalizations to capture the attention of a human. Anim.Behav. 73, 281–286

47. Hayes, C. (1951) The Ape in Our House, Harper

48. Taglialatela, J.P. et al. (2003) Vocal production by a language-competent Pan paniscus. Int. J. Primatol. 24, 1–17

49. Perlman, M. and Clark, N. (2015) Learned vocal and breathingbehavior in an enculturated gorilla. Anim. Cogn. 18, 1165–1179

50. Lameira, A.R. et al. (2015) Speech-like rhythm in a voiced andvoiceless orangutan call. PLoS ONE 10, e116136

51. Wilson, M.L. et al. (2007) Chimpanzees (Pan troglodytes) modifygrouping and vocal behaviour in response to location-specificrisk. Behaviour 144, 1621–1653

52. Cheney, D.L. and Seyfarth, R.M. (1985) Vervet monkey alarmcalls: manipulation through shared information? Behaviour 94,150–166

53. Laporte, M.N.C. and Zuberbühler, K. (2010) Vocal greeting behav-iour in wild chimpanzee females. Anim. Behav. 80, 467–473

54. Wilson, M.L. et al. (2001) Does participation in intergroup conflictdepend on numerical assessment, range location, or rank for wildchimpanzees? Anim. Behav. 61, 1203–1216

55. Owren, M.J. et al. (2011) Two organizing principles of vocalproduction: implications for nonhuman and human primates.Am. J. Primatol. 73, 530–544

56. Crockford, C. et al. (2012) Wild chimpanzees inform ignorantgroup members of danger. Curr. Biol. 22, 142–146

57. Schel, A.M. et al. (2013) Chimpanzee alarm call productionmeetskey criteria for intentionality. PLoS ONE 8, e76674

58. Schel, A.M. et al. (2013) Chimpanzee food calls are directed atspecific individuals. Anim. Behav. 86, 955–965

Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 13

TICS 1533 No. of Pages 15

59. Gruber, T. and Zuberbühler, K. (2013) Vocal recruitment for jointtravel in wild chimpanzees. PLoS ONE 8, e76073

60. Lameira, A.R. et al. (2013) Population specific use of the sametool-assisted alarm call between two wild orangutan populations(Pongopygmaeus wurmbii) indicates functional arbitrariness.PLoS ONE 8, e69749

61. Clay, Z. et al. (2015) Functional flexibility in wild bonobo vocalbehaviour. PeerJ 3, e1124

62. Genty, E. et al. (2014) Multi-modal use of a socially directed call inbonobos. PLoS ONE 9, e84738

63. Ackermann, H. et al. (2014) Brain mechanisms of acoustic com-munication in humans and nonhuman primates: an evolutionaryperspective. Behav. Brain Sci. 37, 529–546

64. Simonyan, K. and Horwitz, B. (2011) Laryngeal motor cortex andcontrol of speech in humans. Neuroscientist 17, 197–208

65. Tourville, J.A. and Guenther, F.H. (2011) The DIVA model: aneural theory of speech acquisition and production. Lang. Cogn.Process. 26, 952–981

66. Ludlow, C.L. (2015) Central nervous system control of voice andswallowing. J. Clin. Neurophysiol. 32, 294–303

67. Belyk, M. et al. (2015) The neural basis of vocal pitch imitation inhumans. J. Cogn. Neurosci. Published online December 22,2015. http://dx.doi.org/10.1162/jocn_a_00914

68. Wattendorf, E. et al. (2013) Exploration of the neural correlates ofticklish laughter by functional magnetic resonance imaging.Cereb. Cortex 23, 1280–1289

69. Belyk, M. and Brown, S. (2015) Pitch underlies activation of thevocal system during affective vocalization. Soc. Cogn. Affect.Neurosci. Published online June 15, 2015. http://dx.doi.org/10.1093/scan/nsv074

70. Van Lancker, D. and Cummings, J.L. (1999) Expletives: neuro-linguistic and neurobehavioral perspectives on swearing. BrainRes. Rev. 31, 83–104

71. Golestani, N. and Pallier, C. (2007) Anatomical correlates offoreign speech sound production. Cereb. Cortex 17, 929–934

72. Reiterer, S.M. et al. (2011) Individual differences in audio-vocalspeech imitation aptitude in late bilinguals: functional neuro-imaging and brain morphology. Front. Psychol. 2, 271

73. Reiterer, S.M. et al. (2013) Are you a good mimic? Neuro-acous-tic signatures for speech imitation ability. Front. Psychol. 4, 782

74. Simmonds, A.J. et al. (2014) The response of the anterior stria-tum during adult human vocal learning. J. Neurophysiol. 112,792–801

75. McGettigan, C. et al. (2013) T’ain’t what you say, it's the way thatyou say it – left insula and inferior frontal cortex work in interactionwith superior temporal regions to control the performance ofvocal impersonations. J. Cogn. Neurosci. 25, 1875–1886

76. Aziz-Zadeh, L. et al. (2010) Common premotor regions for theperception and production of prosody and correlations withempathy and prosodic ability. PLoS ONE 5, e8759

77. Zarate, J.M. (2013) The neural control of singing. Front. Hum.Neurosci. 7, 237

78. Ackermann, H. and Riecker, A. (2010) The contribution(s) of theinsula to speech production: a review of the clinical and functionalimaging literature. Brain Struct. Funct. 214, 419–433

79. Kleber, B. et al. (2010) The brain of opera singers: experience-dependent changes in functional activation. Cereb. Cortex 20,1144–1152

80. Kleber, B. et al. (2013) Experience-dependent modulation offeedback integration during singing: role of the right anteriorinsula. J. Neurosci. 33, 6070–6080

81. Frühholz, S. et al. (2015) Talking in fury: the cortico-subcorticalnetwork underlying angry vocalizations. Cereb. Cortex 25,2752–2762

82. Barrett, J. et al. (2004) The role of the anterior cingulate cortexin pitch variation during sad affect. Eur. J. Neurosci. 19,458–464

83. Hinton, L. et al. (2006) Sound Symbolism, Cambridge UniversityPress

84. McGettigan, C. (2015) The social life of voices: studying theneural bases for the expression and perception of the self and

14 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy

others during spoken communication. Front. Hum. Neurosci. 9,129

85. Banse, R. and Scherer, K.R. (1996) Acoustic profiles in vocalemotion expression. J. Pers. Soc. Psychol. 70, 614

86. Lingala, S.J. et al. (2015) High spatio-temporal resolution multi-slice real time MRI of speech using golden angle spiral imagingwith constrained reconstruction, parallel imaging, and a novelupper airway coil. In Proceedings of the 23rd Conference of theInternational Society for Magnetic Resonance in Medicine, p.789, ISMRM

87. Kriegeskorte, N. et al. (2008) Representational similarity analysis– connecting the branches of systems neuroscience. Front. Syst.Neurosci. 2, 4

88. Chiba, T. and Kajiyama, M. (1958) The Vowel: Its Nature andStructure, Phonetic Society of Japan

89. Fant, G. (1960) Acoustic Theory of Speech Production, Mouton

90. Titze, I.R. (1994) Principles of Vocal Production, Prentice Hall

91. Reby, D. and McComb, K. (2003) Anatomical constraints gen-erate honesty: acoustic cues to age and weight in the roars of reddeer stags. Anim. Behav. 65, 519–530

92. Pisanski, K. et al. (2014) Vocal indicators of body size in men andwomen: a meta-analysis. Anim. Behav. 95, 89–99

93. Pisanski, K. et al. (2016) Voice parameters predict sex-specificbody morphology in men and women. Anim. Behav. 112, 13–22

94. Lieberman, P.H. et al. (1969) Vocal tract limitations on the vowelrepertoires of rhesus monkey and other nonhuman primates.Science 164, 1185–1187

95. Boersma, P. and Weenink, D. (2016) Praat: Doing Phonetics byComputer. ( http://www.praat.org)

96. Fitch, W.T. and Reby, D. (2001) The descended larynx is notuniquely human. Proc. Biol. Sci. 268, 1669–1675

97. Jürgens, U. (2009) The neural control of vocalization inmammals:a review. J. Voice 23, 1–10

98. Jürgens, U. (2002) Neural pathways underlying vocal control.Neurosci. Biobehav. Rev. 26, 235–258

99. Wild, B. et al. (2003) Neural correlates of laughter and humour.Brain 126, 2121–2138

100. Penfield, W. and Rasmussen, T. (1950) The Cerebral Cortex ofMan: A Clinical Study of Localization of Function, Macmillan

101. Pulvermüller, F. et al. (2006) Motor cortex maps articulatoryfeatures of speech sounds. Proc. Natl. Acad. Sci. U.S.A. 103,7865–7870

102. Bouchard, K.E. et al. (2013) Functional organization of humansensorimotor cortex for speech articulation. Nature 495, 327–332

103. Brown, S. et al. (2008) A larynx area in the human motor cortex.Cereb. Cortex 18, 837–845

104. Loucks, T.M.J. et al. (2007) Human brain activation during pho-nation and exhalation: common volitional control for two upperairway functions. Neuroimage 36, 131–143

105. Levelt, W.J.M. et al. (1999) A theory of lexical access in speechproduction. Behav. Brain Sci. 22, 1–38

106. Golfinopoulos, E. et al. (2011) fMRI investigation of unexpectedsomatosensory feedback perturbation during speech. Neuro-image 55, 1324–1338

107. Tourville, J.A. et al. (2008) Neural mechanisms underlying audi-tory feedback control of speech. Neuroimage 39, 1429–1443

108. Scott, S.K. et al. (2014) The social life of laughter. Trends Cogn.Sci. 18, 618–620

109. Bryant, G.A. and Aktipis, C.A. (2014) The animal nature of spon-taneous human laughter. Evol. Hum. Behav. 35, 327–335

110. Lavan, N. et al. (2015) Laugh like you mean it: authenticitymodulates acoustic, physiological and perceptual properties oflaughter. J. Nonverbal Behav. Published online December 19,2015. http://dx.doi.org/10.1007/s10919-015-0222-8

111. Makagon, M.M. et al. (2008) An acoustic analysis of laughterproduced by congenitally deaf and normally hearing collegestudents. J. Acoust. Soc. Am. 124, 472–483

112. Davila Ross, M. et al. (2009) Reconstructing the evolution oflaughter in great apes and humans. Curr. Biol. 19, 1106–1111

TICS 1533 No. of Pages 15

113. Bachorowski, J-A. and Owren, M.J. (2001) Not all laughs arealike: voiced but not unvoiced laughter readily elicits positiveaffect. Psychol. Sci. 12, 252–257

114. Vettin, J. and Todt, D. (2005) Human laughter, social play, andplay vocalizations of non-human primates: an evolutionaryapproach. Behaviour 142, 217–240

115. Davila-Ross, M. et al. (2011) Aping expressions? Chimpanzeesproduce distinct laugh types when responding to laughter ofothers. Emotion 11, 1013–1020

116. Ruch, W. and Ekman, P. (2001) The expressive pattern of laugh-ter. In Emotion, Qualia, and Consciousness (Kaszniak, A.W., ed.),pp. 426–443, World Scientific

117. Coudé, G. et al. (2011) Neurons controlling voluntary vocalizationin the macaque ventral premotor cortex. PLoS ONE 6, e26822

118. Fukushima, M. et al. (2014) Modeling vocalization with ECoGcortical activity recorded during vocal production in the macaquemonkey. Proc. Int. Conf. Eng. Med. Biol. Soc. 2014, 6794–6797

119. Hage, S.R. and Nieder, A. (2013) Single neurons in monkeyprefrontal cortex encode volitional initiation of vocalizations.Nat. Commun. 4, 2409

120. Hage, S.R. and Nieder, A. (2015) Audio-vocal interaction in singleneurons of the monkey ventrolateral prefrontal cortex. J. Neuro-sci. 35, 7030–7040

121. Miller, C.T. et al. (2015) Responses of primate frontal cortexneurons during natural vocal communication. J. Neurophysiol.114, 1158–1171

122. Ploog, D.W. (1992) The evolution of vocal communication. InNonverbal Vocal Communication: Comparative and Develop-mental Approaches (Papou, H. et al., eds), pp. 6–30, Maisondes Sciences de l’Homme

123. Simonyan, K. (2014) The laryngeal motor cortex: its organizationand connectivity. Curr. Opin. Neurobiol. 28, 15–21

124. Sapir, E. (1929) A study in phonetic symbolism. J. Exp. Psychol.12, 225

Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 15