Differentiating Questions from Statements: The Role of ... · Differentiating Questions from...
Transcript of Differentiating Questions from Statements: The Role of ... · Differentiating Questions from...
Differentiating Questions from Statements: The Role of Intonation
by
Mathieu R. Saindon
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy
Department of Psychology University of Toronto
© Copyright by Mathieu R. Saindon 2016
ii
Differentiating Questions from Statements: The Role of Intonation
Mathieu R. Saindon
Doctor of Philosophy
Department of Psychology University of Toronto
2016
Abstract
The most salient cues to yes/no questions and statements are rising and falling terminal pitch
contours. Because young children readily differentiate rising from falling pitch contours, the
presumption is that they can discern speakers’ questioning or declarative intentions from
intonation alone. This thesis examined developmental changes in the use of intonation to identify
questions and statements. In the first of three studies, participants from 5 years of age judged
naturally produced utterances and low-pass filtered versions as questions or statements. Children
correctly identified utterance type above chance levels and achieved adult accuracy levels at 9
years of age, highlighting younger children’s confusion about the links between intonation
contours and pragmatic intentions. In the second study, participants from 8 years of age judged
utterances with graded manipulations of terminal contours as questions or statements. Children’s
question/statement judgments shifted more gradually than those of adults, reflecting greater
uncertainty, but the location of the category shift was comparable across age. Adults'
discrimination of utterance pairs was best for the pair that crossed the category shift, implying
categorical perception of the intonation contours. In the final study, participants from 7 years of
age judged the statement or question status of child- and adult-directed utterances in a gating task
with words added incrementally. After limited exposure to the speaker’s voice, adults and
children from 9 years of age correctly identified questions and statements from the initial word,
demonstrating for the first time that English-speaking adults and children perceive pre-terminal
iii
cues to these utterance types. Identification was more accurate for child-directed than for adult-
directed utterances, indicating that the former speech register exaggerates the distinctions
between question and statements. Collectively, the findings reveal a protracted course of
development for the questioning and declarative intentions signalled by intonation patterns and,
more generally, for the pragmatic functions of intonation.
iv
Dedication
This thesis is dedicated to my late mother, Suzanne Hébert, who did not live to see me complete
my graduate studies. She always made it clear that she was proud of what I had accomplished
and the person I had become. Merci maman. Je t'aime beaucoup.
v
Acknowledgments
I would like to thank my supervisors, Drs. Sandra E. Trehub and E. Glenn Schellenberg,
for their guidance and support over the past seven years. I am especially grateful for their
patience—I had much to learn during the course of this degree, and their demanding workloads
and busy lives never prevented them from providing help whenever I needed it. I would also like
to thank Dr. Pascal van Lieshout for his encouragement and constructive feedback during this
process.
Many thanks also to the research and administrative assistants that have helped me over
the years, including Deanna Feltracco, Lisa Hotson, Brad Marks, Ansha Thirumeny, Lily Zhou,
Terilyn Harber, Sarah Lindsay, Sari Park, Snover Dhillon, and Sofia Rokerya. Thank you also to
my fellow researchers and graduate students—Judy Plantinga, Michael Weiss, Stephanie
Stalinski, and Patrick Hunter—for their help and guidance.
Lastly, I would like to thank my parents, Paul and Suzanne, and my wife, Lauren. My
parents' selfless devotion enabled me to pursue doctoral studies, and my wife's love, support, and
unending patience enabled me to survive the process.
vi
Table of Contents
General Abstract ............................................................................................................................. ii
Dedication ...................................................................................................................................... iv
Acknowledgments ........................................................................................................................... v
Table of Contents ........................................................................................................................... vi
List of Tables ............................................................................................................................... viii
List of Figures ................................................................................................................................ ix
Introductory Comments .................................................................................................................. 1
Study 1: Children's Identification of Questions from Rising Terminal Pitch ................................. 6
Introduction ................................................................................................................................. 7
Experiment 1 ............................................................................................................................. 10
Method .................................................................................................................................. 10
Results and Discussion ......................................................................................................... 15
Experiment 2 ............................................................................................................................. 19
Method .................................................................................................................................. 19
Results and Discussion ......................................................................................................... 21
General Discussion ................................................................................................................... 24
Study 2: Children’s and Adults’ Categorization of Questions from Terminal Pitch Contours .... 28
Introduction ............................................................................................................................... 29
Method ...................................................................................................................................... 31
Results ....................................................................................................................................... 36
Discussion ................................................................................................................................. 41
vii
Study 3: When is a Question a Question for Children and Adults? .............................................. 46
Introduction ............................................................................................................................... 47
Experiment 1 ............................................................................................................................. 50
Method .................................................................................................................................. 50
Results and Discussion ......................................................................................................... 52
Experiment 2 ............................................................................................................................. 56
Method .................................................................................................................................. 56
Results and Discussion ......................................................................................................... 56
General Discussion ................................................................................................................... 63
Concluding Statements ................................................................................................................. 68
References ..................................................................................................................................... 75
Copyright Acknowledgments ....................................................................................................... 94
viii
List of Tables
Table 1. Sentences Used in Experiments 1 and 2 (Study 1) ......................................................... 12
Table 2. Sentences Used in the Identification and Discrimination Tasks (Study 2) ..................... 33
Table 3. Stimulus Sentences (Study 3) ......................................................................................... 51
Table 4. Mean Duration and Fundamental Frequency of Adult- and Child-Directed Utterances
(Study 3)........................................................................................................................................ 51
Table 5. Proportions of Participants Identifying Questions Above Chance After 1-5 Words
(Study 3)........................................................................................................................................ 59
Table 6. Mean Statement Responses and Age (Study 3) .............................................................. 60
ix
List of Figures
Figure 1. F0 contour of a typical question and statement from Experiment 1 (Study 1) .............. 13
Figure 2. Black-and-white versions of asking (left) and telling photos (right) for the male talker
(Study 1)........................................................................................................................................ 14
Figure 3. Number of training and practice blocks in Experiment 1 (natural utterances) and
Experiment 2 (low-pass filtered utterances) as a function of age group (Study 1). Error bars are
standard errors ............................................................................................................................... 16
Figure 4. Performance in the test sessions of Experiment 1 (natural utterances) and Experiment 2
(low-pass filtered utterances) as a function of age group (Study 1). Error bars are standard errors
....................................................................................................................................................... 17
Figure 5. Sample F0 contours for the eight steps of one utterance with question stem (upper
panel) and statement stem (lower panel) (Study 2) ...................................................................... 35
Figure 6. Probability of question responses as a function of step and age group (Study 2) ......... 37
Figure 7. Range of indecision for children and adults (Study 2) .................................................. 40
Figure 8. Performance in Experiment 1 (adults) and Experiment 2 (children) as a function of
speech register and utterance increments (Study 3) ...................................................................... 54
Figure 9. Mean number of words required to identify questions and statements as a function of
age and utterance number (Study 3) ............................................................................................. 61
Figure 10. F0 contours and ToBI annotations from sample adult- and child-directed utterances
(Study 3)........................................................................................................................................ 62
1
Introductory Comments
Melody characterizes somewhat different aspects of speech and music. Musical melodies
are defined by relationships between consecutive tones in terms of their pitches and durations
(Dowling & Fujitani, 1971; Gfeller et al., 2002; Halpern, 1984; Kidd, Boltz, & Jones, 1984;
Kong, Cruz, Jones, & Zeng, 2004; Monahan & Carterette, 1985; Palmer & Krumhansl, 1987;
Sturges & Martin, 1974). The available evidence indicates that pitch patterns (i.e., changes in
fundamental frequency or F0) make greater contributions to the distinctiveness of melodies than
do rhythmic patterns (Hébert & Peretz, 1997; White, 1960). For example, adults readily
recognize melodies in the absence of rhythm or relative timing cues (Galvin, Fu, & Nogaki,
2007; Kang et al., 2009; Kong et al., 2004), but not in the absence of pitch cues (Hébert &
Peretz, 1997).
Speech melodies commonly refer to intonation patterns, or the rising and falling pitch
levels of phrases or utterances (e.g., Bolinger, 1986, 1989). Such melodies are defined primarily
by their contours, or directional changes in pitch (upward, downward, or the same). For musical
melodies, however, pitch relations are defined more precisely by intervals, or the exact pitch
distance between consecutive tones. In Western music, intervals are quantified in logarithmic
scales using semitones, each semitone corresponding to one twelfth of an octave.
Pitch is also used very differently in speech and music. In contrast to the discrete,
sustained pitches of musical melodies, pitch varies almost continuously in speech (Bolinger,
1986, 1989; Cruttenden, 1997), with voices gliding upward and downward to a greater or lesser
degree. Furthermore, musical melodies use specific pitch values from a prescribed pitch set (i.e.,
scales), as determined by cultural conventions, so that one pitch functions as the central or
reference tone for all other pitches (Krumhansl, 1990). There are no comparable constraints on
2
the component pitches of speech, although directional changes in pitch are associated with the
speaker’s intentions (Bolinger, 1986, 1989).
Changes from one sound to another also occur much more rapidly in speech than in
music. For example, English has roughly 12.5 perceptible sounds per second (Miller, 1982), in
contrast to 2–5 in music (Drake & Bertrand, 2001). The sustained pitches of music may account
for the small range of acceptable variation relative to speech. For example, semitone deviations
are perceived as errors in music (Drayna, Manichaikul, de Lange, Snieder, & Spector, 2001;
Warrier & Zatorre, 2002) but are rarely meaningful in speech, even in tonal languages (Francis,
Ciocca, Ma, & Fenn, 2008).
One example of the use of speech melody is to signal a speaker’s intention—to request
rather than transmit information by manipulating the terminal pitch contours that distinguish
statements from yes/no questions. Yes/no questions differ from wh-questions (i.e., those
beginning with who, what, where, when, and why) in that they seek confirmation (i.e., yes or no)
rather than information from the addressee. Yes/no questions usually invert the subject and verb
(e.g., Is John going?), but the change in structure is not obligatory (e.g., John is going?;
Gunlongson, 2002). Yes/no questions without such inversion are designated echoic or
declarative. In yes/no questions, including declarative versions, the pitch contour rises at the end
of the utterance, which contrasts with the falling pitch contour at the end of statements (Bolinger,
1986; Gårding & Abramson, 1965; Hadding-Koch & Studdert-Kennedy, 1964; Ladd, 1996;
Studdert-Kennedy & Hadding, 1973).
Although adults typically interpret utterances with rising and falling terminal pitch
contours as questions and statements, respectively (Cruttenden, 1981; Eady & Cooper, 1986;
Hadding-Koch & Studdert-Kennedy, 1964; Gårding & Abramson, 1965; Studdert-Kennedy &
Hadding, 1973), the situation is less clear for children. For example, 4-year-old children often
3
omit rising pitch contours when producing yes/no questions (Patel & Brayton, 2009; Patel &
Grigos, 2006; Snow, 1998, 2001). Moreover, there is limited research on children’s use of
intonation as a means of identifying the speaker’s questioning intentions. In the two studies that
examined children’s perception of questions and statements, they were considerably less
proficient than adults at differentiating questions and statements based on intonation alone
(Doherty, Fitzsimons, Asenbauer, & Staunton, 1999; Gerard & Clément, 1998). Their difficulty
interpreting utterances with rising pitch contours as yes/no questions is unlikely to stem from
limitations in the discrimination of pitch patterns, because the differentiation of rising from
falling pitch contours is evident in infancy (Frota, Butler, & Vigário, 2014; Karzon & Nicholas,
1989; Nazzi, Floccia, & Bertoncini, 1998). As children master the complexities of language use,
they eventually learn to categorize those contours and link the categories to different intentions.
Although young children readily acquire conventional labels for objects and attributes
(e.g., big, small), they often have difficulty with conventional spatial labels for pitch (e.g., high,
low) and pitch direction (rising/falling, up/down; Fancourt, Dick, & Stewart, 2013; Hair, 1977).
With relatively little training and feedback, however, even 5-year-olds succeed in labeling small
directional changes in pitch (Stalinski, Schellenberg, & Trehub, 2008). Little is known, however,
about whether they can use information about pitch contours in utterances to discern the
speaker’s requesting or informing intentions.
The overall goal of the present investigation was to shed light on developmental changes
in the identification of declarative questions on the basis of intonation or speech melody. The
first of three studies aimed to ascertain the earliest age at which children could reliably identify
naturally produced declarative questions and statements and the age at which they achieved
adult-like performance. The second study examined the magnitude of the terminal pitch contour
that was necessary for consistent categorization of yes/no questions and statements. Although
4
terminal contours provide the most salient acoustic cues to questions and statements (Bolinger,
1986; Ladd, 1996), there has been little descriptive detail about such contours other than their
pitch directional differences (upward for questions; downward for statements). Accordingly, this
study focused on the magnitude of the pitch excursion that is necessary to sustain reliable
categorization in children and adults.
Once it is established that children are able to identify naturally produced questions and
statements, it becomes possible to inquire, as in the third study, about developmental changes in
the ability to use pre-terminal pitch cues to identify declarative questions and statements. The
ability to use such cues has been documented for Dutch (van Heuven & Haan, 2000), Spanish
(Face, 2005), and French (Vion & Collas, 2006) adults, but not for English adults. One would
expect children to have more difficulty than adults with pre-terminal cues to questions, which are
considerably less salient than terminal pitch cues.
There are enormous cognitive changes in early and middle childhood, which are likely to
influence children’s ability to discern a speaker’s questioning intentions on the basis of
intonation alone. For example, age-related improvements in cognitive flexibility (Davidson,
Amso, Anderson, & Diamond, 2006; Deak, 2003; Ceci & Howe, 1978; Hermer-Vazquez,
Moffet, & Munkholm, 2001) involve suitable adjustments in behaviour to various situations, and
the ability to focus on relevant stimuli or specific aspects of stimuli (i.e., selective attention;
Doyle, 1973; Maccoby & Konrad, 1966; Pearson & Lane, 1991; Plude, Enns, & Brodeur, 1994).
Such improvements are bound to facilitate children’s use of intonation to interpret the intentions
of speakers. Although one might reasonably expect younger children to perform more poorly on
question-identification tasks because of their lesser cognitive capabilities and experience with
language, it is nevertheless important to specify the time course of developmental improvements,
ultimately relating them to concurrent changes in other skills, including other aspects of prosodic
5
perception such as the perception of emotion in speech. The goal here was to document age-
related differences in (1) the ability to distinguish statements from questions on the basis of
naturally occurring changes in pitch (Study 1), (2) the size of terminal prosodic cues that allow
for such discrimination (Study 2), and (3) the ability to distinguish statements from questions on
the basis of pre-terminal prosodic cues (Study 3).
6
Study 1: Children's Identification of Questions from Rising Terminal Pitch
Abstract
Young children are slow to master conventional intonation patterns in their yes/no questions,
which may stem from imperfect understanding of the links between terminal pitch contours and
pragmatic intentions. In Experiment 1, 5- to 10-year-old children and adults were required to
judge utterances as questions or statements on the basis of intonation alone. Children 8 years of
age or younger performed above chance levels but less accurately than adult listeners. To
ascertain whether the verbal content of utterances interfered with young children’s attention to
the relevant acoustic cues, low-pass filtered versions of the same utterances were presented to
children and adults in Experiment 2. Low-pass filtering reduced performance comparably for all
age groups, perhaps because such filtering reduced the salience of critical pitch cues. Young
children’s difficulty in differentiating declarative questions from statements is not attributable to
basic perceptual difficulties but rather to absent or unstable intonation categories.
7
In contrast to wh-questions, which are marked by words such as what, why, who, and
how, and typical yes/no questions, which are marked by subject/verb inversion, declarative or
echoic questions (e.g., It’s snowing?) are marked exclusively by prosodic cues. The principal cue
to declarative questions is a pronounced rise in terminal fundamental frequency (F0) in contrast
to falling F0 for statements (Cruttenden, 1981; Eady & Cooper, 1986; Gårding & Abramson,
1965; Studdert-Kennedy & Hadding, 1973). Secondary cues include increased intensity (Peng,
Lu, & Chatterjee, 2009) and final-syllable lengthening (Patel & Brayton, 2009; Patel & Grigos,
2006).
The ability to perceive the relevant acoustic distinctions is apparent in infancy. For
example, 5-month-olds differentiate the intonation contours of European Portuguese statements
from those of yes/no questions in the context of single two-syllable words (Frota et al., 2014).
English-learning infants 5–24 months of age exhibit greater attention to uninverted yes/no
questions (i.e., declarative questions) than to statements (Soderstrom, Ko, & Nevzorova, 2011),
perhaps because of the attention-getting properties of rising terminal pitch (Papoušek, Bornstein,
Nuzzo, Papoušek, & Symmes, 1990) and the frequent use of prosodic contours in infant-directed
speech (Snow, 1977). Evidence of discrimination and differential attention does not imply
categorical representations of such acoustic forms. Children must go beyond detecting the
differences between rising and falling pitch, reflecting the salience of terminal pitch contours, to
categorizing these contours and associating them with questioning or declarative intentions. In
fact, young children’s productions suggest protracted acquisition of stable intonational categories
that map onto specific meanings (Patel & Grigos, 2006; Snow, 1994).
Although preverbal infants produce vocalizations with rising as well as falling pitch
contours (Whalen, Levitt, & Wang, 1991), young language users are inconsistent in their use of a
terminal F0 rise for questions (Snow, 1994, 1998), and their imitations of declarative, monotone,
8
and interrogative patterns are not clearly differentiated until five years of age (Loeb & Allen,
1993). Moreover, when declarative questions are elicited from 4-year-olds, the utterances are
often marked by final-syllable lengthening rather than F0 changes (Patel & Grigos, 2006).
Unfortunately, little is known about children’s spontaneous use of declarative questions or their
understanding of the contextual restrictions that guide their use (Gunlogson, 2002). Nevertheless,
the available production data imply that 5-year-old children have yet to acquire distinct
intonational categories for terminal rise and fall that can be mapped reliably onto question versus
statement functions, respectively.
The present study investigated 5- to 10-year-old children’s ability to interpret utterances
as questions or statements on the basis of intonation alone and compared children’s performance
with that of adults. The goal was to document developmental progression toward adult-like
efficacy in mapping a terminal rise onto a question and a terminal fall onto a statement when all
other variables are held constant. To this end, we used decontextualized declarative questions
(e.g., Bob is funny?) rather than standard yes/no questions (e.g., Is Bob funny?).
Although 5-year-old children differentiate declarative questions from statements in a
same–different task (Doherty et al., 1999), their long-term representations of the contrasting
pitch contours (i.e., the relevant intonational categories) may be insufficiently robust to support
stable mapping onto pragmatic and non-linguistic functions. For example, preschoolers more
readily remember a cartoon character’s favourite melody from its timbre (i.e., instrument) than
from its rising or falling pitch contour (Creel, 2014). Moreover, 5- and 6-year-olds readily
discriminate pitch directional changes (e.g., rising vs. falling), but they do not typically apply
labels such as higher, lower, up, and down to pitch direction (Andrews & Madeira, 1977; Costa-
Giomi & Descombes, 1996) unless they receive targeted training (Stalinski et al., 2008).
9
Weak categorical representations of pitch contours may lead children to accord less
attention to prosody than adults do, especially in the context of conflicting cues. For example,
when 4- to 9-year-olds are asked to judge a speaker’s feelings (happy or sad) from the sound of
her voice, ignoring what she says, they focus on lexical or semantic cues rather than prosodic
cues (Morton & Trehub, 2001). When situational cues are available, 5- and 7-year-olds judge a
speaker’s feelings (good or bad) from situational rather than prosodic cues (Aguert, Laval, Bigot,
& Bernicot, 2010). In the absence of conflicting or distracting cues, young children succeed in
distinguishing happy from sad expressiveness (Morton & Munakata, 2002; Morton & Trehub,
2001). Their success is facilitated by the availability of multiple acoustic cues to these emotion
categories (e.g., pitch level, pitch contours, speaking rate, amplitude) as well as familiar,
concrete response categories (happy, sad).
In the present study, the acoustic distinctions between statements and declarative
questions were less pronounced than those of happy- and sad-sounding utterances, and the
response categories, question and statement, were less familiar to young children, more abstract,
and less readily amenable to visual depiction. In Experiment 1, adults and children 5 to 10 years
of age were required to identify each of several utterances as questions or statements. To counter
younger children’s potential unfamiliarity with terms such as statements and questions, asking
and telling were used as response labels along with supporting photographs. Children as young
as five understand the meaning of ask and tell, although their responses are dominated, at times,
by contextual and interpersonal factors (Warden, 1981). In principle, the verbal content of
utterances, although irrelevant and non-conflicting, could prove distracting, as in previous
research (Aguert et al., 2010; Morton & Trehub, 2001), because of children’s prepotent bias for
message content (Waxer & Morton, 2011). Accordingly, Experiment 2 featured the same task
with the same utterance low-pass filtered utterances to obscure the content.
10
Experiment 1
Method
Participants. The final sample consisted of 122 participants, including 30 5- and 6- year-
olds (14 girls, 16 boys; M = 6;0, range = 5;0–6;11), 31 7- and 8-year-olds (10 girls, 21 boys; M =
8;1, range = 7;0–8;11), 32 9- and 10-year-olds (17 girls, 15 boys; M = 10;1, range = 9;0–10;11),
and 29 adults (21 women, 8 men; M = 18.52 years, SD = 1.27). Children were recruited from the
community. Adults were college students who received partial course credit for their
participation. Children had normal hearing and overall development, according to parental
report. Inclusion criteria for adults were normal hearing and Canadian birth or arrival in Canada
by eight years of age. An additional nine participants were tested but excluded because of
technical errors (one 5-year-old), parent-reported developmental delay (one 5-year-old, one 7-
year-old, one 8-year-old), failure to meet the criterion during the training phase (one 6-year-old),
and scores that were more than two SDs below the mean for their age group (a common
predetermined criterion that affected two 7-year-olds and two 10-year-olds). The exclusion of
children with atypically low scores had no effect on the findings.
Apparatus and stimuli. Stimulus recording and testing took place in a sound-attenuating
booth (Industrial Acoustics Corporation, Bronx, NY) with loudspeakers (Electro-Medical
Instrument Co., Mississauga, ON) mounted in two corners of the sound booth at 45° azimuth to
the participant. Interactive software created with Affect4 (Spruyt, Clarysse, Vansteenwegen,
Baeyens, & Hermans, 2010) for a Windows 7 computer (outside the booth) presented
instructions and stimuli and recorded participants’ responses. Participants entered their responses
on a 17-in touch-screen monitor (Elo LCD TouchSystems, Berwyn, PA) that faced them.
Two men and two women recorded declarative question and statement versions of each
of ten sentences (see Table 1) using a microphone (Sony T) connected to the computer. They
11
generated natural-sounding utterances while minimizing distinctive prosodic cues until the final
syllable, which featured a rising F0 glide for questions and a falling glide for statements. The
stimulus set consisted of eighty utterances (10 sentences x 4 speakers x 2 versions). High-quality
digital sound files (44.1 kHz, 16-bit, mono) created with a digital audio editor (Sound Forge Pro
version 10.0; Sony, Tokyo, Japan) were amplitude normalized and cleaned for superfluous noise
with the Sound Forge Noise Reduction plug-in. Stimuli were presented at approximately 65 dB
SPL. The F0 contours of a typical question and statement from the stimulus set are illustrated in
Figure 1, which confirm the contour shape as relatively flat until the terminal pitch rise or fall.
Audio samples are provided in supplementary materials. Digital photographs of a man and
woman smiling and posing neutrally served as telling pictures; photographs with a quizzical
facial expression and pose served as asking pictures (see Figure 2).
Procedure. Participants were tested individually. The experimenter remained in the
booth only for children’s testing. Children sat facing the touch-screen, and the experimenter was
seated to one side, controlling trial presentations with a keypad. The task was described as a
game in which children would hear a man or lady asking or telling them something. They were
instructed to touch the asking picture if the person was asking about something and the telling
picture if the person was telling them something. Adults controlled the presentation of trials and
indicated whether each utterance was a question or statement by selecting the appropriate
picture.
Pilot testing indicated that many of the 5- and 6-year-olds had difficulty with the task.
Accordingly, all children were required to meet a training criterion—indicating their
understanding of the task—before proceeding to the practice and test phases. First, children
confirmed that they could correctly identify the asking and telling pictures (e.g., “Which one is
the asking picture?”). The subsequent training phase consisted of a maximum of four blocks of
12
four trials, with feedback on all trials. Each block consisted of declarative statements and
standard yes/no questions (e.g., “Is the coat in the closet?”). Children who made errors on the last
block of training trials were excluded from the final sample.
Table 1
Sentences Used in Experiments 1 and 2 (Study 1)
The cat ran away
She lost her shoes
Mom made it
It’s snowing
You found it
Mom went to the store
It’s bedtime
He’s watching TV
He’s in the car
You’re staying home
13
Figure 1. F0 contour of a typical question and statement from Experiment 1 (Study 1).
14
Figure 2. Black-and-white versions of asking (left) and telling photos (right) for the male talker
(Study 1).
As soon as children achieved a perfect score on any of the training blocks, they
proceeded directly to the practice phase, which featured statements and declarative questions that
differed from those in the test phase. There were a maximum of four blocks of four practice
trials, which were designed to clarify that utterances other than standard yes/no questions could
still be asking something. If children obtained a perfect score in any practice block, they
proceeded directly to the test phase. Thus, there were one to four blocks of training trials and one
to four blocks of practice trials, depending on the individual child’s understanding and
performance. Although children received different numbers of practice trials depending on their
understanding and performance, no children were excluded from the experiment on the basis of
their performance on these trials.
Adults completed a four-trial familiarization phase with statements and declarative
questions (not those used in testing) before the onset of the test phase. The actual test phase
15
comprised eighty trials, one for each stimulus utterance. Children were told at the start of the test
phase that four pieces of a large smiley face could be exchanged for a prize. One piece of the
smiley face appeared after each set of twenty trials regardless of children’s performance. The
order of trials in the test phase was randomized with the constraint that the same sentence content
did not appear on successive trials. Adults and children received feedback on all trials (training,
practice, and test), consisting of a 1 s presentation of one of ten cartoon characters for correct
answers, and a 1 s blank screen for incorrect answers. The provision of continuous feedback
ensured that age-related differences in performance were not attributable to differential memory
of the task requirements.
Results and Discussion
For each child, we summed the number of training and practice blocks as an index of
initial difficulty with the task. As can be seen in Figure 3, younger children required more blocks
than did older children, indicating greater initial difficulty in differentiating questions from
statements. Data from the test phase were analyzed by converting responses to d' (d-prime)
scores using proportions of hits and false alarms. Whereas d' scores provide an index of listeners’
sensitivity to question intonation (i.e., the issue of interest in the present investigation), percent
correct scores reflect bias as well as sensitivity. For the present purposes, question responses to
question stimuli constituted hits, and question responses to statement stimuli constituted false
alarms. A score of d' = 1 corresponds to 69% correct.
The stimulus set consisted of forty questions and forty statements, which meant that the
maximum number of hits or false alarms was forty. Because d' scores cannot be computed when
the proportion of hits is 1 or the proportion of false alarms is 0 (i.e., statistically infinite scores),
proportions of hits and false alarms were calculated by adding 0.5 to the number of hits and also
to the number of false alarms and dividing those numbers by 41 (total possible hits or false
16
alarms + 1), as in previous developmental studies (e.g., Thorpe, Trehub, Morrongiello, & Bull,
1988). These proportions were converted to z-scores and then to d' scores (d' =z[hit rate] –
z[false-alarm rate]). The maximum d' score was 4.5. Raw data (percent correct) and standard
errors are illustrated in Figure 4 separately for each age group.
Figure 3. Number of training and practice blocks in Experiment 1 (natural utterances) and
Experiment 2 (low-pass filtered utterances) as a function of age group (Study 1). Error bars are
standard errors.
One-sample t-tests conducted separately for each age group revealed that performance
was significantly better than chance (i.e., chance or d' = 0 results from an equal number of hits
and false alarms) for each age group (ps < .001). A one-way analysis of variance (ANOVA),
with age (5–6, 7–8, 9–10, and adults) as the between-subjects variable and d' as the dependent
variable, revealed a significant effect of age (F(3,118) = 35.78, p <. 001, η2 = .48). Follow-up
17
pairwise comparisons (Tukey HSD) revealed that the performance of 5- to 6-year-olds was
significantly poorer than all other age groups (ps < .001), and that 7- to 8-year-olds performed
more poorly than adults (p = .001). The performance of 9- to 10-year-olds did not differ
significantly from adults or from 7- to 8-year-olds (ps > .1).
Figure 4. Performance in the test sessions of Experiment 1 (natural utterances) and Experiment 2
(low-pass filtered utterances) as a function of age group (Study 1). Error bars are standard errors.
We also examined the percentage of individuals in each group whose correct responses
(i.e., correct identification of questions and statements) exceeded chance levels. According to the
normal approximation to the binomial test (one-tailed, correcting for continuity), 48 or more
correct responses out of 80 is significantly better than chance. Only 63% (19 of 30) of 5- and 6-
year-olds obtained a score of 48 or more, in contrast to 100% in each of the three older groups.
18
Fisher’s exact tests confirmed that the proportion of individuals performing at chance levels was
significantly higher in the youngest group of children compared to any other group (ps < .001).
We then asked whether participants had a response set, specifically a bias to respond
telling (or statement) more or less often than asking (or question). For each participant, we
calculated the total number of telling responses. Comparisons with 50% (40 of 80) revealed that
the youngest children had no response bias (p > .5), but each of the three older groups tended to
choose the telling response more than half of the time (ps < .03), presumably because of the
syntax of the stimulus sentences. Nevertheless, the mean number of telling responses was under
42 in each instance.
The variance in the number of telling responses was considerably higher for the youngest
children (SD = 15.79) compared to the other three groups (all SDs < 4.10), which motivated us to
examine the number of participants who exhibited a bias to respond either telling or asking.
Using the criterion described above (i.e., 48 or more telling responses, or 48 or more asking
responses), we identified 11 participants who exhibited such a bias: 10 5- to 6-year-olds and 1 7-
to 8-year-old. Fisher’s exact tests confirmed that this bias was greater for the youngest group of
children than for any of the other three groups (ps < .001).
Finally, we examined whether children who required more training and practice blocks
performed more poorly in the test phase than those who required fewer blocks. Positive
skewness of the training/practice variable prompted the use of a Spearman’s correlation. Test
scores were negatively correlated with number of training and practice blocks (rs(n = 92) =
-0.48, p < .001).
In sum, the results indicated that children and adults identified questions based on
prosodic information alone. Nevertheless, 5- to 6-year-old children performed more poorly than
older children and adults. In fact, 37% of children in the youngest age group performed at chance
19
levels, but no participant in any other group did so, and one-third of the youngest children tended
to respond consistently with telling or asking regardless of the stimuli. The performance of 7- to
8-year-old children was also less accurate than that of adults. Finally, children who required
more training and practice trials had poorer outcomes on test trials.
In light of young children’s propensity to focus on irrelevant verbal content or situational
context when judging a speaker’s feelings (Aguert et al., 2010; Friend, 2000; Morton & Trehub,
2001), the irrelevant verbal content of declarative questions may have distracted them from the
critical prosodic cues. Just as young children achieve greater success in identifying a speaker’s
feelings when the verbal content is obscured (Morton & Trehub, 2001), they may achieve greater
success in identifying declarative questions when the verbal content is unintelligible. This
possibility was examined in Experiment 2.
Experiment 2
The goal of this experiment was to examine whether young children’s identification of
declarative questions would approach the performance of older children when the verbal content
of utterances was obscured by low-pass filtering.
Method
Participants. The final sample consisted of 126 participants: 33 5- and 6-year-olds (18
girls, 15 boys; M = 6;1, range = 5;0–6;9), 34 7- and 8-year-olds (15 girls, 19 boys; M = 7;11,
range = 7;0–8;10), 31 9- and 10-year-olds (14 girls, 17 boys; M = 10;1, range = 9;2–10;9), and
28 adults (20 women, 8 men; M = 18.43 years, SD = 1.26). Recruiting and inclusion criteria were
the same as in Experiment 1. An additional 19 participants were tested but excluded because of
parent-reported developmental delays (one 10-year-old), failure to meet the criterion during the
training phase (seven 5-year-olds and one 6-year-old), failure to pay attention to the task (three
5-year-olds and four 6-year-olds), and scores that were more than two SDs below the mean for
20
their age group (one 8-year-old, one 9-year-old, and one 10-year-old). Inclusion of these outliers
distorted the group means but did not alter the outcome of the analyses.
Apparatus and stimuli. The apparatus was the same as in Experiment 1. The stimuli
were created by low-pass filtering the sentences from Experiment 1 at 400 Hz (following Friend,
2000; Knoll, Uther & Costall, 2009) using Praat version 5.3.68 (Boersma, 2002) and normalizing
the amplitude of filtered utterances, which were presented at 65 dB SPL. Because low-pass
filtered speech sounds unnatural, children were told that they would hear robots speaking and
that they had to indicate whether the robots were asking or telling them something. The pictures
of the man and woman were replaced with male and female robot cartoon characters, which were
drawn as standing in either a neutral (arms down) pose or a questioning (arms raised, palms
upward) pose. The same pictures were used with adults, who were told that the task was
designed for children.
Procedure. The procedure was similar to that used in Experiment 1. The experimenter
remained in the booth for the testing of children but not adults. Children were told they were
going to play a game in which they would hear a girl or boy robot asking or telling them
something. They were told that they would not understand what the robots were saying but they
had to listen carefully to decide whether they were asking or telling them something. They were
shown the picture of the girl and boy robot and told that they should touch the asking picture if
the robot was asking something and the telling picture if the robot was telling them something.
Adults also indicated whether each utterance was a question or statement by selecting the
appropriate picture. They were also told that the utterances had been modified to make the words
incomprehensible and that they needed to listen carefully to judge whether the utterances were
questions or statements.
21
As in Experiment 1, children were required to complete training and practice phases prior
to the test phase. The four blocks of the training phase were identical to Experiment 1. The four
blocks of the practice phase were also identical to Experiment 1 except that the utterances were
low-pass filtered. As soon as children obtained a perfect score in one of the four training blocks,
they proceeded directly to the practice phase. Children who failed to achieve a perfect score on
the fourth block of training trials with unfiltered speech and yes/no questions were excluded
from the final sample. The practice phase comprised four blocks of four trials, but if children
obtained a perfect score in a block, the practice phase was terminated and they began the test
phase. Children and adults received feedback after each response in the training, practice, and
test phases, as in Experiment 1. As before, children were told about gathering pieces of the
smiley face for subsequent prizes. Children and adults had the option of hearing each utterance
for a second time.
Results and Discussion
As shown in Figure 3, children required more training and practice blocks in the present
experiment (M = 3.40 SD = 1.49) than in Experiment 1 (M = 2.91, SD = 1.33), t(187.55) = 2.37,
p = .019 (unequal variances test). The data from the test phase were converted to d' scores, as in
Experiment 1. Raw data (percent correct) are illustrated in Figure 4. One-sample t-tests for each
age group revealed that performance was significantly better than chance in all cases (ps < .001).
An ANOVA with age (5–6, 7–8, 9–10, and adults) as the between-subjects variable and d' as the
dependent variable revealed a significant effect of age (F(3,122) = 33.13, p < .001, η2 = .45). The
performance of 5- to 6-year-olds was significantly worse than that of all other age groups (ps <.
001), and the performance of 7- to 8-year-olds was worse than that of 9- to 10-year-olds (p =
.038) and adults (p < .001). The performance of 9- to 10-year-olds and adults did not differ (p >
.2).
22
We then examined the percentage of individuals in each group whose correct responses
exceeded chance levels (48 or more correct responses). Because performance was not as
consistently good as it was in Experiment 1, we used a 4 (age [5–6, 7–8, 9–10, adults]) by 2
(above chance, chance) chi-square test of independence to examine age-related differences in the
percentage of individuals who successfully distinguished questions from statements (all cells had
expected frequencies > 5). Successful performance varied across age groups (χ2(3, N = 126) =
54.70, p <.001, Φ = .66). All adults performed above chance levels, as did all but one of the 9- to
10-year-olds, and all but three of the 7- to 8-year-olds. For the 5- to 6-year-olds, far fewer
children—just slightly more than a third (36%, 12 of 33)—exceeded chance levels of
performance.
Performance on the natural utterances from Experiment 1 and the low-pass filtered
utterances from the present experiment was compared by means of a two-way ANOVA with
stimulus type (natural, filtered) and age group (5–6, 7–8, 9–10, adults) as between-subjects
variables and d' as the dependent variable. There were main effects of age (F(3,240) = 67.92, p <
.001, η2 = .43) and stimulus type (F(1, 240) = 28.86, p < .001, η2 = .06). Overall, performance
improved with age, and participants could more easily identify natural rather than low-pass
filtered questions and statements. The interaction between age and stimulus type was not
significant (F < 1). In other words, the decrement in performance for filtered compared to natural
stimuli was similar across age groups.
If the unusual acoustic quality resulting from low-pass filtering contributed to the
unexpected reduction in performance, then performance may have improved from the first to the
second half of the test session, after listeners adapted to the spectral degradation. Because
individual participants had different numbers of questions and statements in the first and second
half due to randomization of order, we analyzed the number of correct responses rather than d'
23
scores. A mixed-design ANOVA, with age group (5–6, 7–8, 9–10, adults) as the between-
subjects variable and test phase (first or second half) as the within-subjects variable revealed a
main effect of age (F(3,122) = 37.14, p <.001, η2 = .48), reflecting age-related improvement, and
a small but significant effect of test phase (F(1,122) = 8.40, p = .004, η2 = .06). Performance was
better during the second half of the test session (M = 33.63, SD = 7.49) than in the first half (M =
32.62, SD = 7.73), but the improvement represented only one additional correct answer on forty
trials. There was no interaction between age group and test phase (F < 1), which indicated that
all age groups had comparable adaptation to the low-pass filtered stimuli. In the second half of
trials, overall performance with filtered stimuli (84% correct) remained substantially below
overall performance with natural stimuli in Experiment 1 (91%), t(237.04) = 3.19. p = .002
(unequal variances test).
Examination of a possible response set (i.e., responding telling more than 50% of the
time) revealed no such bias among the three groups of children (ps > .1), but a small bias for
adults (M = 40.93; p = .011). Categorization of participants into those with or without a
systematic bias to respond either telling or asking revealed 21 participants with such a bias: 14 5-
to 6-year-olds, 6 7- to 8-year-olds, and 1 9- to 10-year-old. The proportion of participants with
this bias varied reliably across age groups (χ2 (3, N = 126) = 25.42, p < .001, Φ = .45).
As in Experiment 1, we examined the relation between number of training and practice
blocks and subsequent test scores for the child participants. A Spearman’s correlational analysis
revealed a significant negative correlation between test scores and number of training and
practice blocks (rs(n = 98) = –0.50, p < .001).
In sum, children differentiated questions from statements in low-pass filtered utterances
at better than chance levels, with the two oldest age groups (9–10, adults) performing
significantly better than the two youngest (5–6, 7–8). Contrary to expectations, all age groups
24
performed worse on the filtered utterances than on the original versions, which indicates that
young children’s difficulty in differentiating question from statement intonation cannot be
attributed to interference from the verbal content. As in Experiment 1, a substantial portion of
children in the youngest group tended to respond telling or asking in general, and children who
required more training and practice trials tended to perform poorly in the actual test session.
General Discussion
The present study examined the identification of declarative questions and statements by
English-speaking children (5 to 10 years of age) and adults. Age-related differences in
performance were similar for natural utterances (Experiment 1) and low-pass filtered utterances
(Experiment 2), but performance in general was poorer for the filtered utterances. Although 5-
and 6-year-olds performed above chance levels, they performed more poorly than all other age
groups despite having numerous training and practice trials and continuous feedback about
response accuracy throughout the test session. The 7- and 8-year-olds performed no differently
than 9- and 10-year-olds on natural utterances but they performed more poorly on filtered
utterances. Finally, the 9- and 10-year-olds performed no differently than adults.
The findings from the 5- and 6-year-olds are consistent with limited knowledge of the
relevant intonation categories and, consequently, poor mapping between intonational contours
and communicative functions. These results are in line with preschool children’s challenges in
linking cartoon characters with rising or falling melodic sequences (Creel, 2014) and 6-year-
olds’ difficulty with conventional verbal labels for rising and falling pitch (Costa-Giomi &
Descombe, 1996).
It is likely young children’s difficulties were exacerbated by limitations in attention
allocation and working memory (Cowan, Morey, AuBuchon, Zwilling, & Gilchrist, 2010). To
succeed on the present task, children had to focus on the terminal pitch contour, determine if it
25
was rising, and designate it as a question if it was or as a statement otherwise. Young children’s
habitual focus on lexical cues at the expense of prosodic cues (Morton & Trehub, 2001;
Snedeker & Trueswell, 2004) and limited cognitive flexibility (Munakata, Snyder, & Chatham,
2012), even in the face of continuous feedback, may have interfered with consistent allocation of
attention to the relevant cues in Experiment 1, which featured natural utterances.
Although all participants were required to meet the same training criterion before
proceeding to the test phase, they did not achieve comparable understanding, as reflected in
persistent performance differences in the test phase. In fact, the number of trials required to meet
the training criterion—one index of initial comprehension—was predictive of subsequent
comprehension, as indexed by performance in the test phase, which also featured feedback on
every trial. Fully 37% of the 5- to 6-year-olds failed to identify utterance type, as reflected in
their chance-level performance.
Low-pass filtering that made the verbal content unintelligible was expected to facilitate
young children’s attention to the relevant acoustic cues, in line with previous studies of
emotional prosody (Morton & Trehub, 2001). Instead, such filtering had the opposite effect,
reducing performance comparably for all age groups. Low-pass filtering decreases the salience
of the component pitches that are relevant for differentiating statements from questions
(Cruttenden, 1981; Eady & Cooper, 1986; Gårding & Abramson, 1965; Peng et al., 2009;
Studdert-Kennedy & Hadding, 1973). Young children’s difficulty with the filtered utterances
confirmed that their problems with the unfiltered utterances in Experiment 1 did not stem from
lexical biases. Although 63% of the 5- and 6-year-old children performed above chance levels on
the natural utterances, only 36% performed above chance on the filtered utterances. Presumably,
their problems with categorizing pitch contours were exacerbated by decreased salience of the
relevant cues.
26
Low-pass filtering preserves the pitch directional differences of the original utterances,
but the resulting speech is perceived as being reduced in pitch range (i.e., compressed pitch
contours) and pitch variability relative to unfiltered speech (Scherer, Koivumaki & Rosenthal,
1972; van Bezzoijen & Boves, 1986). It is likely, then, that reduced pitch salience in the filtered
utterances contributed to the uniform reduction in performance across age.
The unusual sound quality of the filtered speech could have impaired performance even
though listeners adapt to distorted speech after limited exposure (Hervais-Adelman, Davis,
Johnsrude, & Carlyon, 2008; Hervais-Adelman, Davis, Johnsrude, Taylor, & Carlyon, 2011).
Indeed, performance on the filtered utterances improved modestly (i.e., one additional item
correct) during the second half of the test session although it remained below the levels achieved
with unfiltered stimuli. Presumably, adaptation to the filtered speech began during the training
phase and continued during the test phase. Further exposure, either during the training phase or
in a subsequent session, could result in performance levels approaching those attained with
unfiltered utterances.
If young children have difficulty discerning the communicative intent of declarative
questions, one would expect caregivers to avoid such questions. There is evidence, however, that
parents make regular use of declarative questions in their conversations with very young children
(Estigarribia, 2010). One might further expect the acoustic cues to the question/ statement
distinction to be exaggerated in child-directed speech, as are other cues to the speaker’s
intentions (Foulkes, Docherty, & Watt, 2005; Jacobson, Boersma, Fields, & Olson, 1983).
Moreover, declarative questions in parent–child discourse are likely to be restricted to face-to-
face contexts that provide supplementary visual cues such as raised eyebrows and head
movements (Ekman, 1976, 1979; Srinivasan & Massaro, 2003). Most importantly, such
questions would not occur in isolation, as in the present study. Instead, parents would adhere to
27
the contextual constraints on declarative questions (Gunlogson, 2002), which would highlight
their communicative intentions. For example, a child’s request for a cookie just before dinner
might receive an incredulous reply such as “You want a cookie?” Despite the limitations of 5-
and 6-year-olds, as observed in the present study, it is likely that they would comprehend the gist
of the message.
In conclusion, children between 5 and 6 years of age have considerably greater difficulty
than older children and adults in differentiating isolated declarative questions from statements on
the basis of intonation alone. We contend that this difficulty stems primarily from absent or
unstable intonational categories and secondarily from immature working memory (Cowan et al.,
2010), which precluded successful mapping of terminal pitch contours onto communicative
functions. An important challenge for future research is to ascertain the factors that support
children’s transition from continuous representations of pitch contours to categorical
representations of rising and falling contour.
28
Study 2: Children’s and Adults’ Categorization of Questions from Terminal Pitch
Contours
Abstract
The present study had two goals: (1) to compare children’s and adults’ identification of
declarative questions and statements on the basis of terminal cues alone, and (2) to ascertain
whether adults perceived questions and statements categorically. Children (8–11 years, n = 41)
and adults (n = 21) judged utterances as statements or questions from sentences with natural
statement and question endings and with manipulated endings that featured intermediate pitch
values. Adults were also tested on their discrimination of the utterances. Children’s judgments
shifted more gradually across categories than those of adults, but their category boundaries were
comparable. Adults’ discrimination and identification performance was consistent with
categorical perception of statements and questions. Although questions and statements are
typically characterized by rising and falling terminal pitch contours, listeners in the present study
identified utterances as statements even when the terminal pitch rose by as much as four
semitones. The terminal contours of all utterances included falling and rising portions, but the
duration of the rising portion was larger for questions and the duration of the falling portion was
longer for statements. Parametric manipulations of pitch and duration are necessary to specify
their joint contribution to the identification of statements and questions.
29
Prosody encompasses the rhythm, stress, and intonation (pitch variation) in utterances. It
conveys information about a speaker’s emotional state (Banse & Scherer, 1996; Scherer, 1986),
sincere or ironic intentions (Kreutz & Glucksberg, 1989; Pexman & Glenwright, 2007), focus of
interest (Tom bought a car vs. Tom bought a car), and utterance form (e.g., statement vs.
question). The present study was concerned with utterance form, specifically with children and
adult listeners’ ability to differentiate questions from statements solely on the basis of the
terminal pitch contour.
In English, the principal cue to yes/no questions, as demonstrated by perception and
production studies, is considered to be rising terminal intonation, in contrast to falling terminal
intonation for statements (Crutttenden, 1981; Gårding & Abramson, 1965; Eady & Cooper,
1986; Studdert-Kennedy & Hadding, 1973). Young children’s questions often lack the
characteristic terminal pitch rise (Patel & Grigos, 2006; Snow 1994, 1998, 2001), which makes it
difficult to identify their questions on the basis of prosodic cues alone (Patel & Brayton, 2009).
In addition to the primary cues that distinguish declarative questions from statements, there are
secondary cues that vary across languages (Face, 2005; Falé & Faria, 2006; Heeren, Bibyk,
Gunlogson, & Tanenhaus, 2015; van Heuven & Haan, 2000; Vion & Colas, 2006), which
demonstrates multiple cues can be used to identify questions in naturally produced speech.
Nevertheless, forcing listeners to rely on terminal cues alone can provide information
about the specific terminal cues that differentiate questions from statements. It can also reveal
whether listeners’ judgments shift gradually with increasing difference or abruptly after a certain
criterion or threshold value. In other words, are the intonation patterns underlying questions and
statements perceived categorically or continuously?
In general, categorical perception is established by demonstrating that the discrimination
of acoustic stimuli along various acoustic continua (e.g., /ba/ to /pa/) is considerably more
30
difficult when the comparison stimuli are from the same category (e.g., different examples of
/ba/) than from different categories (e.g., /ba/ to /pa/) even though the magnitude of acoustic
differences (e.g., voice-onset time) may be identical for within- and between-category
comparisons (e.g., Liberman, Harris, Hoffman, & Griffith, 1957). When the discrimination peak
and labeling threshold converge, perception is considered to be categorical.
Speakers of tonal languages perceive level and rising tones categorically (Francis,
Ciocca, & Ng, 2003; Hallé, Chang, & Best, 2004; Wang, 1976; Xu, Gandour, & Francis, 2006).
In these languages, F0 height and contour signal lexical distinctions. In Mandarin, for example,
the syllable /ma/ means hemp when produced with a rising tone but mother when produced with
a level tone. To date, however, evidence about categorical perception of intonation in non-tonal
languages is mixed. Stress or emphasis in English is labeled categorically but not perceived
categorically (Ladd & Morton, 1997). In Swedish, emotional prosody is thought to be perceived
categorically (Laukka, 2005). For English (Hutchins, Gosselin, & Peretz, 2010) and German
(Meister, Landwehr, Pyschny, Walger, & Wedel, 2009), there are suggestions of a sharp
statement–question boundary, which is also consistent with categorical perception.
The principal goal of the present study was to compare children’s and adults’
identification of declarative questions and statements on the basis of terminal pitch cues alone.
Natural-sounding utterances were achieved by using naturally produced questions and statements
and electronically altering the terminal contours. As noted, listeners can identify statements and
questions from pre-terminal cues, but it is unclear whether such cues are used when more salient
terminal pitch cues are available. The latter question was addressed by using pre-terminal cues
from naturally produced questions for half of the stimuli and from naturally produced statements
for the other half. In essence, the manipulated terminal pitch contour was appended to question
or statement stems. If listeners profit from pre-terminal cues when terminal cues are present, they
31
should identify utterances more accurately when sentence stems and endings are congruent (e.g.,
question stems with question endings) rather than incongruent (e.g., question stems with
statement endings).
Because young school-age children have lesser sensitivity than adults to the cues that
distinguish declarative questions from statements (Study 1), one might expect them to require
larger terminal pitch contrasts to distinguish questions from statements. In addition, children
might exhibit a more gradual shift from statement to question labeling and a different location of
the boundary between statement and question categories.
A secondary goal was to ascertain whether adults’ perception of statements and questions
is categorical. To this end, adults participated in a discrimination task as well as an identification
task. Children completed the identification task only because it was not feasible for them to
complete both tasks in a single laboratory visit.
Method
Participants
Participants were 21 college students (14 women, 7 men; M = 21.29 years, SD = 4.05)
and 41 children between 8 and 11 years of age. A median split was used to divide the children
into younger and older groups, resulting in younger children from 8;0 to 9;5 years (n = 20, 11
boys, 9 girls, M = 8;9 years, SD = 6 months). The older children ranged from 9;9 to 10;11 years
(n = 21, 9 boys, 12 girls, M = 10;5 years, SD = 4 months). These age groupings were motivated
by developmental changes in children’s understanding of intonational cues to statements and
questions (Study 1).
Adults were born in Canada or living in Canada before 8 years of age and had normal
hearing, according to self-report. Children were born in Canada and had normal hearing and
development, according to parental report. An additional eight children were tested but excluded
32
because of failure to identify natural (i.e., unaltered) questions above chance levels (16 or fewer
correct out of 24, n = 7) or parent-reported developmental delay (n = 1). Adults received partial
course credit for participation; children received a small gift.
Apparatus and Stimuli
The stimuli consisted of recorded question and statement versions of each of four
utterances (see Table 2) produced by a 27-year-old woman who was a native speaker of
Canadian English. For all utterances, she attempted to maintain relatively constant loudness and
speaking rate and relatively level pitch until the terminal fundamental frequency (F0) contour. A
digital audio editor was used to divide each utterance into a stem (entire utterance except for the
final two syllables) and ending (final two syllables). The amplitude of each sound clip was
normalized to 75 dB SPL, and syllable durations were equalized across statement and question
versions of each utterance. Sample stimuli are available in supplementary materials.
An examination of natural utterance endings revealed that the terminal pitch contours of
questions and statements featured a final syllable with initially falling pitch followed by rising
pitch. The pitch range of the entire set of question utterances was 168.6–397.5 Hz (M = 225.4
Hz) and the corresponding range of the entire set of statements was 162.5–316.7 Hz (M = 215.9
Hz). For natural statement endings, pitch fell initially by a substantial amount (M = 6.5
semitones), then rose much less (M = 1.5 semitones), with the falling portion being of longer
duration (M = 305 ms) than the rising portion (M = 165 ms). For natural question endings, pitch
fell very modestly (M = 0.5 semitones), then rose substantially (M = 12.5 semitones), with the
falling portion being considerably shorter in duration (M = 105 ms) than the rising portion (M =
352 ms). In other words, statements and questions contrasted in the location, extent, and duration
of the rising and falling portions, even though statement endings were primarily falling and
question endings were primarily rising.
33
Table 2
Sentences Used in the Identification and Discrimination Tasks (Study 2)
The cat ran away
She lost her shoes
He’s in the car
He’s watching TV
Manipulations of the final syllable alone resulted in unnatural-sounding utterances.
Examination of the penultimate syllable of the natural utterances revealed falling pitch levels that
were more pronounced for questions (M = 4.1 semitones) than for statements (M = 2.3
semitones). Thus, we manipulated the contour of the two final syllables to ensure that the
utterances sounded natural (as in Hutchins et al., 2010).
Sentence endings were analyzed by means of Praat version 5.3.68 (Boersma, 2002) to
reveal the F0 contour with a series of time points separated by intervals of 0.01 s. Points not
deemed crucial to the contour were deleted. New points were then created so that the statement
and question versions had F0 points at the same temporal locations. For example, after the
deletion of superfluous pitch points, if the statement ending had an F0 point at 0.84 s of the
sample but the question ending for the same sentence did not, a new F0 point was created at 0.84
s in the question ending. Subsequently, we calculated the difference in cents (1 semitone = 100
cents) between the matching points of the question and statement versions of each sentence. This
difference was divided into eight steps, and the resulting values were used to create modified
intonation contours. The manipulations based on these values created intonation contours that
shifted gradually from statement to question contours.
34
Stimuli with the modified intonation contours were generated using the F0 Synchronous
Overlap Add (PSOLA) method in Praat. Because large F0 manipulations can make utterances
sound unnatural, the natural statement ending was used to create half of the stimuli (steps 1
through 4) and the natural question ending was used to create the other half (steps 5 through 8).
Because amplitude and pacing were identical across the two versions, judgments were based
solely on the intonation contours. Finally, to control for differences in prosodic cues that precede
the penultimate syllable, two versions of each step were created, one using the question stem (the
portion of the sentence prior to the penultimate syllable) and one using the statement stem. The
resulting stimulus set consisted of 64 utterances (4 sentences x 8 steps x 2 stems). An example of
the manipulations is shown in Figure 5.
The sentences were recorded in a quiet room with a tube microphone (Apex 460)
connected directly to a MacBook Pro (OS 10.7). Testing was conducted in a double-walled,
sound-attenuating chamber (Industrial Acoustics Corporation Co., Bronx, NY). A computer
workstation and amplifier (Harmon-Kardon 3380, Stamford, CT) outside of the booth interfaced
with a 17-in. (43.2 cm) touch-screen monitor (Elo LCD TouchSystems, Berwyn, PA) and two
wall-mounted loudspeakers (Electro-Medical Instrument Co., Mississauga, ON) inside the booth.
The touch-screen monitor was used to present instructions and record participants’ responses.
The loudspeakers were mounted at the corners of the sound booth, each located at a distance of
.76 m and 45° azimuth from the participant, with the touch-screen monitor placed at the
midpoint. An interactive computer program created with Affect4 software (Spruyt et al., 2010)
presented the sentences and recorded responses via the touch-screen. All stimuli were played at a
comfortable listening level of approximately 65 dB SPL.
35
Figure 5. Sample F0 contours for the eight steps of one utterance with question stem (upper
panel) and statement stem (lower panel) (Study 2).
Procedure
Adults completed the identification and discrimination tasks in a single session, and
children completed the identification task only. The identification task began with a short
36
familiarization phase during which listeners heard a sample question and statement (not from the
stimulus set used in testing). In the subsequent test phase, each of the 64 stimulus sentences was
presented three times in random order with the constraint that the same sentence content did not
occur on successive trials. Adults and children were required to indicate whether each sentence
was a question or statement, such that they made 192 (3 x 64) responses in total.
In the discrimination task, adults heard two sentences that were identical or that differed
by a single step and judged whether they were the same or different. The sentences were
arranged in three types of pairs: same (1–1, 2–2, and so on), lower-higher step (1–2, 2–3, 3–4, 4–
5, 5–6, 6–7, and 7–8), and higher-lower step (2–1 to 8–7), resulting in a total of 22 pairs per
sentence (8 same, 7 lower-higher, 7 higher-lower) and a grand total of 88 pairs. Each pair was
presented twice, for a total of 176 trials. The discrimination task also began with a short
familiarization phase in which listeners heard an example of a same and different pair (not from
the test set).
The identification tasks for adults and children were nearly identical, but the children’s
task was modified to make it more entertaining. Specifically, the children’s task was presented as
a game in which the goal was to collect pieces of a puzzle (i.e., corresponding to different parts
of an image that appeared on screen) that mirrored their progress toward task completion. A
piece of the puzzle appeared after each set of 12 trials regardless of the children’s responses, and
16 pieces completed the puzzle. In addition, a short bell-like sound was presented after each
response.
Results
Identification Task
For each participant, we calculated the proportion of times stimuli were labeled as
questions separately for each step and each stem, such that each participant had 16 scores.
37
Standard repeated-measures analysis of variance (ANOVA) has the assumption of sphericity,
which was violated for the main effect of step, p < .001, and the interaction between stem and
step, p < .001. Thus, the first analysis was a mixed-design multivariate analysis of variance
(MANOVA), which has no such assumption. The MANOVA had two repeated measures (two
stems, eight steps) and one between-subjects variable (age: younger children, older children,
adults). This analysis revealed no main effect of stem, p = .704, no two-way interaction between
stem and age group, p = .178, or between stem and step, p = .082, and no three-way interaction,
p = .730 (p-value for Wilk’s lambda), which meant that sentence stem had no discernible effect
on response patterns. Consequently, stem was not considered further, and responses for stimuli
with question and statement stems were combined in subsequent analyses. Descriptive statistics
are illustrated in Figure 6.
Figure 6. Probability of question responses as a function of step and age group (Study 2).
38
The MANOVA revealed a robust main effect of step, F(7, 53) = 1398.97, p < .001, η2 =
.995, and an interaction between step and age group, F(14, 106) = 2.48, p = .004, η2 = .247. As
shown in Figure 6, compared to both groups of children, adults were more consistent at labeling
stimuli at steps 6 to 8 as questions.
For each participant, we derived a new value that quantified how well defined their
boundary was between statements and questions. Specifically, we used logistic regression to fit a
logit curve to the participant’s responses (0: statement or 1: question) as a function of step
number. The resulting slopes varied such that a steep slope indicated a clear or sharp boundary,
whereas a shallower slope indicated that listeners were less consistent in the step at which they
switched from statement to question responses across trials. A one-way ANOVA revealed
significant differences in slope across age groups, F(2, 59) = 7.92, p = .001, η2 = .212. Pairwise
comparisons (Tukey HSD) indicated that adults’ average slope (M = 2.17, SD = 1.01) was
significantly steeper than the slopes of older children (M = 1.41, SD = 0.64), p = .006, and
younger children (M = 1.29, SD = 0.59), p = .002, but the two groups of children did not differ, p
= .875. Thus, no developmental differences were evident in children between 8 and 11 years of
age.
We then compared the location of the question/statement boundary in adults and children
by calculating the point at which participants responded question for 50% or more of the trials
(i.e., when the logit function of question responses crossed 0.5 on the y-axis). An ANOVA
revealed no age-related differences in the location of the midpoint, F < 1. For adults, older
children, and younger children, the average midpoints were 4.93 (SD = 0.54), 5.10 (SD = 0.69),
and 5.19 (SD = 0.68), respectively. In sum, listeners in all age groups changed from statement to
question responses at approximately step 5, but adults did so more consistently.
39
Acoustic analyses were conducted to examine the terminal intonation contour of steps 4
and 5 (i.e., the steps at which judgments shifted from statement to question). In step 4, the
average pitch fall—from the pitch level at onset to the pitch level at the lowest portion of the
fall—was 2.9 semitones, occurring over the course of 193 ms, and the average terminal pitch
rise—from the lowest portion of the pitch declination to the highest pitch value at the final rising
portion—was 4.9 semitones, occurring over 285 ms. In step 5, the average pitch fall and rise
were 1.8 semitones and 6.2 semitones, respectively, with corresponding durations of 176 and 289
ms.
We also calculated a range of indecision for each participant, which represented the
number of adjacent steps that were inconsistently labeled as statement or question. Because our
sample included children, consistency was defined as a proportion of question responses .2 or
less or .8 or more and inconsistency as a proportion of question responses greater than .2 but less
than .8. The data are illustrated in Figure 7. Because the older and younger children did not differ
in all previously reported analyses, the two groups of children were combined for further
analyses. As shown in Figure 7, the range of indecision varied as a function of age (children vs
adults), which was documented with a 2 x 4 chi-square test of independence with group (adults,
children) and range of steps (0, 1, 2, 3 or more) treated as categorical variables, χ2(3, N = 62) =
8.07, p = .045, φ = .361. The mean range of indecision was 1.47 steps (SD = .81) for adults and
2.00 (SD = .84) for children.
Discrimination Task
For the discrimination task conducted with adults, we calculated performance for each
comparison (i.e., steps 1 and 2, steps 2 and 3, and so on) by subtracting the number of false
alarms, or identifying a same pair as different, from the number of hits, or correctly identifying a
different pair as different. The probability of identifying each pair was calculated by dividing the
40
resulting score (hits minus false alarms) by 16, which was the maximum number of hits for each
pair.
Figure 7. Range of indecision for children and adults (Study 2).
The results revealed a peak at steps 4 and 5, indicating highest accuracy for that
comparison. Moreover, adults’ identification data revealed that the probability of labeling a
sentence as a question crossed the 0.5 threshold between steps 4 and 5. Taken together, these
findings are consistent with categorical perception of statements and questions. To confirm the
possibility of categorical perception, two paired-sample t-tests were conducted. The first test
compared the mean probability of indicating that pairs 4 and 5 were different to the mean
probability of indicating that the adjacent pairs were different (i.e., steps 3–4 and 5–6), and the
second test compared the mean probability of discriminating steps 4–5 to the mean probability of
discriminating all other pairs. The results indicated that the probability of discriminating steps 4
and 5 (M = .52, SD = .20) was higher than the probability of discriminating steps 3–4 and 5–6 (M
0!
10!
20!
30!
40!
50!
60!
70!
80!
90!
100!
0 steps! 1 step! 2 steps! ≥3 steps!
Perc
enta
ge o
f par
ticip
ants!
Size of indecision range!
Children!Adults!
41
= .36, SD = 22), t(20) = 6.02, p < .001, as well as the probability of discriminating the other pairs
(M = .41, SD = 22), t(20) = 3.58, p = .002. Thus, adults seemed to exhibit categorical perception
of terminal pitch, even though they found it difficult to discriminate between the terminal
contours. The average discrimination score (hits minus false alarms) across all sentence pairs
was 42.2%, and the average difference between the final pitch values of the intonation contours
was 1.7 semitones.
Discussion
The principal goal of the present study was to compare children’s and adults’
identification of declarative questions and statements on the basis of terminal pitch contours. A
secondary question was to ascertain whether adults perceive terminal contours categorically.
Because children are less sensitive than adults to the acoustic cues that distinguish questions
from statements (Study 1), we expected children’s and adults’ perceptual boundary between
statements and questions to differ in sharpness and location. Specifically, we expected children
to transition more gradually than adults from one response category to another, regardless of the
boundary location.
In fact, children shifted more gradually than adults from one category to another,
reflecting their greater uncertainty about the distinctions between statements and questions, but
the location of the category boundary was stable across age, presumably reflecting broad
similarity in adults’ and children’s criteria for statements and questions. Children’s uncertainty
about questions and statements in the region of the category boundary is consistent with previous
evidence of children’s lesser accuracy than adults in identifying questions from natural
utterances (Study 1).
On the basis of listeners’ ability to identify statements and questions from cues appearing
early in an utterance (Face, 2005; van Heuven & Haan, 2000; Vion & Colas, 2006), the absence
42
of performance advantages for congruent sentence stems and endings (e.g., question stem,
question ending) over incongruent stems and endings (e.g., question stem, statement ending) is
surprising. It is likely, however, that pre-terminal cues, which are much less salient than terminal
pitch cues, are useful primarily in the absence of terminal pitch cues (Majewski & Blasdell,
1968), as in gating procedures (Face, 2005; van Heuven & Haan, 2000; Vion & Colas, 2006).
Adults’ performance on the identification and discrimination tasks is consistent with
categorical perception of statements and questions in that the boundary for statement/question
identification and the discrimination peak occurred between the fourth and fifth steps. Although
the discrimination peak was relatively small, it differed significantly from the discrimination of
adjacent pairs (i.e., steps 3 vs. 4 and 5 vs. 6) as well as other pairs. These findings mirror the
categorical perception of level and rising tones by tone-language speakers, for whom those tones
have linguistic significance, but not the continuous perception by those who do not speak or
understand a tone language (Francis et al., 2003; Hallé et al., 2004; Xu et al., 2006; Wang, 1976).
Acoustic analyses revealed that children as well as adults were more likely to label
utterances as statements when the terminal pitch fall ranged from 6.3 to 2.9 semitones, and they
were more likely to label them as questions when the terminal pitch rise ranged from 6.2 to 12.4
semitones. It was not surprising that listeners identified utterances as questions when the
terminal pitch rise was six semitones. In two studies examining the perception of terminal pitch,
adult listeners labeled a two-syllable word (popcorn) as a question approximately 80% of the
time when the F0 rose by six or more semitones over the course of the word (Chatterjee & Peng,
2008; Peng et al., 2009).
An unexpected finding in the present study was that children and adults identified
utterances as statements when the terminal pitch rose by as much as four semitones. In previous
perceptual studies, terminal pitch in natural utterances was reported as falling 4 to 7 semitones
43
for statements and rising 10 to 12 semitones for questions (Patel, Wong, Foxton, Lochy, &
Peretz, 2008; Patel, Foxton, & Griffiths, 2005). In other studies that manipulated terminal
contour, listeners labeled utterances with level contours as statements and shifted their judgment
from statement to question when the contour rose by as little as two semitones (Chatterjee &
Peng, 2008; Peng et al., 2009). Hutchins et al. (2010) found, however, that listeners label some
utterances with rising contours as statements. Specifically, four utterances labeled as statements
exhibited a rise on the final portion of the last syllable, and the pitch offset of the last syllable
was higher than its onset for two of these utterances.
Rising intonation contours are generally associated with yes/no questions, but they are
also found in other types of utterances, such as statements with patronizing or self-justifying
intentions (Cruttenden, 1985). It is possible, then, that our utterances with a small rise were not
perceived as traditional statements or questions. However, the two-option, forced-choice task did
not allow listeners to indicate degree of certainty about their judgment, and it did not allow them
to indicate that the utterance was neither a question nor a statement. When listeners are required
to label utterances as questions or statements, their default option is the statement response (e.g.,
Vion & Colas, 2006; see also Study 1). Thus, it is possible that listeners labeled some utterances
with rising terminal pitch as statements because of uncertainty.
Acoustic analyses revealed systematic differences in the duration of the falling and rising
portions of the terminal contour. Specifically, the most salient portion of the terminal contour—
the fall in statements and the rise in questions—was extended. In utterances labeled as
statements, the duration of the terminal fall ranged from 193 to 305 ms, and the terminal rise
ranged from 165 to 285 ms. In utterances labeled as questions, the terminal fall ranged from 105
to 176 ms, and the terminal rise ranged from 289 to 352 ms. The relative duration of these
portions of the terminal contour is likely to influence the perception of statements and questions,
44
although we cannot indicate the importance of duration and its impact on listeners’ judgments
because we did not manipulate duration separately from intonation contour.
Another unexpected finding was adults’ modest performance on the sentence
discrimination task. Although the terminal pitch of comparison stimuli differed by 1.7 semitones,
performance (hits minus false alarms) averaged 42.2% across the seven pairs of stimuli. In the
context of isolated tones, adults discriminate pitch differences as small as 0.1 semitones
(Micheyl, Delhommeau, Perrot, & Oxenham, 2006; Olsho, Sakai, Turpin, & Sperduto, 1982;
Spiegel & Watson, 1984). In the context of melodies, differences of a semitone are readily
apparent in familiar melodies (Drayna et al., 2001; Warrier & Zatorre, 2002) and often in
unfamiliar melodies as well.
Although some scholars claim that adults’ ability to discriminate pitch in vocal and non-
vocal stimuli is comparable (Moore, Estis, Gordon-Hickey, & Watts, 2008), there is also
evidence to the contrary. For example, Chinese and English speakers discriminate pitch
differences of 1.4 semitones more readily in the context of pure tones than in speech stimuli,
with discrimination scores improving by 12 to 18% for non-speech stimuli (Xu et al., 2006).
Similarly, listeners require a lesser pitch difference to identify a rising contour in a tone sequence
than in a speech sequence (Hutchins et al., 2010). There is additional evidence that irrelevant
auditory information can impair selective attention, with adverse effects on cognitive
performance, including memory (Banbury, Macken, Tremblay, & Jones, 2001; Neath, 2000).
Moreover, children have difficulty identifying emotions conveyed in speech on the basis of
prosody alone, especially when utterances have conflicting semantic content (Morton & Trehub,
2001). Thus, it is possible that acoustic information associated with message content interfered
with listeners’ processing of the pitch information.
45
In short, when the terminal contours of utterances were altered electronically in
successive steps with natural statements and questions as endpoints, the boundary between
statement and question judgments was similar for children and adults, but the shift between
response categories was more gradual for children. Although declarative questions and
statements are typically described as ending with rising and falling pitch contours, respectively,
all natural and manipulated endings in the present study began with a falling portion and ended
with a rising portion. The relative duration of the falling and rising portions contributed to their
relative salience and to their categorization as questions or statements. The present study
systematically manipulated the final pitch excursions of utterances, but systematic manipulations
of the duration of falling and rising portions are necessary to specify the joint contribution of
terminal pitch changes and their durations to the identity of statements and questions. Greater
understanding of these cues to the identification of sentence types could have implications for
interventions with special populations such as cochlear implant users (Marx et al., 2015), who
experience difficulty with various aspects of prosody (Meister et al., 2009; Nakata, Trehub, &
Kanda, 2012), and people diagnosed with autism spectrum disorder (Filipe, Frota, Castro, &
Vicente, 2014) and Parkinson's disease (Ma, Whitehill, & So, 2010).
46
Study 3: When is a Question a Question for Children and Adults?
Abstract
Terminal changes in fundamental frequency provide the most salient acoustic cues to declarative
questions, but adults sometimes identify such questions from pre-terminal cues. In the present
study, adults and 7- to 10-year-old children judged a single speaker’s adult- and child-directed
utterances as questions or statements in a gating task with word-length increments. Listeners of
all ages successfully used pre-terminal cues to identify utterance type, often only the initial word,
and they were more accurate for child-directed than adult-directed utterances. There were age-
related differences in identification accuracy and number of words required for correct
identification. Age differences were already apparent on the initial (first five) utterances,
confirming adults’ superior explicit knowledge of intonation patterns that signify questions and
statements. Adults’ performance improved over the course of the test session, reflecting taker-
specific learning, but children exhibited no such learning.
47
The most salient acoustic differences between declarative questions and statements are
the terminal fundamental frequency (F0) contours, which rise in yes/no questions and fall in
statements (Bolinger, 1986; Ladd, 1996). Artificially increasing the pitch excursion of utterance-
final syllables increases adults’ likelihood of identifying those utterances as questions (Gårding
& Abramson, 1965). Even sine-wave patterns with rising pitch contours are identified as
questions and those with falling pitch contours as statements (Studdert-Kennedy & Harding,
1973).
Production studies confirm the distinctiveness of terminal pitch contours, but they also
reveal differences in pre-terminal positions. In English, for example, successive content words
often exhibit rising pitch in questions but not statements, and stressed syllables are typically
followed by a pitch decrease in statements but not questions (O’Shaughnessy, 1979). In the
present study, we sought to document age-related changes in the ability to differentiate
declarative questions from statements on the basis of pre-terminal cues, and to ascertain whether
speech register—child- or adult-directed—affects such differentiation.
Gating tasks, which were developed to assess the speed of word identification (Grosjean,
1980), provide a means of determining the minimum information necessary for identifying
statements and declarative questions. Listeners judge whether utterances differing only in
prosodic cues are questions or statements from incremental increases in syllables or words.
Findings from these tasks reveal a different time course of identification across languages. For
example, Dutch listeners identify declarative questions after hearing the second accented syllable
in an utterance (van Heuven & Haan, 2000). Portuguese listeners identify statements after the
first stressed vowel, but they need the final stressed vowel to identify questions (Falé & Faria,
2006). Spanish listeners identify questions after hearing the first content word (Face, 2005), but
French listeners do so after hearing the falling F0 contour that precedes the final rise (Vion &
48
Colas, 2006). Surprisingly, there are no comparable studies with English-speaking children or
adults.
Only one study used a gating task to examine children’s identification of statements and
declarative questions (Gérard & Clément, 1998). French 5-, 7-, and 9-year-olds judged utterance
type from increasing numbers of words (i.e., one word, two words, three, etc.). Even after
hearing complete utterances, 5- and 7-year-olds failed to differentiate questions from statements,
but 9-year-olds were able to do so. Although children’s performance was well below adult levels,
adult gating studies suggest that there are fewer pre-terminal cues to question identity in French
(Vion & Colas, 2006) than in Dutch (van Heuven & Haan, 2000) or Spanish (Face, 2005).
Moreover, English-speaking children as young as 5 (Study 1) and Spanish- and Mandarin-
speaking 4-year-olds (Armstrong, 2012; Zhou, Crain, & Zan, 2012) can differentiate questions
from statements on the basis of prosody alone, although they are much less accurate than adults.
Note, however, that Mandarin questions are cued by duration and intensity contrasts rather than
pitch contrasts (Zhou et al., 2012).
There are also notable cross-language differences in young children’s perception and
production of the relevant acoustic distinctions. For example, 21-month-old Spanish-speaking
toddlers mark their declarative questions with rising intonation contours (Armstrong, 2012), but
English-speaking preschoolers are unlikely to do so (Loeb & Allen, 1993; Patel & Grigos, 2006;
Snow, 1994, 1998). Children’s difficulties, when evident, do not stem from perceptual
limitations. For example, infants differentiate rising from falling pitch contours (Frota et al.,
2014; Karzon & Nicholas, 1989; Nazzi et al., 1998; Soderstrom et al., 2011), and young children
can learn to identify small directional differences (up, down) in pitch (Stalinski et al., 2008).
Instead, their difficulties seem to arise from interpretive issues with sentence-level prosody,
which are resolved progressively with increasing age and cognitive maturity (Cutler & Swinney,
49
1987; Tyler & Marslen-Wilson, 1978; Wells, Peppé, & Goulandris, 2004). It is also notable that
young children perform much better in conversational contexts that provide contextual support
(e.g., Doherty-Sneddon & Kent, 1996) than in laboratory contexts in which isolated utterances
are presented or elicited (e.g., Patel & Grigos, 2006).
In line with our goal of ascertaining the minimum information required to differentiate
question from statement intentions, we asked English-speaking adults (Experiment 1) and 7- to
10-year-old children (Experiment 2) to identify stimuli as questions or statements from the initial
word of five-word utterances and from incremental additions of words. In contrast to previous
gating studies, which used conventional or adult-directed speech, we used child- and adult-
directed speech. Child-directed speech (or ‘clear’ adult-directed speech) features wider F0 range,
slower speaking rate, increased pitch and intensity accents, and longer pauses than conventional
adult-directed speech (Foulkes et al., 2005; Jacobson et al., 1983; Smiljanić & Bradlow, 2009).
These features enhance the intelligibility of messages for children (Fernald & Mazzie, 1991; Liu,
Tsao, & Kuhl, 2009; Syrett & Kawahara, 2014), older adults (Caporael, 1981; Masataka, 2002),
and non-native speakers (Uther, Knoll, & Burnham, 2007). Because suprasegmental aspects of
child-directed speech typically exaggerate cues to the speaker’s intentions, this speaking style
should highlight the distinctions between statements and questions. Accordingly, we expected
listeners of all ages to differentiate questions from statements earlier (i.e., from fewer words)
with child-directed than with adult-directed utterances. We also predicted age-related changes, in
line with French-speaking children’s performance on a gating task (Gérard & Clément, 1998)
and English-speaking children’s performance on full utterances presented in isolation (Study 1).
The performance of adults was examined in Experiment 1 and that of children in Experiment 2.
Our use of a single talker provided an opportunity to examine attunement to her
intonation patterns over the course of the test session. Adults often exhibit adaptation to
50
individual phonetic signatures (Theodore, Myers, & Lomibao, 2015) and conversational style
(Pogue, Kurumada, & Tanenhaus, 2016). Children’s ability to perceive accented or atypical
speech implies similar attunement, but younger children experience greater difficulty in this
regard (Creel, Rojo, & Paullada, 2016). In the present study, adaptation to the speaker’s
questioning and stating style would be reflected in progressive improvement in the use of pre-
terminal cues.
Experiment 1
Method
Participants. Participants were 45 college students (36 women, 9 men; mean age = 18.82
years, SD = 2.85), who received partial course credit for their participation. Inclusion criteria
were Canadian birth or arrival in Canada before 8 years of age and absence of hearing difficulty,
according to self-report. Three additional adults were tested but excluded from the final data set
because of inattentiveness, as reflected in their failure to identify questions and statements above
chance levels (14 or more incorrect out of 40) after hearing the complete sentence.
Apparatus and stimuli. A woman with experience in vocal performance and early-
childhood education produced four versions of 10 5-word utterances (see Table 3). The sentence-
length utterances were selected on the basis of their prosodic form rather than their syntax or
content. Specifically, each sentence began with a stressed syllable, and words in the same
position across sentences had the same number of syllables (first word: two syllables; second
word: one syllable; third word: two syllables; fourth word: one syllable; fifth word: two
syllables). Sentences were produced as declarative questions or statements in an adult-directed or
child-directed style. On average, child-directed versions had higher F0, greater F0 range and
variability, and longer duration than adult-directed versions (see Table 4). Each sentence was
amplitude-normalized at 75 dB SPL using Sound Forge Pro (Version 10.0; Sony, Tokyo, Japan).
51
Table 3
Stimulus Sentences (Study 3)
Judy knows Michael is going.
Lisa seems happy for Alex.
Maybe one apple is enough.
Someone went shopping for balloons.
People played music from Japan.
Puddles make rainy days better.
Donuts with sprinkles are tasty.
Brian is going home today.
Brendan is leaving town again.
Roses are growing on bushes.
Table 4
Mean Duration and Fundamental Frequency of Adult- and Child-Directed Utterances (Study 3)
Speech register Duration (ms) F0 mean (Hz) F0 min (Hz) F0 max (Hz) F0 SD (Hz)
Adult-directed 2798.00 258.01 138.74 522.60 82.90
Child-directed 3074.48 276.61 164.08 588.00 103.83
The stimuli were recorded in a double-walled, sound-attenuating chamber
(Industrial Acoustics Corporation Co., Bronx, NY) with a microphone (Sony T) connected
directly to a Windows 7 workstation. Testing was conducted in the same sound-attenuating
chamber. A computer workstation and amplifier (Harmon-Kardon 3380, Stamford, CT) outside
52
the chamber interfaced with a 17-in touch-screen monitor (Elo LCD TouchSystems, Berwyn,
PA) and two wall-mounted loudspeakers (Electro-Medical Instrument Co., Mississauga, ON)
inside the chamber. The touchscreen monitor facing the participant was used to present
instructions and record responses. The loudspeakers were mounted at the corners of the sound
chamber, each located at a distance of .76 m and 45° azimuth from the participant. A customized
program created with Affect4 software (Spruyt et al., 2010) presented the auditory stimuli and
recorded responses. All stimuli were played at a comfortable listening level of approximately 65
dB SPL.
Procedure. Listeners heard all 40 stimulus sentences presented in random order,
constrained so that different versions of the same utterance (i.e., question or statement, adult- or
child-directed) could not appear consecutively. (Due to computer malfunction, one listener heard
39 sentences.) Each sentence was presented in a block of five trials, beginning with the one-word
stimulus, with successive trials adding a word, ending with the complete five-word sentence on
the fifth trial. After each trial, listeners were required to select one of four options: question,
maybe a question, maybe a statement, or statement, which combined choice of utterance type
with confidence level. The task began with a short familiarization phase featuring one trial block
with an adult-directed declarative question and another with an adult-directed statement, neither
of which appeared in the test phase. The experimenter provided feedback at the end of each trial
block (i.e., complete sentence) only during the familiarization phase.
Results and Discussion
Preliminary analyses revealed no meaningful differences as a function of whether
responses indicated more confidence (i.e., question) or less confidence (i.e., maybe a question).
Accordingly, each response was considered dichotomously as statement or question, which
allowed us to transform these responses into d' scores. For each participant, d' scores were
53
computed separately for each number of words and for adult- and child-directed sentences. Thus,
each participant had 10 d' scores. A hit was a response in the question category (question or
maybe a question) when the stimulus was a question. A false-alarm was a response in the
question category when the stimulus was a statement. In terms of signal-detection theory, the
“signal” consisted of the acoustic cues that distinguish questions from statements, and d' scores
reflected sensitivity to these cues.
The stimulus set consisted of 20 adult-directed and 20 child-directed utterances, half
questions and half statements. Thus, the maximum number of successes for adult- and child-
directed questions was 10 for each number of words. Because d' scores cannot be computed
when the hit rate equals 1.0 or the false-alarm rate equals 0 (i.e., infinite scores), proportions of
correct responses were adjusted by using the following formula: (hit OR false-alarm rate + .5) /
(maximum number of successes + 1), as in previous developmental studies (e.g., Thorpe et al.,
1988). These proportions were then converted to z-scores and subsequently to d' scores (d' = z[hit
rate] – z[false-alarm rate]). The maximum d' score, indicating perfect performance (i.e., 10 hits, 0
false alarms), was 3.38, and chance responding (i.e., equal number of hits and false alarms) was
0. Descriptive statistics are illustrated in Figure 8 as a function of age group, speaking style, and
number of words.
Initial analyses compared performance with chance levels using one-sample t-tests.
Because there were 10 tests (two speaking styles x five different numbers of words), the alpha-
level was lowered to .005. Nevertheless, performance exceeded chance levels in all cases, ps <
.001, indicating that cues to statements or questions were discernible even from the first word,
regardless of speaking style.
54
Figure 8. Performance in Experiment 1 (adults) and Experiment 2 (children) as a function of
speech register and utterance increments (Study 3).
When the entire five-word utterance was presented, performance was at ceiling, as
expected, with mean d' scores of 3.20 (SD = 0.36) and 3.12 (SD = 0.38) for adult- and child-
directed speech, respectively, including perfect performance by 34 of 45 listeners for adult-
directed speech and 29 of 45 for child-directed speech. Moreover, performance on five-word
utterances vastly exceeded performance on one-, two-, three-, and four-word stimuli for both
55
speaking styles, ps < .001, yet there was no difference between speaking styles for five-word
stimuli, p < .2. Thus, further analyses were limited to stimuli with one to four words.
Response patterns were analyzed with an analysis of variance (ANOVA), with speaking
style (adult- or child-directed speech) and number of words (1 through 4) as repeated measures
and d' scores as the dependent variable. A main effect of speaking style revealed that participants
more readily differentiated questions from statements in child-directed than in adult-directed
speech, F(1, 44) = 24.43, p < .001, η2 = .357. A main effect of number of words stemmed from
the fact that performance improved monotonically as the number of words increased, F(3, 132) =
58.41, p < .001, η2 = .570. Both main effects were qualified, however, by a significant two-way
interaction, F(3, 132) = 9.29, p < .001, η2 = .174. Separate analyses of the two speaking styles
revealed a linear improvement in performance for adult-directed speech, F(1, 44) = 97.90, p <
.001, η2 = .690, and for child-directed speech, F(1, 44) = 27.59, p < .001, η2 = .385, with the
interaction highlighting more dramatic improvement for adult-directed speech. As shown in
Figure 8, performance on child-directed speech was relatively good even with one-word stimuli,
which left less room for improvement as the number of words increased.
The present results confirmed the perceptibility of acoustic differences between
statements and declarative questions that featured identical words. When the pitch contour of the
penultimate and final words, whether rising or falling, was present, the identification of
declarative questions and statements was at ceiling. Nevertheless, cues earlier in the utterances
permitted successful identification, especially when the speaker used a child-directed style.
Experiment 2 asked whether children are comparably successful at discriminating statements
from questions when terminal pitch cues are present, whether they capitalize on subtle acoustic
cues in pre-terminal position, and whether they benefit comparably from the child-directed
speaking style.
56
Experiment 2
Although English-speaking children as young as 5 can differentiate questions from
statements on the basis of intonation alone, 5- and 6-year-olds have much greater difficulty
compared to older children, even after considerable training (Study 1). Accordingly, the present
study was limited to children between 7 and 10 years of age.
Method
Participants. The participants, 32 children from the local community, included 15 7- and
8-year-olds (designated “younger children”: 7 girls, 8 boys; mean age = 7 years, 9 months, range
= 7;2–8;10), and 17 9- and 10-year-olds (designated “older children”: 9 girls, 8 boys; mean age =
10 years, 1 month, range = 9;1–10;11). Inclusion criteria were Canadian birth and no personal or
family history of hearing loss, according to parental report. An additional two 7-year-olds were
tested but excluded from the final sample because they failed to identify questions and
statements above chance levels (14 or more incorrect out of 40) after hearing the complete
sentence, indicating poor understanding or attention. Children received a small gift for their
participation.
Apparatus and stimuli. The apparatus and stimuli were the same as in Experiment 1.
Procedure. The adults’ task from Experiment 1 was modified to make it more engaging
for children. On the basis of previous research indicating young children’s confusion with the
terms statement and question (Study 1), the response options from Experiment 1 were simplified
by changing them to asking, maybe asking, maybe telling, and telling. In addition, non-
contingent feedback was provided during the test phase by presenting the image of one of several
cartoon characters after completion of each trial block.
Results and Discussion
57
Responses were converted to d' scores, as in Experiment 1, with telling and maybe telling
considered as statement responses, and asking and maybe asking considered as question
responses. Figure 8 illustrates descriptive statistics as a function of age group, speaking style,
and number of words.
Initial analyses examined whether performance exceeded chance levels separately for
both age groups, both speaking styles, and different numbers of words, with correction for
multiple tests, as in Experiment 1. For the older children, performance exceeded chance levels
for three- and five-word stimuli in adult-directed style, ps < .005, and was at the cusp of
significance for four-word stimuli, p = .005. For all stimuli in child-directed style, performance
exceeded chance levels, ps < .005. For the younger children and adult-directed utterances,
performance exceeded chance only for the five-word stimuli, p < .001. For child-directed
utterances, performance exceed chance for three- and five-word stimuli, ps < .005.
For both speaking styles and age groups, performance for five-word stimuli was much
better than performance for utterances with fewer words, ps < .001, as it was for adults in
Experiment 1 (see Figure 8). Moreover, for both groups of children, performance with five-word
stimuli did not vary as a function of speaking style, ps > .6. As in Experiment 1, the main
analysis focused on stimuli with one to four words.
A mixed-design ANOVA was used to analyze d' scores as a function of two repeated
measures (speaking style, number of words) and one between-subjects factor (age group). A
main effect of speaking style confirmed that children performed better with child-directed than
with adult-directed speech, F(1, 30) = 7.10, p = .012, η2 = .191. There was also a main effect of
number of words, F(3, 90) = 10.65, p < .001, η2 = .262, which was qualified by a two-way
interaction between age group and number of words, F(3, 90) = 5.19, p = .002, η2 = .147. There
were no other main effects or interactions, ps > .2. Separate analyses of the older and younger
58
children revealed that although both age groups exhibited a monotonic increase in mean d’
scores as the stimuli increased from one to four words, the linear trend was significant for the
older children, F(1, 16) = 28.65, p < .001, η2 = .642, but not for the younger children, p > .3.
Subsequent analyses considered the present samples of children jointly with the adults
tested in Experiment 1. We first examined how many individual participants could identify
questions at above-chance levels. We conducted a series of 3 (age group) by 2 (above or at
chance) chi-square tests of independence (one for each number of words) separately for adult-
and child-directed utterances. According to the normal approximation to the binomial test (one-
tailed, correcting for continuity), participants were significantly above chance if they had at least
15 out of 20 correct responses. The results are summarized in Table 5. For adult-directed
utterances, age-related differences in the proportion of participants who successfully
differentiated questions from statements were evident at three and four words. For child-directed
utterances, age-related differences were evident at one, two, three, and four words. In each
instance, adults had the greatest proportion of participants performing at above chance levels,
followed by older children, and then younger children.
We also found a bias for responding statement rather than question for adult-directed
utterances by way of one-sample t-tests, which examined whether the mean number of statement
responses differed from the number expected in the absence of bias (50 out of 100), separately
for each speaking style and age group (see Table 6). The results revealed a significant bias for
statement responses for adult-directed (ps < .005) but not child-directed utterances (ps > .06)
across age groups. Perhaps listeners expect more pitch variability in questions than in statements,
choosing the latter category in the absence of obvious cues, as suggested by Vion and Colas
(2006). This would also explain the absence of a comparable bias for child-directed utterances,
which had more pitch variability and were therefore judged as questions more often than adult-
59
directed utterances. Moreover, because statements occur more commonly than questions in
everyday discourse, a statement response is a reasonable default option in the context of
ambiguous cues.
Table 5
Proportions of Participants Identifying Questions Above Chance After 1–5 Words (Study 3)
Adult-directed Child-directed
Words 7–8 9–10 Adults χ2 p 7–8 9–10 Adults χ2 p
1 .000 .059 .178 4.157 .125 .133 .353 .511 6.899 .032*
2 .067 .059 .244 4.415 .110 .200 .235 .667 15.071 .001*
3 .067 .235 .489 10.083 .006* .200 .412 .778 18.141 .001*
4 .200 .471 .600 7.247 .027* .267 .529 .867 20.590 .001*
5 1.00 1.00 1.00 n/a n/a .933 1.00 1.00. 4.188 .123
Note: Each chi-square test of independence had df = 2 and n = 77.
Because the stimuli were limited to a single speaker, listeners had an opportunity to profit
from speaker-specific cues over the course of the test session. We tested this hypothesis with a
multilevel model that had position of correct judgment (1–5 words) as the dependent variable1,
group (younger children, older children, adults) and speaking style (adult- or child-directed) as
fixed factors, and utterance number (1–40, centered) as a fixed, continuous variable. A random
intercept was included for each participant. Descriptive statistics are illustrated in Figure 9.
There was a main effect of speaking style, F(1, 2993.045) = 16.01, p < .001, because
performance on child-directed utterances exceeded that on adult-directed utterances, as noted
above. Main effects of age group, F(2, 74.006) = 22.92, p < .001, and utterance number, F(1,
60
2993.255) = 10.69, p = .001, were qualified by a two-way interaction between age group and
utterance number, F(2, 2993.237) = 4.61, p = .010 (see Figure 2). No other interactions were
evident, Fs < 1. Separate analyses of the three age groups revealed that performance improved
with increasing exposure to the speaker (i.e., fewer words required for correct identification) for
adults, p < .001, but not for older, p > .2, or younger, F < 1, children (see Figure 9). Adults also
outperformed children on the initial utterances (first 5 of 40), t(75) = 2.07, p = .042.
Table 6
Mean Statement Responses and Age (Study 3)
Age group Adult-directed Child-directed
Mean t p Mean t p
7–8 58.6 3.503 .004* 54.7 1.889 .080
9–10 58.6 3.476 .003* 49.2 -.294 .773
Adults 62.2 8.022 .001* 48.1 -1.890 .065
Note: * indicates values that significantly exceeded chance levels (50).
Descriptive analyses provided insight into the pre-terminal cues that enabled listeners to
distinguish questions from statements. Qualitative annotation conventions from Tonal Breaks
and Indices (ToBI) transcriptions (Pierrehumbert, 1980) depict pitch events associated with
intonational boundaries and accented syllables. The transcriptions include two tone levels—high
(H) and low (L). The * symbol denotes the location of an accented syllable, and the % symbol
denotes the end of an intonational phrase. The + symbol is used when a nuclear accent is marked
by a bitonal pitch accent, or an accent comprising a fall-rise (L+H) or rise-fall (H+L) contour.
61
Figure 9. Mean number of words required to identify questions and statements as a function of
age and utterance number.
Sample question and statement contours are depicted in Figure 10. All questions in the
set began with a low pitch accent followed by a rise (L*+H). By contrast, most statements began
with a pitch rise followed by high pitch accent (L+H*), with the remaining statements having an
initial high pitch accent without the preceding rise (H*). Statements tended to end with a low
phrase accent and low boundary tone (L-L%), and questions tended to end with a high phrase
accent and high boundary tone (H-H%). Some child-directed statements, however, ended with a
low phrase accent and high boundary tone (L-H%). The boundary tone of questions was often
preceded by a low pitch accent (L*), whereas the boundary tone of statements was often
preceded by a high pitch accent (H*). Moreover, downstepped tones—a sequence of high pitch
accents that are not as high as the preceding high (‘H’) accent—were more common in
statements than in questions, reflecting the tendency for pitch to fall gradually over the course of
62
Figure 10. F0 contours and ToBI annotations from sample adult- and child-directed utterances (Study 3).
63
the utterance. Finally, differences between adult- and child-directed utterances were inconsistent,
with the patterns being sentence-dependent, presumably because of the location of nuclear tones.
General Discussion
The present study is the first to use a gating task to examine the identification of adult-
and child-directed questions and statements by English-speaking children and adults. We
established that children as young as 7 could use pre-terminal cues to identify questions and
statements. Adults differentiated questions from statements after hearing the first word regardless
of speech register. Older children (9- to 10-year-olds) performed better than younger children (7-
to 8-year-olds), with both groups identifying child-directed utterances more readily than adult-
directed utterances. Register also had more dramatic consequences for children than for adults.
For example, older children identified child-directed utterances after hearing the first word, but
they identified adult-directed utterances only after the first three words. Younger children
correctly identified utterance type after hearing the first three words of child-directed utterances,
but they required the complete adult-directed utterances to differentiate questions from
statements. Finally, repeated exposure to the speaker’s voice enhanced performance for adults
but not for children.
Performance differences between children and adults were evident early in the test
session (the initial 5 of 40 utterances), which indicates that adults’ explicit knowledge of the
intonation patterns that signify questions and statements exceeds that of children. In previous
research that required listeners to judge full sentences, 7- and 8-year-olds identified questions
and statements more poorly than 9- and 10-year-olds, who performed at adult levels (Study 1).
The findings of both studies are consistent with the notion that pitch contour differences,
although discriminable, are less memorable for children than for adults (Creel, 2014).
64
It is likely that adaptation to the demands of the gating task occurred relatively quickly.
By contrast, adaptation to speaker-specific cues would be expected to unfold more slowly after
exposure to a range of utterances from the single speaker. In fact, improvement in adults’ use of
pre-terminal cues became evident at utterances 26–30, with no further improvement thereafter
(see Figure 9). Children, by contrast, derived no benefit from such exposure within the time
frame of the test session. Talker familiarity advantages are likely to be achieved more readily for
word recognition than for intonation recognition. At times, however, familiarity effects in
school-age children who receive protracted exposure to talkers are restricted to highly familiar
words (Levi, 2015).
For questions in the present study, the pre-nuclear pitch accent was low and followed
rapidly by a rising contour (L*+H). For statements, the contour tended to rise during the pre-
nuclear pitch accent (L+H*). Although little is known about pre-terminal cues to questions and
statements, there is evidence that intonation contours tend to fall during stressed syllables in
statements but not in questions (O’Shaugnessy, 1979). This was not the case for the questions
and statements in the present study, both of which featured a rising pitch contour during the first
word. The main difference between our questions and statements involved the timing of the rise,
which occurred during the primary stress for statements but after the primary stress for questions.
Our findings are consistent with adults’ ability to differentiate Spanish questions from
statements after hearing the first content word, which features higher F0 peaks in questions than
in statements (Face, 2005). There are notable differences, however, in cues to utterance type
across languages and speakers. For example, adults correctly identify Dutch questions and
statements after hearing the second F0 accent, which is larger in questions than in statements for
female speakers but smaller for male speakers (van Heuven & Haan, 2000). Question
65
identification requires information in the penultimate syllable for French questions (Vion &
Colas, 2006) and in the final syllable for Portuguese questions (Falé & Faria, 2006).
In principle, intensity and duration could influence the perception of utterance type.
Although duration is not a reliable cue to question identity (Peng, Chatterjee, & Nelson, 2012),
longer utterances and those with higher terminal intensity are more likely to be identified as
questions than as statements (Ma et al., 2010; Patel, 2003; Peng et al., 2012). Intensity is highly
correlated with F0 (Bolinger, 1986), but listeners are thought to use intensity and duration cues
for question/statement identification only when F0 cues are unavailable (Ma et al., 2010; Patel,
2003; Peng & Chatterjee, 2009).
Our study is also the first to examine the impact of child-directed speech on listeners’
differentiation of questions and statements. Child-directed speech, which is characterized by
slower speaking rate, more careful pronunciation, and wider pitch range relative to adult-directed
speech (Foulkes et al., 2005; Jacobson et al., 1983), emphasizes the speaker’s affective intentions
and focus of interest. The present findings indicate that it also emphasizes the speaker’s
questioning or declarative intent, as one might expect. Acoustic analyses confirmed that the
differences between questions and statements in adult- and child-directed speech were similar
except for the greater magnitude of differences in child-directed speech. As can be seen in Figure
10, pitch accents were more prominent in child-directed than in adult-directed utterances.
The finding of age-related differences in overall question and statement identification is
consistent with age-related differences in French-speaking children’s identification of utterance
type (Gérard & Clément, 1998). In contrast to French 7-year-olds’ inability to distinguish
declarative questions from statements and 9-year-olds’ ability to do so only after hearing the
entire utterance, English-speaking children in the present study correctly identified statements
and questions early in the utterance (except for 7- and 8-year-olds’ judgments of adult-directed
66
utterances). As noted, French includes fewer pre-terminal cues to questions and statements (Vion
& Colas, 2006), whereas questions and statements in our English stimuli included distinct pre-
terminal differences (L*+H tone in questions and L+H* tone in statements). It is unlikely,
however, that the cross-language differences are attributable solely to differential cue
distinctiveness or issues of incidence. For example, although the French study also featured a
single talker, it was considerably more challenging because of five response alternatives—
asking, telling, happy, sad, and ironic—rather the two general response classes—asking and
telling (each qualified by maybe)—in the present study.
It is important to note the limitations of the gating paradigm for assessing listeners’
identification of the speaker’s pragmatic intentions based on intonation alone. In principle, the
units of interest are intonation phrases rather than words, and studies with adults have used
intonation rather than word gates (e.g., Face, 2005; Vion & Colas, 2006; van Heuven & Haan,
2000). Young children’s difficulty with the identification of complete declarative questions and
statements in the absence of conversational context (Study 1) makes it unlikely that that they
would be able to interpret the speaker’s intentions from partial words (i.e., gates based on
intonation contour rather than words). Eye-tracking measures (e.g., Zhou et al., 2012) could
provide a viable means of assessing the impact of intonation contour on question and statement
identification, but the use of such measures necessitates utterances with clear visual referents,
unlike those in the present study.
Although 9- and 10-year-olds exhibited considerable success in question identification,
especially for child-directed utterances, their failure to capitalize on speaker-specific intonation
patterns prevented them from achieving adult performance levels. Similar challenges may
underlie the failure of 13-year-olds to match adults’ perception of emotional prosody (Aguert,
Laval, Lacroix, Gil, & Le Bigot, 2013). Future research could examine parallels and differences
67
in children’s understanding of prosodic cues that signal questions and emotional tone by means
of gating tasks involving familiar talkers.
68
Concluding Statements
The goal of this thesis was to examine developmental changes in the ability to use
intonation cues to identify declarative questions and statements. In Study 1, 5- to 10-year-old
children and adults were required to identify natural and low-pass filtered questions and
statements. Children and adults were able to identify the utterance type based on intonation cues
alone, but 5- and 6-year-olds performed more poorly than older listeners, and 7- and 8-year-olds
performed more poorly than adults. Performance of the 9- and 10-year-olds was comparable to
that of adults. Performance of all age groups was affected adversely by low-pass filtering of the
stimuli, which removed lexical information and decreased the salience of pitch cues. In Study 2,
8- to 11-year-old children and adults judged utterances with electronically manipulated endings
as questions and statements. Adults were also required to discriminate between pairs of these
utterances. Children shifted their judgments more gradually between statement and question
categories than adults did, but their category boundaries were similar. Moreover, adults’
performance on the discrimination and identification tasks was consistent with categorical
perception of statements and questions. In Study 3, 7- to 10-year-old children and adults were
required to identify adult- and child-directed questions and statements in a gating task that
provided the first word of the utterance, the first two words, the first three words, and so on.
Adults succeeded in identifying the utterance type after hearing the first word of adult- and child-
directed utterances. For children, identification accuracy was greater for child-directed than for
adult-directed utterances. Moreover, adults performed better than 9- to 10-year-olds, who
performed better than 7- to 8-year-olds.
In contrast to the prevailing view that the rules of language are largely mastered by 4 or 5
years of age (e.g., Werker & Hensch, 2015), the present findings reveal that children’s
association of prosodic cues with a speaker’s intentions has a protracted developmental
69
trajectory. Why do children, as late as 10 years of age, find it more difficult than adults to use
prosodic information to infer a speaker’s intentions? Immature cognitive control is likely to be
implicated, especially in younger children. For example, cognitive flexibility—the ability to
adapt behaviour to different situations—improves between 4 and 13 years of age (Ceci & Howe,
1978; Davidson et al., 2006), with important gains occurring between 5 and 7 years of age
(Hermer-Vazquez et al., 2001). Limited cognitive flexibility may affect children's ability to focus
on prosodic information to interpret a speaker's intentions when their usual focus is elsewhere.
For example, when children 4–10 years of age and adults are asked to judge a speaker's emotions
from the way she speaks, ignoring what she says, the verbal content dominates the responses of
4- to 9-year-olds and continues to influence the responses of 10-year-olds (Morton & Trehub,
2001). Not surprisingly, adults’ responses are based exclusively on prosodic cues.
Similar biases may influence children’s perception of prosodic cues related to questions
and statements. For example, 5- to 9-year-old children judge pairs of questions and statements as
similar if they feature identical lexical information but different prosodic cues (e.g., rising and
falling terminal pitch contours, respectively), and as different if they feature different lexical
information but similar prosodic cues (Ploog, Bannerjee, & Brooks, 2009). In other words, their
judgments of utterance pairs as similar or different is based on lexical rather than prosodic cues.
It seems reasonable to conclude from the findings of other studies in conjunction with the present
findings that prosodic cues are less salient in childhood than they are in adulthood.
The active-latent account (Morton & Munakata, 2002) posits that performance on
cognitive tasks is influenced by the interaction between attentional biases and representations of
task goals or instructions in working memory. Children between 5 and 7 years of age seem to be
unaware that prosody can provide semantic information, which results in weak representations of
a speaker’s true intentions and, at times, to prosody production preceding perception (Cutler &
70
Swinney, 1987). Limited explicit knowledge of prosody coupled with a lexical bias could
explain the youngest children’s difficulty with identification of questions and statements on the
basis of prosodic information alone.
Lexical cues interfere with children’s use of prosodic cues to emotion whether or not the
lexical cues have emotional implications (Morton & Trehub, 2001; Morton et al., 2003). For
example, children often infer a speaker's feelings from emotionally neutral lexical information or
contexts rather than prosodic information, inventing reasons to explain why the speaker is happy
or sad (Aguert et al., 2013; Morton et al., 2003). In short, children may favour uninformative or
even misleading cues (lexical) over informative cues (prosodic) because of immature cognitive
flexibility or difficulty shifting from their usual strategies to ones that are more appropriate for
the task at hand. Selective attention to auditory cues or filtering also improves between 5 and 14
years of age (Doyle, 1973; Maccoby & Konrad, 1966; Pearson & Lane, 1991; Plude et al., 1994),
which may contribute further to children’s difficulty in focusing selectively on prosodic cues and
ignoring irrelevant lexical cues.
With respect to question and statement identification, children’s difficulty is mirrored in
production. For example, their production of questions with rising terminal pitch, which is the
most salient prosodic cue to questions, increases between 4 and 11 years of age (Patel & Grigos,
2006; Snow, 1998; Wells et al., 2004). Children also have difficulty in interpreting other aspects
of linguistic prosody such as the distinction between topic and comment and compound versus
phrasal stress (hot dog vs. hot dog) (Cutler & Swinney, 1987; Hornby, 1971; Moore, Harris, &
Patriquin, 1993; Quam & Swingley, 2014; Vogel & Raimy, 2002; Wells et al., 2004).
Prosodic comprehension is related to general language and cognitive abilities. For
example, the receptive prosodic skills of 6- to 14-year-old children, such as the ability to identify
sentence focus or compound nouns, are related to grammatical comprehension (Stonajovik,
71
Setter, & van Ewijk, 2007) and to verbal intelligence (Peppé, McCann, Gibbon, O'Hare, &
Rutherford, 2007). Moreover, third graders who read quickly and accurately also produce shorter
and more adult-like pauses, larger pitch falls at the end of statements, and larger pitch rises at the
end of yes/no questions (Miller & Schwanenflugel, 2006). Those who use larger pitch falls and
rises while reading statements and questions, respectively, also demonstrate greater reading
comprehension. Further evidence of a link between prosodic abilities and general cognitive and
linguistic abilities comes from adolescents with Williams Syndrome, who are markedly poorer at
interpreting linguistic prosody than age-matched controls, with the performance gap narrowing
for controls matched on language level (Catterall, Howard, Stojanovik, Szczerbinski, & Wells,
2006).
Children’s difficulty with identifying questions and statements solely on the basis of
prosodic cues may be ameliorated in more natural situations. For example, face-to-face
conversations provide contextual language cues (i.e., previous utterances) and visual cues in
addition to prosodic cues. Even in the absence of some contextual cues, as in telephone
conversations, prior utterances would be available to facilitate children’s comprehension of
declarative questions.
Although there is some evidence that children use supplementary cues to interpret
emotional prosody, there are important age-related differences. For example, 5- and 7-year-old
children use contextual (i.e., situational) cues rather than prosody to interpret a speaker's
emotions, 9-year-olds use both contextual and prosodic cues, and adults largely rely on prosody
(Aguert et al., 2010). As noted, children between 5 and 13 years of age are more likely than
adults to use contextual rather than prosodic cues to interpret a speaker's emotion even when the
contextual cues are neutral, which provides additional evidence that prosodic cues pose
challenges for children (Aguert et al., 2013). For example, when using contextual cues to explain
72
a speaker's emotions, children tend to invent reasons why the context was happy or sad. Even
when visual (e.g., facial expressions) and prosodic cues are available, children between 5 and 7
years of age use contextual (i.e., situational) cues to interpret a speaker’s emotions (Gil, Aguert,
Le Bigot, Lacroix, & Laval, 2014).
Although adults prioritize prosodic cues over visual cues when interpreting a speaker's
intentions, they also use the visual information. For example, Dutch listeners give more weight
to prosodic cues when interpreting the location of sentence focus, but they perceive the sentence
focus as more prominent when visual cues (e.g., eyebrow movements, head nods, and
articulatory movements) and prosodic cues are congruent (Krahmer & Swerts, 2008; Swerts &
Krahmer, 2008). Moreover, speakers of American English can identify phrasal and lexical stress
based on facial movements alone (Scarborough, Keating, Mattys, Cho, & Alwan, 2009).
Visual cues also help listeners differentiate questions from statements. For example,
facial cues reinforce auditory cues to Dutch questions when the auditory cues are weak or
ambiguous (House, 2002). Moreover, facial expressions have more influence than auditory cues
when differentiating contrastive focus from echo questions in Catalan, but the combination of
facial and acoustic cues produces the fastest and most accurate judgments (Borràs-Comes &
Prieto, 2011). Visual cues also affect the identification of questions and statements in American
English. For example, listeners rely primarily on auditory cues to differentiate declarative
questions from statements, but they can also distinguish these utterance types based on visual
cues (head tilt and eyebrow movements), and performance is best when auditory and visual cues
are available (Srinivasan & Massaro, 2003). Future developmental research that combined
auditory with visual cues would provide much needed information about age-related differences
in the understanding of pragmatic intentions.
73
To date, however, there is little information about children’s use of visual cues to
interpret linguistic prosody. A single study by Doherty-Sneddon and Kent (1996) indicated that
6-year-olds performed more poorly than 11-year-olds when the speaker was behind a screen but
not when the speaker was visible. This finding implies that visual cues influence young
children’s interpretation of linguistic prosody. If children have weak representations of prosodic
cues, they may require support from supplementary cues. Future research could examine the role
of visual cues in children’s interpretation of linguistic prosody, including the prosodic cues that
signal questions and statements.
There is increasing interest in the interpretation of linguistic prosody, including question
and statement identification, by special populations. For example, children with Autism
Spectrum Disorder (Peppé et al., 2007), language impairment (Wells & Peppé, 2003), and
Williams Syndrome (Catterall et al., 2006) can discriminate declarative questions and statements
as well as typically developing children, but they have greater difficulty in identifying the
speaker's intentions. In other words, atypically developing children perceive the relevant
prosodic cues but their challenge, like that of young, typically developing children, is mapping
these cues onto meaning or pragmatic intentions.
Deaf individuals who use cochlear implants tend to have difficulty perceiving prosodic
cues because their prostheses transmit degraded spectral information (Galvin, Fu, & Shannon,
2009), which impairs the perception of pitch contours (Gfeller et al., 2005; Gfeller et al., 2012;
McDermott, 2004). As a result, their perception (Chatterjee & Peng, 2008; Meister et al., 2009;
Most & Peled, 2007; Peng et al., 2009) and production (Peng, Tomblin, & Turner, 2008) of
questions and statements is impaired relative to that of their normally hearing peers.
In sum, by 5 years of age, children can identify questions and statements on the basis of
prosody alone, but they are much less proficient in this regard than older children and adults. By
74
9 years of age, children are as proficient as adults at identifying statements and questions from
complete utterances, but they are less proficient than adults on more challenging tasks involving
electronically altered terminal cues or pre-terminal cues. Children’s difficulties are attributable,
in part, to limitations in their language and cognitive abilities. They can discriminate the requisite
prosodic distinctions, but there are important age-related changes in attending to those cues and
understanding the links between prosodic cues and meaning. Because the interpretation of
linguistic prosody is related to general language abilities, it may be possible to improve
children’s comprehension of prosodic information by focusing on language abilities generally
rather than prosody specifically. Links between the perception and production of prosody (Peppé
et al., 2007; Wells et al., 2004) raise the possibility that interventions designed to improve the
perception of prosodic cues could have consequences for production as well as perception. For
example, an intervention designed to improve the perception of prosodic cues by children with
cochlear implants had favourable consequences on production (Klieve & Jeanes, 2001).
It is important to emphasize, however, that young children are likely to be more
disadvantaged than older children and adults by decontextualized stimuli such as those used in
the present studies. Children’s skills in ecologically valid contexts, or real environments, often
differ substantially from those observed in laboratory environments (Neisworth & Bagnato,
2004). Future research could compare age-related changes in the identification of questions and
statements in relatively natural contexts that include visual and contextual cues with contexts in
which listeners must rely exclusively on prosody, as in the present dissertation.
75
References
Aguert, M., Laval, V., Lacroix, A., Gil, S., & Le Bigot, L. (2013). Inferring emotions from
speech prosody: Not so easy at age five. Plos One, 8, e83657.
doi:10.1371/journal.pone.0083657
Aguert, M., Laval, V., Le Bigot, L., & Bernicot, J. (2010). Understanding expressive speech acts:
The role of prosody and situational context in French-speaking 5-to 9-year-olds. Journal
of Speech, Language, and Hearing Research, 53, 1629–1641. doi:10.1044/1092-
4388(2010/08-0078)
Andrews, M. L., & Madeira, S. S. (1977). The assessment of pitch discrimination ability in
young children. Journal of Speech and Hearing Disorders, 42, 279–286.
doi:10.1044/jshd.4202.279
Armstrong, M. E. (2012). The development of yes-no question intonation in Puerto Rican
Spanish (Doctoral dissertation, Ohio State University). Retrieved from http://eric.ed.gov/
Banbury, S. P., Macken, W. J., Tremblay, S., & Jones, D. M. (2001). Auditory distraction and
short-term memory: Phenomena and practical implications. Human Factors: The Journal
of the Human Factors and Ergonomics Society, 43, 12–29.
doi:10.1518/001872001775992462
Banse, R., & Scherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of
Personality and Social Psychology, 70, 614–636. doi:10.1037/0022-3514.70.3.614
Boersma, P. (2002). Praat: A system for doing phonetics by computer. Glot International, 5,
341–345.
Bolinger, D. L. M. (1986). Intonation and its parts: Melody in spoken English. Stanford, CA:
Stanford University Press.
Bolinger, D. L. M. (1989). Intonation and its uses: Melody in grammar and discourse. Stanford,
76
CA: Stanford University Press.
Borràs-Comes, J., & Prieto, P. (2011). ‘Seeing tunes.’ The role of visual gestures in tune
interpretation. Laboratory Phonology, 2, 355–380. doi:10.1515/labphon.2011.013
Caporael, L. R. (1981). The paralanguage of caregiving: Baby talk to the institutionalized aged.
Journal of Personality and Social Psychology, 40, 876–884. doi:10.1037/0022-
3514.40.5.876
Catterall, C., Howard, S., Stojanovik, V., Szczerbinski, M., & Wells, B. (2006). Investigating
prosodic ability in Williams syndrome. Clinical Linguistics & Phonetics, 20, 531–538.
doi:10.1080/02699200500266380
Ceci, S. J., & Howe, M. J. (1978). Age-related differences in free recall as a function of retrieval
flexibility. Journal of Experimental Child Psychology, 26, 432–442. doi:10.1016/0022-
0965(78)90123-6
Chatterjee, M., & Peng, S. C. (2008). Processing F0 with cochlear implants: Modulation
frequency discrimination and speech intonation recognition. Hearing Research, 235,
143–156. doi:10.1016/j.heares.2007.11.004
Costa-Giomi, E., & Descombes, V. (1996). Pitch labels with single and multiple meanings: A
study with French-speaking children. Journal of Research in Music Education, 44, 204–
214. doi:10.2307/3345594
Cowan, N., Morey, C. C., AuBuchon, A. M., Zwilling, C. E., & Gilchrist, A. L. (2010). Seven‐
year‐olds allocate attention like adults unless working memory is overloaded.
Developmental Science, 13, 120–133. doi:10.1111/j.1467-7687.2009.00864.x
Creel, S. C. (2014). Tipping the scales: Auditory cue weighting changes over development.
Journal of Experimental Psychology: Human Perception and Performance, 40, 1146–
1160. doi:10.1037/a0036057
77
Creel, S.C., Rojo, D.P., & Paullada, A.N. (2016). Effects of contextual support on preschoolers’
accented speech comprehension. Journal of Experimental Child Psychology, 146, 156–
180. doi: 10.1016/j.jecp.2016.01.018
Cruttenden, A. (1981). Falls and rises: Meanings and universals. Journal of Linguistics, 17, 77–
91. doi:10.1017/S0022226700006782
Cruttenden, A. (1985). Intonation comprehension in ten-year-olds. Journal of Child Language,
12, 643–661. doi:10.1017/S030500090000670X
Cruttenden, A. (1997). Intonation. Cambridge, UK: Cambridge University Press.
Cutler, A., & Swinney, D. A. (1987). Prosody and the development of comprehension. Journal
of Child Language, 14, 145–167. doi:10.1017/s0305000900012782
Davidson, M. C., Amso, D., Anderson, L. C., & Diamond, A. (2006). Development of cognitive
control and executive functions from 4 to 13 years: Evidence from manipulations of
memory, inhibition, and task switching. Neuropsychologia, 44, 2037–2078.
doi:10.1016/j.neuropsychologia.2006.02.006
Deak, G. O. (2003). The development of cognitive flexibility and language abilities. Advances in
Child Development and Behavior, 31, 273–328. doi:10.1016/s0065-2407(03)31007-9
Doherty, C. P., Fitzsimons, M., Asenbauer, B., & Staunton, H. (1999). Discrimination of prosody
and music by normal children. European Journal of Neurology, 6, 221–226.
doi:10.1111/j.1468-1331.1999.tb00016.x
Doherty‐Sneddon, G., & Kent, G. (1996). Visual signals and the communication abilities of
children. Journal of Child Psychology and Psychiatry, 37, 949–959. doi:10.1111/j.1469-
7610.1996.tb01492.x
Dowling, W. J., & Fujitani, D. S. (1971). Contour, interval, and pitch recognition in memory for
melodies. The Journal of the Acoustical Society of America, 49, 524–531.
78
doi:10.1121/1.1912382
Doyle, A. B. (1973). Listening to distraction: A developmental study of selective attention.
Journal of Experimental Child Psychology, 15, 100–115. doi:10.1016/0022-
0965(73)90134-3
Drake, C., & Bertrand, D. (2001). The quest for universals in temporal processing in music.
Annals of the New York Academy of Sciences, 930, 17–27.
doi:10.1093/acprof:oso/9780198525202.003.0002
Drayna, D., Manichaikul, A., de Lange, M., Snieder, H., & Spector, T. (2001). Genetic correlates
of musical pitch recognition in humans. Science, 291, 1969–1972.
doi:10.1126/science.291.5510.1969
Eady, S. J., & Cooper, W. E. (1986). Speech intonation and focus location in matched statements
and questions. The Journal of the Acoustical Society of America, 80, 402–417.
doi:10.1121/1.394091
Ekman, P. (1976). Movements with precise meanings. Journal of Communication, 26, 14–26.
doi:10.1111/j.1460-2466.1976.tb01898.x
Ekman, P. (1979). About brows: Emotional and conversational signals. In M. von Cranach, K.
Foppa, W. Lepenies & D. Ploog (Eds.), Human ethology: Claims and limits of a new
discipline (pp. 169–248). Cambridge, UK: Cambridge University Press.
Estigarribia, B. (2010). Facilitation by variation: Right-to-left learning of English yes/no
questions. Cognitive Science, 34, 68–93. doi:10.1111/j.1551-6709.2009.01053.x
Face, T. L. (2005). F0 peak height and the perception of sentence type in Castilian Spanish.
Revista Internacional De Lingüística Iberoamericana, 3, 49–65. Retrieved from
http://prosodia.upf.edu/atlesentonacio/recursos/documents/prieto/
Falé, I., & Faria, I. H. (2006). Time course of intonation processing in European Portuguese: A
79
gating paradigm approach. Iv Jornadas En Tecnologia Del Habla, 265–270. Retrieved
from http://lorien.die.upm.es/~lapiz/rtth/JORNADAS/IV/
Fancourt, A., Dick, F., & Stewart, L. (2013). Pitch-change detection and pitch-direction
discrimination in children. Psychomusicology: Music, Mind, and Brain, 23, 73–81.
doi:10.1037/a0033301
Fernald, A., & Mazzie, C. (1991). Prosody and focus in speech to infants and adults.
Developmental Psychology, 27, 209–221. doi:10.1037//0012-1649.27.2.209
Filipe, M. G., Frota, S., Castro, S. L., & Vicente, S. G. (2014). Atypical prosody in Asperger
syndrome: Perceptual and acoustic measurements. Journal of Autism and Developmental
Disorders, 44, 1972–1981. doi:10.1007/s10803-014-2073-2
Foulkes, P., Docherty, G., & Watt, D. (2005). Phonological variation in child-directed speech.
Language, 80, 177–206. doi:10.1353/lan.2005.0018
Francis, A.L., Ciocca, V., Ma, L., & Fenn, K. (2008). Perceptual learning of Cantonese lexical
tones by tone and non-tone language speakers. Journal of Phonetics, 36, 268–294.
doi:10.1016/j.wocn.2007.06.005
Francis, A. L., Ciocca, V., & Ng, B. K. C. (2003). On the (non) categorical perception of lexical
tones. Perception & Psychophysics, 65, 1029–1044. doi:10.3758/bf03194832
Friend, M. (2000). Developmental changes in sensitivity to vocal paralanguage. Developmental
Science, 3, 148–162. doi:10.1111/1467-7687.00108
Frota, S., Butler, J., & Vigário, M. (2014). Infants' perception of intonation: Is it a statement or a
question? Infancy, 19, 194–213. doi:10.1111/infa.12037
Galvin, J. J., III, Fu, Q. J., & Nogaki, G. (2007). Melodic contour identification by cochlear
implant listeners. Ear and Hearing, 28, 302–319.
doi:10.1097/01.aud.0000261689.35445.20
80
Galvin, J. J., III, Fu, Q., & Shannon, R. V. (2009). Melodic contour identification and music
perception by cochlear implant users. Annals of the New York Academy of Sciences,
1169, 518–533. doi:10.1111/j.1749-6632.2009.04551.x
Gårding, E., & Abramson, A. S. (1965). A study of the perception of some American English
intonation contours. Studia Linguistica, 19, 61–79. doi:10.1111/j.1467-
9582.1965.tb00527.x
Gérard, C., & Clément, J. (1998). The structure and development of French prosodic
representations. Language and Speech, 41, 117–142. doi:10.1177/002383099804100201
Gfeller, K., Jiang, D., Oleson, J. J., Driscoll, V., Olszewski, C., & Knutson, J. F., ... Gantz, B.
(2012). The effects of musical and linguistic components in recognition of real-world
musical excerpts by cochlear implant recipients and normal-hearing adults. Journal of
Music Therapy, 49, 68–101. doi: 10.1093/jmt/49.1.68
Gfeller, K., Olszewski, C., Rychener, M., Sena, K., Knutson, J. F., Witt, S., & Macpherson, B.
(2005). Recognition of “real-world” musical excerpts by cochlear implant recipients and
normal-hearing adults. Ear and Hearing, 26, 237–250. doi:10.1097/00003446-
200506000-00001
Gfeller, K., Turner, C., Mehr, M., Woodworth, G., Fearn, R., Knutson, J. F., ... Stordahl, J.
(2002). Recognition of familiar melodies by adult cochlear implant recipients and
normal‐hearing adults. Cochlear Implants International, 3, 29–53. doi:10.1002/cii.50
Gil, S., Aguert, M., Le Bigot, L., Lacroix, A., & Laval, V. (2014). Children’s understanding of
others’ emotional states: Inferences from extralinguistic or paralinguistic cues?
International Journal of Behavioral Development, 38, 539–549.
doi:10.1177/0165025414535123
Grosjean, F. (1980). Spoken word recognition processes and the gating paradigm. Perception &
81
Psychophysics, 28, 267–283. doi:10.3758/BF03204386
Gunlogson, C. (2002). Declarative questions. Semantics and Linguistic Theory, 12, 124–143.
doi:10.3765/salt.v12i0.2860
Hadding-Koch, K., & Studdert-Kennedy, M. (1964). An experimental study of some intonation
contours. Phonetica, 11, 175–185. doi:10.1159/000258338
Hair, H. I. (1977). Discrimination of tonal direction on verbal and nonverbal tasks by first grade
children. Journal of Research in Music Education, 25, 197–210. doi:10.2307/3345304
Hallé, P. A., Chang, Y. C., & Best, C. T. (2004). Identification and discrimination of Mandarin
Chinese tones by Mandarin Chinese vs. French listeners. Journal of Phonetics, 32, 395–
421. doi:10.1016/s0095-4470(03)00016-0
Halpern, A. R. (1984). Organization in memory for familiar songs. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 10, 496–512. doi:10.1037/0278-
7393.10.3.496
Hébert, S., & Peretz, I. (1997). Recognition of music in long-term memory: Are melodic and
temporal patterns equal partners? Memory & Cognition, 25, 518–533.
doi:10.3758/bf03201127
Heeren, W. F., Bibyk, S. A., Gunlogson, C., & Tanenhaus, M. K. (2015). Asking or telling–Real-
time processing of prosodically distinguished questions and statements. Language and
Speech, 58, 474–501. doi:10.1177/0023830914564452
Hermer-Vazquez, L., Moffet, A., & Munkholm, P. (2001). Language, space, and the
development of cognitive flexibility in humans: The case of two spatial memory tasks.
Cognition, 79, 263–299. doi:10.1016/s0010-0277(00)00120-7
Hervais-Adelman, A. G., Davis, M. H., Johnsrude, I. S., Taylor, K. J., & Carlyon, R. P. (2011).
Generalization of perceptual learning of vocoded speech. Journal of Experimental
82
Psychology: Human Perception and Performance, 37, 283–295. doi:10.1037/a0020772
Hervais-Adelman, A., Davis, M. H., Johnsrude, I. S., & Carlyon, R. P. (2008). Perceptual
learning of noise vocoded words: effects of feedback and lexicality. Journal of
Experimental Psychology: Human Perception and Performance, 34, 460–474.
doi:10.1037/0096-1523.34.2.460
Hornby, P. A. (1971). Surface structure and the topic-comment distinction: A developmental
study. Child Development, 42, 1975–1988. doi:10.2307/1127600
House, D. (2002). Intonational and visual cues in the perception of interrogative mode in
Swedish. Proceedings of Interspeech, USA, 1957–1960. Retrieved from http://www.isca-
speech.org/archive/archive_papers/icslp_2002/
Hutchins, S., Gosselin, N., & Peretz, I. (2010). Identification of changes along a continuum of
speech intonation is impaired in congenital amusia. Frontiers in Psychology, 1(236).
doi:10.3389/fpsyg.2010.00236
Jacobson, J. L., Boersma, D. C., Fields, R. B., & Olson, K. L. (1983). Paralinguistic features of
adult speech to infants and small children. Child Development, 54, 436–442.
doi:10.2307/1129704
Kang, R., Nimmons, G. L., Drennan, W., Longnion, J., Ruffin, C., & Nie, K., ... Rubinstein, J.
(2009). Development and validation of the University of Washington Clinical
Assessment of Music Perception test. Ear and Hearing, 30, 411–418.
doi:10.1097/aud.0b013e3181a61bc0
Karzon, R. G., & Nicholas, J. G. (1989). Syllabic pitch perception in 2- to 3-month-old infants.
Perception & Psychophysics, 45, 10–14. doi:10.3758/bf03208026
Kidd, G., Boltz, M., & Jones, M. R. (1984). Some effects of rhythmic context on melody
recognition. The American Journal of Psychology, 97,153–173. doi: 10.2307/1422592
83
Klieve, S., & Jeanes, R. (2001). Perception of prosodic features by children with cochlear
implants: Is it sufficient for understanding meaning differences in language? Deafness &
Education International, 3, 15–37. doi:10.1002/dei.92
Knoll, M. A., Uther, M., & Costall, A. (2009). Effects of low-pass filtering on the judgment of
vocal affect in speech directed to infants, adults and foreigners. Speech Communication,
51, 210–216. doi:10.1016/j.specom.2008.08.001
Kong, Y. Y., Cruz, R., Jones, J. A., & Zeng, F. G. (2004). Music perception with temporal cues
in acoustic and electric hearing. Ear and Hearing, 25(2), 173–185.
Krahmer, E., & Swerts, M. (2006). Perceiving focus. In C. Lee, & M. Gordon (Eds.), Topic and
focus: Cross-linguistic perspectives on meaning and intonation (pp. 121–137). The
Netherlands: Springer Netherlands.
Kreuz, R. J., & Glucksberg, S. (1989). How to be sarcastic: The echoic reminder theory of verbal
irony. Journal of Experimental Psychology: General, 118, 374–386. doi:10.1037/0096-
3445.118.4.374
Krumhansl, C. L., & Keil, F. C. (1982). Acquisition of the hierarchy of tonal functions in music.
Memory & Cognition, 10, 243–251. doi:10.3758/bf03197636
Ladd, D. R., & Morton, R. (1997). The perception of intonational emphasis: continuous or
categorical? Journal of Phonetics, 25, 313–342. doi:10.1006/jpho.1997.0046
Ladd, R. (1996). Intonational phonology. Cambridge, UK: Cambridge University Press.
Laukka, P. (2005). Categorical perception of vocal emotion expressions. Emotion, 5, 277–295.
doi:10.1037/1528-3542.5.3.277
Levi, S.V. (2015). Talker familiarity and spoken word recognition in school-age children.
Journal of Child Language, 42, 843–872. doi: 10.1017/S0305000914000506
Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1957). The discrimination of
84
speech sounds within and across phoneme boundaries. Journal of Experimental
Psychology, 54, 358–368. doi:10.1037/h0044417
Liu, H. M., Tsao, F. M., & Kuhl, P. K. (2009). Age-related changes in acoustic modifications of
Mandarin maternal speech to preverbal infants and five-year-old children: A longitudinal
study. Journal of Child Language, 36, 909–922. doi:10.1017/S030500090800929X
Loeb, D. F., & Allen, G. D. (1993). Preschoolers' imitation of intonation contours. Journal of
Speech and Hearing Research, 36, 4–13. doi:10.1044/jshr.3601.04
Lynch, M. P., & Eilers, R. E. (1991). Children's perception of native and nonnative musical
scales. Music Perception: An Interdisciplinary Journal, 9, 121–131.
doi:10.2307/40286162
Ma, J. K. Y., Whitehill, T. L., & So, S. Y. S. (2010). Intonation contrast in Cantonese speakers
with hypokinetic dysarthria associated with Parkinson’s disease. Journal of Speech,
Language, and Hearing Research, 53, 836–849. doi:10.1044/1092-4388(2009/08-0216)
Maccoby, E. E., & Konrad, K. W. (1966). Age trends in selective listening. Journal of
Experimental Child Psychology, 3, 113–122. doi:10.1016/0022-0965(66)90086-5
Majewski, W., & Blasdell, R. (1969). Influence of fundamental frequency cues on the perception
of some synthetic intonation contours. The Journal of the Acoustical Society of America,
45, 450–457. doi:10.1121/1.1911394
Marx, M., James, C., Foxton, J., Capber, A., Fraysse, B., Barone, P., & Deguine, O. (2015).
Speech prosody perception in cochlear implant users with and without residual hearing.
Ear and Hearing, 36, 239–248. doi: 10.1097/aud.0000000000000105
Masataka, N. (2002). Pitch modification when interacting with elders: Japanese women with and
without experience with infants. Journal of Child Language, 29, 939–951.
doi:10.1017/S0305000902005378
85
McDermott, H. J. (2004). Music perception with cochlear implants: A review. Trends in
Amplification, 8, 49–82. doi:10.1177/108471380400800203
Meister, H., Landwehr, M., Pyschny, V., Walger, M., & Wedel, H. V. (2009). The perception of
prosody and speaker gender in normal-hearing listeners and cochlear implant recipients.
International Journal of Audiology, 48, 38–48. doi:10.1080/14992020802293539
Micheyl, C., Delhommeau, K., Perrot, X., & Oxenham, A. J. (2006). Influence of musical and
psychoacoustical training on pitch discrimination. Hearing Research, 219, 36–47.
doi:10.1016/j.heares.2006.05.004
Miller, J. L. (1982). Effects of speaking rate on segmental distinctions. In P.D. Eimas, & J.L.
Miller (Eds.), Perspectives on the study of speech (pp. 39–74). New York, NY:
Routledge.
Miller, J., & Schwanenflugel, P. J. (2006). Prosody of syntactically complex sentences in the oral
reading of young children. Journal of Educational Psychology, 98, 839–853.
doi:10.1037/0022-0663.98.4.839
Monahan, C. B., & Carterette, E. C. (1985). Pitch and duration as determinants of musical space.
Music Perception: An Interdisciplinary Journal, 3, 1–32. doi:10.2307/40285320
Moore, C., Harris, L., & Patriquin, M. (1993). Lexical and prosodic cues in the comprehension
of relative certainty. Journal of Child Language, 20, 153–167.
doi:10.1017/s030500090000917x
Moore, R. E., Estis, J., Gordon-Hickey, S., & Watts, C. (2008). Pitch discrimination and pitch
matching abilities with vocal and nonvocal stimuli. Journal of Voice, 22, 399–407.
doi:10.1016/j.jvoice.2006.10.013
Morton, J. B., & Munakata, Y. (2002). Are you listening? Exploring a developmental
knowledge–action dissociation in a speech interpretation task. Developmental Science, 5,
86
435–440. doi:10.1111/1467-7687.00238
Morton, J. B., & Trehub, S. E. (2001). Children's understanding of emotion in speech. Child
Development, 72, 834–843. doi:10.1111/1467-8624.00318
Morton, J. B., Trehub, S. E., & Zelazo, P. D. (2003). Sources of inflexibility in 6-year- olds'
understanding of emotion in speech. Child Development, 74, 1857–1868.
doi:10.1046/j.1467-8624.2003.00642.x
Most, T., & Peled, M. (2007). Perception of suprasegmental features of speech by children with
cochlear implants and children with hearing aids. Journal of Deaf Studies and Deaf
Education, 12, 350–351. doi:10.1093/deafed/enm012
Munakata, Y., Snyder, H. R., & Chatham, C. H. (2012). Developing cognitive control three key
transitions. Current Directions in Psychological Science, 21, 71–77.
doi:10.1177/0963721412436807
Nakata, T., Trehub, S. E., & Kanda, Y. (2012). Effect of cochlear implants on children’s
perception and production of speech prosody. The Journal of the Acoustical Society of
America, 131, 1307–1314. doi:10.1121/1.3672697
Nazzi, T., Floccia, C., & Bertoncini, J. (1998). Discrimination of pitch contours by neonates.
Infant Behavior and Development, 21, 779–784. doi: 10.1016/s0163-6383(98)90044-3
Neath, I. (2000). Modeling the effects of irrelevant speech on memory. Psychonomic Bulletin &
Review, 7, 403–423. doi:10.3758/bf03214356
Neisworth, J. T., & Bagnato, S. J. (2004). The mismeasure of young children: The authentic
assessment alternative. Infants & Young Children, 17, 198–212. doi:10.1097/00001163-
200407000-00002
Olsho, L. W., Schoon, C., Sakai, R., Turpin, R., & Sperduto, V. (1982). Preliminary data on
frequency discrimination in infancy. The Journal of the Acoustical Society of America,
87
71, 509–511. doi:10.1121/1.387427
O'Shaughnessy, D. (1979). Linguistic features in fundamental frequency patterns. Journal of
Phonetics, 7, 119–145.
Palmer, C., & Krumhansl, C. L. (1987). Independent temporal and pitch structures in
determination of musical phrases. Journal of Experimental Psychology: Human
Perception and Performance, 13, 116–126. doi:10.1037/0096-1523.13.1.116
Papoušek, M., Bornstein, M. H., Nuzzo, C., Papoušek, H., & Symmes, D. (1990). Infant
responses to prototypical melodic contours in parental speech. Infant Behavior and
Development, 13, 539–545. doi:10.1016/0163-6383(90)90022-Z
Patel, A. D., Foxton, J. M., & Griffiths, T. D. (2005). Musically tone-deaf individuals have
difficulty discriminating intonation contours extracted from speech. Brain and Cognition,
59, 310–313. doi:10.1016/j.bandc.2004.10.003
Patel, A. D., Wong, M., Foxton, J., Lochy, A., & Peretz, I. (2008). Speech intonation perception
deficits in musical tone deafness (congenital amusia). Music Perception, 25, 357–368.
doi:10.1525/mp.2008.25.4.357
Patel, R. (2003). Acoustic characteristics of the question-statement contrast in severe dysarthria
due to cerebral palsy. Journal of Speech, Language, and Hearing Research, 46(6), 1401–
1415. doi:10.1044/1092-4388(2003/109)
Patel, R., & Brayton, J. T. (2009). Identifying prosodic contrasts in utterances produced by 4-, 7-,
and 11-year-old children. Journal of Speech, Language, and Hearing Research, 52, 790–
801. doi:10.1044/1092-4388(2008/07-0137)
Patel, R., & Grigos, M. I. (2006). Acoustic characterization of the question–statement contrast in
4, 7 and 11 year-old children. Speech Communication, 48(10), 1308–1318.
doi:10.1016/j.specom.2006.06.007
88
Pearson, D. A., & Lane, D. M. (1991). Auditory attention switching: A developmental study.
Journal of Experimental Child Psychology, 51, 320–334. doi:10.1016/0022-
0965(91)90039-u
Peng, S. C., Lu, N., & Chatterjee, M. (2009). Effects of cooperating and conflicting cues on
speech intonation recognition by cochlear implant users and normal hearing listeners.
Audiology and Neurotology, 14, 327–337. doi:10.1159/000212112
Peng, S. C., Chatterjee, M., & Lu, N. (2012). Acoustic cue integration in speech intonation
recognition with cochlear implants. Trends in Amplification, 16, 67–82.
doi:10.1177/1084713812451159
Peng, S. C., Tomblin, J. B., & Turner, C. W. (2008). Production and perception of speech
intonation in pediatric cochlear implant recipients and individuals with normal hearing.
Ear and Hearing, 29, 336–351. doi:10.1097/AUD.0b013e318168d94d
Peppé, S., McCann, J., Gibbon, F., O’Hare, A., & Rutherford, M. (2007). Receptive and
expressive prosodic ability in children with high-functioning autism. Journal of Speech,
Language, and Hearing Research, 50, 1015–1028. doi:10.1044/1092-4388(2007/071)
Pexman, P. M., & Glenwright, M. (2007). How do typically developing children grasp the
meaning of verbal irony? Journal of Neurolinguistics, 20, 178–196.
doi:10.1016/j.jneuroling.2006.06.001
Pierrehumbert, J.B. (1980). The phonology and phonetics of English intonation (Doctoral
dissertation, Massachusetts Institute of Technology). Retrieved from
http://dspace.mit.edu/
Ploog, B. O., Banerjee, S., & Brooks, P. J. (2009). Attention to prosody (intonation) and content
in children with autism and in typical children using spoken sentences in a computer
game. Research in Autism Spectrum Disorders, 3, 743–758.
89
doi:10.1016/j.rasd.2009.02.004
Plude, D. J., Enns, J. T., & Brodeur, D. (1994). The development of selective attention: A life-
span overview. Acta Psychologica, 86, 227–272. doi:10.1016/0001-6918(94)90004-3
Pogue, A., Kurumada, C., & Tanenhaus, M.K. (2015). Talker-specific generalization of
pragmatic inferences based on under- and over-informative prenominal adjective use.
Frontiers in Psychology, 6. doi: 10.3389/fpsyg.2015.02035
Quam, C., & Swingley, D. (2014). Processing of lexical stress cues by young children. Journal
of Experimental Child Psychology, 123, 73–89. doi:10.1016/j.jecp.2014.01.010
Scarborough, R., Keating, P., Mattys, S. L., Cho, T., & Alwan, A. (2009). Optical phonetics and
visual perception of lexical and phrasal stress in English. Language and Speech, 52, 135–
175. doi:10.1177/0023830909103165
Scherer, K. R. (1986). Vocal affect expression: a review and a model for future research.
Psychological Bulletin, 99, 143–165. doi:10.1037//0033-2909.99.2.143
Scherer, K. R., Koivumaki, J., & Rosenthal, R. (1972). Minimal cues in the vocal
communication of affect: Judging emotions from content-masked speech. Journal of
Psycholinguistic Research, 1, 269–285. doi:10.1007/bf01074443
Sexton, M. A., & Geffen, G. (1979). Development of three strategies of attention in dichotic
monitoring. Developmental Psychology, 15, 299–310. doi:10.1037/0012-1649.15.3.299
Smiljanić, R., & Bradlow, A. R. (2009). Speaking and hearing clearly: Talker and listener factors
in speaking style changes. Language and Linguistics Compass, 3, 236–264.
doi:10.1111/j.1749-818x.2008.00112.x
Snedeker, J., & Trueswell, J. C. (2004). The developing constraints on parsing decisions: The
role of lexical-biases and referential scenes in child and adult sentence processing.
Cognitive Psychology, 49, 238–299. doi:10.1016/j.cogpsych.2004.03.001
90
Snow, C. E. (1977). Mothers' speech research: From input to interaction. In C.E. Snow, & C.A.
Ferguson (Eds.), Talking to children: Language input and acquisition (pp. 31–49).
Cambridge, UK: Cambridge University Press.
Snow, D. (1994). Phrase-final syllable lengthening and intonation in early child speech. Journal
of Speech, Language, and Hearing Research, 37, 831–840. doi:0022-4685/94/3704-0831
Snow, D. (1998). Children's imitations of intonation contours: Are rising tones more difficult
than falling tones? Journal of Speech, Language, and Hearing Research, 41, 576–587.
doi:1092-4388/98/4103-0576
Snow, D. (2001). Imitation of intonation contours by children with normal and disordered
language development. Clinical Linguistics & Phonetics, 15, 567–584.
doi:10.1080/02699200110078168
Soderstrom, M., Ko, E. S., & Nevzorova, U. (2011). It's a question? Infants attend differently to
yes/no questions and declaratives. Infant Behavior and Development, 34, 107–110.
doi:10.1016/j.infbeh.2010.10.003
Spiegel, M. F., & Watson, C. S. (1984). Performance on frequency‐discrimination tasks by
musicians and nonmusicians. The Journal of the Acoustical Society of America, 76,
1690–1695. doi:10.1121/1.391605
Spruyt, A., Clarysse, J., Vansteenwegen, D., Baeyens, F., & Hermans, D. (2010). Affect 4.0: A
free software package for implementing psychological and psychophysiological
experiments. Experimental Psychology, 57, 36–45. doi:10.1027/1618-3169/a000005
Srinivasan, R. J., & Massaro, D. W. (2003). Perceiving prosody from the face and voice:
Distinguishing statements from echoic questions in English. Language and Speech, 46,
1–22. doi:10.1177/00238309030460010201
Stalinski, S. M., Schellenberg, E. G., & Trehub, S. E. (2008). Developmental changes in the
91
perception of pitch contour: Distinguishing up from down. The Journal of the Acoustical
Society of America, 124, 1759–1763. doi:10.1121/1.2956470
Stojanovik, V., Setter, J., & van Ewijk, L. (2007). Intonation abilities of children with Williams
syndrome: A preliminary investigation. Journal of Speech, Language, and Hearing
Research, 50, 1606–1617. doi:10.1044/1092-4388(2007/108)
Studdert-Kennedy, M., & Hadding, K. (1973). Auditory and linguistic processes in the
perception of intonation contours. Language and Speech, 16, 293–313. Retrieved from
http://web.haskins.yale.edu
Sturges, P. T., & Martin, J. G. (1974). Rhythmic structure in auditory temporal pattern
perception and immediate memory. Journal of Experimental Psychology, 102, 377–383.
doi:10.1037/h0035866
Swerts, M., & Krahmer, E. (2008). Facial expression and prosodic prominence: Effects of
modality and facial area. Journal of Phonetics, 36, 219–238.
doi:10.1016/j.wocn.2007.05.001
Syrett, K., & Kawahara, S. (2013). Production and perception of listener-oriented clear speech in
child language. Journal of Child Language, 3, 1–17. doi:10.1111/j.1749-
818x.2008.00112.x
Theodore, R.M., Myers, E.B., & Lomibao, J.A. (2015). Talker-specific influences on phonetic
category structure. The Journal of the Acoustical Society of America, 138, 1068–1078.
doi: 10.1121/1.4927489
Thorpe, L. A., Trehub, S. E., Morrongiello, B. A., & Bull, D. (1988). Perceptual grouping by
infants and preschool children. Developmental Psychology, 24, 484–491.
doi:10.1037//0012-1649.24.4.484
Trehub, S. E., Schellenberg, E. G., & Kamenetsky, S. B. (1999). Infants' and adults' perception
92
of scale structure. Journal of Experimental Psychology: Human Perception and
Performance, 25, 965–975. doi:10.1037/0096-1523.25.4.965
Tyler, L. K., & Marslen-Wilson, W. (1978). Some developmental aspects of sentence processing
and memory. Journal of Child Language, 5, 113–129. doi:10.1017/s0305000900001975
Uther, M., Knoll, M. A., & Burnham, D. (2007). Do you speak E-NG-LI-SH? A comparison of
foreigner-and infant-directed speech. Speech Communication, 49, 2–7.
doi:10.1016/j.specom.2006.10.003
van Bezzoijen, R., & Boves, L. (1986). The effects of low-pass filtering and random splicing on
the perception of speech. Journal of Psycholinguistic Research, 15, 403–417.
doi:10.1007/bf01067722
van Heuven, V., & Haan, J. (2000). Phonetic correlates of statement versus question intonation
in Dutch. In A. Botinis (Ed.), Intonation: Analysis, modelling, & technology (pp. 119–
144). The Netherlands: Kluwer Academic Publications.
Vion, M., & Colas, A. (2006). Pitch cues for the recognition of yes-no questions in French.
Journal of Psycholinguistic Research, 35, 427–445. doi:10.1007/s10936-006-9023-x
Vogel, I., & Raimy, E. (2002). The acquisition of compound vs. phrasal stress: The role of
prosodic constituents. Journal of Child Language, 29, 225–250.
doi:10.1017/s0305000902005020
Wang, W. S. (1976). Language change. Annals of the New York Academy of Sciences, 280, 61–
72. doi:10.1111/j.1749-6632.1976.tb25472.x
Warden, D. (1981). Children's understanding of ask and tell. Journal of Child Language, 8, 139–
149. doi:10.1017/s0305000900003068
Warrier, C. M., & Zatorre, R. J. (2002). Influence of tonal context and timbral variation on
perception of pitch. Perception & Psychophysics, 64, 198–207. doi: 10.3758/bf03195786
93
Waxer, M., & Morton, J. B. (2011). Children’s judgments of emotion from conflicting cues in
speech: Why 6-year-olds are so inflexible. Child Development, 82, 1648–1660.
doi:10.1111/j.1467-8624.2011.01624.x
Wells, B., & Peppé, S. (2003). Intonation abilities of children with speech and language
impairments. Journal of Speech, Language, and Hearing Research, 46, 5–20.
doi:10.1044/1092-4388(2003/001)
Wells, B., Peppé, S., & Goulandris, N. (2004). Intonation development from five to thirteen.
Journal of Child Language, 31, 749–778. doi:10.1017/s030500090400652x
Werker, J. F., & Hensch, T. K. (2015). Critical periods in speech perception: New directions.
Psychology, 66, 173–196. doi:10.1146/annurev-psych-010814-015104.
Whalen, D., Levitt, A. G., & Wang, Q. (1991). Intonational differences between the
reduplicative babbling of French-and English-learning infants. Journal of Child
Language, 18, 501–516. doi:10.1017/s0305000900011223
White, B. W. (1960). Recognition of distorted melodies. The American Journal of Psychology,
73, 100–107. doi:10.2307/1419120
Xu, Y., Gandour, J. T., & Francis, A. L. (2006). Effects of language experience and stimulus
complexity on the categorical perception of pitch direction. The Journal of the Acoustical
Society of America, 120, 1063–1074. doi:10.1121/1.2213572
Zhou, P., Crain, S., & Zhan, L. (2012). Sometimes children are as good as adults: The pragmatic
use of prosody in children’s on-line sentence processing. Journal of Memory and
Language, 67, 149–164. doi:10.1016/j.jml.2012.03.005
94
Copyright Acknowledgments
Study 1: Children's Identification of Questions from Rising Terminal Pitch is the manuscript of
an article published by Cambridge University Press in the Journal of Child Language. The
manuscript was accepted on 5 August 2015 and is available online at
http://dx.doi.org/10.1017/S0305000915000458. Citations should refer to the published text.