Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the...
-
Upload
solomon-borman -
Category
Documents
-
view
221 -
download
0
Transcript of Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the...
Submitted in partial fulfillment of therequirements for the degree
of Doctor of Philosophyin the Graduate School of Arts and Sciences
Agustín Gravano
Columbia UniversityColumbia University
Turn-Taking and Affirmative Cue Wordsin Task-Oriented Dialogue
Agustín Gravano - Thesis Defense - Jan 28, 2009
2
Special thanks to: Julia Hirschberg Committee Members
Maxine Eskenazi, Kathy McKeown, Becky Passonneau, Amanda Stent.
The Speech Lab Stefan Benus, Fadi Biadsy, Sasha Caskey, Bob Coyne, Frank
Enos, Martin Jansche, Jackson Liscombe, Sameer Maskey, Andrew Rosenberg.
Collaborators Gregory Ward and Elisa Sneed German (Northwestern U);
Ani Nenkova (UPenn); Héctor Chávez, David Elson, Michel Galley, Enrique Henestroza, Hanae Koiso, Shira Mitchell, Michael Mulley, Kristen Parton, Ilia Vovsha, and Lauren Wilcox.
Family and friends
Agustín Gravano - Thesis Defense - Jan 28, 2009
3
Interactive Voice Response Systems
Quickly spreading. Mostly simple functionality. Examples of IVR systems:
Let’s Go!: Bus scheduling information (CMU).GOOG-411: Local information (Google).
Most visible components of IVR systems:Automatic Speech Recognition (ASR)Text-To-Speech (TTS)
Introduction
Agustín Gravano - Thesis Defense - Jan 28, 2009
4
ASR+TTS account for most IVR problems. ASR: Up to 60% word error rate. TTS: Described as ‘odd’ or ‘mechanical’.
As ASR and TTS improve, other problems begin to show: Coordination of system-user exchanges. Frequent words overloaded with multiple functions.
Introduction
Interactive Voice Response Systems
Agustín Gravano - Thesis Defense - Jan 28, 2009
5
Coordination of Exchanges
Let’s Go! Demo (http://www.speech.cs.cmu.edu/letsgo/)
S: Thank you for calling the CMU Let's Go! Bus Information System.
[…] What can I do for you?
U: I would like to go to the airport tomorrow morning. [silence]
S: To the airport. When do you want to arrive?
U: I'd like to arrive at 10:30. [silence]
S: Arriving at around 10:30 AM. Where do you want to leave from?
U: I'd like to leave from Carnegie Mellon. [silence]
S: From Carnegie Mellon. There is a 28X leaving Forbes Avenue […]
Turn boundary detection is currently based on silence detection. Problems: latencies and false positives.
Introduction
Agustín Gravano - Thesis Defense - Jan 28, 2009
6
Overloaded Cue Words
Cue words: expressions such as by the way, however, after all. Frequent in dialogue, used for structuring
discourse and shaping conversation. Affirmative cue words: okay, alright, etc.
Convey acknowledgment, start a new topic, display continued attention, inter alia.
Frequent in task-oriented dialogue. IVR systems: understanding and generation.
Introduction
Agustín Gravano - Thesis Defense - Jan 28, 2009
7
Motivation
Understand and incorporate these and other phenomena into IVR systems, aiming at gradually approaching human-like behavior.
Descriptions of associations between observed phenomena (e.g. turn exchange types) and measurable events (e.g. variations in acoustic features). No strong claims about the degree of awareness of
speakers and listeners.
Introduction
Agustín Gravano - Thesis Defense - Jan 28, 2009
8
(1) Columbia Games Corpus
(2) Study of Turn-Taking
(3) Study of Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
9
Columbia Games Corpus
Task-oriented spontaneous dialogues. Two subjects, each with a laptop computer. Series of collaborative computer games. Soundproof booth; head-mounted mics. No eye contact; only verbal communication. No restrictions; subjects could speak freely.
Agustín Gravano - Thesis Defense - Jan 28, 2009
10
Cards Game, Part 1
Columbia Games Corpus
Player 1: Describer Player 2: Searcher
Agustín Gravano - Thesis Defense - Jan 28, 2009
11
Cards Game, Part 2
Player 1: Describer Player 2: Searcher
Columbia Games Corpus
Agustín Gravano - Thesis Defense - Jan 28, 2009
12
Objects Game
Player 1: Describer Player 2: Follower
Columbia Games Corpus
Agustín Gravano - Thesis Defense - Jan 28, 2009
13
Columbia Games Corpus
12 sessions, 13 subjects (6 female, 7 male). 9 hours of dialogue. Orthographic transcription and alignment.
70K words, 2K unique words Non-word vocalizations (laughs, coughs, etc.) Prosodic transcription (ToBI conventions). Automatically generated session logs.
Agustín Gravano - Thesis Defense - Jan 28, 2009
14
(1) Columbia Games Corpus
(2) Study of Turn-Taking
(3) Study of Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
15
Goals
Speech understanding: Detection of the end of the user’s turn. Detection of points in the user’s turn where a
backchannel response would be welcome. Speech generation:
Display of cues signalling the end of system’s turn. Display of cues inviting the user to produce a
backchannel response.
Turn-Taking
Agustín Gravano - Thesis Defense - Jan 28, 2009
16
Previous Work
Sacks, Schegloff & Jefferson 1974. General characterization of turn-taking in
conversation between two or more persons. Transition-relevance place: The current speaker
may either yield the turn, or continue speaking. Duncan 1972, 1973, 1974, inter alia.
Six turn-yielding cues in face-to-face dialogue. Linear relation between the number of displayed
cues and the likelihood of a turn-taking attempt.
Turn-Taking
Agustín Gravano - Thesis Defense - Jan 28, 2009
17
Previous Work
Corpus and perception studies. Formalized and verified some of the turn-yielding cues
hypothesized by Duncan. Ford & Thompson 1996; Wennerstrom & Siegel 2003;
Cutler & Pearson 1986; Wichmann & Caspers 2001. Implementations of turn-boundary detection.
Simulations (Ferrer et al. 2002, 2003; Edlund et al. 2005; Schlangen 2006; Atterer et al. 2008; Baumann 2008).
Actual systems (Raux & Eskenazi 2008, on Let’s Go!). Exploiting turn-yielding cues improves performance.
Turn-Taking
Agustín Gravano - Thesis Defense - Jan 28, 2009
18
Turn-Yielding Cues
Cues displayed by the speaker when approaching a potential turn boundary.
Turn-Taking
Agustín Gravano - Thesis Defense - Jan 28, 2009
19
Method
Smooth switch: Speaker A finishes her utterance; speaker B takes the turn with no overlapping speech.
Trained annotators distinguished Smooth switches from Interruptions and Backchannels using a scheme based on Ferguson 1977, Beattie 1982.
Turn-Yielding Cues
IPU (Inter Pausal Unit): Maximal sequence of words from the same speaker surrounded by silence ≥ 50ms.
Speaker A:
Speaker B:
HoldIPU1 IPU2
IPU3
Smooth switch
Agustín Gravano - Thesis Defense - Jan 28, 2009
20
Compare IPUs preceding Holds and IPUs preceding Smooth switches.
Assumption: Cues are more likely to occur before Smooth switches than before Holds.
Speaker A:
Speaker B:
Hold Smooth switchIPU1 IPU2
IPU3
Turn-Yielding Cues
Method
Agustín Gravano - Thesis Defense - Jan 28, 2009
21
1. Final intonation
2. Speaking rate
3. Intensity level
4. Pitch level
5. Textual completion
6. Voice quality
7. IPU duration
Individual Turn-Yielding Cues
Agustín Gravano - Thesis Defense - Jan 28, 2009
22
Individual Turn-Yielding Cues
Smoothswitch
Hold
H-H% 22.1% 9.1%
[!]H-L% 13.2% 29.9%
L-H% 14.1% 11.5%
L-L% 47.2% 24.7%
No boundary tone 0.7% 22.4%
Other 2.6% 2.4%
Total 100% 100% (2 test: p≈0)
1. Final Intonation
Falling, high-rising: turn-final. Plateau: turn-medial. Examination of final pitch slope shows same results.
Agustín Gravano - Thesis Defense - Jan 28, 2009
23
Individual Turn-Yielding Cues
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Syllables persecond
Phonemesper second
Syllables persecond
Phonemesper second
Final IPU Final word
S
H
2. Speaking Rate
Reduced final lengthening before turn boundaries.
**
* *
(*) ANOVA: p < 0.01
Smooth switch
Hold
Final wordEntire IPU
z-sc
ore
Agustín Gravano - Thesis Defense - Jan 28, 2009
24
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
IPU Final1.0s
Final0.5s
IPU Final1.0s
Final0.5s
Intensity Pitch
S
H
3/4. Intensity and Pitch Levels
Individual Turn-Yielding Cues
* **
* * *
Intensity Pitch
(*) ANOVA: p < 0.01
Lower intensity, pitch levels before turn boundaries.
Smooth switch
Hold
z-sc
ore
Agustín Gravano - Thesis Defense - Jan 28, 2009
25
5. Textual Completion
Syntactic/semantic/pragmatic completion independent of intonation and gesticulation.
Automatic computation of textual completion.(1) Manually annotated a portion of the data.
3 labelers; 400 IPUs; Fleiss’ = 0.814.
(2) Trained an SVM classifier. 80% accuracy; baseline: 55%; human: 91%.
Individual Turn-Yielding Cues
Agustín Gravano - Thesis Defense - Jan 28, 2009
26
5. Textual Completion
Labeled all IPUs in the corpus with the SVM model.
Individual Turn-Yielding Cues
Incomplete
Complete
Smooth switch Hold
18%
82%47% 53%
(2 test, p ≈ 0)
Textual completion seems to be almost a necessary condition before switches, but not before holds.
Agustín Gravano - Thesis Defense - Jan 28, 2009
27
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
IPU Final1.0s
Final0.5s
IPU Final1.0s
Final0.5s
IPU Final1.0s
Final0.5s
Jitter Shimmer NHR
S
H
6. Voice Quality
Individual Turn-Yielding Cues
**
*
* * **
*
*
Jitter Shimmer NHR
Higher jitter, shimmer, NHR before turn boundaries.
(*) ANOVA: p < 0.01
Smooth switch
Hold
z-sc
ore
Agustín Gravano - Thesis Defense - Jan 28, 2009
28
7. IPU Duration
Individual Turn-Yielding Cues
Longer IPUs before turn boundaries.
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
IPU duration IPU wordcount
*
*
(*) ANOVA: p < 0.01
Smooth switch
Hold
z-sc
ore
Agustín Gravano - Thesis Defense - Jan 28, 2009
29
1. Final intonation
2. Speaking rate
3. Intensity level
4. Pitch level
5. Textual completion
6. Voice quality
7. IPU duration
Individual Cues
Turn-Yielding Cues
Agustín Gravano - Thesis Defense - Jan 28, 2009
30
Combined Cues
Number of cues conjointly displayed
Per
cent
age
of t
urn-
taki
ng a
ttem
pts
Turn-Yielding Cues
0%
10%
20%
30%
40%
50%
60%
70%
0 1 2 3 4 5 6 7
r 2 = 0.969
Agustín Gravano - Thesis Defense - Jan 28, 2009
31
Backchannel-Inviting Cues
Cues displayed by the speaker inviting the listener to produce a backchannel response.
Turn-Taking
Agustín Gravano - Thesis Defense - Jan 28, 2009
32
Compare IPUs preceding Holds and IPUs preceding Backchannels.
Assumption: Cues are more likely to occur before Backchannels than before Holds.
Backchannel-Inviting Cues
Method
Speaker A:
Speaker B:
Hold BackchannelIPU1 IPU2
IPU3
IPU4
Agustín Gravano - Thesis Defense - Jan 28, 2009
33
Backchannel-Inviting Cues
Individual Cues
1. Final rising intonation: H-H% or L-H%.
2. Higher intensity level.
3. Higher pitch level.
4. Longer IPU duration.
5. Lower NHR.
6. Final POS bigram: DT NN, JJ NN, or NN NN.
Agustín Gravano - Thesis Defense - Jan 28, 2009
34
Backchannel-Inviting Cues
Combined Cues
Number of cues conjointly displayed
Per
cent
age
of I
PU
s fo
llow
ed b
y a
BC
-5%
0%
5%
10%
15%
20%
25%
30%
35%
0 1 2 3 4 5 6
r 2 = 0.812 r
2 = 0.993
Agustín Gravano - Thesis Defense - Jan 28, 2009
35
Speaker A:
Speaker B:
ip2ip1 ip3
Overlapping Speech
95% of overlaps start during the turn-final intermediate phrase (ip).
We look for turn-yielding cues in the second-to-last intermediate phrase (e.g., ip2).
Hold Overlap
Turn-Taking
Agustín Gravano - Thesis Defense - Jan 28, 2009
36
Overlapping Speech
Cues found in second-to-last ips: Higher speaking rate. Lower intensity. Higher jitter, shimmer, NHR.
All cues match the corresponding cues found in (non-overlapping) smooth switches.
Cues seem to extend further back in the turn, becoming more prominent toward turn endings.
Future research: Generalize the model of discrete turn-yielding cues.
Turn-Taking
Agustín Gravano - Thesis Defense - Jan 28, 2009
37
(1) Columbia Games Corpus
(2) Study of Turn-Taking
(3) Study of Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
38
Affirmative Cue Words
8% of the words in the Columbia Games Corpus: okay, right, yeah, mm-hm, alright, uh-huh, gotcha, huh,
yep, yes, yup. 10 discourse/pragmatic functions:
Acknowledgment/agreement, Literal modifier, Backchannel, Cue beginning/ending discourse segment, Check with the interlocutor, Stall/Filler, Back from a task, Pivot beginning/ending (Ack+Cue).
Labeled by 3 trained annotators. Fleiss’ = 0.69: ‘Substantial’ agreement.
Agustín Gravano - Thesis Defense - Jan 28, 2009
39
Examples
Affirmative Cue Words
that’s pretty much okay
Speaker 1: between the yellow mermaid and the whale
Speaker 2: okaySpeaker 1: and it is
okay we’re gonna be placing the blue moon
Literal modifier
Backchannel
Cue beginning discourse segment
Agustín Gravano - Thesis Defense - Jan 28, 2009
40
Interactive Voice Response Systems
Speech understanding: Must interpret the user’s input correctly.
Speech generation: Need to convey potentially ambiguous terms
with the appropriate parameters for the intended meaning.
Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
41
Previous Work
Disambiguation of single-word cue phrases. well, now, say, so, like, really, … Discourse vs. sentential senses. Hirschberg & Litman 1987, 1993; Litman 1994, 1996;
Zufferey & Popescu-Belis 2004, Lai 2008. Affirmative cue words.
Hockey 1991, 1992; Kowtko 1997: Intonational differences across discourse/pragmatic functions.
Jurafsky et al. 1998: Lexical identity is a strong cue to word function.
Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
42
Descriptive statistics
Large contextual differences Backchannels occur always as separate turns. Cue beginnings occur mostly in turn-initial
position. Modifier instances of right occur in all positions
within the turn, but rarely as separate turns. Acknowledgments occur in turn initial, medial
and final positions, and also as separate turns.
Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
43
Descriptive statistics
Final intonation Backchannel: Rising (H-H%, L-H%) Cue beginning: Falling (L-L%) Check: High-rising (H-H%)
Intensity Backchannel: High Cue beginning: High Cue ending: Low
Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
44
Perception study of okay
Okay is the most frequent ACW in the corpus. How do hearers disambiguate its meaning?
Acoustic/prosodic/phonetic vs. contextual info? 20 subjects classified 54 tokens of okay into
{Ack, BC, CueBeg} in two conditions: No context available: only the word okay. Context available: 2 full speaker turns.
Affirmative Cue Words
contextualized ‘okay’
Speaker A:okayokaySpeaker B:
Agustín Gravano - Thesis Defense - Jan 28, 2009
45
Perception study of okay
No context available Very low inter-subject agreement. Correlations of word function with acoustic/prosodic/
phonetic features. Context available
Higher inter-subject agreement. Contextual features trump ac/pr/ph features of okay. Exception: Final intonation of okay.
Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
46
Automatic Classification
Identify automatically the function of ACWs. Classification into discourse vs. sentential function
insufficient for ACWs. right: 15% discourse, 85% sentential. All other ACWs: 99% discourse, 1% sentential.
New classification tasks: Detection of an acknowledgment function.
Acknowledgment vs. No acknowledgment. Detection of a discourse segment boundary function.
SegBeg vs. SegEnd vs. None.
Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
47
Automatic Classification
Lexical features Lexical id, POS tags, n-grams.
Discourse features Position of target word in IPU, turn, conversation.
Timing features Duration of word, IPU, turn; amount of overlaps; latencies.
Acoustic features Pitch, intensity, pitch slope, voice quality.
Phonetic features Id, duration of each phone.
Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
48
Automatic Classification
Discourse Boundary Acknowledgment
Error Rate Error Rate
Baseline (1) 18.6 % 15.3 %
SVM: Word-only 14.4 % 15.0 %
SVM: Online (up to current IPU) 10.1 % 6.7 %
SVM: Full model 6.9 % 4.5 %
Human labelers 5.7 % 3.3 %
(1) Discourse Boundary: majority class == no boundaryAcknowledgment: {right, huh} no ACK; all others ACK
(*) Significantly different (Wilcoxon signed rank sum test; p < 0.05)
Affirmative Cue Words
***
**
}}}
}}
Agustín Gravano - Thesis Defense - Jan 28, 2009
49
Affirmative Cue Words
Speaker Entrainment
In conversation, people adapt the way they speak to match their partner. Referring expressions (Brennan 1996). Syntactic constructions (Reitter et al. 2006). Intensity (Coulston et al. 2002, Ward & Litman 2007).
Entrainment at different levels (lex, syn, sem): Key for both production and understanding, and facilitates
interaction (Pickering & Garrod 2004, Goleman 2006). Predictor of task success (MapTask; Reitter & Moore 2007).
Agustín Gravano - Thesis Defense - Jan 28, 2009
50
Affirmative Cue Words
Speaker Entrainment
Two novel measures of entrainment based on usage of high-frequency words (HFW), including ACW.
Entrainment of HFW correlates with:(+) Game score Task success
(+) Proportion of overlaps
(–) Proportion of interruptions Dialogue coordination
(–) Latency of smooth switches
Future work: Establish causality relation. Impact on IVR system design and/or evaluation.
}
Agustín Gravano - Thesis Defense - Jan 28, 2009
51
(1) Columbia Games Corpus
(2) Study of Turn-Taking
(3) Study of Affirmative Cue Words
Agustín Gravano - Thesis Defense - Jan 28, 2009
52
Contributions
Columbia Games Corpus Valuable dataset for studying spontaneous task-
oriented dialogue.
Study of Turn-Taking Turn-yielding cues. Backchannel-inviting cues. Objective, automatically computable. Combined cues. Improve turn-taking decisions of IVR systems.
Agustín Gravano - Thesis Defense - Jan 28, 2009
53
Contributions
Study of Affirmative Cue Words Descriptive statistics and perceptual results. Automatic classification. Speaker entrainment. Understanding and generation in IVR systems.
Results drawn from task-oriented dialogues, thus not necessarily generalizable, but suitable for most IVR domains.
Necessary steps towards the ambitious, long-term goal of human-like speech systems.
Agustín Gravano - Thesis Defense - Jan 28, 2009
54
Future Work
Additional turn-taking cues. Voice quality? Novel ways to combine cues. Weights? Study cues that extend over entire turns,
increasing near potential turn boundaries. Characterize interruptions. Speaker entrainment
Affirmative cue words. Turn-taking behavior. Acoustic/prosodic variation.
Submitted in partial fulfillment of therequirements for the degree
of Doctor of Philosophyin the Graduate School of Arts and Sciences
Agustín Gravano
Columbia UniversityColumbia University
Turn-Taking and Affirmative Cue Wordsin Task-Oriented Dialogue