Agust ín Gravano 1,2 Julia Hirschberg 1

17
Agustín Gravano 1,2 Julia Hirschberg 1 (1) (1) Columbia University, New York, USA Columbia University, New York, USA (2) Universidad de Buenos Aires, Argentina (2) Universidad de Buenos Aires, Argentina Backchannel-Inviting Cues in Task-Oriented Dialogue

description

Backchannel-Inviting Cues in Task-Oriented Dialogue. Agust ín Gravano 1,2 Julia Hirschberg 1. Columbia University, New York, USA (2) Universidad de Buenos Aires, Argentina. Introduction. Interactive Voice Response Systems. Quickly spreading. Mostly simple functionality. - PowerPoint PPT Presentation

Transcript of Agust ín Gravano 1,2 Julia Hirschberg 1

Page 1: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano1,2

Julia Hirschberg1

(1)(1) Columbia University, New York, USAColumbia University, New York, USA(2) Universidad de Buenos Aires, Argentina(2) Universidad de Buenos Aires, Argentina

Backchannel-Inviting Cuesin Task-Oriented Dialogue

Page 2: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 2

Interactive Voice Response Systems

• Quickly spreading. Mostly simple functionality.

• “Uncomfortable”, “awkward”.

• ASR+TTS account for most IVR problems.

• As ASR and TTS improve, other problems revealed.• Coordination of system-user exchanges.

• Backchannels.

Introduction

Page 3: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 3

• Short expressions uttered by listeners to: • Convey that they are paying attention.

• Encourage the speaker to continue.

• Examples: okay, uh-huh, mm-hm, alright.

• Very frequent in task-oriented dialogue.

• Thus, modeling human usage of BC should lead to an improved system-user coordination.

Introduction

Backchannels

Page 4: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 4

Goal

• Learn when backchannels are likely to occur.

• Find “backchannel-inviting” cues.• Cues displayed by the speaker “inviting” the

listener to produce a backchannel response.

• This could improve the coordination of IVRs:• Speech understanding: Detect points in the user’s

turn where a backchannel would be welcome. • Speech generation: Display cues inviting the user

to produce a backchannel.

Introduction

Page 5: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 5

Talk Outline

• Previous work• Material• Method• Results• Conclusions

Page 6: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 6

Previous Work

• Duncan 1972, 1973, 1974, inter alia.• Hypothesized six turn-yielding cues in face-to-face

dialogue.• Several studies continued this line of research, but

always excluded backchannels.

• Ward & Tsukahara 2000.• Region of low pitch lasting 110ms or more.

• Cathcart et al. 2003.• Language model based on pause duration and part-

of-speech tags to predict the location of BC.

Backchannel-Inviting Cues

Page 7: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 7

Columbia Games Corpus

• 12 task-oriented spontaneous dialogues.• Standard American English.• 13 subjects: 6 female, 7 male.• Series of collaborative computer games.• No eye contact. No speech restrictions.• 9 hours of dialogue.• Manual orthographic transcription, alignment.• Manual prosodic annotations (ToBI).

Material

Page 8: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 8

Player 1: Describer Player 2: Follower

Material

Columbia Games Corpus

Page 9: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 9

Backchannel-Inviting Cues

• Cues displayed by the speaker “inviting” the listener to produce a backchannel response.

Page 10: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 10

Method

• 3 trained annotators identified Backchannels using a labeling scheme described in [Gravano et al. 2007].

• To find BC-inviting cues, we compare:• IPUs preceding Holds,

• IPUs preceding Backchannels.

Backchannel-Inviting Cues

• IPU (Inter Pausal Unit): Maximal sequence of words from the same speaker surrounded by silence ≥ 50ms.

Hold Backchannel

Speaker A:

Speaker B:

IPU1 IPU2

IPU3

IPU4

Page 11: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 11

Backchannel-Inviting Cues

Individual Cues

1. Final rising intonation: 81% of IPUs before BC end in H-H% or L-H%.

2. Higher pitch level.

3. Higher intensity level.

4. Lower NHR (voice quality).

5. Longer IPU duration (seconds, #words).

6. Final POS bigram:72% of IPUs before BC end in DT NN, JJ NN, or NN NN.

} •entire IPU•final 1.0 sec•final 0.5 sec

Page 12: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 12

Defining Presence of a Cue

• 2 representative features for each cue:

Final intonation Pitch slope over final 200ms, 300ms.

Intensity level Mean intensity over final 500ms, 1000ms.

Pitch level Mean pitch over final 500ms, 1000ms.

Voice quality NHR over final 500ms, 1000ms.

IPU duration Duration in ms, and in number of words.

Final POS bigram {‘DT NN’, ‘JJ NN’, ‘NN NN’} vs. Rest (binary).

• Define presence/absence based on whether the value is closer to the mean before BC or H.

Backchannel-Inviting Cues

Page 13: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 13

Top Frequencies of Complex Cues

BC-inviting cues:

1: Final intonation

2: Intensity level

3: Pitch level

4: IPU duration

5: Voice quality

6: Final POS bigram

digit == cue present

dot == cue absent

Page 14: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 14

Backchannel-Inviting Cues

Combined Cues

Number of cues conjointly displayed

Per

cent

age

of I

PU

s fo

llow

ed b

y a

BC

r 2 = 0.993

Page 15: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 15

Backchannel-Inviting Cues

IVR Systems

• After each IPU from the user:if estimated likelihood > thresholdthen produce a backchannel

• To elicit a backchannel from the user, if desired:Include as many cues as possible in the system’s final IPU.

Page 16: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 16

Summary

• Study of backchannel-inviting cues.• Objective, automatically computable.• Combined cues.• Improve turn-taking decisions of IVR systems.

• Results drawn from task-oriented dialogues.• Not necessarily generalizable.• Suitable for most IVR domains.

• SIGdial 2009: Study of turn-yielding cues.

Page 17: Agust ín Gravano 1,2 Julia Hirschberg 1

Agustín Gravano Interspeech 2009 17

Special thanks to…• My advisor, Julia Hirschberg• Thesis Committee Members

• Maxine Eskenazi, Kathy McKeown, Becky Passonneau, Amanda Stent.

• Speech Lab at Columbia University• Stefan Benus, Fadi Biadsy, Sasha Caskey, Bob Coyne, Frank

Enos, Martin Jansche, Jackson Liscombe, Sameer Maskey, Andrew Rosenberg.

• Collaborators• Gregory Ward and Elisa Sneed German (Northwestern U);

Ani Nenkova (UPenn); Héctor Chávez, David Elson, Michel Galley, Enrique Henestroza, Hanae Koiso, Shira Mitchell, Michael Mulley, Kristen Parton, Ilia Vovsha, Lauren Wilcox.