Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago...

35
Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago http://www.cs.uchicago.edu/~levow MAICS April 1, 2006
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago...

Page 1: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Understanding Spoken Corrections inHuman-Computer Dialogue

Gina-Anne LevowUniversity of Chicago

http://www.cs.uchicago.edu/~levowMAICS

April 1, 2006

Page 2: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Error Correction Spiral U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½ since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½. U: Give me the price for American Telephone and

Telegraph. S: Excuse me?

Page 3: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Identifying Corrections

Most "Reasoning" Approaches Correction => Violates Task, Belief Constraint Requires Tight Task, Belief Model Often Requires Accurate Recognition

This Approach Uses Acoustic or Lexical Information Content, Context Independent

Page 4: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Accomplishments

Corrections vs Original Inputs Significant Differences: Duration, Pause, Pitch

Corrections vs Recognizer Models Contrasts: Phonology and Duration

Correction Recognition Decision Tree Classifier: 65-77% accuracy

Human Baseline ~80%

Page 5: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Why Corrections?

Recognizer Error Rates ~25-40% REAL meaning of utterance

user intent Corrections misrecognized 2.5X as often Hard to Correct => Poor Quality System

Page 6: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Why it's Necessary

Error Repair Requires Detection Errors can be very difficult to detect

E.g. Misrecognitions Focus Repair Efforts Corrections Decrease Recognition Accuracy

Adaptation Requires Identification

Page 7: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Why is it Hard?

Recognition Failures and Errors Repetition <> Correction

500 Strings => 6700 Instances (80%) Speech Recognition Technology

Variation - Undesirable, Suppressed

Page 8: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.
Page 9: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Roadmap

Data Collection and Description SpeechActs System & Field Trial

Characterizing Corrections Original-Repeat Pair Data Analysis Acoustic and Phonological Measures & Results

Recognizing Corrections Conclusions and Future Work

Page 10: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

SpeechActs System

Speech-Only System over the Telephone (Yankelovich, Levow & Marx 1995)

Access to Common Desktop Applications Email, Calendar, Weather, Stock Quotes

BBN's Hark Speech Recognition, Centigram TruVoice Speech Synthesis

In-house: Natural Language Analysis Back-end Applications, Dialog Manager

Page 11: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

System Data Overview

Approximately 60 hours of interactions Digitized at 8kHz, 8-bit mu-law encoding

18 subjects: 14 novices, 4 experts, single shots 7529 user utterances, 1961 errors P(error | correct) = 18%; P(error | error) = 44%

Page 12: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

System: Recognition Error Types

Rejection Errors - Below Recognition Level U: Switch to Weather S (Heard): <nothing> S (said): Huh?

Misrecognition Errors - Substitution in Text U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Friday, May 4, you have talk at Chicago.

1250 Rejections ~2/3 706 Misrecognitions ~1/3

Page 13: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Analysis: Data

300 Original Input-Repeat Correction Pairs Lexically Matched, Same Speaker Example:

S: (Said): Please say mail, calendar, weather. U: Switch to Weather. Original S (Said): Huh? U: Switch to Weather. Repeat.

Page 14: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Analysis: Duration

Automatic Forced Alignment, Hand-Edited Total: Speech Onset to End of Utterance Speech: Total - Internal Silence Contrasts: Original Input/Repeat Correction

Total: Increases 12.5% on average Speech: Increases 9% on average

Page 15: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.
Page 16: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Analysis: Pause

Utterance Internal Silence > 10ms Not Preceding Unvoiced Stops(t), Affricates(ch)

Contrasts: Original Input/Repeat Correction Absolute: 46% Increase Ratio of Silence to Total Duration: 58% Increase

Page 17: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.
Page 18: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Pitch Tracks

Page 19: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Analysis: Pitch I

ESPS/Waves+ Pitch Tracker, Hand-Edited Normalized Per-Subject:

(Value-Subject Mean) / (Subject Std Dev) Pitch Maximum, Minimum, Range

Whole Utterance & Last Word Contrasts: Original Input/Repeat Correction

Significant Decrease in Pitch Minimum Whole Utterance & Last Word

Page 20: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Analysis: Pitch II

Page 21: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Analysis: Overview

Significant Differences: Original/Correction Duration & Pause

Significant Increases: Original vs Correction Pitch

Significant Decrease in Pitch Minimum Increase in Final Falling Contours

Conversational-to-Clear Speech Shift

Page 22: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Analysis: Phonology

Reduced Form => Citation Form Schwa to unreduced vowel (~20)

E.g. Switch t' mail => Switch to mail. Unreleased or Flapped 't' => Released 't' (~50)

E.g. Read message tweny => Read message twenty Citation Form => Hyperclear Form

Extreme lengthening, calling intonation (~20) E.g. Goodbye => Goodba-aye

Page 23: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Durational Model Contrasts

Departure from Model Mean (Std Dev)

# of

Wor

ds

Non-final Final

Compare to SR model(Chung 1995)

Phrase-final lengthening Words in final position significantly longer than non-final and than model prediction

All significantly longer incorrection utterances

Page 24: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Analysis: Overview II

Original vs Correction & Recognizer Model Phonology

Reduced Form => Citation Form => Hyperclear Form Conversational to (Hyper) Clear Shift

Duration Contrast between Final and Non-final Words Departure from ASR Model

Increase for Corrections, especially Final Words

Page 25: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Automatic Recognition of Spoken Corrections

Machine learning classifier: Decision Trees

Trained on labeled examples Features: Duration, Pause, Pitch

Evaluation: Overall: 65% accuracy (inc. text features)

Absolute and normalized duration Misrecognitions: 77% accuracy (inc. text features)

Absolute and normalized duration, pitch 65% accuracy – acoustic features only

Approaches human baseline: 79.4%

Page 26: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Accomplishments

Contrasts between Originals vs Corrections Significant Differences in Duration, Pause, Pitch

Conversational-to-Clear Speech Shifts Shifts away from Recognizer Models Corrections Recognized at 65-77%

Near-human Levels

Page 27: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Future Work

Modify ASR Duration Model for Correction Reflect Phonological and Duration Change

Identify Locus of Correction for Misrecognitions U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Friday, May 4, you have talk at Chicago. U:Switch to WEATHER!

Preliminary tests: 26/28 Corrected Words Detected, 2 False Alarms

Page 28: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Future Work

Identify and Exploit Cues to Discourse and Information Structure Incorporate Prosodic Features into Model of Spoken

Dialogue

Exploit Text and Acoustic Features for Segmentation of Broadcast Audio and Video

Necessary first phase for information retrieval

Assess language independence

First phase: Segmentation of Mandarin and Cantonese Broadcast News (in collaboration with CUHK)

Page 29: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Classification of Spoken Corrections

Decision Trees Intelligible, Robust to irrelevant attributes ?Rectangular decision boundaries; Don’t combine features

Features (38 total, 15 in best trees) Duration, pause, pitch, and amplitude

Normalized and absolute

Training and Testing 50% Original Inputs, 50% Repeat Corrections 7-way cross-validation

Page 30: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Recognizer: Results (Overall)

Tree Size: 57 (unpruned), 37 (pruned) Minimum of 10 nodes per branch required

First Split: Normalized Duration (All Trees) Most Important Features:

Normalized & Absolute Duration, Speaking Rate 65% Accuracy - Null Baseline-50%

Page 31: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Example Tree

Page 32: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Classifier Results: Misrecognitions

Most important features: Absolute and Normalized Duration Pitch Minimum and Pitch Slope

77% accuracy (with text) 65% (acoustic features only)

Null baseline - 50% Human baseline - 79.4% (Hauptman & Rudnicky 1987)

Page 33: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Misrecognition Classifier

Page 34: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Background & Related Work Detecting and Preventing Miscommunication

(Smith & Gordon 96, Traum & Dillenbourg 96) Identifying Discourse Structure in Speech

Prosody: (Grosz & Hirschberg 92, Swerts & Ostendorf 95) Cue words+prosody: (Taylor et al 96, Hirschberg&Litman 93)

Self-repairs: (Heeman & Allen 94, Bear et al 92) Acoustic-only: (Nakatani & Hirschberg 94, Shriberg et al 97)

Speaking Modes: (Ostendorf et al 96, Daly & Zue 96) Spoken Corrections:

Human baseline (Rudnicky & Hauptmann 87) (Oviatt et al 96, 98; Levow 98,99; Hirschberg et al 99,00)

Other languages: (Bell & Gustafson 99, Pirker et al 99,Fischer 99)

Page 35: Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006.

Learning Method Options (K)-Nearest Neighbor

Need Commensurable Attribute Values Sensitive to Irrelevant Attributes Labeling Speed - Training Set Size

Neural Nets Hard to Interpret Can Require More Computation & Training Data +Fast, Accurate when Trained

Decision Trees <= Intelligible, Robust to Irrelevant Attributes +Fast, Compact when Trained ?Rectangular Decision Boundaries, Don't Test Feature Combinations