Comparing Synthesized versus Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue...

25
Comparing Synthesized versus Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue System Kate Forbes-Riley and Diane Litman and Scott Silliman and Joel Tetreault Learning Research and Development Center University of Pittsburgh

Transcript of Comparing Synthesized versus Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue...

Comparing Synthesized versus Pre-Recorded Tutor Speech

in an Intelligent Tutoring Spoken Dialogue System

Kate Forbes-Riley and Diane Litman and Scott Silliman and Joel Tetreault

Learning Research and Development CenterUniversity of Pittsburgh

Outline

Overview

System and Corpora

Evaluation Metrics and Methodology

Results

Conclusions, Future Work

Overview: Motivation Intelligent tutoring systems adding speech capabilities

(e.g. LISTEN Reading Tutor, SCoT, AutoTutor)

Enhance communication richness, increase effectiveness

Question: What is relationship between quality of speech technology and system effectiveness?

Is pre-recorded tutor voice (costly, inflexible, human) more effective than synthesized tutor voice (cheaper, flexible, non-human)?

If not, put effort elsewhere in system design!

Overview: Recent Work (mixed results)

Math Tutor System: (Non-)Visual tutor with prerecorded voice always rated higher and yielded deeper learning. (Atkinson et al., 2005)

Instructional Plan Tutor System: Pre-recorded voice always rated more engaging. Non-visual tutor: prerecorded voice yields more motivation. Visual tutor: synthesized voice yields more motivation. (Baylor et al., 2003)

Smart-Home System: More natural-sounding voice preferred. Characteristics (effort, pleasantness) more important than type (Moller et al., 2006)

Overview: Our Study Two Tutoring System Versions: pre-recorded tutor voice,

synthesized tutor voice (non-visual tutor)

Evaluate Effectiveness: student learning, system usability, dialogue efficiency across corpora (subsets)

Hypothesis: more human-sounding voice will perform better

Results: tutor voice quality has only a minor impact Does not impact learning May impact usability and efficiency: in certain corpora

subsets, pre-recorded preferred, in others, synthesized

• Back-end: text-based Why2-Atlas system (VanLehn et al., 2002)

Intelligent Tutoring Spoken Dialogue System

• Sphinx2 speech recognizer - Why2-Atlas performs NLP on transcript

• Scrollable dialogue history is available to the student

2 ITSPOKE 2005 Corpora

Pre-Recorded voice: paid voice talent, 5.85 hours of audio, 25 hours of time (at $120/hr)

Synthesized voice: Cepstral text-to-speech system voice of “Frank” for $29.95

Corpus #Students #Dialogues

PR 28 140

SYN 29 145

Example of 2 ITSPOKE Tutor Voices

TUTOR TURN: Right. Let's now analyze what happens to the keys. So what are the forces acting on the keys after they are released? Please, specify their directions (for instance, vertically up).  

ITSPOKE Pre-Recorded Tutor Voice (PR)

ITSPOKE Synthesized Tutor Voice (SYN)

Experimental Procedure

Paid subjects w/o college physics recruited via UPitt ads: Read a small background document Took a pretest Worked 5 training problems (dialogues) with ITSPOKE Took a posttest Took a User Satisfaction Survey

Corpus #Students #Dialogues

PR 28 140

SYN 29 145

ITSPOKE User Satisfaction SurveyS1. It was easy to learn from the tutor.S2. The tutor interfered with my understanding of the content.S3. The tutor believed I was knowledgeable.S4. The tutor was useful.S5. The tutor was effective on conveying ideas.S6. The tutor was precise in providing advice.S7. The tutor helped me to concentrate.S8. It was easy to understand the tutor.S9. I knew what I could say or do at each point in the

conversations with the tutor.S10. The tutor worked the way I expected it to.S11. Based on my experience using the tutor to learn physics, I

would like to use such a tutor regularly.

ALMOST ALWAYS (5), OFTEN (4), SOMETIMES (3),RARELY (2), ALMOST NEVER (1)

Evaluation Metrics

Student Learning Gains SLG (standardized gain): posttest score – pretest score NLG (normalized gain): posttest score – pretest score

1 – pretest score

Dialogue Efficiency TOT (time on task): total time over all 5 dialogues (min.)

System Usability S# (S1 – S11): score for each survey statement

Evaluation Methodology For each of 14 evaluation metrics, compute a 2-tailed t-test

over student set in each corpus, for 13 student sets:

All students (PR and SYN) Students who may be more susceptible to tutor voice

quality, based on 3 criteria: Highest/High/Low/Lowest Time on Task Highest/High/Low/Lowest Pretest Score Highest/High/Low/Lowest Word Error Rate

• High/Low Partition: criterion median in corpora• Highest/Lowest Partition: cutoffs above/below median

Student Learning Results

No significant difference ( p < .05) in learning gains (SLG or NLG) for any of the 13 student sets

No trend for a significant difference (p < .10) in learning gains (SLG or NLG) for any of the 13 student sets

Students learned significantly in both conditions (p=.000)

Dialogue Efficiency Results

Metric

Student Set PR Mean SYN Mean

p

TOT Highest Pretest

100.9 (6) 121.9 (9) .09 Most knowledgeable SYN students may take more time

to read transcript (PR most efficient)

PR voice marginally slower than SYN voice (e.g. in our example, PR = 13 seconds, SYN = 10 seconds)

System Usability Results (1)

Metric

Student Set PR Mean SYN Mean

p

S3 All 3.50 (28) 3.00 (29) .05

S3 Highest TOT 3.40 (10) 2.64 (11) .06

S3 High WER 3.43 (14) 2.64 (14) .07 S3. The tutor believed I was knowledgeable: more human-like qualities attributed to more human voice (PR preferred)

System Usability Results (2)

Metric

Student Set PR Mean SYN Mean p

S11 High WER 1.86 (14) 2.64 (14) .08

S11 Highest WER

1.50 (6) 2.71 (7) .06

S11. Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly: more consistent with experience, not too human (SYN preferred)

System Usability Results (3)

Metric

Student Set PR Mean SYN Mean p

S2 Low WER 2.36 (14) 1.93 (14) .08

S2. The tutor interfered with my understanding of the content: when voice & WER human-like, notice inflexible NLU/G (SYN preferred)

Summary Evaluate impact of pre-recorded vs. synthesized tutor

voice on system effectiveness in ITSPOKE

Student Learning Results: no impact

Dialogue Efficiency Results: little impact High Pretest students took less time with PR (trend)

System Usability: little impact, mixed voice preference All, High Word Error Rate and Highest TOT students felt PR

believed them more knowledgeable (sig, trends) Low Word Error Rate students felt SYN interfered less (trend) High(est) Word Error Rate students preferred SYN for regular

use (trend)

Conclusions and Future Work Tutor voice quality has minimal impact in ITSPOKE

Why is text-to-speech sufficient in ITSPOKE context? Transcript dilutes impact of voice Students have time to get used to voice

Future Work: Show transcript after tutor speech/not at all Extend survey: how often is transcript read/how much

effort to understand voice (Moller et al., 2006) Try using other voices

Thank You!

Questions?

Further information:http://www.cs.pitt.edu/~litman/itspoke.html