Beneﬁts and Challenges of Real-Time Uncertainty Detection ...forbesk/papers/spcomm11.pdfExamples...

Benefits and Challenges of Real-Time Uncertainty

Detection and Adaptation in a Spoken Dialogue

Computer Tutor

Kate Forbes-Rileya, Diane Litmanb

aLearning Research and Development Center, University of Pittsburgh, USAbComputer Science Department, Learning Research and Development Center, University

of Pittsburgh, USA

Abstract

We evaluate the performance of a spoken dialogue system that providessubstantive dynamic responses to automatically detected user affective states.We then present a detailed system error analysis that reveals challenges forreal-time affect detection and adaptation. This research is situated in thetutoring domain, where the user is a student and the spoken dialogue systemis a tutor. Our adaptive system detects uncertainty in each student turnvia a model that combines a machine learning approach with hedging phraseheuristics; the learned model uses acoustic-prosodic and lexical features ex-tracted from the speech signal, as well as dialogue features. The adaptivesystem varies its content based on the automatic uncertainty and correct-ness labels for each turn. Our controlled experimental evaluation shows thatthe adaptive system yields higher global performance than two non-adaptivecontrol systems, but the difference is only significant for a subset of students.Our system error analysis indicates that noisy affect labeling is a major per-formance bottleneck, yielding fewer than expected adaptations thus lowerthan expected performance. However, the percentage of received adapta-tion correlates with higher performance over all students. Moreover, whenuncertainty is accurately recognized and adapted to, local performance issignificantly improved.

Keywords:fully automated spoken dialogue tutoring system, automatic affectdetection and adaptation, accompanying manual annotation, controlledexperimental evaluation, error analysis

Preprint submitted to Speech Communication Special Issue: Sensing Emotion and AffectJanuary 19, 2011

1. Introduction

Within the field of affective computing, researchers hypothesize that in-corporating information about affective states into the human-computer in-teraction model will significantly benefit system performance (e.g., [1, 2, 3,4, 5, 6, 7]). Typically, this task will involve enhancing the model in two ways.First, a method of automatically detecting the affective state(s) of interestmust be developed and implemented. Over the past decade significant ad-vances have been made in the area of affect detection (e.g., [8, 9]). Second,a method of automatically adapting to the affective state(s) must be devel-oped and implemented. To date there have been fewer reported evaluationsof affect-adaptive systems (e.g., [10, 11, 7]). Moreover, within the computertutoring domain, most of these evaluations have taken place under “ideal”conditions that reduce system error by using an unseen human “wizard” toperform the most complex tasks such as affect detection (e.g., [12]).

We recently evaluated a wizarded version of a spoken dialogue computertutor, UNC-ITSPOKE-WOZ [13], which detects and adapts to student uncer-tainty, and responds with additional remedial content. Uncertainty detectionand natural language understanding were performed by a human. Our eval-uation showed that UNC-ITSPOKE-WOZ yielded significantly higher globalstudent learning as compared to a non-adaptive control system. In this paperwe take the next step in our research and evaluate UNC-ITSPOKE, a fullyautomated version of our system which performs all tasks automatically.

Figure 1 shows an example system dialogue with and without the uncer-tainty adaptation. The figure begins with an initial tutor question (tutor1)and a correct+uncertain student answer (student1). The response to stu-dent1 by the non-adaptive system is shown in nonadaptive-tutor2. Asshown, the non-adaptive system ignores the student uncertainty and simplyprovides correctness feedback before moving on to the next question. The re-sponse to student1 by the adaptive system is shown in adaptive-tutor2; asshown, the adaptive system provides correctness feedback and then initiatesa subdialogue to remediate the student’s uncertainty. After this subdialoguecompletes, the adaptive system move on to the next question.

Our first main contribution in this paper is to show via a controlledexperimental evaluation that the fully automated UNC-ITSPOKE systemimproves performance (as measured by student learning), but this perfor-

2

tutor1: What will the velocity of the object be a second after that (where theinitial velocity is 9.8m/s and the acceleration is 9.8m/s2)?student1: Is it 19.6 m/s? [Correct+Uncertain]

nonadaptive-tutor2: Good. Now back to the man and his keys. We’ve shownthat they both have an acceleration of 9.8m/s2. So, how do their velocities comparewith each other, at every second during the fall?

adaptive-tutor2: Good. A second later, its velocity will be 9.8m/s + 9.8m/s =19.6m/s. This is because its acceleration tells us that every second, its velocityincreases by 9.8m/s. So, what’ll its velocity be, a second after that?adaptive-student2: Um. I’m thinking. 29.4 m/s. [Correct+Certain]adaptive-tutor3: Right. Now back to the man and his keys. We’ve shown thatthey both have an acceleration of 9.8m/s2. So, how do their velocities comparewith each other, at every second during the fall?

Figure 1: Example Illustrating How Adaptive and Non-Adaptive System Responses Differ

mance improvement is weaker than we previously found in “ideal” conditionsusing the UNC-ITSPOKE-WOZ system. In particular, UNC-ITSPOKE im-proved global learning, measured via pretest and posttest, more than twonon-adaptive control systems, but the difference was only significant for asubset of students. However, the percentage of received adaptations signif-icantly correlated with learning over all students. Moreover, we found thataccurately recognizing and adapting to uncertainty significantly improvedoverall local learning, measured as the learning state to which turns transi-tion after receiving the uncertainty adaptation.

Our second main contribution in this paper is a detailed system error anal-ysis using post-experimental manual annotations. This analysis reveals howweaknesses in real-time affect detection and adaptation in spoken dialoguesystems can negatively impact system performance. In particular, we foundthat errors in the automatic uncertainty detection and natural language pro-cessing components caused UNC-ITSPOKE to produce significantly feweruncertainty adaptations than expected, with the major performance bottle-neck being the noisy uncertainty model. In other words, UNC-ITSPOKEdidn’t detect enough uncertainty and thus didn’t adapt to enough uncertainturns; this likely explains the lack of a significant overall improvement inglobal learning.

3

Section 2 provides a summary of the research background for this work.Our fully automated UNC-ITSPOKE system is presented in Sections 3-5.Section 6 discusses our controlled experiment. Section 7 evaluates the globallearning impact of UNC-ITSPOKE. Section 8 then analyzes the performanceof the automatic uncertainty detection and natural language processing com-ponents and discusses the effect of the errors produced by these components.Section 9 evaluates the local learning impact of UNC-ITSPOKE when un-certainty is correctly recognized and adapted to. Section 10 discusses ourfuture work, including a future evaluation of UNC-ITSPOKE with improvedstudent state detection based on the lessons learned from this research.

2. Background

As noted in Section 1, building an affect-adaptive spoken dialogue systemrequires both detecting and responding to user affect automatically.

Automatic affect detection has received significant attention in the wideraffective computing literature. For example, within spoken dialogue sys-tems research, a range of linguistic information, including pitch, energy,and timing information, have been extracted from the user’s speech signaland larger dialogue context and used to automatically detect user affectivestates [3, 5, 14, 15, 16, 17]. Similarly, within the domain of tutoring systems,sophisticated models of learner affective states have been developed thattake not only learner-based cues (e.g., linguistic, visual, and physiological),but also deeper features of the learning environment into account (e.g., cor-rectness, dialogue acts, inferred causes, personality types) [5, 2, 18, 19, 20].However, despite these demonstrated advances, the problem of reliable affectdetection has been quite challenging, due largely to two main bottlenecks:1) the limitations in the type and number of features (particularly deep se-mantic/pragmatic features) that can be used because the detection must beperformed in real-time,1 and 2) the subjectivity of the emotional experiencecombined with the spontaneity of emotional expression during the interac-tion with the spoken dialogue system; human annotators with potentiallylimitless features and time at their disposal typically display low interanno-tator reliability on the task, and existing affect detection techniques do not

1Note however that there has been tremendous recent progress made on developingreal-time linguistic and acoustic features for emotion prediction (cf. [8, 9]), which shouldhelp alleviate this bottleneck for working affect-adaptive systems in coming years.

4

yet consistently perform on par with humans (e.g, [5]).Due in part to the challenges associated with the affect detection task,

there have to date been few reported evaluations of automatic affect adapta-tions in the affective system literature (e.g., [10, 11, 7]). For example, Kleinet al.’s simulated network game system [10] responded with sympathy andapology to user self-reported instances of frustration; the adaptive systemyielded increased user persistence and reduced frustration in a follow-on taskas compared to two non-adaptive control systems. Liu and Picard’s healthassessment system [11] used physiological sensors to automatically detectinstances of user stress and then responded with empathy; users rated theempathetic system as significantly less stressful than the non-adaptive con-trol. However, it is still largely an open question as to what are the mosteffective responses to user affective states. To some extent the answer de-pends on the task domain, the performance metric targeted for improvement,and the affective state(s) targeted for adaptation.

Within the spoken dialogue tutoring system domain, student learning isthe primary performance metric, and student uncertainty is an affective stateof primary interest. Although uncertainty is not one of the “big 6” emotionssuch as anger and happiness [21], tutoring research suggests uncertainty andclosely related states such as confusion and self-efficacy are among the mostfrequently occurring affective states in tutoring dialogues (e.g., [22, 23, 4]),and they play an important role in the learning process [23, 24, 25]. Dueto the complexity of the information exchange in a tutoring dialogue, mostdynamic affect-adaptive tutoring systems - that is, systems that recognizeand respond to student affect on a turn by turn basis - have been evaluatedwithin a “Wizard of Oz” scenario, where an unseen human “wizard” performstasks such as speech recognition, natural language understanding, and affectdetection or adaptation. Wizarding the system has the benefit of removingthe system errors (i.e., noise), such as misrecognition of the user’s utteranceand/or affective state. that occur when these tasks are automated and thatcan distract from the dialogue interaction.2 Another benefit of wizarded af-fect recognition is that spontaneous affect (even non-prototypical examples)is detected across multiple speakers in real-time, while avoiding the limited-feature bottleneck of automatic affect detection. Thus wizarding allows the

2Of course, human interaction also contains noise, as exemplified by the fact that inter-annotator agreement for affect labeling is far from perfect (cf. [13]).

5

system design to be tested under the best possible conditions. However, theobvious bottleneck of wizarding is that the system is not fully automated.Examples of wizarded computer tutor evaluations include Tsukahara andWard [26], where positive feedback responses to affective states were imple-mented in a Memory Game computer tutor, with a human wizard performingspeech recognition and natural language understanding (affect detection wasautomatic). The affect-adaptive system was rated more highly than a non-adaptive version, but did not yield improved student learning. Similarly, inAist et al. [12], a human wizard performed speech recognition and naturallanguage understanding in a spoken Reading Tutor, and then provided pos-itive feedback after detecting affective states. The adaptive system did notyield increased learning, but did yield increased student persistence (mea-sured as how often the student chose to continue reading with the tutor asopposed to quitting), which has been associated with increased learning.

We recently evaluated a wizarded spoken dialogue tutoring system, UNC-ITSPOKE-WOZ, that dynamically responds to uncertain student turns withsubstantive content, as opposed to positive feedback or empathy [13]. Ahuman wizard performed speech recognition, natural language understand-ing, and uncertainty detection. The uncertainty adaptation was based ontutoring theory that views both uncertainty and incorrectness as learningopportunities; UNC-ITSPOKE-WOZ provided additional content after everyuncertain or incorrect turn to take advantage of all learning opportunities.UNC-ITSPOKE-WOZ yielded significantly higher learning as compared toa non-adaptive control system. To our knowledge only one other study [27]has presented an affective computer tutor that improves learning. In partic-ular, Wang and Johnson [27] conducted a series of Wizard-of-Oz studies inwhich students either received polite tutorial feedback after every turn, orreceived direct feedback after every turn. The polite feedback system yieldedincreased learning as compared to the direct feedback system. This can beviewed as a non-dynamic (static) approach to affect adaptation because theresponse does not vary depending on the affect in each turn. More generally,this research is situated in the field of “empathetic agents” (e.g., [28, 29]) suchas the empathetic systems in other domains mentioned above ([10, 11, 7]).Also related to empathetic agents is recent research on natural language gen-eration in dialogue systems that addresses automatic generation of differentsystem personality styles, such as in Mairesse and Walker [30].

In this paper we take the next step in our research and evaluate UNC-ITSPOKE, which is a fully automated version of our uncertainty-adaptive

6

spoken dialogue tutoring system. To our knowledge only one other study hasevaluated a fully automated affect-adaptive spoken dialogue tutoring system.Like ours, PonBarry et al. [31]’s system used substantive dynamic adapta-tions to respond to uncertain student turns. They found that when usedonly after uncertain answers, the adaptations did not significantly increaselearning. However, when used after all answers (regardless of uncertainty),the adaptations did significantly increase learning. This suggests that onemajor limitation of the PonBarry et al. study was that the adaptations werenot actually effective responses to uncertainty; rather, they were effectiveresponses to all answers. While speech recognition and understanding errorslikely negatively impacted the effectiveness of the adaptation, the other majorlimitation of the PonBarry et al. study likely came from the automatic uncer-tainty detection, which was based on a very limited set of features: hedgingphrases, filled pauses, and response latency. Our UNC-ITSPOKE systemovercomes the two major limitations of the PonBarry et al. study. First, ouruncertainty adaptation was previously shown to significantly improve learn-ing in the UNC-ITSPOKE-WOZ system when used after uncertain answers,but not when used randomly [13]. This indicates that our adaptation doesrepresent an effective response to uncertainty in “ideal” (wizarded) systemconditions. Second, UNC-ITSPOKE automatically detects uncertainty ineach student turn via a more complex model that combines a machine learn-ing approach with hedging phrase heuristics; the learned model was trainedon the UNC-ITSPOKE-WOZ corpus using acoustic-prosodic and lexical fea-tures extracted from the speech signal, as well as dialogue features.

3. ITSPOKE: Non-Adaptive System Behavior

Our non-adaptive tutoring system is simply called ITSPOKE [13].3 IT-SPOKE is a fully automated spoken dialogue tutoring system that was origi-nally built on top of the Why2-Atlas text-based physics tutoring system [32]and was later reimplemented using the TuTalk authoring tool and tutorialdialogue system platform [33].4

ITSPOKE tutors 6 qualitative physics problems via spoken dialogue,where each problem corresponds to one dialogue. The spoken dialogue is

3The abbreviation stands for Intelligent Tutoring SPOKEn dialogue system.4The later version focuses on a system-led dialogue to solve the physics problems and

eliminates the essay writing component of Why2-Atlas.

7

mediated via a web interface. An example screenshot of this interface isshown in Figure 2; the screenshot also shows one physics problem.

Figure 2: Screenshot during ITSPOKE Spoken Tutoring Dialogue

Each dialogue consists of a series of questions about the topics neededto solve the physics problem. The dialogues have a Question - Answer -Response format, and each contains about 20 tutor questions on average.

If the student answers a question correctly, ITSPOKE responds with cor-rectness feedback and then moves on to the next question. Correctness feed-back is randomly selected from the following phrases to indicate that theanswer was correct:5 Fine; That’s correct; Excellent; That’s right; Good;Right; Correct.

If the student answers a question incorrectly, ITSPOKE responds witha remediation. This response begins with correctness feedback randomlyselected from the following phrases to indicate that the answer was incorrect:That doesn’t sound right; I don’t think so; Well...; That’s not what I expected.The rest of the response takes one of two forms:

5More precisely, feedback is randomly selected from the subset of these phrases thathave not already been used in the dialogue. In other words, a selection won’t be repeateduntil all of the alternatives have been used.

8

• For incorrect answers to questions about easier topics, ITSPOKE givesa BottomOut, i.e., provides the correct answer with a brief statementof reasoning. This is illustrated in Figure 3.

tutor1: Upon which vehicle is the impact force greater?student: the car [Incorrect]tutor2: That doesn’t sound right. [FEEDBACK] We just discussed that byNewton’s Third law, when two objects collide, the forces they exert on each otherare equal in magnitude and opposite in direction. This is true regardless of theobjects’ differing masses. [BOTTOMOUT]

Figure 3: BottomOut Response to an Incorrect Answer in (Non-Adaptive) ITSPOKE

• For incorrect answers to questions about harder topics, ITSPOKE en-gages the student in a Subdialogue, i.e., one or more additional ques-tions that walk the student through the more complex line of reasoningrequired to achieve the correct answer. This is illustrated in Figure 4(only the first question in the subdialogue is shown).

tutor1: What’s the overall net force on the truck equal to?student: zero?? [Incorrect]tutor2: That doesn’t sound right. [FEEDBACK] We can derive the net forceon the truck by summing the individual forces on it, just like we did for thecar. First, what horizontal force is exerted on the truck during the collision?[SUBDIALOGUE]

Figure 4: Subdialogue Response to an Incorrect Answer in (Non-Adaptive) ITSPOKE

Figure 5 schematizes the ITSPOKE architecture. The black unbrokenarrows represent the non-adaptive system and the red dashed arrows andboxes indicate the enhancements made to implement uncertainty detectionand adaptation in the adaptive system (these enhancements are discussed inrelation to Figure 5 in Sections 5.4 and 4, respectively).

As shown on the left of the figure, student speech is sent to the Sphinx2speech recognizer [34] after being digitized from head-mounted microphoneinput. The speech recognizer used the default acoustic models distributed

9

with the Sphinx2 software, along with 15 stochastic (trigram) language mod-els which support a vocabulary of 681 words and were trained on the UNC-ITSPOKE-WOZ corpus and other prior ITSPOKE corpora.6

As shown on the right of the figure, the recognized student answer is thensent to the finite state dialogue manager, where states represent tutor ques-tions and responses and transitions between states represent student answervalues. The dialogue manager determines the semantic classification and cor-rectness value for each student answer using levenshtein distance between theanswer string and a set of semantic concepts representing anticipated correctand incorrect answers to the current tutor question. For instance, the seman-tic concept downward is the semantic classification for answer strings suchas “down”, “towards earth”, etc. There are 51 different semantic conceptsshared among the 142 different tutor states; the set of answer strings whichrepresent each semantic concept were assembled from prior ITSPOKE cor-pora. If an answer string is successfully classified as an anticipated semanticconcept, it receives the correctness value associated with that concept, oth-erwise, the answer string is labeled as incorrect.

In non-adaptive ITSPOKE, the next tutor state depends only on thecorrectness value of the student answer. In uncertainty-adaptive ITSPOKE,the next tutor state depends on both the correctness and uncertainty valueof the student answer, where the uncertainty value is determined by prosodicfeatures and other features of the student answer (see Sections 4 and 5).

Finally, the tutor response produced by the dialogue manager is sentto the Cepstral text-to-speech synthesis system7 and played to the studentthrough the headphone. The tutor text is also displayed on the web interface,as shown in Figure 2.

Earlier implementations of the non-adaptive ITSPOKE system have beenused in prior experiments as a platform to compare human-computer versushuman-human spoken dialogue tutoring, spoken versus typed dialogue com-puter tutoring, and synthesized versus pre-recorded tutor speech [36, 37].In these prior experiments, significant overall learning has been associatedwith ITSPOKE tutoring (as measured by pretest and posttest before andafter system use, respectively). However, there is still significant room for

6cmucmcltk with default parameters was used to train the language models. Thetraining corpora contained 9959 sentences, 2188 of which were unique, and 26900 words,681 of which were unique.

7The Cepstral system is a commercial outgrowth of the Festival system [35].

10

Figure 5: ITSPOKE Architecture

improvement; for example, the average student posttest score after using theITSPOKE system has never exceeded 75% (out of 100%).

4. UNC-ITSPOKE: Automatically Adapting to Uncertainty

The automatic uncertainty adaptation used in this experiment was basedon the hypothesis that student learning could be significantly improved bydetecting and adapting to student uncertainty during the computer tutoring.In particular, the adaptation was motivated by research that views uncer-tainty as well as incorrectness as a signal of a “learning impasse”, i.e., as anopportunity for the student to better learn the material about which s/he isuncertain or incorrect; this research further suggests that experiencing un-certainty can motivate the student to take an active role in constructing abetter understanding of the material being tutored (e.g. [38, 39]).

We further observed that to be motivated to resolve a learning impasse,the student must first perceive that it exists. Incorrectness and uncertaintydiffer in this perception. Incorrectness simply signals the student has reachedan impasse, while uncertainty signals the student perceives s/he has reachedan impasse. Based on this, we associated each combination of binary un-certainty (UNC, CER) and correctness (INC, COR) with an “impasseseverity”, as in Figure 6. COR+CER corresponds to a state where a studentis not experiencing an impasse, since s/he is correct and not uncertain about.INC+CER corresponds to a state where a student is experiencing the mostsevere type of impasse, since s/he is incorrect and not aware of it. INC+UNC

11

and COR+UNC answers indicate impasses of lesser severity: the student isincorrect but aware s/he may be, and the student is correct but uncertain ifs/he is, respectively. In Forbes-Riley et al. [40] we show empirical supportfor distinguishing impasse severities.

Impasse State: INC+CER INC+UNC COR+UNC COR+CERSeverity: most less least none

Figure 6: Different Learning Impasse State Severities

We hypothesized that student learning would increase if UNC-ITSPOKEwould provide additional substantive content to remediate all learning im-passe states. As discussed in Section 3, both types of incorrectness impasse(INC+CER and INC+UNC) already receive additional substantive contentvia a BottomOut or Subdialogue in non-adaptive ITSPOKE. However, theCOR+UNC impasse is ignored.

UNC-ITSPOKE thus adds additional substantive content to remediateCOR+UNC impasses. In particular, UNC-ITSPOKE responds to COR+UNCanswers in the same way as INC+UNC and INC+CER answers: for any givenquestion, all three answer types receive the same BottomOut or Subdialogueresponse. This is illustrated in Figure 7. As shown, the COR+UNC answerin student receives a Subdialogue response in tutor2. Comparison withFigure 4 shows that this is the same Subdialogue response that incorrectanswers already receive in non-adaptive ITSPOKE.

tutor1: What’s the overall net force on the truck equal to?student: The force of the car hitting it?? [COR+UNC]tutor2: Fine. [FEEDBACK] We can derive the net force on the truck by summingthe individual forces on it, just like we did for the car. First, what horizontal forceis exerted on the truck during the collision? [SUBDIALOGUE]

Figure 7: Example of UNC-ITSPOKE’s Adaptation for COR+UNC Answers

As shown by the red dashed arrows in Figure 5 above, implementing theuncertainty adaptation in UNC-ITSPOKE involved changing the next statetransitions in the finite state dialogue manager. Correctness labeling and un-certainty labeling are performed separately but in parallel within the dialoguemanager. Then, instead of transitioning based only on the correctness of the

12

student answer, the transition is based on the answer’s combined correctnessand uncertainty value: COR+UNC, INC+UNC and INC+CER answers alltransition to the same Subdialogue or BottomOut response.

Note that UNC-ITSPOKE does not change the (in)correctness feedbackgiven after student answers. Identically to non-adaptive ITSPOKE, correctanswers still receive feedback randomly selected from the following phrases:Fine; That’s correct; Excellent; That’s right; Good; Right; Correct, and in-correct answers still receive feedback randomly selected from the followingphrases: That doesn’t sound right, I don’t think so, Well..., That’s not whatI expected.

In sum, UNC-ITSPOKE differs from non-adaptive ITSPOKE only interms of its response to COR+UNC answers. UNC-ITSPOKE and non-adaptive ITSPOKE are identical in terms of their response to INC+UNC,INC+CER and COR+CER answers.

As discussed in Section 1, our uncertainty adaptation was first tested us-ing a wizarded version of UNC-ITSPOKE (UNC-ITSPOKE-WOZ), in whichuncertainty detection, and speech recognition and correctness annotationwere all performed in real-time by a human “wizard”. In this way, we firsttested the benefit our automatic uncertainty adaptation in “ideal” systemconditions. Our results showed that students using UNC-ITSPOKE-WOZhad significantly higher learning gains and learning efficiency than studentsusing non-adaptive ITSPOKE [13, 41].

5. UNC-ITSPOKE: Automatically Detecting Uncertainty

For use in UNC-ITSPOKE, we built an uncertainty model with WEKAsoftware [42] to automatically detect uncertainty from features of the studentspeech and dialogue context. We used the UNC-ITSPOKE-WOZ corpus totrain the uncertainty model. The UNC-ITSPOKE-WOZ corpus contains6561 student turns, which were manually annotated for uncertainty and cor-rectness by the wizard during the experiment, and were manually transcribedafter the experiment.

The wizard’s annotations in the UNC-ITSPOKE-WOZ corpus were basedon annotation schemes for uncertainty and correctness that were previouslydeveloped and evaluated on prior ITSPOKE corpora. In particular, thiswizard displayed interannotator agreement of 0.85 and 0.62 Kappa on cor-rectness and uncertainty, respectively, when these states were labeled afterthe experiments in which those corpora were collected [43]. Because these

13

prior evaluations showed that this trained wizard could reliably annotate un-certainty and correctness, no further inter-annotation evaluations were per-formed in the UNC-ITSPOKE-WOZ corpus. Table 1 shows the distributionof the wizard’s uncertainty labels overall and within the correct turns.

Table 1: Distribution of Turns in UNC-ITSPOKE-WOZ Training Corpus (N = 6561)

Turn Label Total Percent

Uncertain 1491 22.7%Certain 5070 77.3%Correct 5147 78.4%Correct+Uncertain 727 11.1%Correct+Certain 4420 67.4%

5.1. Feature Extraction

To build our uncertainty model, we first extracted a set of speech anddialogue features from the student turns in the UNC-ITSPOKE-WOZ corpus.These features are shown in Figure 8. Because the uncertainty model wouldbe used to label uncertain turns in real-time during the experiment, it wasnecessary that all features be quickly and automatically computable.

Apart from these constraints, our feature set is motivated by previousstudies of emotion prediction in spontaneous dialogues [44, 45, 46, 47, 48,49, 50, 17]. Our acoustic-prosodic features represent knowledge of pitch,energy, duration, and pausing. F0 and RMS values, representing measuresof pitch and loudness, respectively, were computed using Entropic ResearchLaboratory’s pitch tracker, get f0 [51, 52], with no post-correction. Amountof internal silence was approximated as the proportion of zero f0 framesfor the turn. Turn duration and prior pause duration were computed fromthe manually labeled turn boundaries in the UNC-ITSPOKE-WOZ corpus.During the experiment, they were computed via the automatically labeledstart and end turn boundaries provided by the speech recognizer. All ofthese raw acoustic-prosodic features were normalized to the first turn in eachdialogue, to help control for variations across students, because our pilotmachine learning studies with this data indicated that normalized featuresoutperformed raw features slightly at predicting the uncertainty labels.

Our lexical features represent ITSPOKE’s best speech recognition hy-pothesis of what is said in each student turn, as a word occurrence vector

14

(indicating the lexical items that are present in the turn). Although theUNC-ITSPOKE-WOZ corpus was manually transcribed, we ran it throughthe automatic speech recognition model to obtain an automatic transcriptand then used the 627 unique lexical items from the automatic transcript asour lexical features since they more closely approximated the lexical itemsavailable to the uncertainty model during the actual experiment.

Our dialogue features include the problem name and the tutor goal namesthat make up the prior tutor turn. Both of these features are potentiallyimportant features since they repeat across students. Tutor goal names referto the elements that make up tutor turns. Tutor turns contain one or moregoals, and each goal consists of one or more utterances. For example, inFigure 2, the last tutor turn contains tutor feedback (the first sentence)followed by two tutor goals (the last two sentences, respectively). Althoughthe goal names are not in and of themselves meaningful (e.g., ELEV-REM-2011), they enable tutor goals to be reused multiple times in a dialogue,and they also enable us to determine which tutor utterances are repeatedacross students. Since particular tutor utterances may be associated withstudent uncertainty, tutor goal names are a potentially important feature forpredicting uncertainty. Like the lexical items, the tutor goal names featureis represented with an occurrence vector, since more than one can be presentin a tutor turn.

The turn number feature provides a representation of where the turnis located in the dialogue, while the running totals and averages providea generalization of the student’s acoustic-prosodic behavior so far in eachdialogue.

Finally, we manually recorded a “gender” feature for each turn. Priorstudies [53, 54, 48] have shown that gender can play an important role inemotion research. Although some of this prior research has built separateemotion models for each gender, we use gender as an input feature.

5.2. Uncertainty Detection using Machine Learning

Machine learning experiments were then performed on the uncertainty-annotated student turns in the UNC-ITSPOKE-WOZ corpus. Preliminar-ily, we ran a number of experiments comparing the performance of sevendifferent WEKA learning algorithms on our dataset. Specifically, we com-pared a support vector machine algorithm (SMO), a decision tree algorithm(J48), a nearest neighbor learner (IB1), two rule learners (PART, OneR), anaive Bayes algorithm (NaiveBayes), and two logistic regression algorithms

15

• Acoustic-Prosodic Features

4 fundamental frequency (f0) features: maximum, minimum, mean, standarddeviation

4 energy (RMS) features: maximum, minimum, mean, standard deviation

3 temporal features: turn duration, prior pause duration, internal silence

• Lexical and Dialogue Features

ITSPOKE-recognized lexical items in turn

tutor goal name

problem name

turn number

per-dialogue running totals and averages for 11 acoustic-prosodic features

• Identifier Feature:

subject gender

Figure 8: Features Used to Predict Uncertainty for each Student Turn

(Logistic, SimpleLogistic). We found that the SimpleLogistic algorithm out-performed the other two highest-performing algorithms (J48 and SMO) forboth accuracy and uncertainty precision but not recall, although the differ-ences between these three algorithms were not drastic for any metric. Wedecided that the SimpleLogistic algorithm was preferable for two main rea-sons. First, we wanted an easily interpretable (and easily implementable)uncertainty model. SimpleLogistic and J48 were better than SMOs fromthis point of view. Because SimpleLogistic has built-in feature selection, ityielded a substantially smaller (and probably more generalizable) model thanJ48.8 Second, we decided that precision was more important than recall.9

After these preliminary experiments, we used the SimpleLogistic algo-rithm to build our uncertainty model. This algorithm outputs an equa-tion representing a linear relationship between the automatically selected

8Variable tweaking (e.g., pruning) can yield smaller trees, but we found this decreasedperformance.

9In retrospect this conclusion was probably not correct, as discussed in Section 10.

16

weighted input features and the logit-transformed probability that the tar-get categorical variable is equal to 1 (in our case, that the turn is uncertain).The feature weights are chosen by the algorithm via a maximum likelihoodmethod, with larger weights representing more important features with re-spect to predicting the target category (uncertain).

We trained and tested our uncertainty model using 5-fold cross validationover the UNC-ITSPOKE-WOZ corpus.10 Note however that the uncertaintymodel’s performance on incorrect turns is irrelevant, because the uncertaintyadaptation is only given to correct+uncertain turns. In other words, theuncertainty model’s performance needs to be optimized for the correct turnsonly. Including incorrect turns in our testing datasets would be misleading,since it would mask the model’s performance on correct turns.

There are two possible sets of correct student turns to use for test-ing: manually-labeled and automatically-labeled. We chose to test on themanually-labeled corrects, but because our automatic correctness labelingwas noisy, some turns receiving the adaptation during the controlled ex-periment will actually be incorrect (because they are mislabeled as cor-rect.)Though one could try to incorporate this into the testing set by testingon the automatically-labeled corrects, there’s no way to ensure that the samekinds of incorrects will be mislabeled across two different user populations(the UNC-ITSPOKE-WOZ and UNC-ITSPOKE corpus). Moreover, in ourprior ITSPOKE corpora, the precision of correctness labeling was 92.5%, andit is 94.7% in the UNC-ITSPOKE corpus, as discussed below in Section 8.1.This indicates that these false positives are quite rare in our system. There-fore we believe that either set of corrects is a reasonable test set for theuncertainty model.

We created 5 training sets from 5 different 4/5 splits of the UNC-ITSPOKE-WOZ corpus. The corresponding 1/5 held out dataset was used as the test setafter discarding the incorrect turns. However, we retained both correct andincorrect turns in our training sets, because preliminary experiments showedthat including incorrect turns improved the model’s performance during test-ing. We believe this is because both types of turns have features useful forlabeling uncertainty in correct turns.

10The folds were not student-independent, both because real-world tutoring systems aretypically used by the same student multiple times, and because in our prior work we foundthat testing on one speaker and training on the others, averaged across all speakers, doesnot significantly change the accuracy of emotion prediction in our system corpora [14].

17

This cross-validation represented our estimate of how the uncertaintymodel would perform during the controlled experiment, which is describednext in Section 6. In Section 8 we evaluate the actual performance of theuncertainty model during that experiment.

Table 2 shows averages and standard deviations for the results of thecross-validation with the logistic regression algorithm in terms of accuracy(ACC), as well as precision (P), recall (R), and F-measure (Fm) for both un-certain (UNC) and certain (CER) turns. The penultimate column shows theaverage absolute percentage of automatically labeled uncertain turns (AutoUNC), while the last column shows the average absolute percentage of truepositive uncertains (TP UNC) - that is, the absolute percentage of turnsthat were both manually and automatically labeled uncertain. Note that theupper bound of the last column is 11.1%, which is the absolute percentageof manually labeled COR+UNC in the UNC-ITSPOKE-WOZ corpus (seeTable 1). This means that on average 8.2% of manually labeled uncertainturns were automatically labeled certain.

Table 2: Results of 5-fold Cross-Validation Experiment with SimpleLogistic Algorithm onthe “Corrects-Only” Subset of the UNC-ITSPOKE-WOZ Corpus

ACC UNC UNC UNC CER CER CER Auto TPP R Fm P R Fm UNC UNC

AV 84.4% 40.6% 20.6% 27.0% 87.9% 95.0% 91.3% 7.2% 2.9%SD 1.5% 6.1% 2.8% 1.7% 2.1% 1.2% 1.0% 1.3% 0.4%

We compare our results in Table 2 with the results of majority class (CER)labeling of the same turns; this is equivalent to the original non-adaptiveITSPOKE system which ignores uncertainty. Table 1 above showed that4420/5147 correct turns in the UNC-ITSPOKE-WOZ corpus are certain.Always selecting the majority class (CER) label for these turns thus yields85.9% accuracy (with 0% precision and recall for uncertainty, and 85.9%precision and 100% recall for certainty).

As a final comparison, we computed the unweighted average recall (Re-call(UNC) + Recall(CER)/2) resulting from the cross-validation; this met-ric is commonly used for unbalanced two-class problems (cf. [8, 9]). Ourunweighted average recall was 57.8%, which is significantly above chance

18

(50%).11

Based on the these cross-validation results we did not expect our uncer-tainty model to outperform the majority class labeling with respect to over-all accuracy. However, we viewed uncertainty precision and recall as muchmore important than accuracy because accuracy conflates performance onthe uncertainty and certainty classes. F-measure is also useful as it combinesprecision and recall into a single value (the harmonic mean). As shown,our model’s precision, recall and F-measure were very high for the CERclass but much lower for the UNC class, although it well outperformed themajority class labeling on these metrics. As further shown in the table,based on the cross-validation we estimated that the model would average7.2% automatically-labeled COR+UNC turns but only 2.9% true positiveCOR+UNC turns (manually and automatically labeled COR+UNC) whenused in the fully automated experiment. This meant that on average 4.3%of the automatically-labeled COR+UNC turns would be manually labeledCOR+CER. Although we believed uncertainty precision was important, wehypothesized that it would not be harmful for some false positives to receivethe adaptation, since it could help increase the certainty of those turns. Thishypothesis was based on the results of our prior UNC-ITSPOKE-WOZ sys-tem evaluation, which showed that giving the uncertainty adaptation to apercentage of both uncertain and certain turns did not negatively impactlearning [13].

We then ran the logistic regression algorithm over the entire dataset. Theresulting logistic regression equation is shown in Figure 9. This equation iscomputed for each student turn, using the extracted feature values for theturn. Then given the summed result R for the turn, the probability p thatthe turn is uncertain is computed as: p = exp(R)/(1+exp(R)), where p≥0.5 indicates that the turn is uncertain and p<0.5 indicates that the turn iscertain.

As shown (reading left to right along each row), the model contains aconstant, numerous temporal, pitch and energy features (AV indicates run-ning average features), various lexical items, gender, a specific problem name(Plane), and a wide variety of specific tutor goal names (shown in capital let-ters).

11The 95% confidence interval for our unweighted average recall is +/- 0.012, yieldinga lower limit of 56.6%, which is still greater than 50%.

19

Figure 9 also shows that the most useful features come from a variety ofdifferent feature types. To illustrate this, we have bolded all features whoseweights are above an absolute value of 0.8. As shown, mean pitch is the moststrongly weighted acoustic-prosodic feature. The higher the mean pitch, themore likely the turn is to be labeled as uncertain; this may represent risingintonation which often accompanies uncertainty.

The more strongly weighted lexical items are all positively weighted, andthus may each be associated with common student answers to questions thatcause uncertainty. For example, “wind” is probably a common answer toquestions asking about forces besides gravity on freefalling objects. Self-reference (“I”) may be associated with common hedging phrases such as “Idon’t know” and “I guess”. The model also includes “not”, “sure”, and“don’t”, which may be associated with hedging phrases such as “I’m notsure” and “I don’t know”.

“ASK-DO-ANOTHER-PROBLEM” is a tutor goal name that refers tothe yes/no question given at the end of every dialogue. Not surprisingly,students are highly unlikely to be uncertain when answering this question.“PUMPKIN-DIALOGUE15” refers to a tutor question that asks whether aman exerts force on a pumpkin as he throws it in the air. Based on this modelit appears that students are unlikely to be uncertain about their answer tothis question.

Because the model uses so many different feature types to predict uncer-tainty and the stronger features come from most of them, speech featuresalone would likely not provide enough information to accurately detect un-certainty. This conclusion is supported by preliminary machine learning ex-periments which showed that excluding dialogue features (namely, tutor goalname) significantly decreased performance. Our prior results on detectingaffective states using only speech features also support this conclusion [14].Domain-dependent features are also needed to detect uncertainty in interac-tions with our system. This indicates that our uncertainty model would notport well to domains where only the speech features would be applicable.

5.3. Adding Uncertainty Bypass Heuristics to the Final Uncertainty Model

The lexical features in the uncertainty equation in Figure 9 only cap-ture unigrams. However, there are a wide variety of common multi-wordhedging phrases (e.g., “I guess”, “I think”) that are well-known indicatorsof uncertainty in tutoring [24, 31]. Manual inspection of our manually andautomatically transcribed student turns indicated that hedging words such

20

R = -1.35 +[priorPause] * 0.01 + [maxf0] * 0.14 +[meanf0] * 0.81 + [maxRMS] * -0.05 +[minRMS] * -0.11 + [AVpriorPause] * -0.01 +[AVmaxf0] * -0.29 + [action] * -0.49 +[always] * 1.23 + [but] * -0.42 +[changes] * 0.75 + [distance] * 1.22 +[dont] * 0.55 + [down] * -0.17 +[higher] * 0.78 + [I] * 1 +[increasing] * 0.37 + [mass] * 0.4 +[not] * 0.77 + [of] * -0.19 +[other] * 0.44 + [out] * 0.68 +[sure] * 0.53 + [till] * 0.9 +[vertically] * -0.4 + [wind] * 0.92 +[yes] * -0.5 + [gender] * 0.24 +[problem=Plane] * -0.1 + [CRASH-DIALOGUE20A] * 0.74 +[ASK-CALL-EXPERIMENTER] * -0.73 + [ASK-DO-ANOTHER-PROBLEM] * -1.18 +[CRASH-DIALOGUE20B] * -0.42 + [CRASH-DIALOGUE20C] * 0.75 +[CRASH-DIALOGUE21B] * 0.68 + [CRASH-DIALOGUE23] * 0.31 +[ELEV-DIALOGUE10] * 0.73 + [ELEV-DIALOGUE14] * 0.62 +[ELEV-DIALOGUE18] * 0.51 + [ELEV-DIALOGUE20] * 0.41 +[ELEV-DIALOGUE22] * 0.49 + [ELEV-DIALOGUE28] * -0.57 +[ELEV-DIALOGUE32] * 0.67 + [ELEV-DIALOGUE38] * -0.35 +[ELEV-DIALOGUE39] * 0.48 + [ELEV-DIALOGUE39B] * -0.45 +[ELEV-REM-2011] * 0.64 + [ELEV-REM-2012A] * 0.76 +[ELEV-REM-2211] * 0.79 + [ELEV-REM-221A] * 0.69 +[PLANE-DIALOGUE28] * 0.41 + [PLANE-REM-12112] * 0.66 +[PLANE-REM-2614] * -0.66 + [PLANE-REM-2811] * 0.55 +[PUMPKIN-DIALOGUE20] * 0.46 + [PUMPKIN-DIALOGUE15] * -1.2 +[PUMPKIN-REM-201a] * -0.63 + [PUMPKIN-DIALOGUE28] * -0.57 +[PUMPKIN-DIALOGUE30A] * -0.41 + [PUMPKIN-DIALOGUE30B] * -0.44 +[PUMPKIN-DIALOGUE32] * 0.58 + [PUMPKIN-DIALOGUE34] * 0.49 +[PUMPKIN-DIALOGUE35A] * 0.57 + [PUMPKIN-DIALOGUE35B] * 0.43 +[PUMPKIN-REM-32121] * 0.75 + [PUMPKIN-REM-3212A] * 0.77 +[SUN-DIALOGUE12] * 0.48 + [SUN-DIALOGUE14] * -0.65 +[SUN-DIALOGUE18] * -0.36

Figure 9: Logistic Regression Equation for Predicting Uncertainty

as “guess” and “think” do occur in our data. However, they do not show upin the regression output, even they were included as lexical input features.It may have been that their sparse occurrence in the data prevented theirinclusion in the automatic uncertainty model.

We therefore supplemented our purely data-driven model with additionalknowledge about the usefulness of hedge phrases suggested in the uncertaintyliterature. In particular, we hypothesized that the number of accuratelylabeled uncertains in UNC-ITSPOKE would be increased by including a setof hedging phrase “Bypass Heuristics” in our final uncertainty model thatwere not already captured by our logistic regression equation. Our set ofBypass Heuristics is shown in Figure 10. It is based on lists of hedging phrasesin prior tutoring research (e.g, [24]); however, each item consists only of theminimal necessary words (e.g., noun phrases aren’t included) because theautomatic speech recognition cannot necessarily recognize all of the wordsin a given hedging phrase. The Bypass Heuristics yielded an uncertaintyprediction as follows: if a student turn contained a hedging phrase, it wasautomatically labeled as uncertain.

21

believe(ed) guess(ing/ed) think(ing/thoughtcould/can may(be)/might shouldprobabl(e/y) possibl(e/y) do not/don’t rememberum/uh do not/don’t know can not/can’t remember(ing/ed)not sure

Figure 10: Hedging Phrase Bypass Heuristics added to the Final Uncertainty Model

5.4. Implementing the Final Uncertainty Model

Returning now to Figure 5, the implementation of our final uncertaintymodel into UNC-ITSPOKE involved first computing the features in Figure 8for each student turn. As shown by the red dashed arrows and boxes inFigure 5, the pitch, energy and lexical features were computed from thestudent speech within the student front end. In particular, the pitch andenergy features were computed by running the esps program ‘get f0’ [51, 52]on the speech recognizer’s audio output file, which yielded an f0 estimate andan RMS estimate for each 10ms frame in the turn. The relevant statistics(i.e., maximum, minimum, mean, running average) were then computed andnormalized to the feature values of the first turn in the dialogue. The lexicalfeatures were directly provided by the speech recognizer’s recognized string.All other features were provided by the dialogue manager. In particular,when beginning the experiment students input their gender to the system;this was recorded in the dialogue manager. The current tutor goal name wasalso recorded in the dialogue manager.

The dialogue manager then computed the logistic regression equation’sprediction from these weighted features. The dialogue manager also com-puted the Bypass Heuristics’ prediction by comparing the stored heuristicsto the automatically recognized student turn string.

The final uncertainty value was then produced by combining the pre-dictions of the Bypass Heuristics and the logistic regression equation in aninclusive OR relationship: a student turn was labeled uncertain if either theBypass Heuristics or the logistic regression equation predicted uncertain (orif both did so).

6. The Experiment

To evaluate the performance of UNC-ITSPOKE, we performed a con-trolled experiment comparing UNC-ITSPOKE to two control systems. Note

22

that uncertainty was automatically computed and logged in all 3 conditions.In the UNCAdapt experimental condition, students used UNC-

ITSPOKE (Section 4); this system gave the uncertainty adaptation to allCOR+UNC student turns. .

In the NOAdapt control condition, students used non-adaptive IT-SPOKE (Section 3); this system ignored student uncertainty.

In the Random control condition, students used a randomly adap-tive version of ITSPOKE, which provided the uncertainty adaptation to arandomly selected 11% of COR+CER turns. The Random condition wasincluded in order to control for the “additional” tutoring content that theexperimental condition received. The Random condition provided this sameadditional tutoring content to 11% of student turns, but those turns were notuncertain (as determined by the automatic uncertainty model). We chose11% because this is the percentage of COR+UNC turns found in the UNC-ITSPOKE-WOZ corpus [13] (see Table 1). We anticipated that the percent-age would be about the same in the UNC-ITSPOKE corpus. To achieve therandom selection, we sent the output of a random number generator to thedialogue manager whenever a turn was labeled by the system as COR+CER;if this number was less than or equal to 0.11 then the turn was given theuncertainty adaptation.

Subjects were students recruited through flyers posted around campusand were paid for their participation in the experiment. Subjects were re-quired to have never before used an ITSPOKE (or ITSPOKE-related) sys-tem. Subjects were randomly assigned to the conditions, except that condi-tions were gender-balanced and pretest-balanced. To achieve these balances,we kept a running average for pretest score and total males in each con-dition; we then assigned subjects to conditions based on which assignmentwould keep the averages similar across all conditions. Subjects were nativespeakers of English who had never taken college-level physics.

The experimental procedure was as follows. Each subject: i.) read aphysics text introducing the concepts to be tutored (10-30 minutes); ii.)completed a pretest of 26 multiple choice questions (10-30 minutes); iii.)used a web/voice interface to work through 5 physics “training problems”with a version of the system (depending on condition) (30-75 minutes); iv.)completed a survey questionnaire (10-20 minutes); v.) completed a posttestisomorphic to the pretest (10-30 minutes). vi.) completed a final physics“test” problem that was isomorphic to the 5th physics problem, using the

23

non-adaptive version of the system (10-20 minutes).12 After the experiment,the total time length of the experiment was found to range from about 1.5hours to 3.0 hours across all subjects.

The resulting UNC-ITSPOKE corpus contains 360 dialogues from 72 sub-jects. The UNCAdapt condition contains 24 subjects, the NOAdapt condition25 subjects, and the Random condition 23 subjects, yielding 72 total sub-jects, of which 47 are female and 25 are male, with 7-10 males per condition.The UNC-ITSPOKE corpus contains 7216 student turns; these turns weretranscribed after the experiment and manually labeled for uncertainty andcorrectness13. Table 3 shows the distribution of manually labeled uncertaintyand correctness in the UNC-ITSPOKE corpus. As shown, the percentage ofuncertain, correct, and COR+UNC turns is slightly lower than in the UNC-ITSPOKE-WOZ corpus (see Table 1).

Table 3: Distribution of Manually Labeled Turns in UNC-ITSPOKE Corpus (N = 7216)

Manual Turn Label Total Percent

Uncertain 1483 20.6%Certain 5733 79.4%Correct 5284 73.2%Correct+Uncertain 650 9.0%Correct+Certain 4634 64.2%

7. Evaluating the Uncertainty-Adaptive System’s Global Impact

Our initial goal for evaluating UNC-ITSPOKE can be stated as follows:Did the uncertainty adaptation have a positive global performance impact,measured as overall student learning?

Evaluation Hypothesis #1: We hypothesized that the UNCAdaptcondition, which gave the uncertainty adaptation after all COR+UNC turns,would yield significantly higher overall learning than both the NOAdapt con-dition, which ignored uncertainty, and the Random condition, which gave

12The survey and test problem aren’t evaluated in this paper; see [41, 40] for their uses.13For consistency, the wizard from the first wizarded version of this experiment per-

formed this labeling [13]

24

the uncertainty adaptation to 11% of COR+CER turns. To test this hy-pothesis, we compared normalized learning gain across condition ((posttest-pretest)/(1-pretest)). This metric is the preferred measure of learning in thetutoring literature because it controls for the fact that students with higherpretest scores do no have as much room for improvement as those with lowerpretest scores [55]. Table 4 shows the means and standard deviations acrosscondition for normalized learning gain, along with the number of students(N). As shown, the UNCAdapt condition did yield higher normalized learninggain than both the NOAdapt and the Random conditions. In addition, theRandom condition yielded higher normalized gain than the NOAdapt condi-tion. However, none of these differences were significant.14 Hypothesis #1was thus not confirmed.

Table 4: Differences in Student Learning Gain Across Condition

Condition N Mean SD

NOAdapt 25 0.41 0.25

UNCAdapt 24 0.49 0.20

Random 23 0.48 0.24

Evaluation Hypothesis #2: We next hypothesized that receiving theuncertainty adaptation would be better for learning than not receiving theuncertainty adaptation. This hypothesis was based on means in Table 4,where both UNCAdapt and Random have higher mean learning than NOAd-apt. To test this hypothesis, we calculated for the percentage of uncertaintyadaptations received by each student (PctAdapts), and then computed aPearson’s correlation between PctAdapts and normalized learning gain overall students. Hypothesis #2 was confirmed: PctAdapts significantly corre-lates with increased learning (R=0.230, p=0.05).15

Evaluation Hypothesis #3: We next hypothesized that although wedidn’t find significant learning differences overall (Hypothesis #1), we would

14We found the same pattern of results for raw learning gain (posttest - pretest) andlearning efficiency (learning gain/total tutoring time), whereby UNCAdapt outperformedRandom and NOAdapt, but not significantly.

15This result is unlikely to be an artifact of increased tutoring time, because PctAdaptsis not significantly correlated with total tutoring time (R=0.15, p=0.20) or total studentturns (R=0.19, p=0.11).

25

find them for the subset of students who received the most uncertainty adap-tations. This hypothesis was based on the significant positive correlation wefound between PctAdapts and learning (second hypothesis). To test ourthird hypothesis, students with PctAdapts values below the mean (4.0%over all students) were grouped as “Low PctAdapt”, and all other studentswere grouped “High PctAdapt”. A t-test over mean PctAdapts confirmedthat the Low and High groups differ significantly (t(70)=-12.63, p<0.001).A 2x3 factorial ANOVA with pairwise simple effects tests showed significantdifferences between the High PctAdapt group and the Low PctAdapt group(F(1,67) = 5.298, p=0.024). Note however that because the sizes of theUNCAdapt and Random subsets were quite small in one of the groups, ourresults should be taken as suggestive rather than conclusive. Tables 5-6 showthe pairwise test results for the Low and High PctAdapt groups, respectively.The columns show the condition, number of students, mean learning gain,the condition with which a significant difference is found (if any), and thedirection (> or <) and p-value of this difference. However, these tables showthat Hypothesis #3 was not confirmed. We predicted that in the High Pc-tAdapt group, the UNCAdapt condition would learn the most; instead in theLow PctAdapt group, the UNCAdapt condition learned significantly morethan the Random condition.

Table 5: Learning Gain Differences in Low PctAdapt Students

Condition N Mean SD Diff p

NOAdapt 25 0.41 0.25 - -

UNCAdapt 16 0.51 0.21 > Random 0.04

Random 4 0.26 0.15 - -

Table 6: Learning Gain Differences in High PctAdapt Students


NOAdapt 0 - - - -

UNCAdapt 8 0.43 0.20 - -

Random 19 0.53 0.22 - -

Evaluation Hypothesis #4: We next hypothesized that the UNCAdaptcondition actually received fewer adaptations than the Random condition,

26

despite our experimental design (Section 6), which was intended to equal-ize the percentage of received adaptations across these two conditions. Ifconfirmed, this hypothesis would explain why only the Low PctAdapt grouplearned more in the UNCAdapt condition than in the Random condition.Hypothesis #4 was confirmed in two ways. T-tests showed that within theHi PctAdapt group (but not the Lo PctAdapt group), the UNCAdapt con-dition received significantly fewer adaptations than the Random condition(t(25)=2.07, p=0.05). In addition, a one-way ANOVA with posthoc pair-wise tests showed that over all subjects, the UNCAdapt condition had sig-nificantly a lower PctAdapts than the Random condition (F(2,69) = 36.59,p<0.001).

The full set of pairwise test results are shown in Table 7. As shown, inthe Random condition, 8.6% of student turns received the adaptation. Thisis slightly less than the 11% we expected; the difference is probably due tochance variations in the random number generation algorithm we used.16 Inthe UNCAdapt condition, only 3.6% of student turns received the adaptation.This is much less than the 7.2% we expected based on the percentage ofCOR+UNCs labeled by the uncertainty model during our cross-validationexperiments in the UNC-ITSPOKE-WOZ corpus (Table 2). In Section 8 weinvestigate why fewer than expected turns received the adaptation.

This unexpectedly small PctAdapts value in the UNCAdapt conditionmay explain why the UNCAdapt condition didn’t have significantly higheroverall learning than the other conditions: Although the learning means arein the right direction (Table 4), the UNCAdapt condition simply didn’t re-ceive enough uncertainty adaptations to yield a significant improvement ontheir test scores. The fact that the learning means are so similar between theUNCAdapt and Random conditions (Table 4), even though Random receivedmore adaptations than UNCAdapt, suggests that giving the adaptation toCOR+CER answers can benefit learning by increasing the certainty of thoseanswers, but is not as effective as giving the adaptation to COR+UNC an-swers. If it were, the Random condition would have yielded higher learning

16In retrospect, it would have been more accurate to set the random adaptation per-centage to 7.2%, which was the percentage of COR+UNCs labeled by the uncertaintymodel during the cross-validation experiments in the UNC-ITSPOKE-WOZ corpus (seeTable 2), rather than to 11%, which was the percentage of manually labeled COR+UNCsin the UNC-ITSPOKE-WOZ corpus.

27

gains than the UNCAdapt condition.17

Table 7: Differences in Percentage of Uncertainty Adaptation Received (PctAdapts)Across Condition


NOAdapt 25 0% 0 < UNCAdapt, Random <0.01

UNCAdapt 24 3.6% 3.2

Random 23 8.6% 5.3 > UNCAdapt <0.01

Finally, a repeated test-measure ANOVA confirmed overall learning anddetermined power and effect size. Students in all conditions did learn a signif-icant amount during tutoring (F(1,69) = 225.688, p<0.001). The observedpower in the experiment (at p= 0.05) was 0.14 with an effect size of 0.02(partial Eta Squared); this indicates that the uncertainty adaptation hadonly a small effect on learning in the UNCAdapt condition, so there was lit-tle likelihood of detecting a significant learning improvement over the otherconditions.

We conclude from the evaluations presented in this section that the uncer-tainty adaptation had only a small effect on learning because students in theUNCAdapt condition did not receive enough uncertainty adaptations. Wespeculate that if the UNCAdapt condition had received more adaptations,the adaptation would have a larger effect size and thus would have yieldedsignificantly improved learning over the control conditions. These conclusionare supported by a power analysis of our prior UNC-ITSPOKE-WOZ exper-iment, where students in the experimental condition received the adaptationin 11.1% of turns and showed significantly improved learning over the controlconditions; the observed power (at p= 0.05) in that experiment was 0.73 withan effect size of 0.11.

8. Evaluating the Speech Recognition, Correctness, and Uncer-tainty Labeling Components

Our next goal for evaluating UNC-ITSPOKE can be stated as follows:Why didn’t students in the UNCAdapt condition receive enough uncertainty

17The fact that we do not see this also suggests that it would not further benefit learningto simply give the uncertainty adaptation to all correct answers. Our prior results fromour wizarded experiment also support this conclusion [13].

28

adaptations to significantly improve overall learning?Evaluation Hypotheses #5 and #6: We hypothesized that poor

performance of UNC-ITSPOKE’s automatic correctness and uncertainty la-beling components were responsible. In other words, too many COR+UNCturns were mislabeled as either incorrect or certain, and thus these turnsdidn’t receive the adaptation when they should have.

Because UNC-ITSPOKE’s automatic correctness and uncertainty label-ing components are implemented separately and thus can be optimized in-dependently of each other, we separately analyzed the performance of thesecomponents (Hypotheses #5 and #6, respectively).

For illustrative purposes in this section, Figure 11 shows an annotatedexample of the UNC-ITSPOKE corpus dialogues. student-auto shows theautomatically recognized speech and automatic uncertainty and correctnesslabels for each student turn, while student-manual shows the correspondingmanual transcript (including punctuation for clarity) and annotations.

tutor1: Let’s begin by looking at the motion of the man and his keys while he’sholding them. How does his velocity compare to that of his keys?

student-auto1: his also the is the same as that of his keys points force(INC CER)

student-manual1: His velocity is the... same... as that of his keys while he’s holdingthem? (COR UNC)

tutor2: Actually... Their velocities must be the same, because the man’s holdinghis keys, so they’re not moving with respect to each other. [...] What are theforces exerted on the man after he releases his keys?

student-auto2: the only force is (INC CER)

student-manual2: The only force is... (INC UNC)

tutor3: Well... Gravity is the only force acting on the man. [...] So what’s thedirection of the force of gravity on the man (e.g., vertically up)?

student-auto3: that in the pull in the man vertically down (COR CER)

student-manual3: Gravity will be pulling the man vertically down. (COR CER)

Figure 11: Annotated UNC-ITSPOKE Corpus Example

8.1. Evaluating Speech Recognition and Correctness Labeling

Since the performance of the automatic correctness labeling depends onthe automatic speech recognition output, we begin by evaluating the per-

29

formance of that component. Overall speech recognition accuracy in theUNC-ITSPOKE corpus was 74.6%, measured as 100% - Word Error Rate(WER). WER is a standard measure for evaluating the performance of auto-matic speech recognition (ASR) software. We compute WER using the NISTsclite program to compare our automatically recognized transcripts and ourmanual transcripts.18

Semantic accuracy is more important for spoken dialogue systems thanspeech recognition accuracy, because semantic accuracy determines the nextdialogue transition. In our system, semantic accuracy equates with correct-ness accuracy. That is, even if the speech recognition is not entirely accurate,if the correctness value is accurate the dialogue will transition to the appro-priate next state. We thus measure semantic accuracy in the UNC-ITSPOKEcorpus as the percentage of turns in which the automatic correctness label ofthe input string from the ASR matches the manual correctness label. We use1 for matches and 0 for mis-matches. For example, in Figure 11, STUDENT1

has a semantic accuracy of 0 while STUDENT2 and STUDENT3 have a se-mantic accuracy of 1. This comparison yielded a semantic accuracy of 84.7%over the whole UNC-ITSPOKE corpus, which is substantially higher than thespeech recognition accuracy (74.6%). For example, in Figure 11, STUDENT3

has a WER of 62.5% but since the important words for determining correct-ness are properly recognized, the semantic accuracy is 1.

Table 8 shows the confusion matrix for the automatic and manual cor-rectness labels across the 7216 turns in the UNC-ITSPOKE corpus. Theautomatic correctness labeling yielded a precision of 0.947, a recall of 0.837,and an F-measure of 0.889 for the ‘correct’ label.

We conclude that Hypothesis #5 was confirmed: Noisy automatic correct-ness labeling was one factor responsible for the low percentage of uncertaintyadaptations in the UNCAdapt condition, because incorrectness was favoredover correctness (false negatives (863) were much more frequent than falsepositives (247)).

We thus speculate that it is important to improve correctness recall infuture versions of UNC-ITSPOKE, so that more COR+UNC turns receivethe uncertainty adaptation.

18Note that in prior work we evaluated the impact of ITSPOKE’s speech recognitionperformance, and found that speech recognition errors are not related to learning gain inITSPOKE [36, 56].

30

Table 8: Confusion Matrix for Correctness in the UNC-ITSPOKE corpus (N = 7216)

AUTOMATIC AUTOMATICCORRECT INCORRECT

MANUAL CORRECT 4421 863MANUAL INCORRECT 247 1685

8.2. Evaluating the Uncertainty Model

Because the uncertainty adaptation was given only to correct turns, wefocus here on how the uncertainty model performed on the manually-labeledcorrect turns in the UNC-ITSPOKE corpus. A benefit of focusing on theseturns is that we evaluate the uncertainty model’s performance assuming per-fect correctness labeling. We can also compare this performance with our ear-lier cross-validation results in the UNC-ITSPOKE-WOZ corpus (Section 5),which also used manually-labeled correct turns. A drawback of this analysisis that it doesn’t show how the model performed on incorrects that weremislabeled as correct in the UNC-ITSPOKE corpus. However, as discussedin Section 8.1 the precision of correctness labeling in UNC-ITSPOKE was0.947, indicating that these false positives are quite rare.

First, we present the uncertainty model’s performance on all manually-labeled correct turns in the UNC-ITSPOKE corpus (“All Manual Corrects”).These results show the model’s performance if the automatic correctnesslabeling had always worked perfectly during the experiment. Table 9 showsthe confusion matrix for the manual and automatic uncertainty labels onthese 5284 turns. From these labels we computed the performance resultsshown in the first data row of Table 10. For comparison, the second datarow recalls the cross-validation results on the manually-labeled correct turnsin the UNC-ITSPOKE-WOZ corpus (originally presented in Table 2). Asshown, accuracy and precision for uncertainty are both slightly higher inthe UNC-ITSPOKE corpus, while recall and F-measure for uncertainty areslightly lower. Moreover, the model automatically labeled slightly less thanthe expected uncertain turns (Auto UNC), and less than expected of theautomatically labeled turns were both manually and automatically labeleduncertain (Agreed UNC).

Note however that the cross-validation results are based only on the logis-tic regression equation, not on the Bypass Heuristics; both of these elementsmade up the final uncertainty model and so are jointly responsible for the

31

results for the UNC-ITSPOKE corpus shown in Table 9. We examined therelative usefulness of these two components within the automatically labeleduncertain turns shown in Table 9. Of the 109 turns that were both manu-ally and automatically labeled uncertain, 92 (84.4%) were labeled uncertainby the logistic regression equation, 29 (26.6%) were labeled uncertain bythe Bypass Heuristics and 17 (15.6%) were labeled uncertain only by theBypass Heuristics. This indicates that as expected, including the BypassHeuristics did increase the number of true positive uncertains labeled byour uncertainty model. Of the 255 turns that were manually labeled certainand automatically labeled uncertain, 151 (59.2%) were labeled uncertain bythe logistic regression equation, 128 (50.2%) were labeled uncertain by theBypass Heuristics and 104 (40.8%) were labeled uncertain only by the By-pass Heuristics. This indicates that including the Bypass Heuristics alsoincreased the number of false positives labeled by our uncertainty model. Infuture work we will further investigate which of the hedge phrases in theBypass Heuristics performed best and worst to remove those that decreasedthe precision of the overall model.

Table 9: Uncertainty Confusion Matrix on “All Manual Corrects” Subset (N = 5284),UNC-ITSPOKE Corpus

AUTOMATIC AUTOMATICUNCERTAIN CERTAIN

MANUAL UNCERTAIN 109 541MANUAL CERTAIN 255 4379

Table 10: Results for Uncertainty Model on “All Manual Corrects”

Dataset ACC UNC UNC UNC Auto AgreedP R Fm UNC UNC

UNC-ITSPOKE 84.9% 0.43 0.17 0.24 6.9% 2.1%Corpus (N = 5284)

UNC-ITSPOKE-WOZ 84.4% 0.41 0.21 0.27 7.2% 2.9%Corpus (N = 5147)

Next, we present the uncertainty model’s performance on the subset ofmanually-labeled correct turns that were also automatically labeled correct

32

(“Agreed Manual and Automatic Corrects”). These results represent themodel’s performance on the subset of turns where the automatic correct-ness labeling actually did work perfectly during the experiment. Table 11shows the confusion matrix for these 4421 turns in the whole UNC-ITSPOKEcorpus. Table 12 shows the confusion matrix for these 1408 turns in the UN-CAdapt condition only. Table 13 shows the performance results for thesetwo subsets of agreed corrects. Comparison with Table 10 shows that theuncertainty model’s performance on these agreed corrects is higher on accu-racy but lower on precision, recall and F-measure for uncertainty, and thepercentages of automatically-labeled uncertain turns and agreed uncertainturns are also lower. These results suggest that the agreed correct turns con-tain a lower proportion of uncertain turns than all manually-labeled corrects,and that the misrecognized correct turns missing from Table 11 were easierfor the uncertainty model to label as uncertain. Finally, comparing the twodata rows in Table 13 shows that the uncertainty model performed worse onall metrics in UNCAdapt condition of the UNC-ITSPOKE corpus than inthe UNC-ITSPOKE corpus overall.19

Table 11: Uncertainty Confusion Matrix in “Agreed Manual and Automatic Corrects”Subset (N = 4421), UNC-ITSPOKE Corpus



Table 12: Uncertainty Confusion Matrix in “Agreed Manual and Automatic Correct”Subset of UNCAdapt Condition (N = 1408), UNC-ITSPOKE Corpus



19For comparison, note that the Random and NOAdapt conditions yielded 2.6% and1.2% Agreed UNC, respectively.

33

Table 13: Results for Uncertainty Model on “Agreed Manual and Automatic Correct”Subsets, UNC-ITSPOKE Corpus

Dataset ACC UNC UNC UNC Auto AgreedP R Fm UNC UNC

UNC-ITSPOKE 86.7% 0.35 0.14 0.20 4.9% 1.7%Corpus (N = 4421)

UNCAdapt 86.4% 0.30 0.11 0.16 4.3% 1.3%Condition (N = 1408)

We conclude that Hypothesis #6 was confirmed: Noisy automatic un-certainty labeling was a major factor responsible for the low percentage ofuncertainty adaptations in the UNCAdapt condition. In particular, the re-call for uncertainty was lower than expected over the whole UNC-ITSPOKEcorpus and within the UNCAdapt condition in particular. In fact, the recallwas so low that the absolute proportion of true positive COR+UNC turnsthat received the uncertainty adaptation was nearly the same in both theUNCAdapt and Random conditions (0.92% and 0.84%, respectively). Thisis because many of the automatically labeled COR+CER turns randomlyselected for adaptation in the Random condition were actually COR+UNC.

We speculate that the difference between the model’s expected perfor-mance based on our cross-validation experiments in the UNC-ITSPOKE-WOZ corpus, and its actual performance in the UNC-ITSPOKE corpus, canbe attributed to four main factors. First, the two corpora were collected withdifferent systems. The UNC-ITSPOKE used automatic semantic and uncer-tainty labeling while UNC-ITSPOKE-WOZ used human labeling; this mayhave led to different student dialogue behaviors (e.g., longer turns). Second,the two corpora represent different user populations; it may be that uncer-tainty is best represented by different features in these populations. Thefact that the two corpora have different uncertainty distributions supportsthis possibility (see Tables 1 and 3). Third, during training and testing theuncertainty model used temporal features computed from manually labeledturn boundaries in the UNC-ITSPOKE corpus (Section 5.2). But duringthe controlled experiment the temporal features were computed from auto-matically labeled turn boundaries. This variation may have impacted theeffectiveness of these features. Fourth, the uncertainty model was devel-oped to optimize test performance on manually-labeled correct turns in the

34

UNC-ITSPOKE-WOZ corpus (Section 5.2). It would have been more accu-rate to optimize test performance on the automatically-labeled correct turnsin the UNC-ITSPOKE-WOZ corpus (some of which are manually labeledincorrect). However, since the correctness precision in the UNC-ITSPOKEcorpus was 0.947, this probably did not have much impact on the uncertaintymodel’s performance.

We speculate that the best way to improve the model for future use isto improve its recall (as opposed to precision). This belief is based on thefact that the Random condition yielded similar learning gains overall to theUNCAdapt condition, which suggests that it is better from a learning per-spective to adapt to both COR+UNCs and COR+CERs than to not adaptat all. However, as discussed in Section 7, the fact that the Random did notyield higher learning than UNCAdapt despite adapting to more turns overallsuggests that it would not further benefit learning to give the adaptation toall correct answers.

9. Evaluating the Uncertainty-Adaptive System’s Local Impact

Our final goal for evaluating UNC-ITSPOKE can be stated as follows:Didcorrectly recognizing and adapting to uncertainty have a positive local im-pact on learning?

Hypothesis #8: We hypothesized that correctly recognizing andadapting to uncertainty would produce significant local learning in the UN-CAdapt condition, even though the adaptation didn’t occur enough to pro-duce significant global learning overall (Section 7).

We tested this hypothesis by comparing impasse state transitions aftersuccessful and failed applications of the uncertainty adaptation. Althoughthese failures could be due to mislabeling COR+UNC turns as either in-correct or certain, uncertainty labeling errors were much more severe thancorrectness labeling errors (Section 8). Thus to streamline our presentationwe focused here on failures due to uncertainty labeling errors. We dividedthe “Agreed Manual and Automatic Correct” turns in the UNCAdapt con-dition (Table 12) into two groups: the UNC Model Success group containedall turns where the manual and automatic uncertainty labels agreed (the di-agonal cells in Table 12), and the UNC Model Failure group contained allturns where the manual and automatic uncertainty labels did not agree (theoff-diagonal cells in Table 12).

35

Within each group, we then computed for each student the likelihoods oftransitioning to each of the 4 (manually labeled) impasse states (see Figure 6)in turn n+1 from an uncertain state in turn n, and from a certain state inturn n. Hypothesis #8 predicts that the impasse states most likely to betransitioned to in turn n+1 would be less severe in the UNC Model Successgroup and more severe in the UNC Model Failure group.

Our transition likelihood metric is D’Mello et al.’s L [57]. Our analysisis based on McQuiggan et al. [58], who also use this metric to compareaffective state transitions across two groups of turns that each received adifferent type of affective feedback. L(n→n+1) is computed as shown below,where n refers to the manually labeled impasse state in turn n and n+1refers to the manually labeled impasse state in turn n+1. As shown, Lcomputes the likelihood that the n→n+1 transition will occur. L=1 indicatesthat n+1 always follows n, while L=0 and L<0 indicate that the likelihoodof transitioning from n to n+1 is equal to chance, and less than chance,respectively.

L(n→n+1) = P (n+1|n)−P (n+1)1−P (n+1)

Mean L values across students for the UNC Model Success and Failuregroups in the UNCAdapt condition are shown in Tables 14-15. The rowsrepresent each turn n state and the columns represent the four different turnn+1 impasse severities - the severity ranking (Section 4, Figure 6) is shownbelow each impasse type. The total counts for each transition type are shownin parenthesis beside the mean L values.20

Within each group, ANOVAs with post-hoc pairwise tests determined ifthe transition likelihoods to each turn n+1 state were different for each turnn state. Within the UNC Model Success group, the overall ANOVAs weresignificant for both COR+UNC (F(3,44)=3.764, p=0.03) and COR+CER(F(3,92)=25.638, p<0.001) transitions. The post-hoc pairwise tests showedthat the bolded transitions in Table 14 were statistically the most likely(either significantly or as trends (p≤0.10)). Within the UNC Model Fail-ure group, the overall ANOVAs were not significant for either COR+UNC or

20These totals sum to 1082 instead of 1408 (Table 12) for two reasons. First, we excludedall turns from test problem in this analysis since no turns received the adaptation duringthe test problem (Section 6). Second, there are fewer turn transitions than total turnsbecause the final turn doesn’t transition.

36

COR+CER transitions. Finally, t-tests were computed across the two groupsfor each turn n→turn n+1 transition, to determine if each transition’s likeli-hood differed across the Success and Failure groups. The t-tests showed thatall of the COR+UNC transitions and the COR+CER→INC+UNC transitiondiffered significantly or as a trend across the Success and Failure groups.

Table 14: Mean L Values for COR+UNC and COR+CER Impasse State Transitions in theUNC Model Success Group, UNCAdapt condition, “Agreed Corrects” Subset (N=1082)

turn nturn n+1

INC+CER INC+UNC COR+UNC COR+CER pmost less least none

COR+UNC -.236 (0) .254 (7) .248 (5) -.580 (6) 0.03COR+CER -.014 (135) -.036 (59) .001 (71) .115 (628) <0.01

Table 15: Mean L Values for COR+UNC and COR+CER Impasse State Transitions in theUNC Model Failure Group, UNCAdapt condition, “Agreed Corrects” Subset (N=1082)

turn nturn n+1


COR+UNC -.035 (13) -.050 (20) -.030 (23) .191 (74) 0.15COR+CER .131 (9) -.167 (3) -.013 (2) .161 (27) 0.29

Regarding COR+UNC turns, our results indicate that COR+UNC turnsare more likely to become INC+UNC or remain COR+UNC when they arecorrectly recognized (and thus adapted to) (Table 14) as compared to whenthey are incorrectly recognized (and thus not adapted to) (Table 15). Whenincorrectly recognized COR+UNC turns are equally likely to transition toany of the 4 impasse states severities.

Regarding COR+CER turns, our results indicate that when COR+CERare correctly recognized (and thus not adapted to) they are more likely toremain COR+CER than to become any other impasse state. In contrast,when they are incorrectly recognized (and thus adapted to), they are equallylikely to transition to any of the 4 impasse states severities.

We conclude that Hypothesis #8 is confirmed. We interpret our resultsas follows:

37

Correctly recognizing and adapting to COR+UNCs in turn n causes stu-dents to question whether an alternative line of thought should be used inturn n+1. The continued uncertainty in turn n+1 indicates that this alterna-tive line of thought is not unreservedly adopted. Prior research (e.g., [23, 39])suggests that this questioning process is a positive part of the learning pro-cess. On the other hand, incorrectly recognizing and thus not adapting toCOR+UNCs in turn n is equally likely in turn n+1 to be associated witha negative impact on learning (INC+CER), with students questioning theirthinking (INC+UNC, COR+UNC), or with students letting go of their un-certainty (COR+CER). If INC+CER occurs in turn n+1, this means thatstudents with the least severe impasse about a concept in turn n end updisplaying the most severe impasse about that concept or about the nextconcept in turn n+1,21 i.e., they give up the correct line of thought usedin turn n and unreservedly adopt an incorrect line of thought in n+1. Ourprior research [59], which shows that Average Learning Impasse Severity isstrongly negatively correlated with learning, indicates that INC+CER an-swers are negatively related to learning.

Correctly recognizing and thus not adapting to COR+CER turns is mostlikely to support continued high cognitive performance. Incorrectly recog-nizing and thus adapting to COR+CER turns is as likely to have a negativeimpact on learning as it is to support continued high performance.

Hypothesis #9: Finally, we replicated this analysis in the NOAd-apt condition, in order to confirm that our non-adaptive and adaptive sys-tem yielded different local learning behaviors. In particular, since the non-adaptive system ignores uncertainty, there should be no difference in transi-tion likelihoods between the UNC Model Success and Failure groups in theNOAdapt condition. Moreover, the COR+UNC transitions in the NOAdaptcondition should pattern like the COR+UNC transitions in the UNC ModelFailure group in the UNCAdapt condition. The COR+CER transitions inthe NOAdapt condition should pattern like the COR+CER transitions inthe UNC Model Success group in the UNCAdapt condition.

Hypothesis #9 is confirmed by the results in Tables 16-17. Since un-certainty is ignored in the NOAdapt condition, the Success and Failure

21Recall that turn n+1 will answer another question about the same concept if theadaptation gives a Subdialogue Remediation, and turn n+1 will answer a question aboutthe next topic if the adaptation gives a Bottom Out (Section 4).

38

groups pattern identically: ANOVAs for both groups were not significantfor COR+UNC (p>0.10), indicating that COR+UNC turns were equallylikely to transition to any of the 4 impasse states severities in the NOAd-apt condition. These results mirror the UNC Model Failure group in theUNCAdapt condition. Moreover, ANOVAs for both groups were significantfor COR+CER (p<0.01) in the NOAdapt condition and post-hoc pairwisetests showed that COR+CER turns were most likely to remain COR+CERin turn n+1 (p<0.11). These results mirror the UNC Model Success groupin the UNCAdapt condition.

Table 16: Mean L Values for COR+UNC and COR+CER Impasse Transitions in theUNC Model Success Group, NOAdapt condition, “Agreed Corrects” Subset (N=1164)

turn nturn n+1


COR+UNC .028 (3) .-045 (2) .244 (6) -.358 (7) 0.31COR+CER -.021 (135) -.022 (77) -.004 (63) .106 (708) <0.01

Table 17: Mean L Values for COR+UNC and COR+CER Impasse Transitions in theUNC Model Failure Group, NOAdapt condition, “Agreed Corrects” Subset (N=1164)

turn nturn n+1


COR+UNC .008 (11) -.071 (27) -.006 (18) .188 (67) 0.18COR+CER -.115 (4) -.103 (4) -.032 (4) .478 (28) <0.01

In sum, we conclude from this evaluation of the adaptive system’s lo-cal learning impact that when the correctness labeling is accurate, success-fully detecting and responding to uncertainty has a positive local impacton learning. In particular, COR+UNC turns receive the uncertainty adap-tation and the likelihood of subsequent INC+CER turns is reduced, whileCOR+CER turns don’t receive the adaptation and the likelihood of sub-sequent COR+CER turns is increased. In comparison, ignoring the uncer-tainty label in COR+UNC turns, and misrecognizing the (un)certainty labelin COR+UNC and COR+CER turns, can all lead to subsequent INC+CERturns, which are associated with decreased learning [59].

39

10. Conclusions and Current Directions

In this paper, we presented UNC-ITSPOKE, a spoken dialogue computertutor that automatically detects and adapts to uncertainty in student turns.We then evaluated the performance of UNC-ITSPOKE via a controlled ex-periment. UNC-ITSPOKE detects uncertainty via a machine-learning modelsupplemented with hedging phrase heuristics. UNC-ITSPOKE responds tocorrect+uncertain turns with remedial content (incorrect+uncertain turnsalready receive remedial content because they are incorrect).

First we evaluated the global performance of UNC-ITSPOKE, in termsof how much students learned by using the system. We found that UNC-ITSPOKE yielded higher learning than two control systems, but the differ-ence was only significant for a subset of students. However, the percentage ofreceived adaptations significantly correlated with higher learning over all stu-dents. Moreover, UNC-ITSPOKE produced significantly fewer adaptationsthan expected; we concluded that this explained the smaller than expectedeffect on learning. Second, we evaluated the performance of the automaticcorrectness and uncertainty labeling components. We concluded that errorsin both components explained why UNC-ITSPOKE produced fewer uncer-tainty adaptations than expected, but the uncertainty model performed muchworse than the correctness model. Third, we evaluated the local performanceof UNC-ITSPOKE, in terms of the learning state to which turns transitionafter receiving the uncertainty adaptation. We found that correctly recogniz-ing and adapting to uncertainty leads to less severe learning impasse states,while ignoring or incorrectly recognizing (un)certainty can lead to more se-vere learning impasse states. We concluded that the adaptation significantlybenefited local learning even though it didn’t occur enough to significantlyimprove overall learning.

When compared to our prior work, which showed that a wizarded versionof UNC-ITSPOKE can significantly improve overall learning because moreuncertainty is detected via manual uncertainty and correctness labeling, ourcurrent results suggest that the major performance bottleneck in fully au-tomated UNC-ITSPOKE is the poor uncertainty model. In order to yieldbetter performance, UNC-ITSPOKE must detect (and therefore adapt to)more uncertainty.

Our future work thus will focus on improving uncertainty recall in our un-certainty model, since our global evaluation of learning suggests that adapt-ing to some (mislabeled) correct+certain turns as well as correct+uncertain

40

turns yields higher learning than not adapting at all. However, since ourlocal evaluation of learning indicates that adapting to some (mislabeled) cor-rect+certain turns can lead to more severe learning impasse states, we willtry to improve uncertainty recall without decreasing uncertainty precision.We also plan to improve the recall of our correctness labeling so that morecorrect turns are potentially able to receive the uncertainty adaptation.

To improve our uncertainty model, we plan to test additional features.In particular, we believe automatic correctness features will be useful sincecorrectness and uncertainty are related [23, 24, 25]. We plan to includethe correctness value of the current turn and a running total and averagefor correctness to represent dialogue performance so far. We also plan toextend our running total and average features to include running totals andaverages that range across dialogues rather than only within dialogues; webelieve that this larger generalization may be more useful. In addition, weplan to try using word-level features which improved emotion recognitionat the turn level in our prior work [60]. We will also investigate additionalfeatures and learning methods that resulted from the Interspeech 2009 and2010 Emotion Challenges [8, 9]; these conferences represent the tremendousrecent progress that has been made in these areas, which can now filter downto fully implemented affective systems such as ours.

To improve our uncertainty model, we also plan to use alternative datasetsfor our cross-validation experiments. In particular, we will try combining theUNC-ITSPOKE-WOZ and UNC-ITSPOKE corpora, to create a much largerdataset that contains multiple user populations. We will also try using onlythe UNC-ITSPOKE corpus to see if the best features for detecting uncer-tainty during fully automated tutoring differ from those during wizardedtutoring. Finally, we will also try downsampling (removing certain turns),upsampling (replicating uncertain turns) and more complex approaches (e.g.,SMOTE [61]) to increasing the sensitivity of our uncertainty model to our mi-nority class (uncertainty). Although such techniques can decrease precision(e.g, by removing certain turns where the same features that predict uncer-tainty also predict certainty), they can also increase recall (e.g., by forcingthe learning algorithm to predict a higher relative proportion of uncertainty).

Acknowledgments

This work is funded by National Science Foundation (NSF) awards #0914615and #0631930. We thank Pam Jordan and the ITSPOKE Group for their

41

help with this research.

References

[1] C. Conati, H. Maclaren, Empirically building and evaluating a proba-bilistic model of user affect, User Modeling and User-Adapted Interac-tion (UMUAI) 19 (3) (2009) 267–303.

[2] K. Porayska-Pomsta, M. Mavrikis, H. Pain, Diagnosing and acting onstudent affect: the tutor’s perspective, User Modeling and User-AdaptedInteraction: The Journal of Personalization Research 18 (2008) 125–173.

[3] A. Batliner, S. Steidl, C. Hacker, E. Noth, Private emotions vs. so-cial interaction - a data-driven approach towards analysing emotion inspeech, User Modeling and User-Adapted Interaction: The Journal ofPersonalization Research 18 (2008) 175–206.

[4] S. McQuiggan, B. Mott, J. Lester, Modeling self-efficacy in intelligenttutoring systems: An inductive approach, User Modeling and User-Adapted Interaction (UMUAI) 18 (1-2) (2008) 81–123.

[5] S. K. D’Mello, S. D. Craig, A. Witherspoon, B. McDaniel, A. Graesser,Automatic detection of learner’s affect from conversational cues, UserModeling and User-Adapted Interaction: The Journal of PersonalizationResearch 18 (2008) 45–80.

[6] A. Paiva, R. Prada, R. Picard (Eds.), Affective Computing and Intelli-gent Interaction, Springer, Heidelberg, 2007.

[7] H. Prendinger, M. Ishizuka, The Empathetic Companion: A character-based interface that addresses users’ affective states, International Jour-nal of Applied Artificial Intelligence 19 (3) (2005) 267–285.

[8] B. Schuller, S. Steidl, A. Batliner, The Interspeech 2009 EmotionChallenge, in: Proceedings of the 10th Annual Conference of the In-ternational Speech Communication Association (Interspeech), ISCA,Brighton, UK, 2009.

[9] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Muller,S. Narayanan, The Interspeech 2010 Paralinguistic Challenge, in: Pro-ceedings of the 11th Annual Conference of the International SpeechCommunication Assocation (Interspeech), Chiba, Japan, 2010.

42

[10] J. Klein, Y. Moon, R. Picard, This computer responds to user frustra-tion: Theory, design, and results, Interacting with Computers 14 (2002)119–140.

[11] K. Liu, R. W. Picard, Embedded empathy in continuous, interactivehealth assessment, in: CHI Workshop on HCI Challenges in HealthAssessment, 2005.

[12] G. Aist, B. Kort, R. Reilly, J. Mostow, R. Picard, Experimentally aug-menting an intelligent tutoring system with human-supplied capabilities:Adding human-provided emotional scaffolding to an automated readingtutor that listens, in: Proc. Intelligent Tutoring Systems Workshop onEmpirical Methods for Tutorial Dialogue Systems, 2002.

[13] K. Forbes-Riley, D. Litman, Designing and evaluating a wizardeduncertainty-adaptive spoken dialogue tutoring system, ComputerSpeech and Language (CSL) 25 (1) (2010) 105–126.

[14] D. Litman, K. Forbes-Riley, Recognizing student emotions and attitudeson the basis of utterances in spoken tutoring dialogues with both humanand computer tutors, Speech Communication 48 (5) (2006) 559–590.

[15] L. Vidrascu, L. Devillers, Detection of real-life emotions in dialogsrecorded in a call center, in: Proceedings of INTERSPEECH, Lisbon,Portugal, 2005.

[16] C. M. Lee, S. Narayanan, Towards detecting emotions in spoken dialogs,IEEE Transactions on Speech and Audio Processing 13 (2).

[17] I. Shafran, M. Riley, M. Mohri, Voice signatures, in: Proceedings ofthe IEEE Automatic Speech Recognition and Understanding Workshop(ASRU), St. Thomas, US Virgin Islands, 2003, pp. 31–36.

[18] C. Conati, H. Maclaren, Evaluating a probabilistic model of studentaffect, in: Proceedings of Intelligent Tutoring Systems Conference(ITS),Maceio, Brazil, 2004, pp. 55–66.

[19] J. Gratch, S. Marsella, Fight the way you train: The role and limits ofemotions in training for combat, The Brown Journal of World Affairs10 (1) (2003) 63–76.

43

[20] A. de Vicente, H. Pain, Informing the detection of the students’ mo-tivational state: an empirical study, in: Proceedings of the IntelligentTutoring Systems Conference (ITS), Biarritz, France, 2002, pp. 933–943.

[21] P. Ekman, W. V. Friesen, The facial action coding system: A techniquefor the measurement of facial movement, Consulting Psychologists Press,Palo Alto, 1978.

[22] D. Litman, K. Forbes-Riley, Annotating student emotional states inspoken tutoring dialogues, in: Proceedings of the SIGdial Workshop onDiscourse and Dialogue, Boston, USA, 2004, pp. 144–153.

[23] S. Craig, A. Graesser, J. Sullins, B. Gholson, Affect and learning: an ex-ploratory look into the role of affect in learning with AutoTutor, Journalof Educational Media 29 (3).

[24] K. Bhatt, M. Evens, S. Argamon, Hedged responses and expressionsof affect in human/human and human/computer tutorial interactions,in: Proceedings of Cognitive Science (CogSci), Chicago, USA, 2004, pp.114–119.

[25] K. Forbes-Riley, D. Litman, Investigating human tutor responses to stu-dent uncertainty for adaptive system development, in: R. P. Ana Paiva,R. Picard (Eds.), Proceedings of Affective Computing and IntelligentInteraction (ACII), Springer Press, 2007, pp. 678–689.

[26] W. Tsukahara, N. Ward, Responding to subtle, fleeting changes in theuser’s internal state, in: Proc. SIG-CHI on Human factors in computingsystems, 2001.

[27] N. Wang, W. Johnson, R. E. Mayer, P. Rizzo, E. Shaw, H. Collins, Thepoliteness effect: Pedagogical agents and learning outcomes, Interna-tional Journal of Human-Computer Studies 66 (2) (2008) 98–112.

[28] N. Wang, W. Johnson, P. Rizzo, E. Shaw, R. Mayer, Experimental evalu-ation of polite interaction tactics for pedagogical agents, in: Proceedingsof Intelligent User Interface Conference (IUI), 2005, pp. 12–19.

[29] L. Hall, S. Woods, D. Sobral, A. Paiva, K. Dautenhahn, D. Wolke,L. Newall, Designing empathic agents: Adults vs. kids, in: Proceedings

44

of the Intelligent Tutoring Systems Conference (ITS), Maceio, Brazil,2004, pp. 604–613.

[30] F. Mairesse, M. Walker, Trainable generation of big-five personalitystyles through data-driven parameter estimation, in: Proceedings of the46th Annual Meeting of the Association for Computational Linguistics(ACL), Columbus, Ohio, 2008.

[31] H. Pon-Barry, K. Schultz, E. O. Bratt, B. Clark, S. Peters, Respondingto student uncertainty in spoken tutorial dialogue systems, InternationalJournal of Artificial Intelligence in Education 16.

[32] K. VanLehn, P. W. Jordan, C. P. Rose, D. Bhembe, M. Bottner, A. Gay-dos, M. Makatchev, U. Pappuswamy, M. Ringenberg, A. Roque, S. Siler,R. Srivastava, R. Wilson, The architecture of Why2-Atlas: A coach forqualitative physics essay writing, in: Proc. Intelligent Tutoring Systems,2002.

[33] P. Jordan, B. Hall, M. Ringenberg, Y. Cui, C. Ros, Tools for authoringa dialogue agent that participates in learning studies, in: Proceedingsof Artificial Intelligence in Education (AIED), Los Angeles, 2007, pp.43–50.

[34] X. D. Huang, F. Alleva, H. W. Hon, M. Y. Hwang, K. F. Lee, R. Rosen-feld, The SphinxII speech recognition system: An Overview, Computer,Speech and Language.

[35] A. Black, P. Taylor, Festival speech synthesis system: system docu-mentation (1.1.1), Human Communication Research Centre TechnicalReport 83, The Centre for Speech Technology Research, University ofEdinburgh (1997).

[36] D. Litman, C. Rose, K. Forbes-Riley, K. VanLehn, D. Bhembe, S. Sil-liman, Spoken versus typed human and computer dialogue tutoring,International Journal of Artificial Intelligence in Education 16 (2006)145–170.

[37] K. Forbes-Riley, D. Litman, S. Silliman, J. Tetreault, Comparing synthe-sized versus pre-recorded tutor speech in an intelligent tutoring spoken

45

dialogue system, in: Proceedings of the Florida Artificial Intelligence Re-search Society Conference (FLAIRS), Melbourne Beach, Florida, USA,2006, pp. 509–514.

[38] K. VanLehn, S. Siler, C. Murray, Why do only some events cause learn-ing during human tutoring?, Cognition and Instruction 21 (3).

[39] B. Kort, R. Reilly, R. Picard, An affective model of interplay betweenemotions and learning: Reengineering educational pedagogy-building alearning companion, in: T. Okamoto, R. Hartley, J. Kinshuk, P. Klus(Eds.), Proceedings IEEE International Conference on Advanced Learn-ing Technology: Issues, Achievements and Challenges, Madison, WI,2001, pp. 43–48.

[40] K. Forbes-Riley, D. Litman, M. Rotaru, Responding to student uncer-tainty during computer tutoring: A preliminary evaluation, in: Proceed-ings of the 9th International Conference on Intelligent Tutoring Systems(ITS), Montreal, Canada, 2008.

[41] K. Forbes-Riley, D. Litman, Adapting to student uncertainty improvestutoring dialogues, in: Proceedings 14th International Conference onArtificial Intelligence in Education (AIED), Brighton, UK, 2009.

[42] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Wit-ten, The WEKA data mining software: An update, SIGKDD Explo-rations 11 (1).

[43] D. Litman, K. Forbes-Riley, Improving (meta)cognitive tutoring by de-tecting and responding to uncertainty, in: Working Notes of the Cog-nitive and Metacognitive Educational Systems AAAI Symposium, Ar-lington, VA, 2009.

[44] H. Ai, D. Litman, K. Forbes-Riley, M. Rotaru, J. Tetreault, A. Puran-dare, Using system and user performance features to improve emotiondetection in spoken tutoring dialogs, in: Proceedings of Interspeech,Pittsburgh, PA, 2006, pp. 797–800.

[45] D. Litman, K. Forbes-Riley, Predicting student emotions in computer-human tutoring dialogues, in: Proceedings of the Annual Meeting ofthe Association for Computational Linguistics (ACL), Barcelona, Spain,2004, pp. 352–359.

46

[46] K. Forbes-Riley, D. Litman, Predicting emotion in spoken dialogue frommultiple knowledge sources, in: Proceedings of the Human LanguageTechnology Conference: 4th Meeting of the North American Chapter ofthe Association for Computational Linguistics (HLT/NAACL), Boston,MA, 2004, pp. 201–208.

[47] J. Ang, R. Dhillon, A. Krupski, E.Shriberg, A. Stolcke, Prosody-basedautomatic detection of annoyance and frustration in human-computerdialog, in: J. H. L. Hansen, B. Pellom (Eds.), Proceedings of the Inter-national Conference on Spoken Language Processing (ICSLP), Denver,USA, 2002, pp. 2037–2039.

[48] C. Lee, S. Narayanan, R. Pieraccini, Combining acoustic and languageinformation for emotion recognition, in: Proceedings of the InternationalConference on Spoken Language Processing (ICSLP), Denver, Colorado,USA, 2002, pp. 873–876.

[49] A. Batliner, K. Fischer, R. Huber, J. Spilker, E. Noth, How to findtrouble in communication, Speech Communication 40 (1-2) (2003) 117–143.

[50] L. Devillers, L. Lamel, I. Vasilescu, Emotion detection in task-orientedspoken dialogs, in: Proc. IEEE International Conference on Multimedia& Expo (ICME), 2003.

[51] D. Talkin, D. Lin, Get f0 online documentation, esps/waves release 5.31,Tech. rep., Entropic Research Laboratory (1996).

[52] Speech Coding and Synthesis, Elsevier, New York, 1995, Ch. A robustalgorithm for pitch tracking.

[53] W. Burleson, R. Picard, Evidence for gender specific approaches to thedevelopment of emotionally intelligent learning companions, IEEE In-telligent Systems 22 (4) (2007) 62–69.

[54] P. Oudeyer, The synthesis of cartoon emotional speech, in: Proceedingsof Speech Prosody 2002, Aix-en-Provence, 2002, pp. 551–554.

[55] R. R. Hake, Assessment of physics teaching methods, in: Proceedingsof the UNESCO-ASPEN Workshop on Active Learning in Physics, SriLanka, 2002.

47

[56] K. Forbes-Riley, D. Litman, Correlating student acoustic-prosodic pro-files with student learning in spoken tutoring dialogues, in: Proceedings9th European Conference on Speech Communication and Technology(Interspeech-2005/Eurospeech), Lisbon, Portugal, 2005.

[57] S. D’Mello, R. S. Taylor, A. Graesser, Monitoring affective trajectoriesduring complex learning, in: Proceedings of the 29th Annual Meetingof the Cognitive Science Society, Austin, TX, 2007, pp. 203–208.

[58] S. W. McQuiggan, J. L. Robison, J. C. Lester, Affective transitions innarrative-centered learning environments, in: Proceedings of the 9th In-ternational Intelligent Tutoring Systems Conference, Montreal, Canada,2008.

[59] K. Forbes-Riley, D. Litman, Metacognition and learning in spoken dia-logue computer tutoring, in: Proceedings of the International IntelligentTutoring Systems Conference (ITS), Pittsburgh, PA, 2010.

[60] D. Litman, M. Rotaru, G. Nicholas, Classifying turn-level uncertaintyusing word-level prosody, in: Proceedings Interspeech, Brighton, UK,2009.

[61] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE:Synthetic minority over-sampling technique, Journal of Artificial Intel-ligence Research 16 (2002) 321–357.

48

Beneﬁts and Challenges of Real-Time Uncertainty Detection ...forbesk/papers/spcomm11.pdfExamples...

Documents

Transcript of Beneﬁts and Challenges of Real-Time Uncertainty Detection ...forbesk/papers/spcomm11.pdfExamples...