Using Reinforcement Learning to Build a Better Model of Dialogue State
description
Transcript of Using Reinforcement Learning to Build a Better Model of Dialogue State
Using Reinforcement Learning to Build a Better Model of Dialogue State
Joel Tetreault & Diane LitmanUniversity of PittsburghLRDCApril 7, 2006
Problem Problems with designing spoken dialogue systems:
What features to use? How to handle noisy data or miscommunications? Hand-tailoring policies for complex dialogues?
Previous work used machine learning to improve the dialogue manager of spoken dialogue systems [Singh
et al., ‘02; Walker, ‘00; Henderson et al., ‘05] However, very little empirical work on testing the
utility of adding specialized features to construct a better dialogue state
Goal
Lots of features can be used to describe the user state, which ones to you use?
Goal: show that adding more complex features to a state is a worthwhile pursuit since it alters what actions a system should make
5 features: certainty, student dialogue move, concept repetition, frustration, student performance
All are important to tutoring systems, but also are important to dialogue systems in general
Outline
Markov Decision Processes (MDP) MDP Instantiation Experimental Method Results
Markov Decision Processes
What is the best action an agent to take at any state to maximize reward at the end?
MDP Input: States Actions Reward Function
MDP Output
Use policy iteration to propagate final reward to the states to determine: V-value: the worth of each state Policy: optimal action to take for each state
Values and policies are based on the reward function but also on the probabilities of getting from one state to the next given a certain action
What’s the best path to the fly?
MDP Frog Example
Final State: +1
-1
-1
-1
-1 -1
-1 -1
MDP Frog Example
Final State: +1
-2
-2
-2
-1 0
-1 0
-3
MDP’s in Spoken Dialogue
MDP
DialogueSystem
Training dataPolicy
User Simulator
HumanUser
MDP works offline
Interactions work online
ITSPOKE Corpus
100 dialogues with ITSPOKE spoken dialogue tutoring system [Litman et al. ’04]
All possible dialogue paths were authored by physics experts
Dialogues informally follow question-answer format
50 turns per dialogue on average Each student session has 5 dialogues
bookended by a pretest and posttest to calculate how much student learned
Corpus Annotations
Manual annotations: Tutor and Student Moves (similar to Dialog Acts)
[Forbes-Riley et al., ’05]
Frustration and certainty [Litman et al. ’04] [Liscombe et al. ’05]
Automated annotations: Correctness (based on student’s response to last
question) Concept Repetition (whether a concept is
repeated) %Correctness (past performance)
MDP State Features
Features Values
Correctness Correct (C) Incorrect/Partially Correct (I)
Certainty Certain (cer), Neutral (neu), Uncertain (unc)
Student Move Shallow (S), Deep/Novel Answer/Assertion (O)
Concept Repetition New Concept (0), Repeated (R)
Frustration Frustrated (F) , Neutral (N)
% Correctness 50-100% (H)igh, 0-49% (L)ow
MDP Action Choices
Case TMove Example Turn
Feed Pos “Super.”
NonFeed Hint, Ques. “To analyze the pumpkin’s acceleration we will use Newton’s Second Law. What is the definition of the law?”
Mix Pos, Rst, Ques.
“Good. So when the truck and car collide they exert a force on each other. What is the relationship between their magnitudes?”
MDP Reward Function
Reward Function: use normalized learning gain to do a median split on corpus:
10 students are “high learners” and the other 10 are “low learners”
High learner dialogues had a final state with a reward of +100, low learners had one of -100
)1(
)(
pretest
pretestposttestNLG
Infrastructure 1. State Transformer:
Based on RLDS [Singh et al., ’99]
Outputs State-Action probability matrix and reward matrix
2. MDP Matlab Toolkit (from INRA) to generate policies
Methodology
Construct MDP’s to test the inclusion of new state features to a baseline: Develop baseline state and policy Add a feature to baseline and compare polices A feature is deemed important if adding it results in a
change in policy from a baseline policy (“shifts”) For each MDP: verify policies are reliable (V-value
convergence)
Hypothetical Policy Change Example
B1 State Policy B1+Certainty
State
1 [C] Feed [C,Cer]
[C,Neu]
[C,Unc]
2 [I] Feed [I,Cer]
[I,Neu]
[I,Unc]
+Cert 1
Policy Feed
Feed
Feed
Mix
Mix
Mix
+Cert 2
Policy Mix
Feed
Mix
Mix
NonFeed
Mix
0 shifts 5 shifts
Tests
+%Correct
+Goal
+Frustration
B2+
Correctness +Certainty
Baseline 1 Baseline 2
B1+
+SMove
Baseline
Actions: {Feed, NonFeed, Mix} Baseline State: {Correctness}
Baseline network
[C]
FINAL
F|NF|Mix
[I] F|NF|Mix
F|NF|Mix
F|NF|Mix
F|NF|Mix
Baseline 1 Policies
Trend: if you only have student correctness as a model of student state, regardless of their response, the best tactic is to always give simple feedback
# State State Size Policy
1 [C] 1308 Feed
2 [I] 872 Feed
But are our policies reliable?
Best way to test is to run real experiments with human users with new dialogue manager, but that is months of work
Our tact: check if our corpus is large enough to develop reliable policies by seeing if V-values converge as we add more data to corpus
Method: run MDP on subsets of our corpus (incrementally add a student (5 dialogues) to data, and rerun MDP on each subset)
Baseline Convergence Plot
Methodology: Adding more Features Create more complicated baseline by adding
certainty feature (new baseline = B2) Add other 4 features (student moves, concept
repetition, frustration, performance) individually to new baseline
Check that V-values converge Analyze policy changes
Tests
+%Correct
+Goal
+Frustration
B2+
Correctness +Certainty
Baseline 1 Baseline 2
B1+
+SMove
Certainty
Previous work (Bhatt et al., ’04) has shown the importance of certainty in ITS
A student who is certain and correct, may not need feedback, but one that is correct but showing some doubt is a sign they are becoming confused, give more feedback
B2: Baseline + Certainty Policies
B1 State Policy B1+Certainty
State
+Certainty Policy
1 [C] Feed [C,Cer]
[C,Neu]
[C,Unc]
NonFeed
Feed
NonFeed
2 [I] Feed [I,Cer]
[I,Neu]
[I,Unc]
NonFeed
Mix
NonFeed
Trend: if neutral, give Feed or Mix, else give NonFeed
Baseline 1 and 2 Convergence Plots
Tests
+ %Correct
+Goal
+Frustration
B2+
Correctness +Certainty
Baseline 1 Baseline 2
B1+
+SMove
% Correct Convergence Plots
Student Move Policies
B2 B2 Policy B2 +SMove +Smove Policy
1 [Cer,C] NonFeed[Cer,C,S]
[Cer,C,O]
NonFeed
Feed
2 [Cer,I] NonFeed[Cer,I,S]
[Cer,I,O]
Mix
Mix
3 [Neu,C] Feed[Neu,C,S]
[Neu,C,O]
Feed
NonFeed
4 [Neu,I] Mix[Neu,I,S]
[Neu,I,O]
Mix
NonFeed
5 [Unc,C] NonFeed[Unc,C,S]
[Unc,C,O]
Mix
NonFeed
6 [Unc,I] NonFeed[Unc,I,S]
[Unc,I,O]
Mix
NonFeed
Trend: give Mix if shallow (S), give NonFeed if Other (O)
7 Changes
Concept Repetition Policies
Trend: if concept is repeated (R) give complex or mix feedback
B2 B2 Policy B2 +Concept +Concept Policy
1 [Cer,C] NonFeed[Cer,C,O]
[Cer,C,R]
NonFeed
Feed
2 [Cer,I] NonFeed[Cer,I,O]
[Cer,I,R]
Mix
Mix
3 [Neu,C] Feed[Neu,C,O]
[Neu,C,R]
Mix
Feed
4 [Neu,I] Mix[Neu,I,O]
[Neu,I,R]
Mix
Mix
5 [Unc,C] NonFeed[Unc,C,O]
[Unc,C,R]
NonFeed
NonFeed
6 [Unc,I] NonFeed[Unc,I,O]
[Unc,I,R]
NonFeed
NonFeed
4 Shifts
Frustration Policies
Trend: if student is frustrated (F), give NonFeed
B2 B2 Policy B2 +Frustration +Frustration Policy
1 [Cer,C] NonFeed[Cer,C,N]
[Cer,C,F]
NonFeed
Feed
2 [Cer,I] NonFeed[Cer,I,N]
[Cer,I,F]
NonFeed
NonFeed
3 [Neu,C] Feed[Neu,C,N]
[Neu,C,F]
Feed
NonFeed
4 [Neu,I] Mix[Neu,I,N]
[Neu,I,F]
Mix
NonFeed
5 [Unc,C] NonFeed[Unc,C,N]
[Unc,C,F]
NonFeed
NonFeed
6 [Unc,I] NonFeed[Unc,I,N]
[Unc,I,F]
NonFeed
NonFeed
4 Shifts
Percent Correct Policies3 Shifts
Trend: if student is a low performer (L), give NonFeed
B2 B2 Policy B2 +%Correct +%Correct Policy
1 [Cer,C] NonFeed[Cer,C,H]
[Cer,C,L]
NonFeed
NonFeed
2 [Cer,I] NonFeed[Cer,I,H]
[Cer,I,L]
Mix
NonFeed
3 [Neu,C] Feed[Neu,C,H]
[Neu,C,L]
Feed
Feed
4 [Neu,I] Mix[Neu,I,H]
[Neu,I,L]
NonFeed
Mix
5 [Unc,C] NonFeed[Unc,C,H]
[Unc,C,L]
Mix
NonFeed
6 [Unc,I] NonFeed[Unc,I,H]
[Unc,I,L]
NonFeed
NonFeed
Discussion
Incorporating more information into a representation of the student state has an impact on tutor policies
Despite not having human or simulated users, can still claim that our findings are reliable due to convergence of V-values and policies
Including Certainty, Student Moves and Concept Repetition effected the most change
Future Work
Developing user simulations and annotating more human-computer experiments to further verify our policies are correct
More data allows us to develop more complicated policies such as More complex tutor actions (hints, questions) Combinations of state features More refined reward functions (PARADISE)
Developing more complex convergence tests
Related Work
[Paek and Chickering, ‘05] [Singh et al., ‘99] – optimal dialogue length [Frampton et al., ‘05] – last dialogue act [Williams et al., ‘03] – automatically generate
good state/action sets
Diff Plots
Diff Plot: compare final policy (20 students) with policies generated at smaller cuts