2015 EDM Leopard for Adaptive Tutoring Evaluation

55
Your model is predictive but is it useful?Theoretical and Empirical Considerations of a New Paradigm for Adaptive Tutoring Evaluation José P. González-Brenes, Pearson Yun Huang, University of Pittsburgh 1

Transcript of 2015 EDM Leopard for Adaptive Tutoring Evaluation

Page 1: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Your model is predictive but is it useful?− Theoretical and Empirical Considerations of a New Paradigm for Adaptive Tutoring Evaluation

José P. González-Brenes, Pearson Yun Huang, University of Pittsburgh

1

Page 2: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Main point of the paper: We are not evaluating student models correctly New paradigm, Leopard, may help

2

Page 3: 2015 EDM Leopard for Adaptive Tutoring Evaluation

3

Page 4: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Why are we here?

4

Page 5: 2015 EDM Leopard for Adaptive Tutoring Evaluation

5

Page 6: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Educational Data Mining

Data Mining

=  ?

6

Page 7: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Should we just publish at KDD*?

*or other data mining venue

7

Page 8: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Claim: Educational Data Mining helps learners

8

Page 9: 2015 EDM Leopard for Adaptive Tutoring Evaluation

9

Is our research helping learners?

Page 10: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Adaptive Intelligent Tutoring Systems: Systems that teach and adapt content to humans

teach  

collect  evidence  

10

Page 11: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Paper writing: Researchers quantify the improvements of the systems compared Not a purely academic pursuit: Superintendents to choose between alternative technology Teachers choose between systems

11

Page 12: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Randomized Controlled Trials may measure the time students spent on tutoring, and their performance on post-tests

12

Page 13: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Difficulties of Randomized Controlled Trials: •  IRB approval •  experimental design by an expert •  recruiting (and often payment!) of enough

participants to achieve statistical power •  data analysis

13

Page 14: 2015 EDM Leopard for Adaptive Tutoring Evaluation

How do other A.I. disciplines do it?

14

Page 15: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Bleu [Papineni et al ‘01]: machine translation systems Rouge [Lin et al ’02]: automatic summarization systems Paradise [Walker et al ’99]: spoken dialogue systems

15

Page 16: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Using automatic metrics can be very positive: •  Cheaper experimentation •  Faster comparisons •  Competitions that accelerate progress

16

Page 17: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Automatic metrics do not replace RCTs

17

Page 18: 2015 EDM Leopard for Adaptive Tutoring Evaluation

What does the Educational Data Mining community do? Evaluate the student model using classification accuracy metrics like RMSE, AUC of ROC, accuracy… (Literature reviews by Pardos, Pelánek, … )

18

Page 19: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Other fields verify that automatic metrics correlated with the target behavior [Eg.: Callison-Burch et al ’06]

19

Page 20: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Ironically, we have a growing body of evidence that classification evaluation metrics are a BAD way to evaluate adaptive tutors

Read Baker and Beck papers on limitations / problems of these evaluation metrics

20

Page 21: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Surprisingly, in spite of all of the evidence against using classification evaluation metrics, their use is still very widespread in the adaptive literature* Can we do better? * ExpOppNeed [Lee & Brunskill] is an exception

21

Page 22: 2015 EDM Leopard for Adaptive Tutoring Evaluation

The rest of this talk: •  Leopard Paradigm

– Teal – White

•  Meta-evaluation •  Discussion

22

Page 23: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Leopard

23

Page 24: 2015 EDM Leopard for Adaptive Tutoring Evaluation

•  Leopard: Learner Effort-Outcomes Paradigm

•  Leopard quantifies the effort and outcomes of students in adaptive tutoring

24

Page 25: 2015 EDM Leopard for Adaptive Tutoring Evaluation

•  Effort: Quantifies how much practice the adaptive tutor gives to students. Eg., number of items assigned to students, amount of time…

•  Outcome: Quantifies the performance of students after adaptive tutoring

25

Page 26: 2015 EDM Leopard for Adaptive Tutoring Evaluation

•  Measuring effort and outcomes is not novel by itself (e.g, RCT)

•  Leopard’s contribution is measuring both without a randomized control trial

•  White and Teal are metrics that operationalize Leopard

26

Page 27: 2015 EDM Leopard for Adaptive Tutoring Evaluation

White: Whole Intelligent Tutoring system Evaluation White performs a counterfactual simulation (“What Would the Tutor Do?”) to estimate how much practice students receive

27

Page 28: 2015 EDM Leopard for Adaptive Tutoring Evaluation

28

Design desiderata: Evaluation metric should be easy to use Same, or similar input than conventional metrics

Page 29: 2015 EDM Leopard for Adaptive Tutoring Evaluation

29

Bob

Alice

skill q

1

student u

.5

Bob

s1

s1

Bob

Alices1

.7

Bobs13

.7

s1

.8

6

1

yt0

41

.6

11

Bob

2

3

4

Bob

.72

.91Bob

.5

t

1

Alice s11

0

s1

s1s1

0

s1

s11

ct+1

.6

Alice 0

.4

.9

ˆactual performancepredicted performance

cu,q,t+1 yu,q,tˆ

Page 30: 2015 EDM Leopard for Adaptive Tutoring Evaluation

30

Bob

Alice

skill q

1

student u

.5

Bob

s1

s1

Bob

Alices1

.7

Bobs13

.7

s1

.8

6

1

yt0

41

.6

11

Bob

2

3

4

Bob

.72

.91Bob

.5

t

1

Alice s11

0

s1

s1s1

0

s1

s11

ct+1

.6

Alice 0

.4

.9

2/3

4/5

score

score

effort=

effort=

ˆactual performancepredicted performance

cu,q,t+1 yu,q,tˆ

Alternatively, we can

model error

Page 31: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Future direction: Present aggregate results

31

Page 32: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Q/ What if student does not achieve target performance? A: “Visible” imputation

32

Page 33: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Meta-Evaluation

33

Page 34: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Compare: •  Conventional classification metrics •  Leopard metrics (White)

34

Page 35: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Datasets: •  Data from a middle-school Math commercial

tutor –  1.2 million observations –  25,000 students –  Item to skill mapping:

•  Coarse: 27 skills •  Fine: 90 skills •  (Other item-to-skill model not reported)

•  Synthetic data

35

Page 36: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Assessing an evaluation metric with real student data is difficult because we often do not know the ground truth Insight: Use data that we know a priori its behavior in an adaptive tutor

36

Page 37: 2015 EDM Leopard for Adaptive Tutoring Evaluation

For adaptive tutoring to be able to optimize when to stop instruction, the student performance should increase with repeated practice (the learning curve should be increasing) Decreasing /flat learning curve = bad data

37

Page 38: 2015 EDM Leopard for Adaptive Tutoring Evaluation

38

Page 39: 2015 EDM Leopard for Adaptive Tutoring Evaluation

39

Page 40: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Procedure: 1.  Select skills with decreasing/flat learning

curve (aka bad data) 2.  Train a student model on those skills 3.  Compare classification metrics with

Leopard

40

Page 41: 2015 EDM Leopard for Adaptive Tutoring Evaluation

F1 AUC Score Effort Bad student model .79 .85 Majority class 0 .50

41

Page 42: 2015 EDM Leopard for Adaptive Tutoring Evaluation

F1 AUC Score Effort Bad student model .79 .85 .18 10.1 Majority class 0 .50 .18 11.2

42

Page 43: 2015 EDM Leopard for Adaptive Tutoring Evaluation

What does this mean? •  High accuracy models may not be useful

for adaptive tutoring •  We need to change how we report results

in adaptive tutoring

43

Page 44: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Solutions •  Report classification accuracy averaged

over skills (for models with 1 skill per item) ✖ Not useful for comparing or discovering different skill models

•  Report as “difficulty” baseline ✖ Experiments suggest that models with baseline performance can be useful

•  Use Leopard

44

Page 45: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Let’s use all data, and pick an item-to-skill mapping:

AUC Score Effort Coarse (27 skills) .69 Fine (90 skills) .74

45

Page 46: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Let’s use all data, and pick an item-to-skill mapping:

AUC Score Effort Coarse (27 skills) .69 .41   55.7  Fine (90 skills) .74 .36   88.1  

46

Page 47: 2015 EDM Leopard for Adaptive Tutoring Evaluation

With synthetic data we can use Teal as the ground truth We generate 500 synthetic datasets with known Knowledge Tracing Parameters Which metrics correlate best to the truth?

47

Page 48: 2015 EDM Leopard for Adaptive Tutoring Evaluation

48

Page 49: 2015 EDM Leopard for Adaptive Tutoring Evaluation

49

Page 50: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Discussion

50

Page 51: 2015 EDM Leopard for Adaptive Tutoring Evaluation

51

In EDM 2014 we proposed “FAST” toolkit for Knowledge Tracing with Features

Page 52: 2015 EDM Leopard for Adaptive Tutoring Evaluation

“FAST model improves 25% AUC of ROC”

Page 53: 2015 EDM Leopard for Adaptive Tutoring Evaluation
Page 54: 2015 EDM Leopard for Adaptive Tutoring Evaluation

54

Page 55: 2015 EDM Leopard for Adaptive Tutoring Evaluation

Input

Teal White Knowledge Tracing Family parameters

Student’s correct or incorrect response

Sequence length Student models’ prediction of correct/incorrect

Target Probability of correct

Target Probability of correct

55