2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
-
Upload
yun-huang -
Category
Technology
-
view
138 -
download
1
Transcript of 2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
Introduction to Feature-Aware Student Knowledge Tracing (FAST) Model and Toolkit
José P. González-Brenes, Pearson Yun Huang, University of Pittsburgh Acknowledging: Peter Brusilovsky, University of Pittsburgh
Outline • Introduction • FAST – Feature-Aware Student Knowledge Tracing
• Toolkit 1-2-3 • Walk-through examples
1. Item difficulty 2. Temporal Item Response Theory
• Conclusion
Motivation
• Personalize learning of students – For example, teach students new material as
they learn, so we don’t teach students material they know
• How? Typically with Knowledge Tracing
:
û û ü ü û û ü ü ü
û û ü ü ü
û û ü ü
:
:
:
û û ü ü ü
û û ü ü
Learns a skill or not
• Knowledge Tracing fits a two-state HMM per skill
• Binary latent variables indicate the knowledge of the student of the skill
• Four parameters: 1. Initial Knowledge 2. Learning 3. Guess 4. Slip
Transition
Emission
What’s wrong?
• Only uses performance data (correct or incorrect) • We are now able to capture feature rich data
– MOOCs & intelligent tutoring systems are able to log fine-grained data
– Used a hint, watched video, after hours practice…
• … these features can carry information or intervene on learning
What’s a researcher gotta do?
• Modify Knowledge Tracing algorithm • For example, just on a small-scale
literature survey, we find at least nine different flavors of Knowledge Tracing
Are all of those models sooooo different? • No! we identify three main variants • We call them the “Knowledge Tracing
Family”
Knowledge Tracing Family
No features Emission (guess/slip)
Transition (learning)
Both (guess/slip and
learning)
• Item difficulty (Gowda et al ’11; Pardos et al ’11)
• Student ability (Pardos et al ’10)
• Subskills (Xu et al ’12)
• Help (Sao Pedro et al ’13)
• Student ability (Lee et al ’12; Yudelson et al ’13)
• Item difficulty (Schultz et al ’13)
• Help (Becker et al ’08)
k
y
k
y
f
k
y
f
k
y
f f
• Each model is successful for an ad hoc purpose only – Hard to compare models – Doesn’t help to build a
cognition theory
• Learning scientists have to worry about both features and modeling
• These models are not scalable: – Rely on Bayes Net’s
conditional probability tables – Memory performance grows
exponentially with number of features
– Runtime performance grows exponentially with number of features (with exact inference)
Example:
Knowledge p(Correct)
False (1) 0.10 (guess)
True (2) 0.85 (1-slip)
20+1 parameters!
Emission probabilities with no features:
Example: Emission probabilities with 1 binary feature:
Knowledge Hint p(Correct)
False False (1) 0.06
True False (2) 0.75
False True (3) 0.25
True True (4) 0.99
21+1 parameters!
Example: Emission probabilities with 10 binary features: Knowledge F1 … F10 p(Correct)
False False False False (1) 0.06
… …
True True True True (2048) 0.90
210+1 parameters!
Outline • Introduction • FAST – Feature-Aware Student Knowledge Tracing
1. Low Complexity with Many Features 2. Flexible Feature Engineering 3. Flexible Parameterization 4. High Predictive Performance, Plausibility and
Consistency • Toolkit 1-2-3 • Walk-through Examples
1. Item difficulty 2. Temporal Item Response Theory
• Conclusion
Something old… k
y
f f • Uses the most general model
in the Knowledge Tracing Family
• Parameterizes learning or emission (guess and slip) probabilities
Something new… k
y
f f • Instead of using inefficient
conditional probability tables, we use logistic regression [Berg-Kirkpatrick et al’10 ]
• Exponential complexity -> linear complexity
Example (guess and slip):
# of features # of parameters in KT # of parameters in FAST*
0 2 2
1 4 4
10 2048 22
25 67,108,864 52
52 features are not that many, and yet they can become intractable with Knowledge Tracing Family
* Parameterizing guess and slip probabili4es without sharing features.
Something blue? k
y
f f • Not a lot of changes to
implement prediction • Training requires quite a bit of
changes – We use a recent modification of
the Expectation-Maximization algorithm proposed for Computational Linguistics problems [Berg-Kirkpatrick et al’10 ]
KT uses Expectation-Maximization
Conditional Probability
Table Lookup / Update
Latent Knowledge Estimation
E-Step:Forward-Backward algorithm
M-Step: Maximum Likelihood
“Conditional Probability
Table” Lookup / Update
Latent Knowledge Estimation
Logistic Regression
Weights Estimation
FAST uses a recent E-M algorithm [Berg-Kirkpatrick et al’10 ]
E-step
Slip/guess lookup:
Mastery p(Correct)
False (1)
True (2)
Use the multiple parameters of logistic regression to fill the
values of a conditional
probability table! [Berg-Kirkpatrick et al’10 ]
“Conditional Probability
Table” Lookup / Update
Latent Knowledge Estimation
Logistic regression
weights Estimation
FAST uses a recent E-M algorithm [Berg-Kirkpatrick et al’10 ]
FAST uses a recent E-M algorithm [Berg-Kirkpatrick et al’10 ]
Latent Knowledge Estimation
Instance Weights for
Logistic Regression
Train a weighted logistic regression !
P(hidden | Observed), i.e., P(Learnedt|O) The probability of being in Learned state at tth practice given a student’s practice sequence.
observation 1observation 2
observation n
...
featu
re 1
featu
re 2
featu
re k
featu
re 1
featu
re 2
featu
re k
featu
re 1
featu
re 2
featu
re k
... ... ...
observation 1observation 2
observation n
...
{ { {active when
masteredactive when not masteredalways active
Features:Instance weights:
prob
abili
ty o
f no
t mas
terin
gpr
obab
ility
of
mas
terin
g
P(Learned1|O) P(Learned2|O)
…
P(Learnedn|O)
P(Unlearned1|O) P(Unlearned2|O)
…
P(Unlearnedn|O)
active when learned
active when unlearned
O
O
Feature Design Matrix for slip/guess logistic regression
Parameterization example
• To model the impact of example usage, we construct a binary example feature Et: whether a student clicked an example before current practice.
• This feature affects the Guess and Slip probabilities: when a student checked an example, he/she has higher probability to guess? (and lower probability to slip?)
Parameterization example
feature Et for Slip
bias for Slip
feature Et for Guess
bias for Guess
0 1 0 0 1 1 0 0
0 0 0 1 0 0 1 1
• Mary attempted a problem twice. • On the 1st attempt, she failed. • She checked an example. On the 2nd attempt, she succeeded.
Outcome
incorrect
correct
incorrect
correct
original data for learned state
a copy of the data for unlearned state
P(Learnedt|O)
0.3 0.6 0.7 0.4
standard logistic regression weight for each observation
weighted logistic regression to train coefficients
s2 s1 g2 g1
Parameterization example
observation 1observation 2
observation n
...
featu
re 1
featu
re 2
featu
re k
featu
re 1
featu
re 2
featu
re k
featu
re 1
featu
re 2
featu
re k
... ... ...
observation 1observation 2
observation n
...
{ { {active when
masteredactive when not mastered
always active
Features:Instance weights:
prob
abili
ty o
f no
t mas
terin
gpr
obab
ility
of
mas
terin
g
Slip/Guess logistic regression
When FAST uses only intercept terms as features for the two levels of mastery, it is equivalent to Knowledge Tracing!
March 28, 2014 31
7,100 11,300 15,500 19,8000
10
20
30
40
50
60
23
28
46
54
0.08 0.10 0.12 0.15
# of observations
exe
cutio
n tim
e (
min
.)
BNT!SM (no feat.)
FAST (no feat.)
FAST is 300x faster than BNT-SM!
(On an old laptop, no parallelizaLon, nothing fancy)
BNT-SM vs FAST
• BNT-SM contains other functionalities that FAST doesn’t have. For example, it allows different ways to learn parameters.
• We recommend you to explore different tools to find the best fit.
Outline • Introduction • FAST – Feature-Aware Student Knowledge Tracing
1. Low Complexity with Many Features 2. Flexible Feature Engineering 3. Flexible Parameterization 4. High Predictive Performance, Plausibility and
Consistency • Toolkit 1-2-3 • Walk-through Examples
1. Item difficulty 2. Temporal Item Response Theory
• Conclusion
What kind of features can we put?
• Item Dummies Incorporating item difficulties
• Student Dummies Incorporating student abilities
• Item and Student Dummies Temporal Item Response Theory, incorporating both item difficulties and student abilities
• Subskill Dummies Incorporating subskill difficulties
What kind of features can we put?
• Binary hint features à Consider whether a student requested a hint or not
• Binary example features à Consider whether a student checked an example or not …
Outline • Introduction • FAST – Feature-Aware Student Knowledge Tracing
1. Low Complexity with Many Features 2. Flexible Feature Engineering 3. Flexible Parameterization 4. High Predictive Performance, Plausibility and
Consistency • Toolkit Setup • Examples
1. Item difficulty 2. Multiple subskills 3. Temporal Item Response Theory
• Conclusion
What can we parameterize? • For example, to model the impact of example
usage, we can consider following parameterizations with a binary example feature:
(Huang et al. ’15)
Outline • Introduction • FAST – Feature-Aware Student Knowledge Tracing
1. Low Complexity with Many Features 2. Flexible Feature Engineering 3. Flexible Parameterization 4. High Predictive Performance, Plausibility and Consistency
• Toolkit 1-2-3 • Walk-through Examples
1. Item difficulty 2. Temporal Item Response Theory
• Conclusion
Beyond higher predictive performance….
• FAST promises higher predictive performance than Knowledge Tracing with proper feature engineering.
• Moreover, it increase model plausibility and consistency.
• Details in our paper in EDM 2015 (http://www.educationaldatamining.org/EDM2015/uploads/papers/paper_164.pdf). A quick introduction of how the FAST toolkit addresses these issues: – By specifying #random restarts, it automatically picks
the one with the maximum log likelihood on train set, – It outputs plausibility evaluation metrics.
Outline • Introduction • FAST – Feature-Aware Student Knowledge Tracing
1. Low Complexity with Many Features 2. Flexible Feature Engineering 3. Flexible Parameterization 4. High Predictive Performance, Plausibility and
Consistency • Toolkit 1-2-3 • Walk-through Examples
1. Item difficulty 2. Temporal Item Response Theory
• Conclusion
Toolkit Setup -- input 1. Download the latest release from
https://github.com/ml-smores/fast/releases 2. Decompress the file (fast-2.1.0-release.zip). The main files
for starting are: • fast-2.1.0-final.jar • data/item_exp/FAST+item1.conf (configuration file) • data/item_exp/train0.csv, test0.csv (data)
3. Open a terminal, go to the directory where fast-2.1.0-final.jar locates, and type:
java -jar fast-2.1.0-final.jar ++data/item_exp/FAST+item1.conf Details can be found in our wikihttps://github.com/ml-smores/fast/wiki
• XXX_Prediction.csv – P(Correct) – Knowledge estimation: P(Learned|O) …
• XXX_Evaluation.csv – Overall AUC, mean AUC …
• XXX_Parameters.csv – Non parameterized – Parameterized: feature weights
• Runtime.log
Toolkit Setup -- output data/item_exp/
Outline • Introduction • FAST – Feature-Aware Student Knowledge Tracing
1. Low Complexity with Many Features 2. Flexible Feature Engineering 3. Flexible Parameterization 4. High Predictive Performance, Plausibility and
Consistency • Toolkit 1-2-3 • Walk-through Examples
1. Item difficulty 2. Temporal Item Response Theory
• Conclusion
Modeling item difficulty
• Within the same skill, students may perform well on easier items (problems), and worse on harder items.
• Probably harder items have lower guess and higher slip?
• We use binary item dummies (indicators) as features to parameterize guess and slip probabilities.
Results on the Java dataset
Overall AUC Mean AUC Knowledge Tracing .71 ± .01 .58 FAST+item .75 ± .01 .68
6% improvement
• Java tutoring system Quizjet (Hsiao et al ’10) • 20,808 observations, 19 skills, 110 students, 70% correct. • Randomly select 80% in train, 20% in test • Parameterizing emission
17% improvement
95% confidence intervals
Current experiment • Here, we experiment on a public dataset from PSLC
Datashop, the Geometry dataset (Koedinger et al ’10). • 5,055 observations, 18 skills, 59 students, 75% correct. • Randomly selected 80% students in train, remaining in test.
Model #Random restart Parameterization KT1 1 / KT2 20 / FAST+item1 1 emission FAST+item2 20 emission FAST+item3 20 initial, transition, emission
Have a look at the input data…
Required columns for KT and FAST Feature columns for FAST models
data/item_exp/train0.csv
Have a look at the configuration file KT1.conf
modelName KT1 parameterizing false parameterizingInit false parameterizingTran false parameterizingEmit false forceUsingAllInputFeatures false
nbRandomRestart 1
inDir ./data/item_exp/ outDir ./data/item_exp/ trainInFilePrefix train testInFilePrefix test inFileSuffix .csv EMMaxIters 500 LBFGSMaxIters 50 EMTolerance 1.0E-6 LBFGSTolerance 1.0E-6
data/item_exp/KT1.conf
Let’s run Knowledge Tracing baseline first … • java -jar fast-2.1.0-final.jar ++data/item_exp/KT1.conf • Open KT1_Evaluation.csv (data/item_exp/)
Model #restart Overall AUC Mean AUC Time(s)
KT1 1 .71 .55 1
• KT2.conf only changes nbRandomRestarts to 20 • java -jar fast-2.1.0-final.jar ++data/item_exp/KT2.conf • Open KT2_Evaluation.csv
Model #restart Overall AUC Mean AUC Time(s)
KT2 20 .71 .56 11
Let’s run FAST with item features …
modelName FAST+item1 parameterizing true parameterizingInit false parameterizingTran false parameterizingEmit true forceUsingAllInputFeatures true nbRandomRestart 1
inDir ./data/item_exp/ outDir ./data/item_exp/ trainInFilePrefix train testInFilePrefix test inFileSuffix .csv EMMaxIters 500 LBFGSMaxIters 50 EMTolerance 1.0E-6 LBFGSTolerance 1.0E-6
data/item_exp/FAST+item1.conf
Let’s run FAST+item parameterizing emission probabilities …
• Run FAST+item1: java -jar fast-2.1.0-final.jar ++data/item_exp/FAST+item1.conf
• Run FAST+item2 (nbRandomRestart=20): java -jar fast-2.1.0-final.jar ++data/item_exp/FAST+item2.conf
• Open FAST+item1_Evaluation.csv, FAST+item2_Evaluation.csv
Model #restart Overall AUC Mean AUC Time(s)
KT2 20 .71 .56 11
FAST+item1 1 .71 .58 10
FAST+item2 20 .72 .60 145
7% improvement
What about parameterizing all the probabilities?
modelName FAST+item3 parameterizing true parameterizingInit true parameterizingTran true parameterizingEmit true forceUsingAllInputFeatures true nbRandomRestart 1
inDir ./data/item_exp/ outDir ./data/item_exp/ trainInFilePrefix train testInFilePrefix test inFileSuffix .csv EMMaxIters 500 LBFGSMaxIters 50 EMTolerance 1.0E-6 LBFGSTolerance 1.0E-6
data/item_exp/FAST+item3.conf
What about parameterizing all the probabilities?
• java -jar fast-2.1.0-final.jar ++data/item_exp/FAST+item3.conf • Open FAST+item3_Evaluation.csv • For running parameterizing all probabilities with 20 restarts, we need
more than 7 minutes, yet we can get the same result with only 1 restart (FAST+item3).
Model #restart Parameterization
Overall AUC
Mean AUC
Time (s)
KT2 20 / .71 .56 11
FAST+item3 1 Initial, transition, emission
.72 .62 27
11% improvement
Outline • Introduction • FAST – Feature-Aware Student Knowledge Tracing
1. Low Complexity with Many Features 2. Flexible Feature Engineering 3. Flexible Parameterization 4. High Predictive Performance, Plausibility and
Consistency • Toolkit 1-2-3 • Walk-through Examples
1. Item difficulty 2. Temporal Item Response Theory
• Conclusion
Two paradigms: (50 years of research in 1 slide) • Knowledge Tracing
– Allows learning – Every item = same difficulty – Every student = same ability
• Item Response Theory – NO learning – Models items difficulties – Models student abilities
Can FAST help merging the paradigms?
Item Response Theory
• The simplest of its forms, it’s the Rasch model
• The Rasch can be formulated in many ways: – Typically using latent variables – Logistic regression
• a feature per student • a feature per item • We end up with a lot of features! – Good thing we
are using FAST ;-)
Results on the Java dataset
Overall AUC Mean AUC Knowledge Tracing .67 ± .03 .56 FAST + IRT .76 ± .03 .70
13% improvement
• Java tutoring system Quizjet (Hsiao et al ’10) • 6,549 observations (first attempts), 60% correct. • Randomly select 50% students in train, for remaining students,
place the first half of practices in train and predict the rest • Only consider parameterizing emission
25% improvement
Current experiment • We choose one skill, “Nested Loops” from the Java dataset
(Caution: this is a private dataset, please don’t distribute this subset).
• Randomly select 50% students in train, for remaining students, place the first half of practices in train and predict the rest
• Only consider parameterizing emission • The toolkit can automatically generate student and item dummy
features according to “student” and “problem” columns from train and test sets.
• Here, we force both hidden states share features, which means the student ability or item difficulty remains the same whether the student is in learned or unlearned states.
Have a look at the input data… Train set
Test set
Training Datapoints
Test Datapoints
• We need to put the entire skill-student sequence in test set (using “fold” column to differentiate datapoints from train or test). This is for allowing the toolkit updating knowledge estimation by historical practices.
Have a look at the configuration file for FAST …
modelName FAST+IRT1 parameterizing true parameterizingInit false parameterizingTran false parameterizingEmit true forceUsingAllInputFeatures true generateStudentDummy true generateItemDummy true
nbRandomRestart 1
inDir ./data/item_exp/ outDir ./data/item_exp/ trainInFilePrefix train testInFilePrefix test inFileSuffix .csv EMMaxIters 500 LBFGSMaxIters 50 EMTolerance 1.0E-6 LBFGSTolerance 1.0E-6
data/IRT_exp/FAST+IRT1.conf
Let’s run baselines and FAST+IRT models …
• Type the following command consecutively for 1) KT1.conf, 2) KT2.conf, 3) FAST+item1.conf, 4) FAST+item2.conf
java -jar fast-2.1.0-final.jar ++data/IRT_exp/XXX.conf • Open XXX_Evaluation.csv
Model #restart AUC Time(s)
KT1 1 .60 <1
KT2 20 .59 1
FAST+IRT1 1 .71 3
FAST+IRT2 20 .73 39
24% improvement
Outline • Introduction • FAST – Feature-Aware Student Knowledge Tracing
1. Low Complexity with Many Features 2. Flexible Feature Engineering 3. Flexible Parameterization 4. High Predictive Performance, Plausibility and
Consistency • Toolkit 1-2-3 • Walk-through Examples
1. Item difficulty 2. Temporal Item Response Theory
• Conclusion
Comparison of existing techniques
March 28, 2014 64
allows features
slip/ guess
recency/ordering learning
FAST ✓ ✓ ✓ ✓ PFA Pavlik et al ’09 ✓ ✗ ✗ ✓ Knowledge Tracing Corbett & Anderson ’95 ✗ ✓ ✓ ✓ Rasch Model Rasch ’60 ✓ ✗ ✗ ✗
• FAST lives by its name • FAST provides high flexibility in utilizing
features, and as our studies show, even with simple features improves significantly over Knowledge Tracing
• The effect of features depends on how smartly they are designed and on the dataset
• We are looking forward for more clever uses of feature engineering for FAST in the community.
Thank you!
Multiple subskills
• Experts annotated items (question) with a single skill and multiple subskills
Multiple subskills & KnowledgeTracing • Original Knowledge Tracing can not
model multiple subskills • Most Knowledge Tracing variants assume
equal importance of subskills during training (and then adjust it during testing)
• State of the art method, LR-DBN [Xu and
Mostow ’11] assigns importance in both training and testing
FAST can handle multiple subskills
• Parameterize learning • Parameterize slip and guess
• Features: binary variables that indicate presence of subskills
FAST vs Knowledge Tracing: Slip parameters of subskills
• Conventional Knowledge assumes that all subskills have the same difficulty (red line)
• FAST can identify different difficulty between subskills
• Does it matter?
subskills within a skill:
State of the art (Xu & Mostow’11)
• The 95% of confidence intervals are within +/- .01 points
Model AUC
LR-DBN .71
KT - Weakest .69 KT - Multiply .62
Benchmark Model AUC
LR-DBN .71 Single-skill KT .71 KT - Weakest .69 KT - Multiply .62
• The 95% of confidence intervals are within +/- .01 points • We are testing on non-overlapping students, LR-DBN was
designed/tested in overlapping students and didn’t compare to single skill KT
!
Benchmark Model AUC
LR-DBN .71 Single-skill KT .71 KT - Weakest .69 KT - Multiply .62
• The 95% of confidence intervals are within +/- .01 points • We are testing on non-overlapping students, LR-DBN was
designed/tested in overlapping students and didn’t compare to single skill KT
!
Benchmark
• The 95% of confidence intervals are within +/- .01 points
Model AUC FAST .74 LR-DBN .71 Single-skill KT .71 KT - Weakest .69 KT - Multiply .62
Have a look at the input data…
data/others/FAST+subskill_train0.csv
Let’s run FAST+subskill models …
• Move to “data/others/” folder • Copy fast-2.1.0-final.jar under this folder • Create “input” folder under this folder, and put FAST
+subskill_train0.txt and FAST+subskill_test0.txt under it. • Type the following command for
java -jar fast-2.1.0-final.jar ++FAST+subskill.conf