Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

20
Assessing Students’ Performance Longitudinally: Item Difficulty Parameter vs. Skill Learning Tracking Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

description

Assessing Students’ Performance Longitudinally: Item Difficulty Parameter vs. Skill Learning Tracking. Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute. - PowerPoint PPT Presentation

Transcript of Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Page 1: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Assessing Students’ Performance Longitudinally: Item Difficulty Parameter vs.

Skill Learning TrackingMingyu Feng, Worcester Polytechnic Institute

Neil T. Heffernan, Worcester Polytechnic Institute

Page 2: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

The “ASSISTment” System

• A web-based tutoring system that assists students in learning mathematics and gives teachers assessment of their students’ progress

Page 3: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

An ASSISTment• We break multi-step problems

into “scaffolding questions”• “Hint Messages”: given on

demand that give hints about what step to do next

• “Buggy Message”: a context sensitive feedback message

• • (Feng, Heffernan & Koedinger, 2006a)

• Skills– The state reports to teachers on 5

areas– We seek to report on more and finer

grain-sized skills

(Demo/movie)The original question

a. Congruenceb. Perimeterc. Equation-Solving

The 1st scaffolding question

Congruence

The 2nd scaffolding question

Perimeter

A buggy message

A hint message

Geometry

Page 4: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

The ASSISTment Project

What Level of Tutor Interaction is Best?By Leena Razzaq, Neil Heffernan & Robert Lindeman

To determine the best level of tutor interaction to help students learn the mathematics required for a state exam based on their math proficiency.

Goal

The Assistment System is a web-based assessment system that tutors students on math problems. The system is freely available at www.assistment.orgAs of March 2007, 1000’s of Worcester middle school students use ASSISTments every two weeks as part of their math class.Teachers use the fine-grained reporting that the system provides to inform their instruction.

The Interaction Hypothesis

When one-on-one tutoring, either by a human tutor or a computer tutor, is compared to a less interactive control condition that covers the same content, then students will learn more in the interactive condition than the control condition.

Background on ASSISTments

Analysis and Conclusions

Experiment Design 3 levels of interaction:Scaffolding + hints represents the most interactive experience: students must answer scaffolding questions, i.e. learning by doing.Hints on demand are less interactive because students do not have to respond to hints, but they can get the same information as in the scaffolding questions by requesting hints. Delayed feedback is the least interactive condition because students must wait until the end of the assignment to get any feedback.

2 levels of math proficiency:Students in Honors math classes.Students in Regular math classes.

566 8th grade students participated. Results showed a significant interaction between condition and math proficiency (p < 0.05), a good case for tailoring tutor interaction to types of students.

Regular students learned more with scaffolding + hints (p < 0.05): less-proficient students benefit from more interaction and coaching through each step to solve a problem.Honors students learned more with delayed feedback (p = 0.075): more-proficient students benefit from seeing problems worked out and getting the big picture.Delayed feedback performed better than hints on demand (p=.048) for both more- and less-proficient students: students don’t do as well when we depend on student initiative.

Experiment Screen Shots

This work has been accepted for publication at the 2007 Artificial Intelligence in Education Conference in Los Angeles.

Hints on Scaff. Q.

Scaff. Q. #1

Scaff. Q. #2

Scaff. Q. #3

Scaff. Q. #4

Hint #1

Hint #2

Hint #3

Hint #4

Hint #5

Hint #6

Hint #7

Collaborators Sponsors

Students in this condition interact with the tutor by

answering scaffolding questions.

Students in this condition can get hints when they ask for them by pressing the hint

button.

Students in this condition get no feedback until the end of

the assignment when they get answers and solutions.

Students see the solution after they finish all of the problems.

Is this hypothesis true?We found evidence to support this hypothesis in some cases, not in others.Based on the results of Razzaq & Heffernan (2006), we believe the difficulty of the material influences how effective interactive tutoring will be.

Our Hypothesis

More interactive intelligent tutoring will lead to more learning (based on post-test gains) than less interactive tutoring. Differences in learning will be more significant for students who are less-proficient than students who are more-proficient.

Page 5: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

By tagging items with skills,

teachers can 1) get reports on

which skills students are

doing poorly on, and 2) track

them over time.

CAREER: Learning about Learning: Using Intelligent Tutoring Systems as a Research Platform to Investigate Human Learning

Free researcher, teacher and student accounts for 7th-10th grade math preparation at www.assistment.org

ID Question TextCorrect Answer

% Correct

Hint Req. # Attempt

Common Errors

Standard

Resp. #

Triangles ABC and DEF are congruent. The perimeter of triangle ABC is 23 inches. What is the length of side DF in triangle DEF?

10 16% 49% 188

16 14

Geometry5 12

8 7

1Which side of triangle ABC has the same length as side DF of triangle DEF?

ac 23% 44% 144

ab 8

GeometryAB 4

10 3

2What is the perimeter of triangle ABC?

2x + x + 8

41% 20% 143 2x + 8 23 Measurement

3

Now, given the perimeter of triangle ABC equals 23 inches, you can write the equation 2x + x + 8 = 23 and solve it for x. What is the value of x?

5 36% 44% 140

15 4

Algebra & Number Sense

8 4

10 2

4Remember, we are looking for side DF. Enter the length of side DF: 10 43% 34% 135

5 16

Algebra & Geometry

16 3

6 2

The three hint

messages for the second

scaffold.

The “bottom

out” hint.

The second

scaffold

The first

scaffold

Uploaded image

What a student sees

This project has 5 main research thrusts; 1) For the designing cognitive models thrust we report that we can do a better job of modeling students by using finer-grained models (i.e., that track more knowledge components) of student than more courser grain model (Zapdos et al, 2006, Feng, et al, 2006). 2) For the research thrust of inferring what students know and are learning we can report two new results. First, we can do a better job of assessing students (as measured by predicting state test scores) by seeing how much tutoring they need to solve a question (Feng, et al, 2006a). Secondly, we have shown that we can do a better job of modeling students’ learning overtime by building models that take allow us to model different rates of learning for different skills (Feng et al, 2006a). 3) For the optimizing learning thrust we have new empirical results that show that students learn more with the type of tutoring we provide that compared to a traditional Computer-Added Instruction (CAI) control (Razzaq & Heffernan, 2006). 4) For the thrust for informing educators, we have some recent publications on the types of feedback we give educators (Feng & Heffernan, 2005& 2006)). Additionally, we have work that shows we can track student motivation and then inform educators in novel manners that increase student motivation (Walonoski & Heffernan, 2006a & 2006b). 5) Finally, for the thrust of allowing user adaptation we have shown that the authoring tools we have built can be used to teachers and quickly create content for their classes (Heffernan, Turner et al, 2006).

References are at www.asssistment.org

What the teacher who builds the tutoring sees.

This shows a student that first guessed 16 (real answer is 24), then got the first scaffolding question correct with “AC”. The student then clicked on “½*8x” and the system spit out the “bug” message in red. The student, twice in a row, asked for a hint shown in the green box.

The author wrote this hint message shown in the green box, put typing it in here.

This dialog

shows the author

has tagged

the third scaffold

with three different grained

sized models.

Recent Results - 2006

What the State MCAS test provides

Teachers get reports per student, per skill, and per item.

Teacher Reports

1) To help researchers learn about student learning. 2) To help students learn math and report to teachers valuable information about their students’ knowledge.

Goal

•The Assistment System is a web-based assessment system that tutors students on items they get wrong•The system is freely available at www.assistment.org•Thousands of students use it in Worcester and surrounding towns every two weeks as part of their math class or for homework.•The system tracks 98 skills for 8th grade math, and reports on those skills to teachers.•Teachers and schools (and researchers) can use our web-based tools to create their own content quickly.

Funding/People

•PI Neil Heffernan at WPI with collaborator Kenneth Koedinger at Carnegie Mellon. •Over 50 people have helped contribute.•Thanks for $3 millions in funding from National Science Foundation (NSF) CAREER, US Department of Education, Office of Naval Research, Spencer Foundation, and US Army•Contact: Professor Neil T. Heffernan (508) 831-5569, [email protected]

Do Students Learn from Assistments?

•Yes! We compared 19 pairs of items that address the same concept with 681 students got significant results (p<.05). See Razzaq et al (2005) and Razzaq & Heffernan (2006).

Summary

Do Assistments Assess Accurately?

•Yes, the Assistment System can predict a student’s MCAS score quite reliably and can track different rates of learning for different skills. See Feng, Heffernan & Koedinger(2006)

* Feng, M., Heffernan, N.T, Koedinger, K.R.,(2006) Predicting State Test Scores Better with Intelligent Tutoring Systems: Developing Metrics to Measure Assistance Required, The 8th International Conference on Intelligent Tutoring System, 2006, Taiwan

* Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005) The Assistment Project: Blending Assessment and Assisting. The 12th Annual Conference on Artificial Intelligence in Education 2005, Amsterdam

* Razzaq L., Heffernan, N.T. (2006). Scaffolding vs. Hint in the Assistment System. The 8th International Conference on Intelligent Tutoring Systems, 2006, Taiwan.

The third

scaffold

The originalquestion

The fourth

and last Scaffold

Page 6: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

LoadBalancer

1. Apache Tomcat Web Server Web

Client

Internet

2. Apache Tomcat Web Server

N. Apache Tomcat Web Server

Web Client

user actions

html rendering

2. DB

N. DB

queries

query results

LoadBalancer

Master

Backup

New architecture in progress

CA

RP

/ VIR

TU

AL

IP

SL

ON

Y-I &

PG

PO

OL

CLUSTER DATA FARM

1. DB

ww

w.a

ssis

tmen

t.o

rgMONITORING ENVIRONMENT

Old Architecture

Current Architecture

Scaling up a Server-Based Web Tutor Jozsef Patvarczki & Neil Heffernan

Our research team has built a web-based tutor, located at www.ASSISTment.org [1], that is used by hundreds of students a day in Worcester and surrounding towns The system’s focus is to teach 8th and 10th grad mathematics and MCAS preparation. Because it is easily accessible, it helps lower the entry barrier for teachers and enable both teachers and researchers to collect data and generate reports. Scaling up a server-based intelligent tutoring system requires developers to care about speed and reliability. We will present how the Assistment system can improve performance and reliability with a fault-tolerant scalable architecture.

Introduction

•Two concerns when running the Intelligent Tutor on a central server are:

•1) building a scalable server architecture; •2) providing reliable service to researchers, teachers, and students.

•We will answer several research questions: •1) can we reduce the cost of authoring ITS; •2) how can we improve performance and reliability with a better server architecture.

•In order to server thousands of users, we must achieve high reliability and scalability at different levels.•Scalability at our first entry point through the use of a virtual IP for www.assistment.org, provided by the CARP protocol.•Random and round-robin redirection algorithms can provide very effective load-sharing and the load-balancer distributes load over multiple application servers.•This will allow us to redirect incoming web requests and build a web portal application in a multiple-server environment.•The monitoring system uses Selenium has allowed us to send text messages to our administrators when the system goes down.•Multiple database servers with automatic synchronization, pooling, and fail-over detection.

System Scalability and Reliability

Results

Reference1. Razzaq, L, Feng, M., Nuzzo-Jones, G., Heffernan, N.T. et. al (2005). The Assistment Project: Blending Assessment

and Assisting. 12th Annual Conference on Artificial Intelligence in Education 2005, Amsterdam

•Since each public school classes have about 20 students, we noticed clusters (shown in ovals in the bottom left) of intervals where a single class was logged on. •The log-on procedures is the most expensive step in the process and this data shows that this might be a good place for us to improve.•We noticed a second cluster of around 40 users, which most likely represents instances where two classes of students were using the system simultaneously. •There was no appreciable pattern towards a slower page creation time with more users.

•Three simulated scenarios with 10s random delay between student actions:

•In the first scenario we used 50 threads simulating 50 students working without load-balancer, one application server, and one database•Second scenario with load-balancer and two application servers•Third scenario with web-cache technique and load- balancer

•We seem to have able to get linear speed-up by the help of the load-balancer and an additional application server. •We have a possibility to reduce the execution time of the computation intensive applications by the help of the GRID computing.

This problem uses a pseudo-tutor (state-based implementation) with pre-made scaffolding and hint questions selected based upon student input. Incorrect responses are in red, and hints are in green.

Assistment Features

Contact: Neil Heffernan, [email protected]

Test Type Number of Unique Users

Response Time [ms]

Scenario 1/Normal 50 8955

Scenario 2/Load-Balancer 50 3624

Scenario 3/Web-Cache 50 8073

Horizontal scaled configuration

-Scalable

-Fault-tolerant

-Dynamically configurable

ArchitectureHTTP server as Load Balancer

Client’s actions represent the system’s load

Users begin interacting with our system through the “Portal” that manages all activities

Example of a State-based Pseudo Tutor

Additional application servers for load balancing

GRID computing: Bayesian Network Application

WPI P-GRADE GRID Portal http://pgrade.wpi.edu

Workflow Editor and Manager

Visualization and Resource Information System

Page 7: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

How was the Skill Models Created

Page 8: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute
Page 9: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

• Fine grained skill models in reporting– Teachers get reports that they think are credible

and useful. (Feng & Heffernan, 2005, 2006, 2007)

Page 10: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute
Page 11: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute
Page 12: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Research Question

• In the ASSISTment project, which approach works better on assessing students’ performance longitudinally? – Skill learning tracking? – Or using item difficulty parameter?

(unidimensional)

Page 13: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Data Source• 497 students of two middle schools • Students used the ASSISTment system every

other week from Sep. 2004 to May 2005• Real state test score in May 2005• Item level online data

– students’ binary response (1/0) to items that are tagged in different skill models

• Some statistics– Average usage: 7.3 days– Average questions answered: 250– 138,000 data points

Page 14: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Data Source

Page 15: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Item Difficulty ParameterFit one-parameter logistic (1PL) IRT model (Rasch

model) on our online data

• the dependent variable: probability of correct response for student i to item n

• The independent variables: the person’s trait score and the item’s difficulty level . n

i

Page 16: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Longitudinally Modeling

• Mixed-effects Logistic Regression Models

• Models we fitted– Model-beta : time + beta -> item response– Model-WPI5: time + skills in WPI5 -> item response– Model-WPI78: time + skills in WPI78 -> item response

• Evaluation– The accuracy of the predicted MCAS test

score was used to evaluate different approaches

Singer & Willet (2003). Applied Longitudinal Data Analysis. Oxford University Press: New York.

Hedeker & Gibbions (in preparation). Longitudinal Data Analysis.

Page 17: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Results

StudentsReal MCAS

score

Predicted MCAS scoreAbsolute Difference between real score

and predicted score

Model-Beta

Model-WPI-5

Model-WPI-78

Model-Beta

Model-WPI-5

Model-WPI-78

Tom 22 20.91 19.86 17.28 1.09 2.14 3.72

Dick 26 24.15 23.76 20.96 1.85 2.24 5.04

Harry 25 19.08 17.76 16.21 5.92 7.24 7.79

Mary 25 20.44 19.18 18.38 4.56 5.82 5.62

…           

Lisa 9 17.04 17.35 15.87 8.04 8.35 6.87

%Error 13.63% 13.15% 11.97%> >

P-values of both Paired t-tests are below 0.05

Page 18: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Conclusion• We have found evidence that shows skill learning

tracking can better predict MCAS score than simply using item difficulty parameter and fine-grained models did even better than coarse-grained model

• Our skill mapping is good (maybe not optimal) • We are considering using these skills models in selecting

the next best-problem to present a student with. • Although we used Rasch model to train the item difficulty

parameter, we were not modeling students' response with IRT. One interesting work will be comparing our results to predictions made through item response modeling approach.

Page 19: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Modeling Student Knowledge Using Bayesian Networks to Predict Student Performance

By Zach Pardos –Neil Heffernan, Advisor – Computer ScienceJoint work with Brigham Anderson and Cristina Heffernan

To evaluate the predictive performance of various fine-grained student skill models in the ASSISTment tutoring system using Bayesian networks.

Goal

• ASSISTment is a web-based assessment system for 8th-10th grade math that tutors students on items they get wrong. There are 1,443 items in the system.• The system is freely available at www.assistment.org• Question responses from 600 students using the system during the 2004-2005 school year were used.•Each student completed around 260 items each.

The Skill Models

The skill models were created for use in the online tutoring system called ASSISTment, founded at WPI. They consist of skill names and associations (or tagging) of those skill names with math questions on the system. Models with 1, 5, 39 and 106 skills were evaluated to represent varying degrees of concept generality. The skill models’ ability to predict performance of students on the system as well as on a standardized state test was evaluated.

The five skill models used:• WPI-106: 106 skill names were drafted and tagged to items in the tutoring system and to the questions on the state test by our subject matter expert, Cristina.• WPI-5 and WPI-39: 5 and 39 skill names drafted by the Massachusetts Department of Education.• WPI-1: Represents unidimensional assessment.

Background on ASSISTment

Predicting student responses within the ASSISTment tutoring system

•The ASSISTment fine-grained skill models excel at assessment of student skills (see Ming Feng’s poster for a Mixed-Effects approach comparison)•Accurate prediction means teachers can know when their students have attained certain competencies.

1. Skill probabilities are inferred from a student’s responses to questions on the system

Bayesian Belief Network

Student Test Score Prediction Process

This work has been accepted for publication at the 2007 User Modeling Conference in Corfu, Greece.

SponsorsCollaborators

Online Data Prediction Error

05

1015202530

WPI-1 WPI-5 WPI-39 WPI-106

Skill Models

Av

era

ge

Err

or

%

MCAS Test Prediction Error

0

5

10

15

20

25

30

WPI-1 WPI-5 WPI-39 WPI-106

Skill Models

Ave

rag

e E

rro

r %

Result:The finer-grained the model, the better prediction accuracy. The finest-grained WPI-106 performed the best with an average of only 5.5% error in prediction of student answers within the system.

Result:The finest-grained model, the WPI-106, came in 2nd to the WPI-39 which may have performed better than the 106 because 50% of its skills are sampled on the MCAS Test vs. only 25% of the WPI-106’s.

Predicting student state test scores

Conclusions

• A Bayesian Network is a probabilistic machine learning method. It is well suited for making predictions about unobserved variables by incorporating prior probabilities with new evidence.

Bayesian Networks

• Arrows represent associations of skills with question items. They also represent conditional dependence in the Bayesian Belief Network.

•Probability of Guess is set to 10% (tutor questions are fill in the blank)

•Probability of getting the item wrong even if the student knows it is set to 5%

2. Inferred skill probabilities from above are used to predict the probability the student will answer each test question correctly

•Probabilities are summed to generate total test score.

•Probability of Guess is set to 25% (MCAS questions are multiple choice)

•Probability of getting the item wrong even if the student knows it is set to 5%

Page 20: Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Growth of 5 Skills over Time for One Student

01020304050607080

Sept Oct Nov Dec Jan Feb March

Time

Per

cen

t C

orr

ect Geometry

Algebra

Measurement

Data Analysis

Number Sence

Tracking skill learning longitudinally