Some Perspectives on CAT for K-12 Assessments

26
Pearson Copyright 2010 Some Perspectives on CAT for K- 12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June 20, 2010

description

Some Perspectives on CAT for K-12 Assessments. Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June 20, 2010. Some CAT Questions. You want to implement CAT, but you wonder about what IRT model should you use? - PowerPoint PPT Presentation

Transcript of Some Perspectives on CAT for K-12 Assessments

Page 1: Some Perspectives on CAT for K-12 Assessments

Pearson Copyright 2010

Some Perspectives on CAT for K-12 Assessments

Denny Way, Ph.D.Presented at the 2010 National Conference on Student Assessment

June 20, 2010

Page 2: Some Perspectives on CAT for K-12 Assessments

2Pearson Copyright 2010

Some CAT Questions

• You want to implement CAT, but you wonder about what IRT model should you use?

• You want to implement CAT, but you wonder how to put together a CAT pool and how you can best implement CAT?

• You want to implement CAT, and you wonder about whether it has to be limited to on-grade items only?

Page 3: Some Perspectives on CAT for K-12 Assessments

3Pearson Copyright 2010

Which Model to Use for CAT?

• The Rasch and three-parameter logistic (3PL) models are the most popular for IRT applications with multiple-choice items

• In applications to conventional fixed-form tests, the differences between the two models are not that great, i.e., when you do parallel forms equating, you get about the same answer based on either model

Page 4: Some Perspectives on CAT for K-12 Assessments

4Pearson Copyright 2010

Which Model to Use for CAT?

• With CAT, there are much greater differences between the Rasch and 3PL models. For example:– Rasch CAT only supports a reduction in test length of

about 20% compared to a conventional test– 3PL CAT supports a reduction in test length of about 40-

50% compared to a conventional test

• Why?– With the Rasch model, the information functions for all

items have the same shape and the information for an “optimally administered” item is not that much greater than a typical item

Page 5: Some Perspectives on CAT for K-12 Assessments

5Pearson Copyright 2010

Reduced Test Length for an Optimal Rasch CAT

0%

10%

20%

30%

40%

50%

60%

70%

80%

Average of P(θ)•[1 – P(θ)] on Conventional Test

%R

edu

ctio

n fo

r O

ptim

al C

AT

Page 6: Some Perspectives on CAT for K-12 Assessments

6Pearson Copyright 2010

Reduced Test Length for an Optimal Rasch CAT

0%

10%

20%

30%

40%

50%

60%

70%

80%

Average of P(θ)•[1 – P(θ)] on Conventional Test

%R

edu

ctio

n fo

r O

ptim

al C

AT

Most conventional tests are about here for most students

Page 7: Some Perspectives on CAT for K-12 Assessments

7Pearson Copyright 2010

Which Model to Use for CAT?

• With CAT, there are much greater differences between the Rasch and 3PL models. For example:– 3PL CAT tends to select some items in the pool very

often and may never select many perfectly good items– Rasch CAT selects items in a much more uniform

manner

• Why?– With the 3PL, some highly discriminating items provide

much more information than other items and therefore are more attractive to the item selection algorithm

Page 8: Some Perspectives on CAT for K-12 Assessments

8Pearson Copyright 2010

Rasch vs. 3PL Exposure

0

10

20

30

40

50

60

70

80

90

100

Exposure Rate

Cu

mu

lati

ve

Pc

t.

Rasch

3PL

Page 9: Some Perspectives on CAT for K-12 Assessments

9Pearson Copyright 2010

Rasch vs. 3PL Exposure – An Example

0

10

20

30

40

50

60

70

80

90

100

Exposure Rate

Cu

mu

lati

ve

Pc

t.

Rasch

3PL

50% of 3PL items used 5% of the

time or less

10% of 3PL items used more then 25%

of the time

Page 10: Some Perspectives on CAT for K-12 Assessments

10Pearson Copyright 2010

Which Model to Use for CAT?

• Both Rasch and 3PL models have been used successfully in CAT applications

• Psychometricians will offer different opinions about which model is best for CAT

• Either model is defensible for CAT, but the models do behave quite differently

• IRT model is just one consideration related to CAT; other considerations related to design are as important or even more important

Page 11: Some Perspectives on CAT for K-12 Assessments

11Pearson Copyright 2010

Preparing Item Pools for CAT Transition

• Ideally, the number of items in a CAT pool should be 10-12 times the number of items to be administered in the CAT (rule of thumb based on M. Stocking from ETS)

• The CAT pool must include a sufficient number of easy and difficult items; this is usually a big challenge

• More items are needed if students test from the same CAT pool multiple times, especially if previously seen items are not eligible to be used in repeat administrations

Page 12: Some Perspectives on CAT for K-12 Assessments

12Pearson Copyright 2010

Preparing Item Pools for CAT Transition

• Items for a CAT item pool must be calibrated to the same IRT scale

• Most states have pools of calibrated items with good psychometric properties that might be used for CAT– These items have gone through extensive reviews – These items may have been used operationally– These items have been shown to have good

psychometric characteristics

Page 13: Some Perspectives on CAT for K-12 Assessments

13Pearson Copyright 2010

Preparing Item Pools for CAT Transition

• However, there often challenges in using these items with the old statistics, such as:– The items were calibrated in paper but CAT is online– The items were in tests measuring old standards and

CAT will be measuring new standards– Minor edits or format changes may be needed– Items may have come from different places

• How can we make use of these items in a new adaptive test?

Page 14: Some Perspectives on CAT for K-12 Assessments

14Pearson Copyright 2010

CAT Transition Strategy: Fixed-form Transition

• Two year transition strategy• In year one, construct and administer a number (e.g., 6

to 10) of fixed-forms (field-test items can be embedded) using previous (paper-based) statistics for test construction

• Administer the fixed form online• Re-calibrate the data from the fixed forms and link them

to a common scale• Conduct standard setting on subset of the items from the

fixed forms (can be a “synthetic” form)• Apply new cut to each fixed form for reporting

Page 15: Some Perspectives on CAT for K-12 Assessments

15Pearson Copyright 2010

CAT Transition Strategy: Fixed-form Transition• In year two, combine all the items from the online

conventional fixed-forms (plus additional field-tested items) to create the CAT pool

• All items in the CAT pool will have item parameters on a common scale based on an online administration

• Issues include:– Deciding how many fixed-forms to develop– Making the fixed-forms as parallel as possible– Building effective equating links between forms– Determining whether the fixed-forms should count– Making a smooth transition from fixed-forms to CAT (since the

measurement properties will be different)

Page 16: Some Perspectives on CAT for K-12 Assessments

16Pearson Copyright 2010

CAT Transition Strategy—Barely Adaptive Tests (BAT)

• Another strategy for transition to CAT is to use “Barely Adaptive Testing” (BAT)

• In this approach, the CAT algorithm is used to administer items from the pool based on paper-based IRT calibrations

• However, the CAT algorithm does not adapt the difficulty to student performance as strongly as it normally would

• The result is that each student takes a unique test, that is “slightly” targeted to them

• Some examples help to clarify

Page 17: Some Perspectives on CAT for K-12 Assessments

17Pearson Copyright 2010

This slide shows how a conventional test would be administered to three students at different levels of ability

Student A has ability at -2.0 Student B has ability at 0.0 Student C has ability at 2.0

Each student takes

the same 50 items

XX X X X

X X X X X X X X XX X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Item Difficulty / Student Ability

Page 18: Some Perspectives on CAT for K-12 Assessments

18Pearson Copyright 2010

Conventional tests are better for calibrating items but not so good for targeting measurement

Student A has ability at -2.0

XX X X X

X X X X X X X X XX X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Item Difficulty / Student Ability

Page 19: Some Perspectives on CAT for K-12 Assessments

19Pearson Copyright 2010

This slide shows how CAT would be administered to three students at different levels of ability

Student A has ability at -2.0 Student B has ability at 0.0 Student C has ability at 2.0

Each student takes

25 items

XX

X X XX X X X X XX X X X X X X X X X

X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X X X X-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Item Difficulty / Student AbilityB

Page 20: Some Perspectives on CAT for K-12 Assessments

20Pearson Copyright 2010

CAT is best for targeting measurement but not so good for estimating item statistics

OO

O OO O OO O OO O OO O OO O O O

O O O O O O = Students taking

O O O O O a middle difficulty

O O O O O CAT item

O O O O O O OO O O O O O O O

-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Item Difficulty / Student Ability

No responses here for

calibration

No responses here for

calibration

Page 21: Some Perspectives on CAT for K-12 Assessments

21Pearson Copyright 2010

This slide shows how BAT would be administered to three students at different levels of ability

Student A has ability at -2.0 Student B has ability at 0.0 Student C has ability at 2.0

Each student takes

35 items

X X XX X X X

X X X X X X X XX X X X X X X X X X X X X

X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X

-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Item Difficulty / Student Ability

Page 22: Some Perspectives on CAT for K-12 Assessments

22Pearson Copyright 2010

Why Does BAT Make Sense?

• BAT is a compromise during a year of transition—it does better measurement that a conventional test and is better than CAT for calibrating items

• BAT also permits the administration in the transition year to be very similar to the full CAT administration that will occur in year two and beyond (you can even call it CAT!)

Page 23: Some Perspectives on CAT for K-12 Assessments

23Pearson Copyright 2010

CAT and Off-Grade-Level Testing

• There are obvious psychometric benefits to including off-grade-level content in K-12 assessments, if supported by vertically articulated content standards

• These benefits would seem particularly apparent for struggling students, including SWDs– Item pools can be substantially improved for

measuring struggling students accurately– All students start at the same place (no “out of level”

labeling)

Page 24: Some Perspectives on CAT for K-12 Assessments

24Pearson Copyright 2010

CAT and Off-Grade-Level Testing

• Some advocate of SWDs insist that CAT should consist only of on-grade-level content

• The basis for this position seems to be a concern about washback effect

• A psychometrician’s plea: The important consideration is instruction not assessment– The goal of the common core standards is college

readiness for all students– The instructional imperative does not change based

on what items are allowed in a CAT item pool

Page 25: Some Perspectives on CAT for K-12 Assessments

25Pearson Copyright 2010

CAT and Off-Grade Level Testing

• Could off-grade level content be included in accountability?– Yes, if ESEA relaxes “on-grade level requirements– Perhaps, if content standards span multiple grades

• Some will say it “doesn’t matter” and that CAT works just fine with only on-grade level content

• But it does matter. If we really want to do better at measuring student status and growth and we want to take full advantage of adaptive testing for all students, we need to allow the adaptive test to extent above and below grade level

Page 26: Some Perspectives on CAT for K-12 Assessments

26Pearson Copyright 2010

Questions?