Some Perspectives on CAT for K-12 Assessments

Pearson Copyright 2010

Some Perspectives on CAT for K-12 Assessments

Denny Way, Ph.D.Presented at the 2010 National Conference on Student Assessment

June 20, 2010

2Pearson Copyright 2010

Some CAT Questions

• You want to implement CAT, but you wonder about what IRT model should you use?

• You want to implement CAT, but you wonder how to put together a CAT pool and how you can best implement CAT?

• You want to implement CAT, and you wonder about whether it has to be limited to on-grade items only?


Which Model to Use for CAT?

• The Rasch and three-parameter logistic (3PL) models are the most popular for IRT applications with multiple-choice items

• In applications to conventional fixed-form tests, the differences between the two models are not that great, i.e., when you do parallel forms equating, you get about the same answer based on either model



• With CAT, there are much greater differences between the Rasch and 3PL models. For example:– Rasch CAT only supports a reduction in test length of

about 20% compared to a conventional test– 3PL CAT supports a reduction in test length of about 40-

50% compared to a conventional test

• Why?– With the Rasch model, the information functions for all

items have the same shape and the information for an “optimally administered” item is not that much greater than a typical item


Reduced Test Length for an Optimal Rasch CAT

0%

10%

20%

30%

40%

50%

60%

70%

80%

Average of P(θ)•[1 – P(θ)] on Conventional Test

%R

edu

ctio

n fo

r O

ptim

al C

AT


Reduced Test Length for an Optimal Rasch CAT

0%

10%

20%

30%

40%

50%

60%

70%

80%

Average of P(θ)•[1 – P(θ)] on Conventional Test

%R

edu

ctio

n fo

r O

ptim

al C

AT

Most conventional tests are about here for most students



• With CAT, there are much greater differences between the Rasch and 3PL models. For example:– 3PL CAT tends to select some items in the pool very

often and may never select many perfectly good items– Rasch CAT selects items in a much more uniform

manner

• Why?– With the 3PL, some highly discriminating items provide

much more information than other items and therefore are more attractive to the item selection algorithm


Rasch vs. 3PL Exposure

0

10

20

30

40

50

60

70

80

90

100

Exposure Rate

Cu

mu

lati

ve

Pc

t.

Rasch

3PL


Rasch vs. 3PL Exposure – An Example

0

10

20

30

40

50

60

70

80

90

100

Exposure Rate

Cu

mu

lati

ve

Pc

t.

Rasch

3PL

50% of 3PL items used 5% of the

time or less

10% of 3PL items used more then 25%

of the time



• Both Rasch and 3PL models have been used successfully in CAT applications

• Psychometricians will offer different opinions about which model is best for CAT

• Either model is defensible for CAT, but the models do behave quite differently

• IRT model is just one consideration related to CAT; other considerations related to design are as important or even more important


Preparing Item Pools for CAT Transition

• Ideally, the number of items in a CAT pool should be 10-12 times the number of items to be administered in the CAT (rule of thumb based on M. Stocking from ETS)

• The CAT pool must include a sufficient number of easy and difficult items; this is usually a big challenge

• More items are needed if students test from the same CAT pool multiple times, especially if previously seen items are not eligible to be used in repeat administrations



• Items for a CAT item pool must be calibrated to the same IRT scale

• Most states have pools of calibrated items with good psychometric properties that might be used for CAT– These items have gone through extensive reviews – These items may have been used operationally– These items have been shown to have good

psychometric characteristics



• However, there often challenges in using these items with the old statistics, such as:– The items were calibrated in paper but CAT is online– The items were in tests measuring old standards and

CAT will be measuring new standards– Minor edits or format changes may be needed– Items may have come from different places

• How can we make use of these items in a new adaptive test?


CAT Transition Strategy: Fixed-form Transition

• Two year transition strategy• In year one, construct and administer a number (e.g., 6

to 10) of fixed-forms (field-test items can be embedded) using previous (paper-based) statistics for test construction

• Administer the fixed form online• Re-calibrate the data from the fixed forms and link them

to a common scale• Conduct standard setting on subset of the items from the

fixed forms (can be a “synthetic” form)• Apply new cut to each fixed form for reporting


CAT Transition Strategy: Fixed-form Transition• In year two, combine all the items from the online

conventional fixed-forms (plus additional field-tested items) to create the CAT pool

• All items in the CAT pool will have item parameters on a common scale based on an online administration

• Issues include:– Deciding how many fixed-forms to develop– Making the fixed-forms as parallel as possible– Building effective equating links between forms– Determining whether the fixed-forms should count– Making a smooth transition from fixed-forms to CAT (since the

measurement properties will be different)


CAT Transition Strategy—Barely Adaptive Tests (BAT)

• Another strategy for transition to CAT is to use “Barely Adaptive Testing” (BAT)

• In this approach, the CAT algorithm is used to administer items from the pool based on paper-based IRT calibrations

• However, the CAT algorithm does not adapt the difficulty to student performance as strongly as it normally would

• The result is that each student takes a unique test, that is “slightly” targeted to them

• Some examples help to clarify


This slide shows how a conventional test would be administered to three students at different levels of ability

Student A has ability at -2.0 Student B has ability at 0.0 Student C has ability at 2.0

Each student takes

the same 50 items

XX X X X

X X X X X X X X XX X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Item Difficulty / Student Ability


Conventional tests are better for calibrating items but not so good for targeting measurement

Student A has ability at -2.0

XX X X X

X X X X X X X X XX X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0



This slide shows how CAT would be administered to three students at different levels of ability


Each student takes

25 items

XX

X X XX X X X X XX X X X X X X X X X

X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X X X X-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Item Difficulty / Student AbilityB


CAT is best for targeting measurement but not so good for estimating item statistics

OO

O OO O OO O OO O OO O OO O O O

O O O O O O = Students taking

O O O O O a middle difficulty

O O O O O CAT item

O O O O O O OO O O O O O O O

-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0


No responses here for

calibration

No responses here for

calibration


This slide shows how BAT would be administered to three students at different levels of ability


Each student takes

35 items

X X XX X X X

X X X X X X X XX X X X X X X X X X X X X

X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X

-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0



Why Does BAT Make Sense?

• BAT is a compromise during a year of transition—it does better measurement that a conventional test and is better than CAT for calibrating items

• BAT also permits the administration in the transition year to be very similar to the full CAT administration that will occur in year two and beyond (you can even call it CAT!)


CAT and Off-Grade-Level Testing

• There are obvious psychometric benefits to including off-grade-level content in K-12 assessments, if supported by vertically articulated content standards

• These benefits would seem particularly apparent for struggling students, including SWDs– Item pools can be substantially improved for

measuring struggling students accurately– All students start at the same place (no “out of level”

labeling)


CAT and Off-Grade-Level Testing

• Some advocate of SWDs insist that CAT should consist only of on-grade-level content

• The basis for this position seems to be a concern about washback effect

• A psychometrician’s plea: The important consideration is instruction not assessment– The goal of the common core standards is college

readiness for all students– The instructional imperative does not change based

on what items are allowed in a CAT item pool


CAT and Off-Grade Level Testing

• Could off-grade level content be included in accountability?– Yes, if ESEA relaxes “on-grade level requirements– Perhaps, if content standards span multiple grades

• Some will say it “doesn’t matter” and that CAT works just fine with only on-grade level content

• But it does matter. If we really want to do better at measuring student status and growth and we want to take full advantage of adaptive testing for all students, we need to allow the adaptive test to extent above and below grade level


Questions?

Some Perspectives on CAT for K-12 Assessments

Documents

Transcript of Some Perspectives on CAT for K-12 Assessments