University of Ostrava Czech republic 26-31, March, 2012.

University of OstravaCzech republic

26-31, March, 2012

Different forms of a test

Item banking

Achievement monitoring

Classical Test Theory Item ResponseTheory

It is applied only for different test forms equating

It is often ignored (conception of parallel test forms)

Establishes equivalent scores on different test forms

Doesn’t create a common scale

Allows to satisfy all equating needs

Allows to put all estimates of item and examinee parameters to the common scale

It is a special procedure that allows to establish relation between examinee scores on different test forms and place them onto the same scale.

As a result, measure based on responses to one test can be matched to a measure based on responses to another test, and the conclusions drawn about examinee are identical, regardless of the test form that produced the measure.

Equating of different test forms is called horizontal equating.

The purpose: comparison of student achievements at different grade levels

Test forms are designed to be of different difficulties

Measures from different tests should be placed on the same linear continuum

Procedure of this test equating is called vertical equating.

• Item bank – a set of items from which test forms that create equivalent measures may be constructed.

• Item bank is composed of a set of test items that have been placed onto a common scale, so that different subsets of these items produce interchangeable measures for an examinee.

• In the presence of item bank we dont need in further equating

Both are designed to place estimated parameters onto a common scale

In test equating the goal is to place person measures from the multiple test forms onto the same scale

In item banking the goal is to place item calibrations on the same scale

Procedures are nearly identical when we use Rasch measurement

Equating – procedure that ensures the examinee measures obtained from different subsets of items are interchangeable. When two tests are equated, the resulting measures are placed onto the same scale.

Scaling – procedure that associates numbers with the performance of examinees. Tests can be scaled identically, but have not been equated.

Applies only to compare examinee test scores on two different test forms

A problem can be ignored (introduction of “parallel” test froms)

Implies only an establishment of relation between test scores on different test forms

Doesn’t imply creation of a common scale

Linear equating

Equipercentile equating

It is based on equating the standard score on test X to the standard score on test Y:

Thus, , where

,

BxAy

x

yA

xyBx

y

yx

yyxx

Scores on tests X and Y are considered to be equivalent if their respective percentile ranks in any given group are equal.

Both methods require assumptions concerning identity of test score destrubutions and about equivalence of examinee groups

Equating in CTT doesn’t imply creation of a common scale

Measuring the same trait – tests of different content can not be equated (but can be scaled in a similar manner).

Invariance of equating results across samples of examinees

Independence of equating results on which test is used as a reference test

• Method of common items: linkage between two test forms is accomplished by means of a set of items which are common for two test forms

• Method of common persons: linkage between

two test forms is accomplished by means of a set of persons who respond to both test forms

• Combined methods: linkage between two test forms is accomplished by means of common items and / or common persons plus common raters

Internal anchor: Each test form has

one set of items that is shared with other forms and another set of items that is unique to this form

External anchor:

Each test form has an additional set of items, that are not from these test forms

Involving all examinees respond both test forms.

There are two approaches to this design:

- same group/ same time

- same group/ different time

Linkage between two test forms is accomplished by means of a set of examinees who respond to all items.

Selecting an equating method Parameter estimation Transformation of parameters from

different test froms to the same scale Evaluating the quality of the links between

test froms

Simultaneous calibration: all parameters are estimated simultaneously in one run of the estimation software. Data are automatically scaled to the same scale.

Separate calibration: parameters are estimated for each test form separately. That is, the data are calibrated in multiple runs of the estimation software.

Separate calibration may be more difficult to accomplish because the test developer needs to transform measures to a common scale

Separate calibration of all test forms with transformating measures to the common scale

Simultaneous calibration of all test forms and placing all measures on the common scale

Separate calibration of all test forms with anchoring the difficulty values of the common items and consecutive placing all parameters on the common scale

As a rule this procedure is used with method of common items that are called nodal items in this case

Each test form is calibrated separately. As a result for each test form all estimates lie on the own scale. The only difference between scales is in difference between origins of the scales

This difference can be removed by means of calculating location shift

It is desirable to have not less that 15-20 % nodal items (some of them can be deleted from the link later).

Choice of a common scale Selection of nodal items Calibration of all test forms Calculating equating constants Link quality evaluation Transformating all parameters onto a common

scale

t12 – shift constant from test form 1 to test form 2; δi1 – difficulty estimate of item i in test from 1;δi2 – difficulty estimate of item i in test from 2;l – the number of common items.

Sometimes other formulas are applied - weighted mean, dispersion shift, etc.

lt

l

iii

1

12

12

)(

δi1' = δi1 + t12 ,

where δi1 – difficulty estimate for item i in test form 1;

δi1' – difficulty estimate for the same item on the scale of test

form 2, i=1,…,k, k – the total number of test items;

θn1'= θn1 + t12,

where θn1 – ability estimate for examinee n who respond items of test form 1; θn1

' – ability estimate for the same examinee on the scale of test form 2, n=1,…, N; N – the total number of examinees who respond items of test form 1.

Shifted by this way parameter estimates of test from 1 will be placed to the scale of test form 2.

Item-within-link (fit analysis of linking items);

Item-between-link (stability of the item calibrations between two test forms)

where σi12 is defined by σi122 = σi1

2+ σi22 ;

σi1

, σi2 - standard errors of measurement for item i under

calibration of test form 1 and 2;

δi1 - difficulty estimate for item i in test form 1; δi1

' - difficulty estimate for the same item on the scale of test form 2; Ui ~ N(0,1)

12

11

i

iiiU

All parameters of all test forms are estimated simultaneously

Is the simplest approach to equating test forms or calibrating an item bank because it requires no subsequent transformation of the estimated measures or calibrations. Data are automatically scaled to the same scale in one run the estimation software

As a rule this procedure is used with method of common items that are called anchor items in this case

Common items are estimated one time during calibration of the first test form

During calibration of another test form the calibration values for these items are treated as being fixed or known and are not estimated. As a result, the remaining parameter estimates are forced onto the same scale as the anchor items

It is easy to anchor items in most estimation software

IAFILE=* 2 -0.29 4 -1.06 8 -0.49 11 -0.04 17 -0.28 37 -2.20 38 -1.34 *

Numbers of anchor items and their difficulties are specified. These difficulty values will be fixed and not be estimated during calibration of new test form

Choice of a common scale Selection of anchor items Calibration of the test form which scale is accepted as a

common scale Sequential calibration of other test forms with fixing the

difficulty values of anchor items Item-Within Link Fit (fit analysis of linking items);

If we use different equating procedures, obtained scales will be different and can not be directly compared. It is connected with different ways of origin selection in different procedures.

There are papers (for example, Smith R.M. «Applications of Rasch Measurement». Chicago: Mesa Press. -1992) where all three procedures are analyzed. The precision of estimated examinee and item parameters is approximately the same and correlation between measures is high.

Each test form has 26 dichotomous items Both test forms have 6 common items: № 4, 6, 7, 14, 20,

24 (23 % of the total number of items) The total number of examinees for test form 1 is 654, for

test form 2 - 661 For test calibration Winsteps software was used Means of examinee measures are -1,07 и -0,72 logits for

test form 1 and 2 correspondingly The first test form scale was chosen as a common scale

Item numbe

r

Test form 1 Test form 2

ui

Difficulty

estimateδi

Standard Error

σi

Difficulty

estimateδi

Standard Error

σi

Shifted Difficul

ty estimate

δi'

4 -1.39 0.09 -1.07 0.09 -1.368 -0.176 -0.93 0.1 -0.54 0.09 -0.838 0.697 -2.57 0.1 -1.99 0.1 -2.288 2.014 -0.44 0.1 -0.32 0.09 -0.618 -1.3320 0.88 0.12 0.96 0.11 0.662 -1.34Sum -4.45 -2.96 -4.45Mean -0.89 -0.592 -0.89

Shift constant t12= - 0,298.

It implies creation of a common response matrix for both test forms containing 1315 examinees and 46 different items.

Measures of all examinees and difficulty values of all items will be placed on a common scale that is centered in the difficulty mean of all 46 items

Calibration of test form 1 Calibration of test form 2 with fixing the difficulty values of anchor

items from the first calibration IAFILE=*

4 -1.39

6 -0.93

7 -2.57

14 -0.44

20 0.88

* As a result examinee measures from both test forms will be on

the first test form scale

Comparison of examinee measures from three equating procedures revealed approximately similar results: correlation is closed to 1

The choice of equating procedure is determined

by the real data design and purpose of research

University of Ostrava Czech republic 26-31, March, 2012.

Documents

Transcript of University of Ostrava Czech republic 26-31, March, 2012.