IRT - Item response Theory

136
The Basics of IRT The Basics of IRT by by AK AK Dhamija Dhamija © ([email protected] [email protected] ) www.geocities.com/a_k_dhamija www.geocities.com/a_k_dhamija September 9, 2009 September 9, 2009 The Basics of IRT The Basics of IRT 1

Transcript of IRT - Item response Theory

Page 1: IRT - Item response Theory

The Basics of IRTThe Basics of IRTbyby

AKAK DhamijaDhamija©©

(([email protected]@gmail.com))www.geocities.com/a_k_dhamijawww.geocities.com/a_k_dhamija

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT11

Page 2: IRT - Item response Theory

•What is IRT?

•The Item Characteristic Curve

•Item Characteristic CurveModels

•Estimating Item Parameters

•The Test Characteristic Curve

•Estimating an Examinee’sAbility

•The Information Function

•Test Calibration

•Characteristics of a Test

•Computer Adaptive Test

CoverageCoverage

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT22

•What is IRT?

•The Item Characteristic Curve

•Item Characteristic CurveModels

•Estimating Item Parameters

•The Test Characteristic Curve

•Estimating an Examinee’sAbility

•The Information Function

•Test Calibration

•Characteristics of a Test

•Computer Adaptive Test

Page 3: IRT - Item response Theory

• Classical Theory

•Dependence of Item Statistics on sample of Respondents

•Dependence of Respondents’ scores on choice of items

•Assumes equal errors of measurement at all levels of ability

•No Modeling of data at item level

•Items and Respondents at different scale

•Difficult to compare scores across two different testsbecause not on same scale

What is IRT?What is IRT?

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT33

• Classical Theory

•Dependence of Item Statistics on sample of Respondents

•Dependence of Respondents’ scores on choice of items

•Assumes equal errors of measurement at all levels of ability

•No Modeling of data at item level

•Items and Respondents at different scale

•Difficult to compare scores across two different testsbecause not on same scale

Page 4: IRT - Item response Theory

• IRT

•Links observable respondent performance to unobservabletraits

•Theory is general

•One or more abilities or traits

•Various assumptions / models

•Binary / polytomous data

•At specific model level fit can be addressed

DIF/DTF LOGISTIC / MULTIDIMENSIONAL MODEL

What is IRT?What is IRT?

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT44

• IRT

•Links observable respondent performance to unobservabletraits

•Theory is general

•One or more abilities or traits

•Various assumptions / models

•Binary / polytomous data

•At specific model level fit can be addressed

DIF/DTF LOGISTIC / MULTIDIMENSIONAL MODEL

Page 5: IRT - Item response Theory

• Specific IRT model assumptions

•Dominant First factor (Multidimensional models exist too)

•No dependency between items

•Mathematical form of ICC linking performance on items andtrait measured by the instrument

What is IRT?What is IRT?

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT55

• Specific IRT model assumptions

•Dominant First factor (Multidimensional models exist too)

•No dependency between items

•Mathematical form of ICC linking performance on items andtrait measured by the instrument

Page 6: IRT - Item response Theory

• IRT benefits

•Item parameters estimation is independent of respondentsamples

•Trait estimation is independent of particular choice of items(invaluable property for CAT)

•Error of measurement for each respondent

•Item level modeling allows for “optimal assessments”

•Items and respondents calibrated on same scale

What is IRT?What is IRT?

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT66

• IRT benefits

•Item parameters estimation is independent of respondentsamples

•Trait estimation is independent of particular choice of items(invaluable property for CAT)

•Error of measurement for each respondent

•Item level modeling allows for “optimal assessments”

•Items and respondents calibrated on same scale

Page 7: IRT - Item response Theory

• IRT Limitations

•Practitioners lack expertise

•IRT SW are not straight forward

•Large samples are more helpful for estimation

•Doesn’t address construct definition / domain convergence

What is IRT?What is IRT?

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT77

• IRT Limitations

•Practitioners lack expertise

•IRT SW are not straight forward

•Large samples are more helpful for estimation

•Doesn’t address construct definition / domain convergence

Page 8: IRT - Item response Theory

• Many choices of IRT Models

•1PL , 2PL , 3PL and Normal Ogive Models (0-1 data)

•Partial Credit , Generalized Partial Credit , Graded Response, Nominal Response Models (Polytomous data)

•Multidimensional Logistic Models (0-1 data)

• Estimation

•Marginal MLE , Bayesian etc.

• SW

•Bilog-MG ,Parscale , Multilog ,Conquest , Winsteps

What is IRT?What is IRT?

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT88

• Many choices of IRT Models

•1PL , 2PL , 3PL and Normal Ogive Models (0-1 data)

•Partial Credit , Generalized Partial Credit , Graded Response, Nominal Response Models (Polytomous data)

•Multidimensional Logistic Models (0-1 data)

• Estimation

•Marginal MLE , Bayesian etc.

• SW

•Bilog-MG ,Parscale , Multilog ,Conquest , Winsteps

Page 9: IRT - Item response Theory

• Unobservable, or latent, trait

• Generic term ‘ability’ in IRT

• Interval scale of ability

•Theoretical values [- ∞,+ ∞]•Practical Values [-3,+3]

• Ideally free response items

• Difficult to score reliably

The Item Characteristic CurveThe Item Characteristic Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT99

• Unobservable, or latent, trait

• Generic term ‘ability’ in IRT

• Interval scale of ability

•Theoretical values [- ∞,+ ∞]•Practical Values [-3,+3]

• Ideally free response items

• Difficult to score reliably

Page 10: IRT - Item response Theory

• Data Preparation

• Raw data recoded

• Dichotomously scored items(For polytomous data :

Samejima's Graded Response model)

• Investigating dimensionality

•examine the eigenvalues followingPrincipal Axis Factoring (PAF)

If the data are dichotomous, factoranalyze tetrachoric correlations

Assume continuum underlies itemresponses

The Item Characteristic CurveThe Item Characteristic Curve

Dominant

first factor

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT1010

• Data Preparation

• Raw data recoded

• Dichotomously scored items(For polytomous data :

Samejima's Graded Response model)

• Investigating dimensionality

•examine the eigenvalues followingPrincipal Axis Factoring (PAF)

If the data are dichotomous, factoranalyze tetrachoric correlations

Assume continuum underlies itemresponses

Dominant

first factor

Page 11: IRT - Item response Theory

• Classical IAT

• Raw score is sum of scores of item

• IRT

•Emphasis on individual items

• Assumption : Ability score (θ )

• P(θ ) is the probability to give a correct answer

The Item Characteristic CurveThe Item Characteristic Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT1111

• Classical IAT

• Raw score is sum of scores of item

• IRT

•Emphasis on individual items

• Assumption : Ability score (θ )

• P(θ ) is the probability to give a correct answer

Page 12: IRT - Item response Theory

• P(θ ) is small for lowability examinees & viceversa

• A smooth S- shapedcurve

• item characteristic curve(ICC)

•A basic building block ofIRT

• Two technicalproperties

• Item Difficulty• Item Discrimination

The Item Characteristic CurveThe Item Characteristic Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT1212

• P(θ ) is small for lowability examinees & viceversa

• A smooth S- shapedcurve

• item characteristic curve(ICC)

•A basic building block ofIRT

• Two technicalproperties

• Item Difficulty• Item Discrimination

Page 13: IRT - Item response Theory

• Item Difficulty• a location index

• Three ICC with the samediscrimination butdifferent levels of difficulty

The Item Characteristic CurveThe Item Characteristic Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT1313

• Item Difficulty• a location index

• Three ICC with the samediscrimination butdifferent levels of difficulty

Page 14: IRT - Item response Theory

• Item Discrimination• slope at θ = 0

• Differentiating betweenabilities below the itemlocation and those abovethe item location.

•Steepness of the ICC inits middle section

• The steeper the curve,the better the item candiscriminate

The Item Characteristic CurveThe Item Characteristic Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT1414

• Item Discrimination• slope at θ = 0

• Differentiating betweenabilities below the itemlocation and those abovethe item location.

•Steepness of the ICC inits middle section

• The steeper the curve,the better the item candiscriminate

Page 15: IRT - Item response Theory

• Caution

•These two properties•say nothing about validity of item•simply describe the form ICC

• Figures only show the range -3 to +3

• All ICC become asymptotic to a probability of zero at one tailand to 1.0 at the other tail

The Item Characteristic CurveThe Item Characteristic Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT1515

• Caution

•These two properties•say nothing about validity of item•simply describe the form ICC

• Figures only show the range -3 to +3

• All ICC become asymptotic to a probability of zero at one tailand to 1.0 at the other tail

Page 16: IRT - Item response Theory

• Item with Perfectdiscrimination

•item discriminates perfectlythose above and below anability score of 1.5

• this item is useless for otheranilities in discrimination

The Item Characteristic CurveThe Item Characteristic Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT1616

• Item with Perfectdiscrimination

•item discriminates perfectlythose above and below anability score of 1.5

• this item is useless for otheranilities in discrimination

Page 17: IRT - Item response Theory

• Difficulty will have thefollowing levels:

very easyeasymediumhardvery hard

Discrimination will have thefollowing levels:

nonelowmoderatehighPerfect

The Item Characteristic CurveThe Item Characteristic Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT1717

• Difficulty will have thefollowing levels:

very easyeasymediumhardvery hard

Discrimination will have thefollowing levels:

nonelowmoderatehighPerfect

Page 18: IRT - Item response Theory

The Item Characteristic CurveThe Item Characteristic Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT1818

Page 19: IRT - Item response Theory

The Item Characteristic CurveThe Item Characteristic Curve

Recap

1. When the item discrimination is less than moderate, the itemcharacteristic curve is nearly linear and appears rather flat.

2. When discrimination is greater than moderate, the item characteristiccurve is S-shaped and rather steep in its middle section.

3. When the item difficulty is less than medium, most of the itemcharacteristic curve has a probability of correct response that is greaterthan .5.

4. When the item difficulty is greater than medium, most of the itemcharacteristic curve has a probability of correct response less than .5.

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT1919

Recap

1. When the item discrimination is less than moderate, the itemcharacteristic curve is nearly linear and appears rather flat.

2. When discrimination is greater than moderate, the item characteristiccurve is S-shaped and rather steep in its middle section.

3. When the item difficulty is less than medium, most of the itemcharacteristic curve has a probability of correct response that is greaterthan .5.

4. When the item difficulty is greater than medium, most of the itemcharacteristic curve has a probability of correct response less than .5.

Page 20: IRT - Item response Theory

The Item Characteristic CurveThe Item Characteristic Curve

Recap

5. Regardless of the level of discrimination, item difficulty locates the itemalong the ability scale. Therefore item difficulty and discrimination areindependent of each other.

6. When an item has no discrimination, all choices of difficulty yield thesame horizontal line at a value of P(θ ) = .5. This is because the value ofthe item difficulty for an item with no discrimination is undefined.

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT2020

Recap

5. Regardless of the level of discrimination, item difficulty locates the itemalong the ability scale. Therefore item difficulty and discrimination areindependent of each other.

6. When an item has no discrimination, all choices of difficulty yield thesame horizontal line at a value of P(θ ) = .5. This is because the value ofthe item difficulty for an item with no discrimination is undefined.

Page 21: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve Models• Enough of being intuitive , lets see the rigor needed by a theory

• Three mathematical models for the ICC

• The Logistic Function : 2 Parameter Model (2PL)

P(θ ) = 1 = 1

1 + e - L 1 + e -a (θ -b )

where:e is the constant 2.718b is the difficulty parametera is the discrimination parameterL = a(θ - b) is the logistic deviate (logit) andθ is an ability level.

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT2121

• Enough of being intuitive , lets see the rigor needed by a theory

• Three mathematical models for the ICC

• The Logistic Function : 2 Parameter Model (2PL)

P(θ ) = 1 = 1

1 + e - L 1 + e -a (θ -b )

where:e is the constant 2.718b is the difficulty parametera is the discrimination parameterL = a(θ - b) is the logistic deviate (logit) andθ is an ability level.

Page 22: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve Models• The difficulty parameter (b) is defined as the point on the abilityscale at which the probability of correct response to the item is .5.

• Range of b : [- ∞,+ ∞] [-3,+3]

• Discrimination parameter is proportional to the slope of the ICC atθ = b.

• The actual slope at θ = b is a/4, but taking it as ‘a ‘ makesinterpretation easier

• Range of a : [- ∞,+ ∞] [-2.80,+2.80]

. Normal Ogive model has different interpretation

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT2222

• The difficulty parameter (b) is defined as the point on the abilityscale at which the probability of correct response to the item is .5.

• Range of b : [- ∞,+ ∞] [-3,+3]

• Discrimination parameter is proportional to the slope of the ICC atθ = b.

• The actual slope at θ = b is a/4, but taking it as ‘a ‘ makesinterpretation easier

• Range of a : [- ∞,+ ∞] [-2.80,+2.80]

. Normal Ogive model has different interpretation

Page 23: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve Models• Computational Example

• b = 1.0 , a = .5

θ Logit EXP(-L) 1+EXP(-L) P(θ)

-3 -2 7.389 8.389 .12-2 -1.5 4.482 5.482 .18-1 -1 2.718 3.718 .270 -.5 1.649 2.649 .381 0 1 2 .52 .5 .607 1.607 .623 1 .368 1.368 .73

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT2323

• Computational Example

• b = 1.0 , a = .5

θ Logit EXP(-L) 1+EXP(-L) P(θ)

-3 -2 7.389 8.389 .12-2 -1.5 4.482 5.482 .18-1 -1 2.718 3.718 .270 -.5 1.649 2.649 .381 0 1 2 .52 .5 .607 1.607 .623 1 .368 1.368 .73

Page 24: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve Models•The Rasch, or One-Parameter, Logistic Model (1 PL)

•a = 1.0 for all items

P(θ ) = 1 = 1

1 + e - L 1 + e - (θ -b )

Birnbaum (1968) Three Parameter Model (3 PL)

The probability of correct response includes a small component that isdue to guessing .

P(θ ) = c + (1-c)

1 + e -a (θ -b )

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT2424

•The Rasch, or One-Parameter, Logistic Model (1 PL)

•a = 1.0 for all items

P(θ ) = 1 = 1

1 + e - L 1 + e - (θ -b )

Birnbaum (1968) Three Parameter Model (3 PL)

The probability of correct response includes a small component that isdue to guessing .

P(θ ) = c + (1-c)

1 + e -a (θ -b )

Page 25: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve Models• c is guessing parameter : the probability of getting the item correct byguessing alone

• value of c does not vary as a function of the ability level

• Range : Theoretically [0,1] Practically [0,.35]

• Definition of the difficulty parameter is changed in 3 PL

• c defines a floor to P(θ )

• P(θ ) = (1+c)/2

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT2525

• c is guessing parameter : the probability of getting the item correct byguessing alone

• value of c does not vary as a function of the ability level

• Range : Theoretically [0,1] Practically [0,.35]

• Definition of the difficulty parameter is changed in 3 PL

• c defines a floor to P(θ )

• P(θ ) = (1+c)/2

Page 26: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve Models• Discrimination parameter a = slope of the item characteristic curve at θ = b

which is a (1 - c)/4

- can still be taken as proportional to a

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT2626

Page 27: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve Models• Negative Discrimination

• P(θ ) decreases as θ increases

• Can occur in two ways

• The incorrect response to a two-choiceitem will have a negative discrimination ifcorrect response has a positive one

•Something is wrong with the item

•Two curves have same b and |a|

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT2727

• Negative Discrimination

• P(θ ) decreases as θ increases

• Can occur in two ways

• The incorrect response to a two-choiceitem will have a negative discrimination ifcorrect response has a positive one

•Something is wrong with the item

•Two curves have same b and |a|

Page 28: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve Models• Interpreting Item Parameter Values

• a: (divide by 1.7 to interpret in normal ogive model)

Verbal label Range of valuesnone 0very low .01 - .34Low .35 - .64moderate .65 - 1.34High 1.35 - 1.69Very high > 1.70Perfect + infinity

b: difficult job since easy and hard are relative terms

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT2828

• Interpreting Item Parameter Values

• a: (divide by 1.7 to interpret in normal ogive model)

Verbal label Range of valuesnone 0very low .01 - .34Low .35 - .64moderate .65 - 1.34High 1.35 - 1.69Very high > 1.70Perfect + infinity

b: difficult job since easy and hard are relative terms

Page 29: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve Models

• In classical test theory, ‘b’ was defined relative to a group ofexaminees

• Same item could be easy for one group and hard for another group.

•Under IRT, ‘b’ is θ where P(θ ) is .5 for 1-PL & 2-PL models and (1 + c)/2for a 3-PL model.

c is interpreted directly as a probability.c = .12 means that at all θ , P(θ ) by guessing alone is .12.

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT2929

• In classical test theory, ‘b’ was defined relative to a group ofexaminees

• Same item could be easy for one group and hard for another group.

•Under IRT, ‘b’ is θ where P(θ ) is .5 for 1-PL & 2-PL models and (1 + c)/2for a 3-PL model.

c is interpreted directly as a probability.c = .12 means that at all θ , P(θ ) by guessing alone is .12.

Page 30: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve Models•The verbal labels reflect only midpoint of scale

•item difficulty tells where the item functions on the ability scale

•The slope of the ICC (‘a’) is at maximum at an ability level correspondingto the item difficulty

• The item does best in distinguishing between examinees in theneighborhood of its ability level

• An item whose difficulty is -1 functions among the lower abilityexaminees

• A value of +1 denotes an item that functions among higher abilityexaminees

• So ability is a location parameter

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT3030

•The verbal labels reflect only midpoint of scale

•item difficulty tells where the item functions on the ability scale

•The slope of the ICC (‘a’) is at maximum at an ability level correspondingto the item difficulty

• The item does best in distinguishing between examinees in theneighborhood of its ability level

• An item whose difficulty is -1 functions among the lower abilityexaminees

• A value of +1 denotes an item that functions among higher abilityexaminees

• So ability is a location parameter

Page 31: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve ModelsRECAP

1. Under the 1-PL model, the slope is always the same; only thelocation of the item changes.

2. Under the 2-PL & 3-PL models, the value of a must become quite large(>1.7) before the curve is very steep.

3. Under 1-PL & 2-PL and a large positive value of b : lower tail approachzero. Under 3-PL it approaches c.

4. c is not apparent when b < 0 and a < 1.0.

5. Under all models, curves with a negative value of a are the mirrorimage of curves with positive value of a.

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT3131

RECAP

1. Under the 1-PL model, the slope is always the same; only thelocation of the item changes.

2. Under the 2-PL & 3-PL models, the value of a must become quite large(>1.7) before the curve is very steep.

3. Under 1-PL & 2-PL and a large positive value of b : lower tail approachzero. Under 3-PL it approaches c.

4. c is not apparent when b < 0 and a < 1.0.

5. Under all models, curves with a negative value of a are the mirrorimage of curves with positive value of a.

Page 32: IRT - Item response Theory

Item Characteristic Curve ModelsItem Characteristic Curve ModelsRECAP

6. When b = -3.0, only the upper half of the item characteristic curveappears on the graph. When b = +3.0, only the lower half of the curveappears on the graph.

7. The slope of the item characteristic curve is the steepest at the abilitylevel corresponding to the item difficulty. Thus, the difficultyparameter b locates the point on the ability scale where the itemfunctions best.

8. Under IRT, ‘b’ is θ where P(θ ) is .5 for 1-PL & 2-PL models and (1 +c)/2 for a 3-PL model . Only when c = 0 are these two definitionsequivalent.

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT3232

RECAP

6. When b = -3.0, only the upper half of the item characteristic curveappears on the graph. When b = +3.0, only the lower half of the curveappears on the graph.

7. The slope of the item characteristic curve is the steepest at the abilitylevel corresponding to the item difficulty. Thus, the difficultyparameter b locates the point on the ability scale where the itemfunctions best.

8. Under IRT, ‘b’ is θ where P(θ ) is .5 for 1-PL & 2-PL models and (1 +c)/2 for a 3-PL model . Only when c = 0 are these two definitionsequivalent.

Page 33: IRT - Item response Theory

• IRT Analysis

- M examinees in J groups(θj) responds to the N items

- mj examinees within group j, where j = 1, 2, 3. . . . J.

- In jth group , rj examinees answer the item correctly. p(θj ) = rj/mj

Estimating Item parametersEstimating Item parameters

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT3333

Page 34: IRT - Item response Theory

• Find ICC that best fits the observed proportions of correct response.

•first select a model for the curve to be fitted.

• Let’s fit 2PL (chi-square goodness-of-fit index to assess fitness)

• Procedure is Maximum Likelihood Estimation (initial values b=0,a=1)

Estimating Item parametersEstimating Item parameters

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT3434

• Find ICC that best fits the observed proportions of correct response.

•first select a model for the curve to be fitted.

• Let’s fit 2PL (chi-square goodness-of-fit index to assess fitness)

• Procedure is Maximum Likelihood Estimation (initial values b=0,a=1)

Page 35: IRT - Item response Theory

J [ p(θj ) - P(θj)]2

Χ2 = ∑ mj = 28.88 and the criterion value was 45.91j =1 P(θj) Q(θj)

The two-parameter model with b = - .39 and a = 1.27 was a good fit tothe observed proportions of correct response

Estimating Item parametersEstimating Item parameters

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT3535

Page 36: IRT - Item response Theory

• The Group Invariance of Item Parameters

• Two groups

• Group 1 [b= -3 to -1] //// Group 2 [ b=1 to 3]

Estimating Item parametersEstimating Item parameters

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT3636

Page 37: IRT - Item response Theory

• The Group Invariance of Item Parameters

• b(1) = b(2) = -.39 and a(1) = a(2) = 1.27

• Combined analysis

Estimating Item parametersEstimating Item parameters

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT3737

• The Group Invariance of Item Parameters

• b(1) = b(2) = -.39 and a(1) = a(2) = 1.27

• Combined analysis

Page 38: IRT - Item response Theory

• Group Invariance is powerful feature of IRT

•values of the item parameters are a property of the item, not of the groupthat responded to the item.

•Under classical test theory, just the opposite holds. The item difficulty ofclassical theory is the overall proportion of correct response.

•b = 0 may give item difficulty index .3 for low ability group and .8 for highability group.

•Clearly, the value of the classical item difficulty index is not groupinvariant.

• item difficulty fhas consistent meaning in IRT

Estimating Item parametersEstimating Item parameters

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT3838

• Group Invariance is powerful feature of IRT

•values of the item parameters are a property of the item, not of the groupthat responded to the item.

•Under classical test theory, just the opposite holds. The item difficulty ofclassical theory is the overall proportion of correct response.

•b = 0 may give item difficulty index .3 for low ability group and .8 for highability group.

•Clearly, the value of the classical item difficulty index is not groupinvariant.

• item difficulty fhas consistent meaning in IRT

Page 39: IRT - Item response Theory

• Caution

The obtained numerical values will be subject to variation due tosample size, how well-structured the data is, and the goodness-of-fit ofthe curve to the data.

The item must be used to measure the same latent trait for both groups.

An item’s parameters do not retain group invariance whentaken out of context, i.e., when used to measure a different latent traitor with examinees from a population for which the test is inappropriate.

Estimating Item parametersEstimating Item parameters

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT3939

• Caution

The obtained numerical values will be subject to variation due tosample size, how well-structured the data is, and the goodness-of-fit ofthe curve to the data.

The item must be used to measure the same latent trait for both groups.

An item’s parameters do not retain group invariance whentaken out of context, i.e., when used to measure a different latent traitor with examinees from a population for which the test is inappropriate.

Page 40: IRT - Item response Theory

RECAP

1. Estimated item parameters was usually a good overall fit to the observedproportions of correct response.

2. When two groups are employed, the same item characteristic curve will befitted, regardless of the range of ability

3. The number of examinees at each level does not affect the group-invarianceproperty.

4. For the item of positive discrimination, the low-ability group involves thelower left tail of ICC, and the high-ability group involves the upper right tail.

5. The item parameters were group invariant whether or not the abilityranges of the two groups overlapped.

Estimating Item parametersEstimating Item parameters

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT4040

RECAP

1. Estimated item parameters was usually a good overall fit to the observedproportions of correct response.

2. When two groups are employed, the same item characteristic curve will befitted, regardless of the range of ability

3. The number of examinees at each level does not affect the group-invarianceproperty.

4. For the item of positive discrimination, the low-ability group involves thelower left tail of ICC, and the high-ability group involves the upper right tail.

5. The item parameters were group invariant whether or not the abilityranges of the two groups overlapped.

Page 41: IRT - Item response Theory

RECAP

6. it made no difference as to which group was the high-ability group. Thus,group labeling is not a consideration.

7. The group-invariance principle holds for all three item characteristic curvemodels.

8. item parameter estimates are subject to sampling variation.

Estimating Item parametersEstimating Item parameters

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT4141

Page 42: IRT - Item response Theory

• Examinee’s raw test score is obtained by adding up the item scores

NTS j = ∑ Pi (θj)

i=1

• eg for 4 items test : TS = .73 + .57 + .69 + .62 = 2.61

•Procedure is same for all three models

Test Characteristics CurveTest Characteristics Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT4242

• Examinee’s raw test score is obtained by adding up the item scores

NTS j = ∑ Pi (θj)

i=1

• eg for 4 items test : TS = .73 + .57 + .69 + .62 = 2.61

•Procedure is same for all three models

Page 43: IRT - Item response Theory

• For 1-PL & 2-PL left tailapproaches zero and right tail N.

• TS = 0/Nck => θ = - ∞• TS = N => θ = + ∞

Test Characteristics CurveTest Characteristics Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT4343

Page 44: IRT - Item response Theory

TCC transforms ability scores to true scores.

TCC is a monotonically increasing function. (though shapes may differ – S ,inc-plat-inc)

In all cases, it will be asymptotic to a value of N in the upper tail.

TCC depends upon a number of factors, including the number of items, theICC model employed, and the values of the item parameters.

Caution :TCC (similar to ICC) does not depend on distribution of scores.

Test Characteristics CurveTest Characteristics Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT4444

TCC transforms ability scores to true scores.

TCC is a monotonically increasing function. (though shapes may differ – S ,inc-plat-inc)

In all cases, it will be asymptotic to a value of N in the upper tail.

TCC depends upon a number of factors, including the number of items, theICC model employed, and the values of the item parameters.

Caution :TCC (similar to ICC) does not depend on distribution of scores.

Page 45: IRT - Item response Theory

Interpretation

The ability level corresponding TS = N/2 locates the testalong the ability scale.

No explicit formula for TCC, so no parameters for the curve

Test Characteristics CurveTest Characteristics Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT4545

Page 46: IRT - Item response Theory

RECAP

1. Relation of the true score and the ability level.

a. Given an ability level, the corresponding true score can befound via the test characteristic curve.

b. Given a true score, the corresponding ability level can befound via the test characteristic curve.

c. Both the true scores and ability are continuous variables.

2. Shape of the test characteristic curve.

a. When N = 1, the true score ranges from 0 to 1 and TCC = ICC

Test Characteristics CurveTest Characteristics Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT4646

RECAP

1. Relation of the true score and the ability level.

a. Given an ability level, the corresponding true score can befound via the test characteristic curve.

b. Given a true score, the corresponding ability level can befound via the test characteristic curve.

c. Both the true scores and ability are continuous variables.

2. Shape of the test characteristic curve.

a. When N = 1, the true score ranges from 0 to 1 and TCC = ICC

Page 47: IRT - Item response Theory

RECAP

b. TCC may not be similar to ICC (due to regions of varyingsteepness and plateaus). IT reflects a mixture ofitem parameter values

c. The ability level at which the mid-true score (N/2) occursis an indicator of where the test functions on the abilityscale.

d. When the values of the item difficulties have alimited range, the steepness of the TCC depends primarilyupon the average value of the item discriminationparameters.

e. When the values of the item difficulties are spread widelyover the ability scale, the steepness of the TCC will bereduced.

Test Characteristics CurveTest Characteristics Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT4747

RECAP

b. TCC may not be similar to ICC (due to regions of varyingsteepness and plateaus). IT reflects a mixture ofitem parameter values

c. The ability level at which the mid-true score (N/2) occursis an indicator of where the test functions on the abilityscale.

d. When the values of the item difficulties have alimited range, the steepness of the TCC depends primarilyupon the average value of the item discriminationparameters.

e. When the values of the item difficulties are spread widelyover the ability scale, the steepness of the TCC will bereduced.

Page 48: IRT - Item response Theory

RECAP

e. Under a three-parameter model, the lower limit of the truescores is NCk.

f. The shape of the test characteristic curve depends upon thenumber of items, the ICC model and the mix of values of theitem parameters.

3. It would be possible to construct a test characteristic curve thatdecreases as ability increases. To do so would require items withnegative discrimination for the correct response to the items. Such atest would not be considered a good test because the higher anexaminee’s ability level, the lower the score expected for theexaminee.

Test Characteristics CurveTest Characteristics Curve

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT4848

RECAP

e. Under a three-parameter model, the lower limit of the truescores is NCk.

f. The shape of the test characteristic curve depends upon thenumber of items, the ICC model and the mix of values of theitem parameters.

3. It would be possible to construct a test characteristic curve thatdecreases as ability increases. To do so would require items withnegative discrimination for the correct response to the items. Such atest would not be considered a good test because the higher anexaminee’s ability level, the lower the score expected for theexaminee.

Page 49: IRT - Item response Theory

• To locate that person on the ability scale (θ = ?)

• Individual’s ability• Comparisons among individual’s ability

• The list of 1’s and 0’s for the N items is called the examinee’s itemresponse vector.

•Use this item response vector and the known item parameters toestimate the examinee’s θ .

• Maximum likelihood procedures are used to estimate an examinee’sability.

It begins with some a priori value for the ability of theexaminee and the known values of the item parameters. These areused to compute the probability of correct response to each item forthat examinee.

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT4949

• To locate that person on the ability scale (θ = ?)

• Individual’s ability• Comparisons among individual’s ability

• The list of 1’s and 0’s for the N items is called the examinee’s itemresponse vector.

•Use this item response vector and the known item parameters toestimate the examinee’s θ .

• Maximum likelihood procedures are used to estimate an examinee’sability.

It begins with some a priori value for the ability of theexaminee and the known values of the item parameters. These areused to compute the probability of correct response to each item forthat examinee.

Page 50: IRT - Item response Theory

• this procedure is based upon an approach that treats each examineeseparately

N∑ -ai ( ui – Pi(θs))i=1

Θs+1 = θs +N∑ ai

2 Pi(θs) Qi(θs)i=1

where: θs is the estimated ability of the examinee within iteration sai is the discrimination parameter of item i, i = 1, 2, . . . .Nui is the response made by the examinee to item i: 1 or 0Pi(θs ) is the probability of correct response to item i, under the givenitem characteristic curve model, at ability level θs withiniteration s.Qi (θs ) = 1 -Pi (θs )

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT5050

• this procedure is based upon an approach that treats each examineeseparately

N∑ -ai ( ui – Pi(θs))i=1

Θs+1 = θs +N∑ ai

2 Pi(θs) Qi(θs)i=1

where: θs is the estimated ability of the examinee within iteration sai is the discrimination parameter of item i, i = 1, 2, . . . .Nui is the response made by the examinee to item i: 1 or 0Pi(θs ) is the probability of correct response to item i, under the givenitem characteristic curve model, at ability level θs withiniteration s.Qi (θs ) = 1 -Pi (θs )

Page 51: IRT - Item response Theory

• A three-item test

b = -1, a = 1.0b = 0, a = 1.2b = 1, a = .8

• The examinee’s item responses were:

item response1 12 03 1

a priori estimate of the examinee’s ability is set to θs= 1.0

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT5151

• A three-item test

b = -1, a = 1.0b = 0, a = 1.2b = 1, a = .8

• The examinee’s item responses were:

item response1 12 03 1

a priori estimate of the examinee’s ability is set to θs= 1.0

Page 52: IRT - Item response Theory

First iteration:

item u P Q a(u-P) a*a(PQ)1 1 .88 .12 .119 .1052 0 .77 .23 -.922 .2553 1 .5. 5 .400 .160

sum -.403 .520∆θs = -.403/.520 = -.773, θs+1= 1.0 - .773 = .227

Second iteration:∆θs = .066/.674 = .097, θs+1 = .227 + .097 = .324

Third iteration:

∆θs = .0006/.6615 = .0009, θs+1 = .324 + .0009 = .3249

At this point, the process is terminated because the value of the adjustment(.0009) is very small. Thus, the examinee’s estimated ability is .33.

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT5252

First iteration:

item u P Q a(u-P) a*a(PQ)1 1 .88 .12 .119 .1052 0 .77 .23 -.922 .2553 1 .5. 5 .400 .160

sum -.403 .520∆θs = -.403/.520 = -.773, θs+1= 1.0 - .773 = .227

Second iteration:∆θs = .066/.674 = .097, θs+1 = .227 + .097 = .324

Third iteration:

∆θs = .0006/.6615 = .0009, θs+1 = .324 + .0009 = .3249

At this point, the process is terminated because the value of the adjustment(.0009) is very small. Thus, the examinee’s estimated ability is .33.

Page 53: IRT - Item response Theory

Unfortunately, there is no way to know the examinee’s actual abilityparameter. The best one can do is estimate it.

1SE(θ) =

Sqrt(ai2 Pi(θs) Qi(θs))

SE(θ ) = 1 /sqrt(.6615) = 1.23

The examinee’s ability was not estimated very precisely because thestandard error is very large. This is primarily due to the fact that onlythree items were used.

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT5353

Unfortunately, there is no way to know the examinee’s actual abilityparameter. The best one can do is estimate it.

1SE(θ) =

Sqrt(ai2 Pi(θs) Qi(θs))

SE(θ ) = 1 /sqrt(.6615) = 1.23

The examinee’s ability was not estimated very precisely because thestandard error is very large. This is primarily due to the fact that onlythree items were used.

Page 54: IRT - Item response Theory

Two cases where MLE procedure fails

when an examinee answers none of the items correctly, thecorresponding ability estimate is negative infinity.

when an examinee answers all the items in the test correctly,the corresponding ability estimate is positive infinity.

In both of these cases it is impossible to obtain an ability estimate forthe examinee (the computer literally cannot compute a number as bigas infinity).

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT5454

Two cases where MLE procedure fails

when an examinee answers none of the items correctly, thecorresponding ability estimate is negative infinity.

when an examinee answers all the items in the test correctly,the corresponding ability estimate is positive infinity.

In both of these cases it is impossible to obtain an ability estimate forthe examinee (the computer literally cannot compute a number as bigas infinity).

Page 55: IRT - Item response Theory

Item Invariance of an Examinee’s Ability Estimate

Different sets of items should yield the same ability estimate,within sampling variation . In each set, a different part of theICC is involved

This principle rests upon two conditions:

all the items measure the same underlying latent trait

the values of all the item parameters are in a common metric.

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT5555

Item Invariance of an Examinee’s Ability Estimate

Different sets of items should yield the same ability estimate,within sampling variation . In each set, a different part of theICC is involved

This principle rests upon two conditions:

all the items measure the same underlying latent trait

the values of all the item parameters are in a common metric.

Page 56: IRT - Item response Theory

Implication of Item Invariance of an Examinee’s Ability Estimate

A test located anywhere along the ability scale can be used toestimate an examinee’s ability.

An examinee could take a test that is “easy” or a test that is “hard”and obtain, on the average, the same estimated ability.

This is in sharp contrast to classical test theory, where such an examineewould get a high test score on the easy test and vice versa

Under item response theory, the examinee’s ability is fixed and invariantwith respect to the items used to measure it.

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT5656

Implication of Item Invariance of an Examinee’s Ability Estimate

A test located anywhere along the ability scale can be used toestimate an examinee’s ability.

An examinee could take a test that is “easy” or a test that is “hard”and obtain, on the average, the same estimated ability.

This is in sharp contrast to classical test theory, where such an examineewould get a high test score on the easy test and vice versa

Under item response theory, the examinee’s ability is fixed and invariantwith respect to the items used to measure it.

Page 57: IRT - Item response Theory

Caution :

An examinee’s ability is fixed relative to a given context.

However, if the examinee received remedial instructionbetween testing or if there were carryover (memorizing)effects, the examinee’s underlying ability level would bedifferent for each testing.

The item invariance of an examinee’s ability and the group invarianceof an item’s parameters are two facets of what is referred to,generically, as the invariance principle of item response theory.

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT5757

Caution :

An examinee’s ability is fixed relative to a given context.

However, if the examinee received remedial instructionbetween testing or if there were carryover (memorizing)effects, the examinee’s underlying ability level would bedifferent for each testing.

The item invariance of an examinee’s ability and the group invarianceof an item’s parameters are two facets of what is referred to,generically, as the invariance principle of item response theory.

Page 58: IRT - Item response Theory

RECAP

1. Distribution of estimated ability.

a. The standard error of the estimates can be quite large whenthe items are not located near the ability of the examinee.

c. When the values of the item discrimination indices are large,the standard error of the ability estimates is small. When theitem discrimination indices are small, the standard error ofthe ability estimates is large.

e. The optimum set of items for estimating an examinee’sability would have all its item difficulties equal to theexaminee’s ability parameter and have items with largevalues for the item discrimination indices.

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT5858

RECAP

1. Distribution of estimated ability.

a. The standard error of the estimates can be quite large whenthe items are not located near the ability of the examinee.

c. When the values of the item discrimination indices are large,the standard error of the ability estimates is small. When theitem discrimination indices are small, the standard error ofthe ability estimates is large.

e. The optimum set of items for estimating an examinee’sability would have all its item difficulties equal to theexaminee’s ability parameter and have items with largevalues for the item discrimination indices.

Page 59: IRT - Item response Theory

RECAP

2. Item invariance of the examinee’s ability.

a. The different sets of items yielded values of estimatedability that were near the examinee’s actual ability level.

b. The mean value of these estimates generally was a closeapproximation of the examinee’s ability parameter.

3. Each examinee has an ability score (parameter value) that locatesthat person (estimated ability) on the scale.

Estimating Examinee’s AbilityEstimating Examinee’s Ability

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT5959

RECAP

2. Item invariance of the examinee’s ability.

a. The different sets of items yielded values of estimatedability that were near the examinee’s actual ability level.

b. The mean value of these estimates generally was a closeapproximation of the examinee’s ability parameter.

3. Each examinee has an ability score (parameter value) that locatesthat person (estimated ability) on the scale.

Page 60: IRT - Item response Theory

The statistical meaning of information is credited to Sir R.A. Fisher, whodefined information as the (reciprocal of the σ ) precision with which aparameter could be estimated.

1I = where σ ~ SE(θ)

σ2

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT6060

Page 61: IRT - Item response Theory

I(θ) is maximum at θ = -1.0 and is about 3 for the ability range of -2<= θ< = 0.Within this range, ability is estimated with some precision.

I(θ) does not depend upon the distribution of examinees over the abilityscale.

Ideal I(θ) would be a horizontal line at some large value : hard to achieve

Precision with which an examinee’s ability is estimated depends upon wherethe examinee’s ability is located on the ability scale. – Implication for testconstruction

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT6161

I(θ) is maximum at θ = -1.0 and is about 3 for the ability range of -2<= θ< = 0.Within this range, ability is estimated with some precision.

I(θ) does not depend upon the distribution of examinees over the abilityscale.

Ideal I(θ) would be a horizontal line at some large value : hard to achieve

Precision with which an examinee’s ability is estimated depends upon wherethe examinee’s ability is located on the ability scale. – Implication for testconstruction

Page 62: IRT - Item response Theory

Item information function

Ii (θ ), where i indexes the item.

Will be small as single item is involved.

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT6262

Page 63: IRT - Item response Theory

An item measures ability with greatest precision at the ability levelcorresponding to the item’s difficulty parameter.

The amount of item information decreases as the ability level departs fromthe item difficulty and approaches zero at the extremes of the ability scale.

Test Information Function

NI(θ ) = ∑ Ii (θ )

i=1

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT6363

An item measures ability with greatest precision at the ability levelcorresponding to the item’s difficulty parameter.

The amount of item information decreases as the ability level departs fromthe item difficulty and approaches zero at the extremes of the ability scale.

Test Information Function

NI(θ ) = ∑ Ii (θ )

i=1

Page 64: IRT - Item response Theory

I(θ ) is much higher than Ii (θ )

Longer the test , higher I(θ ) and greater precision in mesauringexaminee’s ability than will shorter tests.

Thus, ability is estimated with some precision near the center of theability scale.

I(θ ) tells how well the test is doing in estimating ability over the wholerange of ability scores.

While the ideal test information function often may be a horizontalline, it may not be the best for a specific purpose.

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT6464

I(θ ) is much higher than Ii (θ )

Longer the test , higher I(θ ) and greater precision in mesauringexaminee’s ability than will shorter tests.

Thus, ability is estimated with some precision near the center of theability scale.

I(θ ) tells how well the test is doing in estimating ability over the wholerange of ability scores.

While the ideal test information function often may be a horizontalline, it may not be the best for a specific purpose.

Page 65: IRT - Item response Theory

For example, if you were interested in constructing a test to awardscholarships, this ideal might not be optimal. In this situation, you would liketo measure ability with considerable precision at ability levels near the abilityused to separate those who will receive the scholarship from those who donot.

The best test information function in this case would have a peak at thecutoff score.

Other specialized uses of tests could require other forms of thetest information function.

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT6565

For example, if you were interested in constructing a test to awardscholarships, this ideal might not be optimal. In this situation, you would liketo measure ability with considerable precision at ability levels near the abilityused to separate those who will receive the scholarship from those who donot.

The best test information function in this case would have a peak at thecutoff score.

Other specialized uses of tests could require other forms of thetest information function.

Page 66: IRT - Item response Theory

Information Functions

2 PL:Ii (θ ) = ai

2 P i(θ ) Qi (θ )Pi (θ ) = 1/(1+ EXP (-ai (θ - bi)))Qi (θ ) =1 - Pi (θ )

1 PL:Ii (θ ) = Pi(θ ) Qi (θ )Qi (θ ) =1 - Pi (θ )

3 PL:Ii (θ ) = ai

2 (Qi (θ )) (Pi(θ ) –c2)

(Pi(θ ) ) (1-c2)Pi (θ ) = c + (1 - c) (1/(1 + EXP (-ai (θ - bi))))Qi (θ ) =1 - Pi (θ )

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT6666

Information Functions

2 PL:Ii (θ ) = ai

2 P i(θ ) Qi (θ )Pi (θ ) = 1/(1+ EXP (-ai (θ - bi)))Qi (θ ) =1 - Pi (θ )

1 PL:Ii (θ ) = Pi(θ ) Qi (θ )Qi (θ ) =1 - Pi (θ )

3 PL:Ii (θ ) = ai

2 (Qi (θ )) (Pi(θ ) –c2)

(Pi(θ ) ) (1-c2)Pi (θ ) = c + (1 - c) (1/(1 + EXP (-ai (θ - bi))))Qi (θ ) =1 - Pi (θ )

Page 67: IRT - Item response Theory

Information Functions Example : 3PL (b = 1.0, a = 1.5, c = .2 )

θ L Pi(θ) Qi (θ) Pi (θ) Qi(θ) (Pi (θ) - c) Ii (θ)

-3 -6.0 .20 .80 3.950 .000 .000-2 -4.5 .21 .79 3.785 .000 .001-1 -3.0 .24 .76 3.202 .001 .0160- -1.5 .35 .65 1.890 .021 .1421 0.0 .60 .40 .667 .160 .3752 1.5 .85 .15 .171 .428 .2573 3.0 .96 .04 .040 .481 .082

General level of the values for the amount of information is lower (because of thepresence of the terms (1 - c) and (Pi (θ ) - c) )

The maximum occurred at an ability level slightly higher than the value of b

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT6767

Information Functions Example : 3PL (b = 1.0, a = 1.5, c = .2 )

θ L Pi(θ) Qi (θ) Pi (θ) Qi(θ) (Pi (θ) - c) Ii (θ)

-3 -6.0 .20 .80 3.950 .000 .000-2 -4.5 .21 .79 3.785 .000 .001-1 -3.0 .24 .76 3.202 .001 .0160- -1.5 .35 .65 1.890 .021 .1421 0.0 .60 .40 .667 .160 .3752 1.5 .85 .15 .171 .428 .2573 3.0 .96 .04 .040 .481 .082

General level of the values for the amount of information is lower (because of thepresence of the terms (1 - c) and (Pi (θ ) - c) )

The maximum occurred at an ability level slightly higher than the value of b

Page 68: IRT - Item response Theory

When they share common values of a and b, the information functions will be thesame when c= 0

When c > 0, the three-parameter model will always yield less information

Thus, the item information function under a two-parameter model defines the upperbound for the amount of information under a three-parameter model

This is reasonable, because getting the item correct by guessing should notenhance the precision with which an ability level is estimated.

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT6868

When they share common values of a and b, the information functions will be thesame when c= 0

When c > 0, the three-parameter model will always yield less information

Thus, the item information function under a two-parameter model defines the upperbound for the amount of information under a three-parameter model

This is reasonable, because getting the item correct by guessing should notenhance the precision with which an ability level is estimated.

Page 69: IRT - Item response Theory

Computing a Test Information Function : five-item test under 2PL

Item b a

1 -1.0 2.02 -0.5 1.53 -0.0 1.54 0.5 1.55 1.0 2.0

θ 1 2 3 4 5 Test Information

-3 .071 .051 .024 .012 .001 .159-2 .420 .194 .102 .051 .010 .777-1 1.000 .490 .336 .194 .071 2 .0910 .420 .490 .563 .490 .420 2 .3831 .071 .194 .336 .490 1.000 2.0912 .010 .051 .102 .194 .420 .7773 .001 .012 .024 .051 .071 .159

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT6969

Computing a Test Information Function : five-item test under 2PL

Item b a

1 -1.0 2.02 -0.5 1.53 -0.0 1.54 0.5 1.55 1.0 2.0

θ 1 2 3 4 5 Test Information

-3 .071 .051 .024 .012 .001 .159-2 .420 .194 .102 .051 .010 .777-1 1.000 .490 .336 .194 .071 2 .0910 .420 .490 .563 .490 .420 2 .3831 .071 .194 .336 .490 1.000 2.0912 .010 .051 .102 .194 .420 .7773 .001 .012 .024 .051 .071 .159

Page 70: IRT - Item response Theory

The five item discriminations had a symmetrical distribution around a value of 1.5

The five item difficulties had a symmetrical distribution about an ability level of zero

Because of this, the test information function also is symmetric about an ability ofzero

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT7070

Page 71: IRT - Item response Theory

σ ~ SE(θ) = 1/sqrt(I(θ))

A peaked I(θ ) measures ability with unequal precision the ability scale.

best for estimating the ability of examinees whose abilities fall nearthe peak of the test information function

In some tests, the I(θ ) is rather flat over some region of the ability scale

test would be a desirable one for those examinees whose ability fallsin the given range

The maximum amount of test information was 2.383 at an ability level of 0This translates into a standard error of .65

Roughly that 68 percent of the estimates of this ability level fall between -.65 and +.65 (ability level is estimated with a modest amount of precision)

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT7171

σ ~ SE(θ) = 1/sqrt(I(θ))

A peaked I(θ ) measures ability with unequal precision the ability scale.

best for estimating the ability of examinees whose abilities fall nearthe peak of the test information function

In some tests, the I(θ ) is rather flat over some region of the ability scale

test would be a desirable one for those examinees whose ability fallsin the given range

The maximum amount of test information was 2.383 at an ability level of 0This translates into a standard error of .65

Roughly that 68 percent of the estimates of this ability level fall between -.65 and +.65 (ability level is estimated with a modest amount of precision)

Page 72: IRT - Item response Theory

RECAP

1. The general level of the test information function depends upon:

a. The number of items in the test.b. The average value of the discrimination parameters of the

test items.c. Both of the above hold for all three item characteristic curve

models.

2. The shape of the test information function depends upon:

a. The distribution of the item difficulties over the ability scale.b. The distribution and the average value of the discrimination

parameters of the test items.

3. When the item difficulties are clustered closely around a given value,the test information function is peaked at that point on the abilityscale. The maximum amount of information depends upon thevalues of the discrimination parameters.

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT7272

RECAP

1. The general level of the test information function depends upon:

a. The number of items in the test.b. The average value of the discrimination parameters of the

test items.c. Both of the above hold for all three item characteristic curve

models.

2. The shape of the test information function depends upon:

a. The distribution of the item difficulties over the ability scale.b. The distribution and the average value of the discrimination

parameters of the test items.

3. When the item difficulties are clustered closely around a given value,the test information function is peaked at that point on the abilityscale. The maximum amount of information depends upon thevalues of the discrimination parameters.

Page 73: IRT - Item response Theory

RECAP

4. When the item difficulties are widely distributed over the ability scale, thetest information function tends to be flatter than when the

difficulties are tightly clustered.

5. Values of a < 1.0 result in a low general level of the amount of testinformation.

6. Values of a > 1.7 result in a high general level of the amount of testinformation.

7. Under a three-parameter model, values of the guessing parameter cgreater than zero lower the amount of test information at the low-ability levels. In addition, large values of c reduce the general level ofthe amount of test information.

8. It is difficult to approximate a horizontal test information function. Todo so, the values of b must be spread widely over the ability scale,and the values of a must be in the moderate to low range and have a

U-shaped distribution.

The Information FunctionThe Information Function

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT7373

RECAP

4. When the item difficulties are widely distributed over the ability scale, thetest information function tends to be flatter than when the

difficulties are tightly clustered.

5. Values of a < 1.0 result in a low general level of the amount of testinformation.

6. Values of a > 1.7 result in a high general level of the amount of testinformation.

7. Under a three-parameter model, values of the guessing parameter cgreater than zero lower the amount of test information at the low-ability levels. In addition, large values of c reduce the general level ofthe amount of test information.

8. It is difficult to approximate a horizontal test information function. Todo so, the values of b must be spread widely over the ability scale,and the values of a must be in the moderate to low range and have a

U-shaped distribution.

Page 74: IRT - Item response Theory

Test constructors know beforehandwhat trait they want the item to measure andAbility level of examinees , where the item is designed to function

Butit is not possible to determine the values of the item’s parameters a prioriwhen a test is administered ,latent trait of examinees is not known

As a result a major task isto determine the values of the item parameters and examinee abilities in ametric for the underlying latent trait.

In IRT, this task is called Test calibration

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT7474

Test constructors know beforehandwhat trait they want the item to measure andAbility level of examinees , where the item is designed to function

Butit is not possible to determine the values of the item’s parameters a prioriwhen a test is administered ,latent trait of examinees is not known

As a result a major task isto determine the values of the item parameters and examinee abilities in ametric for the underlying latent trait.

In IRT, this task is called Test calibration

Page 75: IRT - Item response Theory

The Test Calibration Process (Birnbaum,1968 )

Two stages of maximum likelihood estimation

stage 1: The parameters of the N items in the test are estimated

• The estimated ability of each examinee is treated as if it is expressed inthe true metric of the latent trait .

• Then the parameters of each item in the test are estimated via themaximum likelihood procedure

Item’s parameters are estimated one at a time (Independence assumption)

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT7575

The Test Calibration Process (Birnbaum,1968 )

Two stages of maximum likelihood estimation

stage 1: The parameters of the N items in the test are estimated

• The estimated ability of each examinee is treated as if it is expressed inthe true metric of the latent trait .

• Then the parameters of each item in the test are estimated via themaximum likelihood procedure

Item’s parameters are estimated one at a time (Independence assumption)

Page 76: IRT - Item response Theory

The Test Calibration Process (Birnbaum,1968 )

Two stages of maximum likelihood estimation

stage 2 : the ability parameters of the M examinees are estimated

• taking these item parameter estimates, the ability of each examinee isestimated using the maximum likelihood procedure presented

• the ability estimates are obtained one examinee at a time (independenceassumption).

The two-stage process is repeated until some suitable convergence criterion ismet.

The overall effect is that the parameters of the N test items and the ability levels ofthe M examinees have been estimated simultaneously,

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT7676

The Test Calibration Process (Birnbaum,1968 )

Two stages of maximum likelihood estimation

stage 2 : the ability parameters of the M examinees are estimated

• taking these item parameter estimates, the ability of each examinee isestimated using the maximum likelihood procedure presented

• the ability estimates are obtained one examinee at a time (independenceassumption).

The two-stage process is repeated until some suitable convergence criterion ismet.

The overall effect is that the parameters of the N test items and the ability levels ofthe M examinees have been estimated simultaneously,

Page 77: IRT - Item response Theory

• It does not yield a unique metric for the ability scale.

• The midpoint and the unit of measurement of the obtained ability scaleare indeterminate

• The metric is unique up to a linear transformation

• It is necessary to “anchor” the metric via arbitrary rules for determiningthe midpoint and unit of measurement of the ability scale.

• It is not possible to obtain estimates of the examinee’s ability and of theitem’s parameters in the true metric of the underlying latent trait.

• The best we can do is obtain a metric that depends upon a particularcombination of examinees and test items.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT7777

• It does not yield a unique metric for the ability scale.

• The midpoint and the unit of measurement of the obtained ability scaleare indeterminate

• The metric is unique up to a linear transformation

• It is necessary to “anchor” the metric via arbitrary rules for determiningthe midpoint and unit of measurement of the ability scale.

• It is not possible to obtain estimates of the examinee’s ability and of theitem’s parameters in the true metric of the underlying latent trait.

• The best we can do is obtain a metric that depends upon a particularcombination of examinees and test items.

Page 78: IRT - Item response Theory

For three different ICC models to choose from , there are several differentcustomised ways to implement the Birnbaum paradigm.

You have to devise your own implementation

Illustration : BICAL program as implemented by Benjamin Wright and his co-Workers for 1PL Rasch model

ten-item test administered to a group of 16 examinees

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT7878

Page 79: IRT - Item response Theory

ITEM RESPONSES

1 2 3 4 5 6 7 8 9 10 RS01 1 1 202 1 1 203 1 1 1 1 1 504 1 1 1 1 405 1 106 1 1 1 307 1 1 1 1 408 1 1 1 1 409 1 1 1 1 410 1 1 1 311 1 1 1 1 1 1 1 1 1 912 1 1 1 1 1 1 1 1 1 913 1 1 1 1 1 1 614 1 1 1 1 1 1 1 1 1 915 1 1 1 1 1 1 1 1 1 916 1 1 1 1 1 1 1 1 1 1 10

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT7979

ITEM RESPONSES

1 2 3 4 5 6 7 8 9 10 RS01 1 1 202 1 1 203 1 1 1 1 1 504 1 1 1 1 405 1 106 1 1 1 307 1 1 1 1 408 1 1 1 1 409 1 1 1 1 410 1 1 1 311 1 1 1 1 1 1 1 1 1 912 1 1 1 1 1 1 1 1 1 913 1 1 1 1 1 1 614 1 1 1 1 1 1 1 1 1 915 1 1 1 1 1 1 1 1 1 916 1 1 1 1 1 1 1 1 1 1 10

Page 80: IRT - Item response Theory

Examinee 16 answered all of the items so removed from the data set

If item is answered correctly by all of the examinees or by none of the examinees,its item difficulty parameter cannot be estimated.

FREQUENCY COUNTS FOR EDITED DATA

SCORE ITEM Row1 2 3 4 5 6 7 8 9 10 Total

1 1 12 1 2 1 43 2 1 1 1 1 64 4 1 2 2 3 1 1 2 165 1 1 1 1 1 56 1 1 1 1 1 1 69 4 4 2 4 4 4 4 4 4 2 36-----------------------------------------------------------------------------COL 13 8 8 5 10 7 7 6 7 3 74

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT8080

Examinee 16 answered all of the items so removed from the data set

If item is answered correctly by all of the examinees or by none of the examinees,its item difficulty parameter cannot be estimated.

FREQUENCY COUNTS FOR EDITED DATA

SCORE ITEM Row1 2 3 4 5 6 7 8 9 10 Total

1 1 12 1 2 1 43 2 1 1 1 1 64 4 1 2 2 3 1 1 2 165 1 1 1 1 1 56 1 1 1 1 1 1 69 4 4 2 4 4 4 4 4 4 2 36-----------------------------------------------------------------------------COL 13 8 8 5 10 7 7 6 7 3 74

Page 81: IRT - Item response Theory

• Given the two frequency vectors (Rows and Columns), the estimation processcan be implemented

• Under the Rasch model, the anchoring procedure takes advantage of a = 1- Unit of measurements for estimated ability is set at 1.- Item’s difficulty is estimated

• To get midpoint value of ability scale, the mean item difficulty is subtracted fromthe value of each item’s difficulty estimate

- Mid point of item difficulty becomes 0 and same is for ability estimates.

• The ability estimate corresponding to each raw test score is obtained in thesecond stage using the rescaled item difficulties (as if they were the difficultyparameters) and the vector of row marginal totals.

- metric is also changed to that of rescaled ability parameters

• The output of this stage is an ability estimate for each raw test score in the dataset

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT8181

• Given the two frequency vectors (Rows and Columns), the estimation processcan be implemented

• Under the Rasch model, the anchoring procedure takes advantage of a = 1- Unit of measurements for estimated ability is set at 1.- Item’s difficulty is estimated

• To get midpoint value of ability scale, the mean item difficulty is subtracted fromthe value of each item’s difficulty estimate

- Mid point of item difficulty becomes 0 and same is for ability estimates.

• The ability estimate corresponding to each raw test score is obtained in thesecond stage using the rescaled item difficulties (as if they were the difficultyparameters) and the vector of row marginal totals.

- metric is also changed to that of rescaled ability parameters

• The output of this stage is an ability estimate for each raw test score in the dataset

Page 82: IRT - Item response Theory

At this point, the convergence of the overall iterative process is checked.

Wright summed the absolute differences between the values of the itemdifficulty parameter estimates for two successive iterations of theparadigm.

If this sum was less than .01, the estimation process was terminated.

If it was greater than .01, then another iteration was performed and thetwo stages were done again

Thus, the process of stage one, anchor the metric, stage two, and check forconvergence is repeated until the criterion is met.

When this happens, the current values of the item and ability parameter estimatesare accepted and an ability scale metric has been defined.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT8282

At this point, the convergence of the overall iterative process is checked.

Wright summed the absolute differences between the values of the itemdifficulty parameter estimates for two successive iterations of theparadigm.

If this sum was less than .01, the estimation process was terminated.

If it was greater than .01, then another iteration was performed and thetwo stages were done again

Thus, the process of stage one, anchor the metric, stage two, and check forconvergence is repeated until the criterion is met.

When this happens, the current values of the item and ability parameter estimatesare accepted and an ability scale metric has been defined.

Page 83: IRT - Item response Theory

ITEM PARAMETER ESTIMATES

Item Difficulty1 -2.372 -0.273 -0.274 +0.985 -1.006 +0.117 +0.118 +0.529 +0.1110 +2.06

Because of the anchoring procedures used, these values are actually relative tothe average item difficulty of the test for these examinees.

You can verify that the sum of the item difficulties is zero (within rounding errors)

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT8383

ITEM PARAMETER ESTIMATES

Item Difficulty1 -2.372 -0.273 -0.274 +0.985 -1.006 +0.117 +0.118 +0.529 +0.1110 +2.06

Because of the anchoring procedures used, these values are actually relative tothe average item difficulty of the test for these examinees.

You can verify that the sum of the item difficulties is zero (within rounding errors)

Page 84: IRT - Item response Theory

ABILITY ESTIMATIONExaminee Obtained Raw Score1 -1.50 22 -1.50 23 +0.02 54 -0.42 45 -2.37 16 -0.91 37 -0.42 48 -0.42 49 -0.42 410 -0.91 311 +2.33 912 +2.33 913 +0.46 614 +2.33 915 +2.33 916 ***** 10

All examinees with the same raw score obtained the same ability estimate

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT8484

ABILITY ESTIMATIONExaminee Obtained Raw Score1 -1.50 22 -1.50 23 +0.02 54 -0.42 45 -2.37 16 -0.91 37 -0.42 48 -0.42 49 -0.42 410 -0.91 311 +2.33 912 +2.33 913 +0.46 614 +2.33 915 +2.33 916 ***** 10

All examinees with the same raw score obtained the same ability estimate

Page 85: IRT - Item response Theory

• This unique feature is a direct consequence of fixing a =1 for all items

- intuitive appeal of Rasch model

• When 2-PL or 3-PL ICC are used, an examinee’s ability estimate depends uponthe particular pattern of item responses rather than the raw score.

- Under these models, examinees with the same item response patternwill obtain the same ability estimate.

- examinees with the same raw score could obtain different abilityestimates if they answered different items correctly.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT8585

• This unique feature is a direct consequence of fixing a =1 for all items

- intuitive appeal of Rasch model

• When 2-PL or 3-PL ICC are used, an examinee’s ability estimate depends uponthe particular pattern of item responses rather than the raw score.

- Under these models, examinees with the same item response patternwill obtain the same ability estimate.

- examinees with the same raw score could obtain different abilityestimates if they answered different items correctly.

Page 86: IRT - Item response Theory

Illustration:

Three different ten-item tests measuring the same latent trait will be used. Acommon group of 16 examinees will take all three of the tests.

The tests were created so that the average difficulty of the first test wasmatched to the mean ability of the common group of examinees.

The second test was created to be an easy test for this group.

The third test was created to be a hard test for this group. Each of these test-group combinations will be subjected to the Birnbaum paradigm andcalibrated separately.

Each test calibration yields a unique metric for the ability scale.

Due to the anchoring process, all three test calibrations yielded a mean itemdifficulty of zero.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT8686

Illustration:

Three different ten-item tests measuring the same latent trait will be used. Acommon group of 16 examinees will take all three of the tests.

The tests were created so that the average difficulty of the first test wasmatched to the mean ability of the common group of examinees.

The second test was created to be an easy test for this group.

The third test was created to be a hard test for this group. Each of these test-group combinations will be subjected to the Birnbaum paradigm andcalibrated separately.

Each test calibration yields a unique metric for the ability scale.

Due to the anchoring process, all three test calibrations yielded a mean itemdifficulty of zero.

Page 87: IRT - Item response Theory

Within each calibration, examinees with the same raw test score obtained thesame estimated ability.

However, a given raw score will not yield the same estimated ability across thethree calibrations.

The value of the mean estimated abilities is expressed relative to the mean itemdifficulty of the test.

Thus, the mean difficulty of the easy test should result in a positive mean ability.The mean ability on the hard test should have a negative value. The meanability on the matched test should be near zero.

Test results can be placed on a common ability scale.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT8787

Within each calibration, examinees with the same raw test score obtained thesame estimated ability.

However, a given raw score will not yield the same estimated ability across thethree calibrations.

The value of the mean estimated abilities is expressed relative to the mean itemdifficulty of the test.

Thus, the mean difficulty of the easy test should result in a positive mean ability.The mean ability on the hard test should have a negative value. The meanability on the matched test should be near zero.

Test results can be placed on a common ability scale.

Page 88: IRT - Item response Theory

Putting the Three Tests on a Common Ability Scale (Test Equating)

The principle of the item invariance of an examinee’s ability indicates that anexaminee should obtain the same ability estimate regardless of the set of itemsused

However, in the three test calibrations done above, this did not hold.The problem is not in the invariance principle, but in the test calibrations

The invariance principle assumes that the values of the item parameters of theseveral sets of items are all expressed in the same ability-scale metric. In the

present situation, there are three different ability scales, one from each of thecalibrations.

The average difficulties of these tests were intended to be different, but theanchoring process forced each test to have a mean item difficulty of zero.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT8888

Putting the Three Tests on a Common Ability Scale (Test Equating)

The principle of the item invariance of an examinee’s ability indicates that anexaminee should obtain the same ability estimate regardless of the set of itemsused

However, in the three test calibrations done above, this did not hold.The problem is not in the invariance principle, but in the test calibrations

The invariance principle assumes that the values of the item parameters of theseveral sets of items are all expressed in the same ability-scale metric. In the

present situation, there are three different ability scales, one from each of thecalibrations.

The average difficulties of these tests were intended to be different, but theanchoring process forced each test to have a mean item difficulty of zero.

Page 89: IRT - Item response Theory

The mean of the common group was .06 for the matched test, .44 for the easy test,and -.11 for the hard test.

This tells us that the mean ability from the matched test is about what it should be.

The mean from the easy test tells us that the average ability is above the meanitem difficulty of the test

The mean ability from the hard test is below the mean item difficulty.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT8989

The mean of the common group was .06 for the matched test, .44 for the easy test,and -.11 for the hard test.

This tells us that the mean ability from the matched test is about what it should be.

The mean from the easy test tells us that the average ability is above the meanitem difficulty of the test

The mean ability from the hard test is below the mean item difficulty.

Page 90: IRT - Item response Theory

we can use the mean abilities to position the tests on a common scale.

But which particular test calibration to use as the baseline?

Matched Test : This calibration yielded a mean ability of .062 and a mean itemdifficulty of zero.

Because the Rasch model was used, the unit of measurement for all threecalibrations is unity. Therefore, to bring the easy and hard test results to thebaseline metric only involved adjusting for the differences in midpoints.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT9090

we can use the mean abilities to position the tests on a common scale.

But which particular test calibration to use as the baseline?

Matched Test : This calibration yielded a mean ability of .062 and a mean itemdifficulty of zero.

Because the Rasch model was used, the unit of measurement for all threecalibrations is unity. Therefore, to bring the easy and hard test results to thebaseline metric only involved adjusting for the differences in midpoints.

Page 91: IRT - Item response Theory

Easy Test

The shift factor needed is the difference between the mean estimated ability of thecommon group on the easy test (.444) and on the matched test (.062),

which is .382.

To convert the values of the item difficulties for the easy test to baseline metric, onesimply subtracts .382 from each item difficulty.

Similarly, each examinee’s ability can be expressed in the baseline metric bysubtracting .382 from it.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT9191

Easy Test

The shift factor needed is the difference between the mean estimated ability of thecommon group on the easy test (.444) and on the matched test (.062),

which is .382.

To convert the values of the item difficulties for the easy test to baseline metric, onesimply subtracts .382 from each item difficulty.

Similarly, each examinee’s ability can be expressed in the baseline metric bysubtracting .382 from it.

Page 92: IRT - Item response Theory

Hard Test

The hard test results can be expressed in the baseline metric by using thedifferences in mean ability. The shift factor is -.111 -.062, or -.173.

Again, subtracting this value from each of the item difficulty estimates putsthem in the baseline metric.

The ability estimates of the common group yielded by the hard test can betransformed to the baseline metric of the matched test by using the sameshift factor

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT9292

Hard Test

The hard test results can be expressed in the baseline metric by using thedifferences in mean ability. The shift factor is -.111 -.062, or -.173.

Again, subtracting this value from each of the item difficulty estimates putsthem in the baseline metric.

The ability estimates of the common group yielded by the hard test can betransformed to the baseline metric of the matched test by using the sameshift factor

Page 93: IRT - Item response Theory

Item difficulties in the baseline metric

Item Easy test Matched test Hard test

1 -1.492 -2.37 *****2 -1.492 -.27 -.0373 -2.122 -.27 -.4974 -.182 .98 -.4975 -.562 -1.00 .9636 +.178 .11 -.4977 .528 .11 .3838 .582 .521 .5339 .880 .11 .44310 .880 2.06 *****Mean - .285 mean 0.00 mean .224

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT9393

Item difficulties in the baseline metric

Item Easy test Matched test Hard test

1 -1.492 -2.37 *****2 -1.492 -.27 -.0373 -2.122 -.27 -.4974 -.182 .98 -.4975 -.562 -1.00 .9636 +.178 .11 -.4977 .528 .11 .3838 .582 .521 .5339 .880 .11 .44310 .880 2.06 *****Mean - .285 mean 0.00 mean .224

Page 94: IRT - Item response Theory

Ability estimates of the common group in the baseline metric

Item Easy test Matched test Hard test

1 -2.900 -.1.50 *****2 -.772 -1.50 *****3 -1.96 2.02 *****4 -.292 -.42 -.8775 -.292 -2.37 -.8776 .168 -.91 *****7 1.968 -.42 -1.6378 .168 -.42 -.8779 .638 -.42 -1.63710 .638 -.91 -.87711 .638 2.33 .15312 1.188 2.33 .15313 .292 .46 .15314 1.968 2.33 2.00315 ***** 2.33 1.21316 ***** ***** 2.003Mean .062 .062 .062Std. Dev. 1.344 1.566 1.413

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT9494

Ability estimates of the common group in the baseline metric

Item Easy test Matched test Hard test

1 -2.900 -.1.50 *****2 -.772 -1.50 *****3 -1.96 2.02 *****4 -.292 -.42 -.8775 -.292 -2.37 -.8776 .168 -.91 *****7 1.968 -.42 -1.6378 .168 -.42 -.8779 .638 -.42 -1.63710 .638 -.91 -.87711 .638 2.33 .15312 1.188 2.33 .15313 .292 .46 .15314 1.968 2.33 2.00315 ***** 2.33 1.21316 ***** ***** 2.003Mean .062 .062 .062Std. Dev. 1.344 1.566 1.413

Page 95: IRT - Item response Theory

After transformation

• The matched test has a mean at the midpoint of the baseline abilityscale.

• The easy test has a negative value• The hard test has a positive value.

• The average difficulty of both tests is about the same distance from themiddle of the scale.

In technical terms we have “equated” the tests, i.e., put them on a commonscale.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT9595

After transformation

• The matched test has a mean at the midpoint of the baseline abilityscale.

• The easy test has a negative value• The hard test has a positive value.

• The average difficulty of both tests is about the same distance from themiddle of the scale.

In technical terms we have “equated” the tests, i.e., put them on a commonscale.

Page 96: IRT - Item response Theory

• The mean estimated ability of the common group was the same for allthree tests.

• The standard deviations of the ability estimates were nearly the same forthe easy and hard tests, and that for the matched test was “in theballpark.”

• Although the summary statistics were quite similar for all three sets ofresults, the ability estimates for a given examinee varied widely.

• The invariance principle has not gone awry; what you are seeing issampling variation.

Given the small size of the data sets, it is quite amazing that the results cameout as nicely as they did.

This demonstrates rather clearly the powerful capabilities of the Raschmodel and Birnbaum’s MLE paradigm.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT9696

• The mean estimated ability of the common group was the same for allthree tests.

• The standard deviations of the ability estimates were nearly the same forthe easy and hard tests, and that for the matched test was “in theballpark.”

• Although the summary statistics were quite similar for all three sets ofresults, the ability estimates for a given examinee varied widely.

• The invariance principle has not gone awry; what you are seeing issampling variation.

Given the small size of the data sets, it is quite amazing that the results cameout as nicely as they did.

This demonstrates rather clearly the powerful capabilities of the Raschmodel and Birnbaum’s MLE paradigm.

Page 97: IRT - Item response Theory

• After equating, the numerical values of the item parameters can be usedto compare where different items function on the ability scale.

• The examinees’ estimated abilities also are expressed in this metric andcan be compared.

• It is also possible to compute the test characteristic curve and the testinformation function for the easy and hard tests in the baseline metric.

• Technically speaking, the tests were equated using the common groupapproach with tests of different difficulty.

• The ease with which test equating can be accomplished is one of themajor advantages of item response theory over classical test theory.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT9797

• After equating, the numerical values of the item parameters can be usedto compare where different items function on the ability scale.

• The examinees’ estimated abilities also are expressed in this metric andcan be compared.

• It is also possible to compute the test characteristic curve and the testinformation function for the easy and hard tests in the baseline metric.

• Technically speaking, the tests were equated using the common groupapproach with tests of different difficulty.

• The ease with which test equating can be accomplished is one of themajor advantages of item response theory over classical test theory.

Page 98: IRT - Item response Theory

RECAP

1. The end product of the test calibration process is the definition of anability scale metric.

2. Under the Rasch model, this scale has a unit of measurement of 1 anda midpoint of zero.

3. However, it is not the metric of the underlying latent trait. Theobtained metric depends upon the item responses yielded by aparticular combination of examinees and test items being subjectedto the Birnbaum paradigm.

4. Since the true metric of the underlying latent trait cannot bedetermined, the metric yielded by the Birnbaum paradigm is used asif it were the true metric. The obtained item difficulty values and theexaminee’s ability are interpreted in this metric. Thus, the test hasbeen calibrated.

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT9898

RECAP

1. The end product of the test calibration process is the definition of anability scale metric.

2. Under the Rasch model, this scale has a unit of measurement of 1 anda midpoint of zero.

3. However, it is not the metric of the underlying latent trait. Theobtained metric depends upon the item responses yielded by aparticular combination of examinees and test items being subjectedto the Birnbaum paradigm.

4. Since the true metric of the underlying latent trait cannot bedetermined, the metric yielded by the Birnbaum paradigm is used asif it were the true metric. The obtained item difficulty values and theexaminee’s ability are interpreted in this metric. Thus, the test hasbeen calibrated.

Page 99: IRT - Item response Theory

5. The outcome of the test calibration procedure is to locate eachexaminee and item along the obtained ability scale.

6. In the present example, item 5 had a difficulty of -1 and examinee 10had an ability estimate of -.91. Therefore, the probability of examinee10 answering item 5 correctly is approximately .5.

7. The capability to locate items and examinees along acommon scale is a powerful feature of item response theory.

Single Consistent framework

Test CalibrationTest Calibration

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT9999

5. The outcome of the test calibration procedure is to locate eachexaminee and item along the obtained ability scale.

6. In the present example, item 5 had a difficulty of -1 and examinee 10had an ability estimate of -.91. Therefore, the probability of examinee10 answering item 5 correctly is approximately .5.

7. The capability to locate items and examinees along acommon scale is a powerful feature of item response theory.

Single Consistent framework

Page 100: IRT - Item response Theory

During this transitional period in testing practices, many tests have beendesigned and constructed using classical test theory principles but havebeen analyzed via item response theory procedures.

This lack of congruence between the construction and analysis procedureshas kept the full power of item response theory from being exploited.

In order to obtain the many advantages of item response theory, testsshould be designed, constructed, analyzed, and interpreted within theframework of the theory.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT100100

During this transitional period in testing practices, many tests have beendesigned and constructed using classical test theory principles but havebeen analyzed via item response theory procedures.

This lack of congruence between the construction and analysis procedureshas kept the full power of item response theory from being exploited.

In order to obtain the many advantages of item response theory, testsshould be designed, constructed, analyzed, and interpreted within theframework of the theory.

Page 101: IRT - Item response Theory

• Under item response theory, a well-defined set of procedures (itembanking) is used to establish and maintain item pools.

• item parameters are expressed in a known ability-scale metric.

• it is possible to select items from the item pool and determine the majortechnical characteristics of a test before it is administered.

• If the test characteristics do not meet the design goals, selected itemscan be replaced by other items from the item pool until the desiredcharacteristics are obtained.

• In this way, considerable time and money that would ordinarily be devotedto piloting the test are saved.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT101101

• Under item response theory, a well-defined set of procedures (itembanking) is used to establish and maintain item pools.

• item parameters are expressed in a known ability-scale metric.

• it is possible to select items from the item pool and determine the majortechnical characteristics of a test before it is administered.

• If the test characteristics do not meet the design goals, selected itemscan be replaced by other items from the item pool until the desiredcharacteristics are obtained.

• In this way, considerable time and money that would ordinarily be devotedto piloting the test are saved.

Page 102: IRT - Item response Theory

• first define the latent trait the items are to measure, write items tomeasure this trait, and pilot test the items to weed out poor items.

• This large set of items is then administered to a large group of examinees.

• An item characteristic curve model is selected, the examinees’ itemresponse data are analyzed via the Birnbaum paradigm, and the test iscalibrated.

• The ability scale resulting from this calibration is considered to be thebaseline metric of the item pool.

• From a test construction point of view, we now have a set of items whoseitem parameter values are known; in technical terms, a “precalibrateditem pool” exists.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT102102

• first define the latent trait the items are to measure, write items tomeasure this trait, and pilot test the items to weed out poor items.

• This large set of items is then administered to a large group of examinees.

• An item characteristic curve model is selected, the examinees’ itemresponse data are analyzed via the Birnbaum paradigm, and the test iscalibrated.

• The ability scale resulting from this calibration is considered to be thebaseline metric of the item pool.

• From a test construction point of view, we now have a set of items whoseitem parameter values are known; in technical terms, a “precalibrateditem pool” exists.

Page 103: IRT - Item response Theory

Developing a Test From a Precalibrated Item Pool

Since the items in the precalibrated item pool measure a specific latent trait,tests constructed from it will also measure this trait.

Alternate forms are routinely needed to maintain test security, and specialversions of the test can be used to award scholarships.

In such cases, items would be selected from the item pool on the basis oftheir content and their technical characteristics to meet the particulartesting goals.

The advantage of having a precalibrated item pool is that the parametervalues of the items included in the test can be used to compute the testcharacteristic curve and the test information function before the test isadministered.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT103103

Developing a Test From a Precalibrated Item Pool

Since the items in the precalibrated item pool measure a specific latent trait,tests constructed from it will also measure this trait.

Alternate forms are routinely needed to maintain test security, and specialversions of the test can be used to award scholarships.

In such cases, items would be selected from the item pool on the basis oftheir content and their technical characteristics to meet the particulartesting goals.

The advantage of having a precalibrated item pool is that the parametervalues of the items included in the test can be used to compute the testcharacteristic curve and the test information function before the test isadministered.

Page 104: IRT - Item response Theory

This is possible because neither of these curves depends upon thedistribution of examinee ability scores over the ability scale.

Given these two curves, the test constructor has a very good idea of how thetest will perform before it is given to a group of examinees.

In addition, when the test has been administered and calibrated, testequating procedures can be used to express the ability estimates of thenew group of examinees in the metric of the item pool.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT104104

This is possible because neither of these curves depends upon thedistribution of examinee ability scores over the ability scale.

Given these two curves, the test constructor has a very good idea of how thetest will perform before it is given to a group of examinees.

In addition, when the test has been administered and calibrated, testequating procedures can be used to express the ability estimates of thenew group of examinees in the metric of the item pool.

Page 105: IRT - Item response Theory

Screening tests

to distinguish rather sharply between examinees whose abilities are justbelow a given ability level and those who are at or above that level.

- used to assign scholarships- and to assign students to specific instructional programs

Broad-ranged tests

to measure ability over a wide range of underlying ability scale.

- Tests measuring reading or mathematics are typically broad-range tests.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT105105

Screening tests

to distinguish rather sharply between examinees whose abilities are justbelow a given ability level and those who are at or above that level.

- used to assign scholarships- and to assign students to specific instructional programs

Broad-ranged tests

to measure ability over a wide range of underlying ability scale.

- Tests measuring reading or mathematics are typically broad-range tests.

Page 106: IRT - Item response Theory

Peaked tests

to measure ability well in a range of ability that is wider than that of ascreening test, but not as wide as that of a broad-range test.

Some Ground Rules

a. It is assumed that the items would be selected on the basis ofcontent as well as parameter values.

b. No two items in the item pool possess exactly the same combinationof item parameter values.

c. The item parameter values are subject to the following constraints:-3 < = b < = +3 , .50 < = a < +2.00 , 0 < = c < = .35

The values of the discrimination parameter have been restricted to reflect therange of values usually seen in well-maintained item pools.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT106106

Peaked tests

to measure ability well in a range of ability that is wider than that of ascreening test, but not as wide as that of a broad-range test.

Some Ground Rules

a. It is assumed that the items would be selected on the basis ofcontent as well as parameter values.

b. No two items in the item pool possess exactly the same combinationof item parameter values.

c. The item parameter values are subject to the following constraints:-3 < = b < = +3 , .50 < = a < +2.00 , 0 < = c < = .35

The values of the discrimination parameter have been restricted to reflect therange of values usually seen in well-maintained item pools.

Page 107: IRT - Item response Theory

Example Case

You are to construct a ten-item screening test that will separate examineesinto two groups: those who need remedial instruction and those whodon’t, on the ability measured by the items in the item pool. Studentswhose ability falls below a value of -1 will receive the instruction.

Solution:

b = -1.8,a = 1.22 b = -1.6,a = 1.43 b = -1.4,a = 1.14b = -1.2,a = 1.35 b = -1.0,a = 1.56 b = - .8,a = 1.07b = - .6,a = 1.48 b = - .4,a = 1.29 b = - .2,a = 1.110b = 0.0,a = 1.3

The logic underlying these choices was one of centering thedifficulties on the cutoff level of -1 and using moderate values of

discrimination.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT107107

Example Case

You are to construct a ten-item screening test that will separate examineesinto two groups: those who need remedial instruction and those whodon’t, on the ability measured by the items in the item pool. Studentswhose ability falls below a value of -1 will receive the instruction.

Solution:

b = -1.8,a = 1.22 b = -1.6,a = 1.43 b = -1.4,a = 1.14b = -1.2,a = 1.35 b = -1.0,a = 1.56 b = - .8,a = 1.07b = - .6,a = 1.48 b = - .4,a = 1.29 b = - .2,a = 1.110b = 0.0,a = 1.3

The logic underlying these choices was one of centering thedifficulties on the cutoff level of -1 and using moderate values of

discrimination.

Page 108: IRT - Item response Theory

The mid-true score corresponded to an ability level of -1.0.

The test characteristic curve was not particularly steep at the cutoff level,indicating that the test lacked discrimination.

The peak of the information function occurred at an ability level of -1.0, butthe maximum was a bit small.

The results suggest that the test wasproperly positioned on the ability scale but that a better set of items could be

found.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT108108

The mid-true score corresponded to an ability level of -1.0.

The test characteristic curve was not particularly steep at the cutoff level,indicating that the test lacked discrimination.

The peak of the information function occurred at an ability level of -1.0, butthe maximum was a bit small.

The results suggest that the test wasproperly positioned on the ability scale but that a better set of items could be

found.

Page 109: IRT - Item response Theory

The following changes would improve the test’s characteristics:

First cluster the values of the item difficulties nearer the cutoff level;

second, use larger values of the discrimination parameters.

These two changes should steepen the test characteristic curve andincrease the maximum amount of information at the ability level of-1.0.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT109109

The following changes would improve the test’s characteristics:

First cluster the values of the item difficulties nearer the cutoff level;

second, use larger values of the discrimination parameters.

These two changes should steepen the test characteristic curve andincrease the maximum amount of information at the ability level of-1.0.

Page 110: IRT - Item response Theory

RECAP

1. Screening tests.

1. The desired test characteristic curve has the mid-true scoreat the specified cutoff ability level. The curve should be assteep as possible at that ability level.

2. The test information function should be peaked, with itsmaximum at the cutoff ability level.

3. optimal case is where all item difficulties are at the cutoff point andthe item discriminations are large.

4. select items that yield the maximum amount of information at thecutoff point.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT110110

RECAP

1. Screening tests.

1. The desired test characteristic curve has the mid-true scoreat the specified cutoff ability level. The curve should be assteep as possible at that ability level.

2. The test information function should be peaked, with itsmaximum at the cutoff ability level.

3. optimal case is where all item difficulties are at the cutoff point andthe item discriminations are large.

4. select items that yield the maximum amount of information at thecutoff point.

Page 111: IRT - Item response Theory

2. Broad-range tests.1. The desired test characteristic curve has its mid-true score

at an ability level corresponding to the midpoint of the rangeof ability of interest.

2. Most often this is an ability level of zero.

3. The test characteristic curve should be linear for most of its range.

4. The desired test information function is horizontal over thewidest possible range.

5. The maximum amount of information should be as large aspossible.

6. The values of the item difficulty parameters should be spreaduniformly over the ability scale and as widely as practical.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT111111

2. Broad-range tests.1. The desired test characteristic curve has its mid-true score

at an ability level corresponding to the midpoint of the rangeof ability of interest.

2. Most often this is an ability level of zero.

3. The test characteristic curve should be linear for most of its range.

4. The desired test information function is horizontal over thewidest possible range.

5. The maximum amount of information should be as large aspossible.

6. The values of the item difficulty parameters should be spreaduniformly over the ability scale and as widely as practical.

Page 112: IRT - Item response Theory

3. Peaked tests.

1. The desired TCC has its mid-true score in the middle of the ability rangeof interest. The curve should have a moderate slope at that ability level.

2. The test information function should be rounded in appearance over theability range of most interest.

3. The item difficulties should be clustered around the midpoint of the abilityrange of interest, but not as tightly as in the case of a screening test.

4. The values of the discrimination parameters should large.

5. Items whose difficulties are within the ability range of interest shouldhave larger values of the discrimination than other items

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT112112

3. Peaked tests.

1. The desired TCC has its mid-true score in the middle of the ability rangeof interest. The curve should have a moderate slope at that ability level.

2. The test information function should be rounded in appearance over theability range of most interest.

3. The item difficulties should be clustered around the midpoint of the abilityrange of interest, but not as tightly as in the case of a screening test.

4. The values of the discrimination parameters should large.

5. Items whose difficulties are within the ability range of interest shouldhave larger values of the discrimination than other items

Page 113: IRT - Item response Theory

4. Role of item characteristic curve models.

1. ‘a’ being fixed at 1.0, the Rasch model has a limit placed upon themaximum amount of information that can be obtained.

2. The maximum amount of item information is .25 since Pi (θ ) ) Qi (θ ) )= .25when Pi (θ ) = .5. Maximum amount of information for a test under theRasch model is .25 times the number of items.

3. Due to the presence of the guessing parameter, 3-PL model will yield amore linear test characteristic curve and a test information function with alower general level than under 2-PL model with the ‘a’ and ‘b’

4. The information function under a 2-PL model is the upper bound for theinformation function under a 3-PL model when the values of ‘b’ and ‘a’ arethe same.

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT113113

4. Role of item characteristic curve models.

1. ‘a’ being fixed at 1.0, the Rasch model has a limit placed upon themaximum amount of information that can be obtained.

2. The maximum amount of item information is .25 since Pi (θ ) ) Qi (θ ) )= .25when Pi (θ ) = .5. Maximum amount of information for a test under theRasch model is .25 times the number of items.

3. Due to the presence of the guessing parameter, 3-PL model will yield amore linear test characteristic curve and a test information function with alower general level than under 2-PL model with the ‘a’ and ‘b’

4. The information function under a 2-PL model is the upper bound for theinformation function under a 3-PL model when the values of ‘b’ and ‘a’ arethe same.

Page 114: IRT - Item response Theory

5. Role of the number of items.

Increasing the number of items has little impact upon the general form of theTCC if the distribution of the sets of item parameters remains the same.

Increasing the number of items in a test has a significant impact upon thegeneral level of the test information function (TIF).

Select items having high values of the ‘a’ and a distribution of itemdifficulties consistent with the testing goals.

Pairing if item parameters is vital . Eg choosing ‘a’ item whose difficulty isnot of interest does little for test information function or the slope of the

TCC.

It’s important to see ICC and the IIF to ascertain the item’s contribution to theTCC and to the TIF

Partial Credit Model

Specifying the characteristicsSpecifying the characteristicsof a testof a test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT114114

5. Role of the number of items.

Increasing the number of items has little impact upon the general form of theTCC if the distribution of the sets of item parameters remains the same.

Increasing the number of items in a test has a significant impact upon thegeneral level of the test information function (TIF).

Select items having high values of the ‘a’ and a distribution of itemdifficulties consistent with the testing goals.

Pairing if item parameters is vital . Eg choosing ‘a’ item whose difficulty isnot of interest does little for test information function or the slope of the

TCC.

It’s important to see ICC and the IIF to ascertain the item’s contribution to theTCC and to the TIF

Partial Credit Model

Page 115: IRT - Item response Theory

• The computer can update the estimate of the examinee's ability aftereach item, which then is used to select subsequent item.

• With the right item bank and a high examinee ability variance, CATcan be much more efficient than a traditional paper-and-pencil test.

• Paper-and-pencil tests

• "fixed-item" tests

• Everyone takes every item

• Easy and hard items are like adding constants to the score.

• provide little information about the examinee's ability level.

• Large numbers of items and examinees are needed to obtain amodest degree of precision.

Computer Adaptive TestComputer Adaptive Test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT115115

• The computer can update the estimate of the examinee's ability aftereach item, which then is used to select subsequent item.

• With the right item bank and a high examinee ability variance, CATcan be much more efficient than a traditional paper-and-pencil test.

• Paper-and-pencil tests

• "fixed-item" tests

• Everyone takes every item

• Easy and hard items are like adding constants to the score.

• provide little information about the examinee's ability level.

• Large numbers of items and examinees are needed to obtain amodest degree of precision.

Page 116: IRT - Item response Theory

Computer adaptive tests

• the examinee's ability level relative to a norm group can be iterativelyestimated .

• items can be selected based on the current ability estimate.

• Examinees can be given the items that maximize the information(within constraints) about their ability levels from the item responses.

• Examinees will receive few items that are very easy or very hard forthem. (little information is gained of this)

• Reduced standard errors (reciprocal of I) and greater precision withonly a handful of properly selected items.

Computer Adaptive TestComputer Adaptive Test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT116116

Computer adaptive tests

• the examinee's ability level relative to a norm group can be iterativelyestimated .

• items can be selected based on the current ability estimate.

• Examinees can be given the items that maximize the information(within constraints) about their ability levels from the item responses.

• Examinees will receive few items that are very easy or very hard forthem. (little information is gained of this)

• Reduced standard errors (reciprocal of I) and greater precision withonly a handful of properly selected items.

Page 117: IRT - Item response Theory

The CAT algorithm is usually an iterative process

Step 1: All the items that have not yet been administered are evaluated todetermine which will be the best one to administer next given thecurrently estimated ability level

Step 2: The "best" next item (providing the most information) isadministered and the examinee responds

Step 3: A new ability estimate is computed based on the responses to allof the administered items.

Steps 1 through 3 are repeated until a stopping criterion is met.

Similar to Newton-Raphson iterative method for solving equations

Computer Adaptive TestComputer Adaptive Test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT117117

The CAT algorithm is usually an iterative process

Step 1: All the items that have not yet been administered are evaluated todetermine which will be the best one to administer next given thecurrently estimated ability level

Step 2: The "best" next item (providing the most information) isadministered and the examinee responds

Step 3: A new ability estimate is computed based on the responses to allof the administered items.

Steps 1 through 3 are repeated until a stopping criterion is met.

Similar to Newton-Raphson iterative method for solving equations

Page 118: IRT - Item response Theory

Reliability and standard error

• In classical measurement, with a test reliability of 0.90, thestandard error of measurement for the test is about .33 of thestandard deviation of examinee test scores.

• In item response theory-based measurement, and when abilityscores are scaled to a mean of zero and a standard deviation ofone (which is common), this level of reliability corresponds to astandard error of about .33 and test information of about 10.

• Thus, it is common in practice, to design CATs so that the standarderrors are about .33 or smaller (or correspondingly, test informationexceeds 10.

Computer Adaptive TestComputer Adaptive Test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT118118

Reliability and standard error

• In classical measurement, with a test reliability of 0.90, thestandard error of measurement for the test is about .33 of thestandard deviation of examinee test scores.

• In item response theory-based measurement, and when abilityscores are scaled to a mean of zero and a standard deviation ofone (which is common), this level of reliability corresponds to astandard error of about .33 and test information of about 10.

• Thus, it is common in practice, to design CATs so that the standarderrors are about .33 or smaller (or correspondingly, test informationexceeds 10.

Page 119: IRT - Item response Theory

Potential of computer adaptive tests

• Tests are given "on demand" and scores are available immediately.

• Neither answer sheets nor trained test administrators are needed.

• Test administrator differences are eliminated as a factor inmeasurement error.

• Tests are individually paced.

• Test security may be increased .

• Computerized testing offers a number of options for timing andformatting.

Computer Adaptive TestComputer Adaptive Test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT119119

Potential of computer adaptive tests

• Tests are given "on demand" and scores are available immediately.

• Neither answer sheets nor trained test administrators are needed.

• Test administrator differences are eliminated as a factor inmeasurement error.

• Tests are individually paced.

• Test security may be increased .

• Computerized testing offers a number of options for timing andformatting.

Page 120: IRT - Item response Theory

• CATs can reduce testing time by more than 50% while maintaining thesame level of reliability. Shorter testing times also reduce fatigue, afactor that can significantly affect an examinee's test results.

• CATs can provide accurate scores over a wide range of abilities whiletraditional tests are usually most accurate for average examinees.

CAT & IRT is Powerful combination

Computer Adaptive TestComputer Adaptive Test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT120120

Page 121: IRT - Item response Theory

Limitations of CAT

• CATs are not applicable for all subjects and skills.

• Most CATs are based on an IRT model, yet IRT is not applicable to allskills and item types.

• Hardware limitations

• Items involving detailed art work and graphs or extensive readingpassages, for example, may be hard to present.

• CATs require careful item calibration.

• CATs are only manageable if a facility has enough computers for a largenumber of examinees and the examinees are at least partiallycomputer-literate. This can be a big limitation.

Computer Adaptive TestComputer Adaptive Test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT121121

Limitations of CAT

• CATs are not applicable for all subjects and skills.

• Most CATs are based on an IRT model, yet IRT is not applicable to allskills and item types.

• Hardware limitations

• Items involving detailed art work and graphs or extensive readingpassages, for example, may be hard to present.

• CATs require careful item calibration.

• CATs are only manageable if a facility has enough computers for a largenumber of examinees and the examinees are at least partiallycomputer-literate. This can be a big limitation.

Page 122: IRT - Item response Theory

The test administration procedures are different. This may cause problemsfor some examinees.

With each examinee receiving a different set of questions, there can beperceived inequities.

Examinees are not usually permitted to go back and change answers.

Computer Adaptive TestComputer Adaptive Test

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT122122

Page 123: IRT - Item response Theory

Baker,J “The basics of Item Response Theory.”

Birnbaum, A. “Some latent trait models and their use in inferring anexaminee’s ability.” Part 5 in F.M. Lord and M.R. Novick. Statistical Theories of Mental Test Scores.

Reading, MA: Addison-Wesley, 1968.

Hambleton, R.K., and Swaminathan, H. Item Response Theory: Principles andApplications. Hingham, MA: Kluwer, Nijhoff, 1984.

Hulin, C. L., Drasgow, F., and Parsons, C.K. Item Response Theory: Application to PsychologicalMeasurement. Homewood, IL: Dow-Jones, Irwin: 1983.

Lord, F.M. Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ:Erlbaum, 1980.

Mislevy, R.J., and Bock, R.D. PC-BILOG 3: Item Analysis and Test Scoring with Binary LogisticModels. Mooresville, IN: Scientific Software, Inc, 1986.

Wright, B.D., and Mead, R.J. BICAL: Calibrating Items with the Rasch Model. ResearchMemorandum No. 23. Statistical Laboratory, Department of Education, University of Chicago,1976.

Wright, B.D., and Stone, M.A. Best Test Design. Chicago: MESA Press, 1979.

NCME Instructional Modules

ReferencesReferences

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT123123

Baker,J “The basics of Item Response Theory.”

Birnbaum, A. “Some latent trait models and their use in inferring anexaminee’s ability.” Part 5 in F.M. Lord and M.R. Novick. Statistical Theories of Mental Test Scores.

Reading, MA: Addison-Wesley, 1968.

Hambleton, R.K., and Swaminathan, H. Item Response Theory: Principles andApplications. Hingham, MA: Kluwer, Nijhoff, 1984.

Hulin, C. L., Drasgow, F., and Parsons, C.K. Item Response Theory: Application to PsychologicalMeasurement. Homewood, IL: Dow-Jones, Irwin: 1983.

Lord, F.M. Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ:Erlbaum, 1980.

Mislevy, R.J., and Bock, R.D. PC-BILOG 3: Item Analysis and Test Scoring with Binary LogisticModels. Mooresville, IN: Scientific Software, Inc, 1986.

Wright, B.D., and Mead, R.J. BICAL: Calibrating Items with the Rasch Model. ResearchMemorandum No. 23. Statistical Laboratory, Department of Education, University of Chicago,1976.

Wright, B.D., and Stone, M.A. Best Test Design. Chicago: MESA Press, 1979.

NCME Instructional Modules

Page 124: IRT - Item response Theory

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT124124

Thank You

Page 125: IRT - Item response Theory

Classical test theory methods confound “bias” with true meandifferences; IRT does not. In IRT terminology, item/test bias isreferred to as DIF/DTF (Differential Item/Test Functioning )

DIF refers to a difference in the probability of endorsing an item formembers of a reference group (e.g., US workers) and a focal group(e.g., Chinese workers), having the same standing on theta.

DTF refers to a difference in the test characteristic curves, obtainedby summing the item response functions for each group.

DTF is perhaps more important for selection because decisions aremade based on test scores, not individual item responses.

If DIF is detected, IRT can control for item bias when estimatingscores.

What is IRT?What is IRT? –– DIF/DTFDIF/DTF

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT125125

Classical test theory methods confound “bias” with true meandifferences; IRT does not. In IRT terminology, item/test bias isreferred to as DIF/DTF (Differential Item/Test Functioning )

DIF refers to a difference in the probability of endorsing an item formembers of a reference group (e.g., US workers) and a focal group(e.g., Chinese workers), having the same standing on theta.

DTF refers to a difference in the test characteristic curves, obtainedby summing the item response functions for each group.

DTF is perhaps more important for selection because decisions aremade based on test scores, not individual item responses.

If DIF is detected, IRT can control for item bias when estimatingscores.

Page 126: IRT - Item response Theory

What is IRT?What is IRT? –– DIF ExamplesDIF ExamplesUniform DIF Against Focal Group

0.00.10.20.30.40.50.60.70.80.91.0

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3Theta

Prob

. of P

ositi

ve R

espo

nse

ReferenceFocal

Nonuniform (Crossing) DIF

0.00.10.20.30.40.50.60.70.80.91.0

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3Theta

Prob

. of P

ositi

ve R

espo

nse

ReferenceFocal

Reference groupfavored at all levels

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT126126

Uniform DIF Against Focal Group

0.00.10.20.30.40.50.60.70.80.91.0

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3Theta

Prob

. of P

ositi

ve R

espo

nse

ReferenceFocal

Nonuniform (Crossing) DIF

0.00.10.20.30.40.50.60.70.80.91.0

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3Theta

Prob

. of P

ositi

ve R

espo

nse

ReferenceFocal

Focal favored at low theta

Reference favored at hightheta

Page 127: IRT - Item response Theory

What is IRT?What is IRT? –– DIF DetectionDIF Detection•• DIFDIF

–– ParametricParametric•• Lord’s ChiLord’s Chi--SquareSquare•• Likelihood Ratio TestLikelihood Ratio Test•• Signed and Unsigned Area MethodsSigned and Unsigned Area Methods

–– NonparametricNonparametric•• SIBTESTSIBTEST•• MantelMantel--HaenszelHaenszel

•• DTFDTF–– ParametricParametric

•• Raju’s DFIT MethodRaju’s DFIT Method–– NonparametricNonparametric

•• SIBTESTSIBTEST

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT127127

•• DIFDIF–– ParametricParametric

•• Lord’s ChiLord’s Chi--SquareSquare•• Likelihood Ratio TestLikelihood Ratio Test•• Signed and Unsigned Area MethodsSigned and Unsigned Area Methods

–– NonparametricNonparametric•• SIBTESTSIBTEST•• MantelMantel--HaenszelHaenszel

•• DTFDTF–– ParametricParametric

•• Raju’s DFIT MethodRaju’s DFIT Method–– NonparametricNonparametric

•• SIBTESTSIBTEST

Page 128: IRT - Item response Theory

Let represent the latent (factor) score for individual j . Let bethe probability that individual j responds correctly to item i.Then a simple item response model is

This is just a ‘binary response’ factor analysis model.

What is IRT ?What is IRT ? -- Some simple modelsSome simple models

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT128128

Page 129: IRT - Item response Theory

Lawley (1944) really started it off

Lord (1980) promoted the term ‘item response theory’ as opposed to‘classical item analysis’

Technical elaborations include:

‘parameters’ for ‘guessing’Partial credit (degrees of correctness) responsesMultidimensional models

BUT the ‘workhorse’ is still the Lord model (with the factor assumedto be a random rather than fixed variable), as follows:

What is IRT ? Some simple models ?What is IRT ? Some simple models ?

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT129129

Lawley (1944) really started it off

Lord (1980) promoted the term ‘item response theory’ as opposed to‘classical item analysis’

Technical elaborations include:

‘parameters’ for ‘guessing’Partial credit (degrees of correctness) responsesMultidimensional models

BUT the ‘workhorse’ is still the Lord model (with the factor assumedto be a random rather than fixed variable), as follows:

Page 130: IRT - Item response Theory

What is IRT ?What is IRT ? -- Classical to IRTClassical to IRT

•• Classical is really an item response modelClassical is really an item response model(IRM)(IRM) ––

A reasonable (consistent) estimate of (aA reasonable (consistent) estimate of (arandom variablerandom variable -- so in red) is given by theso in red) is given by the‘raw score’ i.e. percentage (or total) of correct‘raw score’ i.e. percentage (or total) of correctitems.items.

•• A somewhat more efficient estimate is given byA somewhat more efficient estimate is given bya weighted percentage, using the as weights.a weighted percentage, using the as weights.

•• The Lord model is simply:The Lord model is simply:

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT130130

•• Classical is really an item response modelClassical is really an item response model(IRM)(IRM) ––

A reasonable (consistent) estimate of (aA reasonable (consistent) estimate of (arandom variablerandom variable -- so in red) is given by theso in red) is given by the‘raw score’ i.e. percentage (or total) of correct‘raw score’ i.e. percentage (or total) of correctitems.items.

•• A somewhat more efficient estimate is given byA somewhat more efficient estimate is given bya weighted percentage, using the as weights.a weighted percentage, using the as weights.

•• The Lord model is simply:The Lord model is simply:ib

Page 131: IRT - Item response Theory

What is IRT? Item response RelationshipWhat is IRT? Item response RelationshipSigmoid Function is a type of logistic function . The general form ofSigmoid Function is a type of logistic function . The general form oflogistic function is :logistic function is :

The special case of the logistic function withThe special case of the logistic function with aa = 1,= 1, mm = 0,= 0, nn = 1, τ = 1, namely= 1, τ = 1, namely

The logistic function is the inverse of the natural logit function and socan be used to convert the logarithm of odds into a probability;the conversion from the log-likelihood ratio of two alternatives also takesthe form of a sigmoid curve.

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT131131For a single item in a test

Sigmoid Function is a type of logistic function . The general form ofSigmoid Function is a type of logistic function . The general form oflogistic function is :logistic function is :

The special case of the logistic function withThe special case of the logistic function with aa = 1,= 1, mm = 0,= 0, nn = 1, τ = 1, namely= 1, τ = 1, namely

The logistic function is the inverse of the natural logit function and socan be used to convert the logarithm of odds into a probability;the conversion from the log-likelihood ratio of two alternatives also takesthe form of a sigmoid curve.

Page 132: IRT - Item response Theory

What is IRT? Rasch ModelWhat is IRT? Rasch Model

Here the ‘discrimination’ is assumed to be the same for each item

The resulting (maximum likelihood) factor score estimates are then a 1 –1 transformation of the raw scores.

Rasch Model is a special case and will often not fit the data very well.

We can add further predictors, for example social background, that maymediate the relationships.

Item response practitioners go further:

If we assume that the item parameters ( ) are the same acrosspopulations, and, for example, tests, then we can form commonscales for different populations and different tests.

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT132132

Here the ‘discrimination’ is assumed to be the same for each item

The resulting (maximum likelihood) factor score estimates are then a 1 –1 transformation of the raw scores.

Rasch Model is a special case and will often not fit the data very well.

We can add further predictors, for example social background, that maymediate the relationships.

Item response practitioners go further:

If we assume that the item parameters ( ) are the same acrosspopulations, and, for example, tests, then we can form commonscales for different populations and different tests.

ii ba ,

Page 133: IRT - Item response Theory

We can generalise our logistic model as follows:

Adding a further factor allows an individual to be characterised bytwo underlying traits.

A sensible analysis will explore the dimensionality structure of a setof item responses

Assumptions are needed, for example that factors are independent,or alternatively that they are correlated but each item has a non-zerocoeffcient (loading) on only 1 factor – or an intermediate assumption.

What are the consequences of a more complex structure?

What is IRT? Multidimensional ModelWhat is IRT? Multidimensional Model

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT133133

We can generalise our logistic model as follows:

Adding a further factor allows an individual to be characterised bytwo underlying traits.

A sensible analysis will explore the dimensionality structure of a setof item responses

Assumptions are needed, for example that factors are independent,or alternatively that they are correlated but each item has a non-zerocoeffcient (loading) on only 1 factor – or an intermediate assumption.

What are the consequences of a more complex structure?

Page 134: IRT - Item response Theory

What is IRT? Multidimensional ModelWhat is IRT? Multidimensional ModelIt allows a more faithful representation of multi-faceted achievement.

It allows the (multidimensional) structure of achievement to be comparedamong groups or populations….in the following ways:

The correlations between factors can varyThe values of loadings can varyThe factor scores can be allowed to depend on further variables suchas gender and the resulting ‘regressions’ may vary. For example:

With extensions to multilevel modelling etc. – a ‘structural equationmodel.

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT134134

It allows a more faithful representation of multi-faceted achievement.

It allows the (multidimensional) structure of achievement to be comparedamong groups or populations….in the following ways:

The correlations between factors can varyThe values of loadings can varyThe factor scores can be allowed to depend on further variables suchas gender and the resulting ‘regressions’ may vary. For example:

With extensions to multilevel modelling etc. – a ‘structural equationmodel.

Page 135: IRT - Item response Theory

Partial Credit Model VariationPartial Credit Model Variation

Object | Item 1 Item 2

Person A >>> | << Item 1.3

| << Item 1.2 <<<<Item 2.2

|

Person B >>> | << <<<<<< <<<<Item 2.1

| << Item 1.1

Central line = scale at interval vel of measurement

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT135135

Object | Item 1 Item 2

Person A >>> | << Item 1.3

| << Item 1.2 <<<<Item 2.2

|

Person B >>> | << <<<<<< <<<<Item 2.1

| << Item 1.1

Page 136: IRT - Item response Theory

Partial Credit Model VariationPartial Credit Model VariationItems :

Item 1.3 refers to Item 1 Category 3Item 1.2 refers to Item 1 Category 2 and so on.The difficulty associated with Category 3 of Item 1 is greater than thedifficulty associated with Category 2 of Item 1, and so on (orderedcategories)The location of Item 1.3 on the scale indicates the ability associated with a50% probability of passing Category 3 of Item 1 (or of any of the lowercategories).

Persons :

The location of person A on the scale indicates his ability.The probability of Person A passing categories at a lower level of difficultyis more than 50%.The probability of Person A passing categories at a higher level ofdifficulty is less than 50%.The probability of Person A passing categories at a level of difficulty thatis the same as his ability is 50%.

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT136136

Items :Item 1.3 refers to Item 1 Category 3Item 1.2 refers to Item 1 Category 2 and so on.The difficulty associated with Category 3 of Item 1 is greater than thedifficulty associated with Category 2 of Item 1, and so on (orderedcategories)The location of Item 1.3 on the scale indicates the ability associated with a50% probability of passing Category 3 of Item 1 (or of any of the lowercategories).

Persons :

The location of person A on the scale indicates his ability.The probability of Person A passing categories at a lower level of difficultyis more than 50%.The probability of Person A passing categories at a higher level ofdifficulty is less than 50%.The probability of Person A passing categories at a level of difficulty thatis the same as his ability is 50%.