Selected topics in psychometrics · Examples of DIF items - Medical admission tests Education...

Selected topics in psychometrics

L8: Differential Item Functioning

Patrıcia MartinkovaNMST570, December 3, 2019

Department of Statistical Modelling

Institute of Computer Science, Czech Academy of Sciences

Institute for Research and Development of Education

Faculty of Education, Charles University, Prague

Outline

1. Review

2. Introduction to DIF

3. DIF detection in binary items

4. Conclusion

Review Introduction to DIF DIF detection in binary items Conclusion

IRT models Estimation of IRT models

Review - IRT models

• IRT models for binary data

• Rasch model, 1PL, 2PL, 3PL, 4PL IRT model

• IRT models for ordinal data

• Cumulative logits

• Graded Response Model (GRM)

• Adjacent-category logits

• Partial Credit Model (PCM)

• Generalized Partial Credit Model (GPCM)

• Rating Scale Model (RSM)

• IRT models for nominal data

• Baseline-category logits

• Nominal Response Model (NRM)

Patrıcia Martinkova NMST570, L8: Differential Item Functioning 1/41


IRT models Estimation of IRT models

Review - IRT models

• Item Characteristic Curve (ICC), Item Response Function (IRF)

• Item Information Function (IIF), Test Information Function (TIF)

• Likelihood function

• Parameter estimation

• Joint maximum likelihood (JML)

• Conditional maximum likelihood (CML)

• Marginal maximum likelihood (MML)

• Model fit

• Item fit

• Person fit



DIF definition DIF examples DIF conceptual issues DIF and fairness

Differential Item Functioning

Differential Item Functioning (DIF)

Two subjects with the same underlying ability but from different groups have

different probability to answer question correctly

• Two groups referred to as reference and focal (usually minority)

• Two types of DIF - uniform and non-uniform




Examples of DIF items

Example (SAT) (Cramp & McDougall, 2018):

Runner is to marathon as

a. envoy is to embassy

b. martyr is to massacre

c. oarsman is to regatta

d. referee is to tournament

e. horse is to stable

Who might have been disadvantaged? (Specify reference and focal group)

Cramp, A., & McDougall, J. (2018). Doing theory on education: Using popular culture to

explore key debates. Routledge.




Examples of DIF item

Tipping example (Martiniello & Wolf, 2012)

Of the following, which is the closest approximation of a 15 percent tip on a

restaurant check of $24.99?

a. $2.50

b. $3.00

c. $3.75

d. $4.50

Example of spelling test (orally administered):

Spell word girder

Martiniello, M., & Wolf, M. (2012). Exploring ells’ understanding of word problems in

mathematics assessments: The role of text complexity and student background knowledge.

In S. Celedon-Pattichis & N. Ramirez (Eds.), Beyond good teaching: Strategies that are

imperative for English language learners in the mathematics classroom. Reston, VA: NCTM.




Examples of DIF items - Medical admission tests

Education ”Growth of long bones”

A) occurs in growth cartilage

B) is hormone-controlled

C) usually ends at about 10-13 years of age, in boys earlier than in girls

D) usually ends around 16-19 years of age, in girls earlier than in boys

(Martinkova et al., (2019): more often correctly answered by males)

”Deficiency of vitamin D in childhood could cause”A) rickets

B) scurvy

C) dwarfism

D) mental retardation

(Drabinova and Martinkova (2017): more often correctly answered by females)

Drabinova, A., & Martinkova, P. (2017). Detection of differential item functioning with

nonlinear regression: A non-IRT approach accounting for guessing. Journal of Educational

Measurement, 54(4), 498–517.




Examples of DIF items - health-related outcome measures

Pain ”How often did pain prevent you from walking more than 1 mile?”

(Amtmann et al. (2010): reported more often by older patient)

”How often did pain prevent you from standing for more than 1 hour?”

(Amtmann et al. (2010): reported more often by older patients)

Depression ”I felt like crying”

(Pilkonis et al. (2011): endorsed more often by females)

Anger ”I was angry when people were unfair”

(endorsed more often by older patients)

“I was angry when I did something stupid”

(Pilkonis et al. (2011): endorsed more often by older patients)

Amtmann, D., Cook, K. F., Jensen, M. P., Chen, W.-H., Choi, S., Revicki, D., ...,

Callahan, L., et al. (2010). Development of a PROMIS item bank to measure pain

interference. Pain, 150(1), 173–182.Pilkonis, P. A., Choi, S. W., Reise, S. P., Stover, A. M., Riley, W. T., Cella, D., & Group,

P. C. (2011). Item banks for measuring emotional distress from the patient-reported

outcomes measurement information system (PROMIS R©): Depression, anxiety, and anger.

Assessment, 18(3), 263–283.




DIF vs. difference in total scores

Comparing total scores only can lead to incorrect conclusions about item/test

fairness (Martinkova et al., 2017)

• Case study 1: Homeostasis Concept Inventory

• Significant difference (Fig A), but no DIF item

• Case study 2: Simulated dataset based on GMAT

• Identical distributions of total score (Fig B), DIF items present

Martinkova, P., Drabinova, A., Liaw, Y. L., Sanders, E. A., McFarland, J. L., & Price, R.

M. (2017). Checking equity: Why differential item functioning analysis should be a routine

part of developing conceptual assessments. CBE—Life Sciences Education, 16(2), rm2.




DIF vs. difference in total scores (cont.)

Comparing total scores only can lead to incorrect conclusions about item/test

fairness (Martinkova et al., 2017)

• Case study 2: Simulated dataset based on GMAT

• Identical distributions of total score (Fig B), DIF items present

Martinkova, P., Drabinova, A., Liaw, Y.-I., et al. (2017). Checking equity: Why

differential item functioning analysis should be a routine part of developing conceptual

assessments. CBE—Life Sciences Education, 16(2), rm2.




DIF vs. difference in item scores

Comparing item scores can lead to incorrect conclusions about item fairness.

In case of different distributions of total scores:

• Difference in item scores (in the same direction as the difference in the

total scores) may be expectable

• No difference in item scores may be actually sign of unfair item.

Thus, both item score and latent ability (total score) need to be taken into

account to assess item fairness.




DIF as multidimensionality problem

DIF as multidimensionality problem:

• Existence of another dimension tested on the particular item besides the

primary latent variable

Exercise:

What is the primary and the secondary latent variable tested in previously

described examples?

• Regatta example

• Spelling example (girder)

• Deficiency of vitamin D in childhood

• Tipping example




DIF and item fairness

DIF items are potentially unfair. However, DIF items are not necessarily

threat to fairness and validity. Content experts must decide on item fairness

based on classification of the secondary latent trait causing DIF:

• Unrelated to content being tested

• DIF item is considered unfair, item should be reworded/removed

• Related to content being tested

• DIF item is not considered unfair, item can inform teaching

Exercise:

Classify secondary latent trait in following items:

• Regatta example

• Spelling example (girder)

• Deficiency of vitamin D in childhood

• Tipping example



Delta-Plot Mantel-Haenszel Logistic regression Generalized logistic regression IRT-based methods

DIF detection methods in binary items

• Based on total score

• Delta plot

• Mantel-Haenszel test

• Logistic regression

• Generalized logistic regression

• Based on latent ability and Item Response Theory models

• Lord’s (Wald) test

• Raju’s area test

• Likelihood ratio test (LRT)




Delta plot - overview

• Angoff & Ford (1973)

• compares proportions of correct answers

• displays non-linear transformation of proportions (using quantiles) item

detection threshold

• fixed to 1.5

• normal approximation (Magis & Facon, 2012).




Delta plot - motivation

Assumption of parameter invariance implies approximately linear form for

proportions of correct responses

Delta plot provides comparison of transformed proportions of correct

responses per item and by group of respondents (Angoff & Ford, 1973)

Angoff, W. H., & Ford, S. F. (1973). Item-race interaction on a test of scholastic aptitude.

Journal of Educational Measurement, 10(2), 95-105.




Delta plot - introducing delta scores

In more detail:

• For each item j = 1, . . . , J and each group (reference and focal) proportions

of correct answers are calculated:

πjR = 1IR

∑IRi=1 Yij and πjF = 1

IF

∑IFi=1 Yij

• Transformation into standard normal deviates:

zjR = qZ (1− πjR) and zjF = qZ (1− πjF ),

where qZ is quantile of the standard normal distribution

• Transformation into delta scores:

∆jR = 4 · zjR + 13 and ∆jF = 4 · zjF + 13






Delta plot - DIF detection

• Pairs of delta scores (∆jR ,∆jF ) can be displayed on a scatter plot

(so called Delta plot or Diagonal plot)

• Delta scores of the reference group on the X axis and of the focal group on

the Y axis

• Delta scores create an ellipsoid with major axis:

∆jF = a + b∆jR , where

b =s2F − s2

R +√

(s2F − s2

R)2 + 4s2RF

2sRFa = mF − b ·mR

s2F , s2

R sample variance of delta scores, sRF their sample covariance and mR

and mF their sample means






Delta plot - DIF detection

• DIF detection is based on distance of delta scores from major axis

Dj =b∆jR + a−∆jF√

b2 + 1

• Detection threshold:

• Fixed: Items is marked as DIF if |Dj | > 1.5 (Angoff & Ford, 1973)

• Based on normal approximation (Magis & Facon, 2014).


Journal of Educational Measurement, 10(2), 95-105.Magis, D., & Facon, B. (2014). deltaPlotR: An R package for differential item functioning

analysis with Angoff’s Delta plot. Journal of Statistical Software, 59(1), 1-19.




Delta plot in ShinyItemAnalysis

• ShinyItemAnalysis offers functionality of deltaPlotR package

• Provides delta plot in ggplot2




Mantel-Haenszel test

MH is an extension of the χ2-test of independence on contingency tables

• Contingency tables summarize item responses by group membership for

given item

• MH test combines all levels of total scores

In more detail:

For each level of total score k = 0, . . . ,K , contingency table is created:

Y = 1 Y = 0

Reference group Ak Bk

Focal group Ck Dk

Odds ratio for total score k :

αk =Ak/Bk

Ck/Dk=

AkDk

BkCk




Mantel-Haenszel test - introducing αMH

In case of independence of item score and group membership for total score k:

αk =AkDk

BkCk≈ 1

MH combines all levels of total score:

αMH =

∑Kk=0(AkDk/Nk)∑Kk=0(BkCk/Nk)

= Weighted average of odds ratio through all levels of the total score

αMH

≈ 1, no DIF

> 1, DIF favoring reference group

< 1, DIF favoring focal group




Mantel-Haenszel test - introducing ∆MH

αMH is often standardized through log transformation, centering the value

around 0:

∆MH = −2.35 · log(αMH)

However, then the interpretation is different!

∆MH

≈ 0, no DIF

> 0, DIF favoring focal group

< 0, DIF favoring reference group

∆MH can be used to determine DIF effect size (Holland & Thayer, 1985):

|∆MH|

< 1, Category A = negligible effect

∈ [1, 1.5), Category B = moderate effect

≥ 1.5, Category C = large effect

Holland, P. W., & Thayer, D. T. (1985). An alternate definition of the ETS delta scale of

item difficulty. ETS Research Report Series, 1985(2), i-10.




Test statistic

Contingency table for given item and total score k :

Y = 1 Y = 0

Reference group Ak Bk NRk = Ak + Bk

Focal group Ck Dk NFk = Ck + Dk

N1k = Ak + Ck N0k = Bk + Dk Nk

Test statistic for testing whether αMH = 1:

MH =

[∑Kk=0

(Ak − NRkN1k

Nk

)− 0.5

]2

∑Kk=0

NRkNFkN1kN0k

N2k (Nk−1)

≈ χ21




Mantel-Haenszel - pros and cons

Pros:

+ Simple method

+ Easily implemented

+ Detects DIF in small samples

Cons:

− Does not detect non-uniform DIF

ShinyItemAnalysis offers step-by-step calculation of MH statistics.




Logistic regression for DIF detection

• LR models probability of correct answer on item j by respondent i based on

their total score and group membership (Swaminathan & Rogers, 1990)

• Introducing effect of total score, group membership, and their interaction

• Nonzero effect of group membership indicates uniform DIF

• Nonzero effect of interaction indicates uniform DIF

• DIF detection based on Wald’s test or likelihood ratio test of the submodel

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using

logistic regression procedures. Journal of Educational measurement, 27(4), 361-370.





In more detail:

P(Yij = 1|Xi ,Gi ) =eb0j+b1jXi

1 + eb0j+b1jXi

eb0j+b1jXi+b2jGi

1 + eb0j+b1jXi+b2jGi

eb0j+b1jXi+b2jGi+b3jXi :Gi

1 + eb0j+b1jXi+b2jGi+b3jXi :Gi

where Xi is total score , Gi group membership (Gi = 0 for reference group, Gi

= 1 for focal group , Xi : Gi their interaction




Interpretation of parameters

Intercept b0j

- Probability of answering item j correctly for reference group (Gi = 0)

respondent with zero total score (Xi = 0)

- P(Yij = 1|Xi = 0,Gi = 0) = eb0j

1+eb0j

Effect of total score b1j

- Gives log odds ratio for answering item j correctly comparing two respondents

from the same group differing by one point in total score

Effect of group membership b2j

- Gives log odds ratio for answering item j correctly comparing two respondents

from reference and focal group with the same total score

- b0 + b2 is an intercept (baseline probability) for focal group

Effect of interaction b3j

- Indicates how effect of total score differs for focal and reference group

- b1 + b3 is an effect of total score for focal groupPatrıcia Martinkova NMST570, L8: Differential Item Functioning 27/41



DIF effect size

Determined by Nagelkerke’s R2 (Nagelkerke et al., 1991)

• Proportional reduction in ”error variance”

• Different cut-off values:

R2

< 0.13 Cat. A = negligible effect∈ [0.13, 0.26) Cat. B = moderate effect (Zumbo & Thomas, 1997)≥ 0.26 Cat. C = large effect

R2

< 0.035 Cat. A = negligible effect∈ [0.035, 0.07) Cat. B = moderate effect (Jodoin & Gierl, 2001)≥ 0.07 Cat. C = large effect

Nagelkerke, N. J. (1991). A note on a general definition of the coefficient of

determination. Biometrika, 78(3), 691-692.Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based

approach for studying DIF. Working paper.Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an

effect size measure with the logistic regression procedure for DIF detection. Applied

Measurement in Education, 14(4), 329-349.




Statistical significance

The statistical significance is determined by test of submodel

• Likelihood-ratio test

• Wald’s test

Testing:

Any DIF H0 : b2 = 0 & b3 = 0 vs. H1 : b2 6= 0 or b3 6= 0

Uniform DIF H0 : b2 = 0 | b3 = 0 vs. H1 : b2 6= 0 | b3 = 0

Non-uniform DIF H0 : b3 = 0 vs. H1 : b3 6= 0




Reparametrization

P(Yij = 1|Zi ,Gi ) =e(aj+ajDIFGi )(Zi−bj−bjDIFGi )

1 + e(aj+ajDIFGi )(Zi−bj−bjDIFGi )=

eajGi (Zi−bjGi )

1 + eajGi (Zi−bjGi )

• Zi standardized total score

• ajGi discrimination of item j for group Gi

• bjGi difficulty of item j for group Gi

Interpretation:

bj0 difficulty of item j for reference group

bj1 = bj0 + bjDIF difficulty of item j for focal group

aj0 discrimination of item j for reference group

aj1 = aj0 + ajDIF discrimination of item j for focal group




Logistic regression - pros and cons

Pros:

+ Detects DIF in medium-size samples

+ Detects both uniform and non-uniform DIF

Cons:

− Does not account for possible guessing

− Does not account for possible inattention




Generalized logistic regression for DIF detection

= Extension of logistic regression model for DIF detection accounting for

guessing and inattention (Drabinova & Martinkova, 2017)

In more detail:

P(Yij = 1|Xi ,Gi ) = cjGi + (djGi − cjGi )eajGi (Xi−bjGi )

1 + eajGi (Xi−bjGi )

Drabinova, A., & Martinkova, P. (2017). Detection of Differential Item Functioning with

Nonlinear Regression: A Non-IRT Approach Accounting for Guessing. Journal of Educational

Measurement, 54(4), 498-517.




Generalized logistic regression for DIF detection

• Also called 4PL non-IRT model

• Offers wide range of models which can be obtained by

• fixing parameters to selected value

(e.g. d = 1 to get 3PL model)

• fixing parameters between groups

(e.g. common guessing and inattention)

Drabinova, A., & Martinkova, P. (2017). Detection of Differential Item Functioning with

Nonlinear Regression: A Non-IRT Approach Accounting for Guessing. Journal of Educational

Measurement, 54(4), 498-517.




R software


• R package difR (Magis, Beland, Tuerlinckx, & De Boeck, 2010)

• difLogistic() function

Generalized logistic regression

• R package difNLR (Hladka & Martinkova, 2019)

• difNLR() function

ShinyItemAnalysis offers both methods.

Hladka A. & Martinkova P. (2019). difNLR: DIF and DDF detection by non-linear

regression models. R package version 1.3.0.Magis, D., Beland, S., Tuerlinckx, F. & De Boeck, P. (2010). A general framework and an

R package for the detection of dichotomous differential item functioning. Behavior Research

Methods, 42(3), 847-862.




IRT-based methods

Methods based on IRT models:

• Lord’s (Wald’s) test: Difference between parameters

• Raju’s test: Area between the curves (difference or absolute difference)

• Likelihood ratio test




Lord’s (Wald) test

DIF detection based on testing difference in parameters for reference and focal

group (Lord, 1980)

(biR − biF )2

var (biR) + var (biF )

D−→ χ21

Lord, F. M. (1980). Application of item response theory to practical testing problems.

Hillsdale Erlbaum Associates, Inc.Patrıcia Martinkova NMST570, L8: Differential Item Functioning 36/41



Raju’s test

Method based on area between two characteristic curves

(Raju, 1988, 1990)

Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika,

53(4), 495-502.Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas

between two item response functions. Applied Psychological Measurement, 14(2), 197-207.




Raju’s test - unsigned area between curves

Unsigned area (UA) between 2 characteristic curves:

UA = P(Y = 1|θ,G = R)− P(Y = 1|θ,G = F)

=

|bR − bF | 1PL∣∣∣ 2(aR−aF )

aRaFlog(

1 + exp(

aRaF (bR−bF )aR−aF

))− (bR − bF )

∣∣∣ 2PL

(1− c)∣∣∣ 2(aR−aF )

aRaFlog(

1 + exp(

aRaF (bR−bF )aR−aF

))− (bR − bF )

∣∣∣ 3PL




Likelihood ratio test

DIF detection based on likelihood ratio test of submodel

(Thissen, Steinberg, & Wainer, 1988, 1993)

Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study

of group difference in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp.

147-169). Lawrence Erlbaum Associates, Inc.Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning

using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.),

Differential item functioning (pp. 67-113). Lawrence Erlbaum Associates, Inc.




Pros and cons

Pros:

+ Applicable for 1PL-4PL IRT models

+ More precise estimate of latent trait

Cons:

− Computationally demanding

− Needs large sample size



Conclusion

DIF/DDF analysis should be used routinely in test development

• to check for fairness with respect to groups

• to inform teaching

DIF detection methods in binary items

• Delta-Plot

• Mantel-Haenszel test

• Logistic regression

• Generalized logistic regression

• IRT/based methods: Lord’s (Wald) test, Raju’s test, LRT


Thank you for your attention!

www.cs.cas.cz/martinkova

www.cs.cas.cz/martinkova

References [1]

Amtmann, D., Cook, K. F., Jensen, M. P., Chen, W.-H., Choi, S., Revicki, D., . . . others

(2010). Development of a promis item bank to measure pain interference. Pain,

150(1), 173–182.


Journal of Educational Measurement, 10(2), 95–105.

Cramp, A., & McDougall, J. (2018). Doing theory on education: Using popular culture to

explore key debates. Routledge.

Drabinova, A., & Martinkova, P. (2017). Detection of differential item functioning with

nonlinear regression: A non-IRT approach accounting for guessing. Journal of

Educational Measurement, 54(4), 498–517.

Hladka, A., & Martinkova, P. (2019). difNLR: DIF and DDF detection by non-linear

regression models. [Computer software manual]. Retrieved from

https://CRAN.R-project.org/package=difNLR (R package version 1.3.0)

Holland, P. W., & Thayer, D. T. (1985). An alternate definition of the ets delta scale of

item difficulty. ETS Research Report Series, 1985(2), i–10.

Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect

size measure with the logistic regression procedure for DIF detection. Applied

Measurement in Education, 14(4), 329–349.

Lord, F. M. (1980). Application of item response theory to practical testing problems.

Hillsdale Erlbaum Associates, Inc.

https://CRAN.R-project.org/package=difNLR

References [2]

Magis, D., Beland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an

R package for the detection of dichotomous differential item functioning. Behavior

Research Methods, 42(3), 847–862.

Magis, D., & Facon, B. (2014). deltaPlotR: An R package for differential item functioning

analysis with Angoff’s Delta plot. Journal of Statistical Software, 59(1), 1–19.

Martiniello, M., & Wolf, M. (2012). Exploring ELLs’ understanding of word problems in

mathematics assessments: The role of text complexity and student background

knowledge. In S. Celedon-Pattichis & N. Ramirez (Eds.), Beyond good teaching:

Strategies that are imperative for English language learners in the mathematics

classroom. Reston, VA: NCTM.

Martinkova, P., Drabinova, A., Liaw, Y.-L., Sanders, E. A., McFarland, J. L., & Price, R. M.

(2017). Checking equity: Why differential item functioning analysis should be a

routine part of developing conceptual assessments. CBE—Life Sciences Education,

16(2), rm2.

Martinkova, P., Hladka, A., Leupen, S., Stepanek, L., & Kralıckova, M. (2019). Towards

better admission tests: Routinizing detailed validation of entrance exams in medical

education.

(Submitted)

Nagelkerke, N. J., et al. (1991). A note on a general definition of the coefficient of

determination. Biometrika, 78(3), 691–692.

References [3]

Pilkonis, P. A., Choi, S. W., Reise, S. P., Stover, A. M., Riley, W. T., Cella, D., & Group,

P. C. (2011). Item banks for measuring emotional distress from the patient-reported

outcomes measurement information system (PROMIS R©): depression, anxiety, and

anger. Assessment, 18(3), 263–283.

Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4),

495–502.

Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas

between two item response functions. Applied Psychological Measurement, 14(2),

197–207.

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using

logistic regression procedures. Journal of Educational measurement, 27(4), 361–370.

Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of

group difference in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity

(p. 147-169). Lawrence Erlbaum Associates, Inc.

Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning

using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.),

Differential item functioning (p. 67-113). Lawrence Erlbaum Associates, Inc.

Zumbo, B., & Thomas, D. (1997). A measure of effect size for a model-based approach for

studying DIF.

(Working paper)

Selected topics in psychometrics · Examples of DIF items - Medical admission tests Education...

Documents

Transcript of Selected topics in psychometrics · Examples of DIF items - Medical admission tests Education...