Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The...

12
http://jpa.sagepub.com/ Assessment Journal of Psychoeducational http://jpa.sagepub.com/content/early/2014/07/09/0734282914541337 The online version of this article can be found at: DOI: 10.1177/0734282914541337 published online 9 July 2014 Journal of Psychoeducational Assessment Mutasem Akour, Saed Sabah and Hind Hammouri Items: Application of the Differential Step Functioning Framework Net and Global Differential Item Functioning in PISA Polytomously Scored Science Published by: http://www.sagepublications.com can be found at: Journal of Psychoeducational Assessment Additional services and information for http://jpa.sagepub.com/cgi/alerts Email Alerts: http://jpa.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: http://jpa.sagepub.com/content/early/2014/07/09/0734282914541337.refs.html Citations: What is This? - Jul 9, 2014 OnlineFirst Version of Record >> by guest on July 10, 2014 jpa.sagepub.com Downloaded from by guest on July 10, 2014 jpa.sagepub.com Downloaded from

Transcript of Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The...

Page 1: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

http://jpa.sagepub.com/Assessment

Journal of Psychoeducational

http://jpa.sagepub.com/content/early/2014/07/09/0734282914541337The online version of this article can be found at:

 DOI: 10.1177/0734282914541337

published online 9 July 2014Journal of Psychoeducational AssessmentMutasem Akour, Saed Sabah and Hind Hammouri

Items: Application of the Differential Step Functioning FrameworkNet and Global Differential Item Functioning in PISA Polytomously Scored Science

  

Published by:

http://www.sagepublications.com

can be found at:Journal of Psychoeducational AssessmentAdditional services and information for    

  http://jpa.sagepub.com/cgi/alertsEmail Alerts:

 

http://jpa.sagepub.com/subscriptionsSubscriptions:  

http://www.sagepub.com/journalsReprints.navReprints:  

http://www.sagepub.com/journalsPermissions.navPermissions:  

http://jpa.sagepub.com/content/early/2014/07/09/0734282914541337.refs.htmlCitations:  

What is This? 

- Jul 9, 2014OnlineFirst Version of Record >>

by guest on July 10, 2014jpa.sagepub.comDownloaded from by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 2: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

Journal of Psychoeducational Assessment 1 –11

© 2014 SAGE PublicationsReprints and permissions:

sagepub.com/journalsPermissions.nav DOI: 10.1177/0734282914541337

jpa.sagepub.com

Article

Net and Global Differential Item Functioning in PISA Polytomously Scored Science Items: Application of the Differential Step Functioning Framework

Mutasem Akour1, Saed Sabah1, and Hind Hammouri1

AbstractThe purpose of this study was to apply two types of Differential Item Functioning (DIF), net and global DIF, as well as the framework of Differential Step Functioning (DSF) to real testing data to investigate measurement invariance related to test language. Data from the Program for International Student Assessment (PISA)–2006 polytomously scored science items for four countries with different test languages were used, where French and English represented the reference languages. The findings showed that many items exhibited both types of DIF, although, in most cases, the results were inconsistent for the two source languages. In addition, net and global DIF tests did not always yield the same results depending on the DSF effect pattern. Furthermore, the DSF analysis provided valuable information over and above that provided by the net DIF analysis concerning the nature and the location of the DIF effect.

Keywordsdifferential step functioning, global DIF, net DIF, PISA 2006, polytomous science items

In polytomous items (e.g., performances that cannot be scored as simply correct or incorrect), differential item functioning (DIF) is present when individuals belonging to two different groups, but having the same level of ability, have differing probabilities of obtaining each score level of the polytomous response variable (Potenza & Dorans, 1995). The increasing use of polytomous item formats has led to the development of various methods for assessing DIF in polytomous items (Penfield & Camilli, 2007). Penfield, Alvarez, and Lee (2009) introduced two conceptions of DIF in polytomous items, net DIF and global DIF. This distinction between net and global DIF is unique to polytomous items, because a polytomous item contains three or more score levels and all score levels are used in the estimation of ability.

1Hashemite University, Zarqa, Jordan

Corresponding Author:Mutasem Akour, Department of Educational Psychology, Hashemite University, P.O. Box 150459, Zarqa 13115, Jordan. Email: [email protected]

541337 JPAXXX10.1177/0734282914541337Journal of Psychoeducational AssessmentAkour et al.research-article2014

by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 3: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

2 Journal of Psychoeducational Assessment

Net and Global DIF

Global DIF statistics are based on the unsigned conditional between-group difference across all score levels of the polytomous item (Penfield, Alvarez, & Lee, 2009). Examples of global poly-tomous DIF statistics include the generalized Mantel–Haenszel statistic (Somes, 1986), item response theory (IRT) likelihood ratio tests (Kim & Cohen, 1998), polytomous logistic regres-sion approaches (French & Miller, 1996), and the simultaneous step-level (SSL) test of DIF (Penfield, 2007).

In contrast to global DIF, net polytomous DIF statistics are based on the signed conditional between-group difference across all score levels. Thus, it is possible for between-group differ-ences to vary in sign across the score levels of the item (e.g., DIF favoring the reference group for some score levels but favoring the focal group for others), which can yield a net DIF effect of 0 (or near 0) despite the presence of sizable effects within particular score levels (Penfield, 2010). Numerous statistics for evaluating net polytomous DIF exists, including Mantel’s chi square (Mantel, 1963), the standardized mean difference index (Dorans & Schmitt, 1993), polytomous SIBTEST (Chang, Mazzeo, & Roussos, 1996), and the cumulative common log-odds ratio (Penfield & Algina, 2003).

One of the practical implications for the distinction between these two forms of DIF is that net and global DIF represent different violations of the measurement invariance, and therefore, it is crucial to clearly identify the form of measurement invariance of interest for a particular assess-ment. For example, a particular testing program may use the standard of no net DIF as a reason-able threshold for invariance to hold. Other testing programs, however, may have a concern for the detection of measurement invariance occurring anywhere within the item and use the stan-dard of no global DIF as a necessary proof of measurement invariance (Penfield, 2010).

Differential Step Functioning (DSF)

Penfield (2007) stated that the current standard practice in conducting DIF analyses with polyto-mous items is to conduct a single item-level test that examines invariance across all score levels simultaneously and thus are considered omnibus tests of DIF across all score levels. However, the investigation of DIF effect associated with each score level can be accomplished using the framework of differential step functioning (DSF).

Using the framework of DSF in examining between-group measurement equivalence in poly-tomous items has several advantages over the omnibus measures of DIF (Penfield, 2007). First, tests of DSF can be more powerful than net tests of DIF when the magnitude and/or the sign of the DSF test vary across the steps underlying the polytomous response variable. A second advan-tage of the DSF framework is that it allows the DIF analyst to determine precisely which score levels (or steps) are responsible for an observed DIF effect.

The framework of DSF is based on the concept of the step function. Several forms of the step functions were proposed, and two of these forms are the cumulative step function form and the adjacent-categories form. For a polytomous item having r (r > 2) ordinal score levels, the prob-ability of observing each score level, given a particular level of ability, is determined by a set of J = r − 1 step functions. Under the cumulative approach, each of the J step functions specifies the probability of successfully advancing (i.e., stepping) from a lower score level to a higher score level as a function of ability. Given this description of the concept of DSF, net DIF can be con-ceptualized as the aggregated DSF effects across the J steps (Penfield, Gattamorta, & Childs, 2009). For investigating DSF, three general approaches have been described in the literature: an IRT approach (Cohen, Kim, & Baker, 1993), a logistic regression approach (French & Miller, 1996), and an odds ratio approach (Penfield, 2007). One of the advantages of the odds ratio approach over the other two approaches is that it is not hindered by the assumption of model fit (Penfield, Alvarez, & Lee, 2009).

by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 4: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

Akour et al. 3

Under the odds ratio approach, the null hypothesis of no DSF at the jth step can be tested using the following test statistic:

z SEj

j

j

( )( )

.λλ

λ

= (1)

This test statistic is distributed approximately as standard normal, where λ j is the estimator of the common log-odds ratio that is associated with the jth step, and SE j( )λ is its estimated standard error (Penfield, 2007).

Test Language DIF

The Program for International Student Assessment (PISA) had been translated (and/or adapted) into more than 40 different test languages (The Organization for Economic Co-Operation and Development [OECD], 2009). The typical PISA procedures include the development of two parallel source versions (in English and French), with a recommendation that each country devel-ops two separate versions in their language of instruction, one from each source language, by two independent translators then reconciles them into a final national version by another independent translator (Grisay, 2003).

Research studies (Arim & Ercikan, 2005; Ercikan, 1999; Le, 2006; Yildirim & Berberoglu, 2009), however, indicated that for PISA and other international tests, DIF exists on different ver-sions of the same test when administered in multiple languages. Le (2006) conducted a compre-hensive exploration of DIF in which English was compared with other five languages in the PISA-2006 field trial data. Using IRT, the results indicated the presence of DIF in PISA items across different item formats, with the English group being advantaged over the other groups of different languages. In addition, Yildirim and Berberoglu (2009) found that 24% of PISA-2003 math items in one booklet displayed DIF using three different DIF detection methods, when English and Turkish speaking examinees were compared.

The present study was motivated by three observations: (a) the lack of published demonstra-tions of the net versus global distinction of DIF in polytomous items; (b) the lack of published documentation of the DSF framework; and (c) the limited number of published studies exploring language DIF in translated test forms, given the growing importance of this as more and more assessments are being translated. Therefore, the purpose of the present study was to examine net and global DIF in the PISA-2006 polytomously scored science items when the two source lan-guages, French and English, were compared with two other languages. In addition, the frame-work of DSF was used to enhance the analysis of measurement equivalence over that provided solely by traditional omnibus measures of DIF. Given the growing potential of polytomous item formats across a range of measurement contexts (innovating item types, automated scoring, and so on), this is a growing area of interest, and so, the present study is timely with respect to all of this. In addition, the issues of net DIF, global DIF, and DSF seem to be important components to the advancements of validity of these items.

Method

Data

PISA is conducted by the OECD in selected countries to measure how well 15-year-old students are prepared to meet the challenges of the future. In each assessment, one of the three areas (sci-ence, reading, and mathematics) is chosen as the major domain and given greater emphasis; the remaining two areas are assessed less thoroughly. In 2000, the major domain was reading; in

by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 5: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

4 Journal of Psychoeducational Assessment

2003, it was mathematics; and in 2006, it was science (OECD, 2009). In PISA-2006, a total of six polytomously scored science items, with three possible scores, 0, 1, and 2, were presented to students in 10 (out of 13) test booklets. Each test item appeared in four of the test booklets with different position in each booklet.

The data used in the present study came from the PISA-2006 polytomously scored science items for four countries, France (n = 4,716), Denmark (n = 4,532), the United States (n = 5,611), and Slovak (n = 4,731). For the purposes of the present study, these four countries were grouped into two sets, such that each set contains a country with one of the source languages as its test language. The first set contains France and Denmark, where French was the source language, while the second set contains the United States and Slovak, where English was the source lan-guage. Denmark and Slovak were selected such that the mean of test scores within each set was very similar for both countries. France with a mean score of 495 and Denmark with a mean score of 496 were classified at the average. However, the United States with a mean score of 489 and Slovak with a mean score of 488 were classified below the average. It should be noted that the examination from source to target languages was not fully crossed (i.e., no attempt was made to examine French to Slovak or English to Danish).

Overview of the Analyses

The null hypothesis of no global DIF was tested through testing the null hypothesis of no DSF at each of the J steps. If the null hypothesis of no DSF was retained for all J steps, then the null hypothesis of no global DIF was retained. If, however, the null hypothesis of no DSF was rejected for one or more steps, then the null hypothesis of no global DIF was rejected. This approach is referred to as the simultaneous step-level (SSL) test of DIF (Penfield, 2007).

For polytomous items, Penfield, Alvarez, and Lee (2009) recommended using at least one global DIF test and one net DIF test for a comprehensive DIF analysis. Therefore, this study used the SSL test as a global DIF test and Mantel’s chi-square test as a net DIF test. If the null hypoth-esis of no DIF was retained for both global and net DIF, it was concluded that measurement equivalence exists. However, if the null hypothesis of no DIF was rejected for either global or net DIF, a thorough DSF analysis was conducted where the step functions were defined using the cumulative approach. Gattamorta and Penfield (2012) recommended the use of the cumulative step function over the adjacent-categories step function due to its stability.

To judge the magnitude of the DSF effect at each of the J steps for each item, λ j was estimated and interpreted according to the five-category DSF classification scheme (Penfield, Alvarez, & Lee, 2009) where |λ j | < 0.43 corresponds to a small DSF Effect (S), 0.43 ≤ |λ j | < 0.64 corre-sponds to a medium DSF effect (M), and |λ j | ≥ 0.64 corresponds to a large DSF effect (L). It is possible to integrate statistical significance in this classification. However, Penfield, Alvarez, and Lee (2009) stated that depending on the magnitude of the effect size alone is likely adequate, given sufficient sample sizes (i.e., > 300 per group). Therefore, in the present study, the classification of medium and large DSF effects is determined solely by the magnitude of the DSF effect size.

Moreover, the taxonomy of DSF forms proposed by Penfield, Alvarez, and Lee (2009) was adopted in the present study to identify and interpret the causes of DIF. This taxonomy catego-rizes the DSF according to two dimensions: The first dimension distinguishes between pervasive and non-pervasive DSF. Pervasive DSF corresponds to the situation where all J steps display substantial DSF effects (medium or large in magnitude), suggesting that the factor causing the DIF is using its influence at the item level. Whereas, the DSF form is labeled as non-pervasive when some steps, but not all, display a substantial DSF effect. This suggests that the factor caus-ing the DIF is using its influence at the level of a particular step.

The second dimension of the DSF taxonomy concerns the consistency of the DSF effect across the affected steps; it distinguishes between constant, convergent, and divergent DSF

by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 6: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

Akour et al. 5

forms. When the steps are displaying substantial DSF effects that are relatively equal in magni-tude and sign, constant DSF occurs. Convergent DSF concerns the situation whereby all steps displaying a substantial DSF effect have the same sign, but not the same magnitude. However, if different steps display DSF effects that have different signs, then divergent DSF form is present. When there are only two steps associated with each item, it will not be possible to distinguish between constant, divergent, and convergent DSF effects when the DSF form is labeled non-pervasive. Therefore, in the present study, it will be referred to this DSF form as “potential non-pervasive,” a term that was used by Penfield, Alvarez, and Lee (2009).

PISA uses the imputation methodology usually referred to as plausible values to indicate stu-dents’ proficiency levels based on the observed item responses. The plausible values are random draws from the marginal posterior of the latent distribution. Usually, five plausible values are allocated to each student on each performance scale of PISA. It is recommended that statistical analyses be performed independently on each of these five plausible values and results be aggre-gated to obtain the final estimates of the statistics and their respective standard errors (OECD, 2009). In the present study, only one run of DIF and DSF analyses was conducted because there were only six polytomous items (out of 108 items). If all six items were flagged as having DIF, this small number of affected items would not, possibly, have a notable effect on the results. In addition, Magis and Facon (2013) showed that item purification is not always useful and that a single run of the DIF method may return equally suitable results.

To summarize, the following analyses were conducted using Penfield’s (2005) DIFAS com-puter program (using the plausible values as the stratifying variables and considering the French and the American groups as the reference groups):

a. A test of the null hypothesis of no net DIF was conducted using Mantel’s chi-square test with a Type I error rate of .05. This statistic is distributed as chi square with one degree of freedom. Therefore, the critical value of this statistic is 3.84.

b. DSF analysis was conducted using the estimated step-level common log-odds ratio ( ).λ jc. A test of no global DIF was conducted using the SSL test with a family-wise Type I error

rate of .05. SSL tests were conducted using z SEj j j( ) ( )λ λ λ = and evaluated with a Bonferroni-adjusted Type I error rate of .05/J (i.e., .025 for J = 2), as described by Penfield (2007). Critical values of this statistic are ±2.24.

d. The aforementioned analyses were conducted five times and then averaged.

Results

To achieve the purposes of this study, the six polytomously scored science items in PISA-2006 were analyzed, and the results of these analyses are displayed in Table 1.

Results of Net and Global DIF Analyses

Only one item (Item S447), out of six items, was free of DIF. This item displayed non-significant tests of both net and global DIF, and small DSF effects across all booklets. However, for each source language, French and English, Table 1 shows that four (67%) polytomous items were flagged as having DIF in the form of net DIF and/or global DIF across at least half of the test booklets. For French as the source language, Items S114, S498, and S519 were flagged as having both net and global DIF across all test booklets. The remaining item (Item S465) displayed sig-nificant tests of net and global DIF in two (50%) booklets (Booklets 5 and 11). For English as the source language, Items S114, S465, S485, and S519 were flagged as having net DIF and/or global DIF. However, none of these items displayed both types of DIF across all booklets.

by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 7: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

6

Tab

le 1

. R

esul

ts o

f Net

DIF

, Glo

bal D

IF, a

nd D

SF A

naly

ses

for

PISA

-200

6 Po

lyto

mou

sly

Scor

ed S

cien

ce It

ems

for

the

Tw

o So

urce

Lan

guag

es a

nd A

cros

s Fo

ur

Diff

eren

t Bo

okle

ts.

Tes

t ite

mSo

urce

lang

uage

Book

let

DSF

mea

sure

sD

ecis

ions

abo

ut D

IF n

ull h

ypot

hesi

s

Step

1 (

)λ 1

Step

2 (

2D

SF fo

rmG

loba

l DIF

Net

DIF

S114

Fren

ch (

Fran

ce v

s. D

enm

ark)

10.

45 (

0.29

) M

+0.

59*

(0.2

6) M

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 5

.55

20.

57 (

0.2

8) M

+0.

64*

(0.2

5) L

+Pe

rvas

ive

conv

erge

ntR

ejec

tR

ejec

t χ2

= 7

.40

80.

52 (

0.29

) M

+0.

69*

(0.2

9) L

+Pe

rvas

ive

conv

erge

ntR

ejec

tR

ejec

t χ2

= 6

.44

110.

82*

(0.3

4) L

+0.

88*

(0.3

5) L

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 8

.49

En

glis

h (t

he U

nite

d St

ates

vs.

Slo

vak)

1−

0.08

(0.

25)

S−

0.45

(0.

31)

M-

Pote

ntia

l non

-per

vasi

veA

ccep

tA

ccep

t χ2

= 1

.49

2

−0.

62*

(0.2

4) M

-−

0.89

* (0

.27)

L-

Perv

asiv

e co

nver

gent

Rej

ect

Rej

ect χ2

= 1

2.66

8

−0.

74*

(0.2

8) L

-−

0.98

* (0

.32)

L-

Perv

asiv

e co

nsta

ntR

ejec

tR

ejec

t χ2

= 1

1.37

11

−0.

55 (

0.33

) M

-−

0.75

(0.

39)

L-Pe

rvas

ive

conv

erge

ntA

ccep

tR

ejec

t χ2

= 5

.65

S465

Fren

ch (

Fran

ce v

s. D

enm

ark)

40.

29 (

0.26

) S

−0.

03 (

0.22

) S

No

DSF

Acc

ept

Acc

ept χ2

= 0

.76

50.

77*

(0.2

6) L

+0.

57*

(0.2

2) M

+Pe

rvas

ive

conv

erge

ntR

ejec

tR

ejec

t χ2

= 1

0.52

110.

62*

(0.2

6) M

+0.

38 (

0.22

) S

Pote

ntia

l non

-per

vasi

veR

ejec

tR

ejec

t χ2

= 6

.01

120.

21 (

0.26

) S

0.37

(0.

25)

SN

o D

SFA

ccep

tA

ccep

t χ2

= 2

.34

En

glis

h (t

he U

nite

d St

ates

vs.

Slo

vak)

4−

0.02

(0.

24)

S−

0.63

* (0

.22)

M-

Pote

ntia

l non

-per

vasi

veR

ejec

tA

ccep

t χ2

= 3

.46

5

−0.

001

(0.2

3) S

−0.

64*

(0.2

2) L

-Po

tent

ial n

on-p

erva

sive

Rej

ect

Acc

ept χ2

= 3

.41

11

0.17

(0.

23)

S−

0.24

(0.

22)

SN

o D

SFA

ccep

tA

ccep

t χ2

= 0

.14

12

0.14

(0.

23)

S−

0.16

(0.

24)

SN

o D

SFA

ccep

tA

ccep

t χ2

= 0

.71

S485

Fren

ch (

Fran

ce v

s. D

enm

ark)

10.

13 (

0.27

) S

0.08

(0.

28)

SN

o D

SFA

ccep

tA

ccep

t χ2

= 0

.47

90.

18 (

0.27

) S

0.27

(0.

28)

SN

o D

SFA

ccep

tA

ccep

t χ2

= 1

.35

100.

49 (

0.31

) M

+0.

42 (

0.38

) S

Pote

ntia

l non

-per

vasi

veA

ccep

tA

ccep

t χ2

= 3

.35

120.

01 (

0.27

) S

0.46

(0.

31)

M+

Pote

ntia

l non

-per

vasi

veA

ccep

tA

ccep

t χ2

= 0

.91

En

glis

h (t

he U

nite

d St

ates

vs.

Slo

vak)

10.

03 (

0.23

) S

−0.

80*

(0.2

8) L

-Po

tent

ial n

on-p

erva

sive

Rej

ect

Acc

ept χ2

= 2

.82

9

−0.

18 (

0.26

) S

−1.

06*

(0.3

7) L

-Po

tent

ial n

on-p

erva

sive

Rej

ect

Rej

ect χ2

= 4

.80

10

0.03

(0.

23)

S−

0.76

(0.

35)

L-Po

tent

ial n

on-p

erva

sive

Acc

ept

Acc

ept χ2

= 1

.30

12

0.09

(0.

24)

S−

0.69

(0.

33)

L-Po

tent

ial n

on-p

erva

sive

Acc

ept

Acc

ept χ2

= 1

.42

(con

tinue

d)

by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 8: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

7

Tes

t ite

mSo

urce

lang

uage

Book

let

DSF

mea

sure

sD

ecis

ions

abo

ut D

IF n

ull h

ypot

hesi

s

Step

1 (

)λ 1

Step

2 (

2D

SF fo

rmG

loba

l DIF

Net

DIF

S498

Fren

ch (

Fran

ce v

s. D

enm

ark)

20.

77*

(0.2

8) L

+0.

95*

(0.2

3) L

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 1

5.74

31.

06*

(0.3

0) L

+1.

33*

(0.2

5) L

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 2

9.36

50.

72*

(0.2

5) L

+1.

03*

(0.2

5) L

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 1

6.34

90.

87*

(0.2

6) L

+0.

84*

(0.2

3) L

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 1

6.09

Engl

ish

(the

Uni

ted

Stat

es v

s. S

lova

k)2

0.13

(0.

21)

S−

0.04

(0.

20)

SN

o D

SFA

ccep

tA

ccep

t χ2

= 0

.48

30.

02 (

0.22

) S

−0.

06 (

0.21

) S

No

DSF

Acc

ept

Acc

ept χ2

= 0

.21

50.

25 (

0.20

) S

−0.

001

(0.2

0) S

No

DSF

Acc

ept

Acc

ept χ2

= 0

.48

90.

34 (

0.20

) S

0.05

(0.

20)

SN

o D

SFA

ccep

tA

ccep

t χ2

= 1

.58

S519

Fren

ch (

Fran

ce v

s. D

enm

ark)

20.

58*

(0.2

4) M

+0.

46 (

0.23

) M

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 6

.63

30.

74*

(0.2

7) L

+0.

87*

(0.3

1) L

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 9

.76

50.

72*

(0.2

4) L

+0.

77*

(0.2

7) L

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 1

0.65

90.

80*

(0.2

5) L

+0.

69*

(0.2

5) L

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 1

1.60

En

glis

h (t

he U

nite

d St

ates

vs.

Slo

vak)

20.

60*

(0.2

1) M

+0.

57 (

0.27

) M

+Pe

rvas

ive

cons

tant

Rej

ect

Rej

ect χ2

= 9

.15

3

0.54

* (0

.21)

M+

0.58

* (0

.25)

M+

Perv

asiv

e co

nsta

ntR

ejec

tR

ejec

t χ2

= 8

.41

5

0.02

(0.

23)

S0.

36 (

0.34

) S

No

DSF

Acc

ept

Acc

ept χ2

= 0

.38

9

0.65

* (0

.23)

L+

0.34

(0.

30)

SPo

tent

ial n

on-p

erva

sive

Rej

ect

Rej

ect χ2

= 6

.50

Not

e. S

tand

ard

erro

rs a

re r

epor

ted

in b

rack

ets,

and

the

num

bers

pre

cedi

ng b

rack

ets

corr

espo

nd t

o th

e es

timat

ed c

umul

ativ

e st

ep-le

vel l

og-o

dds

ratio

λj,

and

the

ast

eris

ks n

ext

to t

hese

num

bers

indi

cate

st

atis

tical

sig

nific

ance

at

the

appr

opri

ate

Bonf

erro

ni-a

djus

ted

Typ

e I e

rror

rat

e fo

r th

e D

SF e

ffect

s (i.

e., 0

.025

for

2 st

eps)

. S =

sm

all D

SF e

ffect

, M-

= m

ediu

m n

egat

ive

DSF

effe

ct, M

+ =

med

ium

pos

itive

DSF

ef

fect

, L-

= la

rge

nega

tive

DSF

effe

ct, a

nd L

+ =

larg

e po

sitiv

e D

SF e

ffect

. DIF

= D

iffer

entia

l Ite

m F

unct

ioni

ng; D

SF =

Diff

eren

tial S

tep

Func

tioni

ng; P

ISA

= P

rogr

am fo

r In

tern

atio

nal S

tude

nt A

sses

smen

t.

Tab

le 1

. (c

ont

inue

d)

by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 9: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

8 Journal of Psychoeducational Assessment

When comparing the findings of net and global DIF resulted under each source language, it was clear that for French as the source language, more administrations of PISA polytomous items were flagged as having both types of DIF as compared with those administrations for English as the source language. Because each one of these four items appeared in four different booklets, there were 16 different administrations of these items. Fourteen administrations were flagged as having net DIF where French was the source language as compared with 7 administrations where English was the source language. Similarly, for French as the source language, 14 administra-tions were flagged as having global DIF as compared with 9 administrations for the other source language.

Results of DSF Analysis

For French as the source language, five (83%) items (Items S114, S465, S485, S498, and S519) displayed different DSF effects in, almost, all booklets. Three items (Items S114, S498, and S519) displayed substantial DSF effects that were either medium or large across the two steps (pervasive form of DSF) in all booklets, indicating that a potentially biasing factor may exist at the item level. Item S465 displayed substantial DSF effects in just two booklets, a pervasive form in Booklet 5 and a potential non-pervasive form in Booklet 11. However, Item S485 showed non-substantial DSF effects in two booklets and a potential non-pervasive form of DSF in the other two booklets indicating that the potentially biasing factor resides in the lowest score category in Booklet 10, while it resides in the highest score category in Booklet 12.

For English as the source language, four (67%) items (Items S114, S465, S485, and S519) displayed different DSF effects in, almost, all booklets. For Item S114, DSF effects were of the pervasive form in all but one booklet (Booklet 1), indicating that a potentially biasing factor may exist at the item level. However, for Item S485, potential non-pervasive form of DSF resulted in all booklets suggesting that a potentially biasing factor may exist and it resides in the highest score category. For one of the two remaining items (Item S465), potential non-pervasive form of DSF resulted in two of the booklets (Booklets 4 and 5) while it appeared to be free of DSF in the other two booklets (Booklets 11 and 12). By contrast, Item S519 displayed pervasive DSF effects in two booklets (Booklets 2 and 3), but potential non-pervasive form in Booklet 9 and no DSF in Booklet 5.

It is worthy to note that in just one item (Item S519), the DSF effects were of the same sign (positive sign) for both source languages across all booklets. This indicates that in both sets of countries, this item gave a relative advantage to the group of the source language over the group with the target language, after conditioning on ability, for whichever test booklet was being administered. However, for the remaining items, the DSF effects for both source languages had different signs. The DSF effects for French as the source language were positive across four items (Items S114, S465, S485, and S498), indicating that French examinees were advantaged by those items over the Danish examinees with the same proficiency levels. A different finding emerged for the other source language, English; the DSF effects for three items (Items S114, S465, and S485) were negative, indicating that the Slovak group was given a relative advantage over the English group after conditioning on ability. However, Item S498 yielded negligible DSF effects and was therefore considered free of DSF.

Conclusion

Five major conclusions emerged from analyzing PISA-2006 polytomously scored science items for test-language-related net DIF, global DIF, and DSF. First, test-language DIF (net and global DIF) is present in PISA-2006 polytomously scored items. For each source language, 67% of the items displayed DIF in either or both types of DIF. The results resemble findings in previous

by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 10: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

Akour et al. 9

studies that showed the existence of test-language-related DIF in international tests such as PISA (Arim & Ercikan, 2005; Ercikan, 1999; Le, 2006; Yildirim & Berberoglu, 2009).

Second, DIF results were different for both source languages. The hypotheses of no net DIF and no global DIF were rejected for French as the source language in more administrations of PISA items as compared with those where English was the source language. French examinees were given a relative advantage over Danish examinees, after conditioning on ability, whereas the English group was disadvantaged by the items as compared with the other group of examin-ees (Slovak). The differences may be due to problems in translation (Le, 2006). However, other factors may affect item equivalence across language versions of PISA, such as cultural and cur-riculum differences between the groups (Le, 2009; Sireci & Berberoglu, 2000; van de Vijver & Tanzer, 2004).

Third, for both source languages and for each item, the findings showed that under the condi-tion of no DSF for each of the two steps, no DIF of either type exists, which agrees with what would be theoretically expected (Penfield, Gattamorta, & Childs, 2009). Examination of such items that perform well (e.g., Items S447 and S498) might prove helpful to item developers.

For items that were flagged for DIF, however, the analysis of DSF provided valuable informa-tion concerning the nature of the DIF effect (i.e., is the DIF an item-level effect or an effect iso-lated to specific score levels) and the location of the DIF effect (i.e., precisely which score levels are manifesting the DIF effect). For some items that were flagged for DIF (e.g., Items S114, S498, and S519 for French vs. Denmark), DSF effects were of pervasive form indicating that a biasing factor may exist at the item level that may be located in the content of the item stem or the general properties of the item itself. This implies that there is a need for a through item con-tent revision and a testing on a one-to-one basis to pinpoint the potential causes of DIF. For the remaining items that yielded potential non-pervasive form of DSF, the DSF analysis showed that a potentially biasing factor may reside in one of the steps indicating that the biasing factor may well be in the scoring criteria for that step. For example, potential non-pervasive DSF observed in just the first step suggests that the cause of the DIF likely resides in the second lowest score level because one group of examinees is experiencing a relative difficulty in making the transi-tion from the lowest score level into a higher score level. This suggests that scoring criteria should be revised for those items and for the affected steps to clarify the potential causes of DIF and DSF.

Fourth, net and global DIF tests did not always yield the same result. The pattern of results observed in this study is what would be theoretically expected in that you can get different results depending on whether one is using a net or global test, depending on the DSF effect pattern. Specifically, when there was significant net DIF but not significant global DIF (e.g., S114, the United States vs. Slovak), the DSF effects were pervasive convergent, and thus the net DIF effect is more powerful than the global effect because all DIF effects were in the same direction. However, when there was significant global DIF but not significant net DIF (e.g., S465, the United States vs. Slovak), the DSF effects were potential non-pervasive, and thus the global DIF effect is more powerful. The existence of large DSF effects in one, and only one, step of these items might be diluted by the negligible (near zero) DSF effects of the other step, yielding a rela-tively small aggregated (net) DIF.

Fifth, in this study, PISA polytomous items had three score levels resulting in just two step functions, which precludes distinguishing between different forms of non-pervasive DSF, that is, non-pervasive constant, non-pervasive convergent, and non-pervasive divergent. Therefore, when there are only two steps and the DSF is non-pervasive, the distinction between constant, convergent, and divergent forms of DSF is irrelevant.

It is suggested that polytomous items with more than three score levels (J ≥ 4) be analyzed to be able to identify non-pervasiveness in light of convergent and divergent causes. Moreover, this study compared each source language with just another test language; it would be beneficial to

by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 11: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

10 Journal of Psychoeducational Assessment

replicate the analyses with data from other PISA participating countries encompassing more divergent test languages, and/or other countries that are statistically different from the average. Finally, it would be interesting to investigate DSF using the other two approaches available in the literature, IRT approach and logistic regression approach, and comparing the results with those yielded by the odds ratio approach.

Acknowledgments

The authors would like to thank Professor Randall Penfield at the University of North Carolina at Greensboro and Professor Margret Wu at Victoria University, and the two anonymous reviewers for their valuable com-ments, suggestions, and edits.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Arim, R. G., & Ercikan, K. (2005, April). Comparability between the US and Turkish versions of the Third International Mathematics and Science study’s (TIMSS) mathematics test results. Paper presented at the meeting of the National Council on Measurement in Education, Montreal, Quebec, Canada.

Chang, H. H., Mazzeo, J., & Roussos, L. (1996). Detecting DIF for polytomously scored items: An adaptation of the SIBTEST procedure. Journal of Educational Measurement, 33, 333-353. doi:10.1111/j.1745-3984.1996.tb00496.x

Cohen, A. S., Kim, S. H., & Baker, F. B. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17, 335-350. doi:10.1177/014662169301700402

Dorans, N. J., & Schmitt, A. P. (1993). Constructed response and differential item functioning: A pragmatic approach. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measure-ment: Issues in constructed response, performance testing, and portfolio assessment (pp. 135-165). Hillsdale, NJ: Lawrence Erlbaum.

Ercikan, K. (1999, April). Translation DIF on TIMSS. Paper presented at the meeting of the National Council on Measurement in Education, Montreal, Quebec, Canada.

French, A. W., & Miller, T. R. (1996). Logistic regression and its use in detecting differential item functioning in polytomous items. Journal of Educational Measurement, 33, 315-332. doi:10.1111/j.1745-3984.1996.tb00495.x

Gattamorta, K. A., & Penfield, R. D. (2012). A comparison of adjacent categories and cumulative differen-tial step functioning effect estimators. Applied Measurement in Education, 25, 142-161. doi:10.1080/08957347.2012.660387

Grisay, A. (2003). Translation procedures in OECD/PISA 2000 international assessment. Language Testing, 20, 225-240. doi:10.1191/0265532203lt254oa

Kim, S. H., & Cohen, A. S. (1998). Detection of differential item functioning under the graded response model with the likelihood ratio test. Applied Psychological Measurement, 22, 345-355. doi:10.1177/014662169802200403

Le, L. T. (2006, April). Analysis of differential item functioning. Paper presented at the annual meeting of American Educational Research Association, San Francisco, CA.

Le, L. T. (2009). Investigating gender differential item functioning across countries and test languages for PISA science items. International Journal of Testing, 9, 122-133. doi:10.1080/15305050902880769

Magis, D., & Facon, B. (2013). Item purification does not always improve DIF detection: A counter-example with Angoff’s delta plot. Educational and Psychological Measurement, 73, 293-311. doi:10.1177/0013164412451903

by guest on July 10, 2014jpa.sagepub.comDownloaded from

Page 12: Journal of Psychoeducational AssessmentMutasem Akour 1, Saed Sabah , and Hind Hammouri Abstract The purpose of this study was to apply two types of Differential Item Functioning (DIF),

Akour et al. 11

Mantel, N. (1963). Chi-square tests with one degree of freedom: Extensions of the Mantel-Haenszel pro-cedure. Journal of the American Statistical Association, 58, 690-700. doi:10.1080/01621459.1963. 10500879

The Organisation for Economic Co-Operation and Development. (2009). PISA 2006 technical report. Retrieved from http://www.oecd.org/pisa/pisaproducts/42025182.pdf

Penfield, R. D. (2005). DIFAS: Differential item functioning analysis system. Applied Psychological Measurement, 29, 150-151. doi:10.1177/0146621603260686

Penfield, R. D. (2007). Assessing differential step functioning in polytomous items using a common odds ratio estimator. Journal of Educational Measurement, 44, 187-210. doi:10.1111/j.1745-3984.2007.00034.x

Penfield, R. D. (2010). Distinguishing between net and global DIF in polytomous items. Journal of Educational Measurement, 47, 129-149. doi:10.1111/j.1745-3984.2010.00105.x

Penfield, R. D., & Algina, J. (2003). Applying the Liu–Agresti estimator of the cumulative common odds ratio to DIF detection in polytomous items. Journal of Educational Measurement, 40, 353-370. doi:10.1111/j.1745-3984.2003.tb01151.x

Penfield, R. D., Alvarez, K., & Lee, O. (2009). Using a taxonomy of differential step functioning to improve the interpretation of DIF in polytomous items: An illustration. Applied Measurement in Education, 22, 61-78. doi:10.1080/08957340802558367

Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In S. Sinharay & C. R. Rao (Eds.), Handbook of statistics, Psychometrics (Vol. 26, pp. 125-167). NewYork: Elsevier.

Penfield, R. D., Gattamorta, K., & Childs, R. A. (2009). An NCME instructional module on using differ-ential step functioning to refine the analysis of DIF in polytomous items. Educational Measurement: Issues and Practice, 28, 38-49. doi:10.1111/j.1745-3992.2009.01135.x

Potenza, M. T., & Dorans, N. J. (1995). DIF assessment for polytomously scored items: A framework for classification and evaluation. Applied Psychological Measurement, 19, 23-37. doi:10.1177/014662-169501900104

Sireci, S. G., & Berberoglu, G. (2000). Using bilingual respondents to evaluate translated–adapted items. Applied Measurement in Education, 13, 229-248. doi:10.1207/S15324818AME1303_1

Somes, G. W. (1986). The generalized Mantel–Haenszel Statistic. The American Statistician, 40, 106-108. doi:10.1080/00031305.1986.10475369

van de Vijver, F., & Tanzer, N. K. (2004). Bias and equivalence in cross-cultural assessment: An over-view. Revue Européenne de Psychologie Appliquée/European Review of Applied Psychology, 54, 119-135. doi:10.1016/j.erap.2003.12.004

Yildirim, H. H., & Berberoĝlu, G. (2009). Judgmental and statistical DIF analyses of the PISA-2003 math-ematics literacy items. International Journal of Testing, 9, 108-121. doi:10.1080/153050509-02880736

by guest on July 10, 2014jpa.sagepub.comDownloaded from