getd.libs.uga.edu · USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-RELATED DIF...

USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-

RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE

by

KATHERINE RACZYNSKI

(Under the Direction of Seock-Ho Kim)

ABSTRACT

This study uses the multiple-indicator/multiple cause (MIMIC) latent variable model and multiple group analysis to evaluate Likert-type items for gender-related differential item functioning (DIF) on an aggression measure in an adolescent population. The MIMIC model allows for the simultaneous examination of group differences in the latent factor of interest (i.e., aggression) and response to measurement (i.e., DIF). Multiple group analysis provides an overall examination of measurement invariance across groups. This study will test for gender DIF in two scales of an aggression measure, physical aggression and relational aggression.

INDEX WORDS: Measurement invariance, MIMIC model, Differential item functioning



by

KATHERINE RACZYNSKI

B.S.Ed., University of Georgia, 2002

A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment

of the Requirements for the Degree

MASTER OF ARTS

ATHENS, GEORGIA

2008

© 2008

Katherine Raczynski

All Rights Reserved



by

KATHERINE RACZYNSKI

Major Professor: Seock-Ho Kim

Committee: Deborah Bandalos Stephen Olejnik

Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia December 2008

ACKNOWLEDGEMENTS

I would like to thank my major advisor, Seock-Ho Kim, for providing support and

guidance, along with the other members of my committee, Deborah Bandalos and Stephen

Olejnik. The suggestions and assistance provided by my committee were of tremendous value.

I also owe a debt of gratitude to Andy Horne, Pamela Orpinas, and the Youth Violence

Prevention Project, for allowing me access to the data and for being great friends and role

models. Finally, thank you to my family for providing unflagging support and encouragement.

iv

TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS........................................................................................................... iv

LIST OF TABLES........................................................................................................................ vii

CHAPTER

1 INTRODUCTION AND THEORETICAL FRAMEWORK ........................................1

Introduction ...............................................................................................................1

Measurement .............................................................................................................1

Measurement Invariance ...........................................................................................3

Differential Item Functioning....................................................................................3

Item Response Theory...............................................................................................4

Structural Equation Modeling ...................................................................................6

Connections between CFA and IRT..........................................................................8

2 LITERATURE REVIEW ............................................................................................10

The MIMIC model ..................................................................................................10

Advantages of the MIMIC model ...........................................................................11

DIF Detection using MIMIC Models: A Comparison to IRT Models....................12

Prior Studies using MIMIC Modeling to Detect DIF..............................................13

Gender-related DIF in Measures of Aggression......................................................16

v

3 PROCEDURE..............................................................................................................18

Sample .....................................................................................................................18

Instrumentation........................................................................................................19

Computer Program ..................................................................................................20

Detection of Gender-related DIF.............................................................................20

Multiple-Indicator/Multiple Cause Modeling .........................................................21

Multiple Group Analysis .........................................................................................23

4 RESULTS ....................................................................................................................25

Descriptive Statistics ...............................................................................................25

Outliers ....................................................................................................................29

Missing Value Treatment ........................................................................................29

Physical Aggression Scale.......................................................................................29

Relational Aggression Scale....................................................................................34

5 SUMMARY AND DISCUSSION...............................................................................38

Summary .................................................................................................................38

Discussion ...............................................................................................................39

Limitations and Future Research.............................................................................42

REFERENCES ..............................................................................................................................45

APPENDICES ...............................................................................................................................51

A MPLUS SYNTAX.......................................................................................................51

vi

vii

LIST OF TABLES

Page

Table 1: Physical and Relational Aggression Items Means and Standard Deviations for Boys and

Girls ...............................................................................................................................27

Table 2: Physical and Relational Aggression Items Intercorrelations, Skewness and Kurtosis ...28

Table 3: Multiple-Indicator/Multiple Causes (MIMIC) Model Estimates for the Physical

Aggression Scale ...........................................................................................................30

Table 4: Sequential Chi Square Tests of Invariance for the Physical and Relational Aggression

Scales.............................................................................................................................32

Table 5: Multiple-Indicator/Multiple Causes (MIMIC) Model Estimates for the Relational

Aggression Scale ...........................................................................................................35

CHAPTER 1

INTRODUCTION AND THEORETICAL FRAMEWORK

Introduction

This study uses the multiple-indicator/multiple cause (MIMIC) latent variable model and

multiple group analysis to evaluate Likert-type items for gender-related differential item

functioning (DIF) on an aggression measure in an adolescent population. The MIMIC model

allows for the simultaneous examination of group differences in the latent factor of interest (i.e.,

aggression) and response to measurement (i.e., DIF). Multiple group analysis provides an

overall examination of measurement invariance across groups. This study tests for gender DIF

in two scales of an aggression measure, physical aggression and relational aggression. Because

gender differences in levels of victimization may contribute to differences in levels of

aggression, the model includes a measure of victimization as a covariate.

Measurement

Measurement is the term given to the systematic act of assigning numbers on variables to

represent properties or characteristics of people, events, or objects (Stevens, 1946; Lord &

Novick, 1968, p. 16). Within education and psychology, measurement is used to aid in the

understanding of unobserved or latent variables that are of interest to the researcher, such as

academic achievement or attitudes toward violence. While researchers believe that these internal

characteristics exist, there is no direct way to observe them. Instead, researchers rely on theory

to develop survey instruments that indirectly measure constructs of interest.

1

The objective of any well-designed survey instrument is to obtain observed item

responses that are reflective of respondents’ levels of an unobserved latent trait. Ideally,

individuals with the same level of the underlying trait should obtain the same score on an

instrument measuring that trait. However, according to classical test theory, survey responses

always include an unknown quantity of error or unexplained variation. That is, other factors,

apart from the level of latent trait, influence how participants respond to items. Classical test

theory models this variation using the equation

X = T + E, (1)

where X is the observed score, T is the true score, and E is the error or unexplained variation

(Lord & Novick, 1968, p. 34). Theoretically, E is normally distributed, with a mean of zero and

variance σ2. The error component is also assumed to be nonsystematic in nature and

uncorrelated with T. Because T and E are theoretically uncorrelated, it follows that the variance

of the observed scores (σ 2X ) can be parceled out into two components, true score variation (σ 2T )

and error variation (σ 2E ), as modeled in

σ 2X = σ 2T + σ 2E . (2)

A derivation of this model allows researchers to conceptualize test reliability:

rX ′ X = σT2 /(σT

2 + σ E2 ) = σT

2 /σ X2 . (3)

The above model demonstrates that high reliability is dependent on a large proportion of

variation in true scores (T) to variation in observed scores (X). Because T is not directly

observable, researchers rely on other techniques (e.g., Pearson correlation, test-retest reliability)

to estimate reliability. Regardless of the measure, the goal in any testing situation is to obtain

observed scores (X) that are substantially made up of T, with negligible amounts of E. It is the

2

responsibility of the researcher to minimize the amount of error likely to be contained in

responses through rigorous confirmation of the instrument’s reliability and validity evidence.

Classical test theory provides a framework for understanding the measurement properties

of observed variables in terms of their reliability and validity. However, more recent analytical

techniques have allowed researchers to address additional aspects of reliability and validity that

were previously inaccessible using solely classical test theory methods of inquiry (Vandenberg &

Lance, 2000). One such topic is the systematic evaluation of measurement invariance.

Measurement Invariance

Measurement invariance refers to a test’s ability to measure the same latent variable

under different measurement conditions, such as with different populations of respondents (Horn

& McArdle, 1992). Therefore, measurement invariance is primarily concerned with the

generalizability of interpretations of test responses across different sets of circumstances. Before

conducting comparisons across groups on a common measure, researchers should evaluate

whether different groups of respondents conceptually respond to and interpret the measure in a

similar way.

Without evidence of adequate measurement invariance, interpretations of observed scores

may be flawed. Differences in group means may be related to actual group differences (e.g., one

group has more of the latent variable assessed) or to group differences in response to the measure

(e.g., differences in frame of reference). In order to meaningfully explore true differences, it is

necessary to discount the possibility of substantial measurement differences.

Differential Item Functioning

This paper is concerned with a type of measurement invariance analysis called

differential item functioning (DIF). DIF analysis involves evaluating differences in item

3

performance across distinct groups of respondents after matching the groups on ability (on

achievement measures) or “severity” (on psychological measures) (Angoff, 1993). DIF occurs

when subgroups of respondents with equal amounts of the latent trait respond differently to

items, causing potentially serious threats to the validity of the test.

DIF can be uniform or non-uniform. Uniform DIF occurs when one group consistently

scores higher than the other tested group, across all levels of ability. An example of uniform DIF

is in the case when a group of girls outperforms a group of boys on a math test when the two

groups of children possess an equal amount of underlying math ability. That is, some other

factor is interfering to give the girls a consistent advantage over boys.

Non-uniform DIF occurs when items discriminate differently between different ability

levels within groups. An example would be a math problem that average ability girls can answer

correctly, but only high ability boys (and not average ability boys) are able to answer. In effect,

this item differentiates between low-ability and average-ability girls and average-ability and

high-ability boys.

Item Response Theory

Researchers have primarily examined DIF using item response theory (IRT) techniques.

IRT models evaluate variation among respondents by analyzing item-level data (Edelen et al.,

2006). In effect, IRT modeling assigns respondents as well as items to a scale of measurement,

which is conceptualized as having a range of negative infinity to positive infinity, a mean of

zero, and a unit of measurement of one (Baker, 2001).

The fundamental elements of IRT are a latent variable (such as ability or severity),

generally called θ , and an item characteristic curve (ICC) for each item. The ICC graphically

represents the probability of correct answer choice or endorsement along a continuum of θ.

4

Therefore, the ICC is conceptualized as a function of θ. When analyzing a polythomous item

(e.g., Likert-type), IRT methods assume that the item true score function, a nonlinear monotonic

function, connects θ to the expected answer choice. There are also additional assumptions

underlying polythomous IRT methods, namely, that the set of items of interest are

unidimensional and locally independent (i.e., uncorrelated after controlling for θ).

One IRT model that can be used to examine polythomous items is the graded response

model (Samejima, 1969). This model is applied when items have Likert-type or categorical

answer choices, which is a common characteristic of personality and attitude measures

(Embretson & Reise, 2000 p. 308). In the graded response model, when the item has K answer

choices or categories, k = 1,2,…,K, the item true score function of θ , T(θ), is modeled as

T(θ) = Σk=1

Kk × Pk (θ), (4)

where the item response y for a particular item category response function k is

Pk (θ) = Pk−1* (θ) − Pk

*(θ). (5)

The boundary response function, or probability of responding above k is

Pk*(θ) = 1

1+ exp[a(θ − bk )], (6)

where a is the slope parameter and bk are the threshold parameters. Additionally, P0*(θ) =1 and

PK* (θ) = 0 . For an item with five response categories, there are five item response functions and

four boundary response functions. In this case, there would be five total item parameters.

In order to evaluate an instrument for DIF using a graded response model, the researcher

designates a reference group and a focal group of respondents. DIF is said to be present if item

true score functions are not equal among these groups such that TR (θ) ≠ TF (θ) where the

subscripts signify the reference and focal groups, respectively.

5

Structural Equation Modeling

Another class of techniques used to evaluate measurement invariance is structural

equation modeling (SEM). In particular, confirmatory factor analysis (CFA), a type of SEM

analysis, has been used extensively to study measurement invariance. CFA methods for

detecting measurement invariance involve an overall test of comparability of parameter values

across groups, followed by a series of more specialized comparisons to identify the source of

lack of equivalence, if indicated.

The measurement model for CFA can be written as

x = τ + Λxξ + δ , (7)

(Vandenberg & Lance, 2000). In this equation, x represents a vector of q ×1 observed variables,

ξ represents a vector of latent variables, n ×1 Λx is a q × n matrix of factor loadings, and δ is a

vector of measurement error in x. This equation also includes the τ vector of intercepts,

although generally intercepts are assumed to be zero and are not estimated. In order to obtain a

covariance matrix,

q ×1

δξ +Λ x is multiplied by its transpose . Following the assumption

that measurement errors are uncorrelated with each other and the latent construct, the covariance

matrix (

tx δξ +Λ

Σx ) may be expressed as

Σx = ΛxΦΛxt + Θδ , (8)

where Φ is the covariance matrix of the latent variables and Θδ is the diagonal matrix of

measurement error variances. While Equation 8 is identical to the measurement model for

exploratory factor analysis (EFA), CFA places restrictions on Λx which differentiates CFA from

EFA.

Measurement invariance can be examined on many different levels using CFA. In that

Σx can be different for different populations, it is possible to test for equivalence of Σx , Λx , Φ,

6

and Θδ (Raju, Laffitte, & Byrne, 2002). A test of Σx = Σx′, where Σx refers to the covariance

matrix of the reference group and ′ Σ x refers to the covariance matrix of the comparison group,

can be thought of as an omnibus test of measurement invariance. If the null hypothesis is not

rejected using chi-square and other goodness-of-fit methods, the measure is generally accepted

as invariant and further tests are unnecessary. However, if the null hypothesis is rejected, lack of

invariance is indicted, and further tests should be conducted to identify the source of invariance

(Schmitt, 1982).

There are several types of tests of invariance to identify the source of lack of

measurement equivalence in a measure. Although there has been some inconsistency in the

literature with regard to terminology, and number of necessary tests and order of necessary tests

of invariance (Vandenberg & Lance, 2000), I will summarize the tests described in Vandenberg

and Lance (2000) in the order in which they are recommended.

The test of configural invariance evaluates whether patterns of significant and non-

significant factor loadings across groups are similar. The test of metric invariance ( Λx = Λx′) is

concerned with the equality of the value of factor loadings across groups. The test of scalar

invariance (τ x = τ x′) indicates whether intercepts on the latent variable are equivalent across

groups. The test of the invariance of the unique variances across groups ( Θδ = Θδ ′) examines

whether like items’ uniquenesses are equivalent between groups. The test of factor variance

invariance (Φ = ′ Φ ) examines whether respondents in different groups utilized a similar range of

responses along the answer continuum. Vandenberg and Lance (2000) also discuss the test of

equal factor covariances, although they do not find this test useful.

7

There have been several sets of step-by-step recommendations for undertaking the

aforementioned tests (e.g., Steenkamp & Baumgartener, 1998, Vandenberg & Lance, 2000).

While different sequences of testing have been reported, the overarching idea is that each test is

undertaken sequentially, and at each step, more restrictive constraints are added to the model

(e.g., setting equal factor loadings on like items across groups). The more restricted model is

compared in terms of goodness of fit (i.e., χ2 value and other goodness of fit indices) to the less

restricted or baseline model. The source of lack of measurement invariance is indicated at the

level of testing when the more restricted model does not meet acceptable standards for goodness-

of-fit.

Connections between CFA and IRT

Raju, Laffitte, and Byrne (2002) compared and contrasted CFA and IRT techniques for

evaluating measurement invariance. They reported that CFA and IRT methods are alike in that

they are both concerned with determining whether true scores are equivalent across groups for

respondents with equal amounts of the latent trait. The two techniques similarly allow for group

differences in the distributions of scores across subgroups. CFA and IRT also both give

information about the source and extent of any lack of measurement invariance identified.

While CFA and IRT are alike in certain respects, Raju, Laffitte, and Byrne (2002) also

noted several differences between the two techniques. Primarily, CFA modeling assumes a

linear relationship between the construct and items, while IRT assumes a non-linear relationship.

They found that with dichotomously scored items, the IRT approach is a more appropriate model

in terms of expressing the relationship between the measured variable and the continuous latent

construct. They found that the literature describing CFA models is more advanced in terms of

simultaneously looking at multiple latent constructs and multiple populations, although the IRT

8

literature is progressing in this direction. While CFA has been used to examine equivalence of

error variances across populations, IRT methods have not in practice examined an equivalent

statistic: the invariance of the standard error of measurement for θ across populations of

respondents. The CFA framework, on the other hand, does not have a means for determining the

probability of a respondent with a given θ selecting a particular response category. These two

techniques, when used in concert, can provide complimentary information (Meade &

Lautenschlager, 2004: Reise et al., 1993).

Several connections between the statistical frameworks of IRT and CFA models have

been discussed recently. In particular, Takane and de Leeuw (1987) provided proofs showing

that a two-parameter normal ogive IRT model is equivalent to a CFA with dichotomous

variables. Researchers have extended upon this parallel to investigate DIF using a multiple-

indicator/multiple cause (MIMIC) model. The MIMIC model is a special case of the factor

analysis model which includes causal variables. Work by Muthén et al. (1991) provided

equations for converting MIMIC model parameters to IRT discrimination and difficulty

parameters. MacIntosh and Hashim (2003) presented a procedure for converting standard errors

for these parameters from the MIMIC model parameters.

9

CHAPTER 2

LITERATURE REVEIW

The MIMIC Model

This paper is primarily concerned with demonstrating the use of MIMIC modeling to

identify DIF. The MIMIC model is an SEM-based alternative to the multiple-sample CFA

analysis described earlier. In a MIMIC model, one or more grouping (or background) variables

function simultaneously as contributors to differences in the latent trait and as covariates upon

which the outcome variables are regressed (Muthén, 1989). In this study, one dichotomous

grouping variable (gender) is included in the model.

The MIMIC model with a dichotomous grouping variable can be expressed as

η = ′ γ x x + ′ γ zz + ζ , (9)

where η is the latent trait, x represents observed background variables, z represents a dummy

variable, and ζ represents an error term, which is normally distributed and independent of x and

z (MacIntosh & Hashim, 2003). MIMIC modeling with categorical data, like IRT-based DIF

detection, includes comparing a latent response variable ( ) to a threshold (y j* τ j ). If the threshold

is exceeded ( y j* > τ j), then the indicator of the latent response variable ( y j) is one. If the

threshold is not exceeded, is zero. y j

The latent response variable ( ) can be modeled as a combination of the indirect effects

through the latent trait variable (η) and the direct effects of the dummy variable (

y j*

zk), which is a

measure of the grouping variable (e.g., gender) that is being examined for potential contributions

to DIF:

10

y j* = λ jη + β j zk + ε j , (10)

where λ j is the factor loading, β j is the slope relating the grouping variable to the response

variable, and ε j is the random error (Finch, 2005).

The procedure for assessing DIF using MIMIC modeling entails estimating the direct and

indirect effects of group membership on the latent trait and item response (Finch, 2005).

Significant indirect effects of group membership on the item indicate that differences in item

response are influenced by group differences on the mean of the latent factor. For example,

assessing indirect effects can indicate whether a greater level of the latent trait “aggression” in

boys contributes to higher endorsement of items on an aggression scale.

Significant direct effects between the grouping variable and the item indicates that group

membership directly impacts item response, apart from any group difference on the latent trait.

Evaluating direct effects, after controlling for indirect effects, is the procedure for assessing

uniform DIF. For example, this procedure can indicate whether boys are endorsing higher levels

of aggression items after controlling for differences in the underlying trait “aggression” (Finch,

2005).

Advantages of the MIMIC model

There are several unique advantages to using a MIMIC model to identify DIF. A MIMIC

model can be used to simultaneously obtain estimates of group difference in item response (DIF)

and amount of the latent trait. That is, the MIMIC model provides information about the

structural model and the measurement model (Muthén, 1989). MIMIC models can be estimated

for data of ordinal or continuous scales, data with multiple grouping variables (including

grouping variables with more than two groups), and data with multiple independent variables,

including categorical or continuous variables (Glockner-Rist & Hoitjink, 2003; Muthén, 1988).

11

Whereas IRT models require large sample sizes (Reise, Widaman, & Pugh, 1993), MIMIC

models can accommodate smaller sample sizes. Finally, researchers may be more familiar with

CFA-based procedures than analyses requiring knowledge of IRT and IRT software.

There are also some disadvantages of MIMIC modeling cited in the literature. Jones

(2006) notes that MIMIC modeling cannot account for a guessing parameter, unlike IRT-based

methods. However, in the behavior rating scales evaluated in this study, guessing is unlikely.

Another disadvantage is that a single-group MIMIC model can only identify uniform DIF,

although this study utilizes a multiple-group MIMIC approach, which is able to test for uniform

and non-uniform DIF.

DIF Detection using MIMIC Models: A Comparison to IRT Models

Finch (2005) conducted Monte-Carlo simulations to compare MIMIC model detection of

DIF to SIBTEST, IRT LR, and the Mantel-Haenszel (MH) statistic. He evaluated Type I error

rate and power under several sets of conditions, varying the size of the reference group (100 or

500), the number of items (20 or 50), the parameter model assessed, (three parameter logistic

model or two parameter logistic model), the amount of DIF contamination in the anchor items

(none or 15%), and the amount of DIF present in the target item (0 or .6). The study conditions

were completely crossed, and each combination of specifications was tested with 500

replications.

He found that the MIMIC model performed as well or better than traditional DIF

detection methods under some, but not all, conditions. Specifically, MIMIC performed well in

the 50 item test and when the two parameter logistic model was used. For the 50 item test, the

MIMIC model was more resistant than the other techniques to Type I error inflation when DIF

contamination was present in the anchor items. However, in the case where the exam was short

12

(20 items) and the three parameter logistic model was used, the MIMIC model had an

undesirably high rate of Type I error.

In terms of power, the MIMIC model performed well. Power was especially good when

the exam had 50 items and when the two parameter logistic model was used. Under these

conditions, the MIMIC model matched or exceeded the power of the other techniques.

Specifically, the MIMIC model outperformed the MH statistic and the SIBTEST when the

reference group was smaller (100) and the level of DIF contamination in the anchor items was

higher (15%).

Prior Studies Using MIMIC Modeling to Detect DIF

Several prior studies have used MIMIC modeling to check for DIF. Grayson et al. (2000)

used a MIMIC model to assess a depression scale for uniform DIF associated with demographic

(e.g., age), disability and physical disorder variables. They conducted their analyses in three

steps. First, a confirmatory factor analysis tested the acceptability of the structural model.

Second, each of the predictor variables (demographic, disability, physical disorder) was

introduced into the model serially. Each model included an indirect path from the predictor to

each item via the latent variable as well as a direct path to each item. Notably, the direct paths

from the predictor to the items were estimated in the same model and not sequentially. One

referent item was constrained to have zero bias for model identification reasons. For each

predictor, the researchers flagged significant parameter estimates for these paths linking items to

the predictor. The researchers frame this step as a screening procedure. That is, they were

interested in identifying predictors that had no significant direct effects on the items, and were

thus unlikely to be contributing to DIF. These predictors were then eliminated from the final

model. In the third and final step of the analyses, each of the significant predictors was added

13

into the model together, and the resulting multivariate model was estimated. The researchers

were again interested in identifying significant effects from the predictor to the items, although

in the multivariate model, all other predictors are held constant in the estimation procedure.

The researchers conducted the analysis using maximum likelihood parameter estimates.

The goodness of fit index (GFI), the Tucker-Lewis index (TLI), and the root mean square error

of approximation (RMSEA) values assessed model fit. The researchers were concerned about

violations to multivariate normality; therefore they used bootstrapping to obtain confidence

intervals. The researchers determined that a biased loading parameter estimate that exceeded

twice its standard error (z score of 2) was statistically significant in this context. The loadings on

the latent variable required a z score of 1.5 to reach significance.

In reporting the results, the researchers partitioned the effects of each predictor into a bias

effect (i.e., the sum of the direct effects from the predictor to the items) and an actual effect (i.e.,

the sum of the direct effects from the latent variable to the items, multiplied by the effect from

the predictor to the latent variable). These effects were compared to the critical ratios described

above (2 for bias parameters, 1.5 for direct effect from the predictor to the latent variable). Items

that exceeded the bias cut-off on one or more predictors were identified as exhibiting DIF.

Gallo et al. (1994) used MIMIC modeling to identify age-related uniform DIF in a

depression scale with marital status, minority status, cognitive status, and recent unemployment

in the model. The latent variable, depression, was regressed on each covariate. The analysis was

conducted by successively testing for significant parameter estimates between age and each item.

Analyses were conducted using LISCOMP’s limited-information generalized least

squares estimator for dichotomous response. Model fit was reported using the descriptive fit

value, the goodness-of-fit index, the adjusted goodness-of-fit index, and the critical number.

14

Christensen et al. (1999) examined a depression scale and an anxiety scale for age-related

uniform DIF using a MIMIC model. They evaluated a two-factor measurement model including

five demographic covariates (age, sex, marital status, financial status, and level of education).

First, the researchers conducted a CFA to assess the fit of the measurement model.

Next, the researchers conducted analyses to select a referent item for the substantive DIF

analysis. For identification purposes, one item must be selected as the referent (e.g., no DIF)

item. The researchers tested each item individually (e.g., testing the significance of the direct

paths to the item from each covariate consecutively) to determine which items show no DIF

across all of the covariates. An item that showed weak direct associations with the covariates

was selected as the referent.

Finally, the substantive model was analyzed. The model included paths from each

covariate to both latent variables, and to each item. This model was similar to the model

described in Grayson et al. (2000), in that DIF was estimated for each of the items

simultaneously (i.e., paths from the covariate to each of the items were included in one model).

Analyses were conducted using maximum likelihood parameter estimates using Amos

3.6.1. Model fit was reported using the goodness-of-fit index (GFI), the non-normed fit index

(NNFI) and the root mean square error of approximation (RMSEA). The researchers were

concerned about violations to multivariate normality; therefore bootstrapping was used to obtain

confidence intervals.

In an article that demonstrated a procedure for calculating the standard error of the

estimates of IRT difficulty and discrimination from MIMIC model parameters, MacIntosh and

Hashim (2003) employed MIMIC modeling to identify uniform gender-related DIF on a scale

measuring racial prejudice. They included gender in the model as the predictor variable, along

15

with three other covariates (educational status, political conservatism, and religious

fundamentalism). The researchers conducted their analyses in two steps. First, they ran the

model with a path from gender to the latent variable and with one path directly from gender to an

item. The researchers obtained the residual variance of the latent variable from the output of this

analysis ( R2). This value was used to set the variance of the latent variable for the second run as

1− R2. Once this variance was set, the model was sequentially run to test each item for DIF in

the same manner as Gallo et al. (2000); that is, with parameter estimates from gender to the item.

The Mplus program was used to run the analysis. The researchers reported model fit

using the chi-squared value.

Gender-related DIF in Measures of Aggression

Obtaining accurate measures of self-reported aggression in adolescents is of interest to

researchers, educators, and policy-makers alike. Aggression scales have been used to calculate

the prevalence of aggression in schools (e.g., Nansel et al., 2001) and as outcome variables for

evaluating the impact of violence-prevention programs (e.g., Farrell, Meyer, & White, 2001).

Evaluating gender differences in levels of aggression and type of perpetration (e.g., physical,

relational) have been topics of particular consideration in the literature. Boys have generally

scored higher than girls on measures of physical aggression, (e.g., Bongers et. al, 2004; Broidy,

Nagin, Trembaly, Bates, Brame, Dodge, et al., 2003), although researchers have recently argued

that adolescent aggression is not only constrained to physical acts. Notably, Crick and Bigbee

(1995) coined the term “relational aggression” to encompass behaviors that are purposefully

damaging to the victim’s peer relationships. Crick and Grotpeter (1995) argue that these

aggressive behaviors are more common in girls than physical aggression.

16

While measuring gender differences on different types of aggression has been a topic of

considerable interest, there are significant gaps in the literature regarding the validity of the

instruments used, particularly in terms of measurement invariance. Typically, researchers have

evaluated gender differences in aggression by comparing means (e.g., Crick & Grotpeter, 1995;

Björkqvist, Lagerspetz, & Kaukiainen, 1992). As discussed earlier, a simple means comparison

may not be interpretable without first obtaining evidence of measurement invariance.

The purpose of this study is to evaluate the measurement invariance of two measures of

aggression (physical and relational) across gender groups in order to evaluate whether

differences in self-reported aggression are due solely to differences in the latent trait, or if

differences are partially due to factors related to measurement. This topic may be of particular

importance because the wording of items on the aggression scale may lend itself to differential

item functioning. In particular, the Crick and Grotpeter (1995) relational aggression scale was

developed with the aim of capturing behaviors that they theorized were common in girls. If the

authors were particularly concerned with female-oriented behavior, the wording of items may

reflect subtle gender-related bias. On the other hand, because physical aggression is more

commonly associated with boys, measures of physical aggression may include an over-

representation of items that are more salient to male respondents.

17

CHAPTER 3

PROCEDURE

Sample

Data for this study are from the GREAT Schools and Families project (GSF), a seven

year, multi-site violence prevention project described in detail in a supplement to the American

Journal of Preventive Medicine (Horne, 2004). The data used here comprises the Spring 2004

student survey of a randomly-selected sample of students from one cohort at the University of

Georgia (UGA) site. Students attended one of nine middle schools in Northeast Georgia. Of the

719 students who were eligible to participate in GSF, 623 (87%) completed the Spring 2004

student assessment. At this assessment wave, all respondents were in 7th grade, unless they had

been retained.

The sample was 49% female. Of the 612 students who selected one race, 53% were

white, 34% were black, less than 1% were American Indian or Asian Indian, 2% were other

Asian, and 10% were some other race. There were 14 students (2%) who selected more than one

race. Twelve percent of students were Hispanic. Students ranged in age from 12 to 15.

Participants took the survey via computer assisted survey interview (CASI) using laptop

computers. Participants wore headphones, and the CASI program read the questions aloud.

Respondents recorded their answers via the keyboard. All surveys were proctored by GSF staff.

Students were assigned an ID number, and the survey was confidential.

This study examines three types of behaviors: physical aggression, relational aggression,

and overt victimization.

18

Instrumentation

The items that are examined in this study are taken from the problem behavior frequency

scale, a collection of 47 items grouped into 7 subscales that assess the 30-day frequency of

problem behaviors, such as aggression and delinquency. This study is concerned with three

subscales (henceforth called scales): physical aggression, relational aggression, and overt

victimization.

Physical aggression is a 7-item scale that measures self-reported physical aggression in

the past 30 days. The stem is, “In the past 30 days, how many times have you,” followed by

descriptions of physical aggression (e.g., hitting) or other serious aggressive behavior (e.g.,

threatening someone with a weapon). Relational aggression is a 6-item scale that measures self-

reported relational aggression in the past 30 days. The stem is, “In the past 30 days, how many

times have you,” followed by descriptions of relational aggression, such as spreading a false

rumor about someone. Overt victimization is a 6-item scale that measures self-reported

victimization in the past 30 days. The stem is, “In the past 30 days, how many times has this

happened to you,” followed by descriptions of victimization (e.g., been pushed).

For each item, the six Likert-type response categories range from “never” to “20 or more

times.” These responses are coded from one to six, respectively, and the scale score is the

average of the item scores.

In terms of reliability, each scale demonstrated acceptable internal consistency based on

the current dataset (physical aggression, α = .85; relational aggression, α = .81; overt

victimization, α = .86). High internal consistency indicates that respondents demonstrated

consistency in answer selection across subsets of items (Crocker & Algina, 1986, p. 135). These

19

values are consistent with or higher than other reported internal consistencies calculated based on

other data sources (Farrell et al., 2001, Crick & Bigbee, 1998, Orpinas & Frankowski, 2001).

In general, published validity evidence on the scales used is scarce. In many cases, the

internal consistency of the measure is offered as the only indicator of validity evidence (see

Dahlberg et al., 2005). More comprehensive validity evidence is available for one of the

measures used. The physical aggression scale was adapted in part from Orpinas and

Frankowski’s 2001 Aggression Scale. Validity studies conducted on this scale using three

samples of middle school students indicated that scores on the Aggression Scale were

significantly positively related to teacher-rated aggression, and self-reports of drug use, weapon-

carrying, and injuries due to fights (Orpinas & Frankowski, 2001).

Computer Program

The data were analyzed using SPSS Version 16.0, and Mplus Version 5 (Muthén &

Muthén, 2007). Syntax for all analyses is provided in Appendix A.

Detection of Gender-related DIF

Gender-related DIF detection was conducted on the two scales of interest: physical

aggression and relational aggression. A measure of victimization was included in the model as a

covariate. Victimization scores were categorized into three groups: no victimization reported

(30% of respondents), one instance of victimization reported (13% of respondents) and more

than one instance of victimization reported (57% of respondents). By adding victimization to the

model, it is possible to evaluate gender- related DIF while adjusting for the effect of

victimization on the aggression measure. Experiences of victimization may moderate the

relationship between gender and the tendency to endorse items on the aggression scale. In other

20

words, differences in victimization may explain some of the gender differences in item

endorsement that I would have otherwise attributed to DIF.

Two approaches were used to evaluate the measurement invariance of the aggression

measures. First, the MIMIC approach was used to test each item for DIF. Second, I employed

multiple group analysis to obtain a test of the overall invariance of the measures across groups.

Each of the procedures was conducted for the physical aggression scale and the relational

aggression scale independently.

Multiple-Indicator/Multiple Cause Modeling

Single group MIMIC modeling was used to identify items that exhibit DIF. For

identification purposes, the mean of the latent variable was set to zero and the variance set to one

according to a procedure documented in MacIntosh and Hashim (2003). To set the latent mean

to zero, the exogenous variables (i.e., gender, victimization) are mean centered. To set the latent

factor variance to one, a two-step procedure was demonstrated in MacIntosh and Hashim (2003).

First, the model is run with no constraints on the variance of the latent factor to obtain the R2

value for the latent factor. Second, the model is estimated again with the variance of the latent

factor set to . I received a warning when running the first step of the procedure regarding

the identification of the model, and the program was not able to calculate the

(1− R2)

R2 value for the

latent factor. In order to identify the model for the first step, I set the variance of the latent

variable to 1. After re-running the model, I was able to obtain the R2 value for the latent factor.

The second step of the model was run exactly as outlined above and no further problems were

encountered.

First, a baseline model containing no direct effects from gender to item responses (i.e., no

DIF) was analyzed. Girls are coded zero, and boys are coded one. Model fit was evaluated

21

using the chi square (χ2) statistic along with other fit indices, the root mean square error of

approximation (RMSEA), and the comparative fit index (CFI). The χ2 test tests the hypothesis

that the original covariance matrix and the estimated or reproduced covariance matrix are

identical. However, the χ2 statistic has been shown to be sensitive to trivial differences among

these matrices under large sample sizes (Bentler, 1990). In other words, non-significant p-values

may be obtained even in models where the model fits the data well. In order to obtain a more

comprehensive view of model fit, additional fit indices in addition to χ2 are typically reported.

The RMSEA is a stand-alone fit index that adjusts for the complexity of the model,

favoring parsimony. It is a standardized measure of the degree to which the population data do

not fit the model. Hu and Bentler (1998) suggest that values of .06 or lower are indicative of

good fit. The CFI is an incremental fit index that compares the amount of non-centrality in the χ2

distribution of the specified model to a baseline (null) model. Hu and Bentler (1998) recommend

using .95 (or above) as a cut off for good model fit for the CFI.

Weighted least squares means and variances (WLSMV) estimation was used for all

analyses. WLSMV is a robust weighted least squares (WLS) estimator, which, similar to WLS

relies on an asymptotic covariance matrix, making both estimators attractive options for use with

non-normal data. However, WLS requires extremely large sample sizes, and is thus not practical

for most datasets. WLSMV is less computationally intensive than WLS, which results in smaller

sample size requirements. WLSMV adjusts the mean and variance of the χ2 value, along with

parameter estimates and standard errors to account for the level of non-normality in the data

(Finney & Distephano, 2006, p. 298).

Next, a series of models were run to test for uniform DIF. Each item was tested

sequentially by adding a direct path ( β ) from gender to the item in the model. A significant

22

parameter estimate for this direct path is indicative of uniform DIF. That is, gender is explaining

differences in item means above and beyond that which is explained via the indirect path of

gender to the item through the latent variable.

In order to test if including DIF effects (β ) results in significantly improved model fit, a

DIF model was created by including all significant β’s in the model. The fit of the DIF model

was compared to the baseline model using a χ2 difference test. It is important to note that under

WLSMV estimation, the χ2 values and degrees of freedom function differently than maximum

likelihood based estimators. According to comments posted by Linda Muthén on the Mplus

Discussion Board (2007), the χ2 values and degrees of freedom obtained from WLSMV

estimation are not able to be interpreted in a similar way as values obtained by ML estimation

(e.g., degrees of freedom calculated as a function of the number of estimated parameters

subtracted from the number of elements in the covariance matrix), because these values obtained

under WLSMV are adjusted to obtain accurate p-values. Similarly, a χ2 difference test between

two nested models estimated using WLSMV cannot be conducted in a straightforward manner

by subtracting the χ2 value and degrees of freedom of the unconstrained model from the

constrained model. Instead, the DIFFTEST command in Mplus must be used to obtain the

accurate p-value for this test.

Multiple Group Analysis

Multiple group analysis was used to further evaluate the equivalence of the factor

structures of the physical and relational aggression scales across gender. Multiple group

analysis considers the fit of the model as equality constraints are placed on two single-gender

groups. For each scale, the data from each single-gender group are first fit to a baseline or

configural model without equality constraints to determine whether the same pattern of

23

significant and non-significant parameter estimates holds across groups. If so, there is evidence

of configural invariance, and a more restrictive model is placed on the data. According to the

guidelines provided in the Mplus manual for conducting multiple group analysis with ordered

categorical data, the tests of metric invariance and scalar invariance described earlier are

combined into one test. That is, the factor loadings and thresholds are constrained all in one step,

“because the item probability curve is influenced by both parameters” (Muthén & Muthén, 2007,

p. 399). Because the two models are nested, χ2 difference tests are conducted to determine if the

added constraints result in significantly poorer fit.

24

CHAPTER 4

RESULTS

Descriptive Statistics

Means and standard deviations for the physical aggression and relational aggression

scales are presented separately by gender in Table 1. Spearman item intercorrelations, skewness,

and kurtosis, are provided in Table 2.

An examination of descriptive statistics (i.e., skewness, kurtosis) revealed violations of

univariate normality in several items. Because these scales measure the 30-day frequency of

aggressive behaviors, it is unsurprising that the responses will be skewed toward “never.” Two

items on the physical aggression scale and four items on the relational aggression scale had

skewness and kurtosis values outside of the acceptable range suggested by Kline (2005)—

│3│for skewness and │10│for kurtosis. Because these values fall outside of the acceptable

range, these items cannot be assumed to be univariate normal (D’Agostino, 1986), and the data

as a whole cannot be assumed to be multivariate normal.

Because of the non-normality in the data, I selected an estimation method that was not

based on normal theory, weighted least squares means and variances (WLSMV). As discussed

earlier, WLSMV accounts for the categorical nature of the data and adjusts for non-normality,

specifically through weighted least squares parameter estimates, mean- and variance- adjusted χ2

values, and scaled standard errors.

Prior to conducting the MIMIC and multiple groups analyses, means and standard

deviations of items were examined for differences in gender. For the physical aggression scale,

25

26

boys reported more aggression than girls across all items except for one. The mean of item 3

(“threatened to hurt a teacher”) was slightly higher for girls (M=1.17) than boys (M=1.11).

Overall, the item standard deviations were smaller for girls than for boys. For the relational

aggression scale, item means for girls’ and boys’ scores were quite similar, except for one item.

The mean of item 6 (“said things about another student to make other students laugh”) was

higher for boys (M=2.43) than girls (M=2.08). All of the item standard deviations were smaller

for girls than for boys.

Table 1 Physical and Relational Aggression Items Means and Standard Deviations for Boys and Girls Boys Girls

Scale Item M SD M SD 1. Thrown something at another student to hurt them 1.87 1.283 1.59 0.999

2. Been in a fight in which someone was hit 1.81 1.253 1.41 0.910

3. Threatened to hurt a teacher 1.11 0.498 1.17 0.705

4. Shoved or pushed another kid 2.50 1.583 2.05 1.391

5. Threatened someone with a weapon (gun, knife, club, etc.) 1.19 0.755 1.10 0.535

6. Hit or slapped another kid 2.06 1.486 1.82 1.154

Physical Aggression

7. Threatened to hit or physically harm another kid 1.80 1.392 1.56 1.133

1. Didn't let another student be in your group anymore because you were mad at them. 1.63 1.094 1.64 0.987

2. Told another kid you wouldn't like them unless they did what you wanted them to do. 1.19 0.701 1.20 0.656

3. Tried to keep others from liking another kid by saying mean things about him/her. 1.46 1.065 1.44 0.869

4. Spread a false rumor about someone. 1.35 0.940 1.31 0.757

5. Left another kid out on purpose when it was time to do an activity. 1.46 1.068 1.43 0.940

Relational Aggression

6. Said things about another student to make other students laugh. 2.43 1.640 2.08 1.314

N = 621 (313 boys, 308 girls)

27

28

e 2

Tabl Physical and Relational Aggression Items Intercorrelations, Skewness and Kurtosis Items Physical Aggression Relational Aggression 1 2 3 4 5 6 7 1 2 3 4 5 6

1 -- 2 .45 -- 3 .21 .21 -- 4 .48 .42 .20 -- 5 .36 .32 .31 .31 -- 6 .49 .44 .19 .65 .31 -- 7 .48 .43 .30 .50 .41 .54 -- 1 .34 .30 .16 .30 .11 .24 .29 -- 2 .30 .29 .22 .29 .23 .22 .28 .36 -- 3 .35 .31 .17 .38 .24 .33 .36 .38 .44 -- 4 .38 .30 .22 .36 .21 .30 .39 .36 .42 .53 -- 5 .35 .25 .16 .32 .19 .29 .36 .44 .36 .56 .44 -- 6 .41 .31 .13 .55 .26 .51 .48 .31 .26 .41 .31 .37 --

Skewness 2.11 2.23 5.63 1.21 5.69 1.68 2.20 2.31 4.59 3.02 3.62 3.05 1.31Kurtosis 4.51 4.93 34.88 0.51 34.85 2.24 4.16 6.05 24.39 10.18 14.88 9.90 0.78

Outliers

The existence of outliers in data can be problematic. On one hand, outliers can exert

disproportionate influence on results and can sometimes be the result of data entry errors or

untruthful respondents. On the other hand, outliers can reflect true variation in respondents (e.g.,

highly deviant behaviors) and are therefore of interest to the researcher.

In order to screen for outliers, I used the macro given in DeCarlo (1997). DeCarlo’s

macro calculates the significance of multivariate outliers and the Mahalanobis distance for each

observation. Fifty-three observations, or 9% of the total sample, met the criteria for multivariate

outliers, using the critical value at the .05 level of significance. These outliers were not removed

from the dataset because I hypothesized that these values were most likely due to true differences

in students and not errors in the dataset or untruthful responses.

Missing Value Treatment

There was very little missing data in the dataset. In total, only two students did not

answer every item. The Mplus program accommodates missing values using full information

maximum likelihood estimation.

Physical Aggression Scale

Parameter estimates and standard errors for the MIMIC analysis for the physical

aggression scale are presented in Table 3. For the baseline (i.e., no DIF) model, the fit of the

physical aggression scale was adequate ( χ202 = 73.308, P

slapped another kid”). These significant values provide evidence that these items may exhibit

different measurement characteristics based on the gender of the respondent.

Next, a DIF model was estimated by including all three significant β ’s in the model. The

fit of the model was adequate ( χ182 = 54.9, P

4 2.290 0.157 14.593 5 2.560 0.196 13.052

1 -0.259 0.055 -4.719 2 0.594 0.056 10.688 3 1.062 0.063 16.757 4 1.393 0.068 20.586

Y4

5 1.661 0.076 21.868

1 -3.121 0.308 -10.120 2 1.608 0.118 13.592 3 2.002 0.139 14.417 4 2.234 0.163 13.725

Y5

5 2.357 0.178 13.244

1 0.055 0.053 1.039 2 0.882 0.061 14.535 3 1.320 0.070 18.943 4 1.628 0.080 20.344

Y6

5 1.829 0.090 20.401

1 0.532 0.057 9.323 2 1.142 0.068 16.809 3 1.526 0.075 20.248 4 1.704 0.081 21.114

Y7

5 1.901 0.087 21.822 β Y1 0.023 0.082 0.286 Y2 0.364 0.095 3.817 Y3 -0.319 0.155 -2.061 Y4 0.101 0.077 1.318 Y5 0.008 0.165 0.050 Y6 -0.222 0.076 -2.903 Y7 -0.101 0.084 -1.204 γ Gender 0.136 0.081 Victimization 0.519 0.046 Ψ Physical Aggression 0.297

31

Table 4

Sequential Chi Square Tests of Invariance for Physical and Relational Aggression Scales

Scale Analysis Model χ 2 df p Δχ 2 Δdf p

Model 0 73.308 20 .000 -- -- -- MIMIC

Model 1 54.971 18 .000 23.711 3 .000

Model 2 119.991 22 .000 -- -- --

Physical Aggression

Multiple Group

Model 3 112.916 28 .000 37.041 15 .001

Model 0 56.385 15 .000 -- -- -- MIMIC Model 1 50.111 14 .000 6.154 1 .013

Model 2 56.414 22 .000 -- -- --

Relational Aggression

Multiple Group

Model 3 35.598 18 .007 0.340 7 .999

Model 0: No direct DIF effects

Model 1: All significant DIF effects included

Model 2: No invariance constraints

Model 3: Invariance of loadings and thresholds

I next used multiple group analysis to further evaluate the equivalence of the factor

structures across gender. Before beginning the process, it was necessary to collapse the fifth and

sixth response categories (“10-19 times” and “20 or more times”) for item 2 (“been in a fight in

which someone was hit”) because no females endorsed the fourth item response. In this case, the

program cannot estimate a threshold and the categories must be collapsed. Because of problems

running the model (i.e., non-positive definite matrix), victimization was removed from the

32

model. After removing victimization, no additional problems were encountered running the

analyses.

In order to identify the model, I followed the recommendations from the Mplus Manual

(Muthén & Muthén, 2007, p 398). That is, for the baseline model (i.e., test of configural

invariance), all of the item error variances are set to one, the factor means are set to zero for both

groups, one item loading is set to one in both groups, and one threshold per item is held invariant

(one additional threshold is held invariant for the item with the factor loading set to one). It is

important to note that by setting loadings and thresholds to one across groups, these reference

parameters are assumed to be invariant. The loading of the first item was chosen as the reference

parameter because this item showed little evidence of lack of invariance in the MIMIC analysis.

The highest threshold was chosen to be held invariant on the theoretical grounds that there may

be less variation across gender on the top end of the scale (i.e., differences between respondents

endorsing one of two answer categories corresponding to high frequency of aggression) than at

the bottom (i.e., differences between respondents endorsing one of two answer categories

corresponding to no aggression versus low frequency of aggression).

Table 4 provides a summary of the sequential χ2 difference tests for the multiple group

analysis. The fit of the baseline model was mediocre ( χ222 = 119.991, P

Relational Aggression Scale

Parameter estimates and standard errors for the MIMIC analysis for the relational

aggression scale are presented in Table 5. The fit of the relational aggression scale for the

MIMIC analysis was adequate ( χ152 = 56.385, P

Table 5 Multiple-Indicator/Multiple-Causes (MIMIC) Model Estimates for the Relational Aggression Scale Estimate SE Est./S.E. λ Y1 0.730 0.035 20.561 Y2 0.852 0.040 21.320 Y3 0.960 0.024 40.630 Y4 0.856 0.036 23.545 Y5 0.905 0.027 33.667 Y6 0.641 0.033 19.639 τ Threshold

1 0.278 0.052 5.361 2 1.174 0.066 17.853 3 1.652 0.084 19.704 4 1.901 0.097 19.621

Y1

5 2.055 0.112 18.370

1 1.221 0.069 17.781 2 1.749 0.089 19.557 3 2.099 0.122 17.137 4 2.353 0.150 15.736

Y2

5 2.411 0.156 15.477

1 0.651 0.056 11.634 2 1.401 0.078 18.060 3 1.806 0.095 19.010 4 2.029 0.110 18.375

Y3

5 2.057 0.112 18.330

1 0.933 0.065 14.328 2 1.713 0.091 18.767 3 2.061 0.111 18.620 4 2.199 0.121 18.199

Y4

5 2.342 0.133 17.634

1 0.710 0.057 12.501 2 1.429 0.083 17.129 3 1.779 0.097 18.284 4 1.960 0.106 18.466

Y5

5 2.062 0.114 18.006

1 -0.300 0.052 -5.710 2 0.630 0.058 10.919 3 1.030 0.061 16.760 4 1.315 0.068 19.346

Y6

5 1.534 0.075 20.438

35

βz Y1 -0.092 0.086 -1.061 Y2 -0.015 0.114 -0.135 Y3 -0.121 0.083 -1.458 Y4 0.016 0.097 0.161 Y5 -0.013 0.091 -0.146 Y6 0.211 0.085 2.479 γ Gender -0.060 0.093 -0.649 Victimization 0.468 0.052 9.067 Ψ Relational Aggression 0.175 -- No responses

I next used multiple group analysis to further evaluate the equivalence of the factor

structures across gender. Similar to the physical aggression scale, it was necessary to collapse

the 5th and 6th response categories (“10-19 times” and “20 or more times”) for two items. For

item 2 (“told another kid you wouldn’t like them unless they did what you wanted them to”), no

respondents endorsed the 5th response category. For item 3, (“tried to keep others from liking

another kid by saying mean things about him/her”) no girls endorsed the 5th response category.

No other problems were encountered, and the model was estimated with victimization in the

model.

The same procedures that were described earlier were followed to identify the model for

multiple groups analysis. The first item loading was held equal across groups, because this item

did not display evidence of DIF in the MIMIC modeling procedure. The highest threshold for

each item was held invariant for the same reason described in the physical aggression section.

Table 4 provides a summary of the sequential χ2 difference tests for the multiple group

analysis. The fit of the baseline model was mediocre ( χ222 = 56.414, P

CFI = .98). After holding the factor loadings and thresholds equal across groups, the model fit

adequately ( χ182 = 35.598, P=0.007; RMSEA = .056, CFI = .99). The χ2 difference test was not

significant ( χ72 = .340, P=.999). This result indicates that adding the additional constraints to t

model does not result in a statistically significant decrease in model fit, which supports the

hypothesis that the relational aggression scale does exhibit measurement invariance across

gender.

he

37

CHAPTER 5

SUMMARY AND DISCUSSION

Summary

The purpose of this study was to evaluate two measures of aggression in adolescents for

gender-related DIF. MIMIC modeling was used to test for direct effects of gender on each item

in the two scales. Another procedure, multiple group analysis, was used to provide additional

information about the measurement invariance of the scales across gender.

For the physical aggression scale, three items displayed evidence of DIF. The model that

did not include these DIF effects fit the data significantly worse than the model that did include

the DIF effects. This finding that including the DIF effects improves model fit suggests that there

are measurement differences between boys and girls on these items. The results of the multiple

group analysis buttressed this finding. Placing equality constraints on the loadings and

thresholds across the single-gender groups resulted in significantly poorer model fit.

For the relational aggression scale, one item displayed evidence of DIF. The model that

did not account for this DIF effect fit the data significantly worse than the model that did include

the DIF effect, indicating a lack of invariance for this item. Although there is evidence that one

item displays DIF, the results of the multiple group analysis indicated that the scale as a whole

does exhibit gender-related measurement invariance. Imposing equality constraints on the

loadings and thresholds of the two groups did not significantly decrease the fit of the model in

comparison to the unconstrained model.

38

Discussion

While statistical procedures can flag items for DIF and identify lack of invariance in

scales, it is up to the researcher to provide possible explanations for the measurement

differences. Although it is not possible to conclusively speak about the reasons for the

measurement differences, additional investigation of the flagged items can provide some

tentative explanations of the findings.

In the physical aggression scale, item 2 (“been in a fight in which someone was hit”)

exhibited DIF. The estimate of β for this item was .364. Because boys are coded 1, this estimate

indicates that in the model either boys endorsed higher values for this item due to gender above

and beyond the indirect effect of gender through latent physical aggression, or alternately, that

girls endorsed lower values. In order to gain additional information about patterns of responses

among boys and girls, I examined item intercorrelations and response frequencies from the

marginal tables (not shown), although this information is not based on the structural equation

modeling estimated parameters. An examination of item intercorrelations did not reveal any

clear pattern of differences between item 2 and the other items (i.e., item 2 correlations were

mostly consistent with the magnitude of the correlations among other items). An examination of

the response frequencies contributes additional information. Boys are less likely to than girls to

choose the response “never” (56.9% to 77.3%, respectively). There are differences at the top end

of the scale as well. While 4.9% of girls selected one of the last three response categories (“6-9

times,” “10-19 times,” or “20 or more times”), 9.9% of boys chose one of these categories. It is

possible that boys are more likely than girls to endorse higher levels of this item for reasons

other than increased physical aggression. For example, boys may be more likely to engage in

play fighting, which may result in one of the participants getting hit. Although this behavior can

39

be considered aggressive, it is different in nature than the malevolent aggression that this scale is

intended to measure.

Item 3 (“threatened to hurt a teacher”) also exhibited DIF. The estimate of β for this item

was -.319, indicating that in the model either girls were more likely or boys were less likely to

endorse higher values for this item due to gender-related factors beyond differences in latent

physical aggression. Examining the item intercorrelation matrix revealed that the correlations

for this item were somewhat lower than other items on the physical relation scale. In terms of

response category distribution from the marginal table, the pattern for girls and boys looks quite

similar. However, more girls than boys indicated that they had threatened to hurt a teacher 10-19

or 20 or more times (2% versus .6%, respectively). While it is difficult to speculate why this item

exhibits DIF, it may be possible that boys who are otherwise highly aggressive may be less likely

to threaten or admit to threatening teachers. This could possibly be because many teachers are

female and there is a strong social norm against males hitting or threatening females in the

Southern U.S.

Item 6 (“hit or slapped another kid”) was also flagged for DIF. The β estimate for this

item was -.222, which indicates that in the model females may be more likely or males may be

less likely to select higher values on this item due to measurement differences rather than true

differences in physical aggression. Examining the item intercorrelation matrix revealed that the

correlations for this item were fairly consistent with the correlations between the other items. In

terms of response category frequencies from the marginal table, substantial differences exist in

the high end of the scale. Where 2.6% of girls reported hitting or slapping another kid 20 or

more times, 7% of boys reported this behavior. It is possible that the word “slapping” resonates

more with girls, as this behavior in our culture is more often associated with females.

40

For the relational aggression scale, item 6 (“said things about another student to make

other students laugh”) demonstrated DIF. The β estimate for this item was .211, which indicates

that males may be more likely or females may be less likely to select higher values on this item

than would be expected based on their level of latent physical aggression. Interitem correlations

for this item were somewhat lower than correlations among other items on the relational

aggression scale. In terms of response category frequencies from the marginal table, boys were

more likely than girls to select the response “20 or more times” (9.9% versus 5.5%,

respectively). It is possible that for adolescent boys, the act of making fun of others is a more

common experience that is not necessarily always tied to a malevolent or aggressive motive.

One important aspect of testing for measurement invariance is deciding what to do if full

measurement invariance is not supported. It may not be practical to remove all items from a

scale that exhibit DIF. Hambleton (2006) suggests that it is common to find many items that

display DIF in psychological, attitudinal, or personality measures. He notes that it is possible

that these effects, taken as a whole, may cancel each other out. That is, while some items may

exhibit DIF in one direction, others counter those effects, so overall the measure does not favor

one group over the other.

It is also possible that while statistically significant DIF effects may exist in the model,

the impact of these effects is not of practical importance. With larger sample sizes the power to

detect DIF increases. Because the present dataset is reasonably large, it is possible that some of

the significant DIF effects correspond to measurement differences that are not practically

meaningful. In future studies, it may be desirable to examine effect size measures of the DIF

items.

41

Limitations and Future Research

There are several limitations inherent to the research presented in this paper. An

important limitation is that for all of the tested models, fit was less than ideal. While the fit of

the MIMIC models was adequate, the fit was mediocre for the multiple group analysis models,

especially for physical aggression. When model fit is borderline in terms of acceptability, it

becomes harder to draw conclusions based on those models. For example, a non-significant χ2

difference test may be reflective of problems with the overall fit of the model, rather than lack of

fit caused only by introducing equality constraints. The less than ideal fit of the models for

multiple groups analysis may be due to constraints that are put on the data for identification

purposes. This possibility is discussed later in this section.

Another problem is that the generalizability of the findings may be limited by the study

sample. All of the participants were middle school students from one region of the U.S. While

in this population the aggression measures examined may not demonstrate full measurement

invariance, this result may not hold for other groups of students, or for the same students at a

different point in time (e.g., in high school).

Another limitation concerns estimation method. Although WLSMV was likely the most

appropriate estimator for the data, based on the categorical and non-normal nature of the data,

there are also some disadvantages to using WLSMV. First, WLSMV is a relatively new

estimation method, and there is relatively little research examining its functioning. Overall,

however, research has supported its use as an improvement upon WLS estimation (Muthén,

1993).

WLSMV also provides challenges in terms of identifying the model when thresholds are

being estimated. In multiple group analysis, one threshold per item must be held invariant across

42

groups. Ideally, this threshold should in reality be invariant across groups, although it is difficult

to determine if this is the case. Systematically testing each threshold becomes time-consuming

when there are many thresholds per item. However, if a non-invariant threshold is set to equality

in the multiple group analysis, the overall model fit may be worsened. Additional research into

this aspect of multiple group analysis with categorical data is warranted.

Second, WLSMV provides less information in terms of multiple group analysis than

continuous estimators. For example, under MLM estimation, tests of metric and scalar

invariance could have been conducted separately, whereas under WLSMV these tests are

combined. By conducting metric and scalar tests separately, the source of lack of invariance

(slopes or intercepts) can be investigated. While under WLSMV it is possible to identify the

model in such a way as to estimate some differences in loadings and thresholds, this procedure is

not recommended in the Mplus manual because the loadings and thresholds together contribute

to the item characteristic curve (Muthén and Muthén, 2007, p. 299).

Another drawback to using WLSMV estimation is that modification indices are not

available. Under multiple group analysis, modification indices can be used to identify items that

potentially are problematic in terms of non-invariant loadings or intercepts. Modification indices

are often used to test for partial invariance, where some parameters are held invariant, while

problematic items are freely estimated. Without modification indices, I was unable to compare

whether the two methods (MIMIC modeling and multiple groups analysis) identified the same

problematic items.

Certain properties of the present dataset may make accurate estimation difficult.

Specifically, several of the items used were severely non-normal, with kurtosis values as high as

34.882. While WLSMV does adjust parameter estimates and χ2 values for non-normality, the

43

accuracy of WLSMV adjustments in cases of severe non-normality has not been studied

extensively. In some cases, the characteristics of the WLSMV estimates contradict intuitive

sense, such as more restricted models that have better fit index values than less restricted values.

Additional studies of the functioning of WLSMV under conditions of severe non-normality

would be useful.

The findings of this study did not support the full measurement invariance of two

measures of adolescent aggression. While lack of invariance was indicated for several items, the

findings of this paper are less than conclusive. Particularly, for the relational aggression scale,

one item displayed DIF, yet invariance of factor loadings and thresholds was supported. It is

clear that additional research is needed to investigate the measurement properties of the scales

included in this study, as well as other measures of adolescent aggression.

44

REFERENCES

Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W.

Holland & H. Wainer (Eds.), Differential item functioning (pp. 3-23). Hillsdale, NJ,

England: Lawrence Erlbaum Associates Inc.

Baker, F. B., & ERIC Clearinghouse on Assessment and Evaluation College Park MD. (2001).

The basics of item response theory. (2nd ed.).

Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107,

238-246.

Björkqvist, K., Lagerspetz, K. M., & Kaukiainen, A. (1992). Do girls manipulate and boys fight?

Developmental trends in regard to direct and indirect aggression. Aggressive Behavior,

18, 117-127.

Bongers, I. L., Koot, H. M., van der Ende, J., & Verhulst, F. C. (2004). Developmental

trajectories of externalizing behaviors in childhood and adolescence. Child Development,

75, 1523-1537.

Broidy, L. M., Nagin, D. S., Tremblay, R. E., Bates, J. E., Brame, B., Dodge, K. A., et al. (2003).

Developmental trajectories of childhood disruptive behaviors and adolescent

delinquency: A six-site, cross-national study. Developmental Psychology, 39, 222-245.

Christensen, H., Jorm, A. F., Mackinnon, A. J., Korten, A. E., Jacomb, P. A., Henderson, A. S.,

et al. (1999). Age differences in depression and anxiety symptoms: A structural equation

modelling analysis of data from a general population sample. Psychological Medicine,

29, 325-339.

45

Crick, N. R., & Bigbee, M. A. (1998). Relational and overt forms of peer victimization: A

multiinformant approach. Journal of Consulting and Clinical Psychology, 66, 337-347.

Crick, N. R., & Grotpeter, J. K. (1995). Relational aggression, gender, and social-psychological

adjustment. Child Development, 66, 710-722.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Forth Worth,

TX: Harcourt Brace.

D'Agostino, R. B. (1986). Tests for the normal distribution. In R. B. D'Agostino & M. A.

Stephens (Eds.), Goodness-of-fit techniques (pp. 367-419). New York: Marcel Dekker.

Dahlberg, L. L., Toal, S. B., Swahn, M., & Behrens, C. B. (2005). Measuring violence-related

attitudes, behaviors, and influences among youths: A compendium of assessment tools.

(2nd ed.) Atlanta, GA: Centers for Disease Control and Prevention, National Center for

Injury Prevention and Control.

DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological Methods, 2, 292-307.

Edelen, M. O., Thissen, D., Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2006).

Identification of differential item functioning using item response theory and the

likelihood-based model comparison approach: Application to the Mini-Mental State

Examination. Medical Care, 44, 134-142.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ:

Lawrence Erlbaum Associates Publishers.

Farrell, A. D., Meyer, A. L., & White, K. S. (2001). Evaluation of responding in peaceful and

positive ways (RIPP): A school-based prevention program for reducing violence among

urban adolescents. Journal of Clinical Child Psychology, 30, 451-463.

46

Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-

Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement,

29(4), 278-295.

Finney, S. J., & Distefano, C. (2006). Non-normal and categorical data in structural equation

modeling. In G. R. Hancock (Ed.), Structural equaltion modeling: A second course (pp.

269-313). Greenwich, CT: Information Age Publishing.

Gallo, J. J., Anthony, J. C., & Muthén, B. O. (1994). Age-differences in the symptoms of

depression - a latent trait analysis. Journals of Gerontology, 49, 251-264.

Glockner-Rist, A., & Hoijtink, H. (2003). The best of both worlds: Factor analysis of

dichotomous data using item response theory and structural equation modeling.

Structural Equation Modeling: A Multidisciplinary Journal, 10, 544 - 565.

Grayson, D. A., Mackinnon, A., Jorm, A. F., Creasey, H., & Broe, G. A. (2000). Item bias in the

center for epidemiologic studies depression scale: Effects of physical disorders and

disability in an elderly community sample. Journals of Gerontology Series B-

Psychological Sciences and Social Sciences, 55, 273-282.

Hambleton, R. K. (2006). Good practices for identifying differential item functioning -

Commentary. Medical Care, 44(11), S182-S188.

Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance

in aging research. Experimental Aging Research, 18, 117-144.

Horne, A. M. (2004). The multisite violence prevention project: background and overview.

American Journal of Preventive Medicine, 26(Suppl1), 3-11.

Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to

underparameterized model misspecification. Psychological Methods, 3, 424-453.

47

Jones, R. N. (2003). Racial bias in the assessment of cognitive functioning of older adults. Aging

& Mental Health, 7, 83-102.

Jones, R. N. (2006). Identification of measurement differences between English and Spanishl

language versions of the Mini-Mental State Examination: Detecting differential item

functioning using MIMIC modeling. Medical Care, 44, 124-133.

Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New

York: The Guilford Press.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Oxford, England:

Addison-Wesley.

MacIntosh, R., & Hashim, S. (2003). Variance estimation for converting MIMIC model

parameters to IRT parameters in DIF analysis. Applied Psychological Measurement, 27,

372-379.

Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of item response theory and

confirmatory factor analytic methodologies for e

getd.libs.uga.edu · USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-RELATED DIF...

Documents

Transcript of getd.libs.uga.edu · USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-RELATED DIF...