getd.libs.uga.edu · USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-RELATED DIF...
Transcript of getd.libs.uga.edu · USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-RELATED DIF...
-
USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-
RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE
by
KATHERINE RACZYNSKI
(Under the Direction of Seock-Ho Kim)
ABSTRACT
This study uses the multiple-indicator/multiple cause (MIMIC) latent variable model and multiple group analysis to evaluate Likert-type items for gender-related differential item functioning (DIF) on an aggression measure in an adolescent population. The MIMIC model allows for the simultaneous examination of group differences in the latent factor of interest (i.e., aggression) and response to measurement (i.e., DIF). Multiple group analysis provides an overall examination of measurement invariance across groups. This study will test for gender DIF in two scales of an aggression measure, physical aggression and relational aggression.
INDEX WORDS: Measurement invariance, MIMIC model, Differential item functioning
-
USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-
RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE
by
KATHERINE RACZYNSKI
B.S.Ed., University of Georgia, 2002
A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment
of the Requirements for the Degree
MASTER OF ARTS
ATHENS, GEORGIA
2008
-
© 2008
Katherine Raczynski
All Rights Reserved
-
USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-
RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE
by
KATHERINE RACZYNSKI
Major Professor: Seock-Ho Kim
Committee: Deborah Bandalos Stephen Olejnik
Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia December 2008
-
ACKNOWLEDGEMENTS
I would like to thank my major advisor, Seock-Ho Kim, for providing support and
guidance, along with the other members of my committee, Deborah Bandalos and Stephen
Olejnik. The suggestions and assistance provided by my committee were of tremendous value.
I also owe a debt of gratitude to Andy Horne, Pamela Orpinas, and the Youth Violence
Prevention Project, for allowing me access to the data and for being great friends and role
models. Finally, thank you to my family for providing unflagging support and encouragement.
iv
-
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS........................................................................................................... iv
LIST OF TABLES........................................................................................................................ vii
CHAPTER
1 INTRODUCTION AND THEORETICAL FRAMEWORK ........................................1
Introduction ...............................................................................................................1
Measurement .............................................................................................................1
Measurement Invariance ...........................................................................................3
Differential Item Functioning....................................................................................3
Item Response Theory...............................................................................................4
Structural Equation Modeling ...................................................................................6
Connections between CFA and IRT..........................................................................8
2 LITERATURE REVIEW ............................................................................................10
The MIMIC model ..................................................................................................10
Advantages of the MIMIC model ...........................................................................11
DIF Detection using MIMIC Models: A Comparison to IRT Models....................12
Prior Studies using MIMIC Modeling to Detect DIF..............................................13
Gender-related DIF in Measures of Aggression......................................................16
v
-
3 PROCEDURE..............................................................................................................18
Sample .....................................................................................................................18
Instrumentation........................................................................................................19
Computer Program ..................................................................................................20
Detection of Gender-related DIF.............................................................................20
Multiple-Indicator/Multiple Cause Modeling .........................................................21
Multiple Group Analysis .........................................................................................23
4 RESULTS ....................................................................................................................25
Descriptive Statistics ...............................................................................................25
Outliers ....................................................................................................................29
Missing Value Treatment ........................................................................................29
Physical Aggression Scale.......................................................................................29
Relational Aggression Scale....................................................................................34
5 SUMMARY AND DISCUSSION...............................................................................38
Summary .................................................................................................................38
Discussion ...............................................................................................................39
Limitations and Future Research.............................................................................42
REFERENCES ..............................................................................................................................45
APPENDICES ...............................................................................................................................51
A MPLUS SYNTAX.......................................................................................................51
vi
-
vii
LIST OF TABLES
Page
Table 1: Physical and Relational Aggression Items Means and Standard Deviations for Boys and
Girls ...............................................................................................................................27
Table 2: Physical and Relational Aggression Items Intercorrelations, Skewness and Kurtosis ...28
Table 3: Multiple-Indicator/Multiple Causes (MIMIC) Model Estimates for the Physical
Aggression Scale ...........................................................................................................30
Table 4: Sequential Chi Square Tests of Invariance for the Physical and Relational Aggression
Scales.............................................................................................................................32
Table 5: Multiple-Indicator/Multiple Causes (MIMIC) Model Estimates for the Relational
Aggression Scale ...........................................................................................................35
-
CHAPTER 1
INTRODUCTION AND THEORETICAL FRAMEWORK
Introduction
This study uses the multiple-indicator/multiple cause (MIMIC) latent variable model and
multiple group analysis to evaluate Likert-type items for gender-related differential item
functioning (DIF) on an aggression measure in an adolescent population. The MIMIC model
allows for the simultaneous examination of group differences in the latent factor of interest (i.e.,
aggression) and response to measurement (i.e., DIF). Multiple group analysis provides an
overall examination of measurement invariance across groups. This study tests for gender DIF
in two scales of an aggression measure, physical aggression and relational aggression. Because
gender differences in levels of victimization may contribute to differences in levels of
aggression, the model includes a measure of victimization as a covariate.
Measurement
Measurement is the term given to the systematic act of assigning numbers on variables to
represent properties or characteristics of people, events, or objects (Stevens, 1946; Lord &
Novick, 1968, p. 16). Within education and psychology, measurement is used to aid in the
understanding of unobserved or latent variables that are of interest to the researcher, such as
academic achievement or attitudes toward violence. While researchers believe that these internal
characteristics exist, there is no direct way to observe them. Instead, researchers rely on theory
to develop survey instruments that indirectly measure constructs of interest.
1
-
The objective of any well-designed survey instrument is to obtain observed item
responses that are reflective of respondents’ levels of an unobserved latent trait. Ideally,
individuals with the same level of the underlying trait should obtain the same score on an
instrument measuring that trait. However, according to classical test theory, survey responses
always include an unknown quantity of error or unexplained variation. That is, other factors,
apart from the level of latent trait, influence how participants respond to items. Classical test
theory models this variation using the equation
X = T + E, (1)
where X is the observed score, T is the true score, and E is the error or unexplained variation
(Lord & Novick, 1968, p. 34). Theoretically, E is normally distributed, with a mean of zero and
variance σ2. The error component is also assumed to be nonsystematic in nature and
uncorrelated with T. Because T and E are theoretically uncorrelated, it follows that the variance
of the observed scores (σ 2X ) can be parceled out into two components, true score variation (σ 2T )
and error variation (σ 2E ), as modeled in
σ 2X = σ 2T + σ 2E . (2)
A derivation of this model allows researchers to conceptualize test reliability:
rX ′ X = σT2 /(σT
2 + σ E2 ) = σT
2 /σ X2 . (3)
The above model demonstrates that high reliability is dependent on a large proportion of
variation in true scores (T) to variation in observed scores (X). Because T is not directly
observable, researchers rely on other techniques (e.g., Pearson correlation, test-retest reliability)
to estimate reliability. Regardless of the measure, the goal in any testing situation is to obtain
observed scores (X) that are substantially made up of T, with negligible amounts of E. It is the
2
-
responsibility of the researcher to minimize the amount of error likely to be contained in
responses through rigorous confirmation of the instrument’s reliability and validity evidence.
Classical test theory provides a framework for understanding the measurement properties
of observed variables in terms of their reliability and validity. However, more recent analytical
techniques have allowed researchers to address additional aspects of reliability and validity that
were previously inaccessible using solely classical test theory methods of inquiry (Vandenberg &
Lance, 2000). One such topic is the systematic evaluation of measurement invariance.
Measurement Invariance
Measurement invariance refers to a test’s ability to measure the same latent variable
under different measurement conditions, such as with different populations of respondents (Horn
& McArdle, 1992). Therefore, measurement invariance is primarily concerned with the
generalizability of interpretations of test responses across different sets of circumstances. Before
conducting comparisons across groups on a common measure, researchers should evaluate
whether different groups of respondents conceptually respond to and interpret the measure in a
similar way.
Without evidence of adequate measurement invariance, interpretations of observed scores
may be flawed. Differences in group means may be related to actual group differences (e.g., one
group has more of the latent variable assessed) or to group differences in response to the measure
(e.g., differences in frame of reference). In order to meaningfully explore true differences, it is
necessary to discount the possibility of substantial measurement differences.
Differential Item Functioning
This paper is concerned with a type of measurement invariance analysis called
differential item functioning (DIF). DIF analysis involves evaluating differences in item
3
-
performance across distinct groups of respondents after matching the groups on ability (on
achievement measures) or “severity” (on psychological measures) (Angoff, 1993). DIF occurs
when subgroups of respondents with equal amounts of the latent trait respond differently to
items, causing potentially serious threats to the validity of the test.
DIF can be uniform or non-uniform. Uniform DIF occurs when one group consistently
scores higher than the other tested group, across all levels of ability. An example of uniform DIF
is in the case when a group of girls outperforms a group of boys on a math test when the two
groups of children possess an equal amount of underlying math ability. That is, some other
factor is interfering to give the girls a consistent advantage over boys.
Non-uniform DIF occurs when items discriminate differently between different ability
levels within groups. An example would be a math problem that average ability girls can answer
correctly, but only high ability boys (and not average ability boys) are able to answer. In effect,
this item differentiates between low-ability and average-ability girls and average-ability and
high-ability boys.
Item Response Theory
Researchers have primarily examined DIF using item response theory (IRT) techniques.
IRT models evaluate variation among respondents by analyzing item-level data (Edelen et al.,
2006). In effect, IRT modeling assigns respondents as well as items to a scale of measurement,
which is conceptualized as having a range of negative infinity to positive infinity, a mean of
zero, and a unit of measurement of one (Baker, 2001).
The fundamental elements of IRT are a latent variable (such as ability or severity),
generally called θ , and an item characteristic curve (ICC) for each item. The ICC graphically
represents the probability of correct answer choice or endorsement along a continuum of θ.
4
-
Therefore, the ICC is conceptualized as a function of θ. When analyzing a polythomous item
(e.g., Likert-type), IRT methods assume that the item true score function, a nonlinear monotonic
function, connects θ to the expected answer choice. There are also additional assumptions
underlying polythomous IRT methods, namely, that the set of items of interest are
unidimensional and locally independent (i.e., uncorrelated after controlling for θ).
One IRT model that can be used to examine polythomous items is the graded response
model (Samejima, 1969). This model is applied when items have Likert-type or categorical
answer choices, which is a common characteristic of personality and attitude measures
(Embretson & Reise, 2000 p. 308). In the graded response model, when the item has K answer
choices or categories, k = 1,2,…,K, the item true score function of θ , T(θ), is modeled as
T(θ) = Σk=1
Kk × Pk (θ), (4)
where the item response y for a particular item category response function k is
Pk (θ) = Pk−1* (θ) − Pk
*(θ). (5)
The boundary response function, or probability of responding above k is
Pk*(θ) = 1
1+ exp[a(θ − bk )], (6)
where a is the slope parameter and bk are the threshold parameters. Additionally, P0*(θ) =1 and
PK* (θ) = 0 . For an item with five response categories, there are five item response functions and
four boundary response functions. In this case, there would be five total item parameters.
In order to evaluate an instrument for DIF using a graded response model, the researcher
designates a reference group and a focal group of respondents. DIF is said to be present if item
true score functions are not equal among these groups such that TR (θ) ≠ TF (θ) where the
subscripts signify the reference and focal groups, respectively.
5
-
Structural Equation Modeling
Another class of techniques used to evaluate measurement invariance is structural
equation modeling (SEM). In particular, confirmatory factor analysis (CFA), a type of SEM
analysis, has been used extensively to study measurement invariance. CFA methods for
detecting measurement invariance involve an overall test of comparability of parameter values
across groups, followed by a series of more specialized comparisons to identify the source of
lack of equivalence, if indicated.
The measurement model for CFA can be written as
x = τ + Λxξ + δ , (7)
(Vandenberg & Lance, 2000). In this equation, x represents a vector of q ×1 observed variables,
ξ represents a vector of latent variables, n ×1 Λx is a q × n matrix of factor loadings, and δ is a
vector of measurement error in x. This equation also includes the τ vector of intercepts,
although generally intercepts are assumed to be zero and are not estimated. In order to obtain a
covariance matrix,
q ×1
δξ +Λ x is multiplied by its transpose . Following the assumption
that measurement errors are uncorrelated with each other and the latent construct, the covariance
matrix (
tx δξ +Λ
Σx ) may be expressed as
Σx = ΛxΦΛxt + Θδ , (8)
where Φ is the covariance matrix of the latent variables and Θδ is the diagonal matrix of
measurement error variances. While Equation 8 is identical to the measurement model for
exploratory factor analysis (EFA), CFA places restrictions on Λx which differentiates CFA from
EFA.
Measurement invariance can be examined on many different levels using CFA. In that
Σx can be different for different populations, it is possible to test for equivalence of Σx , Λx , Φ,
6
-
and Θδ (Raju, Laffitte, & Byrne, 2002). A test of Σx = Σx′, where Σx refers to the covariance
matrix of the reference group and ′ Σ x refers to the covariance matrix of the comparison group,
can be thought of as an omnibus test of measurement invariance. If the null hypothesis is not
rejected using chi-square and other goodness-of-fit methods, the measure is generally accepted
as invariant and further tests are unnecessary. However, if the null hypothesis is rejected, lack of
invariance is indicted, and further tests should be conducted to identify the source of invariance
(Schmitt, 1982).
There are several types of tests of invariance to identify the source of lack of
measurement equivalence in a measure. Although there has been some inconsistency in the
literature with regard to terminology, and number of necessary tests and order of necessary tests
of invariance (Vandenberg & Lance, 2000), I will summarize the tests described in Vandenberg
and Lance (2000) in the order in which they are recommended.
The test of configural invariance evaluates whether patterns of significant and non-
significant factor loadings across groups are similar. The test of metric invariance ( Λx = Λx′) is
concerned with the equality of the value of factor loadings across groups. The test of scalar
invariance (τ x = τ x′) indicates whether intercepts on the latent variable are equivalent across
groups. The test of the invariance of the unique variances across groups ( Θδ = Θδ ′) examines
whether like items’ uniquenesses are equivalent between groups. The test of factor variance
invariance (Φ = ′ Φ ) examines whether respondents in different groups utilized a similar range of
responses along the answer continuum. Vandenberg and Lance (2000) also discuss the test of
equal factor covariances, although they do not find this test useful.
7
-
There have been several sets of step-by-step recommendations for undertaking the
aforementioned tests (e.g., Steenkamp & Baumgartener, 1998, Vandenberg & Lance, 2000).
While different sequences of testing have been reported, the overarching idea is that each test is
undertaken sequentially, and at each step, more restrictive constraints are added to the model
(e.g., setting equal factor loadings on like items across groups). The more restricted model is
compared in terms of goodness of fit (i.e., χ2 value and other goodness of fit indices) to the less
restricted or baseline model. The source of lack of measurement invariance is indicated at the
level of testing when the more restricted model does not meet acceptable standards for goodness-
of-fit.
Connections between CFA and IRT
Raju, Laffitte, and Byrne (2002) compared and contrasted CFA and IRT techniques for
evaluating measurement invariance. They reported that CFA and IRT methods are alike in that
they are both concerned with determining whether true scores are equivalent across groups for
respondents with equal amounts of the latent trait. The two techniques similarly allow for group
differences in the distributions of scores across subgroups. CFA and IRT also both give
information about the source and extent of any lack of measurement invariance identified.
While CFA and IRT are alike in certain respects, Raju, Laffitte, and Byrne (2002) also
noted several differences between the two techniques. Primarily, CFA modeling assumes a
linear relationship between the construct and items, while IRT assumes a non-linear relationship.
They found that with dichotomously scored items, the IRT approach is a more appropriate model
in terms of expressing the relationship between the measured variable and the continuous latent
construct. They found that the literature describing CFA models is more advanced in terms of
simultaneously looking at multiple latent constructs and multiple populations, although the IRT
8
-
literature is progressing in this direction. While CFA has been used to examine equivalence of
error variances across populations, IRT methods have not in practice examined an equivalent
statistic: the invariance of the standard error of measurement for θ across populations of
respondents. The CFA framework, on the other hand, does not have a means for determining the
probability of a respondent with a given θ selecting a particular response category. These two
techniques, when used in concert, can provide complimentary information (Meade &
Lautenschlager, 2004: Reise et al., 1993).
Several connections between the statistical frameworks of IRT and CFA models have
been discussed recently. In particular, Takane and de Leeuw (1987) provided proofs showing
that a two-parameter normal ogive IRT model is equivalent to a CFA with dichotomous
variables. Researchers have extended upon this parallel to investigate DIF using a multiple-
indicator/multiple cause (MIMIC) model. The MIMIC model is a special case of the factor
analysis model which includes causal variables. Work by Muthén et al. (1991) provided
equations for converting MIMIC model parameters to IRT discrimination and difficulty
parameters. MacIntosh and Hashim (2003) presented a procedure for converting standard errors
for these parameters from the MIMIC model parameters.
9
-
CHAPTER 2
LITERATURE REVEIW
The MIMIC Model
This paper is primarily concerned with demonstrating the use of MIMIC modeling to
identify DIF. The MIMIC model is an SEM-based alternative to the multiple-sample CFA
analysis described earlier. In a MIMIC model, one or more grouping (or background) variables
function simultaneously as contributors to differences in the latent trait and as covariates upon
which the outcome variables are regressed (Muthén, 1989). In this study, one dichotomous
grouping variable (gender) is included in the model.
The MIMIC model with a dichotomous grouping variable can be expressed as
η = ′ γ x x + ′ γ zz + ζ , (9)
where η is the latent trait, x represents observed background variables, z represents a dummy
variable, and ζ represents an error term, which is normally distributed and independent of x and
z (MacIntosh & Hashim, 2003). MIMIC modeling with categorical data, like IRT-based DIF
detection, includes comparing a latent response variable ( ) to a threshold (y j* τ j ). If the threshold
is exceeded ( y j* > τ j), then the indicator of the latent response variable ( y j) is one. If the
threshold is not exceeded, is zero. y j
The latent response variable ( ) can be modeled as a combination of the indirect effects
through the latent trait variable (η) and the direct effects of the dummy variable (
y j*
zk), which is a
measure of the grouping variable (e.g., gender) that is being examined for potential contributions
to DIF:
10
-
y j* = λ jη + β j zk + ε j , (10)
where λ j is the factor loading, β j is the slope relating the grouping variable to the response
variable, and ε j is the random error (Finch, 2005).
The procedure for assessing DIF using MIMIC modeling entails estimating the direct and
indirect effects of group membership on the latent trait and item response (Finch, 2005).
Significant indirect effects of group membership on the item indicate that differences in item
response are influenced by group differences on the mean of the latent factor. For example,
assessing indirect effects can indicate whether a greater level of the latent trait “aggression” in
boys contributes to higher endorsement of items on an aggression scale.
Significant direct effects between the grouping variable and the item indicates that group
membership directly impacts item response, apart from any group difference on the latent trait.
Evaluating direct effects, after controlling for indirect effects, is the procedure for assessing
uniform DIF. For example, this procedure can indicate whether boys are endorsing higher levels
of aggression items after controlling for differences in the underlying trait “aggression” (Finch,
2005).
Advantages of the MIMIC model
There are several unique advantages to using a MIMIC model to identify DIF. A MIMIC
model can be used to simultaneously obtain estimates of group difference in item response (DIF)
and amount of the latent trait. That is, the MIMIC model provides information about the
structural model and the measurement model (Muthén, 1989). MIMIC models can be estimated
for data of ordinal or continuous scales, data with multiple grouping variables (including
grouping variables with more than two groups), and data with multiple independent variables,
including categorical or continuous variables (Glockner-Rist & Hoitjink, 2003; Muthén, 1988).
11
-
Whereas IRT models require large sample sizes (Reise, Widaman, & Pugh, 1993), MIMIC
models can accommodate smaller sample sizes. Finally, researchers may be more familiar with
CFA-based procedures than analyses requiring knowledge of IRT and IRT software.
There are also some disadvantages of MIMIC modeling cited in the literature. Jones
(2006) notes that MIMIC modeling cannot account for a guessing parameter, unlike IRT-based
methods. However, in the behavior rating scales evaluated in this study, guessing is unlikely.
Another disadvantage is that a single-group MIMIC model can only identify uniform DIF,
although this study utilizes a multiple-group MIMIC approach, which is able to test for uniform
and non-uniform DIF.
DIF Detection using MIMIC Models: A Comparison to IRT Models
Finch (2005) conducted Monte-Carlo simulations to compare MIMIC model detection of
DIF to SIBTEST, IRT LR, and the Mantel-Haenszel (MH) statistic. He evaluated Type I error
rate and power under several sets of conditions, varying the size of the reference group (100 or
500), the number of items (20 or 50), the parameter model assessed, (three parameter logistic
model or two parameter logistic model), the amount of DIF contamination in the anchor items
(none or 15%), and the amount of DIF present in the target item (0 or .6). The study conditions
were completely crossed, and each combination of specifications was tested with 500
replications.
He found that the MIMIC model performed as well or better than traditional DIF
detection methods under some, but not all, conditions. Specifically, MIMIC performed well in
the 50 item test and when the two parameter logistic model was used. For the 50 item test, the
MIMIC model was more resistant than the other techniques to Type I error inflation when DIF
contamination was present in the anchor items. However, in the case where the exam was short
12
-
(20 items) and the three parameter logistic model was used, the MIMIC model had an
undesirably high rate of Type I error.
In terms of power, the MIMIC model performed well. Power was especially good when
the exam had 50 items and when the two parameter logistic model was used. Under these
conditions, the MIMIC model matched or exceeded the power of the other techniques.
Specifically, the MIMIC model outperformed the MH statistic and the SIBTEST when the
reference group was smaller (100) and the level of DIF contamination in the anchor items was
higher (15%).
Prior Studies Using MIMIC Modeling to Detect DIF
Several prior studies have used MIMIC modeling to check for DIF. Grayson et al. (2000)
used a MIMIC model to assess a depression scale for uniform DIF associated with demographic
(e.g., age), disability and physical disorder variables. They conducted their analyses in three
steps. First, a confirmatory factor analysis tested the acceptability of the structural model.
Second, each of the predictor variables (demographic, disability, physical disorder) was
introduced into the model serially. Each model included an indirect path from the predictor to
each item via the latent variable as well as a direct path to each item. Notably, the direct paths
from the predictor to the items were estimated in the same model and not sequentially. One
referent item was constrained to have zero bias for model identification reasons. For each
predictor, the researchers flagged significant parameter estimates for these paths linking items to
the predictor. The researchers frame this step as a screening procedure. That is, they were
interested in identifying predictors that had no significant direct effects on the items, and were
thus unlikely to be contributing to DIF. These predictors were then eliminated from the final
model. In the third and final step of the analyses, each of the significant predictors was added
13
-
into the model together, and the resulting multivariate model was estimated. The researchers
were again interested in identifying significant effects from the predictor to the items, although
in the multivariate model, all other predictors are held constant in the estimation procedure.
The researchers conducted the analysis using maximum likelihood parameter estimates.
The goodness of fit index (GFI), the Tucker-Lewis index (TLI), and the root mean square error
of approximation (RMSEA) values assessed model fit. The researchers were concerned about
violations to multivariate normality; therefore they used bootstrapping to obtain confidence
intervals. The researchers determined that a biased loading parameter estimate that exceeded
twice its standard error (z score of 2) was statistically significant in this context. The loadings on
the latent variable required a z score of 1.5 to reach significance.
In reporting the results, the researchers partitioned the effects of each predictor into a bias
effect (i.e., the sum of the direct effects from the predictor to the items) and an actual effect (i.e.,
the sum of the direct effects from the latent variable to the items, multiplied by the effect from
the predictor to the latent variable). These effects were compared to the critical ratios described
above (2 for bias parameters, 1.5 for direct effect from the predictor to the latent variable). Items
that exceeded the bias cut-off on one or more predictors were identified as exhibiting DIF.
Gallo et al. (1994) used MIMIC modeling to identify age-related uniform DIF in a
depression scale with marital status, minority status, cognitive status, and recent unemployment
in the model. The latent variable, depression, was regressed on each covariate. The analysis was
conducted by successively testing for significant parameter estimates between age and each item.
Analyses were conducted using LISCOMP’s limited-information generalized least
squares estimator for dichotomous response. Model fit was reported using the descriptive fit
value, the goodness-of-fit index, the adjusted goodness-of-fit index, and the critical number.
14
-
Christensen et al. (1999) examined a depression scale and an anxiety scale for age-related
uniform DIF using a MIMIC model. They evaluated a two-factor measurement model including
five demographic covariates (age, sex, marital status, financial status, and level of education).
First, the researchers conducted a CFA to assess the fit of the measurement model.
Next, the researchers conducted analyses to select a referent item for the substantive DIF
analysis. For identification purposes, one item must be selected as the referent (e.g., no DIF)
item. The researchers tested each item individually (e.g., testing the significance of the direct
paths to the item from each covariate consecutively) to determine which items show no DIF
across all of the covariates. An item that showed weak direct associations with the covariates
was selected as the referent.
Finally, the substantive model was analyzed. The model included paths from each
covariate to both latent variables, and to each item. This model was similar to the model
described in Grayson et al. (2000), in that DIF was estimated for each of the items
simultaneously (i.e., paths from the covariate to each of the items were included in one model).
Analyses were conducted using maximum likelihood parameter estimates using Amos
3.6.1. Model fit was reported using the goodness-of-fit index (GFI), the non-normed fit index
(NNFI) and the root mean square error of approximation (RMSEA). The researchers were
concerned about violations to multivariate normality; therefore bootstrapping was used to obtain
confidence intervals.
In an article that demonstrated a procedure for calculating the standard error of the
estimates of IRT difficulty and discrimination from MIMIC model parameters, MacIntosh and
Hashim (2003) employed MIMIC modeling to identify uniform gender-related DIF on a scale
measuring racial prejudice. They included gender in the model as the predictor variable, along
15
-
with three other covariates (educational status, political conservatism, and religious
fundamentalism). The researchers conducted their analyses in two steps. First, they ran the
model with a path from gender to the latent variable and with one path directly from gender to an
item. The researchers obtained the residual variance of the latent variable from the output of this
analysis ( R2). This value was used to set the variance of the latent variable for the second run as
1− R2. Once this variance was set, the model was sequentially run to test each item for DIF in
the same manner as Gallo et al. (2000); that is, with parameter estimates from gender to the item.
The Mplus program was used to run the analysis. The researchers reported model fit
using the chi-squared value.
Gender-related DIF in Measures of Aggression
Obtaining accurate measures of self-reported aggression in adolescents is of interest to
researchers, educators, and policy-makers alike. Aggression scales have been used to calculate
the prevalence of aggression in schools (e.g., Nansel et al., 2001) and as outcome variables for
evaluating the impact of violence-prevention programs (e.g., Farrell, Meyer, & White, 2001).
Evaluating gender differences in levels of aggression and type of perpetration (e.g., physical,
relational) have been topics of particular consideration in the literature. Boys have generally
scored higher than girls on measures of physical aggression, (e.g., Bongers et. al, 2004; Broidy,
Nagin, Trembaly, Bates, Brame, Dodge, et al., 2003), although researchers have recently argued
that adolescent aggression is not only constrained to physical acts. Notably, Crick and Bigbee
(1995) coined the term “relational aggression” to encompass behaviors that are purposefully
damaging to the victim’s peer relationships. Crick and Grotpeter (1995) argue that these
aggressive behaviors are more common in girls than physical aggression.
16
-
While measuring gender differences on different types of aggression has been a topic of
considerable interest, there are significant gaps in the literature regarding the validity of the
instruments used, particularly in terms of measurement invariance. Typically, researchers have
evaluated gender differences in aggression by comparing means (e.g., Crick & Grotpeter, 1995;
Björkqvist, Lagerspetz, & Kaukiainen, 1992). As discussed earlier, a simple means comparison
may not be interpretable without first obtaining evidence of measurement invariance.
The purpose of this study is to evaluate the measurement invariance of two measures of
aggression (physical and relational) across gender groups in order to evaluate whether
differences in self-reported aggression are due solely to differences in the latent trait, or if
differences are partially due to factors related to measurement. This topic may be of particular
importance because the wording of items on the aggression scale may lend itself to differential
item functioning. In particular, the Crick and Grotpeter (1995) relational aggression scale was
developed with the aim of capturing behaviors that they theorized were common in girls. If the
authors were particularly concerned with female-oriented behavior, the wording of items may
reflect subtle gender-related bias. On the other hand, because physical aggression is more
commonly associated with boys, measures of physical aggression may include an over-
representation of items that are more salient to male respondents.
17
-
CHAPTER 3
PROCEDURE
Sample
Data for this study are from the GREAT Schools and Families project (GSF), a seven
year, multi-site violence prevention project described in detail in a supplement to the American
Journal of Preventive Medicine (Horne, 2004). The data used here comprises the Spring 2004
student survey of a randomly-selected sample of students from one cohort at the University of
Georgia (UGA) site. Students attended one of nine middle schools in Northeast Georgia. Of the
719 students who were eligible to participate in GSF, 623 (87%) completed the Spring 2004
student assessment. At this assessment wave, all respondents were in 7th grade, unless they had
been retained.
The sample was 49% female. Of the 612 students who selected one race, 53% were
white, 34% were black, less than 1% were American Indian or Asian Indian, 2% were other
Asian, and 10% were some other race. There were 14 students (2%) who selected more than one
race. Twelve percent of students were Hispanic. Students ranged in age from 12 to 15.
Participants took the survey via computer assisted survey interview (CASI) using laptop
computers. Participants wore headphones, and the CASI program read the questions aloud.
Respondents recorded their answers via the keyboard. All surveys were proctored by GSF staff.
Students were assigned an ID number, and the survey was confidential.
This study examines three types of behaviors: physical aggression, relational aggression,
and overt victimization.
18
-
Instrumentation
The items that are examined in this study are taken from the problem behavior frequency
scale, a collection of 47 items grouped into 7 subscales that assess the 30-day frequency of
problem behaviors, such as aggression and delinquency. This study is concerned with three
subscales (henceforth called scales): physical aggression, relational aggression, and overt
victimization.
Physical aggression is a 7-item scale that measures self-reported physical aggression in
the past 30 days. The stem is, “In the past 30 days, how many times have you,” followed by
descriptions of physical aggression (e.g., hitting) or other serious aggressive behavior (e.g.,
threatening someone with a weapon). Relational aggression is a 6-item scale that measures self-
reported relational aggression in the past 30 days. The stem is, “In the past 30 days, how many
times have you,” followed by descriptions of relational aggression, such as spreading a false
rumor about someone. Overt victimization is a 6-item scale that measures self-reported
victimization in the past 30 days. The stem is, “In the past 30 days, how many times has this
happened to you,” followed by descriptions of victimization (e.g., been pushed).
For each item, the six Likert-type response categories range from “never” to “20 or more
times.” These responses are coded from one to six, respectively, and the scale score is the
average of the item scores.
In terms of reliability, each scale demonstrated acceptable internal consistency based on
the current dataset (physical aggression, α = .85; relational aggression, α = .81; overt
victimization, α = .86). High internal consistency indicates that respondents demonstrated
consistency in answer selection across subsets of items (Crocker & Algina, 1986, p. 135). These
19
-
values are consistent with or higher than other reported internal consistencies calculated based on
other data sources (Farrell et al., 2001, Crick & Bigbee, 1998, Orpinas & Frankowski, 2001).
In general, published validity evidence on the scales used is scarce. In many cases, the
internal consistency of the measure is offered as the only indicator of validity evidence (see
Dahlberg et al., 2005). More comprehensive validity evidence is available for one of the
measures used. The physical aggression scale was adapted in part from Orpinas and
Frankowski’s 2001 Aggression Scale. Validity studies conducted on this scale using three
samples of middle school students indicated that scores on the Aggression Scale were
significantly positively related to teacher-rated aggression, and self-reports of drug use, weapon-
carrying, and injuries due to fights (Orpinas & Frankowski, 2001).
Computer Program
The data were analyzed using SPSS Version 16.0, and Mplus Version 5 (Muthén &
Muthén, 2007). Syntax for all analyses is provided in Appendix A.
Detection of Gender-related DIF
Gender-related DIF detection was conducted on the two scales of interest: physical
aggression and relational aggression. A measure of victimization was included in the model as a
covariate. Victimization scores were categorized into three groups: no victimization reported
(30% of respondents), one instance of victimization reported (13% of respondents) and more
than one instance of victimization reported (57% of respondents). By adding victimization to the
model, it is possible to evaluate gender- related DIF while adjusting for the effect of
victimization on the aggression measure. Experiences of victimization may moderate the
relationship between gender and the tendency to endorse items on the aggression scale. In other
20
-
words, differences in victimization may explain some of the gender differences in item
endorsement that I would have otherwise attributed to DIF.
Two approaches were used to evaluate the measurement invariance of the aggression
measures. First, the MIMIC approach was used to test each item for DIF. Second, I employed
multiple group analysis to obtain a test of the overall invariance of the measures across groups.
Each of the procedures was conducted for the physical aggression scale and the relational
aggression scale independently.
Multiple-Indicator/Multiple Cause Modeling
Single group MIMIC modeling was used to identify items that exhibit DIF. For
identification purposes, the mean of the latent variable was set to zero and the variance set to one
according to a procedure documented in MacIntosh and Hashim (2003). To set the latent mean
to zero, the exogenous variables (i.e., gender, victimization) are mean centered. To set the latent
factor variance to one, a two-step procedure was demonstrated in MacIntosh and Hashim (2003).
First, the model is run with no constraints on the variance of the latent factor to obtain the R2
value for the latent factor. Second, the model is estimated again with the variance of the latent
factor set to . I received a warning when running the first step of the procedure regarding
the identification of the model, and the program was not able to calculate the
(1− R2)
R2 value for the
latent factor. In order to identify the model for the first step, I set the variance of the latent
variable to 1. After re-running the model, I was able to obtain the R2 value for the latent factor.
The second step of the model was run exactly as outlined above and no further problems were
encountered.
First, a baseline model containing no direct effects from gender to item responses (i.e., no
DIF) was analyzed. Girls are coded zero, and boys are coded one. Model fit was evaluated
21
-
using the chi square (χ2) statistic along with other fit indices, the root mean square error of
approximation (RMSEA), and the comparative fit index (CFI). The χ2 test tests the hypothesis
that the original covariance matrix and the estimated or reproduced covariance matrix are
identical. However, the χ2 statistic has been shown to be sensitive to trivial differences among
these matrices under large sample sizes (Bentler, 1990). In other words, non-significant p-values
may be obtained even in models where the model fits the data well. In order to obtain a more
comprehensive view of model fit, additional fit indices in addition to χ2 are typically reported.
The RMSEA is a stand-alone fit index that adjusts for the complexity of the model,
favoring parsimony. It is a standardized measure of the degree to which the population data do
not fit the model. Hu and Bentler (1998) suggest that values of .06 or lower are indicative of
good fit. The CFI is an incremental fit index that compares the amount of non-centrality in the χ2
distribution of the specified model to a baseline (null) model. Hu and Bentler (1998) recommend
using .95 (or above) as a cut off for good model fit for the CFI.
Weighted least squares means and variances (WLSMV) estimation was used for all
analyses. WLSMV is a robust weighted least squares (WLS) estimator, which, similar to WLS
relies on an asymptotic covariance matrix, making both estimators attractive options for use with
non-normal data. However, WLS requires extremely large sample sizes, and is thus not practical
for most datasets. WLSMV is less computationally intensive than WLS, which results in smaller
sample size requirements. WLSMV adjusts the mean and variance of the χ2 value, along with
parameter estimates and standard errors to account for the level of non-normality in the data
(Finney & Distephano, 2006, p. 298).
Next, a series of models were run to test for uniform DIF. Each item was tested
sequentially by adding a direct path ( β ) from gender to the item in the model. A significant
22
-
parameter estimate for this direct path is indicative of uniform DIF. That is, gender is explaining
differences in item means above and beyond that which is explained via the indirect path of
gender to the item through the latent variable.
In order to test if including DIF effects (β ) results in significantly improved model fit, a
DIF model was created by including all significant β’s in the model. The fit of the DIF model
was compared to the baseline model using a χ2 difference test. It is important to note that under
WLSMV estimation, the χ2 values and degrees of freedom function differently than maximum
likelihood based estimators. According to comments posted by Linda Muthén on the Mplus
Discussion Board (2007), the χ2 values and degrees of freedom obtained from WLSMV
estimation are not able to be interpreted in a similar way as values obtained by ML estimation
(e.g., degrees of freedom calculated as a function of the number of estimated parameters
subtracted from the number of elements in the covariance matrix), because these values obtained
under WLSMV are adjusted to obtain accurate p-values. Similarly, a χ2 difference test between
two nested models estimated using WLSMV cannot be conducted in a straightforward manner
by subtracting the χ2 value and degrees of freedom of the unconstrained model from the
constrained model. Instead, the DIFFTEST command in Mplus must be used to obtain the
accurate p-value for this test.
Multiple Group Analysis
Multiple group analysis was used to further evaluate the equivalence of the factor
structures of the physical and relational aggression scales across gender. Multiple group
analysis considers the fit of the model as equality constraints are placed on two single-gender
groups. For each scale, the data from each single-gender group are first fit to a baseline or
configural model without equality constraints to determine whether the same pattern of
23
-
significant and non-significant parameter estimates holds across groups. If so, there is evidence
of configural invariance, and a more restrictive model is placed on the data. According to the
guidelines provided in the Mplus manual for conducting multiple group analysis with ordered
categorical data, the tests of metric invariance and scalar invariance described earlier are
combined into one test. That is, the factor loadings and thresholds are constrained all in one step,
“because the item probability curve is influenced by both parameters” (Muthén & Muthén, 2007,
p. 399). Because the two models are nested, χ2 difference tests are conducted to determine if the
added constraints result in significantly poorer fit.
24
-
CHAPTER 4
RESULTS
Descriptive Statistics
Means and standard deviations for the physical aggression and relational aggression
scales are presented separately by gender in Table 1. Spearman item intercorrelations, skewness,
and kurtosis, are provided in Table 2.
An examination of descriptive statistics (i.e., skewness, kurtosis) revealed violations of
univariate normality in several items. Because these scales measure the 30-day frequency of
aggressive behaviors, it is unsurprising that the responses will be skewed toward “never.” Two
items on the physical aggression scale and four items on the relational aggression scale had
skewness and kurtosis values outside of the acceptable range suggested by Kline (2005)—
│3│for skewness and │10│for kurtosis. Because these values fall outside of the acceptable
range, these items cannot be assumed to be univariate normal (D’Agostino, 1986), and the data
as a whole cannot be assumed to be multivariate normal.
Because of the non-normality in the data, I selected an estimation method that was not
based on normal theory, weighted least squares means and variances (WLSMV). As discussed
earlier, WLSMV accounts for the categorical nature of the data and adjusts for non-normality,
specifically through weighted least squares parameter estimates, mean- and variance- adjusted χ2
values, and scaled standard errors.
Prior to conducting the MIMIC and multiple groups analyses, means and standard
deviations of items were examined for differences in gender. For the physical aggression scale,
25
-
26
boys reported more aggression than girls across all items except for one. The mean of item 3
(“threatened to hurt a teacher”) was slightly higher for girls (M=1.17) than boys (M=1.11).
Overall, the item standard deviations were smaller for girls than for boys. For the relational
aggression scale, item means for girls’ and boys’ scores were quite similar, except for one item.
The mean of item 6 (“said things about another student to make other students laugh”) was
higher for boys (M=2.43) than girls (M=2.08). All of the item standard deviations were smaller
for girls than for boys.
-
Table 1 Physical and Relational Aggression Items Means and Standard Deviations for Boys and Girls Boys Girls
Scale Item M SD M SD 1. Thrown something at another student to hurt them 1.87 1.283 1.59 0.999
2. Been in a fight in which someone was hit 1.81 1.253 1.41 0.910
3. Threatened to hurt a teacher 1.11 0.498 1.17 0.705
4. Shoved or pushed another kid 2.50 1.583 2.05 1.391
5. Threatened someone with a weapon (gun, knife, club, etc.) 1.19 0.755 1.10 0.535
6. Hit or slapped another kid 2.06 1.486 1.82 1.154
Physical Aggression
7. Threatened to hit or physically harm another kid 1.80 1.392 1.56 1.133
1. Didn't let another student be in your group anymore because you were mad at them. 1.63 1.094 1.64 0.987
2. Told another kid you wouldn't like them unless they did what you wanted them to do. 1.19 0.701 1.20 0.656
3. Tried to keep others from liking another kid by saying mean things about him/her. 1.46 1.065 1.44 0.869
4. Spread a false rumor about someone. 1.35 0.940 1.31 0.757
5. Left another kid out on purpose when it was time to do an activity. 1.46 1.068 1.43 0.940
Relational Aggression
6. Said things about another student to make other students laugh. 2.43 1.640 2.08 1.314
N = 621 (313 boys, 308 girls)
27
-
28
e 2
Tabl Physical and Relational Aggression Items Intercorrelations, Skewness and Kurtosis Items Physical Aggression Relational Aggression 1 2 3 4 5 6 7 1 2 3 4 5 6
1 -- 2 .45 -- 3 .21 .21 -- 4 .48 .42 .20 -- 5 .36 .32 .31 .31 -- 6 .49 .44 .19 .65 .31 -- 7 .48 .43 .30 .50 .41 .54 -- 1 .34 .30 .16 .30 .11 .24 .29 -- 2 .30 .29 .22 .29 .23 .22 .28 .36 -- 3 .35 .31 .17 .38 .24 .33 .36 .38 .44 -- 4 .38 .30 .22 .36 .21 .30 .39 .36 .42 .53 -- 5 .35 .25 .16 .32 .19 .29 .36 .44 .36 .56 .44 -- 6 .41 .31 .13 .55 .26 .51 .48 .31 .26 .41 .31 .37 --
Skewness 2.11 2.23 5.63 1.21 5.69 1.68 2.20 2.31 4.59 3.02 3.62 3.05 1.31Kurtosis 4.51 4.93 34.88 0.51 34.85 2.24 4.16 6.05 24.39 10.18 14.88 9.90 0.78
-
Outliers
The existence of outliers in data can be problematic. On one hand, outliers can exert
disproportionate influence on results and can sometimes be the result of data entry errors or
untruthful respondents. On the other hand, outliers can reflect true variation in respondents (e.g.,
highly deviant behaviors) and are therefore of interest to the researcher.
In order to screen for outliers, I used the macro given in DeCarlo (1997). DeCarlo’s
macro calculates the significance of multivariate outliers and the Mahalanobis distance for each
observation. Fifty-three observations, or 9% of the total sample, met the criteria for multivariate
outliers, using the critical value at the .05 level of significance. These outliers were not removed
from the dataset because I hypothesized that these values were most likely due to true differences
in students and not errors in the dataset or untruthful responses.
Missing Value Treatment
There was very little missing data in the dataset. In total, only two students did not
answer every item. The Mplus program accommodates missing values using full information
maximum likelihood estimation.
Physical Aggression Scale
Parameter estimates and standard errors for the MIMIC analysis for the physical
aggression scale are presented in Table 3. For the baseline (i.e., no DIF) model, the fit of the
physical aggression scale was adequate ( χ202 = 73.308, P
-
slapped another kid”). These significant values provide evidence that these items may exhibit
different measurement characteristics based on the gender of the respondent.
Next, a DIF model was estimated by including all three significant β ’s in the model. The
fit of the model was adequate ( χ182 = 54.9, P
-
4 2.290 0.157 14.593 5 2.560 0.196 13.052
1 -0.259 0.055 -4.719 2 0.594 0.056 10.688 3 1.062 0.063 16.757 4 1.393 0.068 20.586
Y4
5 1.661 0.076 21.868
1 -3.121 0.308 -10.120 2 1.608 0.118 13.592 3 2.002 0.139 14.417 4 2.234 0.163 13.725
Y5
5 2.357 0.178 13.244
1 0.055 0.053 1.039 2 0.882 0.061 14.535 3 1.320 0.070 18.943 4 1.628 0.080 20.344
Y6
5 1.829 0.090 20.401
1 0.532 0.057 9.323 2 1.142 0.068 16.809 3 1.526 0.075 20.248 4 1.704 0.081 21.114
Y7
5 1.901 0.087 21.822 β Y1 0.023 0.082 0.286 Y2 0.364 0.095 3.817 Y3 -0.319 0.155 -2.061 Y4 0.101 0.077 1.318 Y5 0.008 0.165 0.050 Y6 -0.222 0.076 -2.903 Y7 -0.101 0.084 -1.204 γ Gender 0.136 0.081 Victimization 0.519 0.046 Ψ Physical Aggression 0.297
31
-
Table 4
Sequential Chi Square Tests of Invariance for Physical and Relational Aggression Scales
Scale Analysis Model χ 2 df p Δχ 2 Δdf p
Model 0 73.308 20 .000 -- -- -- MIMIC
Model 1 54.971 18 .000 23.711 3 .000
Model 2 119.991 22 .000 -- -- --
Physical Aggression
Multiple Group
Model 3 112.916 28 .000 37.041 15 .001
Model 0 56.385 15 .000 -- -- -- MIMIC Model 1 50.111 14 .000 6.154 1 .013
Model 2 56.414 22 .000 -- -- --
Relational Aggression
Multiple Group
Model 3 35.598 18 .007 0.340 7 .999
Model 0: No direct DIF effects
Model 1: All significant DIF effects included
Model 2: No invariance constraints
Model 3: Invariance of loadings and thresholds
I next used multiple group analysis to further evaluate the equivalence of the factor
structures across gender. Before beginning the process, it was necessary to collapse the fifth and
sixth response categories (“10-19 times” and “20 or more times”) for item 2 (“been in a fight in
which someone was hit”) because no females endorsed the fourth item response. In this case, the
program cannot estimate a threshold and the categories must be collapsed. Because of problems
running the model (i.e., non-positive definite matrix), victimization was removed from the
32
-
model. After removing victimization, no additional problems were encountered running the
analyses.
In order to identify the model, I followed the recommendations from the Mplus Manual
(Muthén & Muthén, 2007, p 398). That is, for the baseline model (i.e., test of configural
invariance), all of the item error variances are set to one, the factor means are set to zero for both
groups, one item loading is set to one in both groups, and one threshold per item is held invariant
(one additional threshold is held invariant for the item with the factor loading set to one). It is
important to note that by setting loadings and thresholds to one across groups, these reference
parameters are assumed to be invariant. The loading of the first item was chosen as the reference
parameter because this item showed little evidence of lack of invariance in the MIMIC analysis.
The highest threshold was chosen to be held invariant on the theoretical grounds that there may
be less variation across gender on the top end of the scale (i.e., differences between respondents
endorsing one of two answer categories corresponding to high frequency of aggression) than at
the bottom (i.e., differences between respondents endorsing one of two answer categories
corresponding to no aggression versus low frequency of aggression).
Table 4 provides a summary of the sequential χ2 difference tests for the multiple group
analysis. The fit of the baseline model was mediocre ( χ222 = 119.991, P
-
Relational Aggression Scale
Parameter estimates and standard errors for the MIMIC analysis for the relational
aggression scale are presented in Table 5. The fit of the relational aggression scale for the
MIMIC analysis was adequate ( χ152 = 56.385, P
-
Table 5 Multiple-Indicator/Multiple-Causes (MIMIC) Model Estimates for the Relational Aggression Scale Estimate SE Est./S.E. λ Y1 0.730 0.035 20.561 Y2 0.852 0.040 21.320 Y3 0.960 0.024 40.630 Y4 0.856 0.036 23.545 Y5 0.905 0.027 33.667 Y6 0.641 0.033 19.639 τ Threshold
1 0.278 0.052 5.361 2 1.174 0.066 17.853 3 1.652 0.084 19.704 4 1.901 0.097 19.621
Y1
5 2.055 0.112 18.370
1 1.221 0.069 17.781 2 1.749 0.089 19.557 3 2.099 0.122 17.137 4 2.353 0.150 15.736
Y2
5 2.411 0.156 15.477
1 0.651 0.056 11.634 2 1.401 0.078 18.060 3 1.806 0.095 19.010 4 2.029 0.110 18.375
Y3
5 2.057 0.112 18.330
1 0.933 0.065 14.328 2 1.713 0.091 18.767 3 2.061 0.111 18.620 4 2.199 0.121 18.199
Y4
5 2.342 0.133 17.634
1 0.710 0.057 12.501 2 1.429 0.083 17.129 3 1.779 0.097 18.284 4 1.960 0.106 18.466
Y5
5 2.062 0.114 18.006
1 -0.300 0.052 -5.710 2 0.630 0.058 10.919 3 1.030 0.061 16.760 4 1.315 0.068 19.346
Y6
5 1.534 0.075 20.438
35
-
βz Y1 -0.092 0.086 -1.061 Y2 -0.015 0.114 -0.135 Y3 -0.121 0.083 -1.458 Y4 0.016 0.097 0.161 Y5 -0.013 0.091 -0.146 Y6 0.211 0.085 2.479 γ Gender -0.060 0.093 -0.649 Victimization 0.468 0.052 9.067 Ψ Relational Aggression 0.175 -- No responses
I next used multiple group analysis to further evaluate the equivalence of the factor
structures across gender. Similar to the physical aggression scale, it was necessary to collapse
the 5th and 6th response categories (“10-19 times” and “20 or more times”) for two items. For
item 2 (“told another kid you wouldn’t like them unless they did what you wanted them to”), no
respondents endorsed the 5th response category. For item 3, (“tried to keep others from liking
another kid by saying mean things about him/her”) no girls endorsed the 5th response category.
No other problems were encountered, and the model was estimated with victimization in the
model.
The same procedures that were described earlier were followed to identify the model for
multiple groups analysis. The first item loading was held equal across groups, because this item
did not display evidence of DIF in the MIMIC modeling procedure. The highest threshold for
each item was held invariant for the same reason described in the physical aggression section.
Table 4 provides a summary of the sequential χ2 difference tests for the multiple group
analysis. The fit of the baseline model was mediocre ( χ222 = 56.414, P
-
CFI = .98). After holding the factor loadings and thresholds equal across groups, the model fit
adequately ( χ182 = 35.598, P=0.007; RMSEA = .056, CFI = .99). The χ2 difference test was not
significant ( χ72 = .340, P=.999). This result indicates that adding the additional constraints to t
model does not result in a statistically significant decrease in model fit, which supports the
hypothesis that the relational aggression scale does exhibit measurement invariance across
gender.
he
37
-
CHAPTER 5
SUMMARY AND DISCUSSION
Summary
The purpose of this study was to evaluate two measures of aggression in adolescents for
gender-related DIF. MIMIC modeling was used to test for direct effects of gender on each item
in the two scales. Another procedure, multiple group analysis, was used to provide additional
information about the measurement invariance of the scales across gender.
For the physical aggression scale, three items displayed evidence of DIF. The model that
did not include these DIF effects fit the data significantly worse than the model that did include
the DIF effects. This finding that including the DIF effects improves model fit suggests that there
are measurement differences between boys and girls on these items. The results of the multiple
group analysis buttressed this finding. Placing equality constraints on the loadings and
thresholds across the single-gender groups resulted in significantly poorer model fit.
For the relational aggression scale, one item displayed evidence of DIF. The model that
did not account for this DIF effect fit the data significantly worse than the model that did include
the DIF effect, indicating a lack of invariance for this item. Although there is evidence that one
item displays DIF, the results of the multiple group analysis indicated that the scale as a whole
does exhibit gender-related measurement invariance. Imposing equality constraints on the
loadings and thresholds of the two groups did not significantly decrease the fit of the model in
comparison to the unconstrained model.
38
-
Discussion
While statistical procedures can flag items for DIF and identify lack of invariance in
scales, it is up to the researcher to provide possible explanations for the measurement
differences. Although it is not possible to conclusively speak about the reasons for the
measurement differences, additional investigation of the flagged items can provide some
tentative explanations of the findings.
In the physical aggression scale, item 2 (“been in a fight in which someone was hit”)
exhibited DIF. The estimate of β for this item was .364. Because boys are coded 1, this estimate
indicates that in the model either boys endorsed higher values for this item due to gender above
and beyond the indirect effect of gender through latent physical aggression, or alternately, that
girls endorsed lower values. In order to gain additional information about patterns of responses
among boys and girls, I examined item intercorrelations and response frequencies from the
marginal tables (not shown), although this information is not based on the structural equation
modeling estimated parameters. An examination of item intercorrelations did not reveal any
clear pattern of differences between item 2 and the other items (i.e., item 2 correlations were
mostly consistent with the magnitude of the correlations among other items). An examination of
the response frequencies contributes additional information. Boys are less likely to than girls to
choose the response “never” (56.9% to 77.3%, respectively). There are differences at the top end
of the scale as well. While 4.9% of girls selected one of the last three response categories (“6-9
times,” “10-19 times,” or “20 or more times”), 9.9% of boys chose one of these categories. It is
possible that boys are more likely than girls to endorse higher levels of this item for reasons
other than increased physical aggression. For example, boys may be more likely to engage in
play fighting, which may result in one of the participants getting hit. Although this behavior can
39
-
be considered aggressive, it is different in nature than the malevolent aggression that this scale is
intended to measure.
Item 3 (“threatened to hurt a teacher”) also exhibited DIF. The estimate of β for this item
was -.319, indicating that in the model either girls were more likely or boys were less likely to
endorse higher values for this item due to gender-related factors beyond differences in latent
physical aggression. Examining the item intercorrelation matrix revealed that the correlations
for this item were somewhat lower than other items on the physical relation scale. In terms of
response category distribution from the marginal table, the pattern for girls and boys looks quite
similar. However, more girls than boys indicated that they had threatened to hurt a teacher 10-19
or 20 or more times (2% versus .6%, respectively). While it is difficult to speculate why this item
exhibits DIF, it may be possible that boys who are otherwise highly aggressive may be less likely
to threaten or admit to threatening teachers. This could possibly be because many teachers are
female and there is a strong social norm against males hitting or threatening females in the
Southern U.S.
Item 6 (“hit or slapped another kid”) was also flagged for DIF. The β estimate for this
item was -.222, which indicates that in the model females may be more likely or males may be
less likely to select higher values on this item due to measurement differences rather than true
differences in physical aggression. Examining the item intercorrelation matrix revealed that the
correlations for this item were fairly consistent with the correlations between the other items. In
terms of response category frequencies from the marginal table, substantial differences exist in
the high end of the scale. Where 2.6% of girls reported hitting or slapping another kid 20 or
more times, 7% of boys reported this behavior. It is possible that the word “slapping” resonates
more with girls, as this behavior in our culture is more often associated with females.
40
-
For the relational aggression scale, item 6 (“said things about another student to make
other students laugh”) demonstrated DIF. The β estimate for this item was .211, which indicates
that males may be more likely or females may be less likely to select higher values on this item
than would be expected based on their level of latent physical aggression. Interitem correlations
for this item were somewhat lower than correlations among other items on the relational
aggression scale. In terms of response category frequencies from the marginal table, boys were
more likely than girls to select the response “20 or more times” (9.9% versus 5.5%,
respectively). It is possible that for adolescent boys, the act of making fun of others is a more
common experience that is not necessarily always tied to a malevolent or aggressive motive.
One important aspect of testing for measurement invariance is deciding what to do if full
measurement invariance is not supported. It may not be practical to remove all items from a
scale that exhibit DIF. Hambleton (2006) suggests that it is common to find many items that
display DIF in psychological, attitudinal, or personality measures. He notes that it is possible
that these effects, taken as a whole, may cancel each other out. That is, while some items may
exhibit DIF in one direction, others counter those effects, so overall the measure does not favor
one group over the other.
It is also possible that while statistically significant DIF effects may exist in the model,
the impact of these effects is not of practical importance. With larger sample sizes the power to
detect DIF increases. Because the present dataset is reasonably large, it is possible that some of
the significant DIF effects correspond to measurement differences that are not practically
meaningful. In future studies, it may be desirable to examine effect size measures of the DIF
items.
41
-
Limitations and Future Research
There are several limitations inherent to the research presented in this paper. An
important limitation is that for all of the tested models, fit was less than ideal. While the fit of
the MIMIC models was adequate, the fit was mediocre for the multiple group analysis models,
especially for physical aggression. When model fit is borderline in terms of acceptability, it
becomes harder to draw conclusions based on those models. For example, a non-significant χ2
difference test may be reflective of problems with the overall fit of the model, rather than lack of
fit caused only by introducing equality constraints. The less than ideal fit of the models for
multiple groups analysis may be due to constraints that are put on the data for identification
purposes. This possibility is discussed later in this section.
Another problem is that the generalizability of the findings may be limited by the study
sample. All of the participants were middle school students from one region of the U.S. While
in this population the aggression measures examined may not demonstrate full measurement
invariance, this result may not hold for other groups of students, or for the same students at a
different point in time (e.g., in high school).
Another limitation concerns estimation method. Although WLSMV was likely the most
appropriate estimator for the data, based on the categorical and non-normal nature of the data,
there are also some disadvantages to using WLSMV. First, WLSMV is a relatively new
estimation method, and there is relatively little research examining its functioning. Overall,
however, research has supported its use as an improvement upon WLS estimation (Muthén,
1993).
WLSMV also provides challenges in terms of identifying the model when thresholds are
being estimated. In multiple group analysis, one threshold per item must be held invariant across
42
-
groups. Ideally, this threshold should in reality be invariant across groups, although it is difficult
to determine if this is the case. Systematically testing each threshold becomes time-consuming
when there are many thresholds per item. However, if a non-invariant threshold is set to equality
in the multiple group analysis, the overall model fit may be worsened. Additional research into
this aspect of multiple group analysis with categorical data is warranted.
Second, WLSMV provides less information in terms of multiple group analysis than
continuous estimators. For example, under MLM estimation, tests of metric and scalar
invariance could have been conducted separately, whereas under WLSMV these tests are
combined. By conducting metric and scalar tests separately, the source of lack of invariance
(slopes or intercepts) can be investigated. While under WLSMV it is possible to identify the
model in such a way as to estimate some differences in loadings and thresholds, this procedure is
not recommended in the Mplus manual because the loadings and thresholds together contribute
to the item characteristic curve (Muthén and Muthén, 2007, p. 299).
Another drawback to using WLSMV estimation is that modification indices are not
available. Under multiple group analysis, modification indices can be used to identify items that
potentially are problematic in terms of non-invariant loadings or intercepts. Modification indices
are often used to test for partial invariance, where some parameters are held invariant, while
problematic items are freely estimated. Without modification indices, I was unable to compare
whether the two methods (MIMIC modeling and multiple groups analysis) identified the same
problematic items.
Certain properties of the present dataset may make accurate estimation difficult.
Specifically, several of the items used were severely non-normal, with kurtosis values as high as
34.882. While WLSMV does adjust parameter estimates and χ2 values for non-normality, the
43
-
accuracy of WLSMV adjustments in cases of severe non-normality has not been studied
extensively. In some cases, the characteristics of the WLSMV estimates contradict intuitive
sense, such as more restricted models that have better fit index values than less restricted values.
Additional studies of the functioning of WLSMV under conditions of severe non-normality
would be useful.
The findings of this study did not support the full measurement invariance of two
measures of adolescent aggression. While lack of invariance was indicated for several items, the
findings of this paper are less than conclusive. Particularly, for the relational aggression scale,
one item displayed DIF, yet invariance of factor loadings and thresholds was supported. It is
clear that additional research is needed to investigate the measurement properties of the scales
included in this study, as well as other measures of adolescent aggression.
44
-
REFERENCES
Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W.
Holland & H. Wainer (Eds.), Differential item functioning (pp. 3-23). Hillsdale, NJ,
England: Lawrence Erlbaum Associates Inc.
Baker, F. B., & ERIC Clearinghouse on Assessment and Evaluation College Park MD. (2001).
The basics of item response theory. (2nd ed.).
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107,
238-246.
Björkqvist, K., Lagerspetz, K. M., & Kaukiainen, A. (1992). Do girls manipulate and boys fight?
Developmental trends in regard to direct and indirect aggression. Aggressive Behavior,
18, 117-127.
Bongers, I. L., Koot, H. M., van der Ende, J., & Verhulst, F. C. (2004). Developmental
trajectories of externalizing behaviors in childhood and adolescence. Child Development,
75, 1523-1537.
Broidy, L. M., Nagin, D. S., Tremblay, R. E., Bates, J. E., Brame, B., Dodge, K. A., et al. (2003).
Developmental trajectories of childhood disruptive behaviors and adolescent
delinquency: A six-site, cross-national study. Developmental Psychology, 39, 222-245.
Christensen, H., Jorm, A. F., Mackinnon, A. J., Korten, A. E., Jacomb, P. A., Henderson, A. S.,
et al. (1999). Age differences in depression and anxiety symptoms: A structural equation
modelling analysis of data from a general population sample. Psychological Medicine,
29, 325-339.
45
-
Crick, N. R., & Bigbee, M. A. (1998). Relational and overt forms of peer victimization: A
multiinformant approach. Journal of Consulting and Clinical Psychology, 66, 337-347.
Crick, N. R., & Grotpeter, J. K. (1995). Relational aggression, gender, and social-psychological
adjustment. Child Development, 66, 710-722.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Forth Worth,
TX: Harcourt Brace.
D'Agostino, R. B. (1986). Tests for the normal distribution. In R. B. D'Agostino & M. A.
Stephens (Eds.), Goodness-of-fit techniques (pp. 367-419). New York: Marcel Dekker.
Dahlberg, L. L., Toal, S. B., Swahn, M., & Behrens, C. B. (2005). Measuring violence-related
attitudes, behaviors, and influences among youths: A compendium of assessment tools.
(2nd ed.) Atlanta, GA: Centers for Disease Control and Prevention, National Center for
Injury Prevention and Control.
DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological Methods, 2, 292-307.
Edelen, M. O., Thissen, D., Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2006).
Identification of differential item functioning using item response theory and the
likelihood-based model comparison approach: Application to the Mini-Mental State
Examination. Medical Care, 44, 134-142.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ:
Lawrence Erlbaum Associates Publishers.
Farrell, A. D., Meyer, A. L., & White, K. S. (2001). Evaluation of responding in peaceful and
positive ways (RIPP): A school-based prevention program for reducing violence among
urban adolescents. Journal of Clinical Child Psychology, 30, 451-463.
46
-
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-
Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement,
29(4), 278-295.
Finney, S. J., & Distefano, C. (2006). Non-normal and categorical data in structural equation
modeling. In G. R. Hancock (Ed.), Structural equaltion modeling: A second course (pp.
269-313). Greenwich, CT: Information Age Publishing.
Gallo, J. J., Anthony, J. C., & Muthén, B. O. (1994). Age-differences in the symptoms of
depression - a latent trait analysis. Journals of Gerontology, 49, 251-264.
Glockner-Rist, A., & Hoijtink, H. (2003). The best of both worlds: Factor analysis of
dichotomous data using item response theory and structural equation modeling.
Structural Equation Modeling: A Multidisciplinary Journal, 10, 544 - 565.
Grayson, D. A., Mackinnon, A., Jorm, A. F., Creasey, H., & Broe, G. A. (2000). Item bias in the
center for epidemiologic studies depression scale: Effects of physical disorders and
disability in an elderly community sample. Journals of Gerontology Series B-
Psychological Sciences and Social Sciences, 55, 273-282.
Hambleton, R. K. (2006). Good practices for identifying differential item functioning -
Commentary. Medical Care, 44(11), S182-S188.
Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance
in aging research. Experimental Aging Research, 18, 117-144.
Horne, A. M. (2004). The multisite violence prevention project: background and overview.
American Journal of Preventive Medicine, 26(Suppl1), 3-11.
Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to
underparameterized model misspecification. Psychological Methods, 3, 424-453.
47
-
Jones, R. N. (2003). Racial bias in the assessment of cognitive functioning of older adults. Aging
& Mental Health, 7, 83-102.
Jones, R. N. (2006). Identification of measurement differences between English and Spanishl
language versions of the Mini-Mental State Examination: Detecting differential item
functioning using MIMIC modeling. Medical Care, 44, 124-133.
Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New
York: The Guilford Press.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Oxford, England:
Addison-Wesley.
MacIntosh, R., & Hashim, S. (2003). Variance estimation for converting MIMIC model
parameters to IRT parameters in DIF analysis. Applied Psychological Measurement, 27,
372-379.
Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of item response theory and
confirmatory factor analytic methodologies for e