getd.libs.uga.edu · USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-RELATED DIF...

66
USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER- RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE by KATHERINE RACZYNSKI (Under the Direction of Seock-Ho Kim) ABSTRACT This study uses the multiple-indicator/multiple cause (MIMIC) latent variable model and multiple group analysis to evaluate Likert-type items for gender-related differential item functioning (DIF) on an aggression measure in an adolescent population. The MIMIC model allows for the simultaneous examination of group differences in the latent factor of interest (i.e., aggression) and response to measurement (i.e., DIF). Multiple group analysis provides an overall examination of measurement invariance across groups. This study will test for gender DIF in two scales of an aggression measure, physical aggression and relational aggression. INDEX WORDS: Measurement invariance, MIMIC model, Differential item functioning

Transcript of getd.libs.uga.edu · USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-RELATED DIF...

  • USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-

    RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE

    by

    KATHERINE RACZYNSKI

    (Under the Direction of Seock-Ho Kim)

    ABSTRACT

    This study uses the multiple-indicator/multiple cause (MIMIC) latent variable model and multiple group analysis to evaluate Likert-type items for gender-related differential item functioning (DIF) on an aggression measure in an adolescent population. The MIMIC model allows for the simultaneous examination of group differences in the latent factor of interest (i.e., aggression) and response to measurement (i.e., DIF). Multiple group analysis provides an overall examination of measurement invariance across groups. This study will test for gender DIF in two scales of an aggression measure, physical aggression and relational aggression.

    INDEX WORDS: Measurement invariance, MIMIC model, Differential item functioning

  • USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-

    RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE

    by

    KATHERINE RACZYNSKI

    B.S.Ed., University of Georgia, 2002

    A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment

    of the Requirements for the Degree

    MASTER OF ARTS

    ATHENS, GEORGIA

    2008

  • © 2008

    Katherine Raczynski

    All Rights Reserved

  • USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT GENDER-

    RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE

    by

    KATHERINE RACZYNSKI

    Major Professor: Seock-Ho Kim

    Committee: Deborah Bandalos Stephen Olejnik

    Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia December 2008

  • ACKNOWLEDGEMENTS

    I would like to thank my major advisor, Seock-Ho Kim, for providing support and

    guidance, along with the other members of my committee, Deborah Bandalos and Stephen

    Olejnik. The suggestions and assistance provided by my committee were of tremendous value.

    I also owe a debt of gratitude to Andy Horne, Pamela Orpinas, and the Youth Violence

    Prevention Project, for allowing me access to the data and for being great friends and role

    models. Finally, thank you to my family for providing unflagging support and encouragement.

    iv

  • TABLE OF CONTENTS

    Page

    ACKNOWLEDGEMENTS........................................................................................................... iv

    LIST OF TABLES........................................................................................................................ vii

    CHAPTER

    1 INTRODUCTION AND THEORETICAL FRAMEWORK ........................................1

    Introduction ...............................................................................................................1

    Measurement .............................................................................................................1

    Measurement Invariance ...........................................................................................3

    Differential Item Functioning....................................................................................3

    Item Response Theory...............................................................................................4

    Structural Equation Modeling ...................................................................................6

    Connections between CFA and IRT..........................................................................8

    2 LITERATURE REVIEW ............................................................................................10

    The MIMIC model ..................................................................................................10

    Advantages of the MIMIC model ...........................................................................11

    DIF Detection using MIMIC Models: A Comparison to IRT Models....................12

    Prior Studies using MIMIC Modeling to Detect DIF..............................................13

    Gender-related DIF in Measures of Aggression......................................................16

    v

  • 3 PROCEDURE..............................................................................................................18

    Sample .....................................................................................................................18

    Instrumentation........................................................................................................19

    Computer Program ..................................................................................................20

    Detection of Gender-related DIF.............................................................................20

    Multiple-Indicator/Multiple Cause Modeling .........................................................21

    Multiple Group Analysis .........................................................................................23

    4 RESULTS ....................................................................................................................25

    Descriptive Statistics ...............................................................................................25

    Outliers ....................................................................................................................29

    Missing Value Treatment ........................................................................................29

    Physical Aggression Scale.......................................................................................29

    Relational Aggression Scale....................................................................................34

    5 SUMMARY AND DISCUSSION...............................................................................38

    Summary .................................................................................................................38

    Discussion ...............................................................................................................39

    Limitations and Future Research.............................................................................42

    REFERENCES ..............................................................................................................................45

    APPENDICES ...............................................................................................................................51

    A MPLUS SYNTAX.......................................................................................................51

    vi

  • vii

    LIST OF TABLES

    Page

    Table 1: Physical and Relational Aggression Items Means and Standard Deviations for Boys and

    Girls ...............................................................................................................................27

    Table 2: Physical and Relational Aggression Items Intercorrelations, Skewness and Kurtosis ...28

    Table 3: Multiple-Indicator/Multiple Causes (MIMIC) Model Estimates for the Physical

    Aggression Scale ...........................................................................................................30

    Table 4: Sequential Chi Square Tests of Invariance for the Physical and Relational Aggression

    Scales.............................................................................................................................32

    Table 5: Multiple-Indicator/Multiple Causes (MIMIC) Model Estimates for the Relational

    Aggression Scale ...........................................................................................................35

  • CHAPTER 1

    INTRODUCTION AND THEORETICAL FRAMEWORK

    Introduction

    This study uses the multiple-indicator/multiple cause (MIMIC) latent variable model and

    multiple group analysis to evaluate Likert-type items for gender-related differential item

    functioning (DIF) on an aggression measure in an adolescent population. The MIMIC model

    allows for the simultaneous examination of group differences in the latent factor of interest (i.e.,

    aggression) and response to measurement (i.e., DIF). Multiple group analysis provides an

    overall examination of measurement invariance across groups. This study tests for gender DIF

    in two scales of an aggression measure, physical aggression and relational aggression. Because

    gender differences in levels of victimization may contribute to differences in levels of

    aggression, the model includes a measure of victimization as a covariate.

    Measurement

    Measurement is the term given to the systematic act of assigning numbers on variables to

    represent properties or characteristics of people, events, or objects (Stevens, 1946; Lord &

    Novick, 1968, p. 16). Within education and psychology, measurement is used to aid in the

    understanding of unobserved or latent variables that are of interest to the researcher, such as

    academic achievement or attitudes toward violence. While researchers believe that these internal

    characteristics exist, there is no direct way to observe them. Instead, researchers rely on theory

    to develop survey instruments that indirectly measure constructs of interest.

    1

  • The objective of any well-designed survey instrument is to obtain observed item

    responses that are reflective of respondents’ levels of an unobserved latent trait. Ideally,

    individuals with the same level of the underlying trait should obtain the same score on an

    instrument measuring that trait. However, according to classical test theory, survey responses

    always include an unknown quantity of error or unexplained variation. That is, other factors,

    apart from the level of latent trait, influence how participants respond to items. Classical test

    theory models this variation using the equation

    X = T + E, (1)

    where X is the observed score, T is the true score, and E is the error or unexplained variation

    (Lord & Novick, 1968, p. 34). Theoretically, E is normally distributed, with a mean of zero and

    variance σ2. The error component is also assumed to be nonsystematic in nature and

    uncorrelated with T. Because T and E are theoretically uncorrelated, it follows that the variance

    of the observed scores (σ 2X ) can be parceled out into two components, true score variation (σ 2T )

    and error variation (σ 2E ), as modeled in

    σ 2X = σ 2T + σ 2E . (2)

    A derivation of this model allows researchers to conceptualize test reliability:

    rX ′ X = σT2 /(σT

    2 + σ E2 ) = σT

    2 /σ X2 . (3)

    The above model demonstrates that high reliability is dependent on a large proportion of

    variation in true scores (T) to variation in observed scores (X). Because T is not directly

    observable, researchers rely on other techniques (e.g., Pearson correlation, test-retest reliability)

    to estimate reliability. Regardless of the measure, the goal in any testing situation is to obtain

    observed scores (X) that are substantially made up of T, with negligible amounts of E. It is the

    2

  • responsibility of the researcher to minimize the amount of error likely to be contained in

    responses through rigorous confirmation of the instrument’s reliability and validity evidence.

    Classical test theory provides a framework for understanding the measurement properties

    of observed variables in terms of their reliability and validity. However, more recent analytical

    techniques have allowed researchers to address additional aspects of reliability and validity that

    were previously inaccessible using solely classical test theory methods of inquiry (Vandenberg &

    Lance, 2000). One such topic is the systematic evaluation of measurement invariance.

    Measurement Invariance

    Measurement invariance refers to a test’s ability to measure the same latent variable

    under different measurement conditions, such as with different populations of respondents (Horn

    & McArdle, 1992). Therefore, measurement invariance is primarily concerned with the

    generalizability of interpretations of test responses across different sets of circumstances. Before

    conducting comparisons across groups on a common measure, researchers should evaluate

    whether different groups of respondents conceptually respond to and interpret the measure in a

    similar way.

    Without evidence of adequate measurement invariance, interpretations of observed scores

    may be flawed. Differences in group means may be related to actual group differences (e.g., one

    group has more of the latent variable assessed) or to group differences in response to the measure

    (e.g., differences in frame of reference). In order to meaningfully explore true differences, it is

    necessary to discount the possibility of substantial measurement differences.

    Differential Item Functioning

    This paper is concerned with a type of measurement invariance analysis called

    differential item functioning (DIF). DIF analysis involves evaluating differences in item

    3

  • performance across distinct groups of respondents after matching the groups on ability (on

    achievement measures) or “severity” (on psychological measures) (Angoff, 1993). DIF occurs

    when subgroups of respondents with equal amounts of the latent trait respond differently to

    items, causing potentially serious threats to the validity of the test.

    DIF can be uniform or non-uniform. Uniform DIF occurs when one group consistently

    scores higher than the other tested group, across all levels of ability. An example of uniform DIF

    is in the case when a group of girls outperforms a group of boys on a math test when the two

    groups of children possess an equal amount of underlying math ability. That is, some other

    factor is interfering to give the girls a consistent advantage over boys.

    Non-uniform DIF occurs when items discriminate differently between different ability

    levels within groups. An example would be a math problem that average ability girls can answer

    correctly, but only high ability boys (and not average ability boys) are able to answer. In effect,

    this item differentiates between low-ability and average-ability girls and average-ability and

    high-ability boys.

    Item Response Theory

    Researchers have primarily examined DIF using item response theory (IRT) techniques.

    IRT models evaluate variation among respondents by analyzing item-level data (Edelen et al.,

    2006). In effect, IRT modeling assigns respondents as well as items to a scale of measurement,

    which is conceptualized as having a range of negative infinity to positive infinity, a mean of

    zero, and a unit of measurement of one (Baker, 2001).

    The fundamental elements of IRT are a latent variable (such as ability or severity),

    generally called θ , and an item characteristic curve (ICC) for each item. The ICC graphically

    represents the probability of correct answer choice or endorsement along a continuum of θ.

    4

  • Therefore, the ICC is conceptualized as a function of θ. When analyzing a polythomous item

    (e.g., Likert-type), IRT methods assume that the item true score function, a nonlinear monotonic

    function, connects θ to the expected answer choice. There are also additional assumptions

    underlying polythomous IRT methods, namely, that the set of items of interest are

    unidimensional and locally independent (i.e., uncorrelated after controlling for θ).

    One IRT model that can be used to examine polythomous items is the graded response

    model (Samejima, 1969). This model is applied when items have Likert-type or categorical

    answer choices, which is a common characteristic of personality and attitude measures

    (Embretson & Reise, 2000 p. 308). In the graded response model, when the item has K answer

    choices or categories, k = 1,2,…,K, the item true score function of θ , T(θ), is modeled as

    T(θ) = Σk=1

    Kk × Pk (θ), (4)

    where the item response y for a particular item category response function k is

    Pk (θ) = Pk−1* (θ) − Pk

    *(θ). (5)

    The boundary response function, or probability of responding above k is

    Pk*(θ) = 1

    1+ exp[a(θ − bk )], (6)

    where a is the slope parameter and bk are the threshold parameters. Additionally, P0*(θ) =1 and

    PK* (θ) = 0 . For an item with five response categories, there are five item response functions and

    four boundary response functions. In this case, there would be five total item parameters.

    In order to evaluate an instrument for DIF using a graded response model, the researcher

    designates a reference group and a focal group of respondents. DIF is said to be present if item

    true score functions are not equal among these groups such that TR (θ) ≠ TF (θ) where the

    subscripts signify the reference and focal groups, respectively.

    5

  • Structural Equation Modeling

    Another class of techniques used to evaluate measurement invariance is structural

    equation modeling (SEM). In particular, confirmatory factor analysis (CFA), a type of SEM

    analysis, has been used extensively to study measurement invariance. CFA methods for

    detecting measurement invariance involve an overall test of comparability of parameter values

    across groups, followed by a series of more specialized comparisons to identify the source of

    lack of equivalence, if indicated.

    The measurement model for CFA can be written as

    x = τ + Λxξ + δ , (7)

    (Vandenberg & Lance, 2000). In this equation, x represents a vector of q ×1 observed variables,

    ξ represents a vector of latent variables, n ×1 Λx is a q × n matrix of factor loadings, and δ is a

    vector of measurement error in x. This equation also includes the τ vector of intercepts,

    although generally intercepts are assumed to be zero and are not estimated. In order to obtain a

    covariance matrix,

    q ×1

    δξ +Λ x is multiplied by its transpose . Following the assumption

    that measurement errors are uncorrelated with each other and the latent construct, the covariance

    matrix (

    tx δξ +Λ

    Σx ) may be expressed as

    Σx = ΛxΦΛxt + Θδ , (8)

    where Φ is the covariance matrix of the latent variables and Θδ is the diagonal matrix of

    measurement error variances. While Equation 8 is identical to the measurement model for

    exploratory factor analysis (EFA), CFA places restrictions on Λx which differentiates CFA from

    EFA.

    Measurement invariance can be examined on many different levels using CFA. In that

    Σx can be different for different populations, it is possible to test for equivalence of Σx , Λx , Φ,

    6

  • and Θδ (Raju, Laffitte, & Byrne, 2002). A test of Σx = Σx′, where Σx refers to the covariance

    matrix of the reference group and ′ Σ x refers to the covariance matrix of the comparison group,

    can be thought of as an omnibus test of measurement invariance. If the null hypothesis is not

    rejected using chi-square and other goodness-of-fit methods, the measure is generally accepted

    as invariant and further tests are unnecessary. However, if the null hypothesis is rejected, lack of

    invariance is indicted, and further tests should be conducted to identify the source of invariance

    (Schmitt, 1982).

    There are several types of tests of invariance to identify the source of lack of

    measurement equivalence in a measure. Although there has been some inconsistency in the

    literature with regard to terminology, and number of necessary tests and order of necessary tests

    of invariance (Vandenberg & Lance, 2000), I will summarize the tests described in Vandenberg

    and Lance (2000) in the order in which they are recommended.

    The test of configural invariance evaluates whether patterns of significant and non-

    significant factor loadings across groups are similar. The test of metric invariance ( Λx = Λx′) is

    concerned with the equality of the value of factor loadings across groups. The test of scalar

    invariance (τ x = τ x′) indicates whether intercepts on the latent variable are equivalent across

    groups. The test of the invariance of the unique variances across groups ( Θδ = Θδ ′) examines

    whether like items’ uniquenesses are equivalent between groups. The test of factor variance

    invariance (Φ = ′ Φ ) examines whether respondents in different groups utilized a similar range of

    responses along the answer continuum. Vandenberg and Lance (2000) also discuss the test of

    equal factor covariances, although they do not find this test useful.

    7

  • There have been several sets of step-by-step recommendations for undertaking the

    aforementioned tests (e.g., Steenkamp & Baumgartener, 1998, Vandenberg & Lance, 2000).

    While different sequences of testing have been reported, the overarching idea is that each test is

    undertaken sequentially, and at each step, more restrictive constraints are added to the model

    (e.g., setting equal factor loadings on like items across groups). The more restricted model is

    compared in terms of goodness of fit (i.e., χ2 value and other goodness of fit indices) to the less

    restricted or baseline model. The source of lack of measurement invariance is indicated at the

    level of testing when the more restricted model does not meet acceptable standards for goodness-

    of-fit.

    Connections between CFA and IRT

    Raju, Laffitte, and Byrne (2002) compared and contrasted CFA and IRT techniques for

    evaluating measurement invariance. They reported that CFA and IRT methods are alike in that

    they are both concerned with determining whether true scores are equivalent across groups for

    respondents with equal amounts of the latent trait. The two techniques similarly allow for group

    differences in the distributions of scores across subgroups. CFA and IRT also both give

    information about the source and extent of any lack of measurement invariance identified.

    While CFA and IRT are alike in certain respects, Raju, Laffitte, and Byrne (2002) also

    noted several differences between the two techniques. Primarily, CFA modeling assumes a

    linear relationship between the construct and items, while IRT assumes a non-linear relationship.

    They found that with dichotomously scored items, the IRT approach is a more appropriate model

    in terms of expressing the relationship between the measured variable and the continuous latent

    construct. They found that the literature describing CFA models is more advanced in terms of

    simultaneously looking at multiple latent constructs and multiple populations, although the IRT

    8

  • literature is progressing in this direction. While CFA has been used to examine equivalence of

    error variances across populations, IRT methods have not in practice examined an equivalent

    statistic: the invariance of the standard error of measurement for θ across populations of

    respondents. The CFA framework, on the other hand, does not have a means for determining the

    probability of a respondent with a given θ selecting a particular response category. These two

    techniques, when used in concert, can provide complimentary information (Meade &

    Lautenschlager, 2004: Reise et al., 1993).

    Several connections between the statistical frameworks of IRT and CFA models have

    been discussed recently. In particular, Takane and de Leeuw (1987) provided proofs showing

    that a two-parameter normal ogive IRT model is equivalent to a CFA with dichotomous

    variables. Researchers have extended upon this parallel to investigate DIF using a multiple-

    indicator/multiple cause (MIMIC) model. The MIMIC model is a special case of the factor

    analysis model which includes causal variables. Work by Muthén et al. (1991) provided

    equations for converting MIMIC model parameters to IRT discrimination and difficulty

    parameters. MacIntosh and Hashim (2003) presented a procedure for converting standard errors

    for these parameters from the MIMIC model parameters.

    9

  • CHAPTER 2

    LITERATURE REVEIW

    The MIMIC Model

    This paper is primarily concerned with demonstrating the use of MIMIC modeling to

    identify DIF. The MIMIC model is an SEM-based alternative to the multiple-sample CFA

    analysis described earlier. In a MIMIC model, one or more grouping (or background) variables

    function simultaneously as contributors to differences in the latent trait and as covariates upon

    which the outcome variables are regressed (Muthén, 1989). In this study, one dichotomous

    grouping variable (gender) is included in the model.

    The MIMIC model with a dichotomous grouping variable can be expressed as

    η = ′ γ x x + ′ γ zz + ζ , (9)

    where η is the latent trait, x represents observed background variables, z represents a dummy

    variable, and ζ represents an error term, which is normally distributed and independent of x and

    z (MacIntosh & Hashim, 2003). MIMIC modeling with categorical data, like IRT-based DIF

    detection, includes comparing a latent response variable ( ) to a threshold (y j* τ j ). If the threshold

    is exceeded ( y j* > τ j), then the indicator of the latent response variable ( y j) is one. If the

    threshold is not exceeded, is zero. y j

    The latent response variable ( ) can be modeled as a combination of the indirect effects

    through the latent trait variable (η) and the direct effects of the dummy variable (

    y j*

    zk), which is a

    measure of the grouping variable (e.g., gender) that is being examined for potential contributions

    to DIF:

    10

  • y j* = λ jη + β j zk + ε j , (10)

    where λ j is the factor loading, β j is the slope relating the grouping variable to the response

    variable, and ε j is the random error (Finch, 2005).

    The procedure for assessing DIF using MIMIC modeling entails estimating the direct and

    indirect effects of group membership on the latent trait and item response (Finch, 2005).

    Significant indirect effects of group membership on the item indicate that differences in item

    response are influenced by group differences on the mean of the latent factor. For example,

    assessing indirect effects can indicate whether a greater level of the latent trait “aggression” in

    boys contributes to higher endorsement of items on an aggression scale.

    Significant direct effects between the grouping variable and the item indicates that group

    membership directly impacts item response, apart from any group difference on the latent trait.

    Evaluating direct effects, after controlling for indirect effects, is the procedure for assessing

    uniform DIF. For example, this procedure can indicate whether boys are endorsing higher levels

    of aggression items after controlling for differences in the underlying trait “aggression” (Finch,

    2005).

    Advantages of the MIMIC model

    There are several unique advantages to using a MIMIC model to identify DIF. A MIMIC

    model can be used to simultaneously obtain estimates of group difference in item response (DIF)

    and amount of the latent trait. That is, the MIMIC model provides information about the

    structural model and the measurement model (Muthén, 1989). MIMIC models can be estimated

    for data of ordinal or continuous scales, data with multiple grouping variables (including

    grouping variables with more than two groups), and data with multiple independent variables,

    including categorical or continuous variables (Glockner-Rist & Hoitjink, 2003; Muthén, 1988).

    11

  • Whereas IRT models require large sample sizes (Reise, Widaman, & Pugh, 1993), MIMIC

    models can accommodate smaller sample sizes. Finally, researchers may be more familiar with

    CFA-based procedures than analyses requiring knowledge of IRT and IRT software.

    There are also some disadvantages of MIMIC modeling cited in the literature. Jones

    (2006) notes that MIMIC modeling cannot account for a guessing parameter, unlike IRT-based

    methods. However, in the behavior rating scales evaluated in this study, guessing is unlikely.

    Another disadvantage is that a single-group MIMIC model can only identify uniform DIF,

    although this study utilizes a multiple-group MIMIC approach, which is able to test for uniform

    and non-uniform DIF.

    DIF Detection using MIMIC Models: A Comparison to IRT Models

    Finch (2005) conducted Monte-Carlo simulations to compare MIMIC model detection of

    DIF to SIBTEST, IRT LR, and the Mantel-Haenszel (MH) statistic. He evaluated Type I error

    rate and power under several sets of conditions, varying the size of the reference group (100 or

    500), the number of items (20 or 50), the parameter model assessed, (three parameter logistic

    model or two parameter logistic model), the amount of DIF contamination in the anchor items

    (none or 15%), and the amount of DIF present in the target item (0 or .6). The study conditions

    were completely crossed, and each combination of specifications was tested with 500

    replications.

    He found that the MIMIC model performed as well or better than traditional DIF

    detection methods under some, but not all, conditions. Specifically, MIMIC performed well in

    the 50 item test and when the two parameter logistic model was used. For the 50 item test, the

    MIMIC model was more resistant than the other techniques to Type I error inflation when DIF

    contamination was present in the anchor items. However, in the case where the exam was short

    12

  • (20 items) and the three parameter logistic model was used, the MIMIC model had an

    undesirably high rate of Type I error.

    In terms of power, the MIMIC model performed well. Power was especially good when

    the exam had 50 items and when the two parameter logistic model was used. Under these

    conditions, the MIMIC model matched or exceeded the power of the other techniques.

    Specifically, the MIMIC model outperformed the MH statistic and the SIBTEST when the

    reference group was smaller (100) and the level of DIF contamination in the anchor items was

    higher (15%).

    Prior Studies Using MIMIC Modeling to Detect DIF

    Several prior studies have used MIMIC modeling to check for DIF. Grayson et al. (2000)

    used a MIMIC model to assess a depression scale for uniform DIF associated with demographic

    (e.g., age), disability and physical disorder variables. They conducted their analyses in three

    steps. First, a confirmatory factor analysis tested the acceptability of the structural model.

    Second, each of the predictor variables (demographic, disability, physical disorder) was

    introduced into the model serially. Each model included an indirect path from the predictor to

    each item via the latent variable as well as a direct path to each item. Notably, the direct paths

    from the predictor to the items were estimated in the same model and not sequentially. One

    referent item was constrained to have zero bias for model identification reasons. For each

    predictor, the researchers flagged significant parameter estimates for these paths linking items to

    the predictor. The researchers frame this step as a screening procedure. That is, they were

    interested in identifying predictors that had no significant direct effects on the items, and were

    thus unlikely to be contributing to DIF. These predictors were then eliminated from the final

    model. In the third and final step of the analyses, each of the significant predictors was added

    13

  • into the model together, and the resulting multivariate model was estimated. The researchers

    were again interested in identifying significant effects from the predictor to the items, although

    in the multivariate model, all other predictors are held constant in the estimation procedure.

    The researchers conducted the analysis using maximum likelihood parameter estimates.

    The goodness of fit index (GFI), the Tucker-Lewis index (TLI), and the root mean square error

    of approximation (RMSEA) values assessed model fit. The researchers were concerned about

    violations to multivariate normality; therefore they used bootstrapping to obtain confidence

    intervals. The researchers determined that a biased loading parameter estimate that exceeded

    twice its standard error (z score of 2) was statistically significant in this context. The loadings on

    the latent variable required a z score of 1.5 to reach significance.

    In reporting the results, the researchers partitioned the effects of each predictor into a bias

    effect (i.e., the sum of the direct effects from the predictor to the items) and an actual effect (i.e.,

    the sum of the direct effects from the latent variable to the items, multiplied by the effect from

    the predictor to the latent variable). These effects were compared to the critical ratios described

    above (2 for bias parameters, 1.5 for direct effect from the predictor to the latent variable). Items

    that exceeded the bias cut-off on one or more predictors were identified as exhibiting DIF.

    Gallo et al. (1994) used MIMIC modeling to identify age-related uniform DIF in a

    depression scale with marital status, minority status, cognitive status, and recent unemployment

    in the model. The latent variable, depression, was regressed on each covariate. The analysis was

    conducted by successively testing for significant parameter estimates between age and each item.

    Analyses were conducted using LISCOMP’s limited-information generalized least

    squares estimator for dichotomous response. Model fit was reported using the descriptive fit

    value, the goodness-of-fit index, the adjusted goodness-of-fit index, and the critical number.

    14

  • Christensen et al. (1999) examined a depression scale and an anxiety scale for age-related

    uniform DIF using a MIMIC model. They evaluated a two-factor measurement model including

    five demographic covariates (age, sex, marital status, financial status, and level of education).

    First, the researchers conducted a CFA to assess the fit of the measurement model.

    Next, the researchers conducted analyses to select a referent item for the substantive DIF

    analysis. For identification purposes, one item must be selected as the referent (e.g., no DIF)

    item. The researchers tested each item individually (e.g., testing the significance of the direct

    paths to the item from each covariate consecutively) to determine which items show no DIF

    across all of the covariates. An item that showed weak direct associations with the covariates

    was selected as the referent.

    Finally, the substantive model was analyzed. The model included paths from each

    covariate to both latent variables, and to each item. This model was similar to the model

    described in Grayson et al. (2000), in that DIF was estimated for each of the items

    simultaneously (i.e., paths from the covariate to each of the items were included in one model).

    Analyses were conducted using maximum likelihood parameter estimates using Amos

    3.6.1. Model fit was reported using the goodness-of-fit index (GFI), the non-normed fit index

    (NNFI) and the root mean square error of approximation (RMSEA). The researchers were

    concerned about violations to multivariate normality; therefore bootstrapping was used to obtain

    confidence intervals.

    In an article that demonstrated a procedure for calculating the standard error of the

    estimates of IRT difficulty and discrimination from MIMIC model parameters, MacIntosh and

    Hashim (2003) employed MIMIC modeling to identify uniform gender-related DIF on a scale

    measuring racial prejudice. They included gender in the model as the predictor variable, along

    15

  • with three other covariates (educational status, political conservatism, and religious

    fundamentalism). The researchers conducted their analyses in two steps. First, they ran the

    model with a path from gender to the latent variable and with one path directly from gender to an

    item. The researchers obtained the residual variance of the latent variable from the output of this

    analysis ( R2). This value was used to set the variance of the latent variable for the second run as

    1− R2. Once this variance was set, the model was sequentially run to test each item for DIF in

    the same manner as Gallo et al. (2000); that is, with parameter estimates from gender to the item.

    The Mplus program was used to run the analysis. The researchers reported model fit

    using the chi-squared value.

    Gender-related DIF in Measures of Aggression

    Obtaining accurate measures of self-reported aggression in adolescents is of interest to

    researchers, educators, and policy-makers alike. Aggression scales have been used to calculate

    the prevalence of aggression in schools (e.g., Nansel et al., 2001) and as outcome variables for

    evaluating the impact of violence-prevention programs (e.g., Farrell, Meyer, & White, 2001).

    Evaluating gender differences in levels of aggression and type of perpetration (e.g., physical,

    relational) have been topics of particular consideration in the literature. Boys have generally

    scored higher than girls on measures of physical aggression, (e.g., Bongers et. al, 2004; Broidy,

    Nagin, Trembaly, Bates, Brame, Dodge, et al., 2003), although researchers have recently argued

    that adolescent aggression is not only constrained to physical acts. Notably, Crick and Bigbee

    (1995) coined the term “relational aggression” to encompass behaviors that are purposefully

    damaging to the victim’s peer relationships. Crick and Grotpeter (1995) argue that these

    aggressive behaviors are more common in girls than physical aggression.

    16

  • While measuring gender differences on different types of aggression has been a topic of

    considerable interest, there are significant gaps in the literature regarding the validity of the

    instruments used, particularly in terms of measurement invariance. Typically, researchers have

    evaluated gender differences in aggression by comparing means (e.g., Crick & Grotpeter, 1995;

    Björkqvist, Lagerspetz, & Kaukiainen, 1992). As discussed earlier, a simple means comparison

    may not be interpretable without first obtaining evidence of measurement invariance.

    The purpose of this study is to evaluate the measurement invariance of two measures of

    aggression (physical and relational) across gender groups in order to evaluate whether

    differences in self-reported aggression are due solely to differences in the latent trait, or if

    differences are partially due to factors related to measurement. This topic may be of particular

    importance because the wording of items on the aggression scale may lend itself to differential

    item functioning. In particular, the Crick and Grotpeter (1995) relational aggression scale was

    developed with the aim of capturing behaviors that they theorized were common in girls. If the

    authors were particularly concerned with female-oriented behavior, the wording of items may

    reflect subtle gender-related bias. On the other hand, because physical aggression is more

    commonly associated with boys, measures of physical aggression may include an over-

    representation of items that are more salient to male respondents.

    17

  • CHAPTER 3

    PROCEDURE

    Sample

    Data for this study are from the GREAT Schools and Families project (GSF), a seven

    year, multi-site violence prevention project described in detail in a supplement to the American

    Journal of Preventive Medicine (Horne, 2004). The data used here comprises the Spring 2004

    student survey of a randomly-selected sample of students from one cohort at the University of

    Georgia (UGA) site. Students attended one of nine middle schools in Northeast Georgia. Of the

    719 students who were eligible to participate in GSF, 623 (87%) completed the Spring 2004

    student assessment. At this assessment wave, all respondents were in 7th grade, unless they had

    been retained.

    The sample was 49% female. Of the 612 students who selected one race, 53% were

    white, 34% were black, less than 1% were American Indian or Asian Indian, 2% were other

    Asian, and 10% were some other race. There were 14 students (2%) who selected more than one

    race. Twelve percent of students were Hispanic. Students ranged in age from 12 to 15.

    Participants took the survey via computer assisted survey interview (CASI) using laptop

    computers. Participants wore headphones, and the CASI program read the questions aloud.

    Respondents recorded their answers via the keyboard. All surveys were proctored by GSF staff.

    Students were assigned an ID number, and the survey was confidential.

    This study examines three types of behaviors: physical aggression, relational aggression,

    and overt victimization.

    18

  • Instrumentation

    The items that are examined in this study are taken from the problem behavior frequency

    scale, a collection of 47 items grouped into 7 subscales that assess the 30-day frequency of

    problem behaviors, such as aggression and delinquency. This study is concerned with three

    subscales (henceforth called scales): physical aggression, relational aggression, and overt

    victimization.

    Physical aggression is a 7-item scale that measures self-reported physical aggression in

    the past 30 days. The stem is, “In the past 30 days, how many times have you,” followed by

    descriptions of physical aggression (e.g., hitting) or other serious aggressive behavior (e.g.,

    threatening someone with a weapon). Relational aggression is a 6-item scale that measures self-

    reported relational aggression in the past 30 days. The stem is, “In the past 30 days, how many

    times have you,” followed by descriptions of relational aggression, such as spreading a false

    rumor about someone. Overt victimization is a 6-item scale that measures self-reported

    victimization in the past 30 days. The stem is, “In the past 30 days, how many times has this

    happened to you,” followed by descriptions of victimization (e.g., been pushed).

    For each item, the six Likert-type response categories range from “never” to “20 or more

    times.” These responses are coded from one to six, respectively, and the scale score is the

    average of the item scores.

    In terms of reliability, each scale demonstrated acceptable internal consistency based on

    the current dataset (physical aggression, α = .85; relational aggression, α = .81; overt

    victimization, α = .86). High internal consistency indicates that respondents demonstrated

    consistency in answer selection across subsets of items (Crocker & Algina, 1986, p. 135). These

    19

  • values are consistent with or higher than other reported internal consistencies calculated based on

    other data sources (Farrell et al., 2001, Crick & Bigbee, 1998, Orpinas & Frankowski, 2001).

    In general, published validity evidence on the scales used is scarce. In many cases, the

    internal consistency of the measure is offered as the only indicator of validity evidence (see

    Dahlberg et al., 2005). More comprehensive validity evidence is available for one of the

    measures used. The physical aggression scale was adapted in part from Orpinas and

    Frankowski’s 2001 Aggression Scale. Validity studies conducted on this scale using three

    samples of middle school students indicated that scores on the Aggression Scale were

    significantly positively related to teacher-rated aggression, and self-reports of drug use, weapon-

    carrying, and injuries due to fights (Orpinas & Frankowski, 2001).

    Computer Program

    The data were analyzed using SPSS Version 16.0, and Mplus Version 5 (Muthén &

    Muthén, 2007). Syntax for all analyses is provided in Appendix A.

    Detection of Gender-related DIF

    Gender-related DIF detection was conducted on the two scales of interest: physical

    aggression and relational aggression. A measure of victimization was included in the model as a

    covariate. Victimization scores were categorized into three groups: no victimization reported

    (30% of respondents), one instance of victimization reported (13% of respondents) and more

    than one instance of victimization reported (57% of respondents). By adding victimization to the

    model, it is possible to evaluate gender- related DIF while adjusting for the effect of

    victimization on the aggression measure. Experiences of victimization may moderate the

    relationship between gender and the tendency to endorse items on the aggression scale. In other

    20

  • words, differences in victimization may explain some of the gender differences in item

    endorsement that I would have otherwise attributed to DIF.

    Two approaches were used to evaluate the measurement invariance of the aggression

    measures. First, the MIMIC approach was used to test each item for DIF. Second, I employed

    multiple group analysis to obtain a test of the overall invariance of the measures across groups.

    Each of the procedures was conducted for the physical aggression scale and the relational

    aggression scale independently.

    Multiple-Indicator/Multiple Cause Modeling

    Single group MIMIC modeling was used to identify items that exhibit DIF. For

    identification purposes, the mean of the latent variable was set to zero and the variance set to one

    according to a procedure documented in MacIntosh and Hashim (2003). To set the latent mean

    to zero, the exogenous variables (i.e., gender, victimization) are mean centered. To set the latent

    factor variance to one, a two-step procedure was demonstrated in MacIntosh and Hashim (2003).

    First, the model is run with no constraints on the variance of the latent factor to obtain the R2

    value for the latent factor. Second, the model is estimated again with the variance of the latent

    factor set to . I received a warning when running the first step of the procedure regarding

    the identification of the model, and the program was not able to calculate the

    (1− R2)

    R2 value for the

    latent factor. In order to identify the model for the first step, I set the variance of the latent

    variable to 1. After re-running the model, I was able to obtain the R2 value for the latent factor.

    The second step of the model was run exactly as outlined above and no further problems were

    encountered.

    First, a baseline model containing no direct effects from gender to item responses (i.e., no

    DIF) was analyzed. Girls are coded zero, and boys are coded one. Model fit was evaluated

    21

  • using the chi square (χ2) statistic along with other fit indices, the root mean square error of

    approximation (RMSEA), and the comparative fit index (CFI). The χ2 test tests the hypothesis

    that the original covariance matrix and the estimated or reproduced covariance matrix are

    identical. However, the χ2 statistic has been shown to be sensitive to trivial differences among

    these matrices under large sample sizes (Bentler, 1990). In other words, non-significant p-values

    may be obtained even in models where the model fits the data well. In order to obtain a more

    comprehensive view of model fit, additional fit indices in addition to χ2 are typically reported.

    The RMSEA is a stand-alone fit index that adjusts for the complexity of the model,

    favoring parsimony. It is a standardized measure of the degree to which the population data do

    not fit the model. Hu and Bentler (1998) suggest that values of .06 or lower are indicative of

    good fit. The CFI is an incremental fit index that compares the amount of non-centrality in the χ2

    distribution of the specified model to a baseline (null) model. Hu and Bentler (1998) recommend

    using .95 (or above) as a cut off for good model fit for the CFI.

    Weighted least squares means and variances (WLSMV) estimation was used for all

    analyses. WLSMV is a robust weighted least squares (WLS) estimator, which, similar to WLS

    relies on an asymptotic covariance matrix, making both estimators attractive options for use with

    non-normal data. However, WLS requires extremely large sample sizes, and is thus not practical

    for most datasets. WLSMV is less computationally intensive than WLS, which results in smaller

    sample size requirements. WLSMV adjusts the mean and variance of the χ2 value, along with

    parameter estimates and standard errors to account for the level of non-normality in the data

    (Finney & Distephano, 2006, p. 298).

    Next, a series of models were run to test for uniform DIF. Each item was tested

    sequentially by adding a direct path ( β ) from gender to the item in the model. A significant

    22

  • parameter estimate for this direct path is indicative of uniform DIF. That is, gender is explaining

    differences in item means above and beyond that which is explained via the indirect path of

    gender to the item through the latent variable.

    In order to test if including DIF effects (β ) results in significantly improved model fit, a

    DIF model was created by including all significant β’s in the model. The fit of the DIF model

    was compared to the baseline model using a χ2 difference test. It is important to note that under

    WLSMV estimation, the χ2 values and degrees of freedom function differently than maximum

    likelihood based estimators. According to comments posted by Linda Muthén on the Mplus

    Discussion Board (2007), the χ2 values and degrees of freedom obtained from WLSMV

    estimation are not able to be interpreted in a similar way as values obtained by ML estimation

    (e.g., degrees of freedom calculated as a function of the number of estimated parameters

    subtracted from the number of elements in the covariance matrix), because these values obtained

    under WLSMV are adjusted to obtain accurate p-values. Similarly, a χ2 difference test between

    two nested models estimated using WLSMV cannot be conducted in a straightforward manner

    by subtracting the χ2 value and degrees of freedom of the unconstrained model from the

    constrained model. Instead, the DIFFTEST command in Mplus must be used to obtain the

    accurate p-value for this test.

    Multiple Group Analysis

    Multiple group analysis was used to further evaluate the equivalence of the factor

    structures of the physical and relational aggression scales across gender. Multiple group

    analysis considers the fit of the model as equality constraints are placed on two single-gender

    groups. For each scale, the data from each single-gender group are first fit to a baseline or

    configural model without equality constraints to determine whether the same pattern of

    23

  • significant and non-significant parameter estimates holds across groups. If so, there is evidence

    of configural invariance, and a more restrictive model is placed on the data. According to the

    guidelines provided in the Mplus manual for conducting multiple group analysis with ordered

    categorical data, the tests of metric invariance and scalar invariance described earlier are

    combined into one test. That is, the factor loadings and thresholds are constrained all in one step,

    “because the item probability curve is influenced by both parameters” (Muthén & Muthén, 2007,

    p. 399). Because the two models are nested, χ2 difference tests are conducted to determine if the

    added constraints result in significantly poorer fit.

    24

  • CHAPTER 4

    RESULTS

    Descriptive Statistics

    Means and standard deviations for the physical aggression and relational aggression

    scales are presented separately by gender in Table 1. Spearman item intercorrelations, skewness,

    and kurtosis, are provided in Table 2.

    An examination of descriptive statistics (i.e., skewness, kurtosis) revealed violations of

    univariate normality in several items. Because these scales measure the 30-day frequency of

    aggressive behaviors, it is unsurprising that the responses will be skewed toward “never.” Two

    items on the physical aggression scale and four items on the relational aggression scale had

    skewness and kurtosis values outside of the acceptable range suggested by Kline (2005)—

    │3│for skewness and │10│for kurtosis. Because these values fall outside of the acceptable

    range, these items cannot be assumed to be univariate normal (D’Agostino, 1986), and the data

    as a whole cannot be assumed to be multivariate normal.

    Because of the non-normality in the data, I selected an estimation method that was not

    based on normal theory, weighted least squares means and variances (WLSMV). As discussed

    earlier, WLSMV accounts for the categorical nature of the data and adjusts for non-normality,

    specifically through weighted least squares parameter estimates, mean- and variance- adjusted χ2

    values, and scaled standard errors.

    Prior to conducting the MIMIC and multiple groups analyses, means and standard

    deviations of items were examined for differences in gender. For the physical aggression scale,

    25

  • 26

    boys reported more aggression than girls across all items except for one. The mean of item 3

    (“threatened to hurt a teacher”) was slightly higher for girls (M=1.17) than boys (M=1.11).

    Overall, the item standard deviations were smaller for girls than for boys. For the relational

    aggression scale, item means for girls’ and boys’ scores were quite similar, except for one item.

    The mean of item 6 (“said things about another student to make other students laugh”) was

    higher for boys (M=2.43) than girls (M=2.08). All of the item standard deviations were smaller

    for girls than for boys.

  • Table 1 Physical and Relational Aggression Items Means and Standard Deviations for Boys and Girls Boys Girls

    Scale Item M SD M SD 1. Thrown something at another student to hurt them 1.87 1.283 1.59 0.999

    2. Been in a fight in which someone was hit 1.81 1.253 1.41 0.910

    3. Threatened to hurt a teacher 1.11 0.498 1.17 0.705

    4. Shoved or pushed another kid 2.50 1.583 2.05 1.391

    5. Threatened someone with a weapon (gun, knife, club, etc.) 1.19 0.755 1.10 0.535

    6. Hit or slapped another kid 2.06 1.486 1.82 1.154

    Physical Aggression

    7. Threatened to hit or physically harm another kid 1.80 1.392 1.56 1.133

    1. Didn't let another student be in your group anymore because you were mad at them. 1.63 1.094 1.64 0.987

    2. Told another kid you wouldn't like them unless they did what you wanted them to do. 1.19 0.701 1.20 0.656

    3. Tried to keep others from liking another kid by saying mean things about him/her. 1.46 1.065 1.44 0.869

    4. Spread a false rumor about someone. 1.35 0.940 1.31 0.757

    5. Left another kid out on purpose when it was time to do an activity. 1.46 1.068 1.43 0.940

    Relational Aggression

    6. Said things about another student to make other students laugh. 2.43 1.640 2.08 1.314

    N = 621 (313 boys, 308 girls)

    27

  • 28

    e 2

    Tabl Physical and Relational Aggression Items Intercorrelations, Skewness and Kurtosis Items Physical Aggression Relational Aggression 1 2 3 4 5 6 7 1 2 3 4 5 6

    1 -- 2 .45 -- 3 .21 .21 -- 4 .48 .42 .20 -- 5 .36 .32 .31 .31 -- 6 .49 .44 .19 .65 .31 -- 7 .48 .43 .30 .50 .41 .54 -- 1 .34 .30 .16 .30 .11 .24 .29 -- 2 .30 .29 .22 .29 .23 .22 .28 .36 -- 3 .35 .31 .17 .38 .24 .33 .36 .38 .44 -- 4 .38 .30 .22 .36 .21 .30 .39 .36 .42 .53 -- 5 .35 .25 .16 .32 .19 .29 .36 .44 .36 .56 .44 -- 6 .41 .31 .13 .55 .26 .51 .48 .31 .26 .41 .31 .37 --

    Skewness 2.11 2.23 5.63 1.21 5.69 1.68 2.20 2.31 4.59 3.02 3.62 3.05 1.31Kurtosis 4.51 4.93 34.88 0.51 34.85 2.24 4.16 6.05 24.39 10.18 14.88 9.90 0.78

  • Outliers

    The existence of outliers in data can be problematic. On one hand, outliers can exert

    disproportionate influence on results and can sometimes be the result of data entry errors or

    untruthful respondents. On the other hand, outliers can reflect true variation in respondents (e.g.,

    highly deviant behaviors) and are therefore of interest to the researcher.

    In order to screen for outliers, I used the macro given in DeCarlo (1997). DeCarlo’s

    macro calculates the significance of multivariate outliers and the Mahalanobis distance for each

    observation. Fifty-three observations, or 9% of the total sample, met the criteria for multivariate

    outliers, using the critical value at the .05 level of significance. These outliers were not removed

    from the dataset because I hypothesized that these values were most likely due to true differences

    in students and not errors in the dataset or untruthful responses.

    Missing Value Treatment

    There was very little missing data in the dataset. In total, only two students did not

    answer every item. The Mplus program accommodates missing values using full information

    maximum likelihood estimation.

    Physical Aggression Scale

    Parameter estimates and standard errors for the MIMIC analysis for the physical

    aggression scale are presented in Table 3. For the baseline (i.e., no DIF) model, the fit of the

    physical aggression scale was adequate ( χ202 = 73.308, P

  • slapped another kid”). These significant values provide evidence that these items may exhibit

    different measurement characteristics based on the gender of the respondent.

    Next, a DIF model was estimated by including all three significant β ’s in the model. The

    fit of the model was adequate ( χ182 = 54.9, P

  • 4 2.290 0.157 14.593 5 2.560 0.196 13.052

    1 -0.259 0.055 -4.719 2 0.594 0.056 10.688 3 1.062 0.063 16.757 4 1.393 0.068 20.586

    Y4

    5 1.661 0.076 21.868

    1 -3.121 0.308 -10.120 2 1.608 0.118 13.592 3 2.002 0.139 14.417 4 2.234 0.163 13.725

    Y5

    5 2.357 0.178 13.244

    1 0.055 0.053 1.039 2 0.882 0.061 14.535 3 1.320 0.070 18.943 4 1.628 0.080 20.344

    Y6

    5 1.829 0.090 20.401

    1 0.532 0.057 9.323 2 1.142 0.068 16.809 3 1.526 0.075 20.248 4 1.704 0.081 21.114

    Y7

    5 1.901 0.087 21.822 β Y1 0.023 0.082 0.286 Y2 0.364 0.095 3.817 Y3 -0.319 0.155 -2.061 Y4 0.101 0.077 1.318 Y5 0.008 0.165 0.050 Y6 -0.222 0.076 -2.903 Y7 -0.101 0.084 -1.204 γ Gender 0.136 0.081 Victimization 0.519 0.046 Ψ Physical Aggression 0.297

    31

  • Table 4

    Sequential Chi Square Tests of Invariance for Physical and Relational Aggression Scales

    Scale Analysis Model χ 2 df p Δχ 2 Δdf p

    Model 0 73.308 20 .000 -- -- -- MIMIC

    Model 1 54.971 18 .000 23.711 3 .000

    Model 2 119.991 22 .000 -- -- --

    Physical Aggression

    Multiple Group

    Model 3 112.916 28 .000 37.041 15 .001

    Model 0 56.385 15 .000 -- -- -- MIMIC Model 1 50.111 14 .000 6.154 1 .013

    Model 2 56.414 22 .000 -- -- --

    Relational Aggression

    Multiple Group

    Model 3 35.598 18 .007 0.340 7 .999

    Model 0: No direct DIF effects

    Model 1: All significant DIF effects included

    Model 2: No invariance constraints

    Model 3: Invariance of loadings and thresholds

    I next used multiple group analysis to further evaluate the equivalence of the factor

    structures across gender. Before beginning the process, it was necessary to collapse the fifth and

    sixth response categories (“10-19 times” and “20 or more times”) for item 2 (“been in a fight in

    which someone was hit”) because no females endorsed the fourth item response. In this case, the

    program cannot estimate a threshold and the categories must be collapsed. Because of problems

    running the model (i.e., non-positive definite matrix), victimization was removed from the

    32

  • model. After removing victimization, no additional problems were encountered running the

    analyses.

    In order to identify the model, I followed the recommendations from the Mplus Manual

    (Muthén & Muthén, 2007, p 398). That is, for the baseline model (i.e., test of configural

    invariance), all of the item error variances are set to one, the factor means are set to zero for both

    groups, one item loading is set to one in both groups, and one threshold per item is held invariant

    (one additional threshold is held invariant for the item with the factor loading set to one). It is

    important to note that by setting loadings and thresholds to one across groups, these reference

    parameters are assumed to be invariant. The loading of the first item was chosen as the reference

    parameter because this item showed little evidence of lack of invariance in the MIMIC analysis.

    The highest threshold was chosen to be held invariant on the theoretical grounds that there may

    be less variation across gender on the top end of the scale (i.e., differences between respondents

    endorsing one of two answer categories corresponding to high frequency of aggression) than at

    the bottom (i.e., differences between respondents endorsing one of two answer categories

    corresponding to no aggression versus low frequency of aggression).

    Table 4 provides a summary of the sequential χ2 difference tests for the multiple group

    analysis. The fit of the baseline model was mediocre ( χ222 = 119.991, P

  • Relational Aggression Scale

    Parameter estimates and standard errors for the MIMIC analysis for the relational

    aggression scale are presented in Table 5. The fit of the relational aggression scale for the

    MIMIC analysis was adequate ( χ152 = 56.385, P

  • Table 5 Multiple-Indicator/Multiple-Causes (MIMIC) Model Estimates for the Relational Aggression Scale Estimate SE Est./S.E. λ Y1 0.730 0.035 20.561 Y2 0.852 0.040 21.320 Y3 0.960 0.024 40.630 Y4 0.856 0.036 23.545 Y5 0.905 0.027 33.667 Y6 0.641 0.033 19.639 τ Threshold

    1 0.278 0.052 5.361 2 1.174 0.066 17.853 3 1.652 0.084 19.704 4 1.901 0.097 19.621

    Y1

    5 2.055 0.112 18.370

    1 1.221 0.069 17.781 2 1.749 0.089 19.557 3 2.099 0.122 17.137 4 2.353 0.150 15.736

    Y2

    5 2.411 0.156 15.477

    1 0.651 0.056 11.634 2 1.401 0.078 18.060 3 1.806 0.095 19.010 4 2.029 0.110 18.375

    Y3

    5 2.057 0.112 18.330

    1 0.933 0.065 14.328 2 1.713 0.091 18.767 3 2.061 0.111 18.620 4 2.199 0.121 18.199

    Y4

    5 2.342 0.133 17.634

    1 0.710 0.057 12.501 2 1.429 0.083 17.129 3 1.779 0.097 18.284 4 1.960 0.106 18.466

    Y5

    5 2.062 0.114 18.006

    1 -0.300 0.052 -5.710 2 0.630 0.058 10.919 3 1.030 0.061 16.760 4 1.315 0.068 19.346

    Y6

    5 1.534 0.075 20.438

    35

  • βz Y1 -0.092 0.086 -1.061 Y2 -0.015 0.114 -0.135 Y3 -0.121 0.083 -1.458 Y4 0.016 0.097 0.161 Y5 -0.013 0.091 -0.146 Y6 0.211 0.085 2.479 γ Gender -0.060 0.093 -0.649 Victimization 0.468 0.052 9.067 Ψ Relational Aggression 0.175 -- No responses

    I next used multiple group analysis to further evaluate the equivalence of the factor

    structures across gender. Similar to the physical aggression scale, it was necessary to collapse

    the 5th and 6th response categories (“10-19 times” and “20 or more times”) for two items. For

    item 2 (“told another kid you wouldn’t like them unless they did what you wanted them to”), no

    respondents endorsed the 5th response category. For item 3, (“tried to keep others from liking

    another kid by saying mean things about him/her”) no girls endorsed the 5th response category.

    No other problems were encountered, and the model was estimated with victimization in the

    model.

    The same procedures that were described earlier were followed to identify the model for

    multiple groups analysis. The first item loading was held equal across groups, because this item

    did not display evidence of DIF in the MIMIC modeling procedure. The highest threshold for

    each item was held invariant for the same reason described in the physical aggression section.

    Table 4 provides a summary of the sequential χ2 difference tests for the multiple group

    analysis. The fit of the baseline model was mediocre ( χ222 = 56.414, P

  • CFI = .98). After holding the factor loadings and thresholds equal across groups, the model fit

    adequately ( χ182 = 35.598, P=0.007; RMSEA = .056, CFI = .99). The χ2 difference test was not

    significant ( χ72 = .340, P=.999). This result indicates that adding the additional constraints to t

    model does not result in a statistically significant decrease in model fit, which supports the

    hypothesis that the relational aggression scale does exhibit measurement invariance across

    gender.

    he

    37

  • CHAPTER 5

    SUMMARY AND DISCUSSION

    Summary

    The purpose of this study was to evaluate two measures of aggression in adolescents for

    gender-related DIF. MIMIC modeling was used to test for direct effects of gender on each item

    in the two scales. Another procedure, multiple group analysis, was used to provide additional

    information about the measurement invariance of the scales across gender.

    For the physical aggression scale, three items displayed evidence of DIF. The model that

    did not include these DIF effects fit the data significantly worse than the model that did include

    the DIF effects. This finding that including the DIF effects improves model fit suggests that there

    are measurement differences between boys and girls on these items. The results of the multiple

    group analysis buttressed this finding. Placing equality constraints on the loadings and

    thresholds across the single-gender groups resulted in significantly poorer model fit.

    For the relational aggression scale, one item displayed evidence of DIF. The model that

    did not account for this DIF effect fit the data significantly worse than the model that did include

    the DIF effect, indicating a lack of invariance for this item. Although there is evidence that one

    item displays DIF, the results of the multiple group analysis indicated that the scale as a whole

    does exhibit gender-related measurement invariance. Imposing equality constraints on the

    loadings and thresholds of the two groups did not significantly decrease the fit of the model in

    comparison to the unconstrained model.

    38

  • Discussion

    While statistical procedures can flag items for DIF and identify lack of invariance in

    scales, it is up to the researcher to provide possible explanations for the measurement

    differences. Although it is not possible to conclusively speak about the reasons for the

    measurement differences, additional investigation of the flagged items can provide some

    tentative explanations of the findings.

    In the physical aggression scale, item 2 (“been in a fight in which someone was hit”)

    exhibited DIF. The estimate of β for this item was .364. Because boys are coded 1, this estimate

    indicates that in the model either boys endorsed higher values for this item due to gender above

    and beyond the indirect effect of gender through latent physical aggression, or alternately, that

    girls endorsed lower values. In order to gain additional information about patterns of responses

    among boys and girls, I examined item intercorrelations and response frequencies from the

    marginal tables (not shown), although this information is not based on the structural equation

    modeling estimated parameters. An examination of item intercorrelations did not reveal any

    clear pattern of differences between item 2 and the other items (i.e., item 2 correlations were

    mostly consistent with the magnitude of the correlations among other items). An examination of

    the response frequencies contributes additional information. Boys are less likely to than girls to

    choose the response “never” (56.9% to 77.3%, respectively). There are differences at the top end

    of the scale as well. While 4.9% of girls selected one of the last three response categories (“6-9

    times,” “10-19 times,” or “20 or more times”), 9.9% of boys chose one of these categories. It is

    possible that boys are more likely than girls to endorse higher levels of this item for reasons

    other than increased physical aggression. For example, boys may be more likely to engage in

    play fighting, which may result in one of the participants getting hit. Although this behavior can

    39

  • be considered aggressive, it is different in nature than the malevolent aggression that this scale is

    intended to measure.

    Item 3 (“threatened to hurt a teacher”) also exhibited DIF. The estimate of β for this item

    was -.319, indicating that in the model either girls were more likely or boys were less likely to

    endorse higher values for this item due to gender-related factors beyond differences in latent

    physical aggression. Examining the item intercorrelation matrix revealed that the correlations

    for this item were somewhat lower than other items on the physical relation scale. In terms of

    response category distribution from the marginal table, the pattern for girls and boys looks quite

    similar. However, more girls than boys indicated that they had threatened to hurt a teacher 10-19

    or 20 or more times (2% versus .6%, respectively). While it is difficult to speculate why this item

    exhibits DIF, it may be possible that boys who are otherwise highly aggressive may be less likely

    to threaten or admit to threatening teachers. This could possibly be because many teachers are

    female and there is a strong social norm against males hitting or threatening females in the

    Southern U.S.

    Item 6 (“hit or slapped another kid”) was also flagged for DIF. The β estimate for this

    item was -.222, which indicates that in the model females may be more likely or males may be

    less likely to select higher values on this item due to measurement differences rather than true

    differences in physical aggression. Examining the item intercorrelation matrix revealed that the

    correlations for this item were fairly consistent with the correlations between the other items. In

    terms of response category frequencies from the marginal table, substantial differences exist in

    the high end of the scale. Where 2.6% of girls reported hitting or slapping another kid 20 or

    more times, 7% of boys reported this behavior. It is possible that the word “slapping” resonates

    more with girls, as this behavior in our culture is more often associated with females.

    40

  • For the relational aggression scale, item 6 (“said things about another student to make

    other students laugh”) demonstrated DIF. The β estimate for this item was .211, which indicates

    that males may be more likely or females may be less likely to select higher values on this item

    than would be expected based on their level of latent physical aggression. Interitem correlations

    for this item were somewhat lower than correlations among other items on the relational

    aggression scale. In terms of response category frequencies from the marginal table, boys were

    more likely than girls to select the response “20 or more times” (9.9% versus 5.5%,

    respectively). It is possible that for adolescent boys, the act of making fun of others is a more

    common experience that is not necessarily always tied to a malevolent or aggressive motive.

    One important aspect of testing for measurement invariance is deciding what to do if full

    measurement invariance is not supported. It may not be practical to remove all items from a

    scale that exhibit DIF. Hambleton (2006) suggests that it is common to find many items that

    display DIF in psychological, attitudinal, or personality measures. He notes that it is possible

    that these effects, taken as a whole, may cancel each other out. That is, while some items may

    exhibit DIF in one direction, others counter those effects, so overall the measure does not favor

    one group over the other.

    It is also possible that while statistically significant DIF effects may exist in the model,

    the impact of these effects is not of practical importance. With larger sample sizes the power to

    detect DIF increases. Because the present dataset is reasonably large, it is possible that some of

    the significant DIF effects correspond to measurement differences that are not practically

    meaningful. In future studies, it may be desirable to examine effect size measures of the DIF

    items.

    41

  • Limitations and Future Research

    There are several limitations inherent to the research presented in this paper. An

    important limitation is that for all of the tested models, fit was less than ideal. While the fit of

    the MIMIC models was adequate, the fit was mediocre for the multiple group analysis models,

    especially for physical aggression. When model fit is borderline in terms of acceptability, it

    becomes harder to draw conclusions based on those models. For example, a non-significant χ2

    difference test may be reflective of problems with the overall fit of the model, rather than lack of

    fit caused only by introducing equality constraints. The less than ideal fit of the models for

    multiple groups analysis may be due to constraints that are put on the data for identification

    purposes. This possibility is discussed later in this section.

    Another problem is that the generalizability of the findings may be limited by the study

    sample. All of the participants were middle school students from one region of the U.S. While

    in this population the aggression measures examined may not demonstrate full measurement

    invariance, this result may not hold for other groups of students, or for the same students at a

    different point in time (e.g., in high school).

    Another limitation concerns estimation method. Although WLSMV was likely the most

    appropriate estimator for the data, based on the categorical and non-normal nature of the data,

    there are also some disadvantages to using WLSMV. First, WLSMV is a relatively new

    estimation method, and there is relatively little research examining its functioning. Overall,

    however, research has supported its use as an improvement upon WLS estimation (Muthén,

    1993).

    WLSMV also provides challenges in terms of identifying the model when thresholds are

    being estimated. In multiple group analysis, one threshold per item must be held invariant across

    42

  • groups. Ideally, this threshold should in reality be invariant across groups, although it is difficult

    to determine if this is the case. Systematically testing each threshold becomes time-consuming

    when there are many thresholds per item. However, if a non-invariant threshold is set to equality

    in the multiple group analysis, the overall model fit may be worsened. Additional research into

    this aspect of multiple group analysis with categorical data is warranted.

    Second, WLSMV provides less information in terms of multiple group analysis than

    continuous estimators. For example, under MLM estimation, tests of metric and scalar

    invariance could have been conducted separately, whereas under WLSMV these tests are

    combined. By conducting metric and scalar tests separately, the source of lack of invariance

    (slopes or intercepts) can be investigated. While under WLSMV it is possible to identify the

    model in such a way as to estimate some differences in loadings and thresholds, this procedure is

    not recommended in the Mplus manual because the loadings and thresholds together contribute

    to the item characteristic curve (Muthén and Muthén, 2007, p. 299).

    Another drawback to using WLSMV estimation is that modification indices are not

    available. Under multiple group analysis, modification indices can be used to identify items that

    potentially are problematic in terms of non-invariant loadings or intercepts. Modification indices

    are often used to test for partial invariance, where some parameters are held invariant, while

    problematic items are freely estimated. Without modification indices, I was unable to compare

    whether the two methods (MIMIC modeling and multiple groups analysis) identified the same

    problematic items.

    Certain properties of the present dataset may make accurate estimation difficult.

    Specifically, several of the items used were severely non-normal, with kurtosis values as high as

    34.882. While WLSMV does adjust parameter estimates and χ2 values for non-normality, the

    43

  • accuracy of WLSMV adjustments in cases of severe non-normality has not been studied

    extensively. In some cases, the characteristics of the WLSMV estimates contradict intuitive

    sense, such as more restricted models that have better fit index values than less restricted values.

    Additional studies of the functioning of WLSMV under conditions of severe non-normality

    would be useful.

    The findings of this study did not support the full measurement invariance of two

    measures of adolescent aggression. While lack of invariance was indicated for several items, the

    findings of this paper are less than conclusive. Particularly, for the relational aggression scale,

    one item displayed DIF, yet invariance of factor loadings and thresholds was supported. It is

    clear that additional research is needed to investigate the measurement properties of the scales

    included in this study, as well as other measures of adolescent aggression.

    44

  • REFERENCES

    Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W.

    Holland & H. Wainer (Eds.), Differential item functioning (pp. 3-23). Hillsdale, NJ,

    England: Lawrence Erlbaum Associates Inc.

    Baker, F. B., & ERIC Clearinghouse on Assessment and Evaluation College Park MD. (2001).

    The basics of item response theory. (2nd ed.).

    Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107,

    238-246.

    Björkqvist, K., Lagerspetz, K. M., & Kaukiainen, A. (1992). Do girls manipulate and boys fight?

    Developmental trends in regard to direct and indirect aggression. Aggressive Behavior,

    18, 117-127.

    Bongers, I. L., Koot, H. M., van der Ende, J., & Verhulst, F. C. (2004). Developmental

    trajectories of externalizing behaviors in childhood and adolescence. Child Development,

    75, 1523-1537.

    Broidy, L. M., Nagin, D. S., Tremblay, R. E., Bates, J. E., Brame, B., Dodge, K. A., et al. (2003).

    Developmental trajectories of childhood disruptive behaviors and adolescent

    delinquency: A six-site, cross-national study. Developmental Psychology, 39, 222-245.

    Christensen, H., Jorm, A. F., Mackinnon, A. J., Korten, A. E., Jacomb, P. A., Henderson, A. S.,

    et al. (1999). Age differences in depression and anxiety symptoms: A structural equation

    modelling analysis of data from a general population sample. Psychological Medicine,

    29, 325-339.

    45

  • Crick, N. R., & Bigbee, M. A. (1998). Relational and overt forms of peer victimization: A

    multiinformant approach. Journal of Consulting and Clinical Psychology, 66, 337-347.

    Crick, N. R., & Grotpeter, J. K. (1995). Relational aggression, gender, and social-psychological

    adjustment. Child Development, 66, 710-722.

    Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Forth Worth,

    TX: Harcourt Brace.

    D'Agostino, R. B. (1986). Tests for the normal distribution. In R. B. D'Agostino & M. A.

    Stephens (Eds.), Goodness-of-fit techniques (pp. 367-419). New York: Marcel Dekker.

    Dahlberg, L. L., Toal, S. B., Swahn, M., & Behrens, C. B. (2005). Measuring violence-related

    attitudes, behaviors, and influences among youths: A compendium of assessment tools.

    (2nd ed.) Atlanta, GA: Centers for Disease Control and Prevention, National Center for

    Injury Prevention and Control.

    DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological Methods, 2, 292-307.

    Edelen, M. O., Thissen, D., Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2006).

    Identification of differential item functioning using item response theory and the

    likelihood-based model comparison approach: Application to the Mini-Mental State

    Examination. Medical Care, 44, 134-142.

    Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ:

    Lawrence Erlbaum Associates Publishers.

    Farrell, A. D., Meyer, A. L., & White, K. S. (2001). Evaluation of responding in peaceful and

    positive ways (RIPP): A school-based prevention program for reducing violence among

    urban adolescents. Journal of Clinical Child Psychology, 30, 451-463.

    46

  • Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-

    Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement,

    29(4), 278-295.

    Finney, S. J., & Distefano, C. (2006). Non-normal and categorical data in structural equation

    modeling. In G. R. Hancock (Ed.), Structural equaltion modeling: A second course (pp.

    269-313). Greenwich, CT: Information Age Publishing.

    Gallo, J. J., Anthony, J. C., & Muthén, B. O. (1994). Age-differences in the symptoms of

    depression - a latent trait analysis. Journals of Gerontology, 49, 251-264.

    Glockner-Rist, A., & Hoijtink, H. (2003). The best of both worlds: Factor analysis of

    dichotomous data using item response theory and structural equation modeling.

    Structural Equation Modeling: A Multidisciplinary Journal, 10, 544 - 565.

    Grayson, D. A., Mackinnon, A., Jorm, A. F., Creasey, H., & Broe, G. A. (2000). Item bias in the

    center for epidemiologic studies depression scale: Effects of physical disorders and

    disability in an elderly community sample. Journals of Gerontology Series B-

    Psychological Sciences and Social Sciences, 55, 273-282.

    Hambleton, R. K. (2006). Good practices for identifying differential item functioning -

    Commentary. Medical Care, 44(11), S182-S188.

    Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance

    in aging research. Experimental Aging Research, 18, 117-144.

    Horne, A. M. (2004). The multisite violence prevention project: background and overview.

    American Journal of Preventive Medicine, 26(Suppl1), 3-11.

    Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to

    underparameterized model misspecification. Psychological Methods, 3, 424-453.

    47

  • Jones, R. N. (2003). Racial bias in the assessment of cognitive functioning of older adults. Aging

    & Mental Health, 7, 83-102.

    Jones, R. N. (2006). Identification of measurement differences between English and Spanishl

    language versions of the Mini-Mental State Examination: Detecting differential item

    functioning using MIMIC modeling. Medical Care, 44, 124-133.

    Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New

    York: The Guilford Press.

    Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Oxford, England:

    Addison-Wesley.

    MacIntosh, R., & Hashim, S. (2003). Variance estimation for converting MIMIC model

    parameters to IRT parameters in DIF analysis. Applied Psychological Measurement, 27,

    372-379.

    Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of item response theory and

    confirmatory factor analytic methodologies for e