An Introduction to Using Multidimensional Item Response Theory to Assess Latent Factor Structures

download An Introduction to Using Multidimensional Item Response Theory to Assess Latent Factor Structures

of 17

Transcript of An Introduction to Using Multidimensional Item Response Theory to Assess Latent Factor Structures

  • Journal of the Society for Social Work and Research October 2010

    Volume 1, Issue 2, 6682 ISSN 1948-822X DOI:10.5243/jsswr.2010.6

    Journal of the Society for Social Work and Research 66

    An Introduction to Using Multidimensional Item Response Theory

    to Assess Latent Factor Structures

    Philip Osteen

    University of Maryland

    This study provides an introduction to the use of multidimensional item response theory (MIRT)

    analysis for assessing latent factor structure, and compares this statistical technique to

    confirmatory factor analysis (CFA) in the evaluation of an original measure developed to assess

    students motivations for entering a social work community of practice. The Participation in a Social Work Community of Practice Scale (PSWCoP) was administered to 506 masters of social

    work students from 11 accredited graduate programs. The psychometric properties and latent

    factor structure of the scale are evaluated using MIRT and CFA techniques. Although designed as

    a 3-factor measure, analysis of model fit using both CFA and MIRT do not support this solution.

    Instead, analyses using both methods produce convergent results supporting a 4-factor solution.

    Discussion includes methodological implications for social work research, focusing on the

    extension of MIRT analysis to assessment of measurement invariance in differential item

    functioning, differential test functioning, and differential factor functioning.

    Keywords: item response theory, factor analysis, psychometrics

    In comparison to classical test theory (CTT), item

    response theory (IRT) is considered as the standard, if

    not preferred, method for conducting psychometric

    evaluations of new and established measures

    (Embretson & Reise, 2000; Fries, Bruce, & Cella,

    2005; Lord, 1980; Ware, Bjorner, & Kosinski, 2000).

    Dubbed the modern test theory, IRT is used across scientific disciplines, including psychology, education,

    nursing, and public health. Considered a superior

    method because of IRTs ability to overcome inherent limitations of CTT, IRT provides researchers with an

    array of statistical tools for assessing measure

    characteristics. Unfortunately, there is a resounding

    paucity of published research in social work using IRT.

    A review of measurement-based articles appearing in

    journals specific to the social work field published

    between 2000 and 2006 showed that fewer than 5% of

    studies used IRT analysis to evaluate the psychometric

    properties of new and existing measures (Unick &

    Stone, 2010). Unick and Stone hypothesized several

    reasons for the absence of IRT analyses from social

    work journals, one of which was a lack of familiarity

    with key conceptual and practical components of IRT.

    Regardless of the reasons underlying the absence

    of IRT-based analyses in the social work literature, the

    field of social work will benefit from researchers

    becoming more familiar with IRT methods and

    incorporating these analyses into social work-based

    Philip J. Osteen is an assistant professor in the

    University of Maryland School of Social Work.

    All correspondence concerning this article should

    be directed to [email protected]

    measurement studies. Historically regarded as a

    method for evaluating latent skill and ability traits in

    education, the application of IRT to measures of

    affective latent traits is becoming more common and

    accepted. As outlined in this article, drawing on the

    strengths of IRT as an alternative to, or ideally in

    conjunction with, CTT analyses supports social work

    researchers development of rigorously substantiated measures. This article provides social work researchers

    with a basic overview of IRT and a demonstration of

    the utility of IRT as compared with CTT-based factor

    analysis by using actual data obtained with the

    implementation of a novel measure of professional

    motivations of masters of social work (MSW) students. Published studies comparing IRT and

    confirmatory factor analysis (CFA) have focused

    almost exclusively on assessing measurement

    invariance. This study takes a different approach in

    comparing IRT and CTT by applying these theories to

    the assessment of multidimensional latent factor

    structures.

    IRT

    IRT is based on the premise that only two

    elements are responsible for a persons response on any given item: the persons ability, and the characteristics of the item (Bond & Fox, 2001). The most common

    IRT model, called the Rasch or one-parameter logistic

    model, assumes the probability of a given response is a

    function of the persons ability and the difficulty of the item (Bond & Fox, 2001). More complex IRT models

    estimate the probability of a given response based on

    additional item characteristics such as discrimination

    and guessing (Bond & Fox, 2001). Derived from its

  • OSTEEN

    Journal of the Society for Social Work and Research 67

    early use in educational measurement, the term ability

    may seem mismatched to psychosocial constructs; thus,

    the term latent trait may be more intuitive, and

    references to level of ability are synonymous with level

    of the latent trait. The IRT model produces estimates

    for both of these elements by calculating item-

    difficulty parameters on the basis of the total number of

    persons who correctly answered an item, and person-

    trait parameters on the basis of the total number of

    items successfully answered (Bond & Fox, 2001). The

    assumptions underlying these estimates are (a) that a

    person with more of the trait will always have a greater

    likelihood of success than a person with less of the

    trait, and (b) that any person will have a greater

    likelihood of endorsing items requiring less of the trait

    than items requiring more of the trait (Mller, Sokol, &

    Overton, 1999). Samejima (1969) and Andrich (1978)

    extended this model to measures with polytomous

    response formats (i.e., Likert scales) by adding an

    estimate to account for the difficultly in crossing the

    threshold from one level of response to the next (e.g.,

    moving from agree to strongly agree).

    Scale Evaluation Using IRT

    The basic unit of IRT is the item response

    function (IRF) or item characteristic curve. The

    relationship between a respondents performance and the characteristics underlying item performance can be

    described by a monotonically increasing function

    called the item characteristic curve (ICC; Henard,

    2000). The ICC is typically a sigmoid curve estimating

    the probability of a given response based on a persons level of latent trait. The shape of the ICC is determined

    by the item characteristics estimated in the model. The

    ICC in a three-parameter IRT model is derived using

    the formula P() = c + (1-c)e a(-b-f)

    where P, the probability of a response given a

    persons level of the latent traitdenoted by theta ()is a function of guessing (c parameter), item discrimination (a parameter), item difficulty (b

    parameter), and the category threshold (f) if using a

    polytomous response format.

    For the one-parameter IRT model, the guessing

    parameter, c, is constrained to zero, assuming little or

    no impact of guessing. For example a person cannot

    guess the correct response to an item using a Likert

    scale because items are not scored as right or wrong.

    The item discrimination parameter, a, is set to 1 under

    the assumption that there is equal discrimination across

    items. In a one-parameter model the probability of a

    response is determined only by the persons level of the latent trait and the difficulty of the item. Item difficulty

    is an indication of the level of the underlying trait that

    is needed to endorse or respond in a certain way to the

    item. For items on a rating scale, the IRF is a

    mathematical function describing the relation between

    where an individual falls on the continuum of a given

    construct such as motivation and the probability that he

    or she will give a particular response to a scale item

    designed to measure that construct (Reise, Ainsworth,

    & Haviland, 2005). The basic goal of IRT modeling is

    to create a sample-free measure.

    Multidimensional item response theory, or MIRT,

    is an extension of IRT and is used to explore the

    underlying dimensionality of an IRT model. Advances

    in computer software (e.g., Conquest, MULTILOG, &

    Mplus) allow for testing and evaluation of more

    complex multidimensional item response models and

    enable researchers to statistically compare competing

    dimensional models. ACER Conquest 2.0 (Wu,

    Adams, & Wilson, 2008), the software used in this

    study, produces marginal maximum likelihood

    estimates for the parameters of the models. The fit of

    the models is ascertained by generalizations of the

    Wright and Masters (1982) residual-based methods.

    Alternative dimensional models are evaluated using a

    likelihood ratio chi-squared statistic (2LR; Barnes, Chard, Wolfe, Stassen, & Williams, 2007).

    Core statistical output of an IRT analysis of a one-

    parameter rating scale model includes estimates of

    person latent trait, item difficulty, model fit, person-

    fit, item-fit, person reliability, item reliability, and step

    calibration. A two-parameter model would include

    estimates for item discrimination, and a three-

    parameter model would include an additional estimate

    for guessing. Person latent trait is an estimate of the

    underlying trait present for each respondent. Persons

    with high person-ability scores possess more of the

    underlying trait than persons with low scores. Item

    difficulty is an estimate of the level of underlying trait

    at which a person has a 50% probability of endorsing

    the item. Items with higher item-difficulty scores

    require a respondent to have more of the underlying

    trait to endorse or correctly respond to the item than

    items with lower item difficulty scores. Consider a

    measure of reading comprehension. An item requiring

    a 12th grade reading level is more difficult than an item

    requiring a 6th grade reading level. The same concept

    applies to a measure of motivation; an item requiring a

    high amount of motivation is more difficult than an item requiring a low amount of motivation. This idea

    translates to the concept of person-ability or latent trait.

    A person who reads at a 12th grade level has more

    ability than a person who reads at a 6th grade level; a

    person who is more motivated has more of the latent

    trait than a person who is less motivated.

    Analysis of item fit. Fit statistics in IRT analysis

    include infit and outfit mean square (MNSQ) statistics.

    1 + e a(-b-f)

  • USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

    Journal of the Society for Social Work and Research 68

    Infit and outfit are statistical representations of how

    well the data match the prescriptions of the IRT model

    (Bond & Fox, 2001). Outfit statistics are based on

    conventional sum of squared standardized residuals,

    and infit statistics are based on information-weighted

    sum squared standardized residuals (Bond & Fox,

    2001). Infit and outfit have expected MNSQ values of

    1.00; values greater than or less than 1 indicate the

    degree of variation from the expected score. For

    example, an item with an infit MNSQ of 1.33 (1.00 +

    .33), indicates 33% more variation in responses to that

    item than were predicted by the model. Mean infit and

    outfit values represent a degree of overall fit of the data

    to the model, but infit and outfit statistics are also

    available for assessing fit at the individual item level

    (item-fit) and the individual person level (person-fit).

    Item-fit refers to how well the IRT model explains the

    responses to a particular item (Embretson & Reise,

    2000). Person-fit refers to the consistency of an

    individuals pattern of responses across items (Embretson & Reise, 2000).

    One limitation of IRT is the need for large

    samples. No clear standards exist for minimum sample

    size, although Embretson and Reise (2000) briefly

    noted that a sample of 500 respondents was

    recommended, and cautioned that parameter

    estimations might become unstable with samples of

    less than 350 respondents. Reeve and Fayers (2005)

    suggested that useful information about item

    characteristics could be obtained with samples of as

    few as 250 respondents. One-parameter models may

    yield reliable estimates with as few as 50 to 100

    respondents (Linacre, 1994). As the complexity of the

    IRT model increases and more parameters are

    estimated, sample size should increase accordingly.

    Smith, Schumacker, and Bush (1998), provided the

    following sample size dependent cutoffs for

    determining poor fit: misfit is evident when MNSQ

    infit or outfit values are larger than 1.3 for samples less

    than 500, 1.2 for samples between 500 and 1,000, and

    1.1 for samples larger than 1,000 respondents.

    According to Adams and Khoo (1996), items with

    adequate fit will have weighted MNSQs between .75

    and 1.33. Bond and Fox (2001) stated items that are

    routinely accepted as having adequate fit will have t-

    values between -2 and +2. According to Wilson (2005),

    when working with large sample sizes, the researcher

    can expect the t-statistic to show significant values for

    several items regardless of fit; therefore, Wilson

    suggested that the researcher consider items

    problematic only if items are identified as misfitting

    based on both the weighted MNSQ and t-statistic.

    For rating scale models, category thresholds are

    provided in the IRT analysis. A category threshold is

    the point at which the probability of endorsing one

    category is equal to the probability of endorsing a

    corresponding category one step away. Although

    thresholds are ideally equidistant, that characteristic is

    not necessarily the reality. Guidelines indicate that

    thresholds should be at least 1.4 logits but no more than

    5 logits (Linacre, 1999). Logits are the scale units for

    the log odds transformation. When thresholds have

    small logits, response categories may be too similar

    and nondiscriminant. Conversely, when the threshold

    logit is large, response categories may be too dissimilar

    and far apart, indicating the need for more response

    options as intermediate points. Infit and outfit statistics

    are also available for step calibrations. Outfit MNSQ

    values greater than 2.0 indicate that a particular

    response category is introducing noise into the measurement process, and should be evaluated as a

    candidate for collapsing with an adjacent category

    (Bond & Fox, 2001; Linacre, 1999).

    In conjunction with the standard output of IRT

    analysis, MIRT analysis provides information about

    dimensionality, the underlying latent factor structure.

    Acer Conquest 2.0 (Wu et al., 2008) software provides

    estimations of population parameters for the

    multidimensional model, which include factor means,

    factor variances, and factor covariances/correlations.

    Acer Conquest 2.0 also produces maps of latent

    variable distributions and response model parameter

    estimates.

    Analysis of nested models. Two models are

    considered as being nested if one is a subset of the

    second. Overall model fit of an IRT model is based on

    the deviance statistic, which follows a chi-square

    distribution. The deviance statistic changes as

    parameters are added or deleted from the model, and

    changes in fit between nested models can be

    statistically tested. The chi-square difference statistic

    (2 D) can be used to test the statistical significance of the change in model fit (Kline, 2005). The 2 D is calculated as the difference between the model chi-

    square (2 M) values of two nested models using the same data; the df for the 2 D statistic is the difference in dfs for two nested models. The 2 D statistic tests the null hypothesis of identical fit of the two models to the

    population. Failure to reject the null hypothesis means

    that the two models fit the population equally. When

    two nested models fit the population equally well, the

    more parsimonious model is generally considered the

    more favorable.

    Scale Evaluation Using CFA

    Factor analysis is a more traditional method for

    analyzing the underlying dimensionality of a set of

    observed variables. Derived from CTT, factor analysis

    includes a variety of statistical procedures for exploring

    the relationships among a set of observed variables

    with the intent of identifying a smaller number of

  • OSTEEN

    Journal of the Society for Social Work and Research 69

    factors, the unobserved latent variables, thought to be

    responsible for these relationships among the observed

    variables (Tabachnik & Fidell, 2007). CFA is used

    primarily as a means of testing hypotheses about the

    latent structure underlying a set of observed data.

    A common and preferred method for conducting

    CFA is structural equation modeling (SEM). The term

    SEM refers to a family of statistical procedures for

    assessing the degree of fit between observed data and

    an a priori hypothetical model in which the researcher

    specifies the relevant variables, which variables affect

    other variables, and the direction of those effects. The

    two main goals of SEM analysis are to explore patterns

    of correlations among a set of variables, both observed

    and unobserved, and to explain as much variance as

    possible using the model specified by the researcher

    (Klem, 2000; Kline, 2005).

    Analysis of SEM models. Analysis of SEM

    models is based on the fit of the observed variance-

    covariance matrix to the proposed model. Although

    maximum likelihood (ML) estimation is the common

    method for deriving parameter estimates, it is not the

    only estimation method available. ML estimation

    produces parameter estimates that minimize the

    discrepancies between the observed covariances in the

    data and those predicted by the specified SEM model

    (Kline, 2005). Parameters are characteristics of the

    population of interest; without making observations of

    the entire population, parameters cannot be known and

    must be estimated from sample statistics. ML

    estimation assumes interval level data, and alternative

    methods, such as weighted least squares estimation,

    should be used with dichotomous and ordinal level

    data. Guo, Perron, and Gillespie (2009) noted in their

    review of social work SEM publications that ML

    estimation was sometimes used and reported

    inappropriately.

    Analysis of model fit. Kline (2005) defined

    model fit as how well the model as a whole explained

    the data. When a model is over identified, it is expected

    that model fit will not be perfect; it is therefore

    necessary to determine the actual degree of model fit,

    and whether the model fit is statistically acceptable.

    Ideally, indicators should load only on the specific

    latent variable identified in the measurement model.

    This type of model can be tested by constraining the

    direct effects between indicators and other factors to

    zero. According to Kline (2005), indicators are expected to be correlated with all factors in CFA

    models, but they should have higher estimated

    correlations with the factors they are believed to

    measure (emphasis in original, p. 177). A measurement model with indicators loading only on a

    single factor is desirable but elusive in practice with

    real data. Statistical comparison of models with cross-

    loadings to models without cross-loadings allows the

    researcher to make stronger assertions about the

    underlying latent variable structure of a measure. As

    Guo et al. (2009) noted, modified models allowing

    cross-loadings between items and factors have been

    frequently published in social work literature without

    fully explaining how they related to models without

    cross-loadings.

    Analysis of nested models. As noted in the

    discussion of MIRT analysis, two models are

    considered to be nested if one is a subset of the second.

    Overall model fit based on the chi-square distribution

    will change as paths are added to or deleted from a

    model. Klines (2005) chi-square difference statistic (2

    D) can be used to test the statistical significance of the

    change in model fit.

    MIRT versus CFA

    MIRT and CFA analyses can be used to assess the

    dimensionality or underlying latent variable structure

    of a measurement. The choice of statistical procedures

    raises questions about differences between analyses,

    whether the results of the two analyses are consistent,

    and what information can be obtained from one

    analysis but not the other. IRT addresses two problems

    inherent in CTT. First, IRT overcomes the problem of

    item-person confounding found in CTT. IRT analysis

    yields estimates of item difficulties and person-abilities

    that are independent of each other, whereas in CTT

    item difficulty is assessed as a function of the abilities

    of the sample, and the abilities of respondents are

    assessed as a function of item difficulty (Bond & Fox,

    2001), a limitation that extends to CFA.

    Second, the use of ordinal level data (i.e., rating

    scales), which are routinely treated in statistical

    analyses as continuous, interval-level data, may violate

    the scale and distributional assumptions of CFA (Wirth

    & Edwards, 2007). Violating these assumptions may

    result in model parameters that are biased and

    impossible to interpret (Wirth & Edwards, 2007, p. 58; DiStefano, 2002). The logarithmic transformation

    of ordinal level raw data into interval level data in IRT

    analysis overcomes this problem.

    IRT and CTT also differ in the treatment of the

    standard error of measurement. The standard error of

    measurement is an indication of variability in scores

    due to error. Under CTT, the standard error of

    measurement is averaged across persons in the sample

    or population and is specific to that sample or

    population. Under IRT, the standard error of

    measurement is considered to vary across scores in the

    same population and to be population-general

    (Embretson & Reise, 2000). The IRT approach to the

    standard error of measurement offers the following

    benefits: (a) the precision of measurement can be

    evaluated at any level of the latent trait instead of

  • USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

    Journal of the Society for Social Work and Research 70

    averaged over trait levels as in CTT, and (b) the

    contribution of each item to the overall precision of the

    measure can be assessed and used in item selection

    (Hambleton & Swaminathan, 1985).

    MIRT and CFA differ in the estimation of item

    fit. Where item fit is assessed through error variances,

    communalities, and factor loadings in CFA, item fit is

    assessed through unweighted (outfit) and weighted

    (infit) mean square errors in IRT analyses (Bond &

    Fox, 2001). Further, the treatment of the relationship

    between indicator and latent variable, which is

    constrained to a linear relationship in CFA, can be

    nonlinear in IRT (Greguras, 2005). CFA uses one

    number, the factor loading, to represent the relationship

    between the indicator and the latent variable across all

    levels of the latent variable; in IRT, the relationship

    between indicator and latent variable is given across

    the range of possible values for the latent variable

    (Greguras, 2005). Potential implications of these

    differences include inconsistencies in parameter

    estimates, indicator and factor structure, and model fit

    across MIRT and CFA analyses.

    Both IRT and CFA provide statistical indicators

    of psychometric performance not available in the other

    analysis. Using the item information curve (IIC), IRT

    analysis allows the researcher to establish both item

    information functions (IIF) and test information

    functions (TIF). The IIF estimates the precision and

    reliability of individual items independent of other

    items on the measure; the TIF provides the same

    information for the total test or measure, which is a

    useful tool in comparing and equating multiple tests

    (Hambleton et al., 1991; Embretson & Reise, 2000).

    IRT for polytomous response formats also provides

    estimated category thresholds for the probability of

    endorsing a given response category as a function of

    the level of underlying trait. These indices of item and

    test performance and category thresholds are not

    available in CFA in which item and test performance

    are conditional on the other items on the measure.

    Conversely, CFA offers a wide range of indices for

    evaluating model fit, whereas IRT is limited to the use

    of the 2 deviance statistic. Reise, Widaman, and Pugh (1993) explicitly identified the need for modification

    indices and additional model fit indicators for IRT

    analyses as a limitation.

    Participation in a Social Work Community of

    Practice Scale

    Although the content of the Participation in a

    Social Work Community of Practice Scale (PSWCoP)

    is less important in the current discussion than the

    methodologies used to evaluate the scale, a brief

    overview will provide context for interpreting the

    results of the analyses. The PSWCoP scale is an

    assessment of students motivations for entering a masters of social work (MSW) program as

    conceptualized in Wenger, McDermott, and Snyders (2002) three-dimensional model of motivation for

    participation in a community of practice. Wenger et al.

    (2002) asserted that all communities of practice are

    comprised of three fundamental elements (p. 27): a

    domain of knowledge defining a set of issues; a

    community of people who care about the domain; and,

    the shared practice developed to be effective in that

    domain. Some individuals are motivated to participate

    because they care about the domain and are interested

    in its development. Some individuals are motivated to

    participate because they value being part of a

    community as well as the interaction and sharing with

    others that is part of having a community. Finally,

    some individuals are motivated to participate by a

    desire to learn about the practice as a means of

    improving their own techniques and approaches. The

    PSWCoP was developed as a multidimensional

    measure of the latent constructs domain motivation,

    community motivation, and practice motivation (Table

    1). Data were collected from a convenience sample of

    students enrolled in MSW programs using a cross-

    sectional survey design and compared to the three-

    factor model developed from Wenger et al.

    Method

    Participants

    A convenience sample of 528 current MSW

    students was drawn from 11 social work programs

    accredited by the Council on Social Work Education

    (CSWE). Participants were enrolled during two

    separate recruitment periods. The first round of

    recruitment yielded a nonrandom sample of 268

    students drawn from nine academic institutions. The

    second round of recruitment yielded a nonrandom

    sample of 260 students drawn from eight institutions.

    Six institutions participated in both rounds of data

    collection, three institutions participated in only the

    first round of data collection, and two institutions

    participated in only the second round of data collection.

    The response rate for the study could not be calculated

    because there was no way to determine the total

    number of students who received information about the

    study or had access to the online survey. Twenty-two

    cases (4.1%) were removed because of missing data,

    yielding a final sample of 506 students; listwise

    deletion was used given the extremely small amount of

    missing data.

    Data were collected on multiple student

    characteristics including age, gender, race/ethnicity,

    sexual orientation, religious affiliation, participation in

    religious activities, family socioeconomic status (SES),

    and enrollment status. The mean age of participants

  • OSTEEN

    Journal of the Society for Social Work and Research 71

    Table 1

    Original Items on the Participation in a Social Work Community of Practice Scale

    Item Factor

    My main interest for entering the MSW program was to be a part of a community of

    social workers.

    Community (C_1)

    I wanted to attend a MSW program so that I could be around people with similar

    values to me.

    Community (C_2)

    I chose a MSW program because I thought social work values were more similar to

    my values than those of other professions.

    Community (C_3)

    There is more diversity of values among students than I expected. Community (C_4)*

    Before entering the program, I was worried about whether or not I would fit in with

    my peers.

    Community (C_5)*

    Learning about the social work profession is less important to me that being part of a

    community of social workers.

    Community (C_6)*

    Without a MSW degree, I am not qualified to be a social worker. Practice (P_1)

    A MSW degree is necessary to be a good social worker. Practice (P_2)

    Learning new social work skills was not a motivating factor in my decision to enter

    the MSW program.

    Practice (P_3)

    My main reason for entering the MSW program was to acquire knowledge and/or

    skills.

    Practice (P_4)

    A MSW degree will give me more professional opportunities than other professional

    degrees.

    Practice (P_5)*

    Being around students with similar goals is less important to me than developing my

    skills as a social worker.

    Practice (P_6)*

    Learning how to be a social worker is more important to me than learning about the

    social work profession.

    Practice (P_7)*

    I find social work appealing because it is different than the type of work I have done

    in the past.

    Domain (D_1)

    I decided to enroll in a MSW program to see if social work is a good fit for me. Domain (D_2)

    I wanted to attend a MSW program so that I could learn about the social work

    profession.

    Domain (D_3)

    Entering the MSW program allowed me to explore a new area of professional interest. Domain (D_4)

    My main reason for entering the MSW program was to decide if social work is the

    right profession for me.

    Domain (D_5)

    *Items deleted from the final version of the PSWCoP

    was 30.2 years (SD = 8.7 years). The majority of

    students were female (92%). The majority of the

    participants were Caucasian (82.6%), with 7.3% of

    students self-identifying as African American or Black;

    4.1% as Hispanic; 1.8% as Asian/Pacific Islander; and

    4.1% as a nonspecified race/ethnicity. Students

    identified their enrollment status as either part-time

    (19.5%), first year (32.7%), advanced standing (27%),

    or second year (20.8%).

    Measures

    Analyses were conducted on an original measure

    of students motivations for entering a social work community of practice, defined as pursuing a MSW

    degree. The PSWCoP was developed and evaluated

    using steps outlined by by Benson and Clark (1982)

    and DeVellis (2003). The pilot measure contained 18 items designed to measure three constructs (domain,

    community, and practice). Items were measured on a 6-

    point rating scale from strongly disagree to strongly

    agree. Items from the pilot measure organized by

    subscale are listed in Table 1. In addition to items on

    the PSWCoP, students were asked to provide

    demographic information.

    Procedures

    Participants completed the PSWCoP survey as part

    of a larger study exploring the relationship between

    students motivations to pursue the MSW degree, their attitudes about diversity and historically marginalized

    groups, and their endorsement of professional social

    work values as identified in the National Association of

    Social Workers (2009) Code of Ethics. This research

    was approved by the University of Denver Institutional

    Review Board prior to recruitment and data collection.

    Recruitment consisted of a two-pronged approach: (a)

    an e-mail providing an overview of the study and a link

  • USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

    Journal of the Society for Social Work and Research 72

    to the online survey was sent to students currently

    enrolled in the MSW program; and (b) an

    announcement providing an overview of the study and

    a link to the online survey posted to student-oriented

    informational Web sites. Interested participants were

    able to access the anonymous, online survey through

    www.surveymonkey.com, which is a frequently used

    online survey provider. Participants were presented

    with a project information sheet and were required to

    indicate their consent to participate by clicking on the

    appropriate response before being allowed to access the

    actual survey.

    Results

    Reliability of scores from the PSWCoP was

    assessed using both CTT and IRT methods. SPSS

    (v.16.0.0, 2007) was used to calculate internal

    consistency reliability (Cronbachs ; inter-item correlations). Acer Conquest 2.0 (Wu et al., 2008) was

    used to assess item reliability. The dimensionality and

    factor structure of the PSWCoP were evaluated using

    both a MIRT and a CFA approach. Acer Conquest 2.0

    (Wu et al., 2008) was used to conduct the MIRT

    analysis and Lisrel 8.8 (Jreskog & Srbom, 2007) was

    used to conduct the CFA analysis. Acer Conquest 2.0

    was used to evaluate the PSWCoP with respect to

    estimates of levels of latent trait and item difficulty

    using a one-parameter logistic model. Assessment of

    the measure was based on model fit, person-fit, item

    fit, person reliability, item reliability, step calibration,

    and population parameters for the multidimensional

    model.

    Item Selection

    Items were identified for possible deletion from

    each subscale using Cronbachs alpha, IRT MNSQ Infit/Outfit results, and theory. Poorly performing

    items identified through statistical analyses were

    further assessed using conceptual and theoretical

    frameworks. A combination of results led to the

    removal of three items from the community subscale,

    three items from the practice subscale, but no items

    from the domain subscale (Table 1). Items C_6, P_6,

    and P_7 addressed relationships between types of

    motivations by asking respondents to rate whether one

    type of motivation was more important than another

    type. Quantitative differences between types of

    motivations were not addressed in community of

    practice theory, and therefore these items were deemed

    not applicable in the measurement of each type of

    motivation. Items C_4 and C_5 were deleted from the

    community subscale because these items specifically

    addressed relationships between respondents and peers.

    Community-based motivation arises out of perceived

    value congruence between the individual and the

    practice (i.e., professional social work), and not

    between the individual and other members of the

    community of practice. All analyses indicated

    problems with the practice subscale and ultimately

    EFA was used with this subscale only. The results of

    the EFA suggested items P_1 and P_2 formed one

    factor, and items P_3 and P_4 constituted a second

    factor. Item P_5 did not load on either factor and was

    deleted.

    The results of the item selection process yielded

    two competing models. The first model consisted of

    three factors in which all items developed for the

    practice subscale were kept together; this model most

    closely reflected the original hypothetical model

    developed based on community of practice theory. The

    second model had four factors with the items from the

    hypothesized practice subscale split into the two factors

    suggested by the EFA. Internal consistency for each of

    the subscales on the final version of the PSWCoP was

    assessed using Cronbachs alpha. Cronbachs alpha was 0.64 for scores from the domain subscale, 0.68 for

    scores from the community subscale, and 0.47 for

    scores from the practice subscale (three-factor model).

    Splitting the practice subscale into two factors yielded

    a Cronbachs alpha of 0.58 for scores from the skills subscale and .68 for scores from the competency

    subscale. Although ultimately indicative of a poor

    measure, low internal consistency did not prohibit the

    application and comparison of factor analysis using

    CFA and MIRT.

    Factor Structure

    CFA. CFA analyses of the PSWCoPS was

    conducted using Lisrel 8.8 (Jreskog & Srbom, 2007).

    The data collected using the PSWCoP were considered

    ordinal based on the 6-point rating scale. When data

    are considered ordinal, Jreskog and Srbom (2007)

    advocated the use of PRELIS to calculate asymptotic

    covariances and polychloric correlations of all items

    modeled, and LISREL or SIMPLIS with weighted least

    squares estimation to test the structure of the data.

    Failure to use these guidelines may result in

    underestimated parameters, biased standard errors, and

    an inflated chi-square (2) model fit statistic (Flora & Curran, 2004). The chi-square difference statistic (2 D) was used to test the statistical significance of the

    change in model fit between nested models (Kline,

    2005). The 2 D was calculated as the difference between the model chi-square (2 M) values of nested models using the same data; the df for the 2 D statistic is the difference in dfs for nested models. The 2 D statistic tested the null hypothesis of identical fit of two

    models to the population. In all, three nested models

    were evaluated and compared sequentially: a four-

    factor model with cross-loadings served as the baseline

    model, followed by a four-factor model without cross-

    loadings, and a three-factor model without cross-

    loadings. The four-factor model with cross-loadings

  • OSTEEN

    Journal of the Society for Social Work and Research 73

    was chosen as the baseline model because it was

    presumed to demonstrate the best fit having the fewest

    degrees-of-freedom. The primary models of interest

    were then compared against this baseline to estimate

    the change in model fit.

    Sun (2005) recommended considering fit indices

    in four categories: sample-based absolute fit indices,

    sample-based relative fit indices, population-based

    absolute indices, and population-based relative fit

    indices. Sample-based fit indices are indicators of

    observed discrepancies between the reproduced

    covariance matrix and the sample covariance matrix.

    Population-based fit indices are estimations of

    difference between the reproduced covariance matrix

    and the unknown population covariance matrix. At a

    minimum, Kline (2005) recommended interpreting and

    reporting four indices: the model chi-square (sample-

    based), the Steiger-Land root mean square error of

    approximation (RMSEA; population-based), the

    Bentler comparative fit index (CFI; population-based),

    and the standardized root mean square residual

    (SRMR; sample-based). In addition to these fit indices,

    this study examined the Akaike information criteria

    (AIC; sample-based) and the goodness-of- fit index

    (GFI; sample-based). According to Jackson, Gillaspy,

    and Purc-Stephenson (2009), a review of CFA journal

    articles published over the past decade identified these

    six fit indices as the most commonly reported.

    The range of values indicating good fit of observed

    data to the measurement model varies depending on the

    specific fit index. The model chi-square statistic tests

    the null hypothesis that the model has perfect fit in the

    population. Degrees-of-freedom for the chi-square

    statistic is equal the number of observations minus the

    number of parameters to be estimated. Given its

    sensitivity to sample size, the chi-square test is often

    statistically significant. Kline (2005) suggested using a

    normed chi-square statistic obtained by dividing chi-

    square by df; ideally, these values should be less than

    three. The SRMR is a measure of the differences

    between observed and predicted correlations; in a

    model with good fit, these residuals will be close to

    zero. Hu and Bentler (1999) suggested that a SRMR

  • USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

    Journal of the Society for Social Work and Research 74

    Figure 1. Standardized solution for four-factor PSWCoP model

    Three Factor Model without Cross-Loadings. The

    three-factor model corresponded to the original model

    of the PSWCoP (Figure 2). Three latent variables were

    included in this model:domain, community, and

    practice. Items were constrained to load on the factor

    for which they were designed. The four items

    originally developed for the practice subscale were

    constrained to load on a single latent variable, which

    represented a perfect correlation between the

    previously used latent variables competency and skills.

    Based on the six fit indices previously described, the

    overall fit of the model was poor: 2 = 359.90, df = 51, p < 0.001; RMSEA = 0.11 [90%CI:.10,.12]; CFI = 0.8;

    SRMR = 0.12; AIC = 413.90; GFI =0.85. When

    compared with the four-factor model without cross-

    loadings, this model demonstrated a significant

    increase in model misfit [(12 2

    2)(df1-df2) =174.38(3),

    p < .001]. All of the fit statistics indicated that the data

    did not fit the model.

    Figure 2. Standardized solution for three-factor PSWCoP model

  • OSTEEN

    Journal of the Society for Social Work and Research 75

    Summary of CFA of the PSWCoP. A summary of

    fit indices across nested models is provided in Table 2.

    The model with the best overall fit was the four-factor

    model in which items were allowed to load across all

    factors. The fit of this model was good, but the model

    lacked conceptual support and was not interpretable

    with respect to the underlying latent structure of the

    PSWCoP. Although the four-factor model with

    constrained loadings had a significant increase in

    model misfit over the four-factor model with cross-

    loadings, the four-factor model with constrained

    loadings demonstrated acceptable fit. The results of the

    CFA on the four-factor model without cross-loadings

    supported the hypothesis of a multidimensional

    measure because correlations between latent variables

    were computed and there were no significant

    correlations between any pair of latent variables

    (=.01).

    The four-factor model with constrained loadings

    was compared with a three-factor model based on the

    originally proposed measurement model for the

    PSWCoP. The conceptual difference between the two

    models was the placement of the items developed for

    the practice subscale. Constraining these four items to

    load on a single latent variable resulted in a large

    increase in model misfit. All of the reported fit

    statistics indicated a model with poor fit.

    Table 2

    Comparison of Fit Indices across Nested Models

    Model 1:

    4 Factor

    Model

    Model 2:

    Unidimensional

    4 Factor Model*

    Model 3:

    Unidimensional

    3 Factor Model**

    2(df) 64.48(35) 185.52(48) 359.90(51)

    Normed 2(2/df) 1.84 3.86 7.05

    p-value (model) .002

  • USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

    Journal of the Society for Social Work and Research 76

    latent trait was the amount of motivation a given

    student possessed. In general, the range of the latent

    trait of the sample and item difficulties were the same,

    and the distribution of persons and items about the

    mean were relatively symmetrical, indicating a good

    match between the latent trait of students and the

    difficulty of endorsing items. Exact numerical values

    for item difficulty are provided in Table 3 and ranged

    from -1.05 to +0.94. Item difficulty was scaled

    according to the theta metric, and indicated the level of

    the latent trait at which the probability of a given

    response to the item was .50. Theta () is the level of the latent trait being measured and scaled with a mean

    of zero and a standard deviation of one. Negative

    values indicated items that were easier to endorse, and

    positive values indicated items that were harder to

    endorse.

    Item fit is an indication of how well an item

    performs according to the underlying IRT model being

    tested, and it is based on the comparison of observed

    responses to expected responses for each item. Adams

    and Khoo (1996) suggested that items with good fit

    have infit scores between 0.75 and 1.33; Bond and Fox

    (2001) suggested that items with good fit have t values

    between -2 and +2. Table 3 provides the fit statistics

    for the items of the PSWCoP survey; according to this

    output, only item P_3_R exceeded Bond and Foxs guideline, and no items exceeded Adams and Khoos guideline.

    Table 3

    Rasch Analysis of Full Survey Item Difficulty and Fit

    Model Infit Outfit

    Item Label Est. S.E. MNSQ ZSTD MNSQ ZSTD

    1 C_1 0.30 .04 1.05 0.9 1.06 1.2

    2 C_2 0.04 .04 0.93 -1.1 0.94 -1.0

    3 C_3 0.05 .03 1.02 0.5 1.06 1.1

    4 P_1 -0.56 .04 1.01 0.1 1.06 0.8

    5 P_2 0.30 .04 0.98 -0.4 1.00 0.1

    6 D_1 0.68 .04 0.94 -1.1 0.93 -1.1

    7 D_2 -0.11 .04 0.91 -1.4 0.89 -1.6

    8 D_3 0.24 .04 1.01 0.1 1.05 0.9

    9 D_4 -0.33 .04 1.07 1.1 1.08 1.1

    10 D_5 0.94 .04 0.97 -0.4 0.95 -0.7

    11 P_3 -0.51 .04 1.17 2.1 1.35 4.0

    12 P_4 -1.05 .06 0.93 -0.7 0.92 -0.9

    IRT analysis produced an item reliability index

    indicating the extent to which item estimates would be

    consistent across different samples of respondents with

    similar abilities. High item reliability indicates that the

    ordering of items by difficult will be somewhat

    consistent across samples. The reliability index of

    items for the PSWCoP pilot survey was 0.99, and

    indicated consistency in ordering of items by difficulty.

    IRT analysis also produced a person-reliability index

    that indicated the extent of consistency in respondent

    ordering based on level of latent trait if given an

    equivalent set of items (Bond & Fox, 2001). The

    reliability index of persons for the PSWCoP was 0.60,

    and indicated low consistency in ordering of persons

    by level of level of latent trait, which was possibly due

    to a constricted range of the latent trait in the sample or

    a constricted range of item difficulty.

    MIRT factor structure. One of the core

    assumptions of IRT is unidimensionality; in other

    words, that person-ability can be attributed to a single,

    latent construct, and that each item contributes to the

    measure of that construct (Bond & Fox, 2001).

    However, whether intended or not, item responses may

    be attributable to more than one latent construct. MIRT

    analyses allow the researcher to assess the

    dimensionality of the measure. Multidimensional

    models can be classified as either within items or between items (Adams, Wilson, & Wang, 1997). Within-items multidimensional models have items that

    can function as indicators of more than one dimension,

  • OSTEEN

    Journal of the Society for Social Work and Research 77

    and between-items multidimensional models have

    subsets of items that are mutually exclusive and

    measure only one dimension.

    Competing multidimensional models can be

    evaluated based on changes in model deviance and

    number of parameters estimated. A chi-square statistic

    is calculated as the difference in deviance (G2) between

    two nested models with df equal to the difference in

    number of parameters for the nested models. A

    statistically significant result indicates a difference in

    model fit. When a difference in fit is found, the model

    with the smallest deviance is selected; when a

    difference in model fit is not found, the more

    parsimonious model is selected.

    The baseline MIRT model corresponded to the

    four-factor model with no cross-loadings estimated in

    the CFA (Figure 1). This baseline model was a

    between-items multidimensional model with items

    placed in mutually exclusive subsets. The four

    dimensions in the model were community,

    competency, domain, and skills. The baseline model fit

    statistic was G2=17558.64 with 26 parameters. A three

    dimensional, between-items, multidimensional model,

    corresponding to the theoretical model of the PSWCoP

    (Figure 2) was tested against the baseline model. The

    three-dimensional model fit statistic was G2=17728.83

    with 22 parameters. When compared with the four-

    dimensional model, the change in model fit was

    statistically significant and indicated that the fit of the

    three-dimensional model was worse than the fit of the

    four-dimensional model (2 (4) = 170.19, p < .001).

    Table 4

    Comparison of Model Fit Across Nested Models

    Four Factor (Between) Three Factor* (Between)

    Deviance (G2 ) 17558.64 17728.83

    Df 26 22

    G21- G

    22 -170.19

    df1-df2 4

    (G21- G

    22)/(df1-df2 ) 42.55

    p-value < .001

    * Compared to the Four Factor, Between-Items Model

    Based on the change in model fit between nested

    models, the four dimensional, between-items model

    had the better fit. This model resulted in a more

    accurate reproduction of the probability of endorsing a

    specific level or step of an item for a person with a

    particular level of the latent trait (Reckase, 1997).

    Thus, the four-dimensional model yielded the greatest

    reduction in discrepancy between observed and

    expected responses.

    Item difficulty. MIRT analyses yielded an item-

    person map by dimension. The output of the MIRT

    item-person map (Figure 3) provided a visual estimate

    of the latent trait in the sample, item difficulty, and

    each dimension. Items are ranked in the right-hand

    column by difficulty, with items at the top being more

    difficult than items at the bottom. Although the range

    of item difficulty was narrow, items were well

    dispersed around the mean. Each dimension or factor

    has its own column with estimates for respondents

    abilities. Two inferences were made based on the

    MIRT item-person map. First, although the range of

    item difficulty was narrow, items appeared to be

    dispersed in terms of difficulty with a range of -0.81 to

    +0.84. Furthermore, regarding Dimensions 1, 2, and 3,

    the item difficulties appeared to be well matched to

    levels of the latent trait, though over a limited range of

    the construct as scaled via the theta metric. Second,

    based on the means of the dimensions, Dimension 2

    (competency, x2=0.069) and Dimension 3 (domain,

    x3=-0.074) did a better job of representing all levels of

    these types of motivation than the other two

    dimensions. The small positive mean of Dimension 1

    (community, x1=0.335), indicated that students sampled

    for this study found it somewhat easier to endorse those

    items, whereas the large positive mean of Dimension 4

    (skills, x4=1.42) indicated that students sampled for this

    study found it very easy to endorse those items.

  • USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

    Journal of the Society for Social Work and Research 78

    Figure 3. MIRT Latent Variable Item-Person Map

    Item fit. Table 5 summarizes the items characteristics. In addition to the estimation of item

    difficulties, infit and outfit statistics are reported. Using

    Adams and Khoos (1996) guideline, only item C_2_2

    showed poor fit (MNSQ=0.68). In contrast, Bond and

    Foxs (2001) guideline identified several items as having poor fit (based on a 95% CI for MNSQ): C_1,

    C_2, D_1, D_3, and D_4.

    Table 5

    Item Parameter Estimates for 4 Dimensional Model

    Model Infit Outfit

    Item Label Est. S.E. MNSQ ZSTD MNSQ ZSTD

    1 C_1 0.40 0.03 0.77 -3.8 0.77 -4.5

    2 C_2 0.21 0.03 0.68 -5.6 0.67 -6.5

    3 C_3 -0.61* 0.04 1.02 0.4 1.04 0.6

    4 P_1 -0.14 0.03 1.01 0.2 1.00 0.0

    5 P_2 0.14* 0.03 0.96 -0.5 0.93 -1.1

    6 D_1 0.11 0.03 1.21 3.0 1.18 3.0

    7 D_2 0.51 0.03 1.02 0.4 1.04 0.7

    8 D_3 -0.65 0.03 1.29 4.1 1.30 4.4

    9 D_4 -0.81 0.03 1.17 2.5 1.22 3.1

    10 D_5 0.84* 0.06 0.95 -0.7 0.98 -0.2

    11 P_3_R 0.33 0.04 1.00 -0.0 1.02 0.4

    12 P_4 -0.33* 0.04 0.99 -0.2 1.00 -0.0

    *Indicates that a parameter estimate is constrained

  • OSTEEN

    Journal of the Society for Social Work and Research 79

    Discussion

    The rigor and sophistication with which social

    workers conduct psychometric assessments can be

    strengthened. Guo et al. (2010) found that social

    workers under utilize CFA, and more generally SEM,

    analyses. Further, even when those approaches are used

    appropriately, considerable room remains for

    improvement in reporting (Guo et al., 2010). Similarly,

    Unick and Stone (2010) found the use of IRT analyses

    for psychometric evaluation was noticeably missing

    from the social work literature. Developing familiarity

    and proficiency with strong psychometric methods will

    empower social workers in developing and selecting

    appropriate measures for research, policy, and practice.

    Integration of CFA and MIRT Results

    The primary result from both the CFA and MIRT

    analyses was the establishment of the PSWCoP as a

    multidimensional measure. Both sets of analyses

    identified a four-factor model in which items loaded on

    a single factor as having the best model fit when

    compared with the three-factor model. In addition, both

    analytic strategies identified significant problems with

    the PSWCoP. Low subscale internal consistencies

    might be due to the small number of items for the

    community, skills, and competency subscales, as well

    as the inability to capture the complexity of different

    types of motivation for participating in a social work

    community of practice. CFA identified multiple items

    with high (>.7) error variances, and IRT analyses

    indicated poor fit for several items. Although the

    results of the analyses identified the PSWCoP as

    having limited utility, these poor psychometric

    properties did not prohibit CFA and MIRT analyses.

    The CFA analysis was found to be more

    informative at the subscale level, whereas the MIRT

    analysis was found to be more informative at the item

    level. CFA was more informative regarding subscale

    composition and assessing associations among factors.

    The CFA analysis led to a final form of the PSWCoP

    with four subscales, and beginning evidence supporting

    the factorial validity of the measure. As indicated by

    the nonsignificant correlations among factors, each

    subscale appeared to be tapping separate constructs.

    Although MIRT allows the researcher to model factor

    structure, this approach does not estimate relationships

    between factors.

    MIRT analyses were found to be more informative

    for assessing individual item performance. Item

    difficulty estimates were obtained for the PSWCoP as a

    whole and for each subscale. Items on the PSWCoP

    appeared to be a good match for the levels of latent

    trait of the respondents with regards to the community,

    domain, and competency factors, but too easy for the

    skills factor. Based on infit and outfit statistics, MIRT

    analyses identified additional items exhibiting poor fit

    as compared with the CFA. Specifically, two items on

    the community subscale had large standardized fit

    scores in the IRT analysis but displayed high factor

    loadings and low error variances in the CFA. The IRT

    analyses also provided estimates of the item

    information function and test information function,

    making it possible to get specific estimates of standard

    errors of measurement instead of relying on an

    averaged standard error of measurement obtained from

    the CFA.

    Strengths and Limitations

    Reliance on a convenience sample is a significant

    limitation of this study. The extent to which

    participants in this study were representative of the

    larger population of MSW students was indiscernible.

    Although IRT purports to generate sample-independent

    item characteristic estimations, the stability of these

    estimations is enhanced when the sample is

    heterogeneous with regard to the latent trait. It is

    possible that students who self-selected to complete the

    measure were overly similar.

    A study strength is its contribution to the field of

    psychometric assessment. Previous studies comparing

    IRT and CFA have dealt almost exclusively with

    assessing measurement invariance across multiple

    samples (e.g., Meade & Lautenschlager, 2004; Raju,

    Lafitte, & Byrne, 2002; Reise et al., 1993). The current

    study addresses emerging issues in measurement

    theory by applying IRT analyses to multidimensional

    latent variable measures, and comparing MIRT and

    CFA assessments of factor structure in a novel

    measure.

    Implications for Social Work Research

    In addition to the benefits of using IRT/MIRT

    analytic procedures outlined in this paper, the ability of

    these techniques to assess differential item functioning

    (DIF) and differential test functioning (DTF) is a major

    advantage over CTT methods. Wilson (1985) described

    DIF as an indication of whether an item performs the

    same for members of different groups who have the

    same level of the latent trait, whereas DTF is invariant

    performance of a set of items across different groups

    (Badia, Prieto, Roset, Dez-Prez, & Herdman, 2002).

    If DIF/DTF exists, respondents from the subgroups

    who share the same level of a latent trait do not have the same probability of endorsing a test item (Embretson & Reise, 2000, p. 252). The ability to

    assess potential bias in items and tests provides a

    powerful method for developing culturally competent

    measures (Teresi, 2006). Valid comparisons between

    groups require measurement invariance, and IRT

    provides an additional tool for examining both items

    and tests.

  • USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

    Journal of the Society for Social Work and Research 80

    Additional benefits of IRT analyses include the

    ability to conduct test equating and develop adaptive

    testing. The core question of test equating is the extent

    to which scores from two measures presumed to

    measure the same construct are comparable. For

    example, are the Beck Depression Inventory and the

    Center for Epidemiological Studies Depression Scale

    (CES-D) equitable? Adaptive testing allows the

    researcher to match specific items to different levels of

    ability to more finely discern a persons ability; persons estimated to have high ability may receive a

    different set of items than a person estimated to have

    low ability. With the increasing availability of

    statistical software for conducting MIRT analyses, the

    potential also exists for developing models with greater

    complexity for testing differential factor functioning

    (DFF). Akin to testing measurement invariance using

    CFA techniques, DFF analyses will provide researchers

    with an assessment of potential bias in performance of

    factors (e.g., subscales) across groups.

    A final consideration is choosing between the

    different psychometric strategies outlined in this paper.

    Ideally, both methods should be integrated. Doing so

    gives the researcher access to unique information

    available only from each analytic method, allows the

    researcher to compare common elements of both

    analyses, and minimizes the impact of each methods limitations. If applying both methods is not possible,

    theoretical and practical considerations can inform the

    decision. IRT is a stronger choice when data are

    dichotomous or ordinal because raw scores are

    transformed to an interval scale. If the relationship

    between items and factors are nonlinear or unknown,

    IRT will yield less biased estimates than CFA. If the

    construct to be measured is presumed to be

    unidimensional, IRT is a better strategy because of the

    additional information provided in the item analysis.

    Both MIRT and CFA are informative in assessing

    latent factor structures, but only CFA allows the

    researcher to estimate relationships between factors.

    Both strategies perform better with large sample sizes,

    but IRT is affected more negatively by smaller samples

    given the larger number of parameters being estimated.

    If possible, IRT/MIRT analysis should be limited to

    samples of 200 items or more. Conversely, IRT

    analyses yield stable results with very few items,

    whereas CTT reliability varies in part as a function of

    the number of items.

    References

    Adams, R. J., & Khoo, S. T. (1996). ACER Quest

    [Computer software]. Melbourne, Australia: ACER.

    Adams, R. J., Wilson, M., & Wang, W. (1997). The

    multidimensional random coefficients multinomial

    logit model. Applied Psychological Measurement,

    21, 1-24. doi:10.1177/0146621697211001

    Andrich, D. (1988). Rasch models for measurement.

    Newbury Park, CA: Sage.

    Badia, X., Prieto, L., Roset, M., Dez-Prez, A., &

    Herdman, M. (2002). Development of a short

    osteoporosis quality of life questionnaire by

    equating items from two existing instruments.

    Journal of Clinical Epidemiology, 55, 32-40.

    doi:10.1016/S0895-4356(01)00432-2

    Benson, J., & Clark, F. (1982). A guide for instrument

    development and validation. American Journal of

    Occupational Therapy, 36, 789-800.

    Bond, T. G., & Fox, C. M. (2001). Applying the Rasch

    model (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.

    DeVellis, R. F. (2003). Scale development: Theory and

    applications. Thousand Oaks, CA: Sage.

    DiStefano, C. (2002). The impact of categorization

    with confirmatory factor analysis. Structural

    Equation Modeling, 9, 327-346.

    doi:10.1207/S15328007SEM0903_2

    Embretson, S. E., & Reise, S. P. (2000). Item response

    theory for psychologists. Mahwah, NJ: Lawrence

    Erlbaum.

    Flora, D. B., & Curran, P. J. (2004). An empirical

    evaluation of alternative methods of estimation for

    confirmatory factor analysis with ordinal data.

    Psychological Methods, 9, 466-491.

    doi:10.1037/1082-989X.9.4.466

    Fries, J., Bruce, B., & Cella, D. (2005). The promise of

    PROMIS: Using item response theory to improve

    assessment of patient-reported outcomes. Clinical

    and Experimental Rheumatology, 23(5, Suppl. 39),

    S53S57.

    Greguras, G. J. (2005). Managerial experience and the

    measurement equivalence of performance ratings.

    Journal of Business and Psychology, 19 (3), 383-

    397. doi:10.1007/s10869-004-2234-y

    Guo, B., Perron, B. E., & Gillespie, D. F. (2009). A

    systematic review of structural equation modeling

    in social work research. British Journal of Social

    Work, 39, 1556-1574. doi:10.1093/bjsw/bcn101

    Hambleton, R. K., & Swaminathan, H. (1985). Item

    response theory: Principles and applications.

    Boston, MA: Kluwer/Nijhoff.

  • OSTEEN

    Journal of the Society for Social Work and Research 81

    Hambleton, R. K., Swaminathan, H., & Rogers, H. J.

    (1991). Fundamentals of item response theory.

    Newbury Park, CA: Sage.

    Henard, D. H. (2000). Item response theory. In L. G.

    Grimm & P. R. Yarnold (Eds.), Reading and

    understanding more multivariate statistics, (67-98).

    Washington, DC: American Psychological

    Association.

    Jackson, D. L., Gillaspy, J. A, & Purc-Stephenson, R.

    (2009). Reporting practices in confirmatory factor

    analysis: An overview and some recommendations.

    Psychological Methods, 14(1), 6-23.

    doi:10.1037/a0014694

    Jreskog, K. G., & Srbom, D. (2007). LISREL 8.80

    for Windows [Computer software]. Licolnwood,

    IL: Scientific Software International.

    Klem, L. (2000). Structural equation modeling. In L.

    G. Grimm and P. R. Yarnold (Eds.), Reading and

    understanding more multivariate statistics, (pp.

    227-260). Washington, DC: American

    Psychological Association.

    Kline, R .B. (2005). Principles and practices of

    structural equation modeling (2nd ed.). New York,

    NY: Guilford Press.

    Linacre, J. K. (1994). Sample size and item calibration

    stability. Rasch Measurement Transactions, 7(4),

    328.

    Linacre, J. K. (1999). Investigating rating scale

    category utility. Journal of Outcome Measurement,

    3(2), 103-122.

    Linacre, J. K. (2006). Winsteps Rasch measurement

    3.68.0 [Software]. Chicago, IL: Author.

    Lord, F. M. (1980). Applications of item response

    theory to practical testing problems. Hillsdale, NJ:

    Lawrence Erlbaum.

    Meade, A.W., & Lautenschlager, G. J. (2004). A

    comparison of item response theory and

    confirmatory factor analytic methodologies for

    establishing measurement equivalence/invariance.

    Organization Research Methods, 7(4), 361-388.

    doi:10.1177/1094428104268027

    Mller, U., Sokol, B., & Overton, W.F. (1999).

    Developmental Sequences in class reasoning and

    propositional reasoning. Journal of Experimental

    Child Psychology, 74, 69-106.

    doi:10.1006/jecp.1999.2510

    Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2001).

    Measurement equivalence: A comparison of

    methods based on confirmatory factor analysis and

    item response theory. Journal of Applied

    Psychology, 87, 517-529. doi:10.1037/0021-

    9010.87.3.517

    Rasch, G. (1980). Probabilistic models for some

    intelligence and attainment tests. Chicago, IL:

    MESA Press.

    Reckase, M. D. (1997). The past and future of

    multidimensional item response theory. Applied

    Psychological Measurement, 21, 25-27.

    doi:10.1177/0146621697211002

    Reeve, B. B., & Fayers, P. (2005). Applying item

    response theory modeling for evaluating

    questionnaire items and scale properties. In P. M.

    Fayers & R. D. Hays (Eds.), Assessing quality of

    life in clinical trials: Methods and practice (2nd

    ed., pp. 55-73). New York, NY: Oxford University

    Press.

    Reise, S. P., Ainsworth, A. T., & Haviland, M. G.

    (2005). Item response theory: Fundamentals,

    applications, and promises in psychological

    research. Current Direction sin Psychological

    Science, 14(2), 95-101.

    Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993).

    Confirmatory factor analysis and item response

    theory: Two approaches for exploring measurement

    invariance. Psychological Bulletin, 114, 552-566.

    doi:10.1037/0033-2909.114.3.552

    Samejima, F. (1969). Estimation of latent ability using

    a response format of graded scores. (Psychometric

    Monograph No. 17) .Richmond, VA: Psychometric

    Society. Retrieved from

    http://www.psychometrika.org/journal/online/MN1

    7.pdf

    Smith, R. M., Schumaker, R. E., & Bush, M. J. (1998).

    Using item mean squares to evaluate fit to the

    Rasch model. Journal of Outcome Measurement,

    2(1), 66-78. PMid:9661732

    SPSS. (2007). SPSS for Windows, Rel. 16.0.0

    [Software]. Chicago: SPSS, Inc.

    Sun, J. (2005). Assessing goodness of fit in

    confirmatory factor analysis. Measurement and

    Evaluation in Counseling and Development, 37(4),

    240-256.

    Swaminathan, H., & Gifford, J. A. (1979). Estimation

    of parameters in the three-parameter latent-trait

    model. Laboratory of Psychometric and Evaluation

    Research (Report No. 90). Amherst: University of

    Massachusetts.

    Tabachnik, B. G., & Fidell, L. S. (2001). Using

    multivariate statistics (4th ed.). Boston, MA: Allyn

    and Bacon.

    Teresi, J. A. (2006). Overview of quantitative

    measurement methods: Equivalence, invariance and

    differential item functioning in health applications.

    Medical Care, 44, S39S49.

  • USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

    Journal of the Society for Social Work and Research 82

    Unick, G. J., & Stone, S. (2010). State of modern

    measurement approaches in social work research

    literature. Social Work Research, 34(2), 94-101.

    Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000).

    Practical implications of item response theory and

    computerized adaptive testing: A brief summary of

    ongoing studies of widely used headache impact

    scales. Medical Care, 38, 11731182. doi:10.1097/00005650-200009002-00011

    Wenger, E., McDermott, R., & Snyder, W. M. (2002).

    Cultivating communities of practice. Boston, MA:

    Harvard Business School Press.

    Wirth, R. J., & Edwards, M. C. (2007). Item factor

    analysis: Current approaches and future directions.

    Psychological Methods, 12(1), 58-79.

    doi:10.1037/1082-989X.12.1.58

    Wright, B. D., & Masters, G. N. (1982). Rating scale

    analysis: Rasch measurement. Chicago, IL: MESA

    Press.

    Wu, M. L., Adams, R. J., & Wilson, M., & Haldane, S.

    (2008). ACER Conquest 2.0: Generalized item

    response modeling software [computer program].

    Hawthorn, Australia: ACER.