An Introduction to Using Multidimensional Item Response Theory to Assess Latent Factor Structures

Journal of the Society for Social Work and Research October 2010

Volume 1, Issue 2, 6682 ISSN 1948-822X DOI:10.5243/jsswr.2010.6

Journal of the Society for Social Work and Research 66

An Introduction to Using Multidimensional Item Response Theory

to Assess Latent Factor Structures

Philip Osteen

University of Maryland

This study provides an introduction to the use of multidimensional item response theory (MIRT)

analysis for assessing latent factor structure, and compares this statistical technique to

confirmatory factor analysis (CFA) in the evaluation of an original measure developed to assess

students motivations for entering a social work community of practice. The Participation in a Social Work Community of Practice Scale (PSWCoP) was administered to 506 masters of social

work students from 11 accredited graduate programs. The psychometric properties and latent

factor structure of the scale are evaluated using MIRT and CFA techniques. Although designed as

a 3-factor measure, analysis of model fit using both CFA and MIRT do not support this solution.

Instead, analyses using both methods produce convergent results supporting a 4-factor solution.

Discussion includes methodological implications for social work research, focusing on the

extension of MIRT analysis to assessment of measurement invariance in differential item

functioning, differential test functioning, and differential factor functioning.

Keywords: item response theory, factor analysis, psychometrics

In comparison to classical test theory (CTT), item

response theory (IRT) is considered as the standard, if

not preferred, method for conducting psychometric

evaluations of new and established measures

(Embretson & Reise, 2000; Fries, Bruce, & Cella,

2005; Lord, 1980; Ware, Bjorner, & Kosinski, 2000).

Dubbed the modern test theory, IRT is used across scientific disciplines, including psychology, education,

nursing, and public health. Considered a superior

method because of IRTs ability to overcome inherent limitations of CTT, IRT provides researchers with an

array of statistical tools for assessing measure

characteristics. Unfortunately, there is a resounding

paucity of published research in social work using IRT.

A review of measurement-based articles appearing in

journals specific to the social work field published

between 2000 and 2006 showed that fewer than 5% of

studies used IRT analysis to evaluate the psychometric

properties of new and existing measures (Unick &

Stone, 2010). Unick and Stone hypothesized several

reasons for the absence of IRT analyses from social

work journals, one of which was a lack of familiarity

with key conceptual and practical components of IRT.

Regardless of the reasons underlying the absence

of IRT-based analyses in the social work literature, the

field of social work will benefit from researchers

becoming more familiar with IRT methods and

incorporating these analyses into social work-based

Philip J. Osteen is an assistant professor in the

University of Maryland School of Social Work.

All correspondence concerning this article should

be directed to [email protected]

measurement studies. Historically regarded as a

method for evaluating latent skill and ability traits in

education, the application of IRT to measures of

affective latent traits is becoming more common and

accepted. As outlined in this article, drawing on the

strengths of IRT as an alternative to, or ideally in

conjunction with, CTT analyses supports social work

researchers development of rigorously substantiated measures. This article provides social work researchers

with a basic overview of IRT and a demonstration of

the utility of IRT as compared with CTT-based factor

analysis by using actual data obtained with the

implementation of a novel measure of professional

motivations of masters of social work (MSW) students. Published studies comparing IRT and

confirmatory factor analysis (CFA) have focused

almost exclusively on assessing measurement

invariance. This study takes a different approach in

comparing IRT and CTT by applying these theories to

the assessment of multidimensional latent factor

structures.

IRT

IRT is based on the premise that only two

elements are responsible for a persons response on any given item: the persons ability, and the characteristics of the item (Bond & Fox, 2001). The most common

IRT model, called the Rasch or one-parameter logistic

model, assumes the probability of a given response is a

function of the persons ability and the difficulty of the item (Bond & Fox, 2001). More complex IRT models

estimate the probability of a given response based on

additional item characteristics such as discrimination

and guessing (Bond & Fox, 2001). Derived from its

OSTEEN


early use in educational measurement, the term ability

may seem mismatched to psychosocial constructs; thus,

the term latent trait may be more intuitive, and

references to level of ability are synonymous with level

of the latent trait. The IRT model produces estimates

for both of these elements by calculating item-

difficulty parameters on the basis of the total number of

persons who correctly answered an item, and person-

trait parameters on the basis of the total number of

items successfully answered (Bond & Fox, 2001). The

assumptions underlying these estimates are (a) that a

person with more of the trait will always have a greater

likelihood of success than a person with less of the

trait, and (b) that any person will have a greater

likelihood of endorsing items requiring less of the trait

than items requiring more of the trait (Mller, Sokol, &

Overton, 1999). Samejima (1969) and Andrich (1978)

extended this model to measures with polytomous

response formats (i.e., Likert scales) by adding an

estimate to account for the difficultly in crossing the

threshold from one level of response to the next (e.g.,

moving from agree to strongly agree).

Scale Evaluation Using IRT

The basic unit of IRT is the item response

function (IRF) or item characteristic curve. The

relationship between a respondents performance and the characteristics underlying item performance can be

described by a monotonically increasing function

called the item characteristic curve (ICC; Henard,

2000). The ICC is typically a sigmoid curve estimating

the probability of a given response based on a persons level of latent trait. The shape of the ICC is determined

by the item characteristics estimated in the model. The

ICC in a three-parameter IRT model is derived using

the formula P() = c + (1-c)e a(-b-f)

where P, the probability of a response given a

persons level of the latent traitdenoted by theta ()is a function of guessing (c parameter), item discrimination (a parameter), item difficulty (b

parameter), and the category threshold (f) if using a

polytomous response format.

For the one-parameter IRT model, the guessing

parameter, c, is constrained to zero, assuming little or

no impact of guessing. For example a person cannot

guess the correct response to an item using a Likert

scale because items are not scored as right or wrong.

The item discrimination parameter, a, is set to 1 under

the assumption that there is equal discrimination across

items. In a one-parameter model the probability of a

response is determined only by the persons level of the latent trait and the difficulty of the item. Item difficulty

is an indication of the level of the underlying trait that

is needed to endorse or respond in a certain way to the

item. For items on a rating scale, the IRF is a

mathematical function describing the relation between

where an individual falls on the continuum of a given

construct such as motivation and the probability that he

or she will give a particular response to a scale item

designed to measure that construct (Reise, Ainsworth,

& Haviland, 2005). The basic goal of IRT modeling is

to create a sample-free measure.

Multidimensional item response theory, or MIRT,

is an extension of IRT and is used to explore the

underlying dimensionality of an IRT model. Advances

in computer software (e.g., Conquest, MULTILOG, &

Mplus) allow for testing and evaluation of more

complex multidimensional item response models and

enable researchers to statistically compare competing

dimensional models. ACER Conquest 2.0 (Wu,

Adams, & Wilson, 2008), the software used in this

study, produces marginal maximum likelihood

estimates for the parameters of the models. The fit of

the models is ascertained by generalizations of the

Wright and Masters (1982) residual-based methods.

Alternative dimensional models are evaluated using a

likelihood ratio chi-squared statistic (2LR; Barnes, Chard, Wolfe, Stassen, & Williams, 2007).

Core statistical output of an IRT analysis of a one-

parameter rating scale model includes estimates of

person latent trait, item difficulty, model fit, person-

fit, item-fit, person reliability, item reliability, and step

calibration. A two-parameter model would include

estimates for item discrimination, and a three-

parameter model would include an additional estimate

for guessing. Person latent trait is an estimate of the

underlying trait present for each respondent. Persons

with high person-ability scores possess more of the

underlying trait than persons with low scores. Item

difficulty is an estimate of the level of underlying trait

at which a person has a 50% probability of endorsing

the item. Items with higher item-difficulty scores

require a respondent to have more of the underlying

trait to endorse or correctly respond to the item than

items with lower item difficulty scores. Consider a

measure of reading comprehension. An item requiring

a 12th grade reading level is more difficult than an item

requiring a 6th grade reading level. The same concept

applies to a measure of motivation; an item requiring a

high amount of motivation is more difficult than an item requiring a low amount of motivation. This idea

translates to the concept of person-ability or latent trait.

A person who reads at a 12th grade level has more

ability than a person who reads at a 6th grade level; a

person who is more motivated has more of the latent

trait than a person who is less motivated.

Analysis of item fit. Fit statistics in IRT analysis

include infit and outfit mean square (MNSQ) statistics.

1 + e a(-b-f)

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY


Infit and outfit are statistical representations of how

well the data match the prescriptions of the IRT model

(Bond & Fox, 2001). Outfit statistics are based on

conventional sum of squared standardized residuals,

and infit statistics are based on information-weighted

sum squared standardized residuals (Bond & Fox,

2001). Infit and outfit have expected MNSQ values of

1.00; values greater than or less than 1 indicate the

degree of variation from the expected score. For

example, an item with an infit MNSQ of 1.33 (1.00 +

.33), indicates 33% more variation in responses to that

item than were predicted by the model. Mean infit and

outfit values represent a degree of overall fit of the data

to the model, but infit and outfit statistics are also

available for assessing fit at the individual item level

(item-fit) and the individual person level (person-fit).

Item-fit refers to how well the IRT model explains the

responses to a particular item (Embretson & Reise,

2000). Person-fit refers to the consistency of an

individuals pattern of responses across items (Embretson & Reise, 2000).

One limitation of IRT is the need for large

samples. No clear standards exist for minimum sample

size, although Embretson and Reise (2000) briefly

noted that a sample of 500 respondents was

recommended, and cautioned that parameter

estimations might become unstable with samples of

less than 350 respondents. Reeve and Fayers (2005)

suggested that useful information about item

characteristics could be obtained with samples of as

few as 250 respondents. One-parameter models may

yield reliable estimates with as few as 50 to 100

respondents (Linacre, 1994). As the complexity of the

IRT model increases and more parameters are

estimated, sample size should increase accordingly.

Smith, Schumacker, and Bush (1998), provided the

following sample size dependent cutoffs for

determining poor fit: misfit is evident when MNSQ

infit or outfit values are larger than 1.3 for samples less

than 500, 1.2 for samples between 500 and 1,000, and

1.1 for samples larger than 1,000 respondents.

According to Adams and Khoo (1996), items with

adequate fit will have weighted MNSQs between .75

and 1.33. Bond and Fox (2001) stated items that are

routinely accepted as having adequate fit will have t-

values between -2 and +2. According to Wilson (2005),

when working with large sample sizes, the researcher

can expect the t-statistic to show significant values for

several items regardless of fit; therefore, Wilson

suggested that the researcher consider items

problematic only if items are identified as misfitting

based on both the weighted MNSQ and t-statistic.

For rating scale models, category thresholds are

provided in the IRT analysis. A category threshold is

the point at which the probability of endorsing one

category is equal to the probability of endorsing a

corresponding category one step away. Although

thresholds are ideally equidistant, that characteristic is

not necessarily the reality. Guidelines indicate that

thresholds should be at least 1.4 logits but no more than

5 logits (Linacre, 1999). Logits are the scale units for

the log odds transformation. When thresholds have

small logits, response categories may be too similar

and nondiscriminant. Conversely, when the threshold

logit is large, response categories may be too dissimilar

and far apart, indicating the need for more response

options as intermediate points. Infit and outfit statistics

are also available for step calibrations. Outfit MNSQ

values greater than 2.0 indicate that a particular

response category is introducing noise into the measurement process, and should be evaluated as a

candidate for collapsing with an adjacent category

(Bond & Fox, 2001; Linacre, 1999).

In conjunction with the standard output of IRT

analysis, MIRT analysis provides information about

dimensionality, the underlying latent factor structure.

Acer Conquest 2.0 (Wu et al., 2008) software provides

estimations of population parameters for the

multidimensional model, which include factor means,

factor variances, and factor covariances/correlations.

Acer Conquest 2.0 also produces maps of latent

variable distributions and response model parameter

estimates.

Analysis of nested models. Two models are

considered as being nested if one is a subset of the

second. Overall model fit of an IRT model is based on

the deviance statistic, which follows a chi-square

distribution. The deviance statistic changes as

parameters are added or deleted from the model, and

changes in fit between nested models can be

statistically tested. The chi-square difference statistic

(2 D) can be used to test the statistical significance of the change in model fit (Kline, 2005). The 2 D is calculated as the difference between the model chi-

square (2 M) values of two nested models using the same data; the df for the 2 D statistic is the difference in dfs for two nested models. The 2 D statistic tests the null hypothesis of identical fit of the two models to the

population. Failure to reject the null hypothesis means

that the two models fit the population equally. When

two nested models fit the population equally well, the

more parsimonious model is generally considered the

more favorable.

Scale Evaluation Using CFA

Factor analysis is a more traditional method for

analyzing the underlying dimensionality of a set of

observed variables. Derived from CTT, factor analysis

includes a variety of statistical procedures for exploring

the relationships among a set of observed variables

with the intent of identifying a smaller number of

OSTEEN


factors, the unobserved latent variables, thought to be

responsible for these relationships among the observed

variables (Tabachnik & Fidell, 2007). CFA is used

primarily as a means of testing hypotheses about the

latent structure underlying a set of observed data.

A common and preferred method for conducting

CFA is structural equation modeling (SEM). The term

SEM refers to a family of statistical procedures for

assessing the degree of fit between observed data and

an a priori hypothetical model in which the researcher

specifies the relevant variables, which variables affect

other variables, and the direction of those effects. The

two main goals of SEM analysis are to explore patterns

of correlations among a set of variables, both observed

and unobserved, and to explain as much variance as

possible using the model specified by the researcher

(Klem, 2000; Kline, 2005).

Analysis of SEM models. Analysis of SEM

models is based on the fit of the observed variance-

covariance matrix to the proposed model. Although

maximum likelihood (ML) estimation is the common

method for deriving parameter estimates, it is not the

only estimation method available. ML estimation

produces parameter estimates that minimize the

discrepancies between the observed covariances in the

data and those predicted by the specified SEM model

(Kline, 2005). Parameters are characteristics of the

population of interest; without making observations of

the entire population, parameters cannot be known and

must be estimated from sample statistics. ML

estimation assumes interval level data, and alternative

methods, such as weighted least squares estimation,

should be used with dichotomous and ordinal level

data. Guo, Perron, and Gillespie (2009) noted in their

review of social work SEM publications that ML

estimation was sometimes used and reported

inappropriately.

Analysis of model fit. Kline (2005) defined

model fit as how well the model as a whole explained

the data. When a model is over identified, it is expected

that model fit will not be perfect; it is therefore

necessary to determine the actual degree of model fit,

and whether the model fit is statistically acceptable.

Ideally, indicators should load only on the specific

latent variable identified in the measurement model.

This type of model can be tested by constraining the

direct effects between indicators and other factors to

zero. According to Kline (2005), indicators are expected to be correlated with all factors in CFA

models, but they should have higher estimated

correlations with the factors they are believed to

measure (emphasis in original, p. 177). A measurement model with indicators loading only on a

single factor is desirable but elusive in practice with

real data. Statistical comparison of models with cross-

loadings to models without cross-loadings allows the

researcher to make stronger assertions about the

underlying latent variable structure of a measure. As

Guo et al. (2009) noted, modified models allowing

cross-loadings between items and factors have been

frequently published in social work literature without

fully explaining how they related to models without

cross-loadings.

Analysis of nested models. As noted in the

discussion of MIRT analysis, two models are

considered to be nested if one is a subset of the second.

Overall model fit based on the chi-square distribution

will change as paths are added to or deleted from a

model. Klines (2005) chi-square difference statistic (2

D) can be used to test the statistical significance of the

change in model fit.

MIRT versus CFA

MIRT and CFA analyses can be used to assess the

dimensionality or underlying latent variable structure

of a measurement. The choice of statistical procedures

raises questions about differences between analyses,

whether the results of the two analyses are consistent,

and what information can be obtained from one

analysis but not the other. IRT addresses two problems

inherent in CTT. First, IRT overcomes the problem of

item-person confounding found in CTT. IRT analysis

yields estimates of item difficulties and person-abilities

that are independent of each other, whereas in CTT

item difficulty is assessed as a function of the abilities

of the sample, and the abilities of respondents are

assessed as a function of item difficulty (Bond & Fox,

2001), a limitation that extends to CFA.

Second, the use of ordinal level data (i.e., rating

scales), which are routinely treated in statistical

analyses as continuous, interval-level data, may violate

the scale and distributional assumptions of CFA (Wirth

& Edwards, 2007). Violating these assumptions may

result in model parameters that are biased and

impossible to interpret (Wirth & Edwards, 2007, p. 58; DiStefano, 2002). The logarithmic transformation

of ordinal level raw data into interval level data in IRT

analysis overcomes this problem.

IRT and CTT also differ in the treatment of the

standard error of measurement. The standard error of

measurement is an indication of variability in scores

due to error. Under CTT, the standard error of

measurement is averaged across persons in the sample

or population and is specific to that sample or

population. Under IRT, the standard error of

measurement is considered to vary across scores in the

same population and to be population-general

(Embretson & Reise, 2000). The IRT approach to the

standard error of measurement offers the following

benefits: (a) the precision of measurement can be

evaluated at any level of the latent trait instead of



averaged over trait levels as in CTT, and (b) the

contribution of each item to the overall precision of the

measure can be assessed and used in item selection

(Hambleton & Swaminathan, 1985).

MIRT and CFA differ in the estimation of item

fit. Where item fit is assessed through error variances,

communalities, and factor loadings in CFA, item fit is

assessed through unweighted (outfit) and weighted

(infit) mean square errors in IRT analyses (Bond &

Fox, 2001). Further, the treatment of the relationship

between indicator and latent variable, which is

constrained to a linear relationship in CFA, can be

nonlinear in IRT (Greguras, 2005). CFA uses one

number, the factor loading, to represent the relationship

between the indicator and the latent variable across all

levels of the latent variable; in IRT, the relationship

between indicator and latent variable is given across

the range of possible values for the latent variable

(Greguras, 2005). Potential implications of these

differences include inconsistencies in parameter

estimates, indicator and factor structure, and model fit

across MIRT and CFA analyses.

Both IRT and CFA provide statistical indicators

of psychometric performance not available in the other

analysis. Using the item information curve (IIC), IRT

analysis allows the researcher to establish both item

information functions (IIF) and test information

functions (TIF). The IIF estimates the precision and

reliability of individual items independent of other

items on the measure; the TIF provides the same

information for the total test or measure, which is a

useful tool in comparing and equating multiple tests

(Hambleton et al., 1991; Embretson & Reise, 2000).

IRT for polytomous response formats also provides

estimated category thresholds for the probability of

endorsing a given response category as a function of

the level of underlying trait. These indices of item and

test performance and category thresholds are not

available in CFA in which item and test performance

are conditional on the other items on the measure.

Conversely, CFA offers a wide range of indices for

evaluating model fit, whereas IRT is limited to the use

of the 2 deviance statistic. Reise, Widaman, and Pugh (1993) explicitly identified the need for modification

indices and additional model fit indicators for IRT

analyses as a limitation.

Participation in a Social Work Community of

Practice Scale

Although the content of the Participation in a

Social Work Community of Practice Scale (PSWCoP)

is less important in the current discussion than the

methodologies used to evaluate the scale, a brief

overview will provide context for interpreting the

results of the analyses. The PSWCoP scale is an

assessment of students motivations for entering a masters of social work (MSW) program as

conceptualized in Wenger, McDermott, and Snyders (2002) three-dimensional model of motivation for

participation in a community of practice. Wenger et al.

(2002) asserted that all communities of practice are

comprised of three fundamental elements (p. 27): a

domain of knowledge defining a set of issues; a

community of people who care about the domain; and,

the shared practice developed to be effective in that

domain. Some individuals are motivated to participate

because they care about the domain and are interested

in its development. Some individuals are motivated to

participate because they value being part of a

community as well as the interaction and sharing with

others that is part of having a community. Finally,

some individuals are motivated to participate by a

desire to learn about the practice as a means of

improving their own techniques and approaches. The

PSWCoP was developed as a multidimensional

measure of the latent constructs domain motivation,

community motivation, and practice motivation (Table

1). Data were collected from a convenience sample of

students enrolled in MSW programs using a cross-

sectional survey design and compared to the three-

factor model developed from Wenger et al.

Method

Participants

A convenience sample of 528 current MSW

students was drawn from 11 social work programs

accredited by the Council on Social Work Education

(CSWE). Participants were enrolled during two

separate recruitment periods. The first round of

recruitment yielded a nonrandom sample of 268

students drawn from nine academic institutions. The

second round of recruitment yielded a nonrandom

sample of 260 students drawn from eight institutions.

Six institutions participated in both rounds of data

collection, three institutions participated in only the

first round of data collection, and two institutions

participated in only the second round of data collection.

The response rate for the study could not be calculated

because there was no way to determine the total

number of students who received information about the

study or had access to the online survey. Twenty-two

cases (4.1%) were removed because of missing data,

yielding a final sample of 506 students; listwise

deletion was used given the extremely small amount of

missing data.

Data were collected on multiple student

characteristics including age, gender, race/ethnicity,

sexual orientation, religious affiliation, participation in

religious activities, family socioeconomic status (SES),

and enrollment status. The mean age of participants

OSTEEN


Table 1

Original Items on the Participation in a Social Work Community of Practice Scale

Item Factor

My main interest for entering the MSW program was to be a part of a community of

social workers.

Community (C_1)

I wanted to attend a MSW program so that I could be around people with similar

values to me.

Community (C_2)

I chose a MSW program because I thought social work values were more similar to

my values than those of other professions.

Community (C_3)

There is more diversity of values among students than I expected. Community (C_4)*

Before entering the program, I was worried about whether or not I would fit in with

my peers.

Community (C_5)*

Learning about the social work profession is less important to me that being part of a

community of social workers.

Community (C_6)*

Without a MSW degree, I am not qualified to be a social worker. Practice (P_1)

A MSW degree is necessary to be a good social worker. Practice (P_2)

Learning new social work skills was not a motivating factor in my decision to enter

the MSW program.

Practice (P_3)

My main reason for entering the MSW program was to acquire knowledge and/or

skills.

Practice (P_4)

A MSW degree will give me more professional opportunities than other professional

degrees.

Practice (P_5)*

Being around students with similar goals is less important to me than developing my

skills as a social worker.

Practice (P_6)*

Learning how to be a social worker is more important to me than learning about the

social work profession.

Practice (P_7)*

I find social work appealing because it is different than the type of work I have done

in the past.

Domain (D_1)

I decided to enroll in a MSW program to see if social work is a good fit for me. Domain (D_2)

I wanted to attend a MSW program so that I could learn about the social work

profession.

Domain (D_3)

Entering the MSW program allowed me to explore a new area of professional interest. Domain (D_4)

My main reason for entering the MSW program was to decide if social work is the

right profession for me.

Domain (D_5)

*Items deleted from the final version of the PSWCoP

was 30.2 years (SD = 8.7 years). The majority of

students were female (92%). The majority of the

participants were Caucasian (82.6%), with 7.3% of

students self-identifying as African American or Black;

4.1% as Hispanic; 1.8% as Asian/Pacific Islander; and

4.1% as a nonspecified race/ethnicity. Students

identified their enrollment status as either part-time

(19.5%), first year (32.7%), advanced standing (27%),

or second year (20.8%).

Measures

Analyses were conducted on an original measure

of students motivations for entering a social work community of practice, defined as pursuing a MSW

degree. The PSWCoP was developed and evaluated

using steps outlined by by Benson and Clark (1982)

and DeVellis (2003). The pilot measure contained 18 items designed to measure three constructs (domain,

community, and practice). Items were measured on a 6-

point rating scale from strongly disagree to strongly

agree. Items from the pilot measure organized by

subscale are listed in Table 1. In addition to items on

the PSWCoP, students were asked to provide

demographic information.

Procedures

Participants completed the PSWCoP survey as part

of a larger study exploring the relationship between

students motivations to pursue the MSW degree, their attitudes about diversity and historically marginalized

groups, and their endorsement of professional social

work values as identified in the National Association of

Social Workers (2009) Code of Ethics. This research

was approved by the University of Denver Institutional

Review Board prior to recruitment and data collection.

Recruitment consisted of a two-pronged approach: (a)

an e-mail providing an overview of the study and a link



to the online survey was sent to students currently

enrolled in the MSW program; and (b) an

announcement providing an overview of the study and

a link to the online survey posted to student-oriented

informational Web sites. Interested participants were

able to access the anonymous, online survey through

www.surveymonkey.com, which is a frequently used

online survey provider. Participants were presented

with a project information sheet and were required to

indicate their consent to participate by clicking on the

appropriate response before being allowed to access the

actual survey.

Results

Reliability of scores from the PSWCoP was

assessed using both CTT and IRT methods. SPSS

(v.16.0.0, 2007) was used to calculate internal

consistency reliability (Cronbachs ; inter-item correlations). Acer Conquest 2.0 (Wu et al., 2008) was

used to assess item reliability. The dimensionality and

factor structure of the PSWCoP were evaluated using

both a MIRT and a CFA approach. Acer Conquest 2.0

(Wu et al., 2008) was used to conduct the MIRT

analysis and Lisrel 8.8 (Jreskog & Srbom, 2007) was

used to conduct the CFA analysis. Acer Conquest 2.0

was used to evaluate the PSWCoP with respect to

estimates of levels of latent trait and item difficulty

using a one-parameter logistic model. Assessment of

the measure was based on model fit, person-fit, item

fit, person reliability, item reliability, step calibration,

and population parameters for the multidimensional

model.

Item Selection

Items were identified for possible deletion from

each subscale using Cronbachs alpha, IRT MNSQ Infit/Outfit results, and theory. Poorly performing

items identified through statistical analyses were

further assessed using conceptual and theoretical

frameworks. A combination of results led to the

removal of three items from the community subscale,

three items from the practice subscale, but no items

from the domain subscale (Table 1). Items C_6, P_6,

and P_7 addressed relationships between types of

motivations by asking respondents to rate whether one

type of motivation was more important than another

type. Quantitative differences between types of

motivations were not addressed in community of

practice theory, and therefore these items were deemed

not applicable in the measurement of each type of

motivation. Items C_4 and C_5 were deleted from the

community subscale because these items specifically

addressed relationships between respondents and peers.

Community-based motivation arises out of perceived

value congruence between the individual and the

practice (i.e., professional social work), and not

between the individual and other members of the

community of practice. All analyses indicated

problems with the practice subscale and ultimately

EFA was used with this subscale only. The results of

the EFA suggested items P_1 and P_2 formed one

factor, and items P_3 and P_4 constituted a second

factor. Item P_5 did not load on either factor and was

deleted.

The results of the item selection process yielded

two competing models. The first model consisted of

three factors in which all items developed for the

practice subscale were kept together; this model most

closely reflected the original hypothetical model

developed based on community of practice theory. The

second model had four factors with the items from the

hypothesized practice subscale split into the two factors

suggested by the EFA. Internal consistency for each of

the subscales on the final version of the PSWCoP was

assessed using Cronbachs alpha. Cronbachs alpha was 0.64 for scores from the domain subscale, 0.68 for

scores from the community subscale, and 0.47 for

scores from the practice subscale (three-factor model).

Splitting the practice subscale into two factors yielded

a Cronbachs alpha of 0.58 for scores from the skills subscale and .68 for scores from the competency

subscale. Although ultimately indicative of a poor

measure, low internal consistency did not prohibit the

application and comparison of factor analysis using

CFA and MIRT.

Factor Structure

CFA. CFA analyses of the PSWCoPS was

conducted using Lisrel 8.8 (Jreskog & Srbom, 2007).

The data collected using the PSWCoP were considered

ordinal based on the 6-point rating scale. When data

are considered ordinal, Jreskog and Srbom (2007)

advocated the use of PRELIS to calculate asymptotic

covariances and polychloric correlations of all items

modeled, and LISREL or SIMPLIS with weighted least

squares estimation to test the structure of the data.

Failure to use these guidelines may result in

underestimated parameters, biased standard errors, and

an inflated chi-square (2) model fit statistic (Flora & Curran, 2004). The chi-square difference statistic (2 D) was used to test the statistical significance of the

change in model fit between nested models (Kline,

2005). The 2 D was calculated as the difference between the model chi-square (2 M) values of nested models using the same data; the df for the 2 D statistic is the difference in dfs for nested models. The 2 D statistic tested the null hypothesis of identical fit of two

models to the population. In all, three nested models

were evaluated and compared sequentially: a four-

factor model with cross-loadings served as the baseline

model, followed by a four-factor model without cross-

loadings, and a three-factor model without cross-

loadings. The four-factor model with cross-loadings

OSTEEN


was chosen as the baseline model because it was

presumed to demonstrate the best fit having the fewest

degrees-of-freedom. The primary models of interest

were then compared against this baseline to estimate

the change in model fit.

Sun (2005) recommended considering fit indices

in four categories: sample-based absolute fit indices,

sample-based relative fit indices, population-based

absolute indices, and population-based relative fit

indices. Sample-based fit indices are indicators of

observed discrepancies between the reproduced

covariance matrix and the sample covariance matrix.

Population-based fit indices are estimations of

difference between the reproduced covariance matrix

and the unknown population covariance matrix. At a

minimum, Kline (2005) recommended interpreting and

reporting four indices: the model chi-square (sample-

based), the Steiger-Land root mean square error of

approximation (RMSEA; population-based), the

Bentler comparative fit index (CFI; population-based),

and the standardized root mean square residual

(SRMR; sample-based). In addition to these fit indices,

this study examined the Akaike information criteria

(AIC; sample-based) and the goodness-of- fit index

(GFI; sample-based). According to Jackson, Gillaspy,

and Purc-Stephenson (2009), a review of CFA journal

articles published over the past decade identified these

six fit indices as the most commonly reported.

The range of values indicating good fit of observed

data to the measurement model varies depending on the

specific fit index. The model chi-square statistic tests

the null hypothesis that the model has perfect fit in the

population. Degrees-of-freedom for the chi-square

statistic is equal the number of observations minus the

number of parameters to be estimated. Given its

sensitivity to sample size, the chi-square test is often

statistically significant. Kline (2005) suggested using a

normed chi-square statistic obtained by dividing chi-

square by df; ideally, these values should be less than

three. The SRMR is a measure of the differences

between observed and predicted correlations; in a

model with good fit, these residuals will be close to

zero. Hu and Bentler (1999) suggested that a SRMR



Figure 1. Standardized solution for four-factor PSWCoP model

Three Factor Model without Cross-Loadings. The

three-factor model corresponded to the original model

of the PSWCoP (Figure 2). Three latent variables were

included in this model:domain, community, and

practice. Items were constrained to load on the factor

for which they were designed. The four items

originally developed for the practice subscale were

constrained to load on a single latent variable, which

represented a perfect correlation between the

previously used latent variables competency and skills.

Based on the six fit indices previously described, the

overall fit of the model was poor: 2 = 359.90, df = 51, p < 0.001; RMSEA = 0.11 [90%CI:.10,.12]; CFI = 0.8;

SRMR = 0.12; AIC = 413.90; GFI =0.85. When

compared with the four-factor model without cross-

loadings, this model demonstrated a significant

increase in model misfit [(12 2

2)(df1-df2) =174.38(3),

p < .001]. All of the fit statistics indicated that the data

did not fit the model.

Figure 2. Standardized solution for three-factor PSWCoP model

OSTEEN


Summary of CFA of the PSWCoP. A summary of

fit indices across nested models is provided in Table 2.

The model with the best overall fit was the four-factor

model in which items were allowed to load across all

factors. The fit of this model was good, but the model

lacked conceptual support and was not interpretable

with respect to the underlying latent structure of the

PSWCoP. Although the four-factor model with

constrained loadings had a significant increase in

model misfit over the four-factor model with cross-

loadings, the four-factor model with constrained

loadings demonstrated acceptable fit. The results of the

CFA on the four-factor model without cross-loadings

supported the hypothesis of a multidimensional

measure because correlations between latent variables

were computed and there were no significant

correlations between any pair of latent variables

(=.01).

The four-factor model with constrained loadings

was compared with a three-factor model based on the

originally proposed measurement model for the

PSWCoP. The conceptual difference between the two

models was the placement of the items developed for

the practice subscale. Constraining these four items to

load on a single latent variable resulted in a large

increase in model misfit. All of the reported fit

statistics indicated a model with poor fit.

Table 2

Comparison of Fit Indices across Nested Models

Model 1:

4 Factor

Model

Model 2:

Unidimensional

4 Factor Model*

Model 3:

Unidimensional

3 Factor Model**

2(df) 64.48(35) 185.52(48) 359.90(51)

Normed 2(2/df) 1.84 3.86 7.05

p-value (model) .002



latent trait was the amount of motivation a given

student possessed. In general, the range of the latent

trait of the sample and item difficulties were the same,

and the distribution of persons and items about the

mean were relatively symmetrical, indicating a good

match between the latent trait of students and the

difficulty of endorsing items. Exact numerical values

for item difficulty are provided in Table 3 and ranged

from -1.05 to +0.94. Item difficulty was scaled

according to the theta metric, and indicated the level of

the latent trait at which the probability of a given

response to the item was .50. Theta () is the level of the latent trait being measured and scaled with a mean

of zero and a standard deviation of one. Negative

values indicated items that were easier to endorse, and

positive values indicated items that were harder to

endorse.

Item fit is an indication of how well an item

performs according to the underlying IRT model being

tested, and it is based on the comparison of observed

responses to expected responses for each item. Adams

and Khoo (1996) suggested that items with good fit

have infit scores between 0.75 and 1.33; Bond and Fox

(2001) suggested that items with good fit have t values

between -2 and +2. Table 3 provides the fit statistics

for the items of the PSWCoP survey; according to this

output, only item P_3_R exceeded Bond and Foxs guideline, and no items exceeded Adams and Khoos guideline.

Table 3

Rasch Analysis of Full Survey Item Difficulty and Fit

Model Infit Outfit

Item Label Est. S.E. MNSQ ZSTD MNSQ ZSTD

1 C_1 0.30 .04 1.05 0.9 1.06 1.2

2 C_2 0.04 .04 0.93 -1.1 0.94 -1.0

3 C_3 0.05 .03 1.02 0.5 1.06 1.1

4 P_1 -0.56 .04 1.01 0.1 1.06 0.8

5 P_2 0.30 .04 0.98 -0.4 1.00 0.1

6 D_1 0.68 .04 0.94 -1.1 0.93 -1.1

7 D_2 -0.11 .04 0.91 -1.4 0.89 -1.6

8 D_3 0.24 .04 1.01 0.1 1.05 0.9

9 D_4 -0.33 .04 1.07 1.1 1.08 1.1

10 D_5 0.94 .04 0.97 -0.4 0.95 -0.7

11 P_3 -0.51 .04 1.17 2.1 1.35 4.0

12 P_4 -1.05 .06 0.93 -0.7 0.92 -0.9

IRT analysis produced an item reliability index

indicating the extent to which item estimates would be

consistent across different samples of respondents with

similar abilities. High item reliability indicates that the

ordering of items by difficult will be somewhat

consistent across samples. The reliability index of

items for the PSWCoP pilot survey was 0.99, and

indicated consistency in ordering of items by difficulty.

IRT analysis also produced a person-reliability index

that indicated the extent of consistency in respondent

ordering based on level of latent trait if given an

equivalent set of items (Bond & Fox, 2001). The

reliability index of persons for the PSWCoP was 0.60,

and indicated low consistency in ordering of persons

by level of level of latent trait, which was possibly due

to a constricted range of the latent trait in the sample or

a constricted range of item difficulty.

MIRT factor structure. One of the core

assumptions of IRT is unidimensionality; in other

words, that person-ability can be attributed to a single,

latent construct, and that each item contributes to the

measure of that construct (Bond & Fox, 2001).

However, whether intended or not, item responses may

be attributable to more than one latent construct. MIRT

analyses allow the researcher to assess the

dimensionality of the measure. Multidimensional

models can be classified as either within items or between items (Adams, Wilson, & Wang, 1997). Within-items multidimensional models have items that

can function as indicators of more than one dimension,

OSTEEN


and between-items multidimensional models have

subsets of items that are mutually exclusive and

measure only one dimension.

Competing multidimensional models can be

evaluated based on changes in model deviance and

number of parameters estimated. A chi-square statistic

is calculated as the difference in deviance (G2) between

two nested models with df equal to the difference in

number of parameters for the nested models. A

statistically significant result indicates a difference in

model fit. When a difference in fit is found, the model

with the smallest deviance is selected; when a

difference in model fit is not found, the more

parsimonious model is selected.

The baseline MIRT model corresponded to the

four-factor model with no cross-loadings estimated in

the CFA (Figure 1). This baseline model was a

between-items multidimensional model with items

placed in mutually exclusive subsets. The four

dimensions in the model were community,

competency, domain, and skills. The baseline model fit

statistic was G2=17558.64 with 26 parameters. A three

dimensional, between-items, multidimensional model,

corresponding to the theoretical model of the PSWCoP

(Figure 2) was tested against the baseline model. The

three-dimensional model fit statistic was G2=17728.83

with 22 parameters. When compared with the four-

dimensional model, the change in model fit was

statistically significant and indicated that the fit of the

three-dimensional model was worse than the fit of the

four-dimensional model (2 (4) = 170.19, p < .001).

Table 4

Comparison of Model Fit Across Nested Models

Four Factor (Between) Three Factor* (Between)

Deviance (G2 ) 17558.64 17728.83

Df 26 22

G21- G

22 -170.19

df1-df2 4

(G21- G

22)/(df1-df2 ) 42.55

p-value < .001

* Compared to the Four Factor, Between-Items Model

Based on the change in model fit between nested

models, the four dimensional, between-items model

had the better fit. This model resulted in a more

accurate reproduction of the probability of endorsing a

specific level or step of an item for a person with a

particular level of the latent trait (Reckase, 1997).

Thus, the four-dimensional model yielded the greatest

reduction in discrepancy between observed and

expected responses.

Item difficulty. MIRT analyses yielded an item-

person map by dimension. The output of the MIRT

item-person map (Figure 3) provided a visual estimate

of the latent trait in the sample, item difficulty, and

each dimension. Items are ranked in the right-hand

column by difficulty, with items at the top being more

difficult than items at the bottom. Although the range

of item difficulty was narrow, items were well

dispersed around the mean. Each dimension or factor

has its own column with estimates for respondents

abilities. Two inferences were made based on the

MIRT item-person map. First, although the range of

item difficulty was narrow, items appeared to be

dispersed in terms of difficulty with a range of -0.81 to

+0.84. Furthermore, regarding Dimensions 1, 2, and 3,

the item difficulties appeared to be well matched to

levels of the latent trait, though over a limited range of

the construct as scaled via the theta metric. Second,

based on the means of the dimensions, Dimension 2

(competency, x2=0.069) and Dimension 3 (domain,

x3=-0.074) did a better job of representing all levels of

these types of motivation than the other two

dimensions. The small positive mean of Dimension 1

(community, x1=0.335), indicated that students sampled

for this study found it somewhat easier to endorse those

items, whereas the large positive mean of Dimension 4

(skills, x4=1.42) indicated that students sampled for this

study found it very easy to endorse those items.



Figure 3. MIRT Latent Variable Item-Person Map

Item fit. Table 5 summarizes the items characteristics. In addition to the estimation of item

difficulties, infit and outfit statistics are reported. Using

Adams and Khoos (1996) guideline, only item C_2_2

showed poor fit (MNSQ=0.68). In contrast, Bond and

Foxs (2001) guideline identified several items as having poor fit (based on a 95% CI for MNSQ): C_1,

C_2, D_1, D_3, and D_4.

Table 5

Item Parameter Estimates for 4 Dimensional Model

Model Infit Outfit

Item Label Est. S.E. MNSQ ZSTD MNSQ ZSTD

1 C_1 0.40 0.03 0.77 -3.8 0.77 -4.5

2 C_2 0.21 0.03 0.68 -5.6 0.67 -6.5

3 C_3 -0.61* 0.04 1.02 0.4 1.04 0.6

4 P_1 -0.14 0.03 1.01 0.2 1.00 0.0

5 P_2 0.14* 0.03 0.96 -0.5 0.93 -1.1

6 D_1 0.11 0.03 1.21 3.0 1.18 3.0

7 D_2 0.51 0.03 1.02 0.4 1.04 0.7

8 D_3 -0.65 0.03 1.29 4.1 1.30 4.4

9 D_4 -0.81 0.03 1.17 2.5 1.22 3.1

10 D_5 0.84* 0.06 0.95 -0.7 0.98 -0.2

11 P_3_R 0.33 0.04 1.00 -0.0 1.02 0.4

12 P_4 -0.33* 0.04 0.99 -0.2 1.00 -0.0

*Indicates that a parameter estimate is constrained

OSTEEN


Discussion

The rigor and sophistication with which social

workers conduct psychometric assessments can be

strengthened. Guo et al. (2010) found that social

workers under utilize CFA, and more generally SEM,

analyses. Further, even when those approaches are used

appropriately, considerable room remains for

improvement in reporting (Guo et al., 2010). Similarly,

Unick and Stone (2010) found the use of IRT analyses

for psychometric evaluation was noticeably missing

from the social work literature. Developing familiarity

and proficiency with strong psychometric methods will

empower social workers in developing and selecting

appropriate measures for research, policy, and practice.

Integration of CFA and MIRT Results

The primary result from both the CFA and MIRT

analyses was the establishment of the PSWCoP as a

multidimensional measure. Both sets of analyses

identified a four-factor model in which items loaded on

a single factor as having the best model fit when

compared with the three-factor model. In addition, both

analytic strategies identified significant problems with

the PSWCoP. Low subscale internal consistencies

might be due to the small number of items for the

community, skills, and competency subscales, as well

as the inability to capture the complexity of different

types of motivation for participating in a social work

community of practice. CFA identified multiple items

with high (>.7) error variances, and IRT analyses

indicated poor fit for several items. Although the

results of the analyses identified the PSWCoP as

having limited utility, these poor psychometric

properties did not prohibit CFA and MIRT analyses.

The CFA analysis was found to be more

informative at the subscale level, whereas the MIRT

analysis was found to be more informative at the item

level. CFA was more informative regarding subscale

composition and assessing associations among factors.

The CFA analysis led to a final form of the PSWCoP

with four subscales, and beginning evidence supporting

the factorial validity of the measure. As indicated by

the nonsignificant correlations among factors, each

subscale appeared to be tapping separate constructs.

Although MIRT allows the researcher to model factor

structure, this approach does not estimate relationships

between factors.

MIRT analyses were found to be more informative

for assessing individual item performance. Item

difficulty estimates were obtained for the PSWCoP as a

whole and for each subscale. Items on the PSWCoP

appeared to be a good match for the levels of latent

trait of the respondents with regards to the community,

domain, and competency factors, but too easy for the

skills factor. Based on infit and outfit statistics, MIRT

analyses identified additional items exhibiting poor fit

as compared with the CFA. Specifically, two items on

the community subscale had large standardized fit

scores in the IRT analysis but displayed high factor

loadings and low error variances in the CFA. The IRT

analyses also provided estimates of the item

information function and test information function,

making it possible to get specific estimates of standard

errors of measurement instead of relying on an

averaged standard error of measurement obtained from

the CFA.

Strengths and Limitations

Reliance on a convenience sample is a significant

limitation of this study. The extent to which

participants in this study were representative of the

larger population of MSW students was indiscernible.

Although IRT purports to generate sample-independent

item characteristic estimations, the stability of these

estimations is enhanced when the sample is

heterogeneous with regard to the latent trait. It is

possible that students who self-selected to complete the

measure were overly similar.

A study strength is its contribution to the field of

psychometric assessment. Previous studies comparing

IRT and CFA have dealt almost exclusively with

assessing measurement invariance across multiple

samples (e.g., Meade & Lautenschlager, 2004; Raju,

Lafitte, & Byrne, 2002; Reise et al., 1993). The current

study addresses emerging issues in measurement

theory by applying IRT analyses to multidimensional

latent variable measures, and comparing MIRT and

CFA assessments of factor structure in a novel

measure.

Implications for Social Work Research

In addition to the benefits of using IRT/MIRT

analytic procedures outlined in this paper, the ability of

these techniques to assess differential item functioning

(DIF) and differential test functioning (DTF) is a major

advantage over CTT methods. Wilson (1985) described

DIF as an indication of whether an item performs the

same for members of different groups who have the

same level of the latent trait, whereas DTF is invariant

performance of a set of items across different groups

(Badia, Prieto, Roset, Dez-Prez, & Herdman, 2002).

If DIF/DTF exists, respondents from the subgroups

who share the same level of a latent trait do not have the same probability of endorsing a test item (Embretson & Reise, 2000, p. 252). The ability to

assess potential bias in items and tests provides a

powerful method for developing culturally competent

measures (Teresi, 2006). Valid comparisons between

groups require measurement invariance, and IRT

provides an additional tool for examining both items

and tests.



Additional benefits of IRT analyses include the

ability to conduct test equating and develop adaptive

testing. The core question of test equating is the extent

to which scores from two measures presumed to

measure the same construct are comparable. For

example, are the Beck Depression Inventory and the

Center for Epidemiological Studies Depression Scale

(CES-D) equitable? Adaptive testing allows the

researcher to match specific items to different levels of

ability to more finely discern a persons ability; persons estimated to have high ability may receive a

different set of items than a person estimated to have

low ability. With the increasing availability of

statistical software for conducting MIRT analyses, the

potential also exists for developing models with greater

complexity for testing differential factor functioning

(DFF). Akin to testing measurement invariance using

CFA techniques, DFF analyses will provide researchers

with an assessment of potential bias in performance of

factors (e.g., subscales) across groups.

A final consideration is choosing between the

different psychometric strategies outlined in this paper.

Ideally, both methods should be integrated. Doing so

gives the researcher access to unique information

available only from each analytic method, allows the

researcher to compare common elements of both

analyses, and minimizes the impact of each methods limitations. If applying both methods is not possible,

theoretical and practical considerations can inform the

decision. IRT is a stronger choice when data are

dichotomous or ordinal because raw scores are

transformed to an interval scale. If the relationship

between items and factors are nonlinear or unknown,

IRT will yield less biased estimates than CFA. If the

construct to be measured is presumed to be

unidimensional, IRT is a better strategy because of the

additional information provided in the item analysis.

Both MIRT and CFA are informative in assessing

latent factor structures, but only CFA allows the

researcher to estimate relationships between factors.

Both strategies perform better with large sample sizes,

but IRT is affected more negatively by smaller samples

given the larger number of parameters being estimated.

If possible, IRT/MIRT analysis should be limited to

samples of 200 items or more. Conversely, IRT

analyses yield stable results with very few items,

whereas CTT reliability varies in part as a function of

the number of items.

References

Adams, R. J., & Khoo, S. T. (1996). ACER Quest

[Computer software]. Melbourne, Australia: ACER.

Adams, R. J., Wilson, M., & Wang, W. (1997). The

multidimensional random coefficients multinomial

logit model. Applied Psychological Measurement,

21, 1-24. doi:10.1177/0146621697211001

Andrich, D. (1988). Rasch models for measurement.

Newbury Park, CA: Sage.

Badia, X., Prieto, L., Roset, M., Dez-Prez, A., &

Herdman, M. (2002). Development of a short

osteoporosis quality of life questionnaire by

equating items from two existing instruments.

Journal of Clinical Epidemiology, 55, 32-40.

doi:10.1016/S0895-4356(01)00432-2

Benson, J., & Clark, F. (1982). A guide for instrument

development and validation. American Journal of

Occupational Therapy, 36, 789-800.

Bond, T. G., & Fox, C. M. (2001). Applying the Rasch

model (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.

DeVellis, R. F. (2003). Scale development: Theory and

applications. Thousand Oaks, CA: Sage.

DiStefano, C. (2002). The impact of categorization

with confirmatory factor analysis. Structural

Equation Modeling, 9, 327-346.

doi:10.1207/S15328007SEM0903_2

Embretson, S. E., & Reise, S. P. (2000). Item response

theory for psychologists. Mahwah, NJ: Lawrence

Erlbaum.

Flora, D. B., & Curran, P. J. (2004). An empirical

evaluation of alternative methods of estimation for

confirmatory factor analysis with ordinal data.

Psychological Methods, 9, 466-491.

doi:10.1037/1082-989X.9.4.466

Fries, J., Bruce, B., & Cella, D. (2005). The promise of

PROMIS: Using item response theory to improve

assessment of patient-reported outcomes. Clinical

and Experimental Rheumatology, 23(5, Suppl. 39),

S53S57.

Greguras, G. J. (2005). Managerial experience and the

measurement equivalence of performance ratings.

Journal of Business and Psychology, 19 (3), 383-

397. doi:10.1007/s10869-004-2234-y

Guo, B., Perron, B. E., & Gillespie, D. F. (2009). A

systematic review of structural equation modeling

in social work research. British Journal of Social

Work, 39, 1556-1574. doi:10.1093/bjsw/bcn101

Hambleton, R. K., & Swaminathan, H. (1985). Item

response theory: Principles and applications.

Boston, MA: Kluwer/Nijhoff.

OSTEEN


Hambleton, R. K., Swaminathan, H., & Rogers, H. J.

(1991). Fundamentals of item response theory.

Newbury Park, CA: Sage.

Henard, D. H. (2000). Item response theory. In L. G.

Grimm & P. R. Yarnold (Eds.), Reading and

understanding more multivariate statistics, (67-98).

Washington, DC: American Psychological

Association.

Jackson, D. L., Gillaspy, J. A, & Purc-Stephenson, R.

(2009). Reporting practices in confirmatory factor

analysis: An overview and some recommendations.

Psychological Methods, 14(1), 6-23.

doi:10.1037/a0014694

Jreskog, K. G., & Srbom, D. (2007). LISREL 8.80

for Windows [Computer software]. Licolnwood,

IL: Scientific Software International.

Klem, L. (2000). Structural equation modeling. In L.

G. Grimm and P. R. Yarnold (Eds.), Reading and

understanding more multivariate statistics, (pp.

227-260). Washington, DC: American

Psychological Association.

Kline, R .B. (2005). Principles and practices of

structural equation modeling (2nd ed.). New York,

NY: Guilford Press.

Linacre, J. K. (1994). Sample size and item calibration

stability. Rasch Measurement Transactions, 7(4),

328.

Linacre, J. K. (1999). Investigating rating scale

category utility. Journal of Outcome Measurement,

3(2), 103-122.

Linacre, J. K. (2006). Winsteps Rasch measurement

3.68.0 [Software]. Chicago, IL: Author.

Lord, F. M. (1980). Applications of item response

theory to practical testing problems. Hillsdale, NJ:

Lawrence Erlbaum.

Meade, A.W., & Lautenschlager, G. J. (2004). A

comparison of item response theory and

confirmatory factor analytic methodologies for

establishing measurement equivalence/invariance.

Organization Research Methods, 7(4), 361-388.

doi:10.1177/1094428104268027

Mller, U., Sokol, B., & Overton, W.F. (1999).

Developmental Sequences in class reasoning and

propositional reasoning. Journal of Experimental

Child Psychology, 74, 69-106.

doi:10.1006/jecp.1999.2510

Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2001).

Measurement equivalence: A comparison of

methods based on confirmatory factor analysis and

item response theory. Journal of Applied

Psychology, 87, 517-529. doi:10.1037/0021-

9010.87.3.517

Rasch, G. (1980). Probabilistic models for some

intelligence and attainment tests. Chicago, IL:

MESA Press.

Reckase, M. D. (1997). The past and future of

multidimensional item response theory. Applied

Psychological Measurement, 21, 25-27.

doi:10.1177/0146621697211002

Reeve, B. B., & Fayers, P. (2005). Applying item

response theory modeling for evaluating

questionnaire items and scale properties. In P. M.

Fayers & R. D. Hays (Eds.), Assessing quality of

life in clinical trials: Methods and practice (2nd

ed., pp. 55-73). New York, NY: Oxford University

Press.

Reise, S. P., Ainsworth, A. T., & Haviland, M. G.

(2005). Item response theory: Fundamentals,

applications, and promises in psychological

research. Current Direction sin Psychological

Science, 14(2), 95-101.

Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993).

Confirmatory factor analysis and item response

theory: Two approaches for exploring measurement

invariance. Psychological Bulletin, 114, 552-566.

doi:10.1037/0033-2909.114.3.552

Samejima, F. (1969). Estimation of latent ability using

a response format of graded scores. (Psychometric

Monograph No. 17) .Richmond, VA: Psychometric

Society. Retrieved from

http://www.psychometrika.org/journal/online/MN1

7.pdf

Smith, R. M., Schumaker, R. E., & Bush, M. J. (1998).

Using item mean squares to evaluate fit to the

Rasch model. Journal of Outcome Measurement,

2(1), 66-78. PMid:9661732

SPSS. (2007). SPSS for Windows, Rel. 16.0.0

[Software]. Chicago: SPSS, Inc.

Sun, J. (2005). Assessing goodness of fit in

confirmatory factor analysis. Measurement and

Evaluation in Counseling and Development, 37(4),

240-256.

Swaminathan, H., & Gifford, J. A. (1979). Estimation

of parameters in the three-parameter latent-trait

model. Laboratory of Psychometric and Evaluation

Research (Report No. 90). Amherst: University of

Massachusetts.

Tabachnik, B. G., & Fidell, L. S. (2001). Using

multivariate statistics (4th ed.). Boston, MA: Allyn

and Bacon.

Teresi, J. A. (2006). Overview of quantitative

measurement methods: Equivalence, invariance and

differential item functioning in health applications.

Medical Care, 44, S39S49.



Unick, G. J., & Stone, S. (2010). State of modern

measurement approaches in social work research

literature. Social Work Research, 34(2), 94-101.

Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000).

Practical implications of item response theory and

computerized adaptive testing: A brief summary of

ongoing studies of widely used headache impact

scales. Medical Care, 38, 11731182. doi:10.1097/00005650-200009002-00011

Wenger, E., McDermott, R., & Snyder, W. M. (2002).

Cultivating communities of practice. Boston, MA:

Harvard Business School Press.

Wirth, R. J., & Edwards, M. C. (2007). Item factor

analysis: Current approaches and future directions.

Psychological Methods, 12(1), 58-79.

doi:10.1037/1082-989X.12.1.58

Wright, B. D., & Masters, G. N. (1982). Rating scale

analysis: Rasch measurement. Chicago, IL: MESA

Press.

Wu, M. L., Adams, R. J., & Wilson, M., & Haldane, S.

(2008). ACER Conquest 2.0: Generalized item

response modeling software [computer program].

Hawthorn, Australia: ACER.

An Introduction to Using Multidimensional Item Response Theory to Assess Latent Factor Structures

Documents

Transcript of An Introduction to Using Multidimensional Item Response Theory to Assess Latent Factor Structures