The multidimensional Itô Integral and the multidimensional ...
An Introduction to Using Multidimensional Item Response Theory to Assess Latent Factor Structures
Transcript of An Introduction to Using Multidimensional Item Response Theory to Assess Latent Factor Structures
-
Journal of the Society for Social Work and Research October 2010
Volume 1, Issue 2, 6682 ISSN 1948-822X DOI:10.5243/jsswr.2010.6
Journal of the Society for Social Work and Research 66
An Introduction to Using Multidimensional Item Response Theory
to Assess Latent Factor Structures
Philip Osteen
University of Maryland
This study provides an introduction to the use of multidimensional item response theory (MIRT)
analysis for assessing latent factor structure, and compares this statistical technique to
confirmatory factor analysis (CFA) in the evaluation of an original measure developed to assess
students motivations for entering a social work community of practice. The Participation in a Social Work Community of Practice Scale (PSWCoP) was administered to 506 masters of social
work students from 11 accredited graduate programs. The psychometric properties and latent
factor structure of the scale are evaluated using MIRT and CFA techniques. Although designed as
a 3-factor measure, analysis of model fit using both CFA and MIRT do not support this solution.
Instead, analyses using both methods produce convergent results supporting a 4-factor solution.
Discussion includes methodological implications for social work research, focusing on the
extension of MIRT analysis to assessment of measurement invariance in differential item
functioning, differential test functioning, and differential factor functioning.
Keywords: item response theory, factor analysis, psychometrics
In comparison to classical test theory (CTT), item
response theory (IRT) is considered as the standard, if
not preferred, method for conducting psychometric
evaluations of new and established measures
(Embretson & Reise, 2000; Fries, Bruce, & Cella,
2005; Lord, 1980; Ware, Bjorner, & Kosinski, 2000).
Dubbed the modern test theory, IRT is used across scientific disciplines, including psychology, education,
nursing, and public health. Considered a superior
method because of IRTs ability to overcome inherent limitations of CTT, IRT provides researchers with an
array of statistical tools for assessing measure
characteristics. Unfortunately, there is a resounding
paucity of published research in social work using IRT.
A review of measurement-based articles appearing in
journals specific to the social work field published
between 2000 and 2006 showed that fewer than 5% of
studies used IRT analysis to evaluate the psychometric
properties of new and existing measures (Unick &
Stone, 2010). Unick and Stone hypothesized several
reasons for the absence of IRT analyses from social
work journals, one of which was a lack of familiarity
with key conceptual and practical components of IRT.
Regardless of the reasons underlying the absence
of IRT-based analyses in the social work literature, the
field of social work will benefit from researchers
becoming more familiar with IRT methods and
incorporating these analyses into social work-based
Philip J. Osteen is an assistant professor in the
University of Maryland School of Social Work.
All correspondence concerning this article should
be directed to [email protected]
measurement studies. Historically regarded as a
method for evaluating latent skill and ability traits in
education, the application of IRT to measures of
affective latent traits is becoming more common and
accepted. As outlined in this article, drawing on the
strengths of IRT as an alternative to, or ideally in
conjunction with, CTT analyses supports social work
researchers development of rigorously substantiated measures. This article provides social work researchers
with a basic overview of IRT and a demonstration of
the utility of IRT as compared with CTT-based factor
analysis by using actual data obtained with the
implementation of a novel measure of professional
motivations of masters of social work (MSW) students. Published studies comparing IRT and
confirmatory factor analysis (CFA) have focused
almost exclusively on assessing measurement
invariance. This study takes a different approach in
comparing IRT and CTT by applying these theories to
the assessment of multidimensional latent factor
structures.
IRT
IRT is based on the premise that only two
elements are responsible for a persons response on any given item: the persons ability, and the characteristics of the item (Bond & Fox, 2001). The most common
IRT model, called the Rasch or one-parameter logistic
model, assumes the probability of a given response is a
function of the persons ability and the difficulty of the item (Bond & Fox, 2001). More complex IRT models
estimate the probability of a given response based on
additional item characteristics such as discrimination
and guessing (Bond & Fox, 2001). Derived from its
-
OSTEEN
Journal of the Society for Social Work and Research 67
early use in educational measurement, the term ability
may seem mismatched to psychosocial constructs; thus,
the term latent trait may be more intuitive, and
references to level of ability are synonymous with level
of the latent trait. The IRT model produces estimates
for both of these elements by calculating item-
difficulty parameters on the basis of the total number of
persons who correctly answered an item, and person-
trait parameters on the basis of the total number of
items successfully answered (Bond & Fox, 2001). The
assumptions underlying these estimates are (a) that a
person with more of the trait will always have a greater
likelihood of success than a person with less of the
trait, and (b) that any person will have a greater
likelihood of endorsing items requiring less of the trait
than items requiring more of the trait (Mller, Sokol, &
Overton, 1999). Samejima (1969) and Andrich (1978)
extended this model to measures with polytomous
response formats (i.e., Likert scales) by adding an
estimate to account for the difficultly in crossing the
threshold from one level of response to the next (e.g.,
moving from agree to strongly agree).
Scale Evaluation Using IRT
The basic unit of IRT is the item response
function (IRF) or item characteristic curve. The
relationship between a respondents performance and the characteristics underlying item performance can be
described by a monotonically increasing function
called the item characteristic curve (ICC; Henard,
2000). The ICC is typically a sigmoid curve estimating
the probability of a given response based on a persons level of latent trait. The shape of the ICC is determined
by the item characteristics estimated in the model. The
ICC in a three-parameter IRT model is derived using
the formula P() = c + (1-c)e a(-b-f)
where P, the probability of a response given a
persons level of the latent traitdenoted by theta ()is a function of guessing (c parameter), item discrimination (a parameter), item difficulty (b
parameter), and the category threshold (f) if using a
polytomous response format.
For the one-parameter IRT model, the guessing
parameter, c, is constrained to zero, assuming little or
no impact of guessing. For example a person cannot
guess the correct response to an item using a Likert
scale because items are not scored as right or wrong.
The item discrimination parameter, a, is set to 1 under
the assumption that there is equal discrimination across
items. In a one-parameter model the probability of a
response is determined only by the persons level of the latent trait and the difficulty of the item. Item difficulty
is an indication of the level of the underlying trait that
is needed to endorse or respond in a certain way to the
item. For items on a rating scale, the IRF is a
mathematical function describing the relation between
where an individual falls on the continuum of a given
construct such as motivation and the probability that he
or she will give a particular response to a scale item
designed to measure that construct (Reise, Ainsworth,
& Haviland, 2005). The basic goal of IRT modeling is
to create a sample-free measure.
Multidimensional item response theory, or MIRT,
is an extension of IRT and is used to explore the
underlying dimensionality of an IRT model. Advances
in computer software (e.g., Conquest, MULTILOG, &
Mplus) allow for testing and evaluation of more
complex multidimensional item response models and
enable researchers to statistically compare competing
dimensional models. ACER Conquest 2.0 (Wu,
Adams, & Wilson, 2008), the software used in this
study, produces marginal maximum likelihood
estimates for the parameters of the models. The fit of
the models is ascertained by generalizations of the
Wright and Masters (1982) residual-based methods.
Alternative dimensional models are evaluated using a
likelihood ratio chi-squared statistic (2LR; Barnes, Chard, Wolfe, Stassen, & Williams, 2007).
Core statistical output of an IRT analysis of a one-
parameter rating scale model includes estimates of
person latent trait, item difficulty, model fit, person-
fit, item-fit, person reliability, item reliability, and step
calibration. A two-parameter model would include
estimates for item discrimination, and a three-
parameter model would include an additional estimate
for guessing. Person latent trait is an estimate of the
underlying trait present for each respondent. Persons
with high person-ability scores possess more of the
underlying trait than persons with low scores. Item
difficulty is an estimate of the level of underlying trait
at which a person has a 50% probability of endorsing
the item. Items with higher item-difficulty scores
require a respondent to have more of the underlying
trait to endorse or correctly respond to the item than
items with lower item difficulty scores. Consider a
measure of reading comprehension. An item requiring
a 12th grade reading level is more difficult than an item
requiring a 6th grade reading level. The same concept
applies to a measure of motivation; an item requiring a
high amount of motivation is more difficult than an item requiring a low amount of motivation. This idea
translates to the concept of person-ability or latent trait.
A person who reads at a 12th grade level has more
ability than a person who reads at a 6th grade level; a
person who is more motivated has more of the latent
trait than a person who is less motivated.
Analysis of item fit. Fit statistics in IRT analysis
include infit and outfit mean square (MNSQ) statistics.
1 + e a(-b-f)
-
USING MULTIDIMENSIONAL ITEM RESPONSE THEORY
Journal of the Society for Social Work and Research 68
Infit and outfit are statistical representations of how
well the data match the prescriptions of the IRT model
(Bond & Fox, 2001). Outfit statistics are based on
conventional sum of squared standardized residuals,
and infit statistics are based on information-weighted
sum squared standardized residuals (Bond & Fox,
2001). Infit and outfit have expected MNSQ values of
1.00; values greater than or less than 1 indicate the
degree of variation from the expected score. For
example, an item with an infit MNSQ of 1.33 (1.00 +
.33), indicates 33% more variation in responses to that
item than were predicted by the model. Mean infit and
outfit values represent a degree of overall fit of the data
to the model, but infit and outfit statistics are also
available for assessing fit at the individual item level
(item-fit) and the individual person level (person-fit).
Item-fit refers to how well the IRT model explains the
responses to a particular item (Embretson & Reise,
2000). Person-fit refers to the consistency of an
individuals pattern of responses across items (Embretson & Reise, 2000).
One limitation of IRT is the need for large
samples. No clear standards exist for minimum sample
size, although Embretson and Reise (2000) briefly
noted that a sample of 500 respondents was
recommended, and cautioned that parameter
estimations might become unstable with samples of
less than 350 respondents. Reeve and Fayers (2005)
suggested that useful information about item
characteristics could be obtained with samples of as
few as 250 respondents. One-parameter models may
yield reliable estimates with as few as 50 to 100
respondents (Linacre, 1994). As the complexity of the
IRT model increases and more parameters are
estimated, sample size should increase accordingly.
Smith, Schumacker, and Bush (1998), provided the
following sample size dependent cutoffs for
determining poor fit: misfit is evident when MNSQ
infit or outfit values are larger than 1.3 for samples less
than 500, 1.2 for samples between 500 and 1,000, and
1.1 for samples larger than 1,000 respondents.
According to Adams and Khoo (1996), items with
adequate fit will have weighted MNSQs between .75
and 1.33. Bond and Fox (2001) stated items that are
routinely accepted as having adequate fit will have t-
values between -2 and +2. According to Wilson (2005),
when working with large sample sizes, the researcher
can expect the t-statistic to show significant values for
several items regardless of fit; therefore, Wilson
suggested that the researcher consider items
problematic only if items are identified as misfitting
based on both the weighted MNSQ and t-statistic.
For rating scale models, category thresholds are
provided in the IRT analysis. A category threshold is
the point at which the probability of endorsing one
category is equal to the probability of endorsing a
corresponding category one step away. Although
thresholds are ideally equidistant, that characteristic is
not necessarily the reality. Guidelines indicate that
thresholds should be at least 1.4 logits but no more than
5 logits (Linacre, 1999). Logits are the scale units for
the log odds transformation. When thresholds have
small logits, response categories may be too similar
and nondiscriminant. Conversely, when the threshold
logit is large, response categories may be too dissimilar
and far apart, indicating the need for more response
options as intermediate points. Infit and outfit statistics
are also available for step calibrations. Outfit MNSQ
values greater than 2.0 indicate that a particular
response category is introducing noise into the measurement process, and should be evaluated as a
candidate for collapsing with an adjacent category
(Bond & Fox, 2001; Linacre, 1999).
In conjunction with the standard output of IRT
analysis, MIRT analysis provides information about
dimensionality, the underlying latent factor structure.
Acer Conquest 2.0 (Wu et al., 2008) software provides
estimations of population parameters for the
multidimensional model, which include factor means,
factor variances, and factor covariances/correlations.
Acer Conquest 2.0 also produces maps of latent
variable distributions and response model parameter
estimates.
Analysis of nested models. Two models are
considered as being nested if one is a subset of the
second. Overall model fit of an IRT model is based on
the deviance statistic, which follows a chi-square
distribution. The deviance statistic changes as
parameters are added or deleted from the model, and
changes in fit between nested models can be
statistically tested. The chi-square difference statistic
(2 D) can be used to test the statistical significance of the change in model fit (Kline, 2005). The 2 D is calculated as the difference between the model chi-
square (2 M) values of two nested models using the same data; the df for the 2 D statistic is the difference in dfs for two nested models. The 2 D statistic tests the null hypothesis of identical fit of the two models to the
population. Failure to reject the null hypothesis means
that the two models fit the population equally. When
two nested models fit the population equally well, the
more parsimonious model is generally considered the
more favorable.
Scale Evaluation Using CFA
Factor analysis is a more traditional method for
analyzing the underlying dimensionality of a set of
observed variables. Derived from CTT, factor analysis
includes a variety of statistical procedures for exploring
the relationships among a set of observed variables
with the intent of identifying a smaller number of
-
OSTEEN
Journal of the Society for Social Work and Research 69
factors, the unobserved latent variables, thought to be
responsible for these relationships among the observed
variables (Tabachnik & Fidell, 2007). CFA is used
primarily as a means of testing hypotheses about the
latent structure underlying a set of observed data.
A common and preferred method for conducting
CFA is structural equation modeling (SEM). The term
SEM refers to a family of statistical procedures for
assessing the degree of fit between observed data and
an a priori hypothetical model in which the researcher
specifies the relevant variables, which variables affect
other variables, and the direction of those effects. The
two main goals of SEM analysis are to explore patterns
of correlations among a set of variables, both observed
and unobserved, and to explain as much variance as
possible using the model specified by the researcher
(Klem, 2000; Kline, 2005).
Analysis of SEM models. Analysis of SEM
models is based on the fit of the observed variance-
covariance matrix to the proposed model. Although
maximum likelihood (ML) estimation is the common
method for deriving parameter estimates, it is not the
only estimation method available. ML estimation
produces parameter estimates that minimize the
discrepancies between the observed covariances in the
data and those predicted by the specified SEM model
(Kline, 2005). Parameters are characteristics of the
population of interest; without making observations of
the entire population, parameters cannot be known and
must be estimated from sample statistics. ML
estimation assumes interval level data, and alternative
methods, such as weighted least squares estimation,
should be used with dichotomous and ordinal level
data. Guo, Perron, and Gillespie (2009) noted in their
review of social work SEM publications that ML
estimation was sometimes used and reported
inappropriately.
Analysis of model fit. Kline (2005) defined
model fit as how well the model as a whole explained
the data. When a model is over identified, it is expected
that model fit will not be perfect; it is therefore
necessary to determine the actual degree of model fit,
and whether the model fit is statistically acceptable.
Ideally, indicators should load only on the specific
latent variable identified in the measurement model.
This type of model can be tested by constraining the
direct effects between indicators and other factors to
zero. According to Kline (2005), indicators are expected to be correlated with all factors in CFA
models, but they should have higher estimated
correlations with the factors they are believed to
measure (emphasis in original, p. 177). A measurement model with indicators loading only on a
single factor is desirable but elusive in practice with
real data. Statistical comparison of models with cross-
loadings to models without cross-loadings allows the
researcher to make stronger assertions about the
underlying latent variable structure of a measure. As
Guo et al. (2009) noted, modified models allowing
cross-loadings between items and factors have been
frequently published in social work literature without
fully explaining how they related to models without
cross-loadings.
Analysis of nested models. As noted in the
discussion of MIRT analysis, two models are
considered to be nested if one is a subset of the second.
Overall model fit based on the chi-square distribution
will change as paths are added to or deleted from a
model. Klines (2005) chi-square difference statistic (2
D) can be used to test the statistical significance of the
change in model fit.
MIRT versus CFA
MIRT and CFA analyses can be used to assess the
dimensionality or underlying latent variable structure
of a measurement. The choice of statistical procedures
raises questions about differences between analyses,
whether the results of the two analyses are consistent,
and what information can be obtained from one
analysis but not the other. IRT addresses two problems
inherent in CTT. First, IRT overcomes the problem of
item-person confounding found in CTT. IRT analysis
yields estimates of item difficulties and person-abilities
that are independent of each other, whereas in CTT
item difficulty is assessed as a function of the abilities
of the sample, and the abilities of respondents are
assessed as a function of item difficulty (Bond & Fox,
2001), a limitation that extends to CFA.
Second, the use of ordinal level data (i.e., rating
scales), which are routinely treated in statistical
analyses as continuous, interval-level data, may violate
the scale and distributional assumptions of CFA (Wirth
& Edwards, 2007). Violating these assumptions may
result in model parameters that are biased and
impossible to interpret (Wirth & Edwards, 2007, p. 58; DiStefano, 2002). The logarithmic transformation
of ordinal level raw data into interval level data in IRT
analysis overcomes this problem.
IRT and CTT also differ in the treatment of the
standard error of measurement. The standard error of
measurement is an indication of variability in scores
due to error. Under CTT, the standard error of
measurement is averaged across persons in the sample
or population and is specific to that sample or
population. Under IRT, the standard error of
measurement is considered to vary across scores in the
same population and to be population-general
(Embretson & Reise, 2000). The IRT approach to the
standard error of measurement offers the following
benefits: (a) the precision of measurement can be
evaluated at any level of the latent trait instead of
-
USING MULTIDIMENSIONAL ITEM RESPONSE THEORY
Journal of the Society for Social Work and Research 70
averaged over trait levels as in CTT, and (b) the
contribution of each item to the overall precision of the
measure can be assessed and used in item selection
(Hambleton & Swaminathan, 1985).
MIRT and CFA differ in the estimation of item
fit. Where item fit is assessed through error variances,
communalities, and factor loadings in CFA, item fit is
assessed through unweighted (outfit) and weighted
(infit) mean square errors in IRT analyses (Bond &
Fox, 2001). Further, the treatment of the relationship
between indicator and latent variable, which is
constrained to a linear relationship in CFA, can be
nonlinear in IRT (Greguras, 2005). CFA uses one
number, the factor loading, to represent the relationship
between the indicator and the latent variable across all
levels of the latent variable; in IRT, the relationship
between indicator and latent variable is given across
the range of possible values for the latent variable
(Greguras, 2005). Potential implications of these
differences include inconsistencies in parameter
estimates, indicator and factor structure, and model fit
across MIRT and CFA analyses.
Both IRT and CFA provide statistical indicators
of psychometric performance not available in the other
analysis. Using the item information curve (IIC), IRT
analysis allows the researcher to establish both item
information functions (IIF) and test information
functions (TIF). The IIF estimates the precision and
reliability of individual items independent of other
items on the measure; the TIF provides the same
information for the total test or measure, which is a
useful tool in comparing and equating multiple tests
(Hambleton et al., 1991; Embretson & Reise, 2000).
IRT for polytomous response formats also provides
estimated category thresholds for the probability of
endorsing a given response category as a function of
the level of underlying trait. These indices of item and
test performance and category thresholds are not
available in CFA in which item and test performance
are conditional on the other items on the measure.
Conversely, CFA offers a wide range of indices for
evaluating model fit, whereas IRT is limited to the use
of the 2 deviance statistic. Reise, Widaman, and Pugh (1993) explicitly identified the need for modification
indices and additional model fit indicators for IRT
analyses as a limitation.
Participation in a Social Work Community of
Practice Scale
Although the content of the Participation in a
Social Work Community of Practice Scale (PSWCoP)
is less important in the current discussion than the
methodologies used to evaluate the scale, a brief
overview will provide context for interpreting the
results of the analyses. The PSWCoP scale is an
assessment of students motivations for entering a masters of social work (MSW) program as
conceptualized in Wenger, McDermott, and Snyders (2002) three-dimensional model of motivation for
participation in a community of practice. Wenger et al.
(2002) asserted that all communities of practice are
comprised of three fundamental elements (p. 27): a
domain of knowledge defining a set of issues; a
community of people who care about the domain; and,
the shared practice developed to be effective in that
domain. Some individuals are motivated to participate
because they care about the domain and are interested
in its development. Some individuals are motivated to
participate because they value being part of a
community as well as the interaction and sharing with
others that is part of having a community. Finally,
some individuals are motivated to participate by a
desire to learn about the practice as a means of
improving their own techniques and approaches. The
PSWCoP was developed as a multidimensional
measure of the latent constructs domain motivation,
community motivation, and practice motivation (Table
1). Data were collected from a convenience sample of
students enrolled in MSW programs using a cross-
sectional survey design and compared to the three-
factor model developed from Wenger et al.
Method
Participants
A convenience sample of 528 current MSW
students was drawn from 11 social work programs
accredited by the Council on Social Work Education
(CSWE). Participants were enrolled during two
separate recruitment periods. The first round of
recruitment yielded a nonrandom sample of 268
students drawn from nine academic institutions. The
second round of recruitment yielded a nonrandom
sample of 260 students drawn from eight institutions.
Six institutions participated in both rounds of data
collection, three institutions participated in only the
first round of data collection, and two institutions
participated in only the second round of data collection.
The response rate for the study could not be calculated
because there was no way to determine the total
number of students who received information about the
study or had access to the online survey. Twenty-two
cases (4.1%) were removed because of missing data,
yielding a final sample of 506 students; listwise
deletion was used given the extremely small amount of
missing data.
Data were collected on multiple student
characteristics including age, gender, race/ethnicity,
sexual orientation, religious affiliation, participation in
religious activities, family socioeconomic status (SES),
and enrollment status. The mean age of participants
-
OSTEEN
Journal of the Society for Social Work and Research 71
Table 1
Original Items on the Participation in a Social Work Community of Practice Scale
Item Factor
My main interest for entering the MSW program was to be a part of a community of
social workers.
Community (C_1)
I wanted to attend a MSW program so that I could be around people with similar
values to me.
Community (C_2)
I chose a MSW program because I thought social work values were more similar to
my values than those of other professions.
Community (C_3)
There is more diversity of values among students than I expected. Community (C_4)*
Before entering the program, I was worried about whether or not I would fit in with
my peers.
Community (C_5)*
Learning about the social work profession is less important to me that being part of a
community of social workers.
Community (C_6)*
Without a MSW degree, I am not qualified to be a social worker. Practice (P_1)
A MSW degree is necessary to be a good social worker. Practice (P_2)
Learning new social work skills was not a motivating factor in my decision to enter
the MSW program.
Practice (P_3)
My main reason for entering the MSW program was to acquire knowledge and/or
skills.
Practice (P_4)
A MSW degree will give me more professional opportunities than other professional
degrees.
Practice (P_5)*
Being around students with similar goals is less important to me than developing my
skills as a social worker.
Practice (P_6)*
Learning how to be a social worker is more important to me than learning about the
social work profession.
Practice (P_7)*
I find social work appealing because it is different than the type of work I have done
in the past.
Domain (D_1)
I decided to enroll in a MSW program to see if social work is a good fit for me. Domain (D_2)
I wanted to attend a MSW program so that I could learn about the social work
profession.
Domain (D_3)
Entering the MSW program allowed me to explore a new area of professional interest. Domain (D_4)
My main reason for entering the MSW program was to decide if social work is the
right profession for me.
Domain (D_5)
*Items deleted from the final version of the PSWCoP
was 30.2 years (SD = 8.7 years). The majority of
students were female (92%). The majority of the
participants were Caucasian (82.6%), with 7.3% of
students self-identifying as African American or Black;
4.1% as Hispanic; 1.8% as Asian/Pacific Islander; and
4.1% as a nonspecified race/ethnicity. Students
identified their enrollment status as either part-time
(19.5%), first year (32.7%), advanced standing (27%),
or second year (20.8%).
Measures
Analyses were conducted on an original measure
of students motivations for entering a social work community of practice, defined as pursuing a MSW
degree. The PSWCoP was developed and evaluated
using steps outlined by by Benson and Clark (1982)
and DeVellis (2003). The pilot measure contained 18 items designed to measure three constructs (domain,
community, and practice). Items were measured on a 6-
point rating scale from strongly disagree to strongly
agree. Items from the pilot measure organized by
subscale are listed in Table 1. In addition to items on
the PSWCoP, students were asked to provide
demographic information.
Procedures
Participants completed the PSWCoP survey as part
of a larger study exploring the relationship between
students motivations to pursue the MSW degree, their attitudes about diversity and historically marginalized
groups, and their endorsement of professional social
work values as identified in the National Association of
Social Workers (2009) Code of Ethics. This research
was approved by the University of Denver Institutional
Review Board prior to recruitment and data collection.
Recruitment consisted of a two-pronged approach: (a)
an e-mail providing an overview of the study and a link
-
USING MULTIDIMENSIONAL ITEM RESPONSE THEORY
Journal of the Society for Social Work and Research 72
to the online survey was sent to students currently
enrolled in the MSW program; and (b) an
announcement providing an overview of the study and
a link to the online survey posted to student-oriented
informational Web sites. Interested participants were
able to access the anonymous, online survey through
www.surveymonkey.com, which is a frequently used
online survey provider. Participants were presented
with a project information sheet and were required to
indicate their consent to participate by clicking on the
appropriate response before being allowed to access the
actual survey.
Results
Reliability of scores from the PSWCoP was
assessed using both CTT and IRT methods. SPSS
(v.16.0.0, 2007) was used to calculate internal
consistency reliability (Cronbachs ; inter-item correlations). Acer Conquest 2.0 (Wu et al., 2008) was
used to assess item reliability. The dimensionality and
factor structure of the PSWCoP were evaluated using
both a MIRT and a CFA approach. Acer Conquest 2.0
(Wu et al., 2008) was used to conduct the MIRT
analysis and Lisrel 8.8 (Jreskog & Srbom, 2007) was
used to conduct the CFA analysis. Acer Conquest 2.0
was used to evaluate the PSWCoP with respect to
estimates of levels of latent trait and item difficulty
using a one-parameter logistic model. Assessment of
the measure was based on model fit, person-fit, item
fit, person reliability, item reliability, step calibration,
and population parameters for the multidimensional
model.
Item Selection
Items were identified for possible deletion from
each subscale using Cronbachs alpha, IRT MNSQ Infit/Outfit results, and theory. Poorly performing
items identified through statistical analyses were
further assessed using conceptual and theoretical
frameworks. A combination of results led to the
removal of three items from the community subscale,
three items from the practice subscale, but no items
from the domain subscale (Table 1). Items C_6, P_6,
and P_7 addressed relationships between types of
motivations by asking respondents to rate whether one
type of motivation was more important than another
type. Quantitative differences between types of
motivations were not addressed in community of
practice theory, and therefore these items were deemed
not applicable in the measurement of each type of
motivation. Items C_4 and C_5 were deleted from the
community subscale because these items specifically
addressed relationships between respondents and peers.
Community-based motivation arises out of perceived
value congruence between the individual and the
practice (i.e., professional social work), and not
between the individual and other members of the
community of practice. All analyses indicated
problems with the practice subscale and ultimately
EFA was used with this subscale only. The results of
the EFA suggested items P_1 and P_2 formed one
factor, and items P_3 and P_4 constituted a second
factor. Item P_5 did not load on either factor and was
deleted.
The results of the item selection process yielded
two competing models. The first model consisted of
three factors in which all items developed for the
practice subscale were kept together; this model most
closely reflected the original hypothetical model
developed based on community of practice theory. The
second model had four factors with the items from the
hypothesized practice subscale split into the two factors
suggested by the EFA. Internal consistency for each of
the subscales on the final version of the PSWCoP was
assessed using Cronbachs alpha. Cronbachs alpha was 0.64 for scores from the domain subscale, 0.68 for
scores from the community subscale, and 0.47 for
scores from the practice subscale (three-factor model).
Splitting the practice subscale into two factors yielded
a Cronbachs alpha of 0.58 for scores from the skills subscale and .68 for scores from the competency
subscale. Although ultimately indicative of a poor
measure, low internal consistency did not prohibit the
application and comparison of factor analysis using
CFA and MIRT.
Factor Structure
CFA. CFA analyses of the PSWCoPS was
conducted using Lisrel 8.8 (Jreskog & Srbom, 2007).
The data collected using the PSWCoP were considered
ordinal based on the 6-point rating scale. When data
are considered ordinal, Jreskog and Srbom (2007)
advocated the use of PRELIS to calculate asymptotic
covariances and polychloric correlations of all items
modeled, and LISREL or SIMPLIS with weighted least
squares estimation to test the structure of the data.
Failure to use these guidelines may result in
underestimated parameters, biased standard errors, and
an inflated chi-square (2) model fit statistic (Flora & Curran, 2004). The chi-square difference statistic (2 D) was used to test the statistical significance of the
change in model fit between nested models (Kline,
2005). The 2 D was calculated as the difference between the model chi-square (2 M) values of nested models using the same data; the df for the 2 D statistic is the difference in dfs for nested models. The 2 D statistic tested the null hypothesis of identical fit of two
models to the population. In all, three nested models
were evaluated and compared sequentially: a four-
factor model with cross-loadings served as the baseline
model, followed by a four-factor model without cross-
loadings, and a three-factor model without cross-
loadings. The four-factor model with cross-loadings
-
OSTEEN
Journal of the Society for Social Work and Research 73
was chosen as the baseline model because it was
presumed to demonstrate the best fit having the fewest
degrees-of-freedom. The primary models of interest
were then compared against this baseline to estimate
the change in model fit.
Sun (2005) recommended considering fit indices
in four categories: sample-based absolute fit indices,
sample-based relative fit indices, population-based
absolute indices, and population-based relative fit
indices. Sample-based fit indices are indicators of
observed discrepancies between the reproduced
covariance matrix and the sample covariance matrix.
Population-based fit indices are estimations of
difference between the reproduced covariance matrix
and the unknown population covariance matrix. At a
minimum, Kline (2005) recommended interpreting and
reporting four indices: the model chi-square (sample-
based), the Steiger-Land root mean square error of
approximation (RMSEA; population-based), the
Bentler comparative fit index (CFI; population-based),
and the standardized root mean square residual
(SRMR; sample-based). In addition to these fit indices,
this study examined the Akaike information criteria
(AIC; sample-based) and the goodness-of- fit index
(GFI; sample-based). According to Jackson, Gillaspy,
and Purc-Stephenson (2009), a review of CFA journal
articles published over the past decade identified these
six fit indices as the most commonly reported.
The range of values indicating good fit of observed
data to the measurement model varies depending on the
specific fit index. The model chi-square statistic tests
the null hypothesis that the model has perfect fit in the
population. Degrees-of-freedom for the chi-square
statistic is equal the number of observations minus the
number of parameters to be estimated. Given its
sensitivity to sample size, the chi-square test is often
statistically significant. Kline (2005) suggested using a
normed chi-square statistic obtained by dividing chi-
square by df; ideally, these values should be less than
three. The SRMR is a measure of the differences
between observed and predicted correlations; in a
model with good fit, these residuals will be close to
zero. Hu and Bentler (1999) suggested that a SRMR
-
USING MULTIDIMENSIONAL ITEM RESPONSE THEORY
Journal of the Society for Social Work and Research 74
Figure 1. Standardized solution for four-factor PSWCoP model
Three Factor Model without Cross-Loadings. The
three-factor model corresponded to the original model
of the PSWCoP (Figure 2). Three latent variables were
included in this model:domain, community, and
practice. Items were constrained to load on the factor
for which they were designed. The four items
originally developed for the practice subscale were
constrained to load on a single latent variable, which
represented a perfect correlation between the
previously used latent variables competency and skills.
Based on the six fit indices previously described, the
overall fit of the model was poor: 2 = 359.90, df = 51, p < 0.001; RMSEA = 0.11 [90%CI:.10,.12]; CFI = 0.8;
SRMR = 0.12; AIC = 413.90; GFI =0.85. When
compared with the four-factor model without cross-
loadings, this model demonstrated a significant
increase in model misfit [(12 2
2)(df1-df2) =174.38(3),
p < .001]. All of the fit statistics indicated that the data
did not fit the model.
Figure 2. Standardized solution for three-factor PSWCoP model
-
OSTEEN
Journal of the Society for Social Work and Research 75
Summary of CFA of the PSWCoP. A summary of
fit indices across nested models is provided in Table 2.
The model with the best overall fit was the four-factor
model in which items were allowed to load across all
factors. The fit of this model was good, but the model
lacked conceptual support and was not interpretable
with respect to the underlying latent structure of the
PSWCoP. Although the four-factor model with
constrained loadings had a significant increase in
model misfit over the four-factor model with cross-
loadings, the four-factor model with constrained
loadings demonstrated acceptable fit. The results of the
CFA on the four-factor model without cross-loadings
supported the hypothesis of a multidimensional
measure because correlations between latent variables
were computed and there were no significant
correlations between any pair of latent variables
(=.01).
The four-factor model with constrained loadings
was compared with a three-factor model based on the
originally proposed measurement model for the
PSWCoP. The conceptual difference between the two
models was the placement of the items developed for
the practice subscale. Constraining these four items to
load on a single latent variable resulted in a large
increase in model misfit. All of the reported fit
statistics indicated a model with poor fit.
Table 2
Comparison of Fit Indices across Nested Models
Model 1:
4 Factor
Model
Model 2:
Unidimensional
4 Factor Model*
Model 3:
Unidimensional
3 Factor Model**
2(df) 64.48(35) 185.52(48) 359.90(51)
Normed 2(2/df) 1.84 3.86 7.05
p-value (model) .002
-
USING MULTIDIMENSIONAL ITEM RESPONSE THEORY
Journal of the Society for Social Work and Research 76
latent trait was the amount of motivation a given
student possessed. In general, the range of the latent
trait of the sample and item difficulties were the same,
and the distribution of persons and items about the
mean were relatively symmetrical, indicating a good
match between the latent trait of students and the
difficulty of endorsing items. Exact numerical values
for item difficulty are provided in Table 3 and ranged
from -1.05 to +0.94. Item difficulty was scaled
according to the theta metric, and indicated the level of
the latent trait at which the probability of a given
response to the item was .50. Theta () is the level of the latent trait being measured and scaled with a mean
of zero and a standard deviation of one. Negative
values indicated items that were easier to endorse, and
positive values indicated items that were harder to
endorse.
Item fit is an indication of how well an item
performs according to the underlying IRT model being
tested, and it is based on the comparison of observed
responses to expected responses for each item. Adams
and Khoo (1996) suggested that items with good fit
have infit scores between 0.75 and 1.33; Bond and Fox
(2001) suggested that items with good fit have t values
between -2 and +2. Table 3 provides the fit statistics
for the items of the PSWCoP survey; according to this
output, only item P_3_R exceeded Bond and Foxs guideline, and no items exceeded Adams and Khoos guideline.
Table 3
Rasch Analysis of Full Survey Item Difficulty and Fit
Model Infit Outfit
Item Label Est. S.E. MNSQ ZSTD MNSQ ZSTD
1 C_1 0.30 .04 1.05 0.9 1.06 1.2
2 C_2 0.04 .04 0.93 -1.1 0.94 -1.0
3 C_3 0.05 .03 1.02 0.5 1.06 1.1
4 P_1 -0.56 .04 1.01 0.1 1.06 0.8
5 P_2 0.30 .04 0.98 -0.4 1.00 0.1
6 D_1 0.68 .04 0.94 -1.1 0.93 -1.1
7 D_2 -0.11 .04 0.91 -1.4 0.89 -1.6
8 D_3 0.24 .04 1.01 0.1 1.05 0.9
9 D_4 -0.33 .04 1.07 1.1 1.08 1.1
10 D_5 0.94 .04 0.97 -0.4 0.95 -0.7
11 P_3 -0.51 .04 1.17 2.1 1.35 4.0
12 P_4 -1.05 .06 0.93 -0.7 0.92 -0.9
IRT analysis produced an item reliability index
indicating the extent to which item estimates would be
consistent across different samples of respondents with
similar abilities. High item reliability indicates that the
ordering of items by difficult will be somewhat
consistent across samples. The reliability index of
items for the PSWCoP pilot survey was 0.99, and
indicated consistency in ordering of items by difficulty.
IRT analysis also produced a person-reliability index
that indicated the extent of consistency in respondent
ordering based on level of latent trait if given an
equivalent set of items (Bond & Fox, 2001). The
reliability index of persons for the PSWCoP was 0.60,
and indicated low consistency in ordering of persons
by level of level of latent trait, which was possibly due
to a constricted range of the latent trait in the sample or
a constricted range of item difficulty.
MIRT factor structure. One of the core
assumptions of IRT is unidimensionality; in other
words, that person-ability can be attributed to a single,
latent construct, and that each item contributes to the
measure of that construct (Bond & Fox, 2001).
However, whether intended or not, item responses may
be attributable to more than one latent construct. MIRT
analyses allow the researcher to assess the
dimensionality of the measure. Multidimensional
models can be classified as either within items or between items (Adams, Wilson, & Wang, 1997). Within-items multidimensional models have items that
can function as indicators of more than one dimension,
-
OSTEEN
Journal of the Society for Social Work and Research 77
and between-items multidimensional models have
subsets of items that are mutually exclusive and
measure only one dimension.
Competing multidimensional models can be
evaluated based on changes in model deviance and
number of parameters estimated. A chi-square statistic
is calculated as the difference in deviance (G2) between
two nested models with df equal to the difference in
number of parameters for the nested models. A
statistically significant result indicates a difference in
model fit. When a difference in fit is found, the model
with the smallest deviance is selected; when a
difference in model fit is not found, the more
parsimonious model is selected.
The baseline MIRT model corresponded to the
four-factor model with no cross-loadings estimated in
the CFA (Figure 1). This baseline model was a
between-items multidimensional model with items
placed in mutually exclusive subsets. The four
dimensions in the model were community,
competency, domain, and skills. The baseline model fit
statistic was G2=17558.64 with 26 parameters. A three
dimensional, between-items, multidimensional model,
corresponding to the theoretical model of the PSWCoP
(Figure 2) was tested against the baseline model. The
three-dimensional model fit statistic was G2=17728.83
with 22 parameters. When compared with the four-
dimensional model, the change in model fit was
statistically significant and indicated that the fit of the
three-dimensional model was worse than the fit of the
four-dimensional model (2 (4) = 170.19, p < .001).
Table 4
Comparison of Model Fit Across Nested Models
Four Factor (Between) Three Factor* (Between)
Deviance (G2 ) 17558.64 17728.83
Df 26 22
G21- G
22 -170.19
df1-df2 4
(G21- G
22)/(df1-df2 ) 42.55
p-value < .001
* Compared to the Four Factor, Between-Items Model
Based on the change in model fit between nested
models, the four dimensional, between-items model
had the better fit. This model resulted in a more
accurate reproduction of the probability of endorsing a
specific level or step of an item for a person with a
particular level of the latent trait (Reckase, 1997).
Thus, the four-dimensional model yielded the greatest
reduction in discrepancy between observed and
expected responses.
Item difficulty. MIRT analyses yielded an item-
person map by dimension. The output of the MIRT
item-person map (Figure 3) provided a visual estimate
of the latent trait in the sample, item difficulty, and
each dimension. Items are ranked in the right-hand
column by difficulty, with items at the top being more
difficult than items at the bottom. Although the range
of item difficulty was narrow, items were well
dispersed around the mean. Each dimension or factor
has its own column with estimates for respondents
abilities. Two inferences were made based on the
MIRT item-person map. First, although the range of
item difficulty was narrow, items appeared to be
dispersed in terms of difficulty with a range of -0.81 to
+0.84. Furthermore, regarding Dimensions 1, 2, and 3,
the item difficulties appeared to be well matched to
levels of the latent trait, though over a limited range of
the construct as scaled via the theta metric. Second,
based on the means of the dimensions, Dimension 2
(competency, x2=0.069) and Dimension 3 (domain,
x3=-0.074) did a better job of representing all levels of
these types of motivation than the other two
dimensions. The small positive mean of Dimension 1
(community, x1=0.335), indicated that students sampled
for this study found it somewhat easier to endorse those
items, whereas the large positive mean of Dimension 4
(skills, x4=1.42) indicated that students sampled for this
study found it very easy to endorse those items.
-
USING MULTIDIMENSIONAL ITEM RESPONSE THEORY
Journal of the Society for Social Work and Research 78
Figure 3. MIRT Latent Variable Item-Person Map
Item fit. Table 5 summarizes the items characteristics. In addition to the estimation of item
difficulties, infit and outfit statistics are reported. Using
Adams and Khoos (1996) guideline, only item C_2_2
showed poor fit (MNSQ=0.68). In contrast, Bond and
Foxs (2001) guideline identified several items as having poor fit (based on a 95% CI for MNSQ): C_1,
C_2, D_1, D_3, and D_4.
Table 5
Item Parameter Estimates for 4 Dimensional Model
Model Infit Outfit
Item Label Est. S.E. MNSQ ZSTD MNSQ ZSTD
1 C_1 0.40 0.03 0.77 -3.8 0.77 -4.5
2 C_2 0.21 0.03 0.68 -5.6 0.67 -6.5
3 C_3 -0.61* 0.04 1.02 0.4 1.04 0.6
4 P_1 -0.14 0.03 1.01 0.2 1.00 0.0
5 P_2 0.14* 0.03 0.96 -0.5 0.93 -1.1
6 D_1 0.11 0.03 1.21 3.0 1.18 3.0
7 D_2 0.51 0.03 1.02 0.4 1.04 0.7
8 D_3 -0.65 0.03 1.29 4.1 1.30 4.4
9 D_4 -0.81 0.03 1.17 2.5 1.22 3.1
10 D_5 0.84* 0.06 0.95 -0.7 0.98 -0.2
11 P_3_R 0.33 0.04 1.00 -0.0 1.02 0.4
12 P_4 -0.33* 0.04 0.99 -0.2 1.00 -0.0
*Indicates that a parameter estimate is constrained
-
OSTEEN
Journal of the Society for Social Work and Research 79
Discussion
The rigor and sophistication with which social
workers conduct psychometric assessments can be
strengthened. Guo et al. (2010) found that social
workers under utilize CFA, and more generally SEM,
analyses. Further, even when those approaches are used
appropriately, considerable room remains for
improvement in reporting (Guo et al., 2010). Similarly,
Unick and Stone (2010) found the use of IRT analyses
for psychometric evaluation was noticeably missing
from the social work literature. Developing familiarity
and proficiency with strong psychometric methods will
empower social workers in developing and selecting
appropriate measures for research, policy, and practice.
Integration of CFA and MIRT Results
The primary result from both the CFA and MIRT
analyses was the establishment of the PSWCoP as a
multidimensional measure. Both sets of analyses
identified a four-factor model in which items loaded on
a single factor as having the best model fit when
compared with the three-factor model. In addition, both
analytic strategies identified significant problems with
the PSWCoP. Low subscale internal consistencies
might be due to the small number of items for the
community, skills, and competency subscales, as well
as the inability to capture the complexity of different
types of motivation for participating in a social work
community of practice. CFA identified multiple items
with high (>.7) error variances, and IRT analyses
indicated poor fit for several items. Although the
results of the analyses identified the PSWCoP as
having limited utility, these poor psychometric
properties did not prohibit CFA and MIRT analyses.
The CFA analysis was found to be more
informative at the subscale level, whereas the MIRT
analysis was found to be more informative at the item
level. CFA was more informative regarding subscale
composition and assessing associations among factors.
The CFA analysis led to a final form of the PSWCoP
with four subscales, and beginning evidence supporting
the factorial validity of the measure. As indicated by
the nonsignificant correlations among factors, each
subscale appeared to be tapping separate constructs.
Although MIRT allows the researcher to model factor
structure, this approach does not estimate relationships
between factors.
MIRT analyses were found to be more informative
for assessing individual item performance. Item
difficulty estimates were obtained for the PSWCoP as a
whole and for each subscale. Items on the PSWCoP
appeared to be a good match for the levels of latent
trait of the respondents with regards to the community,
domain, and competency factors, but too easy for the
skills factor. Based on infit and outfit statistics, MIRT
analyses identified additional items exhibiting poor fit
as compared with the CFA. Specifically, two items on
the community subscale had large standardized fit
scores in the IRT analysis but displayed high factor
loadings and low error variances in the CFA. The IRT
analyses also provided estimates of the item
information function and test information function,
making it possible to get specific estimates of standard
errors of measurement instead of relying on an
averaged standard error of measurement obtained from
the CFA.
Strengths and Limitations
Reliance on a convenience sample is a significant
limitation of this study. The extent to which
participants in this study were representative of the
larger population of MSW students was indiscernible.
Although IRT purports to generate sample-independent
item characteristic estimations, the stability of these
estimations is enhanced when the sample is
heterogeneous with regard to the latent trait. It is
possible that students who self-selected to complete the
measure were overly similar.
A study strength is its contribution to the field of
psychometric assessment. Previous studies comparing
IRT and CFA have dealt almost exclusively with
assessing measurement invariance across multiple
samples (e.g., Meade & Lautenschlager, 2004; Raju,
Lafitte, & Byrne, 2002; Reise et al., 1993). The current
study addresses emerging issues in measurement
theory by applying IRT analyses to multidimensional
latent variable measures, and comparing MIRT and
CFA assessments of factor structure in a novel
measure.
Implications for Social Work Research
In addition to the benefits of using IRT/MIRT
analytic procedures outlined in this paper, the ability of
these techniques to assess differential item functioning
(DIF) and differential test functioning (DTF) is a major
advantage over CTT methods. Wilson (1985) described
DIF as an indication of whether an item performs the
same for members of different groups who have the
same level of the latent trait, whereas DTF is invariant
performance of a set of items across different groups
(Badia, Prieto, Roset, Dez-Prez, & Herdman, 2002).
If DIF/DTF exists, respondents from the subgroups
who share the same level of a latent trait do not have the same probability of endorsing a test item (Embretson & Reise, 2000, p. 252). The ability to
assess potential bias in items and tests provides a
powerful method for developing culturally competent
measures (Teresi, 2006). Valid comparisons between
groups require measurement invariance, and IRT
provides an additional tool for examining both items
and tests.
-
USING MULTIDIMENSIONAL ITEM RESPONSE THEORY
Journal of the Society for Social Work and Research 80
Additional benefits of IRT analyses include the
ability to conduct test equating and develop adaptive
testing. The core question of test equating is the extent
to which scores from two measures presumed to
measure the same construct are comparable. For
example, are the Beck Depression Inventory and the
Center for Epidemiological Studies Depression Scale
(CES-D) equitable? Adaptive testing allows the
researcher to match specific items to different levels of
ability to more finely discern a persons ability; persons estimated to have high ability may receive a
different set of items than a person estimated to have
low ability. With the increasing availability of
statistical software for conducting MIRT analyses, the
potential also exists for developing models with greater
complexity for testing differential factor functioning
(DFF). Akin to testing measurement invariance using
CFA techniques, DFF analyses will provide researchers
with an assessment of potential bias in performance of
factors (e.g., subscales) across groups.
A final consideration is choosing between the
different psychometric strategies outlined in this paper.
Ideally, both methods should be integrated. Doing so
gives the researcher access to unique information
available only from each analytic method, allows the
researcher to compare common elements of both
analyses, and minimizes the impact of each methods limitations. If applying both methods is not possible,
theoretical and practical considerations can inform the
decision. IRT is a stronger choice when data are
dichotomous or ordinal because raw scores are
transformed to an interval scale. If the relationship
between items and factors are nonlinear or unknown,
IRT will yield less biased estimates than CFA. If the
construct to be measured is presumed to be
unidimensional, IRT is a better strategy because of the
additional information provided in the item analysis.
Both MIRT and CFA are informative in assessing
latent factor structures, but only CFA allows the
researcher to estimate relationships between factors.
Both strategies perform better with large sample sizes,
but IRT is affected more negatively by smaller samples
given the larger number of parameters being estimated.
If possible, IRT/MIRT analysis should be limited to
samples of 200 items or more. Conversely, IRT
analyses yield stable results with very few items,
whereas CTT reliability varies in part as a function of
the number of items.
References
Adams, R. J., & Khoo, S. T. (1996). ACER Quest
[Computer software]. Melbourne, Australia: ACER.
Adams, R. J., Wilson, M., & Wang, W. (1997). The
multidimensional random coefficients multinomial
logit model. Applied Psychological Measurement,
21, 1-24. doi:10.1177/0146621697211001
Andrich, D. (1988). Rasch models for measurement.
Newbury Park, CA: Sage.
Badia, X., Prieto, L., Roset, M., Dez-Prez, A., &
Herdman, M. (2002). Development of a short
osteoporosis quality of life questionnaire by
equating items from two existing instruments.
Journal of Clinical Epidemiology, 55, 32-40.
doi:10.1016/S0895-4356(01)00432-2
Benson, J., & Clark, F. (1982). A guide for instrument
development and validation. American Journal of
Occupational Therapy, 36, 789-800.
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch
model (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
DeVellis, R. F. (2003). Scale development: Theory and
applications. Thousand Oaks, CA: Sage.
DiStefano, C. (2002). The impact of categorization
with confirmatory factor analysis. Structural
Equation Modeling, 9, 327-346.
doi:10.1207/S15328007SEM0903_2
Embretson, S. E., & Reise, S. P. (2000). Item response
theory for psychologists. Mahwah, NJ: Lawrence
Erlbaum.
Flora, D. B., & Curran, P. J. (2004). An empirical
evaluation of alternative methods of estimation for
confirmatory factor analysis with ordinal data.
Psychological Methods, 9, 466-491.
doi:10.1037/1082-989X.9.4.466
Fries, J., Bruce, B., & Cella, D. (2005). The promise of
PROMIS: Using item response theory to improve
assessment of patient-reported outcomes. Clinical
and Experimental Rheumatology, 23(5, Suppl. 39),
S53S57.
Greguras, G. J. (2005). Managerial experience and the
measurement equivalence of performance ratings.
Journal of Business and Psychology, 19 (3), 383-
397. doi:10.1007/s10869-004-2234-y
Guo, B., Perron, B. E., & Gillespie, D. F. (2009). A
systematic review of structural equation modeling
in social work research. British Journal of Social
Work, 39, 1556-1574. doi:10.1093/bjsw/bcn101
Hambleton, R. K., & Swaminathan, H. (1985). Item
response theory: Principles and applications.
Boston, MA: Kluwer/Nijhoff.
-
OSTEEN
Journal of the Society for Social Work and Research 81
Hambleton, R. K., Swaminathan, H., & Rogers, H. J.
(1991). Fundamentals of item response theory.
Newbury Park, CA: Sage.
Henard, D. H. (2000). Item response theory. In L. G.
Grimm & P. R. Yarnold (Eds.), Reading and
understanding more multivariate statistics, (67-98).
Washington, DC: American Psychological
Association.
Jackson, D. L., Gillaspy, J. A, & Purc-Stephenson, R.
(2009). Reporting practices in confirmatory factor
analysis: An overview and some recommendations.
Psychological Methods, 14(1), 6-23.
doi:10.1037/a0014694
Jreskog, K. G., & Srbom, D. (2007). LISREL 8.80
for Windows [Computer software]. Licolnwood,
IL: Scientific Software International.
Klem, L. (2000). Structural equation modeling. In L.
G. Grimm and P. R. Yarnold (Eds.), Reading and
understanding more multivariate statistics, (pp.
227-260). Washington, DC: American
Psychological Association.
Kline, R .B. (2005). Principles and practices of
structural equation modeling (2nd ed.). New York,
NY: Guilford Press.
Linacre, J. K. (1994). Sample size and item calibration
stability. Rasch Measurement Transactions, 7(4),
328.
Linacre, J. K. (1999). Investigating rating scale
category utility. Journal of Outcome Measurement,
3(2), 103-122.
Linacre, J. K. (2006). Winsteps Rasch measurement
3.68.0 [Software]. Chicago, IL: Author.
Lord, F. M. (1980). Applications of item response
theory to practical testing problems. Hillsdale, NJ:
Lawrence Erlbaum.
Meade, A.W., & Lautenschlager, G. J. (2004). A
comparison of item response theory and
confirmatory factor analytic methodologies for
establishing measurement equivalence/invariance.
Organization Research Methods, 7(4), 361-388.
doi:10.1177/1094428104268027
Mller, U., Sokol, B., & Overton, W.F. (1999).
Developmental Sequences in class reasoning and
propositional reasoning. Journal of Experimental
Child Psychology, 74, 69-106.
doi:10.1006/jecp.1999.2510
Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2001).
Measurement equivalence: A comparison of
methods based on confirmatory factor analysis and
item response theory. Journal of Applied
Psychology, 87, 517-529. doi:10.1037/0021-
9010.87.3.517
Rasch, G. (1980). Probabilistic models for some
intelligence and attainment tests. Chicago, IL:
MESA Press.
Reckase, M. D. (1997). The past and future of
multidimensional item response theory. Applied
Psychological Measurement, 21, 25-27.
doi:10.1177/0146621697211002
Reeve, B. B., & Fayers, P. (2005). Applying item
response theory modeling for evaluating
questionnaire items and scale properties. In P. M.
Fayers & R. D. Hays (Eds.), Assessing quality of
life in clinical trials: Methods and practice (2nd
ed., pp. 55-73). New York, NY: Oxford University
Press.
Reise, S. P., Ainsworth, A. T., & Haviland, M. G.
(2005). Item response theory: Fundamentals,
applications, and promises in psychological
research. Current Direction sin Psychological
Science, 14(2), 95-101.
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993).
Confirmatory factor analysis and item response
theory: Two approaches for exploring measurement
invariance. Psychological Bulletin, 114, 552-566.
doi:10.1037/0033-2909.114.3.552
Samejima, F. (1969). Estimation of latent ability using
a response format of graded scores. (Psychometric
Monograph No. 17) .Richmond, VA: Psychometric
Society. Retrieved from
http://www.psychometrika.org/journal/online/MN1
7.pdf
Smith, R. M., Schumaker, R. E., & Bush, M. J. (1998).
Using item mean squares to evaluate fit to the
Rasch model. Journal of Outcome Measurement,
2(1), 66-78. PMid:9661732
SPSS. (2007). SPSS for Windows, Rel. 16.0.0
[Software]. Chicago: SPSS, Inc.
Sun, J. (2005). Assessing goodness of fit in
confirmatory factor analysis. Measurement and
Evaluation in Counseling and Development, 37(4),
240-256.
Swaminathan, H., & Gifford, J. A. (1979). Estimation
of parameters in the three-parameter latent-trait
model. Laboratory of Psychometric and Evaluation
Research (Report No. 90). Amherst: University of
Massachusetts.
Tabachnik, B. G., & Fidell, L. S. (2001). Using
multivariate statistics (4th ed.). Boston, MA: Allyn
and Bacon.
Teresi, J. A. (2006). Overview of quantitative
measurement methods: Equivalence, invariance and
differential item functioning in health applications.
Medical Care, 44, S39S49.
-
USING MULTIDIMENSIONAL ITEM RESPONSE THEORY
Journal of the Society for Social Work and Research 82
Unick, G. J., & Stone, S. (2010). State of modern
measurement approaches in social work research
literature. Social Work Research, 34(2), 94-101.
Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000).
Practical implications of item response theory and
computerized adaptive testing: A brief summary of
ongoing studies of widely used headache impact
scales. Medical Care, 38, 11731182. doi:10.1097/00005650-200009002-00011
Wenger, E., McDermott, R., & Snyder, W. M. (2002).
Cultivating communities of practice. Boston, MA:
Harvard Business School Press.
Wirth, R. J., & Edwards, M. C. (2007). Item factor
analysis: Current approaches and future directions.
Psychological Methods, 12(1), 58-79.
doi:10.1037/1082-989X.12.1.58
Wright, B. D., & Masters, G. N. (1982). Rating scale
analysis: Rasch measurement. Chicago, IL: MESA
Press.
Wu, M. L., Adams, R. J., & Wilson, M., & Haldane, S.
(2008). ACER Conquest 2.0: Generalized item
response modeling software [computer program].
Hawthorn, Australia: ACER.