Paradoxical Results in Multidimensional Item Response...

25
Paradoxical Results in Multidimensional Item Response Theory Giles Hooker, Matthew Finkelman and Armin Schwartzman Abstract In multidimensional item response theory (MIRT), it is possible for the estimate of a subject’s ability in some dimension to decrease after they have answered a question correctly. This paper investigates how and when this type of paradoxical result can occur. We demonstrate that many response models and statistical estimates can produce paradoxical results and that in the popular class of linearly com- pensatory models, maximum likelihood estimates are guaranteed to do so. In light of these findings, the appropriateness of multidimensional item response methods for assigning scores in high-stakes testing is called into question. 1 Introduction Jane and Jill are fast friends who are nonetheless intensely competitive. At the end of high school they each take an entrance exam for a prestigious university. After the exam, they compare notes and discover that they gave the same answers for every question but the last. On checking their materials, it is clear that Jane answered this question correctly, but Jill answered incorrectly. They are therefore very surprised, when the test results are published, to find that Jill passed but Jane did not! Lawsuits ensue. The university maintains that it followed well established statistical proce- dures: the questions on the test were designed to simultaneously examine both language and analytic skills, and a multiple-hurdle rule (Segall 2000) based on maximum likelihood estimates of each student’s abilities was used to ensure that admitted students were proficient in both. The university had re-checked its calculations many times and was satisfied the correct decision had been made. Jane’s lawyers countered that, whatever the statistical correctness of the agency’s procedures, it is unreasonable that an examinee should be penalized for getting more questions correct. Could such a situation occur? ***** (2007) demonstrated empirically that it could. On examining an operational data set they found that were a passing threshold for a test set injudiciously, nearly 6% of students could move from fail to pass by changing answers from “correct” to “incorrect”. The possibility of obtaining a lower score by getting more questions correct (or vice versa) was labeled a “paradoxical result” and has clear implications for fairness. This is a property of statistical estimates for multidimensional latent abilities, even when the models for student responses appear reasonable. How does this occur? ***** (2007) provide an intuitive explanation. Suppose that each question on the test given to Jane and Jill required both language and analytical skills to answer correctly, and Jane and Jill got some of these correct and some incorrect. The final question, however, was very difficult in terms of analysis, but did not require strong language skills. That Jane got this question correct suggests that her analytical skills must be very good indeed. This being the case, the only explanation for her previous 1

Transcript of Paradoxical Results in Multidimensional Item Response...

Page 1: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

Paradoxical Results in Multidimensional Item Response Theory

Giles Hooker, Matthew Finkelman and Armin Schwartzman

Abstract

In multidimensional item response theory (MIRT), it is possible for the estimate of a subject’s abilityin some dimension to decrease after they have answered a question correctly. This paper investigateshow and when this type of paradoxical result can occur. We demonstrate that many response modelsand statistical estimates can produce paradoxical results and that in the popular class of linearly com-pensatory models, maximum likelihood estimates are guaranteed to do so. In light of these findings, theappropriateness of multidimensional item response methods for assigning scores in high-stakes testing iscalled into question.

1 Introduction

Jane and Jill are fast friends who are nonetheless intensely competitive. At the end of highschool they each take an entrance exam for a prestigious university. After the exam, they comparenotes and discover that they gave the same answers for every question but the last. On checkingtheir materials, it is clear that Jane answered this question correctly, but Jill answered incorrectly.They are therefore very surprised, when the test results are published, to find that Jill passedbut Jane did not!

Lawsuits ensue. The university maintains that it followed well established statistical proce-dures: the questions on the test were designed to simultaneously examine both language andanalytic skills, and a multiple-hurdle rule (Segall 2000) based on maximum likelihood estimatesof each student’s abilities was used to ensure that admitted students were proficient in both. Theuniversity had re-checked its calculations many times and was satisfied the correct decision hadbeen made. Jane’s lawyers countered that, whatever the statistical correctness of the agency’sprocedures, it is unreasonable that an examinee should be penalized for getting more questionscorrect.

Could such a situation occur? ***** (2007) demonstrated empirically that it could. On examining anoperational data set they found that were a passing threshold for a test set injudiciously, nearly 6% ofstudents could move from fail to pass by changing answers from “correct” to “incorrect”. The possibility ofobtaining a lower score by getting more questions correct (or vice versa) was labeled a “paradoxical result”and has clear implications for fairness. This is a property of statistical estimates for multidimensional latentabilities, even when the models for student responses appear reasonable.

How does this occur? ***** (2007) provide an intuitive explanation. Suppose that each question on thetest given to Jane and Jill required both language and analytical skills to answer correctly, and Jane andJill got some of these correct and some incorrect. The final question, however, was very difficult in termsof analysis, but did not require strong language skills. That Jane got this question correct suggests thather analytical skills must be very good indeed. This being the case, the only explanation for her previous

1

Page 2: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

incorrect answers is that her language skills must be quite low. By contrast, Jill, in getting the final questionincorrect, has demonstrated fewer analytic skills and must have relied on stronger language skills to answerprevious questions correctly. The estimate of Jane’s language ability therefore dipped below the requiredthreshold, while Jill’s was pushed upwards; both obtained satisfactory analysis scores.

This provides an intuitive explanation for how paradoxical results occur and makes clear that they maynot be unreasonable. The authors nonetheless feel that it is better not to put students in the positionof second-guessing when their best answer may be harmful to them. This paper provides a mathematicalanalysis of paradoxical results that began as an attempt to find conditions under which statistical estimationwould always avoid them. Our analysis is not encouraging: paradoxical results can occur across a wide rangeof two-dimensional item response models using a large class of statistical ability estimates. Moreover, in thepopular class of linearly compensatory models, every non-separable test has a response sequence for whichmaximum likelihood estimates (MLEs) of abilities are paradoxical. Indeed, for any ability dimension, almostevery answer sequence for every test is paradoxical in the sense that the estimate of ability can either be madeto increase by changing a correctly answered item to incorrect, or to decrease by changing an incorrectlyanswered item to correct, provided that question is chosen appropriately.

The results in this paper establish the existence of paradoxical results under general conditions. Theyimply, for example, that within a broad class of multidimensional models, it is not enough to restrict theparametric form of an item response function in order to avoid paradoxical results. The exact conditionsand assumptions are given in Sec. 2. They essentially reduce to the following sufficient, but not necessary,conditions:

• All second derivatives of the log of the item response surface should be strictly negative.

• Either MLEs, or priors with no correlation between abilities, are being used to estimate a subject’sability.

These can be unrealistic in practice; the first excludes response functions with guessing parameters and thesecond may assume an unlikely correlation. Neither of these conditions are necessary for paradoxical results,however, and we investigate the extent to which they may be violated and still produce such results. Wealso believe that our analysis sheds light on the general mathematical causes of paradoxical results in a waythat makes an analysis in specific cases possible. As an example, our study of linearly compensatory modelsuses the framework developed in the general case to provide sufficient conditions for an individual item toproduce a paradoxical result, including for Bayesian estimates with non-independent priors.

We note that tests that exhibit simple structure – every question depending on only one ability dimension– violate the first of the above conditions and these can be guaranteed not to produce paradoxical results intwo ability dimensions or when independence priors are used.

The paper is structured as follows. We provide a basic framework for multidimensional item responsetheory in Sec. 2. Some basic theoretical results and constructions are given in Sec. 3. We apply theseresults to frequentist estimates in Sec. 4 and to Bayesian estimates in Sec. 5. The specific case of linearlycompensatory models is explored in Sec. 6. Sec. 7 considers extensions to models not covered in ourassumptions – in particular, models that involve guessing parameters, non-independence priors, discreteability spaces and spaces of three or more dimensions; empirically we show that paradoxical results remaincommon under these conditions. Sec. 8 examines a limited solution given in terms of enforcing regularityconditions.

2

Page 3: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

2 Definitions and Assumptions

The assumptions made here and in Sec. 3 are used throughout Sections 4, 5 and 6. We discuss the effectof relaxing these assumptions in Sec. 7. Multidimensional item response theory assumes that questions(hereafter referred to as “items”) may measure more than one latent trait, with θ ∈ Rd indicating a givensubject’s proficiency in each of d dimensions. In the scenario above, there are dimensions representinglanguage and analytic skills. We assume that all item parameters have been estimated and are defined withrespect to fixed dimensions. We further assume that subjects’ abilities are modeled as being unrestrictedelements of Rd.

In order to simplify the analysis below, we restrict to abilities indexed in R2, but state as much aspossible in more general terms. Some comments regarding higher dimensional models are given in Sec. 7.We assume that a subject’s answers to items are marked either as “correct” or “incorrect”. The item’sresponse function, P , gives the probability of a student getting a particular test item correct as a functionof the student’s abilities and parameters specific to the item. We will frequently refer to an item as meaningboth the specifics of a test question and its response function. Common choices include

Compensatory: P (θ, ai) = g(ai0 + aTi θ) where g is a log-concave cumulative distribution function and

each ai is a d-vector with non-negative entries. Most commonly used are

Logistic: g(t) = et/(1 + et) (e.g., Ackerman 1996; Reckase 1985).

Normal ogive: g(t) is the cumulative distribution for a Gaussian random variable (e.g., Bock et al.1988). See Reckase (1997) for a review including a history of multidimensional IRT models.

The label “compensatory” is used since a subject’s lack of ability in dimension 1 may be made upfor by proficiency in dimension 2 and vice versa. We describe the models above as being linearlycompensatory; they admit particularly easy analysis.

Non-compensatory: P (θ, ai) =∏d

j=1 gj(aij0 + aij1θj) in which each gj is a log-concave cumulative dis-tribution function and aij1 > 0 (Bolt and Lall 2003; Whitely 1980). The term “non-compensatory” isapplied here since no degree of proficiency in dimension 2 will overcome a lack of ability on dimension1.

Our results are more general than these particular models, however, and we set out very general conditionsbelow.

Definition 2.1. We define a test to be a collection of N items with probability functions Pi(θ1, . . . , θd),i = 1, . . . , N for ability parameters θ = (θ1, . . . , θd).

When θ is used as an argument to a function, it is taken to be shorthand for the vector of arguments(θ1, . . . , θd), or a sub-vector of these, and we use both notations interchangeably.

It makes sense that increasing a subject’s ability should not reduce their probability of getting an itemcorrect, hence the following natural restriction:

Definition 2.2. A response function P (θ1, . . . , θd) is monotone if P is a monotone increasing function ofθj for each θ1, . . . , θj−1, θj+1, . . . , θd.

Monotone response functions are considered in Junker and Sijtsma (2000) and Antal (2007); non-monotoneresponse functions can clearly be seen to produce paradoxical results. This is also a concern for multiple-choice models, for example in Thissen and Steinberg (1997). We make the distinction that our results are

3

Page 4: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

specifically based on monotone response functions. These cannot produce paradoxical results in unidimen-sional IRT, but will do so in MIRT.

In addition, we require a somewhat more restrictive assumption about models for a correct response:

Definition 2.3. A response function P (θ) has log negative second derivatives if its second derivatives existand

∂2 log P (θ)∂θj∂θk

≤ 0 and∂2 log(1− P (θ))

∂θj∂θk≤ 0 (1)

for each j and k (possibly equal) everywhere in Rd.

Straightforward calculation gives that each of the models listed above satisfies this condition. Importantly,P (θ) and 1− P (θ) are log-concave as functions of each θj with the other ability dimensions fixed.

In what follows, we will be interested in properties of classes of items; specifically, are there simplerestrictions that ensure a test cannot produce paradoxical results? Our main concern here is whether thereare families of response functions that can be guaranteed to avoid paradoxical results.

For any class of items, our mathematical analysis consists of finding a test comprised of items drawnfrom that class that will produce a paradoxical result. If such a test exists, then the combination of itemswithin the test becomes important. In order to make this argument formal, we need to define which itemscan be selected to form a test.

Definition 2.4. A class of two-dimensional items is said to be parametric if the response functions for eachitem can be written as P (a1θ1, a2θ2, γ) for a1 ≥ 0, a2 ≥ 0 ∈ R and γ finite-dimensional. A parametric classis complete if it contains items for all a1 ≥ 0 and a2 ≥ 0.

In particular, a complete class contains an item for which a1 = 0; that is, θ1 does not affect the probabilityof getting the item correct. If a complete parametric class could be shown to avoid paradoxical results, wecould safely design tests using items from this class. Sec. 6 removes the assumption of completeness forlinearly compensatory models.

As noted in the introduction, two-dimensional tests exhibiting simple structure are not subject to para-doxical results, and we restrict to

Definition 2.5. A class of items with log negative second derivatives is non-simple if it contains at leastone item for which the inequalities in (1) are strict for at least one pair (j, k) with j 6= k.

We now turn to the outcome of administering a test to a subject:

Definition 2.6. For a test, each subject produces a response vector: a series of N binary answers y =(y1, . . . , yN ) which are coded 0 or 1 according to whether item i is answered correctly.

Definition 2.7. For any given test, a score is a function mapping from the response vectors to real numbers,S : {0, 1}N → R.

An important class of scores are those defined with respect to some estimate of a subject’s ability vector, θ.Examples include

• Multiple hurdle scores (Segall 2000): S(y) =∏

I(θi ≥ ci).

• Composite scores (van der Linden 1999): S(y) =∑

siθi or I(∑

siθi ≥ C).

4

Page 5: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

Either score can represent nuisance dimensions by setting ci = −∞ or si = 0.Our results will be concerned with the estimate θ1, which can be thought of as a composite score with

(s1, s2, . . . , sd) = (1, 0, . . . , 0). However, paradoxical results for this score can clearly have implications formultiple hurdle or other composite scores. Specific conditions on when this can happen for composite scoresand linearly compensatory models are investigated in Sec. 6.

As discussed, one of the properties of scores that we would like to retain is regularity.

Definition 2.8. We define a partial ordering on responses by stating y ≺ z if yi ≤ zi for all i and thisinequality is strict for some i. A score is regular if

y ≺ z ⇒ S(y) ≤ S(z).

A paradoxical result is defined to be the existence of response vectors y and z such that y ≺ z and S(z) <S(y).

As with items, we define a partial ordering of functions and a monotonicity condition for functionals of them.

Definition 2.9. We define a partial ordering of unidimensional functions by

g(t) ≺ f(t) if g(t) < f(t), ∀t ∈ R.

A functional T mapping functions to the real line is said to be monotone increasing if

g ≺ f ⇒ T [g] < T [f ];

it is monotone decreasing ifg ≺ f ⇒ T [g] > T [f ].

Most unidimensional frequentist and Bayesian estimates are monotone increasing functionals of the derivativeof the log likelihood.

3 Basic Results and Constructions

Before proceeding to demonstrate that scores based on statistical estimates of an ability parameter are notregular, we provide some basic notation, a few general results and constructions for the results given below.

3.1 Statistical Constructs

Most statistical estimates may be written as functionals of the log likelihood l(θ;y) under the assumptionthat responses are conditionally independent across items given subjects. In our proofs, we will use lk(θ;y)to denote the log likelihood after k items have been administered:

lk(θ;y) =k∑

i=1

yi log Pi(θ) + (1− yi) log(1− Pi(θ)).

lk(θ;y) is a function only of the first k entries in y, and ignores the following entries. Throughout, l(θ;y) willbe taken to imply lN (θ;y): the log likelihood at the end of the test. We note that the log-negative secondderivative condition (Def. 2.3) implies that all the second derivatives of l(θ;y) are negative. Frequentistestimates for θ are most easily expressed in terms of the derivatives of l(θ;y).

When using Bayesian methods, we also place a restriction on the priors available:

5

Page 6: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

Definition 3.1. A density µ(θ) is independent and log-concave if µ(θ) =∏d

j=1 µj(θj) where each µj is alog-concave density.

We assume that Bayesian estimates are conducted with a prior in which the traits are modeled as beingindependent. This is a substantial restriction that ensures that the posterior distribution after k items,

fk(θ|y) =elk(θ|y)µ(θ)∫elk(θ|y)µ(θ)dθ

,

also has log-negative second derivatives. We discuss relaxing this condition in Sec. 7.We want to emphasize that Def. 3.1 is used to define a statistical estimation procedure rather than to

make any statement about the distribution of traits in a population. Our results refer to the behavior ofindividual statistical estimates and do not depend on the population of subjects.

We denote the marginal posterior density of θi by fki (θi|y) and its density conditional on all the other abil-

ity parameters by fki (θi|θ−i,y). The corresponding cumulative distribution functions are written F k

i (θi|y)and F k

i (θi|θ−i,y).

3.2 Basic Results

We begin with two central lemmas. These describe the behavior of the derivatives of the likelihood withrespect to θi, keeping the other entries in θ fixed. For simplicity, throughout this section, we take i = 1 andlet θ−1 = (θ2, . . . , θd). The results will clearly not depend on this choice of i. We use t as an argument inplace of θ1 when examining a likelihood as a function of θ1. All proofs are given in Appendix A.

Lemma 3.1. Assume that a test consists of N items which are monotone in d abilities and have log negativesecond derivatives, at least one of which is strict. If θ∗−1 Â θ−1 (Def. 2.8), then

∂l

∂θ1

(t, θ∗−1;y

) ≺ ∂l

∂θ1

(t, θ−1;y

).

Lemma 3.2. Define

θ1(θ−1, c;y) ={

t :∂l

∂θ1(t, θ−1;y) = c

}.

Under the conditions of Lemma 3.1, dθ1/dθj < 0, for j = 2, . . . , d.

For c = 0 this is the MLE for θ1 with all other parameters held fixed. We can define likelihood ratioconfidence intervals using other values of c. We note that since the log likelihood is concave, ∂l/∂θ1 ismonotone decreasing and hence this function is well defined.

Lemma 3.2 provides the basic intuition for the mathematical explanation of paradoxical results. As θ2,say, increases, the value of θ1 at which ∂l/∂θ1 crosses zero, say, decreases. This already implies that theMLE for θ1 conditional on θ2, . . . , θd decreases as θ2 increases. Figure 1 provides a visual representation ofthis intuition using the likelihood for the first subject in the data set analyzed in ***** (2007). In this figure,the contours of the log likelihood form ellipses with a negatively-oriented major axis. This will always bethe case for when the conditions in Def. 1 hold.

For Bayesian inference, the analogous result to Lemma 3.1 follows from the same reasoning and is statedwithout proof:

6

Page 7: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

−2 −1.5 −1 −0.5 0 0.5

−10

−5

0

5

10

θ1

dl/d

θ 1

θ2+ > θ

2*

dl(θ2* )/dθ

1

dl(θ2+)/dθ

1

θ1,MLE

(θ2* )

θ1,MLE

(θ2+)

θ1

θ2

−1.5 −1 −0.5 0 0.5 1 1.5 2−3

−2.5

−2

−1.5

−1

−0.5

0likelihoodθ

1,MLE(θ

2;y)

Figure 1: An illustration of the nature of likelihoods in two-dimensional item response theory using likelihoodsfrom an example test. Left panel: we plot ∂l(θ1, θ2)/∂θ1 for fixed θ∗2 and θ+

2 > θ∗2 . The latter is everywherebelow the former. We have also given the maximizing values of θ1 for θ∗2 and θ+

2 . Right panel: contoursof the likelihood and the MLE for θ1 (solid) plotted as a function of θ2 (dashed). If θ2 is increased themaximizing value for θ1 must decrease.

Lemma 3.3. Under the conditions of Lemma 3.1, let f(θ|y) be the posterior density based upon an inde-pendent and log-concave prior. Let θ∗ Â θ. Then

∂ log f

∂θ1(t,θ∗|y) ≺ ∂ log f

∂θ1

(t, θ|y

).

While Bayesian inference can be written in terms of the derivatives of a log posterior density, it is farmore convenient to produce an analysis in terms of cumulative distribution functions. The following resultdemonstrates that the ordering relation is preserved between these two.

Lemma 3.4. Let f1(t) and f2(t) be twice-differentiable log-concave densities on the real line. Then

d log f1

dt≺ d log f2

dt⇒

∫ t

−∞f1(s)ds Â

∫ t

−∞f2(s)ds.

This result demonstrates a monotone relationship between derivatives of log densities and cumulativedistribution functions. In particular, letting X and Y correspond to cumulative distribution functionsF1 ≺ F2, then X is said to be stochastically greater than Y and consequently for a monotone increasingfunction g, Eg(X) > Eg(Y ). A discussion of the application of stochastic ordering to IRT is given in van derLinden (1998).

3.3 A Paradoxical Test

In this section, we construct a test that exhibits paradoxical results for a wide range of statistical estimatesof θ1. We make use of this test throughout the proofs of our results in Sections 4 and 5. Our main message

7

Page 8: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

is that among all non-simple complete parametric classes of items with log-negative second derivatives, itis impossible to use the form of the response functions alone to rule out paradoxical results, and one musttherefore consider the particular combination of items in a test.

Assume that F is any complete parametric non-simple class of two-dimensional monotone items withlog-negative second derivatives (see Definitions 2.2 - 2.4). Let P1, . . . , PN−1 be items selected from thisfamily with responses yN−1 = (y1, . . . , yN−1). We assume that the second derivatives of log Pi are strictlynegative for at least one item i and that lN−1(θ;yN−1) has a unique maximum. We take PN to be such that

PN (θ1, θ2, γN ) = PN (0, θ2, γN ) = PN (θ2, γN ).

This item is available by the completeness of F . We define response sequences y1 = (yN−1, 1) and y0 =(yN−1, 0) so that y0 ≺ y1. Each of the results in Sections 4 and 5 shows that a statistical estimate of abilityin dimension 1 has θ1(y1) < θ1(y0). There is nothing special about the first dimension or the final item;they have been chosen for notational, and conceptual, convenience.

The test constructed here requires the final item to exhibit simple structure. This is not required toproduce paradoxical results. The continuity arguments of Sec. 7 demonstrate that the result will continueto hold when the final item is replaced by PN (ε, θ2, γN ) for ε > 0.

4 Frequentist Estimates in Two-Dimensional MIRT

Many frequentist estimates can be regarded as functionals of the derivative of the log likelihood. Ourparticular examples are the MLE and the extrema of likelihood ratio confidence intervals. Our purpose inthis section is to demonstrate that in any complete parametric class of models there are tests for which thesestatistics are paradoxical scores. Without loss of generality, we consider statistical estimates for θ1.

4.1 Maximum Likelihood Estimates

The MLE for θ is given by the vector θMLE(y) that maximizes l(θ;y), or alternatively that satisfies

∂l

∂θ(θMLE ;y) = 0

assuming that this exists and is unique. θ1,MLE(y) to denotes the first entry of θMLE(y).

Theorem 4.1. Let F be a non-simple complete parametric class of monotone two-dimensional responsemodels with log negative second derivatives. There exists a test with response functions P1, . . . , PN selectedfrom F , and response vectors y1 and y0 to this test that give rise to a paradoxical result.

The technique of this proof is to show that in the test defined in Sec. 3.3, θ2,MLE must increase after thefinal item is answered correctly and that θ1,MLE must decrease by Lemma 3.2. This technique is central tothe analysis of the paper. In this proof, we rely on the final item not depending on θ1. However, in Sec. 6we demonstrate how the same idea can be applied to analyze more general tests when linearly compensatorymodels are employed. The formal proof is given in Appendix B.1.

4.2 Profile Likelihood Ratio Confidence Bounds

We can extend paradoxical results for MLEs to confidence bounds on estimates. A general method forconstructing confidence intervals is based on the acceptance region of a likelihood ratio test. A profile

8

Page 9: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

likelihood ratio lower confidence bound is the smallest value for θ1 such that a likelihood ratio test for thisvalue against the MLE would not be rejected:

θ1,PLB(y) = min{

θ1 : l(θ1, θ2,MLE(θ1;y);y

)≥ l

(θMLE ;y

)−K

}.

Here K is a cut-off value. This is usually taken to be a quantile of a χ21 random variable.

Theorem 4.2. Under the conditions of Theorem 4.1, there exists a test P1, . . . , PN drawn from F for whichθ1,PLB(y) is not a regular score.

The proof for this theorem is given in Appendix B.2. The proof for the equivalent unconditional upperprofile confidence bound follows the same construction.

We note that the use of confidence bounds based on the Fisher information matrix is conspicuouslyabsent from this analysis. This is because the generality of our assumptions does not allow us to excludelocal features in the likelihood which may produce quite different bounds from those obtained by likelihoodratio confidence intervals. In such cases we consider the likelihood ratio confidence bounds to be preferable.

5 Bayesian Estimates in Two-Dimensional MIRT

We begin with the maximum a posteriori estimate (MAP):

θMAP ={

θ :∂f

∂θ(θ|y) = 0

}.

As with the MLE, θ1,MAP (y) indicates the first element of θMAP .

Theorem 5.1. Under the conditions of Theorem 4.1, let µ be an independent log-concave prior. There existsa test P1, . . . , PN drawn from F for which θ1,MAP (y) is not a regular score.

The proof of this theorem is similar to that of Theorem 4.1 and is omitted. The analogous result toTheorem 4.2 demonstrates that lower (or upper) bounds based on contours of the log posterior also exhibitparadoxical results.

5.1 Marginal Inference

Many Bayesian estimates for θ1, including credible intervals, can be expressed in terms of monotone func-tionals of their marginal cumulative distribution function. The most commonly found of these is the expecteda posteriori estimate (EAP): θEAP =

∫Rd θf(θ)dθ. We can demonstrate paradoxical results for a much more

general class of estimates.

Theorem 5.2. Let T be a decreasing monotone functional of cumulative distribution functions and F a non-simple complete parametric class of monotone response models with log-negative second derivatives. Thereexists a test P1, . . . , PN drawn from F for which T [F1(θ1;y)] is not a regular score.

The proof of this theorem is given in Appendix C.It is reasonable to expect any sensible Bayesian estimate to be a monotone decreasing function of the pos-

terior cumulative distribution function; that is, to preserve the stochastic ordering of posterior distributions.The following two corollaries are immediate:

9

Page 10: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

Corollary 5.1. Under the conditions of Theorem 5.2, there exists a test for which the EAP for θ1 is not aregular score.

Corollary 5.2. Under the conditions of Theorem 5.2, there exists a test for which any upper (or lower)credible bound for θ1 is not a regular score.

6 Linearly Compensatory Models

We can refine our analysis somewhat for the class of linearly compensatory models such as the logistic andnormal ogive, in which θ enters the model only as a linear combination θT a. In this case, we can providemore precise conditions under which paradoxical results may occur without using completeness of the class.

In particular, we consider a test using items with linearly compensatory response functions P1(θT a1), . . . , PN−1(θT aN−1).We assume that the Pi have log-negative second derivatives with wi1 ≤ d2 log Pi(t)/dt2 ≤ wi2, and wi1 ≤d2 log(1− Pi(t))/dt2 ≤ wi2 for all t and some wi1, wi2 < 0, with wi1 possibly infinite.

Let A be the matrix with aTi on the ith row. We also let W1 and W2 be diagonal matrices with ith

diagonal elements wi1 and wi2 respectively. We note that the Hessian of the log likelihood can now bewritten as AT WA for W a diagonal matrix with W1 ≺ W ≺ W2.

6.1 Frequentist Inference

Theorem 6.1. Let P1(θT a1), . . . , PN−1(θT aN−1) be response functions for a test employing linearly compen-satory monotone response models for abilities θ ∈ Rd. Let yN−1 = (y1, . . . , yN−1) be any response sequence

such that θN−1

MLE(yN−1) exists and is unique. Let PN (θT b) be a further item in the test and y1 = (y, 1),y0 = (y, 0). A sufficient condition for θ1,MLE(y1) < θ1,MLE(y0) is

eT1

(AT WA

)−1b > 0 (2)

for all diagonal matrices W such that W1 ≺ W ≺ W2 where e1 = (1, 0, . . . , 0)T ∈ Rd the first Euclideand-vector.

The condition in this theorem is not immediately interpretable. It will be simplified below.The mathematical strategy for the proof is as follows: we consider a transformation of ability space θ → ψ

so that, without loss of generality, PN depends only on ψd. From the arguments used to prove Theorem 4.1,we know that the MLE for ψd must increase if the final item is answered correctly. We consider the MLE forθ1 as a function of ψd and show that its derivative is negative under the condition (2). Since ψd increases,the MLE for θ1 must decrease. We have given the formal details for this proof in Appendix D.1.

The requirement that the MLE exist and be unique is a technical necessity. It rules out tests wherethe item parameters all lie on a subspace of Rd. This can be expected to be unusual in practice with thenotable exception of multidimensional Rasch models. The condition also rules out some answer sequences.Most immediately, all-correct and all-incorrect answer sequences do not have unique MLEs regardless of theitem parameters (see, for example, ***** (2007)). Note also that this result excludes tests made up solelyof simple structure items. In this case each row of A has only one non-zero entry and AT WA is diagonaland negative definite; (2) is therefore either negative or zero.

10

Page 11: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

Corollary 6.1. Let P1(θT a1), . . . , PN (θT aN ) be response functions for a test comprised of monotone linearlycompensatory items with log negative derivatives. Let the test be ordered such that

aN1

‖aN‖ <ai1

‖ai‖ , i = 1, . . . , N − 1. (3)

Then for any response pattern yN−1 = (y1, . . . , yN−1) such that θ1,MLE(yN−1) is defined and unique, settingy1 = (yN−1, 1) and y0 = (yN−1, 0) gives θ1,MLE(y0) > θ1,MLE(y1).

The proof of this corollary is quite involved and has been given in Appendix D.2.This result demonstrates that for every test using linearly compensatory models, the MLE exhibits

paradoxical results. Moreover, almost all answer sequences produce paradoxical results, either increasing theestimate of θ1 by getting more items incorrect, or decreasing it by getting more correct. It also provides arule for finding which item to change: the one with least relative weight on dimension 1. Should a studentbe able to identify this item, they could be guaranteed to obtain a higher score by answering it incorrectly.

We want to emphasize here that the above theorem provides sufficient conditions for an item to producea paradoxical result; these conditions are not necessary. In addition to the item guaranteed by the corollary,there may be numerous other items that produce paradoxical results. Indeed any item that satisfies (2) isguaranteed to produce a paradoxical result so long as the answers to the other items produce a unique MLE.However, there may be further items that produce paradoxical results for particular values of W . Similarcomments can be made for the results below.

As a final application, we can extend our analysis above to composite scores of the form S(y) =αT θMLE(y). The use of such scores has been suggested, for example, in van der Linden (1999).

Corollary 6.2. Under the conditions of Theorem 6.1, let S(y) = αT θMLE(y). A sufficient condition forS(y1) < S(y0) is αT

(AT WA

)−1b > 0 for all diagonal W1 ≺ W ≺ W2.

The proof for this corollary is given in Sec. D.3.

6.2 Bayesian Inference

Bayesian analysis for linearly compensatory models follows along similar lines. In particular, we observe thatfor an independent and log-concave prior, taking ψ = B∗θ gives

∂ log µ

∂ψ∂ψT= B∗K(θ)B∗T

for K = ∂ log µ/∂θ∂θT . Adding to the second derivatives throughout the proof for Theorem 6.1 leads to:

Theorem 6.2. Under the conditions of Theorem 6.1, let µ be an independent and log-concave prior with sec-ond derivatives bounded below by diagonal matrix K−. A sufficient condition for θ1,MAP (y1) < θ1,MAP (y0)is

eT1

(AT WA + K

)−1b > 0 (4)

for all diagonal W such that W1 ≺ W ≺ W2 and K− ≺ K ≺ 0.

We note that this condition no longer admits paradoxical results across all answer sequences for which theestimate is defined. In particular, in the case of a two-dimensional θ and Gaussian prior with variances σ2

1

and σ22 , (4) can be reduced to

b2

b1>−∑N−1

i=1 wia2i2 + 1

σ22

−∑N−11=1 wiai1ai2

. (5)

11

Page 12: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

It is possible that there is no ordering of the answers for which this is true. The condition becomes morestringent as σ2

2 decreases. Using a smaller value of σ22 reduces the variation of θ2 in the posterior distribution

and we can think of this strategy as shrinking the test towards being a unidimensional test on dimension1. However, we also note that the condition becomes less stringent as N increases and (5) approaches (3).Alternatively, reversing the inequality in (5) can be shown to guarantee that θ1,MAP (y) behaves regularlyat the final item. This could be turned into a general criterion for designing tests. However, doing so wouldrestrict to either short tests, low discrimination parameters or strong priors.

The condition (5) also provides conditions for paradoxical marginal inference:

Theorem 6.3. Under the conditions in Theorem 6.2, let the dimension of θ be 2, and T [F1(θ1;y)] adecreasing monotone functional of the marginal distribution for θ1. Let the second derivative of the log priorbe bounded below by k1. A sufficient condition for T [F1(θ1|y1)] < T [F1(θ1|y0)] is

b2

b1>

∑N−1i=1 wia

2i2 + k1∑N−1

1=1 wiai1ai2

for all wi1 < wi < wi2.

The proof of this Theorem is given in Appendix D.4.

7 Model Extensions

We have made several assumptions in order to simplify our analysis. In particular, the use of log-concaveresponse models and prior densities with no correlation between ability traits is not always realistic inpractice. The relaxation of these assumptions significantly complicates the analysis. This section outlinesa number of specific model extensions that are commonly used in practice and some ways to extend themathematical results above.

7.1 Extending Existence Results

7.1.1 Embedding in a Larger Space

The following result is a direct consequence of the continuity of the statistical estimates we have examinedon the various parameters that define the test and estimation procedure:

Theorem 7.1. Let s(θ(y; γ)) be a monotone function of a MLE, MAP or EAP of θ, or profile confidencebound or credible bound for θ. Let γ be the collection of test parameters including item parameters andparameters of the prior such that the prior and item response functions vary continuously with γ. If y0 ≺ y1

and s(θ(y0; γ)) > s(θ(y1; γ)), then for some ε > 0, ‖γ∗ − γ‖ < ε ⇒ s(θ(y0; γ∗)) > s(θ(y1; γ∗)).

This theorem demonstrates that paradoxical results can occur in more general settings than those coveredby our assumptions, so long as our assumptions hold in a subspace of the model. In particular, this coversthe following models:

No Item Has Simple Structure: The results in Sections 4 and 5 were predicated on a test in which the finalitem response function did not depend on θ1, but at least one of the previous items depended on bothθ1 and θ2. This test is a special case of the set of the tests in a complete parametric class of items.Theorem 7.1 implies that paradoxical results can occur when all items load onto both dimensions.

12

Page 13: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

Three or More Ability Dimensions: Our results in Sections 4 and 5 may be viewed as a special case ofusing a d-dimensional ability vector in which no item depends on the final d− 2 abilities. Theorem 7.1implies that paradoxical results can occur if the loadings on the final d− 2 items are sufficiently small.

More generally, if the multidimensional analogue of PN (θ) in Sec. 3.3 places no weight on θ1 andθN

i,MLE(y1) ≥ θN−1i,MLE(y1) for i = 2, . . . , d, the arguments used above give that θN

1,MLE(y1) < θN1,MLE(y0).

If we do not have θNi,MLE(y1) ≥ θN−1

i,MLE(y1), a paradoxical result must hold for some ability dimensionother than 1. These ideas can also be extended to our results for Bayesian inference.

Guessing Parameters: A common variant of the item response models examined so far is to introduceguessing parameters of the form

Pi(θ) = ci + (1− ci)Pi(θ).

This ensures that a subject always has probability of at least ci of answering the item correctly. Thesemodels violate our log-negative second derivative condition unless ci = 0 for all i. Theorem 7.1 impliesthat paradoxical results will occur for some ci > 0.

Non-independence Priors: Bayesian estimates that use priors in which abilities are positively correlated donot always produce log posteriors with positive second derivatives. Letting µ(θ, ρ) be such that µ(θ, 0)factorizes into log-concave densities, Theorem 7.1 implies that Bayesian estimates using ρ > 0 can alsoexhibit paradoxical results. In longer tests, the influence of the prior will be overwhelmed by the data,whatever correlation is used.

In each of these cases, our results demonstrate that paradoxical results can occur in tests whose parametersare near those defined in Sec. 3.3. Exactly how far away from these parameters it is still possible to produceparadoxical results will depend on the number of items, the shape of the item response function and thespecific parameters in the test.

7.1.2 Local Conditions

Theorems 4.1 and 5.1 only require the log likelihood or log posterior to have negative second derivatives ina local region of θ1(y0) and θ1(y1).

Theorem 7.2. Let lN−1(θ;y) be the log likelihood after N − 1 items in a test. Define θN−11 (θ2;y) to be

the conditional MLE for θ1 as in (10) and assume this is unique. Assume that the response function of

the final item does not depend on θ1 and let θN

(y0) and θN

(y1) be the MLE estimates when the final itemis answered incorrectly and correctly, respectively. A sufficient condition for θN

1 (y0) > θN1 (y1) is that the

second derivatives of lN−1(θ;y) are negative on the line defined by (θN−11 (θ2;y), θ2) for θ2 between θN

2 (y0)and θN

2 (y1).

Proof. This is a direct consequence of the implicit function theorem as used in Theorem 4.1 restricted to theline (θN−1

1 (θ2), θ2).

The equivalent result can be shown for MAPs by replacing the log likelihood above with the log posterior.Verify that these points of inflection do not occur near the MLEs or MAPs appears to be analytically

difficult, but provides an intuitive motivation for expecting paradoxical results to occur in more generalsettings than we examine. A similar analysis of EAPs is also possible, but complex.

13

Page 14: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

7.2 Correlated Priors and Linearly Compensatory Models

In order to refine our analysis in a specific case, we examine linearly compensatory models with Gaussianpriors. It is clear from the results in Sec. 6 that if K in Theorem 6.2 is constant, it need not be diagonal anda general sufficient condition for a paradoxical result is simply (4). In two dimensions, if we set the priorvariances to σ2

1 and σ22 with prior correlation ρ, condition (5) becomes

aN2

aN1>

−∑N−1i=1 wia

2i2 + 1

σ22(1−ρ2)

−∑N−11=1 wiai1ai2 − ρ

(1−ρ2)σ1σ2

(6)

and we see that making ρ closer to 1 increases the right hand side of (6), resulting in a smaller set of itemsthat will produce paradoxical results.

7.3 An Empirical Study

By way of empirically studying the prevalence of paradoxical results under the model violations discussedabove, we investigated the operational data in ***** (2007). In that paper, a linearly compensatory two-dimensional logistic model was used that incorporated guessing parameters and a standard normal prior fora 67-item test of English skills, given to 2500 grade 5 students. Among these students, four pairs exhibitedthe Jill-and-Jane scenario in the introduction, one answering every question as well or better than the otherbut yet obtaining a worse result on dimension 1. Depending on the consequences of the test, this may beregarded as being rare, although it may still be unacceptable in high-stakes settings.

Beyond directly comparing subjects, we can also ask the hypothetical “Would this subject have donebetter by deliberately answering some items incorrectly or vice versa?”. For every one of these subjects,changing the item with largest relative weight on θ2 caused θ1,EAP (y) to produce paradoxical results. Toillustrate the concerns that these results might produce, ***** (2007) investigated setting passing thresholdsfor these data. They found it was possible to choose a threshold on θ1 so that 6% of subjects could bemoved from “fail” to “pass” by getting more questions wrong. ***** (2007) also investigated varying theprior correlation, ρ; while larger values reduced the incidence of paradoxical results, setting ρ as high as 0.8did not eliminate them entirely.

7.4 Discrete Ability Spaces

Some tests (or models) such as cognitive diagnosis models (CDMs; see Haertel (1990), Roussos et al. (2007),and von Davier (2005)) classify subjects into one of multiple ordered categories along each dimension. Inthis context, an assignment is made to one of the categories and thus the use of MLE and MAP estimationis standard. We illustrate the existence of paradoxical results on a two-dimensional ability space partitionedinto categories master and non-master. Here, items are monotone in the sense of Definition 2.2 if theprobability of getting an item correct for a master is greater than the probability of getting an item correctfor a non-master.

Table 1 gives numerical values for a hypothetical three-item test that produces a paradoxical result alongwith the log likelihood values of the answer sequences (1,0,1) and (1,0,0). We observe that the first twoitems create a “ridge” giving highest values to the assignments (non-master, master) and (master, non-master) with the latter being highest. The final item only changes with dimension 2; getting it correctmoves the MLE for that dimension from non-master to master. The effect of this is to move the assignmentfor dimension 1 the other way, creating a paradoxical result. We note that in this example the only way to

14

Page 15: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

P(Item Correct) Cumulative log likelihoodCategory 1 2 3 1 2 31 30

(n,n) 0.40 0.20 0.40 -0.92 -1.27 -2.19 -1.78(n,m) 0.50 0.40 0.80 -0.69 -1.20 −1.43∗ -2.81(m,n) 0.70 0.50 0.40 -0.36 −1.05∗ -1.97 −1.56∗

(m,m) 0.90 0.70 0.80 −0.11∗ -1.31 -1.53 -2.92

Table 1: A paradoxical test on a two-dimensional discrete ability space. Left: probability of a correct answerin each class, coded “n” for non-master and “m” for master. Right: cumulative log likelihood following a(1,0,1) response pattern. The final column gives the log likelihood after a (1,0,0) response pattern. MLEsare starred. Notice that the category maximizing the likelihood moves from (n,m) to (m,n) when the thirditem is changed from correct to incorrect.

achieve a paradoxical result is to move from the (non-master, master) category into (master, non-master)or vice versa. Due to the discrete nature of the estimates, this is less common, than paradoxical results forcontinuous ability spaces.

8 Enforcing Regularity

Our analysis indicates that paradoxical results are endemic to multidimensional item response models. Inthe case of two-dimensional linearly compensatory models, it is not possible to design a test using frequentistestimates that avoids this problem. This fact suggests a need to review the statistical estimates that we use.

One immediate method for modifying a MAP or MLE is to impose constraints on the estimate. Thatis, we seek ability estimates for all subjects such that no paradoxical results occur for the observed responsesequences. For the case of MAP estimates, we want estimated vectors θ1, . . . , θn for n subjects with

(θMAP,1, . . . , θMAP,n) = argmaxθ1,...,θn

l(θ1;y1) + . . . + l(θn;yn) + log µ(θ1) + . . . + log µ(θn)

subject to the constraintsθMAP,j ≺ θMAP,k if yj ≺ yk. (7)

The number of these constraints may be reduced by taking account of the transitivity relations

θMAP,i < θMAP,j and θMAP,j < θMAP,k ⇒ θMAP,i < θMAP,k. (8)

Here these inequalities could be enforced just for a single dimension of interest, or for all dimensions, de-pending on the purpose of the test. Similar estimates could be obtained by defining appropriate objectivefunctions for confidence bounds and Bayesian estimates.

In the operational data investigated in ***** (2007), there were 2500 ability vectors to estimate, with51,354 inequalities of the form (7). Accounting for transitivity relations (8) reduced the number of constraintsto 31,764. We performed the constrained MAP estimation using the IPOPT constrained optimization rou-tines (Wachter and Biegler 2006), starting from the unconstrained EAP estimates1. It took 2.5 seconds of

1The EAP is used here rather than the MAP because it may be calculated without requiring a non-linear optimization. Wehave elected to use a Bayesian analysis in order to avoid the lack of identifiability of MLEs studied in ***** (2007).

15

Page 16: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

CPU time when no constraints were imposed; this increased to 10.4 when the constraints were imposed forθ1 only and 35.5 CPU seconds when the estimates of both θ1 and θ2 were constrained to satisfy (7). In all,the mean squared difference between the unconstrained EAP and MAP estimates was 0.0015. The meansquared difference between the unconstrained MAP and the MAP constrained to be regular in both θ1 andθ2 was 0.02. Thus, the effect of enforcing regularity constraints has a somewhat larger magnitude than thechoice of estimate for θ. Of these, those subject who appeared in the four pairs violating the constraintshas twice the squared displacement on average as the others. When the constraint was only imposed onθ2, the mean squared difference was reduced to 1.15 ∗ 10−9. Some of this difference is due to the numericalapproximations used for constrained optimization within IPOPT.

It appears then, that imposing constraints to avoid what might be termed observed paradoxical resultsis computationally feasible for at least moderate sizes of cohorts and creates moderate distortion in thestatistical estimates of a subject’s ability. By way of understanding this, we give the following theorem:

Theorem 8.1. Let {θk}nk=1 be a fixed set of finite item parameters for n subjects. Let {θu

kN}nk=1 be a set of

θ estimates for each of these n subjects based on N items without constraints. Let θc

kN , k = 1, . . . , n be thecorresponding estimate under constraints (7). Assume that N → ∞ with Pj(θk) uniformly bounded awayfrom zero and one for all k ∈ 1, . . . , n and j ∈ 1, . . . ,∞. Then ‖θc

kN − θu

kN‖ → 0 almost surely.

The proof of this theorem is given in Appendix E. It relies on the fact that observing two subjectswith partial inequality is rare for long tests. In short tests, using Bayesian methods is likely to reduce theincidence of observed non-monotonicity; in long tests it is unlikely that students can be placed in a definiteorder at all. Thus, if observed paradoxical results are the only concern, enforcing regularity constraints willremove the few instances of it without unduly distorting the estimated parameters.

The constrained optimization above would avoid the scenario described in the introduction. However,it has the disadvantage of changing the results if more subjects are added. It also does not address thecounterfactual “if I had only gotten this question wrong, I would have passed”. There is, of course, no reasonthat we need restrict to only those response patterns seen. One way to avoid these concerns is to estimateparameters for all possible response patterns, rather than just those observed, under all relevant regularityconditions. Estimates for θ then amount to matching the observed response pattern to our enumeration ofall possible responses and assigning the corresponding score. For an N -item test with a d-dimensional abilityspace we would need to estimate all d2N possible ability parameters under d

∑Nk=0

(Nk

)(N − k) constraints.

Doing so is clearly infeasible for all but very short tests.

9 Conclusion

Jane and Jill from our opening story were meant to demonstrate the real-world consequences of paradoxicalresults. Clearly, any testing procedure where doing better has adverse consequences has implications forfairness. Theirs is not a purely hypothetical case, nor a case that could occur only under pathologicalconditions. ***** (2007) found four pairs of subjects that matched their situation in a real-world data setand a much larger number who could have done better by deliberately answering questions incorrectly; thecurrent paper complements these empirical findings by providing a mathematical explanation for how andwhen statistical estimation can produce paradoxical results.

Our discussion has uncovered sufficient conditions for paradoxical results in multidimensional item re-sponse models. Paradoxical results in statistical estimates for multidimensional item response models arecommon. They are ubiquitous in some of the most popular models. This does not make them unreason-

16

Page 17: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

able from a statistical standpoint and should not preclude the use of multidimensional models in diagnostictesting. However, our results do provide reason to be cautious about their use in high-stakes tests.

Our results do not provide a complete analysis for the extensions cited in Sec. 7 of the form given in Sec.6. Further extensions such as an analysis of item bundles should also be investigated. Should the user wishto both use statistical methods and maintain regularity, the constrained estimates discussed in Sec. 8 providea tractable means of at least removing observable paradoxical results. When the broader counterfactual isof concern, the practitioner may reconsider the use of multidimensional testing.

References

Ackerman, T. (1996). Graphical representation of multidimensional item response theory analyses. AppliedPsychological Measurement 20 (4), 311–329.

Antal, T. (2007). On multidimensional item response theory – a coordinate free approach. ElectronicJournal of Statistics 1, 290–306.

Bock, R., R. Gibbons, and E. Muraki (1988). Full-information item factor analysis. Applied PsychologicalMeasurement 12, 261–280.

Bolt, D. and V. Lall (2003). Estimation of compensatory and noncompensatory multidimensional itemresponse models using markov chain monte carlo. Applied Psychological Measurement 27 (6), 395–414.

Finkelman, M., G. Hooker, and J. Wang (2007). Unidentifiability and lack of monotonicity in the multi-dimensional three-parameter logistic model. under review.

Haertel, E. (1990). Continuous and discrete latent structure models of item response data. Psychome-trika (55), 477–494.

Junker, B. and K. Sijtsma (2000). Latent and manifest monotonicity in item response models. AppliedPsychological Measurement (24), 65–81.

Reckase, M. (1985). The difficulty of test items that measure more than one ability. Applied PsychologicalMeasurement 9, 401–412.

Reckase, M. (1997). The past and future of multidimensional item response theory. Applied PsychologicalMeasurement 21, 25–36.

Roussos, L., L. DiBello, W. Stout, S. Hartz, R. Henson, and J. Templin (2007). The fusion model skillsdiagnosis system, pp. 275–318. Cambridge, UK: Cambridge University Press.

Segall, D. O. (2000). Principles of multidimensional adaptive testing. In W. J. van der Linden and C. A. W.Glas (Eds.), Computerized adaptive testing: Theory and Practice, pp. 27–52. Boston: Kluwer AcademicPublishers.

Thissen, D. and L. Steinberg (1997). A response model for multiple choice items. In W. J. van der Lindenand R. K. Hambleton (Eds.), Handbook of item response theory, pp. 51–65. New York: Springer-Verlag.

van der Linden, W. J. (1998). Stochastic order in dichotomous item response models for fiixed, adaptiveand multidimensional tests. Psychometrika 63 (3), 211–226.

van der Linden, W. J. (1999). Multidimensional adaptive testing with a minimum error-variance criterion.Journal of Educational and Behavioral Statistics 24, 398–412.

von Davier, M. (2005). A general diagnostic model applied to language testing data. ETS research report05-16, Educational Testing Service, Princeton, NJ.

17

Page 18: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

Wachter, A. and L. T. Biegler (2006). On the implementation of an interior-point filter line-search algo-rithm for large-scale nonlinear programming. Mathematical Programming (106), 25–57.

Whitely, S. (1980). Multicomponent latent trait models for ability tests. Psychometrika 45, 479–494.

A Proofs for Lemmas in Sec. 3

We begin with a notational convenience throughout the proofs below: where y is used it is taken to meanthe common first N − 1 items and will not be used in a function that depends on item N . y1 is taken to bethe concatenation (y, 1) and y0 to mean (y, 0).

A.1 Proof of Lemma 3.1

Consider a Taylor expansion at θ−1 for some θ−1 on the segmet between θ−1 and θ∗−1

∂l

∂θ1

(t, θ∗−1;y

)=

∂l

∂θ1

(t, θ−1;y

)+

(θ∗−1 − θ−1

)T ∂2l

∂θ1∂θ−1

(t, θ−1;y

)≺ ∂l

∂θ1

(t, θ−1;y

),

since(θ∗−1 − θ−1

)Â 0 and ∂2l

(t, θ−1;y

)/∂θ1∂θ−1 ≺ 0.

A.2 Proof of Lemma 3.2

First we give a preliminary lemma:

Lemma A.1. Let µ(t) be a unidimensional density and p(t) a positive strictly increasing function with∫p(t)µ(t)dt < ∞. Then

F (t) =∫ t

−∞µ(s)ds Â

∫ t

−∞ µ(s)p(s)ds∫∞−∞ µ(s)p(s)ds

= Fp(t).

If p(t) is a positive strictly decreasing function, F (t) ≺ Fp(t).

Proof. Without loss of generality, assume that∫∞−∞ p(s)µ(s)ds = 1. We consider p to be increasing and

proceed by contradiction. Suppose that F (t) < Fp(t) for some t. Then if p(t) > 1, then for T > t

Fp(T ) = Fp(t) +∫ T

t

p(s)µ(s)ds > F (t) +∫ T

t

µ(s)ds = F (T ),

so that in particular∫∞−∞ p(s)µ(s)ds = Fp(∞) > F (∞) = 1. The case for p(t) < 1 follows similar arguments.

If p(t) = 1, since p is strictly increasing and F and Fp are continuous, there is some ε > 0 such thatp(t + ε) > 1 and Fp(t + ε) > F (t + ε). The case for p(t) decreasing may be made analogously.

Proof of Lemma 3.2By the implicit function theorem we have that

d

dθjθ1(θ−1, c;y) = −

[∂2l

∂θ21

]−1∂2l

∂θ1∂θj< 0, (9)

where the inequality follows directly from the negative second derivatives of the log likelihood.

18

Page 19: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

A.3 Proof of Lemma 3.4

Writing d log f2(t)/dt = d log f1(t)/dt + k(t) for some k(t) > 0, we have that

f2(t) =f1(t)e

∫ t0 k(s)ds

∫f1(t)e

∫ t0 k(s)dsdt

and the result follows from Lemma A.1 with µ(t) = f1(t), p(t) = exp∫ t

0k(s)ds.

B Proofs of Results in Sec. 4

B.1 Proof of Theorem 4.1

Consider the test in Sec. 3.3; by assumption θN−1

MLE(y) exists. Define

θN−11,MLE(θ2;y) =

{t :

∂lN−1

∂θ1(t, θ2;y) = 0

}(10)

to be the MLE for θ1 after N − 1 items, conditional on θ2. That is, for each θ2, θN−11,MLE(θ2;y) is the value

of θ that maximizes the likelihood with θ2 held fixed. By Lemma 3.2, substituting c = 0, θN−11,MLE(θ2;y) is a

decreasing function of θ2.We observe that since l is convex for each θ1, the MLE for θ2 satisfies

θN−12,MLE(y) =

{t :

∂lN−1

∂θ2(θN−1

1,MLE(t;y), t;y) = 0}

.

In other words, it maximizes the likelihood along the line ((θN−11,MLE(θ2;y), θ2). The unconditional MLE for

θ1 is found by setting θ2 to its MLE in (10). We write this as

θN−11,MLE(y) = θN−1

1,MLE

(θN−12,MLE(y);y

).

We observe that since the response function for the final item is unaffected by θ1, θN1,MLE(θ2;y) does not

depend on the value of the final response. PN is increasing in θ2 so

∂lN

∂θ2(θ;y1) =

∂lN−1

∂θ2(θ;y) +

∂ log PN

∂θ2(θ2) >

∂lN−1

∂θ2(θ;y)

for all (θ1, θ2) and in particular

∂lN

∂θ2

(θN1,MLE(θ2;y1), θ2;y1

)>

∂lN−1

∂θ2

(θN−11,MLE(θ2;y), θ2;y

).

Thus the MLE for θ2 must increase, implying that the MLE for θ1 must decrease:

θN2,MLE(y1) > θN−1

2,MLE(y) ⇒ θN1,MLE

(θN2,MLE(y1);y1

)< θN−1

1,MLE

(θN−12,MLE(y);y

).

Analogously, θN1,MLE

(θN2,MLE(y0);y0

)> θN−1

1,MLE

(θN−12,MLE(y);y

), so that θ1,MLE(y1) < θ1,MLE(y0).

19

Page 20: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

B.2 Proof of Theorem 4.2

We use the test constructed in Sec. 3.3. θ1,PLB is such that

K =∫ θ1,MLE

θ1,P LB

∂l

∂θ1

(s, θ2,MLE(s;y);y

)ds (11)

since ∂l(s, θ2,MLE(s;y);y)/∂θ2 = 0 from the definition of θ2,MLE(s;y).From our construction, the last item is unaffected by θ1 and θ2,MLE(s;y0) < θ2,MLE(s;y1), and ∂l(θ1, θ2;y0)/∂θ1 =

∂l(θ1, θ2;y1)/∂θ1 so by Lemma 3.1,

∂l

∂θ1

(θ1, θ2,MLE(θ1;y1);y1

)≺ ∂l

∂θ1

(θ1, θ2,MLE(θ1;y0);y0

).

We now proceed by contradiction and suppose that θ1,PLB(y0) ≤ θ1,PLB(y1). By Theorem 4.1, θ1,MLE(y1) <

θ1,MLE(y0) and

K =∫ θ1,MLE(y1)

θ1,P LB(y1)

∂l

∂θ1

(s, θ2,MLE(s;y1);y1

)ds

<

∫ θ1,MLE(y0)

θ1,P LB(y0)

∂l

∂θ1

(s, θ2,MLE(s;y1);y1

)+

ds (12)

<

∫ θ1,MLE(y0)

θ1,P LB(y0)

∂l

∂θ1

(s, θ2,MLE(s;y0);y0

)ds = K

where the subscripted ’+’ indicates the positive part. Note that (12) expands the range of integration beyondwhere ∂l(s, θ2,MLE(s;y1);y1)/∂θ1 crosses zero.

C Proof of Theorem 5.2

We make use of the test constructed in Sec. 3.3. We write

FN−11 (θ1|θ2,y) =

∫ θ1

−∞ fN−1(t, θ2|y)dt∫∞−∞ fN−1(t, θ2|y)dt

for the posterior distribution of θ1 given θ2 and y. Applying Lemma 3.4 with f1(θ1) = f(θ1, θ2|y), f2(θ1) =f(θ1, θ

∗2 |y) and θ2 < θ∗2 , we have that FN−1

1 (θ1|θ2,y) is an increasing function of θ2. The condition forLemma 3.4 is guaranteed by Lemma 3.1.

Since PN (θ) does not depend on θ1, FN1 (θ1|θ2,y1) = FN−1

1 (θ1|θ2,y). Applying Lemma A.1 twice, thereis an ordering on the marginal posterior distributions for θ2: FN

2 (θ2|y0) > FN−12 (θ2|y) > FN

2 (θ2|y1).This implies θ2|y0 is stochastically smaller than θ2|y1. Hence:

FN1 (θ1|y1) =

∫FN−1

1 (θ1|θ2,y)dFN2 (θ2|y1) >

∫FN−1

1 (θ1|θ2,y)dFN2 (θ2|y0) = FN

1 (θ1|y0).

20

Page 21: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

D Proofs for Sec. 6

D.1 Proof of Theorem 6.1

Without loss of generality, we assume bT b = 1. Define B∗ =[BT b

]T , where B is a (d−1)-by-d orthonormalmatrix with rows orthogonal to b. We re-parameterize our ability space ψ = B∗θ so that Pi(θT ai) =Pi(ψT B∗ai). We note, importantly, that PN (ψT B∗b) = PN (ψd), so the final item now only depends on ψd.

For convenience, let ψ(d−1) = (ψ1, . . . , ψd−1) and ψk(d−1),MLE(ψd;y) be the vector maximizing the like-

lihood after k items for each ψd. The MLE for θ can be expressed as θk

MLE(y) = B∗T ψk

MLE(y). Were-express this as a function of a fixed ψd:

θk

MLE∗(ψd;y) = BT ψk(d−1),MLE(ψd;y) + ψdb.

We now havedθ

N−1

MLE∗

dψd(ψd;y) = BT

dψN−1(d−1),MLE

dψd(ψd;y) + b. (13)

If the first component is negative then the arguments employed in Lemma 3.1 and Theorem 4.1 continueto hold: ψN

d,MLE(y1) > ψN−1d,MLE(y1) and hence θN

1,MLE(y1) < θN−11,MLE(y1). Conversely, θN

1,MLE(y0) >

θN−11,MLE(y0).

Expanding (13) for the first component of θ, we make use of the implicit function theorem. Let W =diag[yid

2 log Pi(θT ai)/dt+(1−yi)d2 log(1−Pi(θT ai))/dt], where the derivatives are those of Pi(t) evaluatedat t = θT ai. By assumption W1 ≺ W ≺ W2. It is easy to calculate

dθ1,MLE∗

dψd(ψd;y) = −BT

(1)

(BAT WABT

)−1BAT WAb + b1, (14)

where B(1) is the first column of B. Recall that BT B = I − bbT and

(BAT WABT

)−1= B

(AT WA

)−1BT − B

(AT WA

)−1bbT

(AT WA

)−1BT

bT (AT WA)−1 b.

Substituting these into (14) gives

dθ1,MLE∗

dψd(ψd;y) =

eT1

(AT WA

)−1b

bT (AT WA)−1 b.

Since AT WA is negative definite, the denominator is negative and we obtain the result.

D.2 Proof of Corollary 6.1

We denote by CdN the collection of 1:1 maps c : {2, . . . , d} → {1, . . . , N}. We define

Ac =[ac(2) ac(3) · · · ac(d)

]T

to be the design matrix of the items selected by c. We let

M c1j = (−1)1+j

σ∈π(j)

(−1)sgn(σ)∏

i/∈{1,j}ac(i)σ(i), j = 1, . . . , d

21

Page 22: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

be the (1, j)th co-factor of Ac. We let π(j) be the permutations of {2, . . . , j − 1, j + 1, . . . , d} with π(1)indicating the permutations of the integers {2, . . . , d}. We also let Sd

N be the restriction of CdN to the order

preserving transformations: SdN = {c ∈ Cd

N : i < j ⇒ c(i) < c(j)}. This is isomorphic to the set of allcardinality d− 1 subsets of {1, . . . , N}. We observe that Cd

N = SdN ◦ π(1), indicating the composition of all

pairs s ∈ SdN , σ ∈ π(1). Writing c = s◦σ we have Ms◦σ

1j = (−1)sgn(σ)Ms1j since applying σ before s is simply

a permutation of the rows of Ac.For each s ∈ Sd

N we now define a∗s = Ms11(M

s11, . . . ,M

s1d)

T . This vector is normal to the space spannedby as(2), . . . ,as(d). As we show below the condition (2) can be guaranteed by a∗Ts b < 0. Geometrically, wecan think of this in the following terms:

1. Select any d−1 items, with this collection indexed by s. Their parameters make up a (d−1)-dimensionalhyperplane in Rd.

2. Now form the unique 1-dimensional normal to this plane; choose the direction that most points towardse1. This is a∗s. Note that if the rows of As are not linearly independent, the normal is not unique buta∗s = 0.

3. The condition (2) will always be satisfied if b makes an obtuse angle with a∗s for each s.

4. We can guarantee that this will be the case if b places less relative emphasis on the first dimensionthan any of the as(i).

Formally, we can show:

Lemma D.1.

eT1 (AT WA)−1b =

1det(AT WA)

s∈SdN

(d∏

k=2

ws(k)

)a∗Ts b.

22

Page 23: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

Proof. For j 6= 2 (the result for j = 2 follows analogously) the (1, j)th co-factor of AT WA is

C1j = (−1)1+j∑

σ∈π(j)

(−1)sgn(σ)∏

k/∈{1,j}

N∑

i=1

wiaikaiσ(k)

=N∑

i′=1

wi′ai′2

(−1)1+j

σ∈π(j)

(−1)sgn(σ)

ai′σ(2)

k/∈{1,2,j}

N∑

i=1

wiaikaiσ(k)

. . .

=∑

c∈CdN

(d∏

k=2

wc(k)ac(k)k

)(−1)1+j

σ∈π(j)

(−1)sgn(σ)∏

k/∈{1,j}ac(k)σ(j)

=∑

c∈CdN

(d∏

k=2

wc(k)ac(k)k

)M c

1j

=∑

s∈SdN

(d∏

k=2

ws(k)

) ∑

σ∈π(1)

(−1)sgn(σ)

(d∏

k=2

as(k)σ(k)

) Ms

1j

=∑

s∈SdN

(d∏

k=2

ws(k)

)Ms

11Ms1j .

The result is obtained by observing

eT1 (AT WA)−1b =

1det(AT WA)

d∑

j=1

C1jbj .

Lemma D.2. If b1/‖b‖ < as(i)1/‖as(i)‖, for i = 2, . . . , d then a∗Ts b ≤ 0. The inequality is strict if As islinearly independent.

Proof. Assume that As is linearly independent, else a∗s = 0. Let bs be the projection of b onto As. Thenb − bs is normal to the hyperplane defined by As. Hence b − bs = ka∗s for some k since a∗s is the uniquenormal to As. Therefore a∗Ts b = a∗Ts (b− bs) = k‖a∗s‖.

We now claim b1 − bs1 < 0. This is true since b lies in the positive orthant between the hyper-planesdefined by As and by a1 = 0. Since a∗s1 = (Ms

11)2 > 0 we must have k < 0, providing the result.

Proof of Corollary 6.1. Since wi < 0 for all i, sgn(det(AT WA)) = (−1)d and for any s ∈ SdN , sgn

(∏dk=2 ws(k)

)=

(−1)d−1. By Lemma D.1, (2) holds for a∗Ts b ≤ 0 with the inequality strict for some s. By Lemma D.2, thisis the case for the ordering given by (3).

23

Page 24: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

D.3 Proof of Corollary 6.2

Following the proof of Theorem 6.1 we have that for any i, dθN−1i,MLE(ψd;y)/dψd = eT

i

(AT WA

)−1b/bT

(AT WA

)−1b,

and thusdS(y)dψd

=αT

(AT WA

)−1b

bT (AT WA)−1 b.

Since ψNd (y1) > ψN

d (y0) we must have S(y1) < S(y0).

D.4 Proof of Theorem 6.3

We begin by making the change of variables used to prove Theorem 6.1 and note that the first row ofB∗ is B = (b2,−b1). We also note that since B is orthonormal, the posterior density for ψ = (ψ1, ψ2) isfN−1(B∗T ψ|y) = f∗(ψ|y) = f∗1 (ψ1|ψ2,y)f∗2 (ψ2|y).

Noting that θ1 = b2ψ1 + b1ψ2, we can write the conditional distribution of θ given ψ2 as

fN−11 (θ1|ψ2,y) = f∗1

(θ1 − b1ψ2

b2

∣∣ψ2,y)

.

Following the argument in the proof of Theorem 6.1, we note that PN (θ) = PN (ψ2) is only a function ofψ2. Following the proof of Theorem 5.2, if ∂ log fN−1(θ1|ψ2,y)/∂θ1 is a decreasing function of ψ2, thenT [F1(θ1|y1)] < T [F1(θ1|y0)]. From the proof of Lemma 3.1, this will be true if

0 >∂2 log fN−1

1

∂θ1∂ψ2=

1b2

∂2 log f∗1∂ψ1∂ψ2

+b1

b22

∂2 log f∗1∂ψ2

1

.

This is equivalent to

−b2

[∂2 log f∗1

∂ψ21

]−1∂2 log f∗1∂ψ1∂ψ2

+ b1 < 0.

The left hand side of this equation is now exactly (14) with AT WA replaced by AT WA+K, from which weobtained (5).

E Proof of Theorem 8.1

Essentially, the probability of seeing any constraint in (7) becomes zero. Writing yj,k as the jth response forthe kth subject, we will observe yk1 ≺ yk2 only if the event {y2j−1,k1 < y2j−1,k2 and y2j,k1 > y2j,k2} doesnot occur for all successive disjoint pairs of items 2j − 1 and 2j. When N is even, there are exactly N/2successive pairs of items and we have that

P (yk1 ≺ yk2) ≤N/2∏

j=1

(1− P (y2j−1,k1 < y2j−1,k2 and y2j,k1 > y2j,k2)

)

=N/2∏

j=1

(1− P2j−1(θk1)

[1− P2j−1(θk2)

][1− P2j(θk1)

]P2j(θk2)

).

24

Page 25: Paradoxical Results in Multidimensional Item Response Theoryfaculty.bscb.cornell.edu/~hooker/paradoxpaper.pdfMultidimensional item response theory assumes that questions (hereafter

The second line follows since y2j−1,k1 < y2j−1,k2 implies y2j−1,k1 = 0 and y2j−1,k2 = 1 with the equivalentimplication for y2j,k1 > y2j,k2 . Each of the terms inside the product is uniformly bounded away from zeroand one and hence the right hand side has finite sum from N = 1, . . . ,∞. If there are no observed partialorderings, the constrained and unconstrained estimates will be equal. Hence

∞∑

N=1

P (θc

kN 6= θu

kN ) ≤∞∑

N=1

k1 6=k2

P (yk1 ≺ yk2) < ∞.

since the number of pairs of subjects is finite. The result now follows from the Borel-Cantelli Lemma.

25