Parallel psychometric and cognitive modeling analyses of...

22
This article was downloaded by: [Harvard College] On: 07 March 2013, At: 07:52 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of Clinical and Experimental Neuropsychology Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/ncen20 Parallel psychometric and cognitive modeling analyses of the Penn Face Memory Test in the Army Study to Assess Risk and Resilience in Servicemembers Michael L. Thomas a , Gregory G. Brown a b , Ruben C. Gur c d , John A. Hansen c d , Matthew K. Nock e , Steven Heeringa f , Robert J. Ursano g & Murray B. Stein a a Department of Psychiatry, University of California, San Diego, San Diego, CA, USA b VA San Diego Healthcare System, San Diego, CA, USA c Department of Psychiatry, University of Pennsylvania, Philadelphia, PA, USA d Philadelphia VA Medical Center, Philadelphia, PA, USA e Department of Psychology, Harvard University, Cambridge, MA, USA f Institute for Social Research, University of Michigan, Ann Arbor, MI, USA g Department of Psychiatry, Uniformed Services University of the Health Sciences, Bethesda, MD, USA Version of record first published: 05 Feb 2013. To cite this article: Michael L. Thomas , Gregory G. Brown , Ruben C. Gur , John A. Hansen , Matthew K. Nock , Steven Heeringa , Robert J. Ursano & Murray B. Stein (2013): Parallel psychometric and cognitive modeling analyses of the Penn Face Memory Test in the Army Study to Assess Risk and Resilience in Servicemembers, Journal of Clinical and Experimental Neuropsychology, DOI:10.1080/13803395.2012.762974 To link to this article: http://dx.doi.org/10.1080/13803395.2012.762974 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Transcript of Parallel psychometric and cognitive modeling analyses of...

Page 1: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

This article was downloaded by: [Harvard College]On: 07 March 2013, At: 07:52Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: MortimerHouse, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Clinical and ExperimentalNeuropsychologyPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/ncen20

Parallel psychometric and cognitive modelinganalyses of the Penn Face Memory Test in theArmy Study to Assess Risk and Resilience inServicemembersMichael L. Thomas a , Gregory G. Brown a b , Ruben C. Gur c d , John A. Hansen c d ,Matthew K. Nock e , Steven Heeringa f , Robert J. Ursano g & Murray B. Stein aa Department of Psychiatry, University of California, San Diego, San Diego, CA, USAb VA San Diego Healthcare System, San Diego, CA, USAc Department of Psychiatry, University of Pennsylvania, Philadelphia, PA, USAd Philadelphia VA Medical Center, Philadelphia, PA, USAe Department of Psychology, Harvard University, Cambridge, MA, USAf Institute for Social Research, University of Michigan, Ann Arbor, MI, USAg Department of Psychiatry, Uniformed Services University of the Health Sciences,Bethesda, MD, USAVersion of record first published: 05 Feb 2013.

To cite this article: Michael L. Thomas , Gregory G. Brown , Ruben C. Gur , John A. Hansen , Matthew K. Nock ,Steven Heeringa , Robert J. Ursano & Murray B. Stein (2013): Parallel psychometric and cognitive modeling analysesof the Penn Face Memory Test in the Army Study to Assess Risk and Resilience in Servicemembers, Journal of Clinicaland Experimental Neuropsychology, DOI:10.1080/13803395.2012.762974

To link to this article: http://dx.doi.org/10.1080/13803395.2012.762974

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes. Any substantial orsystematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distributionin any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that thecontents will be complete or accurate or up to date. The accuracy of any instructions, formulae, anddrug doses should be independently verified with primary sources. The publisher shall not be liable forany loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever causedarising directly or indirectly in connection with or arising out of the use of this material.

Page 2: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

JOURNAL OF CLINICAL AND EXPERIMENTAL NEUROPSYCHOLOGY, 2013http://dx.doi.org/10.1080/13803395.2012.762974

Parallel psychometric and cognitive modeling analysesof the Penn Face Memory Test in the Army Study to

Assess Risk and Resilience in Servicemembers

Michael L. Thomas1, Gregory G. Brown1,2, Ruben C. Gur3,4, John A. Hansen3,4,Matthew K. Nock5, Steven Heeringa6, Robert J. Ursano7, and Murray B. Stein1

1Department of Psychiatry, University of California, San Diego, San Diego, CA, USA2VA San Diego Healthcare System, San Diego, CA, USA3Department of Psychiatry, University of Pennsylvania, Philadelphia, PA, USA4Philadelphia VA Medical Center, Philadelphia, PA, USA5Department of Psychology, Harvard University, Cambridge, MA, USA6Institute for Social Research, University of Michigan, Ann Arbor, MI, USA7Department of Psychiatry, Uniformed Services University of the Health Sciences, Bethesda, MD, USA

Objective: The psychometric properties of the Penn Face Memory Test (PFMT) were investigated in a large sample(4,236 participants) of U.S. Army Soldiers undergoing computerized neurocognitive testing. Data were drawn fromthe initial phase of the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS), a large-scale study directed towards identifying risk and resilience factors for suicidal behavior and other stress-relateddisorders in Army Soldiers. In this paper, we report parallel psychometric and cognitive modeling analyses of thePFMT to determine whether ability estimates derived from the measure are precise and valid indicators of memoryin the Army STARRS sample. Method: Single-sample cross-validation methodology combined with exploratoryfactor and multidimensional item response theory techniques were used to explore the latent structure of thePFMT. To help resolve rotational indeterminacy of the exploratory solution, latent constructs were aligned withparameter estimates derived from an unequal-variance signal detection model. Results: Analyses suggest that thePFMT measures two distinct latent constructs, one associated with memory strength and one associated withresponse bias, and that test scores are generally precise indicators of ability for the majority of Army STARRSparticipants. Conclusions: These findings support the use of the PFMT as a measure of major constructs related torecognition memory and have implications for further cognitive–psychometric model development.

Keywords: Psychometric modeling; Item response theory; Cognitive modeling; Penn Face Memory Test; ArmyStudy to Assess Risk and Resilience in Servicemembers.

Psychometric analyses provide the bedrock foraccurate test interpretation in clinical neuro-psychological practice and research (Campbell &Fiske, 1959; Cronbach & Meehl, 1955; Loevinger,1957). It is essential that studies of mental healthprovide clear evidence that test scores are reliableand valid indicators of psychological constructs.

Army STARRS was sponsored by the Department of the Army and funded under cooperative agreement U01MH087981 with theU.S. Department of Health and Human Services, National Institutes of Health, National Institute of Mental Health (NIH/NIMH).The contents are solely the responsibility of the authors and do not necessarily represent the views of the Department of Health andHuman Services, NIMH, the Department of the Army, or the Department of Defense.

Address correspondence to Gregory G. Brown, San Diego VA Medical Center, Psychology Service (116b), 3350 La Jolla Village Drive,San Diego, 92161-0002, CA, USA (E-mail: [email protected]).

Modern psychometric techniques including multi-dimensional item response theory (MIRT; seeReckase, 2009) constitute the preeminent method-ology for such analyses. In this paper, we exploreMIRT models of the Penn Face Memory Test(PFMT; Gur et al., 1993; Gur, Jaggi, Shtasel, &Ragland, 1994; Gur et al., 2001; Gur et al., 1997;

© 2013 Taylor & Francis

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 3: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

2 THOMAS ET AL.

Gure et al., 2010) and improve interpretations ofparameter estimates using results from a paral-lel cognitive modeling analysis. These results areused to assess the measurement precision of thePFMT as a component of the neuropsychologicalbattery being administered to U.S. Soldiers in theArmy Study to Assess Risk and Resilience inServicemembers (Army STARRS; Army STARRS,2011). Army STARRS is a large-scale studydirected towards the identification of biological andpsychosocial risk and resilience factors for suici-dal behavior and other stress-related disorders. Thepresent study serves to establish the measurementproperties of the PFMT and provides a basis forfuture development of predictive models.

Context of analyses

Army STARRS is a collaborative research effortin response to the alarming increase in suiciderates among Army Soldiers over the past sev-eral years (U.S. Army, 2010). The study, which isdivided into several components, includes histori-cal data, prospective data collection, and biologicalsample collection across a number of substudies.The New Soldier Study—a component of ArmySTARRS—involves the administration of comput-erized questionnaires, including psychiatric symp-tom inventories, psychological assessments, andneuropsychological tests to Army Soldiers at theonset of basic training. The final sample for theNew Soldier Study is expected to comprise tens ofthousands of participants. Data collection for theproject began in 2011 and was still underway at thetime of writing this manuscript.

The neuropsychological battery being admin-istrated to Soldiers establishes baseline cognitivefunctioning with the future aim of identifyingmarkers of risk and resilience. Tests were designedto provide efficient assessment of neurocognitivedomains that can be linked to specific brain systems(see Penn Computerized Neurocognitive Battery;Gur, Erwin, & Gur, 1992; Gur et al., 2012; Guret al., 2010). Army STARRS test data will belinked with demographic data, measures of mentalhealth, and genetic markers collected in the finalsample. Therefore, it is important that these dataare reliable and valid indicators of Army Soldiers’neurocognitive functioning.

Model development

Analyses of data drawn from psychological instru-ments rely on models that relate test items to latent

abilities. Reliability coefficients from both classicaltest theory (e.g., Cronbach’s alpha) and commonfactor theory (e.g., omega) are based on specificassumptions about underlying measurement struc-tures (e.g., tau-equivalent or common factor; seeMcDonald, 1999). Verification of such assump-tions, including a proper understanding of the testspace, pattern of loadings, and factor correlations,is needed to derive accurate reliability and validitycoefficients. Unfortunately, the process of mappingrelations between observed data and latent abili-ties in psychological assessment is complex. Thereis no guarantee, for example, that item successwithin a specific scale requires just a single cogni-tive ability; there is also no guarantee that successis determined equally by the same abilities for alltest takers (see Haynes, Smith, & Hunsley, 2011).In neuropsychological assessment, this opacity isworsened when separate cognitive processes areinvolved in task performance (see Brandt, 2007),further hindering the goal of establishing the con-struct validity of test scores (see Strauss & Smith,2009).

A combination of theory and empirical evidenceis needed to overcome the difficulties associatedwith measuring latent abilities. Combining princi-ples from correlational and experimental psychol-ogy in model development (cf. Cronbach, 1957)is exemplified in current methodology known ascognitive psychometrics (see Batchelder, 2010).In analyses of the PFMT, cognitive–psychometricmodeling can be used to explain unknown aspectsof the test’s latent factor structure with experimen-tally derived cognitive architecture. To do so, westart with a firm understanding of the PFMT’stheoretical bases.

The PFMT was designed to measure visualepisodic memory for unstructured stimuli, a latentability sensitive to right temporal lobe dysfunction(Gur et al., 1993; Gur et al., 1994). The task par-allels aspects of commonly administrated measuresof verbal episodic memory (e.g., Delis, Kramer,Kaplan, & Ober, 2000), requiring examinees tofirst memorize a list of faces and then to identifytest items as either targets (old faces) or distrac-tors (new faces). Although there is strong evidencethat neuropsychological tests can dissociate visualepisodic memory from other cognitive abilities(Cabeza & Kingstone, 2006; Carroll, 1993; Lezak,Howieson, Loring, Hannay, & Fischer, 2004), itis more difficult to delineate specific subsystemsand/or levels of processing within the domainitself (e.g., Baddeley & Hitch, 1974; Cowan, 1988).Indeed, it is well known that episodic memory tasksactivate several regions of the prefrontal cortexand medial temporal lobe in brain imaging studies

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 4: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

PSYCHOMETRIC AND COGNITIVE MODELING 3

(Cabeza & Kingstone, 2006; Gur et al., 1997; Kelleyet al., 1998).

A parsimonious explanation of recognition mem-ory test performance is offered by cognitivemodeling studies based on signal detection theory(see Green & Swets, 1966; Wickens, 2002), whichhave suggested that differences between target anddistractor responses can be attributed to two gen-eral constructs: (a) response bias; and (b) mem-ory strength. Interplay between the constructs canbe modeled using two unequal-variance Gaussiandistributions (one for targets and one for distrac-tors) and a set of decision criteria (Wixted, 2007).The free parameters of the model are d ′, distancebetween the means of the target and distractor dis-tributions; σ 2

T, variance of the target distribution;and C, response bias (the distance between criterionplacement and the midpoint of d ′; see Snodgrassand Corwin, 1988; additional details provided inthe Appendix). Memory strength is related to d ′ andσ 2

T, where larger values of d ′ and smaller values ofσ 2

T equate with better accuracy. Response bias isrepresented by the C parameter, where higher val-ues of C lead to conservative responding (favoringdistractors), and lower values of C lead to liberalresponding (favoring targets). Note that this defini-tion of response bias is not synonymous with pooreffort. Because the signal detection model has beenwidely discussed in the literature, specific formu-las are not presented here. Importantly, parametersof the model can be conceptualized at the level ofindividual respondents (e.g., Ingham, 1970) so thatdifferences in d ′, σ 2

T, and C can be directly relatedto parameter values from psychometric models tobetter understand a test’s internal structure.

The psychometric models used in our analysesare all based on the multidimensional extension ofan item response theory graded response model (M-GRM; de Ayala, 1994; Muraki & Carlson, 1995;Samejima, 1997). The M-GRM is a psychometricframework for test data comprising ordered cate-gorical responses. For the PFMT, examinees mustchoose from four responses for both target anddistractor items indicating their confidence in see-ing the test item previously: “definitely yes,” “prob-ably yes,” “probably no,” or “definitely no.” TheM-GRM first relies on a logistic distribution tomodel the probability that examinee i (N total) willrespond with the kth or higher category (K total)to item j (J total) as a function of the item cate-gory threshold (djk), item discrimination (ajm), andexaminee ability1 (θim) parameters by

1The terms “ability,” “factor,” and “construct” are similarlyused to indicate a latent variable.

P(uij ≥ k|djk, aj, θi

) =exp

(M∑

m=1ajmθim + djk

)

1 + exp(

M∑m=1

ajmθim + djk

) ,

(1)

where aj and θi are vectors with M elements(one per factor), and uij is the realization of theresponse variate. From these values, the proba-bilities of responding in each of the categoricalresponse options are then given by

P(uij = 1 | dj1, aj, θi

) = 1 − P(uij ≥ 2 | θi, aj, dj1

)(2)

. . .

P(uij = k | djk, aj, θi

) = P(uij ≥ k | θι, aj, djk

)

− P(uij ≥ k + 1 | θι, aj, djk+1

)(3)

. . .

P(uij = K | djK , aj, θi

) = P(uij ≥ K|θι, aj, djK

). (4)

As can be seen, the probability of respondingwith the first categorical response option is sim-ply one minus the probability of responding withany other option. The probability of respondingwith an intermediate categorical response optionis the difference of being in k or higher and k +1 or higher for each category. Finally, the probabil-ity of responding with the last categorical responseoption is just the probability of responding withthis option alone. Equations 2, 3, and 4 rely on therestriction that the item threshold parameters (djk)are strictly ordered from largest to smallest fromthe first to last categorical response options, respec-tively. This assures that the response probabilitiesfor a fixed ability level decrease from one cate-gory to the next (i.e., that correct PFMT responseoptions are more difficult than incorrect PFMTresponse options).

The item threshold (djk) parameter is inverselyrelated to item difficulty (i.e., higher values indi-cate lower difficulty). The item discrimination (aj)parameter can be interpreted similarly to a fac-tor loading or an item-by-total biserial correlation.In a graphical representation of the item responsetheory (IRT) model, the djk and aj parametersdetermine the locations and slopes of the items’category response functions. An example is illus-trated in the top panel of Figure 1 for a hypotheticalPFMT item. The plot shows that the probabilityof responding in the kth of K categories (y-axis) isa function of ability (x-axis). The exact shapes ofthe curves depend on the item parameters. The θ

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 5: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

4 THOMAS ET AL.

DN

DN

PN

PN

PY

PY

DY

DY

Pro

ba

bili

ty

1.0

0.8

0.6

0.4

0.2

0.0

Pro

ba

bili

ty

1.0

0.8

0.6

0.4

0.2

0.0

–4–2

02

4 –4

–2

0Ability 2 (θ2)

Ability 1 (θ1)

2

4

Ability (θ)

–4 –2 0 2 4

Figure 1. Category response curves for hypothetical unidimensional (top) and multidimensional (bottom) Penn Face Memory Testitems. DY = “definitely yes,” PY = “probably yes,” PN = “probably no,” or DN = “definitely no.”

parameter represents examinees’ standings on thecognitive ability assumed to influence individualdifferences in item responses. The model representsitem response variance as a function of examinee

ability, item threshold (inverse of difficulty), anditem discrimination.

The vector notations aj and θi used inEquations 1, 2, 3, and 4 imply that multiple

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 6: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

PSYCHOMETRIC AND COGNITIVE MODELING 5

Ability

Ability 1

Ability 1

Ability 2

Ability 2

Item 1

Item 2

Item 3

Item 4

Item 5

Item 6

Item 7

Item 8

Item 9

Item 10

Item 11

Item 12

Item 13

Item 14

Item 15

Item 16

Item 17

Item 18

Item 19

Item 20

Item 21

Item 22

Item 23

Item 24

Item 25

Item 26

Item 27

Item 28

Item 29

Item 30

Item 31

Item 32

Item 33

Item 34

Item 35

Item 36

Item 37

Item 38

Item 39

Item 40

Item 1

Item 2

Item 3

Item 4

Item 5

Item 6

Item 7

Item 8

Item 9

Item 10

Item 11

Item 12

Item 13

Item 14

Item 15

Item 16

Item 17

Item 18

Item 19

Item 20

Item 21

Item 22

Item 23

Item 24

Item 25

Item 26

Item 27

Item 28

Item 29

Item 30

Item 31

Item 32

Item 33

Item 34

Item 35

Item 36

Item 37

Item 38

Item 39

Item 40

Item 1

Item 2

Item 3

Item 4

Item 5

Item 6

Item 7

Item 8

Item 9

Item 10

Item 11

Item 12

Item 13

Item 14

Item 15

Item 16

Item 17

Item 18

Item 19

Item 20

Item 21

Item 22

Item 23

Item 24

Item 25

Item 26

Item 27

Item 28

Item 29

Item 30

Item 31

Item 32

Item 33

Item 34

Item 35

Item 36

Item 37

Item 38

Item 39

Item 40

Figure 2. Unidimensional (left), between-item multidimensional (center), and within-item multidimensional (right) models.

cognitive abilities can influence responses. In thecase where all items can discriminate betweendistinct levels of all abilities, the model is saidto have within-item multidimensional structure.When items can discriminate just one of two ormore abilities, the model is said to have between-item multidimensional structure. And, when allitems can discriminate just a single ability, themodel is said to have unidimensional structure.The within-item multidimensional model is themost general, and the unidimensional model isthe most restrictive. These three nested variants

are depicted in Figure 2. Examples of within-item multidimensional item category responsefunctions are provided in the bottom panel ofFigure 1. Note that the hypothetical item candiscriminate between levels of multiple abilities(i.e., the curvatures of the category responsefunctions change with level of ability along bothaxes). Several sources can be consulted for furtherdescription of item response theory models inpsychological and clinical assessment (Embretson& Reise, 2000; Reise & Waller, 2009; Thomas,2011).

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 7: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

6 THOMAS ET AL.

Measurement precision

Item response theory (IRT) offers a robust andcomprehensive alternative to classical test theory(CTT). Both IRT and CTT provide descriptionsof and formulas for measurement precision; how-ever, there are important differences among them.In CTT, standard error (SEtot) is derived from testreliability by

SEtot = σtot

√1 − ρ, (5)

where σtot is the population standard deviationof the total number correct true score, and ρ isthe population reliability coefficient (Magnusson,1966). Standard error in IRT is derived fromitem information—the instantaneous change in theprobability of a correct response on a particularitem at a particular ability level normalized byitem variance at the ability (Lord, 1980). An itemwhere a small change from a particular ability valueproduces a large normalized change in the proba-bility of a correct response is informative at thatability. This occurs when item difficulty is closelymatched with ability and when item discrimina-tion is high (Baker & Kim, 2004). Test informationat a particular ability (TIθ) is simply the sum ofitem information. From this, standard error as afunction of ability, SEθ, is given by

SEθ = 1√TIθ

. (6)

Whereas standard expositions of CTT providea single estimate of total score measurement errorbased on reliability, IRT provides distinct estimatesof measurement error that can vary depending onability; that is, potentially unique values of SEθ canbe assumed for unique values of θ.

It has been shown that CTT could be extendedto include the concept of test information and thatIRT could be extended to include the concept ofreliability (e.g., Mellenbergh, 1996). For example,reliability could be calculated in IRT across a rangeof ability estimates using the average standard error(Andrich, 1988). Yet, in practice, information andreliability are rarely both computed from a par-ticular test model. CTT estimates of measurementprecision are almost always reported as true scorereliability, and IRT estimates of measurement pre-cision are almost always reported as information.Furthermore, because CTT estimates of reliabilityare for particular items administered to particu-lar groups of examinees, standard error is oftenreported as a single value that confounds exami-nee and test properties (Embretson & Reise, 2000).

Within the IRT framework, item properties, such asitem difficulty, are modeled in addition to ability,facilitating estimates of a test’s psychometric prop-erties that are independent from examinee proper-ties (Embretson & Reise, 2000).

As a practical tool for researchers, IRT couldeasily offer both SEθ and ρ for evaluating measure-ment precision. Evaluations of SEθ involve com-parisons among differing levels of ability. Usually,graphical representations of the SEθ function areused to determine ability locations with relativelybetter or worse measurement precision. By doingso, a test evaluator can determine whether a testis optimized for measurement within a specificpopulation and a specific range of ability. Cutoffsfor acceptable SEθ, however, are less commonlyoffered, which can sometimes lead to ambiguousinterpretations of data. Unlike SEθ, acceptable cut-offs for ρ are commonly offered (.60 to .80 depend-ing on the specific type of coefficient; see Hayneset al., 2011). Typically, we should expect that afavorable ρ should accompany a favorable SEθ

function with respect to a specific population. Thatis, reliability should be high when areas of low mea-surement error overlap with areas of high densityin ability. However, because reliability is a reflectionof average standard error, implying that it is possi-ble to find equivalent reliabilities for tests with verydifferent standard error functions, both indices ofmeasurement precision should be evaluated.

Study goals

The first goal of this study was to find an appro-priate measurement model for the PFMT. This wasaccomplished by fitting nested measurement struc-tures based on the M-GRM to a training dataset and then cross-validating the best fitting modelin a validation data set. To improve interpretabil-ity, the psychometric results were then comparedto cognitive modeling parameters derived from theunequal-variance signal detection model. Our sec-ond goal was to evaluate the measurement precisionof PFMT latent ability scores. To do so, we com-pared SEθ values and evaluated ρ over the range ofability observed in Army STARRS.

METHOD

Participants

Army Soldiers were recruited to volunteer withoutcompensation for the Army STARRS New SoldierStudy at the start of basic training. The current

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 8: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

PSYCHOMETRIC AND COGNITIVE MODELING 7

sample comprises participants from three Armybases in the United States tested between Februaryand June of 2011. All Soldiers were asked to pro-vide informed, written consent prior to participa-tion in research. Army commanders provided suffi-cient time to complete all surveys and tests, whichwere administered in a group format using laptopcomputers. Research proctors monitored the test-ing environment and assisted with questions andtechnical difficulties. Surveys and tests were admin-istered in a fixed order in 90-min sessions over twodays of testing. The PFMT was administered on thesecond day of testing.

A total of 5,551 Army Soldiers recruited intothe study participated in the New Soldier Surveyduring the specified time frame. Of these, approxi-mately 15% did not complete the PFMT (e.g., somedid not show for the second day of testing), andanother 8% were removed due to technical prob-lems or potentially invalid data. Specifically, duringone administration a subset of participants experi-enced a testing software malfunction. The softwareanomalies were corrected, and incomplete response

profiles were flagged as errant. Also, Soldiers whoprovided the same response to 12 or more consecu-tive items were flagged as errant. This automatedrule for identifying invalid profiles on the PFMTwas developed based on previous large-scale stud-ies using the Penn Computerized NeurocognitiveBattery (Gur et al., 2012). Profiles flagged aserrant were not used for psychometric modeling.Automated quality control is required due thecomplexities of validating thousands of individ-ual response profiles in large-scale studies. Otherlarge-scale studies have reported similar exclusionrates due to technical problems (Hoerger, Quirk, &Weed, 2011).

These procedures resulted in a final sampleof 4,236 participants for psychometric modeling.Demographic characteristics of participants withPFMT data and participants without PFMT dataare shown in Table 1. No group differences werefound for age and handedness; however, theretended to be fewer males, better educational attain-ment, and less ethnic and racial diversity in thegroup of participants with valid PFMT data.

TABLE 1Demographic characteristics of participants with Penn Face Memory Test data present versus absent or excluded

Characteristic Data present Data absent/excluded pa

AgeM 21.49 21.35 .30SD 4.34 4.44

GenderMale 76 84 <.001

EducationGED 7 9 .04HS Diploma 46 54 <.001Post high school no certificate 26 18 <.001Post high school certificate 4 5 .412-year degree 7 8 .144-year degree 9 5 <.001Some graduate 2 2 .91

HandednessRight-handed 86 87 .47Left-handed 11 10 .48Either 3 3 .94

RaceWhite 68 62 <.001Black 21 23 .04American Indian or Alaskan Native 2 2 .90Asian 3 3 .62Native Hawaiian or Pacific Islander 1 1 .25Other 6 9 <.001

EthnicityNot Spanish/Hispanic/Latino 86 82 .003Mexican 5 7 .005Puerto Rican 4 5 .17Cuban <1 1 .51Other Spanish/Hispanic/Latino 5 5 .86

aSignificance (p) based on t tests for continuous variables and χ2 tests for categorical variables.

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 9: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

8 THOMAS ET AL.

Measure

The PFMT was administered as part of aneurocognitive battery designed for efficient com-puterized assessment (Gur et al., 2001, 2010). Thetest begins by showing examinees 20 faces that theywill be asked to identify later. Faces are shownin succession for an encoding period of 5 secondseach. After this initial learning period, examineesare immediately shown a series of 40 faces—20 tar-gets and 20 distractors—and are asked to decidewhether they have seen each face before by choos-ing 1 of 4 ordered categorical response options:1 = “definitely yes”; 2 = “probably yes”; 3 =“probably no”; or 4 = “definitely no.” For consis-tency with IRT’s assumption of a monotonicallyincreasing response function, we recoded responsesprior to analyses so that higher valued responsesalways implied more accurate responses (i.e., 1 isthe least correct answer, and 4 is the most correctanswer for all items). Stimuli consist of black-and-white photographs of faces presented on a blackbackground. All faces were rated as having neu-tral expressions and were balanced for gender andage (Gur et al., 1993, 2001). Examinees’ responsesand response times are recorded during test admin-istration; however, there are no time limits duringrecognition testing or explicit instructions to workquickly. The PFMT takes approximately 4 minutesto administer.

Typical PFMT scoring procedures produce thenumber of correctly recognized targets and cor-rectly rejected distractors, and median responsetimes for correct responses. There are no cutoffs forqualitative interpretations of performance. In pre-viously published studies, the reliability of accuracyscores was .75, and the reliability of time scoreswas .87 (Gur et al., 2010). PFMT scores have alsodemonstrated expected associations with demo-graphic variables such as gender (women performbetter), age (younger examinees perform better),and education (more educated examinees performbetter), and its strongest intercorrelations are withmeasures of word memory and spatial memory(Gur et al., 2010).

Model analysis

Data were analyzed using a single-sample cross-validation. The sample was first randomly splitinto a training set (N = 2,118) and a valida-tion set (N = 2,118). Exploratory psychometricanalyses were used to determine the numberof latent factors present in the data by exam-ining eigenvalues (including a parallel analysis

that compared observed eigenvalues to 10 ran-domly simulated eigenvalue sets) and factorloadings for multiple varimax-rotated (orthogo-nal) solutions (polychoric correlations were ana-lyzed in both techniques). We next fit exploratorybetween-item multidimensional and within-itemmultidimensional graded response models to thedata (both variations of Equation 1; see Figure 2)and compared the relative fit of each using Akaikeinformation criterion (AIC) and Bayesian infor-mation criterion (BIC) values. AIC and BIC val-ues penalize overparameterized models and becomesmaller with better fit (for the use of AIC and BICin IRT, see de Ayala, 2009).

The best fitting exploratory model from thetraining data was then cross-validated in the val-idation data to determine whether the measure-ment structure would generalize across samples.Item fit was evaluated by assessing the amount oflocal dependence remaining between items (Chen& Thissen, 1997). Large values of local depen-dence [residual association (ϕc

2) > .2] imply poormodel fit (see de Ayala, 2009). This is because localdependencies—unaccounted for common variancebetween items—can distort parameter estimatesand lead to incorrect conclusions regarding the reli-ability and validity of test scores. After verifyingfit of the chosen model in the validation data, theentire sample of 4,236 participants was pooled inorder to estimate final item and examinee param-eter values. Lastly, to better understand the latentdimensions themselves, we compared fitted valuesto individual parameter estimates derived from asignal detection theory analysis.

All analyses were conducted using the Rlanguage (R Development Core Team, 2011).Eigenvalues and factor loadings were estimatedusing the “psych” package for R (Revelle, 2011).Exploratory factor analyses were based on maxi-mum likelihood estimation with orthogonal (vari-max) rotations. MIRT model parameters were esti-mated using the “mirt” package for R (Chalmers,2011), which uses a stochastic approximation algo-rithm (see Cai, 2010) to fit exploratory parametervalues to data. Data missing at random are accom-modated in the package within a Metropolis–Hastings Robbins–Monro algorithm. Parameterestimates are based on the logistic model with theadded multiplicative constant 1.702, which pro-duces results that are nearly identically to the nor-mal ogive model (see Haberman, 1974). Proceduresand software for estimating individual differences inthe unequal-variance signal detection model are lesswell established. Therefore, we used R to programa Metropolis–Hastings within-Gibbs sampler for ahierarchical Bayesian representation of the model

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 10: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

PSYCHOMETRIC AND COGNITIVE MODELING 9

and used approximated expected a posteriori (EAP)values as the final parameter estimates. Additionaldetails about this procedure are provided in theAppendix.

RESULTS

Exploratory analysis

Training data

The training data were analyzed usingexploratory psychometric techniques. The resultsreveal that the first 10 eigenvalues are all greater

than one (8.03, 4.56, 1.76, 1.42, 1.31, 1.23, 1.21,1.11, 1.08, and 1.04), which is sometimes consid-ered to be the criterion for retention. In addition,when the eigenvalues were compared against valuesbased on random simulations (parallel analysis), atotal of 13 were sufficient to retain. Despite this,the first two eigenvalues appear to be dominant,and the amount of variance explained quicklydiminishes to a nearly constant value by the thirdfactor (20%, 11%, 4%, 4%, 3%, 3%, 3%, 3%, 3%,and 3%).

The loadings for 1-, 2-, and 3-factor exploratorysolutions are presented in Table 2. The resultsappear to favor a 2-factor solution. Specifically,

TABLE 2Loadings for exploratory 1-, 2-, and 3-factor solutions

1-Factor 2-Factor 3-Factor

Type No. λ λ1 λ2 λ1 λ2 λ3

Target 3 0.40 0.19 0.47 0.15 0.50 0.18Target 4 0.38 0.16 0.49 0.11 0.56 0.16Target 5 0.32 0.08 0.52 0.03 0.57 0.19Target 6 0.36 0.09 0.56 0.04 0.63 0.20Target 8 0.14 −0.06 0.37 −0.07 0.29 0.24Target 9 0.40 0.21 0.43 0.21 0.30 0.31Target 10 0.21 0.04 0.36 0.01 0.37 0.15Target 14 0.41 0.14 0.58 0.15 0.36 0.46Target 19 0.33 0.08 0.53 0.07 0.41 0.35Target 20 0.34 0.09 0.52 0.08 0.41 0.33Target 21 0.49 0.20 0.66 0.20 0.43 0.50Target 22 0.39 0.11 0.60 0.12 0.37 0.47Target 25 0.11 −0.08 0.34 −0.07 0.21 0.27Target 28 0.21 0.00 0.41 0.03 0.18 0.39Target 29 0.43 0.18 0.56 0.23 0.19 0.60Target 31 0.23 0.01 0.45 0.03 0.19 0.44Target 34 0.14 −0.05 0.37 −0.02 0.08 0.44Target 35 0.18 −0.04 0.42 0.00 0.11 0.48Target 36 0.22 −0.03 0.49 −0.01 0.23 0.46Target 40 –0.01 −0.14 0.23 −0.10 −0.04 0.36Distractor 1 0.51 0.46 0.23 0.51 0.01 0.31Distractor 2 0.20 0.24 −0.01 0.27 −0.11 0.10Distractor 7 0.29 0.40 −0.10 0.44 −0.21 0.06Distractor 11 0.38 0.44 0.01 0.46 −0.07 0.09Distractor 12 0.45 0.49 0.05 0.51 −0.01 0.09Distractor 13 0.48 0.54 0.03 0.57 −0.03 0.09Distractor 15 0.45 0.53 −0.01 0.54 −0.03 0.03Distractor 16 0.35 0.47 −0.10 0.49 −0.10 −0.03Distractor 17 0.70 0.60 0.35 0.60 0.28 0.22Distractor 18 0.68 0.60 0.30 0.59 0.27 0.15Distractor 23 0.49 0.56 0.00 0.56 0.05 −0.03Distractor 24 0.50 0.57 0.04 0.57 0.05 0.01Distractor 26 0.53 0.60 0.03 0.57 0.16 −0.10Distractor 27 0.63 0.65 0.16 0.62 0.26 −0.02Distractor 30 0.30 0.45 −0.16 0.43 0.00 −0.22Distractor 32 0.52 0.59 0.03 0.57 0.16 −0.10Distractor 33 0.67 0.65 0.20 0.62 0.33 −0.03Distractor 37 0.70 0.65 0.27 0.63 0.34 0.05Distractor 38 0.51 0.54 0.08 0.51 0.21 −0.09Distractor 39 0.70 0.66 0.26 0.64 0.31 0.06

Note. λ = factor loading. Largest values shown in bold for 2- and 3-factor solutions.

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 11: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

10 THOMAS ET AL.

2 factors maximized the number of items for whichthere is just one large, primary loading per item. Forease of interpretation, items in Table 2 are orderedby targets (first 20 rows; item numbers 3, 4, . . . 40)and then distractors (second 20 rows; item num-bers 1, 2, . . . 39). As can be seen, this provides aclear distinction in the 2-factor solution: Factor 1 isprimarily related to correct rejections, and Factor2 is primarily related to hits. For clarity, the first ofthese is referred to as the distractor-loaded factor,and the second is referred to as the target-loadedfactor. The 1-factor solution is difficult to interpret,as it appears to create one amalgam of factors withno consistent pattern of loadings. The 3-factor solu-tion retains the distractor-loaded factor, but splitsthe target-loaded factor into two subfacets, withitems tending to load onto the first of these earlyin the test and onto the second late in the test. Theresults of 4- or more factor solutions (not reportedhere) suggest a high number of difficult to inter-pret single-item factors or factors with no dominantloadings at all. Because the amount of varianceaccounted for by the first two factors could bereadily distinguished from higher factor solutions,coupled with the interpretability of the two-factorsolution and its parsimony, we decided to pro-ceed with the analysis using the 2-factor model.We return to this issue in the Discussion.

To explore the need for nonprimary abilityloadings, 2-factor M-GRMs were fit to the datato determine whether items are better modeled bywithin- or between-item multidimensionality (i.e.,if the nonprimary ability loadings could be elimi-nated; center panel of Figure 2 vs. right panel ofFigure 2). Both within- and between-item mod-els converged without difficulty. The AIC and BICvalues for the within-item model (162,433.30 and163,564.90) were consistently lower than the AICand BIC values for the between-item model(163,104.50 and 164,009.80). This suggests thatwithin-item multidimensional structure providesa better fit than between-item multidimensionalstructure, even after accounting for the former’sincreased complexity due to a higher number ofparameters that must be estimated.

Cross validation

The training data suggest that the latent struc-ture of the PFMT is adequately fit by a 2-factorwithin-item M-GRM. To confirm this, the modelwas next applied to the validation data and wasevaluated for item fit. Local dependence values areshown in the top panel of Figure 3. As can beseen, the majority of values are less than .10, andnone are greater than .20. The average value was

.08, which implies overall good fit. Specifically, itappears as though all major sources of commonvariance between items can be attributed to just twolatent variables without great ill effect on the finalparameter estimates.

A correlation plot (heat map) showing specificitem-pair local dependencies is shown in the bot-tom panel of Figure 3. Areas of concentration(darker colors) are indicative of systematic modelfit issues. As was done in Table 2, the items aregrouped by targets (Items 3, 4, . . . , 40) andthen distractors (Items 1, 2, . . . , 39). Again,it can be seen that most item-pair associationswere very low. However, there does appear to bea tendency for items administrated later in thePFMT task (i.e., Items 30 through 40) to showhigher local dependence values than other itempairings.

Comparison to cognitive model

Item parameter estimates derived from the entiresample are reported in Table 3, and graphical repre-sentations in the form of item vectors are shown inFigure 4 (for generation of these plots see Reckase,2009). Panel A of Figure 4 shows item vectorsplotted in a coordinate system comprised of orthog-onal axes (θ1 and θ2), which we have called thedistractor- and target-loaded factors, respectively.The angles formed between the vectors and axesconvey the directions of association between theitems and factors (i.e., smaller angles imply greaterrelative discrimination). The overall vector lengths,which are equal to the items’ multidimensional dis-crimination parameters, convey the magnitudes ofassociation. The origins of the vectors are relatedto the items’ thresholds; smaller valued coordinatesimply less difficulty. Because each item is associ-ated with three thresholds, each has three vectorsas well.

The item threshold parameters (d1, d2, and d3)are generally positive, indicating that most itemswere of easy to moderate difficulty (recall that dif-ficulty and threshold are inversely related). This isalso conveyed by the high prevalence of item vectorsoriginating in the upper left and lower left quad-rants of Panel A in Figure 4. Yet, while nearly all ofthe d3 parameters for targets in Table 3 are positive(d3 item vector coordinates less than 0), d3 param-eters for distractors are mostly negative (d3 itemvector coordinates greater than 0).

The item discrimination parameter estimates (a1

and a2) indicate that the first latent ability (θ1)is more strongly associated with distractors (i.e.,distractor-loaded factor) and that the second latent

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 12: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

PSYCHOMETRIC AND COGNITIVE MODELING 11

300

200

Fre

qu

en

cy

100

0

0 0.1

Item Local Dependence(φc2)

0.2 0.3 0.4

391.00

0.20

0.15

0.10

0.05

0.00

3837

3332

3027

2624

2318

1716

1513

1211

72

140

3635

3431

2928

2522

2120

1914

109

86

54

3

Figure 3. Top: Histogram of local dependence (ϕc2) between items after fitting the within-item multidimensional graded response model

to the data. Bottom: Heat map of local dependence (ϕc2) between items. Darker colors imply greater local dependence.

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 13: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

12 THOMAS ET AL.

TABLE 3Parameter estimates for the within-item multidimensional graded response model

Item threshold Item discrimination

Type No. d1 d2 d3 a1 a2 a1∗ a2

Target 3 1.60 1.05 0.61 0.44 0.37 0.56 0.14Target 4 2.09 1.48 1.02 0.48 0.47 0.64 0.22Target 5 1.76 1.24 0.91 0.35 0.41 0.49 0.22Target 6 2.32 1.59 1.00 0.42 0.53 0.61 0.29Target 8 1.34 0.55 0.08 0.09 0.38 0.25 0.30Target 9 1.56 0.99 0.65 0.44 0.39 0.57 0.15Target 10 1.24 0.52 0.18 0.23 0.37 0.37 0.23Target 14 2.21 1.59 1.01 0.51 0.56 0.70 0.28Target 19 1.76 1.30 0.57 0.36 0.48 0.53 0.28Target 20 1.67 1.07 0.62 0.40 0.51 0.58 0.28Target 21 2.18 1.62 1.22 0.66 0.64 0.87 0.28Target 22 1.72 1.06 0.65 0.41 0.55 0.61 0.31Target 25 0.92 0.31 −0.07 0.13 0.37 0.28 0.28Target 28 1.34 0.72 0.10 0.19 0.47 0.38 0.34Target 29 1.98 1.40 1.06 0.58 0.54 0.76 0.23Target 31 1.50 0.76 0.23 0.23 0.49 0.42 0.34Target 34 0.98 0.40 −0.06 0.16 0.43 0.34 0.31Target 35 1.16 0.48 0.03 0.19 0.47 0.38 0.34Target 36 1.55 0.82 0.22 0.22 0.56 0.44 0.40Target 40 0.71 0.06 −0.46 0.00 0.32 0.14 0.29Distractor 1 1.74 1.00 0.14 0.69 −0.02 0.61 −0.32Distractor 2 0.52 −0.14 −0.70 0.30 −0.15 0.20 −0.26Distractor 7 0.54 0.13 −0.68 0.39 −0.29 0.23 −0.43Distractor 11 0.96 0.47 −0.40 0.48 −0.20 0.34 −0.39Distractor 12 1.24 0.73 −0.24 0.61 −0.18 0.46 −0.43Distractor 13 1.27 0.60 −0.38 0.70 −0.21 0.54 −0.50Distractor 15 1.22 0.78 −0.18 0.65 −0.29 0.46 −0.55Distractor 16 0.74 0.14 −0.67 0.48 −0.35 0.28 −0.52Distractor 17 2.98 2.31 0.86 1.20 0.08 1.11 −0.46Distractor 18 2.58 2.02 0.91 1.12 0.04 1.03 −0.46Distractor 23 1.41 0.87 −0.26 0.73 −0.27 0.54 −0.56Distractor 24 1.38 0.79 −0.26 0.69 −0.29 0.50 −0.56Distractor 26 1.65 1.11 −0.01 0.81 −0.28 0.61 −0.61Distractor 27 2.19 1.48 0.08 1.06 −0.23 0.85 −0.67Distractor 30 0.55 0.05 −0.85 0.41 −0.42 0.18 −0.56Distractor 32 1.46 0.92 −0.28 0.75 −0.27 0.56 −0.57Distractor 33 2.27 1.67 0.32 1.09 −0.12 0.93 −0.59Distractor 37 2.02 1.52 0.38 1.09 −0.02 0.97 −0.50Distractor 38 1.07 0.61 −0.08 0.62 −0.15 0.50 −0.41Distractor 39 2.54 1.85 0.44 1.18 −0.07 1.03 −0.58

Note. a1∗ and a2

∗ are rotated parameters.

ability (θ2) is more strongly associated with targets(i.e., target-loaded factor). This can also be seenin Panel A of Figure 4 where the light-gray itemvectors (distractors) tend to align with θ1 and thedark-gray item vectors (targets) tend to align withθ2. Unfortunately, this manifestation of the modelis complicated by rotational indeterminacy. That is,due to the exploratory nature of the M-GRM, thereis no correct orientation to the x- and y-axes (θ1

and θ2). Any consistent orthogonal rotation of thevector and axis values will produce the same modelfit, with just the item weights changing. A properrotation of the axes cannot be determined by the

observed data alone, but instead must be “resolved”using a rotation procedure based on a theory-basedcriterion.

Our approach is to resolve this rotational inde-terminacy using cognitive–psychometric modeling.Specifically, results from the M-GRM werecorrelated with individual parameter estimatesderived from a specific application of the signaldetection theory of recognition memory—theunequal-variance model. Correlations betweenthe unrotated M-GRM dimensions (θ1 and θ2)and the estimated signal detection parameters aredepicted in Panel B of Figure 4. The vectors for d ′,

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 14: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

PSYCHOMETRIC AND COGNITIVE MODELING 13

A B

DC

Targets (d1)+3

–3 +3 +3

C

σ2T

–3

+3

+3

–3

–3

–3

–3

0 0

C

σ2T

–3

–30 0

θ2

θ2∗

+3

θ2∗

θ1∗

+3

θ1∗

+3

d'

d'

θ2

θ1 θ1

Targets (d2)Targets (d3)Distractors (d1)Distractors (d2)Distractors (d3)

Figure 4. Unrotated (A and B) and rotated (C and D) solutions to the within-item multidimensional graded response model. A:Unrotated orientation of the axes (θ1 and θ2) plotted with item vectors. B: Unrotated orientation plotted with unequal variance sig-nal detection model parameter vectors. C: Rotated orientation of the axes (θ1

∗ and θ2∗) plotted with unequal variance signal detection

model parameter vectors. D: rotated orientation plotted with item vectors. Signal detection parameters are distance between the meansof the target and distractor distributions (d ′), variance of the target distribution (σ 2

T), and response bias (C).

σ 2T, and C are plotted at angles and lengths2 that

convey the direction and magnitude of association,respectively, between each parameter and theunrotated θ1 and θ2 dimensions (smaller anglesand longer vectors indicate stronger association).The unrotated M-GRM results provide no more

2Because the angles and vector lengths were based on thesample data, rather than the population model, the values donot maintain orthogonal geometry within the axes. Therefore,we used the average angle between each dimension to plot thevectors.

than an ambiguous interpretation of the distractor-and target-loaded abilities with respect to thesignal detection parameters. The dimension θ1

is positively correlated with both d ′ and C. Thedimension θ2 is positively correlated with d ′ butnegatively correlated with C. Neither ability isstrongly correlated with σ 2

T, which could reflectpoor estimation of the parameter. The unrotatedsolution was unexpected from the perspective ofsignal detection models of recognition memory.In particular, the latent constructs, which oughtto align with memory strength and response

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 15: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

14 THOMAS ET AL.

bias, should manifest themselves equally in targetresponses and distractor responses.

To improve clarity on this issue, we next orthog-onally rotated the solution to align θ1 with d ′and θ2 with C (an angle of approximately 26◦).Specifically, by first averaging the angle of separa-tion between θ1 and d ′ with the angle of separationbetween θ2 and C, and then consistently rotat-ing all parameter estimates by this amount, resultsfrom the M-GRM model were brought into align-ment with results from the signal detection model.This is similar to classic targeted rotation proce-dures (see Mulaik, 2010), except that the angle ofrotation was determined directly from associationsamong estimates of ability and the signal detec-tion parameters rather than a targeted matrix forthe item parameters. The resulting dimensions afterrotation along with the signal detection parametervectors are shown in Panel C of Figure 4. It can beseen that the rotation improves interpretation of theM-GRM latent dimensions considerably. The firstability (θ1

∗) can now be regarded as an indicator ofmemory strength (i.e., closely aligned with d ′), andthe second ability (θ2

∗) can now be regarded as anindicator of response bias (i.e., closely aligned withC, although in the opposite direction). Because θ1

and θ2 were rotated by the same angle (i.e., orthog-onal rotation), the dimensions remain orthogonalto one another.

The rotated dimensions θ1∗ and θ2

∗ along withthe PFMT item vectors are shown in Panel D ofFigure 4. Note that θ1

∗ bisects the average anglebetween distractor and target item vectors. Thissuggests that θ1

∗ reflects a general ability, wherehigher values of θ1

∗ are associated with higher prob-abilities of correct response for both targets anddistractors. Also note that θ2

∗ is positively relatedto targets and negatively related to distractors. Thissuggests that θ2

∗ reflects a tradeoff, where highervalues of θ2

∗ are associated with higher probabilitiesof correct response to targets but lower probabil-ities of correct response to distractors, and thatlower values of θ2

∗ are associated with lower prob-abilities of correct response to targets but higherprobabilities of correct response to distractors.

The rotated discrimination parameters arereported in Table 3 (a1

∗ and a2∗). Importantly,

the items’ overall multidimensional discriminatingpowers (and communalities) have not changed.Discrimination parameters for the first rotateddimension (a1

∗) are all positive and generallymoderate in size. Better memory strength leadsto a higher number of hits and correct rejections.Discrimination parameters for the second rotateddimension (a2

∗) are of the opposite sign for targetsand distractors. That is, high θ2

∗ implies bias

towards targets, and low θ2∗ implies bias towards

distractors. The average absolute value of a1∗

is larger than the average absolute value of a2∗,

which is more true of targets (|Ma1∗ | = .50 vs.|Ma2∗ | = .28) than of distractors (|Ma1∗ | = .60 vs.|Ma2∗ | = .50).

Measurement precision for Army STARRS

The SEθ function for the PFMT is plotted in thetop panel of Figure 5. Curvature in the SEθ sur-face shows that precision in measurement varies asa function of ability. The function reaches a lowpoint around one standard deviation below aver-age memory strength and average response bias (thecoordinate θ1

∗ = –1.0, θ2∗ = –1.3). Notably, there

is one “valley” of SEθ running diagonally3 fromabout (θ1

∗ = –4.0, θ2∗ = –4.0) to (θ1

∗ = 4.0, θ2∗ =

4.0). High points in the SEθ surface fall in regionswhere memory strength is very high, and responsebias is very low (e.g., θ1

∗ = 4.0, θ2∗ = –4.0) or where

memory strength is very low, and response bias isvery high (e.g., θ1

∗ = –4.0, θ2∗ = 4.0).

The bottom panel of Figure 5 plots the bivari-ate distribution of ability estimates in the ArmySTARRS sample. The highest density of abilityoccurs at (θ1

∗ = 0, θ2∗ = 0), with both mem-

ory strength and response bias symmetrically dis-tributed about the point. By visually lining-up theability distribution in Figure 5 with the SEθ func-tion in Figure 5, it can be seen that there is aclose match. That is, low values of SEθ tend tooccur near dense ability regions. However, the SEθ

function does not appear to be optimized for thisspecific sample. The low point in the SEθ functionsuggests that ability estimates are most precise forslightly less able individuals. Indeed, estimated reli-ability in the current sample is .61; however, if weconsider only those Soldiers with ability estimatesfalling within one standard deviation of the lowpoint in the SEθ surface (θ1

∗ = –1.0 and θ2∗ =

−1.3), reliability jumps to .71. If we consider onlythose Soldiers with ability estimates falling withinone standard deviation of an arbitrary high pointin the SEθ surface (say θ1

∗ = 1.0 and θ2∗ = –1.0),

3SEθ functions are difficult to summarize for within-itemmultidimensional models. Whereas changes in ability forunidimensional models can occur in just one dimension, changesin ability for multidimensional models can occur in multipledimensions. Therefore, in order to determine how SEθ changesas a function of ability, a gradient describing the direction ofmovement within the ability plane must first be established. Forsimplicity, we aligned this gradient with the direction of steep-est descent. In practice, observed SEθ values are likely to varydepending on changes in this angle.

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 16: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

PSYCHOMETRIC AND COGNITIVE MODELING 15

1.50

0.20

0.15

0.10

0.05

0.004.0

2.0

0.0

Response Bias

Sa

mp

le D

en

sity

–2.0

–4.0 –4.0

–2.0

0.0

Memory Strength

2.0

4.0

1.25

1.00

0.75

0.50

0.25

0.00

Sta

nd

ard

Err

or

(SE

θ)

4.0

2.0

0.0

Response Bias

–2.0

–4.0 –4.0

–2.0 Memory Strength

0.0

2.0

4.0

Figure 5. Top panel: standard error function for the rotated solution to the within-item multidimensional graded response model.Bottom panel: multidimensional distribution of memory strength and response bias in the Army STARRS sample.

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 17: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

16 THOMAS ET AL.

reliability drops to .42 and would be much lowerhad we chosen more extreme values.

DISCUSSION

Parallel psychometric and cognitive modeling anal-yses were used to better understand the measure-ment properties of the PFMT as a component ofthe neuropsychological battery being administratedto Soldiers in Army STARRS. Item response theory(M-GRM model) was used to confirm the numberof latent cognitive abilities measured by the instru-ment. Then, cognitive theory (unequal-variancesignal detection model) was used to apply psy-chological meaning to those abilities. Specifically,these analyses supported the construct validityof independent neurocognitive markers of mem-ory strength (d ′) and response bias (C). We theninvestigated the measurement precision of PFMTtest scores and found adequate precision overall,but higher precision in low-average ability regions.Thus, PFMT scores were most precise for exami-nees with low-average memory strength. The resultssuggest that although the PFMT can provide use-ful measurements across a wide range of memoryabilities, it is particularly useful for identifying dif-ferences among impaired individuals.

Cognitive–psychometric methodology helpedexplain unknown aspects of the test’s latent factorstructure using experimentally derived cognitivearchitecture. The approach provided a synthesis ofstatistical modeling and cognitive theory. Althoughthe modeling procedures were complex in com-parison to conventional approaches, the finalmeasurements had a priori psychological meaningand did not require a posteriori adjustments (e.g.,multiple regression or pattern analysis) to be madeinterpretable. In addition, the modeling proceduresallowed us to examine standard error as a functionof cognitive ability. This facilitated a more nuancedinspection of the PFMT’s psychometric propertiesand allowed us to determine specific levels ofmemory associated with relatively better or worsemeasurement precision.

Penn Face Memory Test model structure

The within-item M-GRM (bottom panel ofFigure 1 and right panel of Figure 2) is con-sidered to be a relatively complex measurementstructure. Within-item multidimensional modelsrequire greater effort to estimate and to interpretthan do between-item models (McDonald, 2000).A common solution to such complexity is to force

between-item (simple) structure by allowing factorsto correlate with each other. In Panel A of Figure 4,for example, it is possible to obliquely rotate thex- and y-axes so that the target and distractor itemsprimarily align with just one dimension. That is, itis possible to avoid models where items are indica-tors of multiple cognitive constructs by assumingthat the constructs are highly correlated. Althoughsuch a solution is methodologically simpler, it failsto agree with cognitive theories suggesting that tar-get and distractor responses should be similarlydetermined by two underlying constructs. A morepsychologically meaningful, albeit complex, solu-tion was to interpret the psychometric model(within-item M-GRM) from a cognitive perspec-tive (i.e., targeted rotation using unequal-variancesignal detection model parameters). This method-ology produced clear psychological interpretationsfor the latent abilities—one representing memorystrength and one representing response bias.

Aligning the exploratory MIRT results with awell-validated cognitive model helps to establish thePFMT’s construct representation—that is, how testscores (item responses) are determined by the prop-erties of items, tasks, and underlying cognitive pro-cesses (see Embretson, 1983, 2010). In our analyses,for example, the function of the first rotated ability(θ1

∗) was shown to be consistent with the constructrepresentation of memory strength, and the func-tion of the second rotated ability (θ2

∗) was shownto be consistent with the construct representationof response bias. This stringent form of validityis supported by cognitive–psychometric modelingmethodology (see Batchelder, 2010). Long recog-nized as the ideal synthesis of theories from exper-imental and correlational psychology (Cronbach,1957), this methodology is rapidly expanding itsapplications (e.g., Brandt, 2007; Brown, Turner,Mano, Bolden, & Thomas, 2012; Embretson, &Gorin, 2001; Townsend & Neufeld, 2010) and hasthe potential to improve neuropsychological assess-ment, particularly in an age of computerized mea-surement.

The model applied in the current study isregarded as a preliminary attempt to builda cognitive–psychometric architecture for neuro-cognitive measures of episodic memory. Althoughthe model was made consistent with basic resultsfrom signal detection theory (a descriptive cognitivemodel), it remains unclear how specific cognitiveprocesses, as well as other commonly studied cog-nitive phenomena (e.g., list length), can be accom-modated by the measurement structure. Whetherthe cognitive–psychometric model described inthis paper needs to be replaced by a synthesisof psychometric theory with a more explanatory

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 18: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

PSYCHOMETRIC AND COGNITIVE MODELING 17

cognitive theory of episodic memory (see Clark &Gronlund, 1996, for examples), or can be amendedto handle the inherent complexity in episodic mem-ory testing, is unknown. Obvious limitations in theapplied model are that changing the number ofstudy items (faces) and the time between study andtest phases will affect item difficulty. Also, the pro-cess of face recognition itself could affect memorystrength for other stored but not yet tested faces.Consistent with these concerns, our results sug-gest that in addition to the two dominant factorsunderlying test performance (memory strength andresponse bias), nearly a dozen smaller factors werepresent in the data. Our analysis of local depen-dence (Figure 3) suggests that most of this residualcommon variance is found among items admin-istrated later in the PFMT task. We assume thatthis variance reflects the impact of unaccountedfor cognitive processes (e.g., learning to learn, net-work activation, etc.), but this hypothesis was nottested.

If our assumption is accurate, a more compre-hensive model comprising structures for both cog-nitive abilities and cognitive processes is required toaccurately account for episodic memory task per-formance. For example, there is some agreementin the field of cognitive modeling that memorystrength can be further deconstructed into the addi-tive (see Wixted, 2007) and/or complementary (seeOnyper, Zhang, & Howard, 2010; Yonelinas, 2002)effects of familiarity and recollection. Whereasfamiliarity is characterized by the weak sense ofhaving experienced some past event, recollectionconveys clear and unequivocal memory probablyrelated to memory search. Imaging studies haveeven suggested that these processes rely on distinctneuroanatomical structures (Davachi & Wagner,2002; Diana, Yonelinas, & Ranganath, 2007; Kim& Cabeza, 2007; Wheeler & Buckner, 2003), thoughperhaps not in a clear and unambiguous manner(Wixted & Squire, 2011).

Integrative frameworks for such findings areoffered by computational global memory models(for a review, see Clark & Gronlund, 1996). Mostof these assume that memories, which are repre-sented parametrically as associative networks ofitem and context information, are accessed througha process combining elements of global matchingand interactive cueing. Recollection equates to aspecific match between a test item and a storedmemory; familiarity equates to the marginal asso-ciative strength between a test item and the com-plete stored network of memories (e.g., Gronlund& Shiffrin, 1986; Raaijmakers & Shiffrin, 1980).These computational models are unique in thatthey are designed to explain underlying memory

structures and processes at a level of specificity notfeasible with more universally applicable models(e.g., generalized linear model). Further researchis needed to determine whether computationalmodels can be given a cognitive–psychometricframework.

Measurement properties of the PFMT andimplications for Army STARRS

Standard error is determined by item and exami-nee characteristics. Specifically, measurement pre-cision is optimized when item difficulty is closelymatched with ability and when item discrimina-tion is high. This can be visualized in Panel D ofFigure 4. The area of the two-dimensional abilityspace (θ1

∗ by θ2∗) covered by the dark- and light-

gray arrows (item vectors representing combineditem difficulty and discrimination characteristics)is the area of ability where precise measurementoccurs. This implies that the PFMT should bemost precise when measuring low-average levelsof memory strength (θ1

∗) and response bias (θ2∗),

which can also be observed in the standard errorfunction displayed in the top panel of Figure 5.It is noteworthy that PFMT distractor items andPFMT target items contribute towards precision ofmeasurement at complementary angles within themultidimensional ability space. That is, hits andcorrect rejections contribute unique value to theassessment of episodic memory and should not beregarded as interchangeable data.

Most of the PFMT items were estimated to havelow difficulties (inverse of d1, d2, and d3), whichresulted in the test’s standard error function reach-ing a minimum (maximum measurement precision)in the low-average range of ability. In addition,the difficulties of specific PFMT response options(i.e., “definitely yes,” “probably yes,” “probablyno,” or “definitely no”) were slightly dispropor-tionate for targets and distractors. Targets wereeasier than distractors. This implies that distractorscontributed relatively greater measurement preci-sion in the high-average range of ability. All thingsbeing equal, adding distractor items to the PFMTshould improve measurement precision for nonim-paired examinees, while adding target items shouldimprove measurement precision for impaired exam-inees, as the adjustment should make the test moreor less difficult, respectively.

Most of the PFMT items were estimated tohave moderately sized discrimination parameters,but memory strength values (a1

∗) were generallylarger than response bias values (a2

∗). Thus, mem-ory strength was measured more precisely than

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 19: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

18 THOMAS ET AL.

response bias. As a result, variation in memorystrength had a greater effect on standard error thandid variation in response bias. Although somewhatdifficult to discern in Figure 5, marginal variation inthe memory strength axis is greater than marginalvariation in the response bias axis. The surfacedoes not form a consistent “U” shape. An impli-cation is that, because the variances of both latentdimensions were standardized, memory strengthwas a greater relative determinant of response per-formance than was response bias, which couldreflect relatively less true variance in Army Soldiers’response biases than in Army Soldiers’ memorystrengths. Practically, PFMT users can consider theinstrument a better indicator of memory strengththan response bias, which is consistent with thepurpose of the test.

Overall, analyses of the PFMT suggest that stan-dard error values were low for the majority of ArmySTARRS participants’ ability estimates. The relia-bility of all estimates in the sample was .61. Thus,it can be argued that the PFMT approaches thegeneral standard for adequate measurement preci-sion for the Army STARRS sample as a whole.However, the reliability is much higher (.71) if weconsider only those estimates in the low-averagerange of ability. This confirms that the PFMT’smeasurement precision is optimized for examineeswith low-average ability. Item difficulty would haveto be raised (item thresholds lowered) by approxi-mately one standard deviation in order for measure-ment precision to peak in the center of the ArmySTARRS ability distribution. However, doing somay not be desirable in a study of psychopathology,where the impaired end of the ability spectrum isthe primary focus of predictive modeling. BecauseArmy STARRS seeks to identify potential risk fac-tors, such as below-average cognitive functioning,the results generally support maintaining the testin its current instantiation. The capability to deter-mine this is a benefit of modeling both SEθ andreliability.

Two aspects of the PFMT likely contribute toits favorable measurement precision: (a) Items aregenerally of moderate difficulty; and (b) exami-nees are asked to respond using ordered categor-ical response options. Items with moderate diffi-culties are optimal for assessment of normativesamples. PFMT items, which are somewhat gearedtowards clinical populations, can be characterizedas being easy, but not too easy. Developers offixed-length tests, particularly tests administered intime-limited settings, must choose a range of abil-ity for which items will provide maximum precision.For the PFMT, even though ability measurementsare most precise for examinees in the low-average

range, examinees in the above-average range arechallenged enough by the content to avoid dra-matic ceiling effects in the Army STARRS sam-ple. Standard error values are also lowered by thePFMT’s rating scale format. Modeling interme-diate response options (i.e., “probably”) tends tominimize standard error in the bodies of abilitydistributions; extreme response options (i.e., “def-initely”) tend to minimize standard error near thetails. The PFMT’s polytomous format (ordered cat-egorical response options) has the effect of creatinga pseudocontinuous response scale, which helps toovercome some of the limitations due to categoricalresponse data.

It is important to note that an accurate character-ization of standard error goes beyond evaluationsof a test’s measurement properties. Standard errorcan also be used to improve predictive modeling,as is planned for risk and resilience variables inArmy STARRS. A key tenant of structural equa-tion modeling (see Bollen, 1989) is that a properlyspecified measurement model can alleviate some ofthe problems that arise from regression with latentvariables. Most prominently, regression coefficientsbecome attenuated (shrink towards zero) as mea-surement error grows larger. Because standard erroris a property of ability score estimates, it shouldbe examined in predictive modeling to ensure thatpoorly measured examinees will not dispropor-tionately impact analyses of a test’s predictivevalue.

Limitations

A limitation of all latent variable studies is theinability to unequivocally define measured con-structs. Although the rotated dimensions appear torepresent memory strength and response bias, thereis no guarantee that the PFMT measures these twolatent variables. Evidence of nomothetic span (con-vergent and discriminant validity), much of whichhas been collected for the PFMT (e.g., Gur et al.,2001; Gur et al., 1997; Gur et al., 2012; Gur et al.,2010), further supports the test’s construct valid-ity. In addition, our choice to align latent abilityestimates from the M-GRM with parameter valuesfrom the unequal-variance signal detection modelwas based on an a priori chosen theoretical frame-work for interpreting response data. As with allmodels, it is not possible to prove that the rotationis correct or is a true reflection of reality. Indeed,any rotation of the parameter space (e.g., basedon simple structure, targeted item parameters, etc.)would result in the same model fit. By showing howthe rotation solution can be linked to a particular

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 20: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

PSYCHOMETRIC AND COGNITIVE MODELING 19

theory, however, we refocus the discussion of factorrotation strategy from mathematical criteria (vari-max, promax, etc.) to criteria for adequate theoryconstruction.

We noted that the response profiles of approx-imately 8% of Army Soldiers in the initial sam-ple were flagged as errant and were therefore notincluded in the analyses. The exclusions may haveeliminated some valid response profiles resultingin overestimates of the PFMT’s model fit andmeasurement precision. It was also noted thatdemographic characteristics of Soldiers with datapresent were not identical to demographic charac-teristics of Soldiers with data absent or excluded.Disparities in gender, education, and race and eth-nicity, though small, were observed between the twogroups. These differences threaten the validity offindings from this study. However, perhaps moreconcerning than difference between Soldiers testedand Soldiers not tested is the more general possi-bility of group biases. Specifically, it is possible thatdifferential item functioning between demographi-cally diverse groups of examinees, even within thesame calibration sample, led to poor estimation ofsome parameter values. A potential solution is todivide the current sample into subsets based ondemographic characteristics and then investigatedifferential item functioning between groups. Suchanalyses, while valuable, would add considerablelength to the current paper. They will be pursuedin future studies.

Finally, unlike a true cross-validation, whereindependent samples are drawn from the same pop-ulation, our analyses were based on independentsamples drawn from the same sample. This impliesthat our results were cross-validated to the samplebut not population of Army Soldiers.

CONCLUSIONS

The PFMT was found to measure visual memorystrength and response bias with adequate precisionfor the majority of Soldiers participating in ArmySTARRS. The results are promising and suggestthat the test has the potential to provide useful mea-surements of Army Soldiers’ cognitive functioning.Although the psychometric model applied to thePFMT in the current study likely does not accountfor all relevant sources of task and item vari-ance, it does provide a basis for future cognitive–psychometric modeling of the instrument.

Original manuscript received 6 May 2012Revised manuscript accepted 27 December 2012

First published online 30 January 2013

REFERENCES

Andrich, D. (1988). Rasch models for measurement.Newbury Park, CA: Sage.

Army STARRS. (2011). National Institute of MentalHealth Strategic Plan. Army STARRS Home.Retrieved October 11, 2011, from http://www.armystarrs.org/

Baddeley, A. D., & Hitch, G. (1974). Working mem-ory. In G. H. Bower (Ed.), The psychology of learningand motivation: Advances in research and theory (pp.47–89). New York, NY: Academic Press.

Baker, F. B., & Kim, S.-H. (2004). Item response theory:Parameter estimation techniques (2nd ed.). New York,NY: Dekker.

Batchelder, W. H. (2010). Cognitive psychometrics:Using multinomial processing tree models as mea-surement tools. In S. E. Embretson (Ed.), Measuringpsychological constructs: Advances in model-basedapproaches (pp. 71–93). Washington, DC: AmericanPsychological Association.

Bollen, K. A. (1989). Structural equations with latentvariables. New York, NY: Wiley.

Brandt, M. (2007). Bridging the gap between measure-ment models and theories of human memory. Journalof Psychology, 215, 72–85.

Brown, G. G., Turner, T., Mano, Q. R., Bolden, K., &Thomas, M. L. (2012). Experimental manipulation ofworking memory model parameters: An exercise in con-struct validity. Manuscript submitted for publication.

Cabeza, R., & Kingstone, A. (Eds.). (2006). Handbookof functional neuroimaging of cognition (2nd ed.).Cambridge, MA: MIT Press.

Cai, L. (2010). High-dimensional exploratory item factoranalysis by a Metropolis–Hastings Robbins–Monroalgorithm. Psychometrika, 75, 33–57.

Campbell, D. T., & Fiske, D. W. (1959). Convergentand discriminant validation by the multitrait-multi-method matrix. Psychological Bulletin, 56,81–105.

Carroll, J. B. (1993). Human cognitive abilities: A surveyof factor-analytic studies. New York, NY: CambridgeUniversity Press.

Chalmers, P. (2011). mirt: Multidimensional item responsetheory. Retrieved from http://personality-project.org/r/psych.manual.pdf

Chen, W.-H., & Thissen, D. (1997). Local dependenceindexes for item pairs using item response theory.Journal of Educational and Behavioral Statistics, 22,265–289.

Clark, S. E., & Gronlund, S. D. (1996). Global match-ing models of recognition memory: How the modelsmatch the data. Psychonomic Bulletin & Review, 3,37–60.

Cowan, N. (1988). Evolving conceptions of memorystorage, selective attention, and their mutual con-straints within the human information-processing sys-tem. Psychological Bulletin, 104, 163–191.

Cronbach, L. J. (1957). The two disciplines of sci-entific psychology. American Psychologist, 12,671–684.

Cronbach, L. J., & Meehl, P. E. (1955). Construct valid-ity in psychological tests. Psychological Bulletin, 52,281–302.

Davachi, L., & Wagner, A. (2002). Hippocampal contri-butions to episodic encoding: Insights from relational

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 21: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

20 THOMAS ET AL.

and item-based learning. Journal of Neurophysiology,88, 982–990.

de Ayala, R. J. (1994). The influence of dimensionalityon the graded response model. Applied PsychologicalMeasurement, 18, 155–170.

de Ayala, R. J. (2009). The theory and practice of itemresponse theory. New York, NY: Guilford Press.

Delis, D. C., Kramer, J. H., Kaplan, E., & Ober, B. A.(2000). The California Verbal Learning Test (2nd ed.).San Antonio, TX: The Psychological Corporation.

Diana, R. A. Yonelinas, A. P., & Ranganath, C. (2007).Imaging recollection and familiarity in the medialtemporal lobe: A three-component model. Trends inCognitive Sciences, 11, 379–386.

Embretson, S. E. (1983). Construct validity: Constructrepresentation versus nomothetic span. PsychologicalBulletin, 93, 179–197.

Embretson, S. E. (2010). Cognitive design systems:A structural modeling approach applied to devel-oping a spatial ability test. In S. E. Embretson(Ed.), Measuring psychological constructs: Advancesin model-based approaches (pp. 274–273). Washington,DC : American Psychological Association.

Embretson, S. E., & Gorin, J. (2001). Improving con-struct validity with cognitive psychology principles.Journal of Educational Measurement, 38, 343–368.

Embretson, S. E., & Reise, S. P. (2000). Item responsetheory for psychologists. Mahwah, NJ: LawrenceErlbaum Associates.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B.(2004). Bayesian data analysis (2nd ed.). Boca Raton,FL: Chapman and Hall/CRC.

Green, D. M., & Swets, J. A. (1966). Signal detectiontheory and psychophysics. New York, NY: Wiley.

Gronlund, S. D., & Shiffrin, R. M. (1986). Retrievalstrategies in recall of natural categories and catego-rized lists. Journal of Experimental Psychology, 12,550–561.

Gur, R. C., Erwin, R. J., & Gur, R. E. (1992).Neurobehavioral probes for physiologic neuroimagingstudies. Archives of General Psychiatry, 49, 409–414.

Gur, R. C., Jaggi, J. L., Ragland, J. D., Resnick, S. M.,Shtasel, D., Muenz, L., et al. (1993). Effects of mem-ory processing on regional brain activation: Cerebralblood flow in normal subjects. The InternationalJournal of Neuroscience, 72, 31–44.

Gur, R. E., Jaggi, J. L., Shtasel, D. L., & Ragland,J. D. (1994). Cerebral blood flow in schizophrenia:Effects of memory processing on regional activation.Biological Psychiatry, 35, 3–15.

Gur, R. C., Ragland, J. D., Moberg, P. J., Turner, T. H.,Bilker, W. B., Kohler, C., et al. (2001). Computerizedneurocognitive scanning: I. Methodology and valida-tion in healthy people. Neuropsychopharmacology, 25,766–776.

Gur, R. C., Ragland, J. D., Mozley, L. H., Mozley, P. D.,Smith, R., Alavi, A., et al. (1997). Lateralized changesin regional cerebral blood flow during performance ofverbal and facial recognition tasks: Correlations withperformance and “effort.” Brain and Cognition, 33,388–414.

Gur, R. C., Richard, J., Calkins, M. E., Chiavacci, R.,Hansen, J. A., Bilker, W. B., . . . Gur, R. E. (2012). Agegroup and sex differences in performance on a com-puterized neurocognitive battery in children age 8–21.Neuropsychology, 26, 251–265.

Gur, R. C., Richard, J., Hughett, P., Calkins, M. E.,Macy, L., Bilker, W. B., . . . Gur, R. E. (2010).A cognitive neuroscience-based computerized bat-tery for efficient measurement of individual dif-ferences: Standardization and initial construct val-idation. Journal of Neuroscience Methods, 187,254–262.

Haberman, S. J. (1974). The analysis of frequency data.Chicago, IL: University of Chicago Press.

Haynes, S. N., Smith, G. T., & Hunsley, J. D. (2011).Scientific foundations of clinical assessment. NewYork, NY: Routledge.

Hoerger, M., Quirk, S. W., & Weed, N. C. (2011).Development and validation of the DelayingGratification Inventory. Psychological Assessment,23, 725–738.

Ingham, J. (1970). Individual differences in signal detec-tion. Acta Psychologica, 34, 39–50.

Kelley, W., Miezin, F., McDermott, K., Buckner, R.,Raichle, M., Cohen, N., . . . Petersen, S. E.(1998). Hemispheric specialization in human dorsalfrontal cortex and medial temporal lobe for ver-bal and nonverbal memory encoding. Neuron, 20,927–936.

Kim, H., & Cabeza, R. (2007). Differential contribu-tions of prefrontal, medial temporal, and sensory-perceptual regions to true and false memory forma-tion. Cerebral Cortex, 17, 2143–2150.

Lezak, M. D., Howieson, D. B., Loring, D. W., Hannay,H. J., & Fischer, J. S. (2004). Neuropsychologicalassessment (4th ed.). New York, NY: OxfordUniversity Press.

Loevinger, J. (1957). Objective tests as instrumentsof psychological theory. Psychological Reports,Monograph Supplement, 3, 635–694.

Lord, F. M. (1980). Applications of item response theoryto practical testing problems. Hillsdale, NJ: LawrenceErlbaum Associates.

Magnusson, D. (1966). Test theory. Reading, MA:Addison-Wesley.

McDonald, R. P. (1999). Test theory: A unified treatment.Mahwah, NJ: Lawrence Erlbaum Associates.

McDonald, R. P. (2000). A basis for multidimensionalitem response theory. Applied PsychologicalMeasurement, 24, 99–114.

Mellenbergh, G. J. (1996). Measurement precision intest score and item response models. PsychologicalMethods, 1, 293–299.

Mickes, L., Wixted, J. T., & Wais, P. E. (2007). A directtest of the unequal-variance signal-detection model ofrecognition memory. Psychonomic Bulletin & Review,14, 858–865.

Mulaik, S. A. (2010). Foundations of factor analysis (2nded.). Boca Raton, FL: Chapman & Hall/CRC.

Muraki, E., & Carlson, J. E. (1995). Full-informationfactor analysis for polytomous item responses.Applied Psychological Measurement. Special Issue:Polytomous Item Response Theory, 19, 73–90.

Onyper, S. V., Zhang, Y. X., & Howard, M. W. (2010).Some-or-none recollection: Evidence from item andsource memory. Journal of Experimental Psychology,139, 341–364.

R Development Core Team. (2011). R: A language andenvironment for statistical computing. Vienna, Austria:R Foundation for Statistical Computing. Retrievedfrom http://www.R-project.org

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3

Page 22: Parallel psychometric and cognitive modeling analyses of ...projects.iq.harvard.edu/files/nocklab/files/thomas... · test theory (e.g., Cronbach’s alpha) and common factor theory

PSYCHOMETRIC AND COGNITIVE MODELING 21

Raaijmakers, J. G. W., & Shiffrin, R. M. (1980). SAM: Atheory of probabilistic search of associative memory.In G. H. Bower (Ed.), The psychology of learning andmotivation (pp. 207–262). New York, NY: AcademicPress.

Reckase, M. D. (2009). Multidimensional item responsetheory. New York, NY: Springer.

Reise, S. P., & Waller, N. G. (2009). Item response theoryand clinical measurement. Annual Review of ClinicalPsychology, 5, 25–46.

Revelle, W. (2011). psych: Procedures for personalityand psychological research. Retrieved from http://personality-project.org/r/psych.manual.pdf

Samejima, F. (1997). The graded response model. In W. J.van der Linden & R. K. Hambleton (Eds.), Handbookof modern item response theory (pp. 85–100). NewYork, NY: Springer.

Snodgrass, J., & Corwin, J. (1988). Pragmatics of measur-ing recognition memory—Applications to dementiaand amnesia. Journal of Experimental Psychology:General, 117, 34–50.

Strauss, M. E., & Smith, G. T. (2009). Construct validity:Advances in theory and methodology. Annual Reviewof Clinical Psychology, 5, 1–25.

Thomas, M. L. (2011). The value of item response the-ory in clinical assessment: A review. Assessment, 18,291–307.

Townsend, J. T., & Neufeld, R. W. J. (2010). Introductionto special issue on contributions of mathematical psy-chology to clinical science and assessment. Journal ofMathematical Psychology, 54, 1–4.

U.S. Army. (2010). Army health promotion, riskreduction, and suicide prevention: Report 2010.Retrieved October 11, 2011, from http://csf.army.mil/downloads/HP-RR-SPReport2010.pdf

Wheeler, M., & Buckner, R. (2003). Functional dis-sociation among components of remembering:Control, perceived oldness, and content. Journal ofNeuroscience, 23, 3869–3880.

Wickens, T. D. (2002). Elementary signal detection theory.New York, NY: Oxford University Press.

Wixted, J. T. (2007). Dual-process theory andsignal-detection theory of recognition memory.Psychological Review, 114, 152–176.

Wixted, J. T., & Squire, L. R. (2011). The medial temporallobe and the attributes of memory. Trends in CognitiveSciences, 15, 210–217.

Yonelinas, A. P. (2002). The nature of recollection andfamiliarity: A review of 30 years research. Journal ofMemory & Language, 46, 441–517.

APPENDIX

MODELING PROCEDURES

Because the densities of the ordered categoricalresponse options were estimated to be slightlydisproportionate for targets and distractors, anunequal-variance version of the signal detectionmodel was used to model the data. The unequal-variance signal detection model is not identifiablefrom the observed data alone. Therefore, the meanof the distractor distribution was fixed to 0, andthe variance was fixed to 1 for all examinees. Themean of the target distribution for each examineewas set equal to estimates of d ′, and the variance(σ 2

T) was set equal to 1 plus estimates of σ 2 (seebelow). The middle criterion for each examineewas found by adding estimates of bias (C) to themidpoint of estimates of d ′. Individual differencesin the placement of the upper and lower crite-ria were not uniquely estimated, but were insteadfixed by adding and subtracting 0.5, respectively,to each examinee’s middle criterion estimate. Thevalue of 0.5 was found to provide good fit for dataaggregated at the group level. All free parameters(d ′, σ 2, and C) were estimated using a hierarchi-cal Bayesian approach with a Metropolis–Hastingswithin-Gibbs sampler (see Gelman, Carlin, Stern,& Rubin, 2004) written in R. The chain was thinnedby every 10th draw and was run for 3,000 iterationsafter burn-in. We used informative prior distribu-tions to improve parameter recovery. The priorswere specified as follows: d ′ ∼ normal(1.5, 0.25);σ 2 ∼ inv-gamma(5,1); and C ∼ normal(0, 0.25).The means of the d ′ and C prior distributions werebased on preliminary analyses fitted at the grouplevel. The shape and scale parameters of the inversegamma prior result in a prior mean of 0.25 forσ 2 and a prior mean of 1.25 for σ 2

T. This valueis consistent with the commonly reported ratio ofdistractor to target variance of 0.8 (i.e., σ 2

F /σ 2T =

1.00/1.25; see Mickes, Wixted, & Wais, 2007).

Dow

nloa

ded

by [

Har

vard

Col

lege

] at

07:

52 0

7 M

arch

201

3