Comprehensive Clinical Psychology Bellack 1998

21
4.01.9 OTHER MEASURES USED IN CLINICAL PSYCHOLOGY 27 4.01.9.1 The Thematic Apperception Test 27 4.01.9.2 Sentence Completion Tests 28 4.01.9.3 Objective Testing 28 4.01.9.4 The Clinician as a Clinical Instrument 28 4.01.9.5 Structured Interviews 29 4.01.10 CONCLUSIONS 29 4.01.11 REFERENCES 29 4.01.1 INTRODUCTION In this chapter we will describe the current state of affairs with respect to assessment in clinical psychology and then we will attempt to show how clinical psychology got to that state, both in terms of positive influences on the directions that efforts in assessment have taken and in terms of missed opportunities for alternative developments that might have been more productive psychology. For one thing, we really do not think the history is particularly interesting in its own right. The account and views that we will give here are our own; we are not taking a neutralÐand innocuousÐ position. Readers will not find a great deal of equivocation, not much in the way of ªa glass half-empty is, after all, half-fullº type of placation. By assessment in this chapter, we refer to formal assessment procedures, activities that can be named, described, delimited, and so on. We assume that all clinical psychologists are more or less continuously engaged in informal assessment of clients with whom they work. Informal assessment, however, does not follow any particular pattern, involves no rules for its conduct, and is not set off in any way from other clinical activities. We have in mind assessment procedures that would be readily defined as such, that can be studied systematically, and whose value can be quantified. We will not be taking account of neuropsychological assess- ment nor of behavioral assessment, both of which are covered in other chapters in this volume. It will help, we think, if we begin by noting the limits within which our critique of clinical assessment is meant to apply. We, ourselves, are regularly engaged in assessment activities, including developmemt of new mea- sures, and we are clinicians, too. 4.01.1.1 Useful Clinical Assessment is Difficult but not Impossible Many of the comments about clinical assess- ment that follow may seem to some readers to be pessimistic and at odds with the experiences of professional clinicians. We think our views are quite in accord with both research and the theoretical underpinnings for assessment activ- ities, but in at least some respects we are not so negative in our outlook as we may seem. Let us explain. In general, tests and related instruments are devised to measure constructs, for example, intelligence, ego strength, anxiety, antisocial tendencies. In that context, it is reasonable to focus on the construct validity of the test at hand: how well does the test measure the construct it is intended to measure? Generally speaking, evaluations of tests for construct validity do not produce single quantitated indexes. Rather, evidence for construct validity consists of a ªweb of evidenceº that fits together at least reasonably well and that persuades a test user that the test does, in fact, measure the construct at least passably well. The clinician examiner especially if he or she is acquainted in other ways with the examinee, may form impressions, perhaps compelling, of the validity of test results. The situation may be something like the following: Test5Ðconstruct That is, the clinician uses a test that is a measure of a construct. The path coefficient relating the test to the construct (in the convention of structural equations modeling, the construct causes the test performance) may well be substantial. A more concrete example is pro- vided by the following diagram: IQ Test5Ð0.80Ðintelligence This diagram indicates that the construct of intelligence causes performance on an IQ test. We believe that IQ tests may actually be quite good measures of the construct of ªintelli- gence.º Probably clinicians who give intelli- gence tests believe that in most instances the test gives them a pretty good estimate of what we mean by intelligence, for example, 0.80 in this example. To use a term that will be invoked later, the clinician is ªenlightenedº by the results from the test. As long as the clinical use of tests is confined to enlightenment about constructs, many tests may have reasonably good, maybe even very good ªvalidity.º The tests are good measures of the constructs. In many, if not most, clinical uses of tests, however, the tests are used in order to make decisions. Tests are used, for example to The Role of Assessment in Clinical Psychology 2

Transcript of Comprehensive Clinical Psychology Bellack 1998

Page 1: Comprehensive Clinical Psychology Bellack 1998

4.01.9 OTHER MEASURES USED IN CLINICAL PSYCHOLOGY 27

4.01.9.1 The Thematic Apperception Test 274.01.9.2 Sentence Completion Tests 284.01.9.3 Objective Testing 284.01.9.4 The Clinician as a Clinical Instrument 284.01.9.5 Structured Interviews 29

4.01.10 CONCLUSIONS 29

4.01.11 REFERENCES 29

4.01.1 INTRODUCTION

In this chapter we will describe the currentstate of affairs with respect to assessment inclinical psychology and then we will attempt toshow how clinical psychology got to that state,both in terms of positive influences on thedirections that efforts in assessment have takenand in terms of missed opportunities foralternative developments that might have beenmore productive psychology. For one thing, wereally do not think the history is particularlyinteresting in its own right. The account andviews that we will give here are our own; we arenot taking a neutralÐand innocuousÐposition. Readers will not find a great deal ofequivocation, not much in the way of ªa glasshalf-empty is, after all, half-fullº type ofplacation. By assessment in this chapter, werefer to formal assessment procedures, activitiesthat can be named, described, delimited, and soon.We assume that all clinical psychologists aremore or less continuously engaged in informalassessment of clients with whom they work.Informal assessment, however, does not followany particular pattern, involves no rules for itsconduct, and is not set off in any way from otherclinical activities. We have in mind assessmentprocedures that would be readily defined assuch, that can be studied systematically, andwhose value can be quantified. We will not betaking account of neuropsychological assess-ment nor of behavioral assessment, both ofwhich are covered in other chapters in thisvolume. It will help, we think, if we begin bynoting the limits within which our critique ofclinical assessment is meant to apply. We,ourselves, are regularly engaged in assessmentactivities, including developmemt of new mea-sures, and we are clinicians, too.

4.01.1.1 Useful Clinical Assessment is Difficultbut not Impossible

Many of the comments about clinical assess-ment that followmay seem to some readers to bepessimistic and at odds with the experiences ofprofessional clinicians. We think our views arequite in accord with both research and the

theoretical underpinnings for assessment activ-ities, but in at least some respects we are not sonegative in our outlook as we may seem. Let usexplain. In general, tests and related instrumentsare devised to measure constructs, for example,intelligence, ego strength, anxiety, antisocialtendencies. In that context, it is reasonable tofocus on the construct validity of the test athand: how well does the test measure theconstruct it is intended to measure? Generallyspeaking, evaluations of tests for constructvalidity do not produce single quantitatedindexes. Rather, evidence for construct validityconsists of a ªweb of evidenceº that fits togetherat least reasonably well and that persuades a testuser that the test does, in fact, measure theconstruct at least passably well. The clinicianexaminer especially if he or she is acquainted inother ways with the examinee, may formimpressions, perhaps compelling, of the validityof test results. The situation may be somethinglike the following:

Test5ÐconstructThat is, the clinician uses a test that is a measureof a construct. The path coefficient relating thetest to the construct (in the convention ofstructural equations modeling, the constructcauses the test performance) may well besubstantial. A more concrete example is pro-vided by the following diagram:

IQ Test5Ð0.80ÐintelligenceThis diagram indicates that the construct ofintelligence causes performance on an IQ test.We believe that IQ tests may actually be quitegood measures of the construct of ªintelli-gence.º Probably clinicians who give intelli-gence tests believe that in most instances the testgives them a pretty good estimate of what wemean by intelligence, for example, 0.80 in thisexample. To use a term that will be invokedlater, the clinician is ªenlightenedº by the resultsfrom the test.

As long as the clinical use of tests is confinedto enlightenment about constructs, many testsmay have reasonably good, maybe even verygood ªvalidity.º The tests are good measures ofthe constructs. Inmany, if notmost, clinical usesof tests, however, the tests are used in order tomake decisions. Tests are used, for example to

The Role of Assessment in Clinical Psychology2

Page 2: Comprehensive Clinical Psychology Bellack 1998

decide whether a parent should have custody ofa child, to decide whether a patient is likely tobenefit from some form of therapy, to decidewhether a child ªshouldº be placed in a socialclassroom, or to decide whether a patient shouldbe put on some particular medication. Usingour IQ test example, we get a diagram of thefollowing sort:

IQ Test5Ð0.80ÐintelligenceÐ0.50Ð4School gradesThis diagram, which represents predictionrather than simply enlightenment, has twopaths, and the second path is almost certainto have a far lower validity coefficient than thefirst one. Intelligence has a stronger relationshipto performance on an IQ test than to perfor-mance in school. If an IQ test had constructvalidity of 0.80, and if intelligence as a constructwere correlated 0.50 with school grades, whichmeans that intelligence would account for 25%of the total variance in school grades, then thecorrelation between the IQ test and schoolgrades would be only 0.80 x 0.50 = 0.40 (whichis about what is generallly found to be the case).

IQ Test5Ð0.40Ð4School gradesA very goodmeasure of ego strengthmay not bea terribly good predictor of resistance to stressin some particular set of circumstances. Epstein(1983) pointed out some time ago that testscannot be expected to be related especially wellto specific behaviors, but it is in relation tospecific behaviors that tests are likely to be usedin clinical settings.

It could be argued and has been, (e.g., Meyer& Handler 1997), that even modest validitieslike 0.40 are important.Measures with a validityof 0.40, for example, can improve ones predic-tion from that 50% of a group of persons willsucceed at some task to the prediction that 70%will succeed. If the provider of a service cannotserve all eligible or needy persons, thatimprovement in prediction may be quite useful.

In clinical settings, however, decisions aremade about individuals, not groups. Torecommend that one person should not receivea service because the chances of benefit from theservice are only 30% instead of the 50% thatwould be predicted without a test, could beregarded as a rather bold decision for a clinicianto make about a person in need of help. Hunterand Schmidt (1990) have developed very usefulapproaches to validity generalization thatusually result in estimates of test validity wellabove the correlations reported in actual use,but their estimates apply at the level of theory,construct validity, rather than at the level ofspecific application as in clinical settings.

A recommendation to improve the clinicaluses of tests can actually be made: test for morethings. Think of the determinants of perfor-

mance in school, say college, as an example.College grades depend on motivation, persis-tence, physical health, mental health, studyhabits, and so on. If clinical psychologists areserious about predicting performance in college,then they probably will need to measure severalquite different constructs and then combine allthose measures into a prediction equation. Themeasurement task may seem onerous, but it isworth remembering Cronbach's (1960) bandwidth vs. fidelity argument: it is often better tomeasure more things less well than to measureone thing extraordinarily well. A lot of measure-ment could be squeezed into the times usuallyallotted to lowbandwidth tests.Thegeniusof theprofession will come in the determination ofwhat to measure and how to measure it. Thecombination of all the information, however, islikely best to be done by a statistical algorithmfor reasons that we will show later.

We are not negative toward psychologicaltesting, but we think it is a lot more difficult andcomplicated than it is generally taken to be inpractice. An illustrative case is provided by thedifferential diagnosis of attention deficit hyper-activity disorder (ADHD). There might be anADHD scale somewhere but a more responsibleclinical studywould recognize that the diagnosiscan be difficult, and that the validity andcertainty of the diagnosis of ADHD is greatlyimproved by using multiple measures andmultiple reporting agents across multiple con-texts. For example, one authority recommendedbeginning with an initial screening interview, inwhich the possibility of an ADHD diagnosis isruled in, followed by an extensive assessmentbattery addressing multiple domains and usual-ly including (depending upon age): a WechslerIntelligence Scale for Children (WISC-III;McCraken & McCallum, 1993), a behaviorchecklist (e.g., Youth Self-Report (YSR);Achenbach & Edelbrock, 1987), an academicachievement battery (e.g., Kaufmann Assess-ment Battery for Children; Kaufmann &Kaufmann, 1985), a personality inventory(e.g., Millon Adolescent Personality Inventory(MAPI);Millon&Davis, 1993), a computerizedsustained attention and distractibility test(Gordon Diagnostic System [GDS]; McClure& Gordon, 1984), and a semistructured or astuctured clinical interview (e.g., DiagnosticInterview Schedule for Children [DISC]; Cost-ello, Edelbrock,Kalas,Kessler, &Klaric, 1982).

The results from the diagnostic assessmentmay be used to further rule in or rule outADHDas a diagnosis, in conjunction with childbehavior checklists (e.g., CBCL, Achenbach& Edelbrock, 1983; Teacher Rating Scales,Goyette, Conners, & Ulrich, 1978), completedby the parent(s) and teacher, and additonal

Introduction 3

Page 3: Comprehensive Clinical Psychology Bellack 1998

school performance information. The parentand teacher complete both a historical list andthen a daily behavior checklist for a period oftwo weeks in order to adequately samplebehaviors. The information from home andschool domains may be collected concurrentlywith evaluation of the diagnostic assessementbattery, or the battery may be used initially tocontinue to rule in the diagnosis as a possibility,and then proceed with collateral data collection.We are impressed with the recommendedADHD diagnostic process, but we do recognizethat it would involve a very extensive clinicalprocess that would probably not be reimbur-sable under most health insurance plans. Wewould also note, however, that the overalldiagnostic approach is not based on anydecision-theoretic approach that might guidethe choice of instruments corresponding to aprocess of decisionmaking. Or alternatively, theprocess is not guided by any algorithm forcombining information so as to produce adecision. Our belief is that assessment in clinicalpsychology needs the same sort of attention andsystematic study as is occurring in medical areasthrough such organizations as the Society forMedical Decision Making.

In summary, we think the above scenario, orsimilar procedures using similar instruments(e.g., Atkins, Pelham, & White, 1990; Hoza,Vollano, & Pelham, 1995), represent an ex-emplar of assessment practice. It should benoted, however, that the development of suchmultimodal batteries is an iterative process. Onewill soon reach the point of diminishing returnsin the development of such batteries, and theincremental validity (Sechrest, 1963) of instru-ments should be assessed. ADHD is an examplein which the important domains of functioningare understood, and thus can be assessed. Weknow of no examples other that ADHD of suchsystematic approaches to assessment for deci-sion making. Although approaches such asdescribed here and by Pelham and his colleaguesappear to be far from standard practice in thediagnosis of ADHD, we think they ought to be.The outlined procedure is modeled after aprocedure developed by Gerald Peterson,Ph.D., Institute forMotivational Development,Bellevue, WA.

4.01.2 WHY ARE ASSESSMENTS DONE?

Why do we ªtestº in the first place? It is worththinking about all the instances in which we donot test. For example, we usually do not test ourown childrenÐnor our spouses. That is becausewe have ample opportunities to observe theªperformancesº inwhichwe are interested. That

may be one reason that psychotherapists aredisinclined to test their own clients: they havemany opportunities to observe the behaviors inwhich they are interested, that is, if not theactual behaviors than reasonably good indica-tors of them. As we see it, testing is doneprimarily for one or more of three reasons:efficiency of observation, revealing crypticconditions, and quantitative tagging.

Testing may provide for more efficientobservation than most alternatives. For exam-ple, ªtailingº a person, that method so dear todetective story writers, would prove definitivefor many dispositions, but it would be expensiveand often impractical or even unethical (Webb,Campbell, Schwartz, Sechrest, & Grove, 1981).Testing may provide for more efficient observa-tion than most alternatives. It seems unlikelythat any teacher would not have quite a goodidea of the intelligence and personality of any ofher pupils after at most a few weeks of a schoolyear, but appropriate tests might provide usefulinformation from the very first day. Probablyclinicians involved in treating patients do notanticipate much gain in useful informationafter having held a few sessions with a patient.In fact, they may not anticipate much gainunder most circumstances, which could accountfor the apparent infrequent use of assessmentprocedures in connection with psychologicaltreatment.

Testing is also done in order to uncoverªcrypticº conditions, that is, characteristics thatare hidden from view or otherwise difficult todiscern. In medicine, for example, a great manyconditions are cryptic, blood pressure being oneexample. It can be made visible only by somedevice. Cryptic conditions have always been ofgreat interest in clinical psychology, althoughtheir importance may have been exaggeratedconsiderably. The Rorschach, a prime exampleof a putative decrypter, was hailed upon itsintroduction as ªproviding a window on themind,º and it was widely assumed that in skillfulhands the Rorschach would make visible a widerange of hidden dispositions, even thoseunknown to the respondent (i.e., in ªtheunconsciousº). Similarly, the Thematic Apper-ception Test was said to ªexpose underlyinginhibited tendenciesº of which the subject isunaware and to permit the subject to leave thetest ªhappily unaware that he has presented thepsychologist with what amounts to an X-raypicture of his inner selfº (Murray, 1943, p. 1).

Finally, testing may be done, is often done, inorder to provide a quantitative ªtagº for somedispositions or other characteristic. In footraces, to take a mundane example, no necessityexists to time the races; it is sufficient todetermine simply the order of the finish.

The Role of Assessment in Clinical Psychology4

Page 4: Comprehensive Clinical Psychology Bellack 1998

Nonetheless, races are timed so that each onemay be quantitatively tagged for sorting andother uses, for example, making comparisonsbetween races. Similarly, there is scarcely everany need for more than a crude indicator of achild's intelligence, for example, ªwell aboveaverage,º such as a teacher might provide.Nonetheless, the urge to seemingly precisequantification is strong, even if the precisionis specious, and tests are used regularly toprovide such estimates as ªat the 78th percentilein aggressionº or ªIQ= 118.º Although quant-itative tags are used, and may be necessary, forsome decision-making, for example, the award-ing of scholarships based on SAT scores, it is tobe doubted that such tags are ever of much usein clinical settings.

4.01.2.1 Bounded vs. Unbounded Inference andPrediction

Bounded prediction is the use of a test ormeasure to make some limited inference orprediction about an individual, couple, orfamily, a prediction that might be limited intime, situation, or range of behavior (Levy,1963; Sechrest, 1968). Some familiar examplesof bounded prediction are that of a collegestudent's grade point average based on theirSAT score, assessing the likely response of anindividual to psychotherapy for depressionbased on MMPI scores and a SCID interview,or prognosticating outcome for a couple inmarital therapy given their history. Thesepredictions are bounded because they are usingparticular measures to predict a specifiedoutcome in a given context. Limits to boundedpredictions are primarily based on knowledge oftwo areas. First, the reliability of the informa-tion, that is, interview or test, for the populationfromwhich the individual is drawn. Second, andmost important, these predictions are based onthe relationship between the predictor and theoutcome. That is to say, they are limited by thevalidity of the predictor for the particularcontext in question.

Unbounded inference or prediction, which iscommon in clinical practice, is the practice ofmaking general assessment of an individual'stendencies, dispositions, and behavior, andinferring prognosis for situations that may nothave been specified at the time of assessment.These are general statements made aboutindividuals, couples, and families based oninterviews, diagnostic tests, response to projec-tive stimuli, and so forth that indicate how thesepeople are likely to behave across situations.Some unbounded predictions are simply de-scriptive statements, for example,with respect topersonality, from which at some future time the

clinician or another person might make aninference about a behavior not even imagined atthe time of the original assessment. A clinicianmight be asked to apply previously obtainedassessment information to an individual's abilityto work, ability as a parent, likelihood ofbehaving violently, or even the probability thatan individualmight havebehaved in someway inthe past (e.g., abused a spouse or child). Thus,they are unbounded in context. Since reliabilityand validity require context, that is, a measure isreliable in particular circumstances, one cannotreadily estimate the reliability and validity of ameasure for unspecified circumstances.

To the extent that the same measures are usedrepeatedly to make the same type of predictionor judgment about individuals, the more theprediction becomes of a bounded nature. Thus,an initially unbounded prediction becomesbounded by the consistency of circumstancesof repeated use. Under these circumstances,reliability, utility, and validity can be assessed ina standard manner (Sechrest, 1968). Withoutempirical data, unbounded predictions restsolely upon the judgment of the clinician, whichhas proven problematic (see Dawes, Faust, &Meehl, 1989; Grove & Meehl, 1996; Meehl,1954). Again, the contrast with medical testingis instructive. In medicine, tests are generallyassociated with gathering additional informa-tion about specific problems or systems.Although one might have a ªwellnessº visit todetect level of functioning and signs of potentialproblems, it would be scandalous to have abattery of medical tests to ªsee how your healthmight beº under an unspecified set of circum-stances.Medical tests are bounded. They are forspecific purposes at specific times.

4.01.2.2 Prevalence and Incidence of Assessment

It is interesting to speculate about how muchassessment is actually done in clinical psychol-ogy today. It is equally interesting to realize howlittle is known about how much assessment isdone in clinical psychology today. What little isknown has to do with ªincidenceº of assess-ment, and that only from the standpoint of theclinician and only in summary form. Clinicalpsychologists report that a modest amount oftheir time is taken up by assessment activities.The American Psychological Association's(APA's) Committee for the Advancement ofProfessional Practice (1996) conducted a surveyin 1995 of licensed APA members. With aresponse rate of 33.8%, the survey suggestedthat psychologists spend about 14% of theirtime conducting assessmentsÐroughly six orseven hours per week. The low response rate,which ought to be considered disgraceful in a

Why are Assessments Done? 5

Page 5: Comprehensive Clinical Psychology Bellack 1998

profession that claims to survive by science, isindicative of the difficulties involved in gettinguseful information about the practice ofpsychology in almost any area. The responserate was described as ªexcellentº in the report ofthe survey. Other estimates converge on aboutthe same proportion of time devoted toassessment (Wade & Baker, 1977; Watkins,1991; Watkins, Campbell, Nieberding, & Hall-mark, 1995). Using data across a sizable numberof surveys over a considerable period of time,Watkins (1991) concludes that about 50±75%ofclinical psychologists provide at least someassessment services. We will say more laterabout the relative frequency of use of specificassessment procedures, butWatkins et al. (1995)did not find much difference in relative useacross seven diverse work settings.

Think about what appears not to be known:the number of psychologists who do assess-ments in any period of time; the number ofassessments that psychologists who do themactually do; the number or proportion ofassessments that use particular assessmentdevices; the proportion of patients who aresubjected to assessments; the problems forwhich assessments are done. And that doesnot exhaust the possible questions that might beasked. If, however, we take seriously theestimate that psychologists spend six or sevenhours per week on assessment, then it is unlikelythat those psychologists who do assessmentscould manage more than one or two per week;hence, only a very small minority of patientsbeing seen by psychologists could be undergoingassessment. Wade and Baker (1977) found thatpsychologists claimed to be doing an average ofabout six objective tests and three projectivetests per week, and that about a third of theirclients were given at least one or the other of thetests, some maybe both. Those estimates do notmake much sense in light of the overall estimateof only 15% of time (6±8 hours) spent in testing.

It is almost certain that those assessmentactivities in which psychologists do engage arecarried out on persons who are referred by someother professional person or agency specificallyfor assessment. What evidence exists indicatesthat very little assessment is carried out byclinical psychologists on their own clients, eitherfor diagnosis or for planning of treatment. Noris there any likelihood that clinical psychologistsrefer their own clients to some other clinician forassessment. Some years ago, one of us (L. S.)began a study, never completed, of referralsmade by clinical psychologists to other mentalhealth professionals. The study was nevercompleted in part because referrals were,apparently, very infrequent, mostly having todo with troublesome patients. A total of about

40 clinicians were queried, and in no instancedid any of those clinical psychologists refer anyclient for psychological assessment.

Thus, we conclude that only a small minorityof clients or patients of psychologists aresubjected to any formal assessment procedures,a conclusion supported by Wade and Baker(1977) who found that relatively few cliniciansappear to use standard methods of administra-tion and scoring. Despite Wade and Baker'sfindings, it also seems likely that clinicalpsychologists do very little assessment on theirown clients. Most assessments are almostcertainly on referral. Now contrast that stateof affairs with the practice of medicine:assessment is at the heart of medical practice.Scarcely a medical patient ever gets anysubstantial treatment without at least someassessment.Merely walking into amedical clinicvirtually guarantees that body temperature andblood pressure will be measured. Any indicationof a problem that is not completely obvious willresult in further medical tests, including referralof patients from the primary care physician toother specialists.

The available evidence also suggests thatpsychologists do very little in the way of formalassessment of clients prior to therapy or otherforms of intervention. For example, books onpsychological assessment even in clinical psy-chology may not even mention psychotherapyor other interventions (e.g., see Maloney &Ward, 1976), and the venerated and author-itative Handbook of psychotherapy and behaviorchange (Bergen & Garfield, 1994) does not dealwith assessment except in relation to diagnosisand the prediction of response to therapy and todetermining the outcomes of therapy, that is,there is no mention of assessment for planningtherapy at any stage in the process. That is, wethink, anomalous, especially when one con-templates the assessment activities of otherprofessions. It is almost impossible even to getto speak to a physician without at least havingone's temperature and blood pressure mea-sured, and once in the hands of a physician,almost all patients are likely to undergo furtherexplicit assessment procedures, for example,auscultation of the lungs, heart, and carotidarteries. Unless the problem is completelyobvious, patients are likely to undergo bloodor other body-fluid tests, imaging procedures,assessments of functioning, and so on. The samecontrast could be made for chiropractors,speech and hearing specialists, optometrists,and, probably, nearly all other clinical specia-lists. Clinical psychology appears to have nostandard procedures, not much interest in them,and no instruments for carrying them out in anycase. Why is that?

The Role of Assessment in Clinical Psychology6

Page 6: Comprehensive Clinical Psychology Bellack 1998

One reason, we suspect, is that clinicalpsychology has never shown much interest innormal functioning and, consequently, does nothave very good capacity to identify normalresponses or functioning.A competent specialistin internal medicine can usefully palpate apatient's liver, an organ he or she cannot see,because that specialist has been taught what anormal liver should feel like and what itsdimensions should (approximately) be. A phy-sician knows what normal respiratory soundsare. An optometrist certainly knows whatconstitutes normal vision and a normal eye.Presumably, a chiropractor knows a normalspine when he or she sees one.

Clinical psychology has no measures equiva-lent to body temperature and blood pressure,that is, quick, inexpensive screeners (vital signs)that can yield ªnormalº as a conclusion just aswell as ªabnormal.ºMoreover, clinical psychol-ogists appear to have a substantial bias towarddetection of psychopathology. The consequenceis that clinical psychological assessment is notlikely to provide a basis for a conclusion that agiven person is ªnormal,º and that no interven-tion is required. Obviously, the case is differentfor ªintelligence,º for which the conclusion ofªaverageº or some such is quite common.

By their nature, psychological tests are notlikely to offer many surprises. A medical testmay reveal a completely unexpected conditionof considerable clinical importance, for exam-ple, even in a person merely being subjected to aroutine examination. Most persons who cometo the attention of psychologists and othermental health professionals are there becausetheir behavior has already betrayed importantanomalies, either to themselves or to others. Aclinical psychologist would be quite unlikely toadminister an intelligence test to a successfulbusiness man and discover, completely unex-pectedly, that themanwas really ªstupid.º Testsare likely to be used only for further explorationor verification of problems already evident. Ifthey are already evident, then the clinicianmanaging the case may not see any particularneed for further assessment.

A related reason that clinical psychologistsappear to show so little inclination to doassessment of their own patients probably hasto do with the countering inclination of clinicalpsychologists, and other similarly placed clin-icians, to arrive at early judgments of patientsbased on initial impressions.Meehl (1960) notedthat phenomenon many years ago, and it likelyhas not changed. Under those circumstances,testing of clients would have very little incre-mental value (Sechrest, 1963) and would seemunnecessary. At this point, it may be worthrepeating that apparently no information is

available on the specific questions for whichpsychologists make assessments when they doso.

Finally, we do believe that current limitationson practice imposed by managed care organiza-tions are likely to limit even further the use ofassessment procedures by psychologists. Pres-sures are toward very brief interventions, andthat probably means even briefer assessments.

4.01.2.3 Proliferation of Assessment Devices

Clinical psychology has experienced anenormous proliferation of tests since the1960s. We are referring here to commerciallypublished tests, available for sale and for use inrelation to clinical problems. For example,inspection of four current test catalogs indicatesthat there are at least a dozen different tests(scales, inventories, checklists, etc.) related toattention deficit disorder (ADD) alone, includ-ing forms of ADD that may not even exist, forexample, adult ADD. One of the test catalogs is100 pages, two are 176 pages, and the fourth isan enormous 276 pages. Even allowing for thefact that some catalog pages are taken up withadvertisements for books and other such, theamount of test material available is astonishing.These are only four of perhaps a dozen or socatalogs we have in our files.

In the mid-1930s Buros published the firstlistings of psychological tests to help guide usersin a variety of fields in choosing an appropriateassessment instrument. These early uncriticallistings of tests developed into the Mentalmeasurements yearbook and by 1937 the listingshad expanded to include published test reviews.The Yearbook, which includes tests and reviewsof new and revised tests published for commer-cial use, has continued to grow and is now in its12th edition (1995). The most recent editionreviewed 418 tests available for use in education,psychology, business, and psychiatry. BurosMental Measurements Yearbook is a valuableresource for testers, but it also charts the growthof assessment instruments. In addition toinstruments published for commercial use, thereare scores of other tests developed yearly fornoncommercial use that are never reviewed byBuros. Currently, there are thousands ofassessment instruments available for research-ers and practitioners to choose from.

The burgeoning growth in the number of testshas been accompanied by increasing commer-cialization as well. The monthly Monitorpublished by the APA is replete with ads fortest instruments for a wide spectrum ofpurposes. Likewise, APA conference attendeesare inundated with preconference mailingsadvertising tests and detailing the location of

Why are Assessments Done? 7

Page 7: Comprehensive Clinical Psychology Bellack 1998

the test publisher's booth at the conference site.Once at the conference, attendees are oftenstruck by the slick presentation of the boothsand hawking of the tests. Catalogs put out bytest publishers are now also slick, in more waysthan one. They are printed in color on coatedpaper and include a lot of messages about howconvenient and useful the tests are with almostno information at all about reliability andvalidity beyond assurances that one can counton them.

The proliferation of assessment instrumentsand commercial development are not inherentlydetrimental to the field of clinical psychology.They simply make it more difficult to choose anappropriate test that is psychometrically sound,as glib ads can be used as a substitute for thepresentation of sound psychometric propertiesand critical reviews. This is further complicatedby the availability of computer scoring andsoftware that can generate assessment reports.The ease of computer-based applications suchas these can lead to their uncritical applicationby clinicians. Intense marketing of tests maycontribute to their misuse, for example, bypersuading clinical psychologists that the testsare remarkably simple and by convincing thosesame psychologists that they know more thanthey actually do about tests and their appro-priate uses.

Multiple tests, even several tests for everyconstruct, might not necessarily be a bad idea inand of itself, but we believe that the resources inpsychology are simply not sufficient to supportthe proper development of somany tests. Few ofthe many tests available can possibly be used onmore than a very few thousand cases per year,and perhaps not even that. The consequence isthat profit margins are not sufficient to supportreally adequate test development programs.Tests are put on the market and remain therewith small normative samples, with limitedevidence for validity, which is much moreexpensive to produce than evidence for relia-bility, and with almost no prospect for systema-tic exploration of the other psychometricproperties of the items, for example, discrimina-tion functions or tests of their calibration(Sechrest, McKnight, & McKnight, 1996).

One of us (L. S.) happens to have been a closespectator of the development of the SF-36, anow firmly established and highly valuedmeasure of health and functional status (Ware& Sherbourne, 1992). The SF-36 took 15±20years for its development, having begun as anitem pool of more than 300 items. Over the yearsliterally millions of dollars were invested in thedevelopment of the test, and it was subjected,often repeatedly, to the most sophisticatedpsychometric analyses and to detailed scrutiny

of every individual item. The SF-36 has nowbeen translated into at least 37 languages and isbeing used in an extraordinarily wide variety ofresearch projects. More important, however,the SF-36 is also being employed routinely inevaluating outcomes of clinical medical care.Plans are well advanced for use of the SF-36 thatwill result in its administration to 300 000patients in managed care every year. It ispossible that over the years the Wechslerintelligence tests might have a comparablehistory of development, and the MinnesotaMultiphasic Inventory (MMPI) has been thefocus of a great many investigations, as has theRorschach. Neither of the latter, however, hasbeen the object of systematic developmentefforts funded centrally, and scarcely any ofthe many other tests now available are likely tobe subjected to anything like the same level ofdevelopment effort (e.g., consider that in itsmore than 70-year history, the Rorschach hasnever been subjected to any sort of revision of itsoriginal items).

Several factors undoubtedly contribute to theproliferation of psychological tests (not theleast, we suspect, being their eponymousdesignation and the resultant claim to fame),but surely one of the most important would bethe fragmentation of psychological theory, orwhat passes for theory. In 1995 a taskforce wasassembled under the auspices of the APA to tryto devise a uniform test (core) battery thatwould be used in all psychotherapy researchstudies (Strupp, Horowitz, & Lambert, 1997).The effort failed, in large part because of themany points of view that seemingly had to berepresented and the inability of the conferees toagree even on any outcomes that should becommon to all therapies. Again, the contrastwith medicine and the nearly uniform accep-tance of the SF-36 is stark.

Another reason for the proliferation of testsin psychology is, unquestionably, the seemingease with which they may be ªconstructed.ºAlmost anyone with a reasonable ªconstructºcan write eight or 10 self-report items toªmeasureº it, and most likely the new littlescale will have ªacceptableº reliability. Acorrelation or two with some other measurewill establish its ªconstruct validity,º and therest will eventually be history. All that isrequired to establish a new projective test, itseems, is to find a set of stimuli that have not,according to the published literature, been usedbefore and then show that responses to thestimuli are suitably strange, perhaps stranger forsome folks than others. For example, Sharkeyand Ritzler (1985) noted a new PictureProjective Test that was created by usingphotographs from a photo essay. The pictures

The Role of Assessment in Clinical Psychology8

Page 8: Comprehensive Clinical Psychology Bellack 1998

were apparently selected based on the authors'opinions about their ability to elicit ªmean-ingful projective material,º meaning responseswith affective content and activity themes. Noinformation was given pertaining to compar-ison of various pictures and their responses norrelationships to other measures of the targetconstructs; no comparisons were made topictures that were deemed inappropriate. Theªvalidationº procedure simply compared diag-noses to those in charts and results of the TAT.Although rater agreement was assessed, therewas no formal measurement of reliability.

New tests are cheap, it seems. One concern isthat so many new tests appear also to imply newconstructs, and one wonders whether clinicalpsychology can support anywhere near as manyconstructs as are implied by the existence of somany measures of them. Craik (1986) made theeminently sensible suggestion that every ªnewºor infrequently used measure used in a researchproject should be accompanied by at least onewell-known and widely used measure from thesame or a closely related domain. Newmeasuresshould be admitted only if it is clear that theymeasure something of interest and are notredundant, that is, have discriminant validity.That recommendation would likely have theeffect of reducing the array of measures inclinical psychology by remarkable degrees if itwere followed.

The number of tests that are taught ingraduate school for clinical psychology is farlower than the number available for use. Thestandard stock-in-trade are IQ tests such as theWechsler Adult Intelligence Scale (WAIS),personality profiles such as the MMPI, diag-nostic instruments (Structured Clinical Inter-view for DSM-III-R [SCID]), and at someschools, the Rorschach as a projective test. Thislist is rounded out by a smattering of other testslike the Beck Depression Inventory andMillon.Recent standard application forms for clinicalinternships developed by the Association ofPsychology Postdoctoral and Internship Cen-ters (APPIC) asked applicants to report on theirexperience with 47 different tests and proce-dures used for adult assessment and 78 addi-tional tests used with children! It is verydoubtful that training programs actually pro-vide training in more than a handful of thepossible devices.

Training in testing (assessment) is not at allthe same as training in measurement andpsychometrics. Understanding how to admin-ister a test is useful but cannot substitute forevaluating the psychometric soundness of tests.Without grounding in such principles, it is easyto fall prey to glib ads and ease of computeradministration without questioning the quality

of the test. Psychology programs appear,unfortunately, to be abandoning training inbasic measurement and its theory (Aiken, West,Sechrest, & Reno, 1990).

4.01.2.4 Over-reliance on Self-report

ªWheredoes it hurt?º is aquestionoftenheardin physicians' offices. The physician is asking thepatient to self-report on the subjective experi-ence of pain. Depending on the answer, thephysician may prescribe some remedy, or mayorder tests to examine the painmore thoroughlyand obtain objective evidence about the natureof the affliction before pursuing a course oftreatment. The analog heard in psychologists'offices is ªHow do you feel?º Again, the inquirycalls forth self-report on a subjective experienceand like the physician, the psychologist maydetermine that tests are in order to betterunderstand what is happening with the client.

When themedical patient goes for testing, sheor he is likely to be poked, prodded, or prickedso that blood samples and X-rays can be taken.The therapy client, in contrast, will most likelybe responding to a series of questions in aninterview or answering a pencil-and-paperquestionnaire. The basic difference betweenthese is that the client in clinical psychology willcontinue to use self-report in providing asample, whereas themedical patient will provideobjective evidence.

Despite the proliferation of tests in recentyears, few rely on evidence other than theclient's self-report for assessing behavior,symptoms, or mood state. Often assessmentreports remark that the information gleanedfrom testing was corroborated by interviewdata, or vice versa, without recognizing thatboth rely on self-report alone. The problemswith self-report are well documented: poorrecall of past events, motivational differences inresponding, social desirability bias, and mal-ingering, for example. Over-reliance on self-report is a major criticism of psychologicalassessment as it is currently conducted and wasthe topic of a recent conference sponsored bythe National Institute of Mental Health.

What alternatives are there to self-report?Methods of obtainingdata on a client's behaviorthat do not rely on self-report do exist.Behavioral observation with rating by judgescan permit the assessment of behavior, oftenwithout the client's awareness or outside theconfines of an office setting. Use of other in-formants such as family members or co-workersto provide data can yield valuable informationabout a client. Yet, all too often thesealternatives are not pursued because theyinvolve time or resourcesÐin short, they are

Why are Assessments Done? 9

Page 9: Comprehensive Clinical Psychology Bellack 1998

demanding approaches. Compared with askinga client about his or her mood state over the lastweek, organizing field work or contactinginformants involves a great deal more workand time.

Instruments are available to facilitate collec-tion of data not relying so strongly on self-report and for collection of data outside theoffice setting, for example, the Child BehaviorChecklist (CBCL; Achenbach & Edelbrock,1983). The CBCL is meant to assist indiagnosing a range of psychological andbehavior problems in children, and it relies onparent, teacher, and self-reports of behavior.Likewise, neuropsychological tests utilize func-tional performance measures much more thanself-report. However, as Craik (1986) notedwith respect to personality research, methodssuch as field studies are not widely used asalternatives to self-report. This problem of over-reliance on self-report is not new (see Webb,Campbell, Schwartz, & Sechrest, 1966).

4.01.3 PSYCHOMETRIC ISSUES WITHRESPECT TO CURRENTMEASURES

Consideration of the history and currentstatus of clinical assessment must deal withsome fundamental psychometric issues andpractices. Although psychometric is usuallytaken to refer to reliability and validity ofmeasures, matters are much more complicatedthan that, particularly in lightofdevelopments inpsychometric theory and method since the1960s, which seem scarcely to have penetratedclinical assessment as an area. Specifically, gen-eralizability theory and Item Response Theory(IRT) offer powerful tools with which to exploreand develop clinical assessment procedures, butthey have seen scant use in that respect.

4.01.3.1 Reliability

The need for ªreliableº measures is by nowwell accepted in all of psychology, includingclinical assessment.What is not so widespread isthe necessary understanding of what constitutesreliability and the various uses of that term. Intheir now classic presentation of generalizabilitytheory, Cronbach and his associates (Cronbach,Gleser, Nanda, & Rajaratnam, 1972) used theterm ªdependabilityº in a way that is close towhat is meant by reliability, but they madeespecially clear, as classical test theory had not,that measures are dependable (generalizable) invery specific ways, that is, that they aredependable across some particular conditionsof use (facets), and assessments of dependabilityare not at all interchangeable. For example, a

given assessment may be highly dependableacross particular items but not necessarilyacross time. An example might be a measureof mood, which ought to have high internalconsistency (i.e., across items) but that mightnot, in fact, should not, have high dependabilityover time, else the measure would be better seenas a trait rather than as a mood measure.

An assessment procedure might be highlydependable in terms of internal consistency andacross time but not satisfactorily dependableacross users, for example, being susceptible to avariety of biases characteristic of individualclinicians. Or an assessment procedure mightnot be adequately dependable across conditionsof its use, as might be the case when a measure istaken from a research to a clinical setting. Or anassessment procedure might not be dependableacross populations, for example, a projectiveinstrument useful with mental patients might bemisleading if used with imaginative and playfulcollege students.

Issues of dependability are starkly criticalwhen one notes the regrettably commonpractice of justifying the use of a measure onthe ground that it is ªreliable,º often withouteven minimal specification of the facet(s) acrosswhich that reliability was established. Thepractice is even more regrettable when, as isoften the case, only a single value for reliabilityis given when many are available and when onesuspects that the figure reported was not chosenrandomly from those available. Moreover, it isall too frequently the case that the reliabilityestimate reported is not directly relevant to thedecisions to be made. Internal consistency, forexample, may not be as important as general-izability over time when one is using a screeninginstrument. That is, if one is screening in apopulation for psychopathology, it may not beof great interest that two persons with the samescores are different in terms of their manifesta-tions of pathology, but it is of great interestwhether if one retested them a day or so later,the scores would be roughly consistent.

In short, clinical assessment in psychology isunfortunately casual in its use of reliabilityestimates, and it is shamefully behind the curvein its attention to the advantages provided bygeneralizability theory, originally proposed in1963 (Cronbach, Rajaratnam, & Gleser, 1963).

4.01.3.2 Validity

It is customary to treat validity of measures asa topic separate from reliability, but we thinkthat is not only unnecessary but undesirable. Inour view, the validity of measures is simply anextension of generalizability theory to thequestion of what other performances aside from

The Role of Assessment in Clinical Psychology10

Page 10: Comprehensive Clinical Psychology Bellack 1998

those involved in the test is the score general-izable. A test score that is generalizable toanother very similar performance, say on thesame set of test items or over a short period oftime, is said to be reliable. A test score that isgeneralizable to a score on another similar test issometimes said to be ªvalid,º but we think that alittle reflection will show that unless the testsdemand very different kinds of performances,generalizability from one test to another is notmuch beyond the issues usually regarded ashaving to do with reliability. When, however, atest produces a score that is informative aboutanother very different kind of performance, wegradually move over into the realm termedvalidity, such as when a paper-and-pencil test ofªreadiness for changeº (Prochaska, DiCle-mente, & Norcross, 1992) predicts whether aclient will benefit from treatment or even juststay in treatment.

We will say more later about constructvalidity, but a test or other assessment proceduremay be said to have construct validity if itproduces generalizable information and if thatinformation relates to performances that areconceptually similar to those implied by thename or label given to the test. Essentially,however, any measure that does not producescores by some random process is by thatdefinition generalizable to some other perfor-mance and, hence, to that extent may be said tobe valid. What a given measure is valid for, thatis, generalizable to, however, is a matter ofdiscovery as much as of plan. All instrumentsused in clinical assessment shouldbe subjected tocomprehensive and continuing investigation inorder to determine the sources of variance inscores. An instrument that has good general-izability over timeandacross ratersmay turnoutto be, among other things, a very good measureof some response style or other bias. TheMMPIincludes a number of ªvalidityº scales designedto assess various biases in performance on it, andit has been subjected to many investigations ofbias. The same cannot be said of some otherwidely used clinical assessment instruments andprocedures. To take the most notable example,of the more than 1000 articles on the Rorschachthat are in the currentPsychInfodatabase,onlyahandful, about 1%, appear to deal with issues ofresponse bias, and virtually all of those are onmalingering and most of them are unpublisheddissertations.

4.01.3.3 Item Response Theory

Although Item Response Theory (IRT) is apotentially powerful tool for the developmentand study of measures of many kinds, its use todate has not been extensive beyond the area of

ability testing. The origins of IRT go back atleast to the early 1950s and the publication ofLord's (1952) monograph, A theory of testscores, but it has had little impact on measure-ment outside the arena of ability testing (Meier,1994). Certainly it has had almost no impact onclinical assessment. The current PsychInfodatabase includes only two references to IRTin relation to the MMPI and only one to theRorschach, and the latter one, now 10 years old,is an entirely speculative mention of a potentialapplication of IRT (Samejima, 1988).

IRT, perhaps to some extent narrowlyimagined tobe relevantonly to test construction,can be of great value in exploring the nature ofmeasures and improving their interpretation.For example, IRT can be useful in under-standing just when scores may be interpreted asunidimensional and then in determining the sizeof gaps in underlying traits represented byadjacent scores. An example could be theinterpretation of Whole responses on theRorschach. Is the W score a unidimensionalscore, and, if so, is each increment in that score tobe interpreted as an equal increment? Somecards are almost certainly more difficult stimulito which to produce a W response, and IRTcould calibrate that aspect of the cards. IRTwould be even more easily used for standardpaper-and-pencil inventory measures, but thetotal number of applications todate is small, andone can only conclude that clinical assessment isbeing short-changed in its development.

4.01.3.4 Scores on Tests

Lord's (1952) monograph was aimed at testswith identifiable underlying dimensions such asability. Clinical assessment appears never tohave had any theory of scores on instrumentsincluded under that rubric. That is, there seemsnever to have been proposed or adapted anyunifying theory about how test scores on clinicalinstruments come about. Rather there seems tohave been a passive, but not at all systematic,adoption of general test theory, that is, the ideathat test scores are in somemanner generated byresponses representing some underlying trait.That casual approach cannot forward thedevelopment of the field.

Fiske (1971) has come about as close asanyone to formulating a theory of test scores forclinical assessment, although his ideas pertainmore to how such tests are scored than to howthey come about, and his presentation wasdirected toward personality measurementrather than clinical assessment. He suggestedseveral models for scoring test, or otherwiseobserved, responses. The simplest model is whatwe may call the cumulative frequency model,

Psychometric Issues with Respect to Current Measures 11

Page 11: Comprehensive Clinical Psychology Bellack 1998

which simply increments the score by 1 for everyobserved response. This is the model thatunderlies many Rorschach indices. It assumesthat every response is equivalent to every otherone, and it ignores the total number ofopportunities for observation. Thus, eachRorschach W response counts as 1 for thatindex, and the index is not adjusted to takeaccount of the total number of responses. Asecond model is the relative frequency model,which forms an index by dividing the number ofobserved critical responses by some indicator ofopportunities to form a rate of responding, forexample, as would be accomplished by countingW responses and dividing by the total number ofresponses or by counting W responses only forthe first response to each card. Most paper-and-pencil inventories are scored implicitly in thatway, that is, they count the number of criticalresponses in relation to the total numberpossible.

A long story must be made short here, butFiske describes other models, and still more arepossible. One may weight responses accordingto the inverse of their frequency in a populationon the grounds that common responses shouldcount for less than rare responses. Or one mayweight responses according to the judgments ofexperts. One can assign the average weightacross a set of responses, a common practice,but one can also assign as the score the weight ofthe most extreme response, for example, asrunners are often rated on the basis of theirfastest time for any given distance. Pathology isoften scored in that way, for example, apathognomic response may outweigh manymundane, ordinary responses.

The point is that clinical assessment instru-ments and procedures only infrequently haveany explicit basis in a theory of responses. Forthe most part, scores appear to be derived insome standard way without much thoughthaving been given to the process. It is not clearhow much improvement in measures might beachieved by more attention to the developmentof a theory of scores, but it surely could not hurtto do so.

4.01.3.5 Calibration of Measures

A critical limitation on the utility of psycho-logical measures of any kind, but certainly intheir clinical application, is the fact that themeasures do not produce scores in any directlyinterpretable metric. We refer to this as thecalibration problem (Sechrest, McKnight, &McKnight, 1996). The fact is that we have only avery general knowledge of how test scores maybe related to any behavior of real interest. Wemay know in general that a score of 70, let us

say, on an MMPI scale is ªhigh,º but we do notknow very well what might be expected in thebehavior of a person with such a score. Wewould know even less about what difference itmight make if the score were reduced to 60 orincreased to 80 except that in one case we mightexpect some diminution in problems and in theother some increase. In part the lack ofcalibration of measures in clinical psychologystems from lack of any specific interest anddiligence in accomplishing the task. Clinicalpsychology has been satisfied with ªloosecalibration,º and that stems in part, as we willassert later, from adoption of the uninformativemodel of significance testing as a standard forvalidation of measures.

4.01.4 WHY HAVE WE MADE SO LITTLEPROGRESS?

It is difficult to be persuaded that progress inassessment in clinical psychology has beensubstantial in the past 75 years, that is, sincethe introduction of the Rorschach. Severalarguments may be adduced in support of thatstatement, even though we recognize that it willbe met with protests. We will summarize whatwe think are telling arguments in terms oftheory, formats, and validities of tests.

First, we do not discern any particularimprovements in theories of clinical testingand assessments over the past 75 years. TheRorschach, and the subsequent formulation ofthe projective hypothesis, may be regarded ashaving been to some extent innovations; theyare virtually the last ones in the modern historyof assessment. As noted, clinical assessment lagswell behind the field in terms of any theory ofeither the stimuli or responses with which itdeals, let alone the connections between them.No theory of assessment exists that would guideselection of stimuli to be presented to subjects,and certainly none pertains to the specificformat of the stimuli nor to the nature of theresponses required. Just to point to two simpleexamples of the deficiency in understanding ofresponse options, we note that there is no theoryto suggest whether in the case of a projective testresponses should be followed by any sort ofinquiry about their origins, and there is notheory to suggest in the case of self-reportinventories whether items should be formulatedso as to produce endorsements of the ªthis istrue of meº nature or so as to producedescriptions such as ªthis is what I do.º

Given the lack of any gains in theory about theassessment enterprise, it is not surprising thatthere have also not been any changes in testformats since the introduction of theRorschach.

The Role of Assessment in Clinical Psychology12

Page 12: Comprehensive Clinical Psychology Bellack 1998

Projective tests based on the same simple (andinadequate) hypothesis are still being devised,but not one has proven itself in any way betterthan anything that has come before. Itemwritersmay be a bitmore sophisticated than those in thedays of the Bernreuter, but items are stillconstructed in the same way, and responseformats are the same as ever, ªagree±disagree,ºªtrue±false,º and so on.

Even worse, however, is the fact thatabsolutely no evidence exists to suggest thatthere have been any mean gains in the validitiesof tests over the past 75 years. Even for tests ofintellectual functioning, typical correlationswith any external criterion appear to averagearound 0.40, and for clinical and personalitytests the typical correlations are still in the rangeof 0.30, the so-called ªpersonality coefficient.ºThis latter point, that validities have remainedconstant, may, of course, be related to the lackof development of theory and to the fact that thesame test formats are still in place.

Perhaps some psychologists may take excep-tion to the foregoing and cite considerableadvances. Such claims are made for the Exner(1986) improvements on the Rorschach, knownas the ªcomprehensive system,º and for theMMPI-2, but although both claims are super-ficially true, there is absolutely no evidence foreither claim from the standpoint of validity ofeither test. The Exner comprehensive systemseems to have ªcleaned upº some aspects ofRorschach scoring, but the improvements aremarginal, for example, it is not as if inter-raterreliability increased from 0.0 to 0.8, and noimprovements in validity have been established.Even the improvements in scoring have beendemonstrated for only a portion of the manyindexes. The MMPI-2 was only a cosmeticimprovement over the original, for example,getting rid of some politically incorrect items,and no increase in the validity of any score orindex seems to have been demonstrated, nor isany likely.

An additional element in the lack of evidentªprogressº in the validity of test scores may belack of reliability (and validity!) of people beingpredicted. (One wise observer suggested that wewould not really like it at all if behavior were90% predictable! Especially our own.) We mayjust have reached the limits of our ability topredict what is going to happen with and topeople, especially with our simple-minded andlimited assessment efforts. As long as we limitour assessment efforts to the dispositions of theindividuals who are clients and ignore theirsocial milieus, their real environmental circum-stances, their genetic possibilities, and so on, wemay not be able to get beyond correlations of0.3 or 0.4.

The main ªadvanceº in assessment over thepast 75 years is not that we do anything reallybetter but that we do it much more widely. Wehave many more scales than existed in the past,and we can at least assess more things than everbefore, even if we can do that assessment only,at best, passably well.

Woodworth (1937/1992) wrote in his articleon the future of clinical psychology that,ªThere can be no doubt that it will advance,and in its advance throw into the discardmuch guesswork and half-knowledge that nowfinds baleful application in the treatment ofchildren, adolescents and adultsº (p. 16). Itappears to us that the opposite has occurred.Not only have we failed to discard guessworkand half-knowledge, that is, tests and treat-ments with years of research indicating littleeffect or utility, we have continued to generateprocedures based on the same flawed assump-tions with the misguided notion that if we justmake a bit of a change here and there, we willfinally get it right. Projective assessments thattell us, for example, that a patient is psychoticare of little value. Psychologists have morereliable and less expensive ways of determiningthis. More direct methods have higher validityin the majority of cases. The widespread useof these procedures at high actual and op-portunity cost is not justified by the occasionaladdition of information. It is not possible toknow ahead of time which individuals mightgive more information via an indirect method,and most of the time it is not even possibleto know afterwards whether indirectly ob-tained ªinformationº is correct unless theinformation has also been obtained in someother way, that is, asking the person, asking arelative, or doing a structured interview. It isunlikely that projective test responses will alterclinical intervention in most cases, nor shouldit.

Is it fair to say that clinical psychology has nostandards (see Sechrest, 1992)? Clinical psy-chology gives the appearance of standards withaccreditation of programs, internships, licen-sure, ethical standards, and so forth. It is ourobservation, however, that there is little to nomonitoring of the purported standards. Forexample, in reviewing recent literature asbackground to this chapter, we found articlespublished in peer-reviewed journals usingprojective tests as outcome measures fortreatment. The APA ethical code of conductstates that psychologists ª. . . use psychologicalassessment . . . for purposes that are appropriatein light of the research on or evidence ofthe. . . proper application of the techniques.ºThe APA document, Standards for educationaland psychological testing, states:

Why Have We Made So Little Progress? 13

Page 13: Comprehensive Clinical Psychology Bellack 1998

. . . Validity however, is a unitary concept.Although evidence may be accumulated in mayways, validity always refers to the degree to whichthat evidence supports the inferences that are madefrom the scores. The inferences regarding specificuses of a test are validated, not the test itself. (APA,1985, p. 9)

Further, the section titled, Professional stan-dards for test use (APA, 1985, p. 42, Standard6.3) states:

When a test is to be used for a purpose for which ithas not been previously validated, or for whichthere is no supported claim for validity, the user isresponsible for providing evidence of validity.

No body of research exists to support thevalidity of any projective instrument as the soleoutcome measure for treatmentÐor as the solemeasure of anything. So not only do question-able practices go unchecked, they can result inpublication.

4.01.4.1 The Absence of the Autopsy

Medicine has always been disciplined by theregular occurrence of the autopsy. A physicianmakes a diagnosis and treats a patient, and if thepatient dies, an autopsy will be done, and thephysician will receive feedback on the correct-ness of his or her diagnosis. If the diagnosis werewrong, the physician would to some extent becalled to account for that error; at least the errorwould be known, and the physician could notsimply shrug it off. We know that the foregoingis idealized, that autopsies are not done in morethan a fraction of cases, but the model makesour point. Physicians make predictions, andthey get feedback, often quickly, on thecorrectness of those predictions. Surgeons sendtissue to be biopsied by pathologists who aredisinterested; internists make diagnoses basedon various signs and symptoms and then orderlaboratory procedures that will inform themabout the correctness of their diagnosis; familypractitioners make diagnoses and prescribetreatment, which, if it does not work, they arevirtually certain to hear about.

Clinical psychology has no counterpart to theautopsy, no systematic provision for checkingon the correctness of a conclusion and thenproviding feedback to the clinician. Withoutsome form of systematic checking and feedback,it is difficult to see how improvement in eitherinstruments or clinicians' use of them could beregularly and incrementally improved. Psychol-ogist clinicians have been allowed the slackinvolved in making unbounded predictions andthen not getting any sort of feedback on the

potential accuracy of even those loose predic-tions. We are not sure how much improvementin clinical assessment might be possible evenwith exact and fairly immediate feedback, butwe are reasonably sure that very little improve-ment can occur without it.

4.01.5 FATEFUL EVENTSCONTRIBUTINGTOTHEHISTORYOF CLINICAL ASSESSMENT

The history of assessment in clinical psychol-ogy is somewhat like the story of the evolutionof an organism in that at critical junctures, whenthe development of assessment might well havegone one way, it went another. We want toreview here several points that we consider to becritical in the way clinical assessment developedwithin the broader field of psychology.

4.01.5.1 The Invention of the Significance Test

The advent of hypothesis testing in psychol-ogy had fateful consequences for the develop-ment of clinical assessment, as well as for the restof psychology (Gigerenzer, 1993). Hypothesistesting encouraged a focus on the questionwhether any predictions or other consequencesof assessment were ªbetter than chance,º adistinctly loose and undemanding criterion ofªvalidityº of assessment. The typical validitystudy for a clinical instrument would identifytwo groups that would be expected to differ insome ªscoreº derived from the instrument andthen ask the question whether the two groupsdid in fact (i.e., to a statistically significantdegree) differ in that score. It scarcely matteredby how much they differed or in what specificway, for example, an overall mean difference vs.a difference in proportions of individualsscoring beyond some extreme or otherwisecritical value. The existence of any ªsignificantºdifference was enough to justify triumphantclaims of validity.

4.01.5.2 Ignoring Decision Making

One juncture had to dowith bifurcation of thedevelopment of clinical psychology from otherstreams of assessment development. Specifi-cally, intellectual assessment and assessment ofvarious capacities and propensities relevant toperformance in work settings veered in thedirection of assessment for decision-making(although not terribly sharply nor completely),while assessment in clinical psychology went inthe direction of assessment for enlightenment.What eventually happened is that clinicalpsychology failed to adopt any rigorous

The Role of Assessment in Clinical Psychology14

Page 14: Comprehensive Clinical Psychology Bellack 1998

criterion of correctness of decisions made on thebasis of assessed performance, but adoptedinstead a conception of assessments as generallyinformative or ªcorrect.º

Simply to make the alternative clear, theexamples provided by medical assessment areinstructive. The model followed in psychologywould have resulted in medical research of somesuch nature as showing that two groups thatªshouldº have differed in blood pressure, forexample, persons having just engaged invigorous exercise vs. persons having justexperienced a rest period, differed significantlyin blood pressure readings obtained by asphygmomanometer. Never mind by howmuchthey differed or what the overlap between thegroups. The very existence of a ªsignificantºdifference would have been taken as evidencefor the ªvalidityº of the sphygmomanometer.

Instead, however, medicine focused moresharply on the accuracy of decisions made onthebasis of assessment procedures.Theaspect ofbiomedical assessment that most clearly distin-guishes it from clinical psychological assessmentis its concern for sensitivity and specificity ofmeasures (instruments) (Kraemer, 1992). Krae-mer's book, Evaluating medical tests: Objectiveand quantitative guidelines, has not even a closecounterpart in psychology, which is, itself,revealing. These two characteristics of measuresare radically different from the concepts ofvalidity used in psychology, although ªcriterionvalidityº (now largely abandoned) would seemto require such concepts.

Sensitivity refers to the proportion of caseshaving a critical characteristic that are identifiedby the test. For example, if a test were devised toselect persons likely to benefit from some formof therapy, sensitivity would refer to theproportion of cases that would actually benefitwhich would be identified correctly by the test.These cases would be referred to as ªtruepositives.º Any cases that would benefit fromthe treatment but that could not be identified bythe test would be ªfalse-negativesº in thisexample. Conversely, a good test should havehigh specificity, whichwould be avoiding ªfalse-positives,º or incorrectly identifying as goodcandidates for therapy persons who would notactually benefit. The ªtrue negativeº groupwould be those persons who would not benefitfrom treatment, and a good test should correctlyidentify a large proportion of them.

As Kraemer (1992) points out, sensitivity andspecificity as test requirements are nearly alwaysin opposition to each other, and are reciprocal.Maximizing one requirement reduces the other.Perfect sensitivity can be attained by, in ourexample, a test that identifies every case assuitable for therapy; no amenable cases are

missed. Unfortunately, that maneuver wouldalso maximize the number of false-positives,that is, many cases would be identified assuitable for therapy who, in fact, were not.Obviously, the specificity of the test could bemaximized by declaring all cases as unsuitablefor therapy, thus ensuring that the number offalse-positives would be zeroÐwhile at the sametime ensuring that the number of false-negativeswould be maximal, and no one would betreated.

We go into these issues in some detail in ordertomake clear how very different such thinking isfrom usual practices in clinical psychologicalassessment. The requirements for ReceiverOperating Curves (ROC), which is the wayissues of sensitivity and specificity of measuresare often labeled and portrayed, are stringent.They are not satisfied by simple demonstrationsthat measures, for example, suitability fortreatment, are ªsignificantly related toº othermeasures of interest, for example, response totreatment. The development of ROC statisticsalmost always occurs in the context of the use oftests for decision-making: treat±not treat, hire±not hire, do further tests±no further tests. Thosekinds of uses of tests in clinical psychologicalassessment appear to be rare.

Issues of sensitivity-specificity require theexistence of some reasonably well-definedcriterion, for example, the definition of whatis meant by favorable response to treatment anda way of measuring it. In biomedical research,ROC statistics are often developed in thecontext of a ªgold standard,º a definitivecriterion. For example, an X ray might serveas a gold standard for a clinical judgment aboutthe existence of a fracture, or a pathologist'sreport on a cytological analysis might serve as agold standard for a screening test designed todetect cancer. Clinical psychology has never hadanything like a gold standard against which itsvarious tests might have been validated.Psychiatric diagnosis has sometimes been ofinterest as a criterion, and tests of different typeshave been examined to determine the extent towhich they produce a conclusion in agreementwith diagnosis (e.g., Somoza, Steer, Beck, &Clark, 1994), but in that case the gold standardis suspect, and it is by no means clear thatdisagreement means that the test is wrong.

The result is that for virtually no psycholo-gical instrument is it possible to produce a usefulquantitative estimate of its accuracy. Tests andother assessment devices in clinical psychologyhave been used for the most part to producegeneral enlightenment about a target of interestrather than to make a specific prediction ofsome outcome. People who have been tested aredescribed as ªhigh in anxiety,º ªclinically

Fateful Events Contributing to the History of Clinical Assessment 15

Page 15: Comprehensive Clinical Psychology Bellack 1998

depressed,º or ªof average intelligence.º State-ments of that sort, which we have referred topreviously as unbounded predictions, arepossibly enlightening about the nature of aperson's functioning or about the general rangewithin which problems fall, but they are notspecific predictions, and are difficult to refute.

4.01.5.3 Seizing on Construct Validity

In 1955, Cronbach andMeehl published whatis arguably the most influential article in thefield of measurement: Construct validity inpsychological tests (Cronbach & Meehl, 1955).This is the same year as the publication ofAntecedent probability and the efficiency ofpsychometric signs, patterns, or cutting scores(Meehl & Rosen, 1955). It is safe to say that notwo more important articles about measure-ment were ever published in the same year. Thepropositions set forth by Cronbach and Meehlabout the validity of tests were provocative andrich with implications and opportunities. Inparticular, the idea of construct validity re-quired that measures be incorporated intoelaborated theoretical structure, which waslabeled the ªnomological net.º Unfortunately,the fairly daunting requirements for embeddingmeasures in theory were mostly ignored inclinical assessment (the same could probably besaid about most other areas of psychology, butit is not our place here to say so), and the idea ofconstruct validity was trivialized.

The trivialization of construct validity reflectsin part the fact that no standards for constructvalidity exist (and probably none can bewritten)and the general failure to distinguish betweennecessary and sufficient conditions for theinference of construct validity. In their pre-sentation of construct validity, Cronbach andMeehl did not specify any particular criteria forsufficiency of evidence, and it would be difficultto do so. Construct validity exists when every-thing fits together, but trying to specify thenumber and nature of the specific pieces ofevidence would be difficult and, perhaps,antithetical to the idea itself. It is also notpossible to quantify level or degree of constructvalidity other than in a very rough way and suchquantifications are, in our experience, rare. It isdifficult to think of an instance of a measuredescribed as having ªmoderate or ªlowº con-struct validity, although ªhighº constructvalidity is often implied.

It is possible to imagine what some of thenecessary conditions for construct validitymight be, one notable requirement beingconvergent validity (Campbell & Fiske, 1959).In some manner that we have not tried to trace,conditions necessary for construct validity came

to be viewed as sufficient. Thus, for example,construct validity usually requires that onemeasure of a construct correlates with another.Such a correlation is not, however, a sufficientcondition for construct validity, but, none-theless, a simple zero-order correlation betweentwo tests is often cited as ªevidenceº for theconstruct validity of one measure or the other.Even worse, under the pernicious influence ofthe significance testing paradigm, any statisti-cally significant correlation may be taken asevidence of ªgood construct validity.º Or, foranother example, construct validity usuallyrequires a particular factor structure for ameasure, but the verification of the requiredfactor structure is not sufficient evidence forconstruct validity of the measure involved. Thefact that a construct is conceived as unidimen-sional does not mean that a measure alleged torepresent the construct does so simply because itappears to form a single factor.

The net result of the dependence on sig-nificance testing and the poor implementationof the ideas represented by construct validity hasbeen that the standards of evidence for thevalidity of psychological measures has beendistressingly low.

4.01.5.4 Adoption of the Projective Hypothesis

The projective hypothesis (Frank, 1939) is ageneral proposition stating that whatever anindividual does when exposed to an ambiguousstimulus will reveal important aspects of his orher personality. Further, the projective hypoth-esis suggests that indirect responses, that is,those to ambiguous stimuli, are more valid thandirect responses, that is, those to interviews orquestionnaires. There is little doubt that indirectresponses reveal something about people,although whether that which is revealed is, infact, important is more doubtful. Moreover,what one eats, wears, listens to, reads, and so onare rightly considered to reveal something aboutthat individual. While the general propositionabout responses to ambiguous stimuli appearsquite reasonable, the use of such stimuli in theform of projective tests has proven problematicand of limited utility.

The course of development of clinicalassessment might have been different and moreuseful had it been realized that projection wasthe wrong term for the link between ambiguousstimuli and personality. A better term wouldhave been the ªexpressive hypothesis,º thenotion that an individual's personality maybe manifest (expressed) in response to a widerange of stimuli, including ambiguous stimuli.Personality style might have come to be ofgreater concern, and unconscious determinants

The Role of Assessment in Clinical Psychology16

Page 16: Comprehensive Clinical Psychology Bellack 1998

of behavior, implied by projection, might havereceived less emphasis.

In any case, when clinical psychology adoptedthe projective hypothesis and bought wholesaleinto the idea of unconscious determinants ofbehavior, that set the field on a course that hasbeen minimally productive but that still affectsan extraordinarily wide range of clinicalactivities. Observable behaviors have beendownplayed and objective measures treatedwith disdain or dismissed altogether. The ideaof peering into the unconscious appealed bothto psychological voyeurs and to those benton achieving the glamour attributed to thepsychoanalyst.

Research on projective stimuli indicates thathighly structured stimuli which limit the dis-positions tapped increase the reliability of suchtests (e.g.,Kagan, 1959). In achieving acceptablereliability, the nature of the test is altered in suchaway that the stimulus is less ambiguous and thelikelihood of an individual ªprojectingº someaspect of their personality in an unusual waybecomes reduced. Thus, the dependability ofresponses to projective techniques probablydepends to an important degree on sacrificingtheir projective nature. In part, projective testsseem to have failed to add to assessmentinformation because most of the variance inresponses to projective stimuli is accounted forby the stimuli themselves. For example, ªpop-ularº responses on the Rorschach are popularbecause the stimulus is the strongest determi-nant of the response (Murstein, 1963).

Thorndike (Thorndike & Hagen, 1955,p. 418), in describing the state of affairs withprojective tests some 40 years ago, stated:

A great many of the procedures have received verylittle by way of rigorous and critical test and aresupported only by the faith and enthusiasm oftheir backers. In those few cases, most notable thatof the Rorschach, where a good deal of criticalwork has been done, results are varied and there ismuch inconsistency in the research picture. Mod-est reliability is usually found, but consistentevidence of validity is harder to come by.

The picture has not changed substantially inthe ensuing 40 years and we doubt that it islikely to change much in the next 40. As Adcock(1965, cited in Anastasi, 1988) noted, ªThere arestill enthusiastic clinicians and doubting statis-ticians.º As noted previously (Sechrest, 1963,1968), these expensive and time-consumingprojective procedures add little if anything tothe information gained by other methods andtheir abandonment by clinical psychologywould not be a great loss. Despite lack ofincremental validity after decades of research,

not only do tests such as the Rorschach andTAT continue to be used, but new projectivetests continue to be developed. That could beconsidered a pseudoscientific enterprise that, atbest, yields procedures telling clinical psychol-ogists what they at least should already know orhave obtained in some other manner, and that,at worst, wastes time and money and furtherdamages the credibility of clinical psychology.

4.01.5.5 The Invention of the Objective Test

At one time we had rather supposed withoutthinking about it too much that objective testshad always been around in some form or other.Samelson (1987), however, has shown that atleast the multiple-choice test was invented in theearly part of the twentieth century, and it seemslikely that the true±false test had been devisednot too long before then. The objective testrevolutionized education in ways that Samelsonmakes clear, and it was not long before thatform of testing infiltrated into psychology.Bernreuter (1933) is given credit for devising thefirst multiphasic (multidimensional) personalityinventoryÐonly 10 years after the introductionof the Rorschach into psychology.

Since 1933, objective tests have flourished. Infact, they are now much more widely used thanprojective tests and are addressed toward almostevery imaginable problem and aspect of humanbehavior. The Minnesota Multiphasic Person-ality Inventory (1945) was the truly landmarkevent in the course ofdevelopmentofpaper-and-pencil instruments for assessing clinical aspectsof psychological functioning. ªPaper-and-pen-cilº is often used synonymouslywith ªobjectiveºin relation to personality. From that time on,other measures flourished, of recent in greatprofusion.

Paper-and-pencil tests freed clinicians fromthe drudgery of test administration, and in thatway they also made testing relatively inexpen-sive as a clinical enterprise. They also made testsreadily available to psychologists not specifi-cally trained on them, including psychologists atsubdoctoral levels. Paper-and-pencil measuresalso seemed so easy to administer, score, andinterpret. As we have noted previously, the easeof creation of new measures had very sub-stantial effects on the field, including clinicalassessment.

4.01.5.6 Disinterest in Basic PsychologicalProcesses

Somewhere along the way in its development,clinical assessment became detached from themainstream of psychology and, therefore, from

Fateful Events Contributing to the History of Clinical Assessment 17

Page 17: Comprehensive Clinical Psychology Bellack 1998

the many developments in basic psychologicaltheory and knowledge. The Rorschach wasconceived not as a test of personality per se butin part as an instrument for studying perceptionandRorschach referred to it as his ªexperimentº(Hunt, 1956). Unfortunately, the connections ofthe Rorschach to perception and related mentalprocesses were lost, and clinical psychologybecame preoccupied not with explaining howRorschach responses come to be made but inexplaining how Rorschach responses reflectback on a narrow range of potential determi-nants: the personality characteristics of respon-dents, and primarily their pathologicalcharacteristics at that.

It is testimony to the stasis of clinicalassessment that three-quarters of a centuryafter the introduction of the Rorschach, aperiod of time marked by stunning (relatively)advances in understanding of such basicpsychological processes as perception, cogni-tion, learning, andmotivation and by equivalentor even greater advances in understanding of thebiological structures and processes that underliehuman behavior, the Rorschach continues,virtually unchanged, to be the favorite instru-ment for clinical assessment. The Exner System,although a revision of the scoring system, in noway reflects any basic changes in our advance-ment of understanding of the psychologicalknowledge base in which the Rorschach is, orshould be, embedded. Take, just for oneinstance, the great increase of interest in andunderstanding of ªprimingº effects in cognition;those effects would clearly be relevant to theunderstanding of Rorschach responses, butthere is no indication at all of any awarenesson the part of those who write about theRorschach that any such effect even exists. Itwas known a good many years ago thatRorschach responses could be affected by thecontext of their administration (Sechrest, 1968),but without any notable effect on their use inassessment.

Nor do any other psychological instrumentsshow any particular evidence of any relationshipto the rest of the field of psychology. Clinicalassessment could have benefited greatly from aclose and sensitive connection to basic researchin psychology. Such a connection might havefostered interest in clinical assessment in thedevelopment of instruments for the assessmentof basic psychological processes.

Clinical psychology hasÐis afflicted with, wemight sayÐan extraordinary number of differ-ent tests, instruments, procedures, and so on. Itis instructive to consider the nature of all thesetests; they are quite diverse. (We use the termªtestº in a somewhat generic way to refer to thewide range of mechanisms by which psychol-

ogists carry out assessments.) Whether the greatdiversity is a curse or a blessing depends on one'spoint of view.We think that a useful perspectiveis provided by contrasting psychological mea-sures with those typically used in medicine,although, obviously, a great many differencesexist between the two enterprises. Succinctly,however, we can say that most medical tests arevery narrow in their intent, and they are devisedto tap basic states or processes. A screening testfor tuberculosis, for example, involves subcu-taneous injection of tuberculin which, in aninfected person, causes an inflammation at thepoint of injection. The occurrence of theinflammation then leads to further narrowlyfocused tests. The inflammation is not tubercu-losis but a sign of its potential existence. Acreatinine clearance test is a test of renalfunction based on the rate of clearance ofingested creatinine from the blood. A creatinineclearance test can indicate abnormal renalfunctioning, but it is ameasure of a fundamentalphysiological process, not a state, a problem, adisease, or anything of that sort. A physicianwho is faced with the task of diagnosing somedisease process involving renal malfunction willuse a variety of tests, not necessarily specified bya protocol (battery) to build an informationbase that will ultimately lead to a diagnosis.

By contrast, psychological assessment is, byand large, not based on measurement of basicpsychological processes, with few exceptions.Memory is one function that is of interest toneuropsychologists, and occasionally to others,and instruments to measure memory functionsdo exist. Memory can be measured indepen-dently of any other functions and withoutregard to any specific causes of deficiencies.Reaction time is another basic psychologicalprocess. It is currently used by cognitivepsychologists as a proxy for mental processingtime, and since the 1970s, interest in reactiontime as a marker for intelligence has grown andbecome an active research area.

For the most part, however, clinical assess-ment has not been based on tests of basicpsychological functions, although the Wechslerintelligence scales might be regarded as anexception to that assertion. A very large numberof psychological instruments and proceduresare aimed at assessing syndromes or diagnosticconditions, whole complexes of problems.Scales for assessing attention deficit disorder(ADD), suicide probability, or premenstrualsyndrome (PMS) are instances. Those instru-ments are the equivalent of a medical ªTest forDiabetes,º which does not exist. The Conners'Rating Scales (teachers) for ADD, for example,has subscales for Conduct Problem, Hyperac-tivity, Emotional Overindulgent, Asocial,

The Role of Assessment in Clinical Psychology18

Page 18: Comprehensive Clinical Psychology Bellack 1998

Anxious-Passive, and Daydream-Attendance.Several of the very same problems might well berepresented on other instruments for entirelydifferent disorders. But if they were, they wouldinvolve a different set of items, perhaps with aslightly different twist, to be integrated in adifferent way. Psychology has no standard waysof assessing even such fundamental dispositionsas ªasocial.º

One advantage of the medical way of doingthings is that tests like creatinine clearance havebeen used on millions of persons, are highlystandardized, have extremely well-establishednorms, and so on. Another set of ADD scales,the Brown, assesses ªability to activate andorganize work tasks.º That sounds like animportant characteristic of children, so impor-tant that one might think it would be widelyused and useful. Probably, however, it appearsonly on the Brown ADD Scales, and it isprobably little understood otherwise.

Clinical assessment has also not had thebenefit of careful study from the standpoint ofbasic psychological processes that affect theclinician and his or her use and interpretation ofpsychological tests. Achenbach (1985), to cite auseful perspective, discusses clinical assessmentin relation to the common sources of error inhuman judgment. Achenbach refers to suchproblems as illusory correlation, inability toassess covariation, and the representativenessand availability heuristics and confirmatorybias described by Kahneman, Slovic, andTversky (1982). Consideration of these sourcesof human, that is, general, error in judgmentwould be more likely if clinical assessment weremore attuned to and integrated into the main-stream developments of psychology.

We do not suppose that clinical assessmentshould be limited to basic psychologicalprocesses; there may well be a need forsyndrome-oriented or condition-oriented in-struments. Without any doubt, however, clin-ical assessment would be on a much firmerfooting if from the beginning psychologists hadtried to define and measure well a set offundamental psychological processes that couldbe tapped by clinicians faced with diagnostic orplanning problems.

Unfortunately, measurement has never beentaken seriously in psychology, and it is stilllightly regarded. One powerful indicator of thecasual way in which measurement problems aremet in clinical assessment is the emphasis placedon brevity of measures. ª. . . entire exam can becompleted. . . in just 20 to 30 minutesº (for headinjury), ªcompleted in just 15±20 minutesº(childhood depression), ª39 itemsº (to measuresix factors involved inADD) are just a few of thenotations concerning tests that are brought to

the attention of clinician-assessors by adver-tisers. It would be astonishing to think of amedical test advertised as ªdiagnoses braintumors in only 15 minutes,º or ªcompletediabetes workup in only 30 minutes.º An MRIexamination for a patient may take up to severalhours from start to finish, and no one suggests aªshort formº of one. Is it imaginable that onecould get more than the crudest notion ofchildhood depression in 15±20 minutes?

4.01.6 MISSED SIGNALS

At various times in the development ofclinical psychology, opportunities existed toguide, or even redirect, assessment activities inone way or another. Clinical psychology mightvery well have taken quite a different directionthan it has (Sechrest, 1992). Unfortunately, inour view, a substantial number of criticalªsignals to the field were missed, and entailedin missing them was failure to redirect the fieldin what would have been highly constructiveways.

4.01.6.1 The Scientist±Practitioner Model

We do not have the space to go into theintricacies of the scientist±practitioner model oftraining andpractice, but it appears to be an ideawhose time has come and gone. Suffice it to sayhere that full adoption of the model would nothave required every clinical practitioner to be aresearcher, but it would have fostered the ideathat to some extent every practitioner is respons-ible for the scientific integrity of his or her ownpractice, including the validity of assessmentprocedures. The scientist±practitioner modelmight have helped clinical psychologists to beinvolved in research, even if only as contributorsrather than as independent investigators.

That involvement could have been of vitalimportance to the field. The development ofpsychological procedures will never be sup-ported commercially to any appreciable extent,and if they are to be adequately developed, it willhave to be with the voluntaryÐandenthusiasticÐparticipation of large numbersof practitioners who will have to contributedata, be involved in the identification ofproblems, and so on. That participation wouldhavebeen farmore likelyhadclinical psychologystuck to its original views of itself (Sechrest,1992).

4.01.6.2 Construct Validity

We have already discussed construct validityat some length, and we have explained our view

Missed Signals 19

Page 19: Comprehensive Clinical Psychology Bellack 1998

that the idea has been trivialized, in essenceabandoned. That is another lost opportunity,because the power of the original formulationby Cronbach and Meehl (1955) was great. Hadtheir work been better understood and honestlyadopted, clinical psychology would by this timealmost certainly have had a set of well-under-stood and dependable measures and proce-dures. The number and variety of suchmeasureswould have been far less than exists now, andthe dependability of them would have beencircumscribed, but surely it would have beenbetter to have good than simplymanymeasures.

4.01.6.3 Assumptions Underlying AssessmentProcedures

In 1952, Lindzey published a systematicanalysis of assumptions underlying the use ofprojective techniques (Lindzey, 1952). His paperwas a remarkable achievement, or would havebeen had anyone paid any attention to it. TheLindzey paper could have served as a model andstimulus for further formulations leading to atheory, comprehensive and integrated, of per-formance on clinical instruments. A brief listingof several of the assumptions must suffice toillustrate what he was up to:

IV. The particular response alternatives emittedare determined not only by characteristic responsetendencies (enduring dispositions) but also byintervening defenses and his cognitive style.

XI. The subject's characteristic response tenden-cies are sometimes reflected indirectly or symbo-lically in the response alternatives selected orcreated in the test situation.

XIII. Those responses that are elicited or pro-duced under a variety of different stimulus condi-tions are particularly likely to mirror importantaspects of the subject.

XV. Responses that deviate from those typicallymade by other subjects to this situation are morelikely to reveal important characteristics of thesubject than modal responses which are more likethose made by most other subjects.

These and other assumptions listed by Lindzeycould have provided a template for systematicdevelopment of both theory and programs ofresearch aimed at supporting the empirical basefor projectiveÐand otherÐtesting. Assump-tion XI, for example, would lead rather natu-rally to the development of explicit theory,buttressed by empirical data, which wouldindicate just when responses probably shouldand should not be interpreted as symbolic.

Unfortunately, Lindzey's paper appears tohave been only infrequently cited and to havebeen substantially ignored by those who wereengaged in turning out all those projective tests,inventories, scales, and so on. At this point weknow virtually nothing more about the perfor-mance of persons on clinical instruments thanwas known by Lindzey in 1952. Perhaps evenless.

4.01.6.4 Antecedent Probabilities

In 1955 Meehl and Rosen published anexceptional article on antecedent probabilitiesand the problem of base rates. The article was,perhaps, a bit mathematical for clinical psy-chology, but it was not really difficult tounderstand, and its implications were clear.Whenever one is trying to predict (or diagnose) acharacteristic that is quite unevenly distributedin a population, the difficulty in beating theaccuracy of the simple base rates is formidable,sometimes awesomely so. For example, even ina population considered at high risk for suicide,only a very few persons will actually commitsuicide. Therefore, unless a predictivemeasure isextremely precise, the attempt to identify thosepersons who will commit suicide will identify assuicidal a relatively large number of ªfalse-positives,º that is, if one wishes to be sure not tomiss any truly suicidal people, onewill include inthe ªpredicted suicideº group a substantialnumber of people not so destined. That problemis a serious to severe limitation when the cost ofmissing a true-positive is high, but so, relatively,is the cost of having to deal with a false-positive.

More attention to the difficulties described byMeehl and Rosen (1955) would have movedpsychological assessment in the direction takenby medicine, that is, the use of ROCs. AlthoughROCs do not make the problem go away, theykeep it in the forefront of attention and requirethat those involved, whether researchers orclinicians, deal with it. That signal was missed inclinical psychology, and it is scarcely mentionedin the field today. Many indications exist that alarge proportion of clinical psychologists arequite unaware that the problem even exists, letalone that they have an understanding of it.

4.01.6.5 Need for Integration of Information

Many trends over the years converge on theconclusion that psychology will make substan-tial progress only to the extent that it is able tointegrate its theories and knowledge base withthose developing in other fields. We can addressthis issue only on the basis of personalexperience; we can find no evidence for our

The Role of Assessment in Clinical Psychology20

Page 20: Comprehensive Clinical Psychology Bellack 1998

view. Our belief is that clinical assessment inpsychology rarely results in a report in whichinformation related to a subject's geneticdisposition, family structure, social environ-ment, and so on are integrated in a systematicand effective way.

For example, we have seen many reports onpatients evaluated for alcoholism without anyattention, let alone systematic attention, to apotential genetic basis for their difficulty. Atmost a report might include a note to the effectthat the patient has one or more relatives withsimilar problems. Never was any attempt madeto construct a genealogy that would includeother conditions likely to exist in the families ofalcoholics. The same may be said for depressedpatients. It might be objected that the respon-sibilities of the psychologist do not extend intosuch realms as genetics and family and socialstructure, but surely that is not true if thepsychologist aspires to be more than a sheertechnician, for example, serving the samefunction as a laboratory technician whoprovides a number for the creatinine clearancerate and leaves it to someone else, ªthe doctor,ºto put it all together.

That integration of psychological and otherinformation is of great importance has beenimplicitly known for a very long time. Thatknowledge has simply never penetrated trainingprograms and clinical practice. That missedopportunity is to the detriment of the field.

4.01.6.6 Method Variance

The explicit formulation of the concept ofmethod variance was an important develop-ment in the history of assessment, but onewhoseimport was missed or largely ignored. Theconcept is quite simple: to some extent, the valueobtained for the measurement of any variabledepends in part on the characteristics of themethod used to obtain the estimate. (A key ideais the understanding that any specific value is, infact, an estimate.) The first explicit formulationof the idea of method variance was the seminalCampbell and Fiske paper on the ªmultitrait-multimethod matrixº (Campbell & Fiske,1959). (That paper also introduced the veryimportant concepts of ªconvergentº and ªdis-criminantº validity, now widely employed but,unfortunately, not always very well under-stood.) There had been precursors of the idea ofmethod variance. In fact, much of the interest inprojective techniques stemmed from the ideathat they would reveal aspects of personalitythat would not be discernible from, for example,self-report measures. The MMPI, first pub-lished in 1943 (Hathaway & McKinley),

included ªvalidityº scales that were meant todetect, and, in the case of the K-scale, evencorrect for, methods effects such as lying,random responding, faking, and so on. By1960 or so, Jackson and Messick had begun topublish their work on response styles inobjective tests, including the MMPI (e.g.,Jackson & Messick, 1962). At about the sametime, Berg (1961) was describing the ªdeviantresponse tendency,º which was the hypothesisthat systematic variance in test scores could beattributed to general tendencies on the part ofsome respondents to respond in deviant ways.Nonetheless, it was the Campbell and Fiske(1959) paper that brought the idea of methodvariance to the attention of the field.

Unfortunately, the cautions expressed byCampbell and Fiske, as well as by othersworking on response styles and other methodeffects, appear to have had little effect ondevelopments in clinical assessment. For themost part, the problems raised by methodseffects and response styles appear to have beenpretty much ignored in the literature on clinicalassessment. A search of a current electronicdatabase in psychology turned up, for example,only one article over the past 30 years or solinking the Rorschach to any discussion ofmethod effects (Meyer, 1996). When oneconsiders the hundreds of articles having todo with the Rorschach that were publishedduring that period of time, the conclusion thatmethod effects have not got through to theattention of the clinical assessment communityis unavoidable. The consequence almost surelyis that clinical assessments are not beingcorrected, at least not in any systematic way,for method effects and response biases.

4.01.6.7 Multiple Measures

At least a partial response to the problem ofmethod effects in assessment is the use ofmultiple measures, particularly measures thatdo not appear to share sources of probable erroror bias. That recommendation was explicit inCampbell and Fiske (1959), and it was echoedand elaborated upon in 1966 (Webb et al.,1966), and again in 1981 (Webb et al., 1981).Moreover, Webb and his colleagues warnedspecifically against the very heavy reliance onself-report measures in psychology (and othersocial sciences). That warning, too, appears tohave made very little difference in practice.Examination of catalogs of instruments meantto be used in clinical assessment will show that avery large proportion of them depend upon self-reports of individual subjects about their owndispositions, and measures that do not rely

Missed Signals 21

Page 21: Comprehensive Clinical Psychology Bellack 1998

directly on self-reports nonetheless do nearly allrely solely on the verbal responses of subjects.Aside from rating scales to be used with parents,teachers, or other observers of behavior,characteristics of interest such as personalityand psychopathology almost never requireanything of a subject other than a verbal report.By contrast, ability tests almost always requiresubjects to do something, solve a problem,complete a task, or whatever. Wallace (1966)suggested that it might be useful to think oftraits as abilities, and following that lead mightvery well have expanded the views of thoseinterested in furthering clinical assessment.

4.01.7 THE ORIGINS OF CLINICALASSESSMENT

The earliest interest in clinical assessment wasprobably that used for the classification of theªinsaneº and mentally retarded in the early1800s. Because there was growing interest inunderstanding and implementing the humanetreatment of these individuals, it was firstnecessary to distinguish between the two typesof problems. Esquirol (1838), a French physi-cian, published a two-volume document out-lining a continuum of retardation basedprimarily upon language (Anastasi, 1988).

Assessment in one form or another has beenpart of clinical psychology from its beginnings.The establishment of Wundt's psychologicallaboratory at Leipzig in 1879 is considered bymany to represent the birth of psychology.Wundt and the early experimental psychologistswere interested in uniformity rather thanassessment of the individual. In the Leipziglab, experiments investigated psychologicalprocesses affected by perception, in whichWundt considered individual differences to beerror. Accordingly, he believed that sincesensitivity to stimuli differs, using a standardstimulus would compensate and thus eliminateindividual differences (Wundt, Creighton, &Titchener, 1894/1896).

4.01.7.1 The Tradition of Assessment inPsychology

Sir FrancisGalton's efforts in intelligence andheritability pioneered both the formal testingmovement and field testing of ideas. Throughhis Anthropometric Laboratory at the Interna-tional Exposition in 1884, and later at the SouthKensington Museum in London, Galton gath-ered a large database on individual differencesin vision, hearing, reaction time, other sensor-imotor functions, and physical characteristics.It is interesting to note that Galton's proposi-

tion that sensory discrimination is indicative ofintelligence continues to be promoted andinvestigated (e.g., Jensen, 1992). Galton alsoused questionnaire, rating scale, and freeassociation techniques to gather data.

James McKeen Cattell, the first Americanstudent of Wundt, is credited with initiating theindividual differences movement. Cattell, animportant figure in American psychology,(Fourth president of the American Psychologi-cal Association and the first psychologist electedto the National Academy of Science) becameinterested in whether individual differences inreaction time might shed light on consciousnessand, despite Wundt's opposition, completed hisdissertation on the topic. He wondered if, forexample, some individuals might be observed tohave fast reaction time across situations andsupposed that the differencesmay have been lostin the averaging techniques used by Wundt andother experimental psychologists (Wiggins,1973). Cattell later became interested in thework of Galton and extended his work byapplying reaction time and other physiologicalprocesses as measures of intelligence. Cattell iscredited with the first published reference to amental test in the psychological literature(Cattell, 1890).

Cattell remained influenced by Wundt in hisemphasis on psychophysical processes.Although physiological functions could beeasily and accurately measured, attempts torelate them to other criteria, however, such asteacher ratings of intelligence and grades,yielded poor results (Anastasi, 1988).

Alfred Binet conducted extensive and variedresearch on themeasurement of intelligence. Hismany approaches included measurements ofcranial, facial, and hand form, handwritinganalysis, and inkblot tests. Binet is best knownfor his work in the development of intelligencescales for children. The earliest form of the scale,the Binet±Simon, was developed followingBinet's appointment to a governmental com-mission to study the education of retardedchildren (Binet & Simon, 1905). The scaleassessed a range of abilities with emphasis oncomprehension, reasoning, and judgment. Sen-sorimotor and perceptual abilities were rela-tively less prominent, as Binet considered thebroader process, for example, comprehension,to be central to intelligence. The Binet±Simonscale consisted of 30 problems arranged in orderof difficulty. These problems were normed using50 3±11-year-old normal children and a fewretarded children and adults.

A second iteration, the 1908 scale, wasdeveloped. The 1908 scale was somewhat longerand normed on approximately 300 3±13-year-old normal children. Performance was grouped

The Role of Assessment in Clinical Psychology22