Schauer Spellman Paper - law.duke.edu · ! 2!...

1

Note to Duke colleagues: This manuscript is in many ways more preliminary than the typical workshop paper. Not only does it need refinement of analysis, modification of argumentative structure, addition of references, and polishing of prose, but it could also benefit greatly from many more examples (and counterexamples). We think there is a genuine idea here, but much more needs to be done, and we are grateful for your assistance. Fred Schauer & Bobbie Spellman

10/26/2015

CALIBRATING LEGAL JUDGMENTS

Frederick Schauer1 & Barbara A. Spellman2

Legal decision-‐makers must frequently assess the judgments of other legal

decision-‐makers. The Supreme Court is required to evaluate the judgments of

federal courts of appeals and state supreme courts, and those courts must evaluate

the legal and sometimes the factual rulings of trial courts. Courts are also routinely

called upon to determine the legal sufficiency of decisions by Presidents, governors,

members of Congress, state legislators, heads of executive departments, police

officers, school principals, and countless administrative officials. So too must

1 David and Mary Harrison Distinguished Professor of Law, University of Virginia. 2 Professor of Law, University of Virginia. Earlier versions of this Article have been presented at the University of Arizona College of Law, the University of Pennsylvania School of Law, and at the 2015 Convention of the International Association of Legal and Social Philosophy. We are grateful for audience comments and questions on those occasions, as well as Will Baude, Paul Mahoney, Stephen Morse, and Andrew Vollmer for valuable information and references.

2

judges and magistrates in issuing search warrants assess the reliability of the

representations made to them by law enforcement officers seeking the warrant.

And of course jurors as well as judges must evaluate the veracity and credibility of

the witnesses who testify in court.

When such situations arise in everyday life, the evaluator of someone else’s

decision or judgment often seeks to inform her evaluation by considering, explicitly

or implicitly, the other decisions or judgments made by the decision-‐maker whose

decision is now under review. When a friend recommends a restaurant or movie to

us, we would like to know something about that friend’s reactions to other

restaurants or movies, preferably in the context of restaurants or movies we have

both seen, so that we can assess whether the judgment on this occasion is worth

following. If another friend says that we will like someone we have yet to meet, we

want to have some idea of who the friend likes and does not like, in order that we

can decide what to make of the endorsement on this occasion.3 And when an

applicant to admission to graduate school reports having an undergraduate GPA of

3.6, the graduate school wants to know where the 3.6 stands in relation to other

students (hence the value of class rankings), just as an employer of a student wants

to know much the same thing, and just as the recipient of a recommendation wants

to know whether the recommender tends to like everyone, or no one, or something

in between.

3 Yes, we are thinking of blind dates, but the situation arises on other contexts as well.

3

In ordinary talk, this process of attempting to gauge another’s decisions is

often referred to as calibration, although, as we will explain, this ordinary sense of

calibration both diverges from an important technical sense and may also be the

covering term that encompasses a number of diverse processes, which it will be

important to distinguish. But for the moment we can describe “calibration” as the

process by which a user of some measuring device (or an assessor of some

measurer) attempts to establish the relationship between the indicated measure

and some relevant standard.4 Thus, we calibrate a (weight) scale by attempting to

establish the relationship between what the scale reads and what something

actually weighs. If the scale systematically reads five pounds low, and if (perhaps

counterfactually) we wish to know what we actually weigh, we calibrate the scale

(or calibrate ourselves to the scale) by adding five pounds to the scale’s reading or

by adjusting the scale to take account of the error. So too when we take a

measurement knowing that some ruler gives a measure slightly longer than actual

length,5 or when a rifleman aims slightly lower than where the gunsight tells him to

aim, knowing from previous experience that the gunsight leads him consistently to

miss high.

4 The Oxford English Dictionary, for example, tells us that to calibrate is to “make allowance for [the] irregularities” of a measuring instrument. 1 Compact Edition of the Oxford English Dictionary 318 (1971). 5 We understand, of course, that neither one foot nor one inch nor one pound have some sort of metaphysical reality. Feet, inches, and pounds, like meters and kilograms, are human-‐created standards. But although the standard meter, for example, is a creation of human beings, it is nevertheless the case that a ruler is accurate insofar as what it reads approaches the standard measure.

4

At times the desire for this kind of calibration is systematized. Colleges and

universities often provide information so that employers and graduate schools can

calibrate whether the 3.6 is extraordinarily good or barely above average.6 And,

increasingly, the user of on-‐line restaurant and hotel reviews at sites such as

TripAdvisor can examine the review history of particular reviewers in order to

determine whether a reviewer’s rave review should be discounted because this

particular reviewer says nice things about everyplace, or whether some reviewer’s

brutally negative assessment should similarly be discounted (or, to put it differently,

inflated) because that reviewer has something bad to say about every establishment

he patronizes.7

Yet although such calibration is ubiquitous in ordinary life, and is at first

glance seemingly desirable, the legal system appears implicitly to resist it. A court

evaluating an administrative law judge’s denial of Veterans or Medicare or Social

Security benefits would presumably wish to know whether that administrative law

judge is a frequent denier or instead whether this denial should be entitled to great

deference because that judge denies benefits so rarely. Much the same desire for a 6 For example, prior to the emphasis on absolute grade point averages prompted by the formula used by the US News and World Report law school ranking system, some law schools would take into account in their admissions algorithm the previous grade point averages and law school performances of applicants from particular undergraduate colleges and universities. This knowledge would enable the law school, implicitly in its algorithm, to calibrate (or adjust) the undergraduate grade point average of a particular applicant according to the prior performances of prior applicants from that same school. 7 We have in mind TripAdvisor and Yelp, although there are other restaurant and hotel rating sites that do much the same thing. Indeed, the TripAdvisor model, in which a reviewer’s review history is easily accessible to the user, is one to which we will refer repeatedly in this Article.

5

decisional history would seem to apply as well to the decision by that administrative

law judge in evaluating the grant or denial by the administrative official who made

the initial decision. Similarly, the judges of an appellate court reviewing a trial

judge’s legal or factual ruling against a labor union might want to know whether

they are assessing the decision of a judge who typically rules for (or against) unions,

so as better to calibrate their assessment of the trial judge’s ruling on this occasion.

In the same way, the Supreme Court’s review of a lower court’s decision to uphold a

death sentence might usefully be informed by knowing whether the court under

review is generally sympathetic or hostile to capital sentences. And a jury

attempting to determine whether a witness who describes a person as drunk is

accurate or exaggerating would ideally benefit from information about the full range

of instances in which that witness has described someone as drunk, or not.

However sensible it might seem for judges, jurors, and appellate courts to

wish to calibrate in just this way, and however much the TripAdvisor model8 seems

to embody a larger rationality, the legal system typically, albeit implicitly, avoids

this form of calibration. Whatever the judges of a United States Court of Appeals

might actually be thinking in evaluating a decision by a sentencing District Judge to

depart upwards from the sentencing guidelines, or whatever those appellate judges

might actually know from informal research, personal contact, or hallway gossip, it

would be inappropriate for the appellate court to say explicitly in an opinion that it

was rejecting the upward departure because this judge is known to be an especially

8 See note 7 and accompanying text, supra.

6

tough sentencer, or has a history of upward departures.9 Indeed, it would even be

considered inappropriate for an appellate court openly to seek such information,

however much the appellate judges might be aware of the information from

previous cases or word of mouth. And cross-‐examination of a witness about other

similar assessments she has made – of drunkenness, say – would likely be excluded

by rules limiting cross-‐examination to matters addressed on direct examination.10

Our goal in this Article is to examine the ways in which just this kind of

calibration might be useful in various legal settings, to trace why the legal system

seems implicitly to be hostile to what so many other aspects of life and decision-‐

making have embraced, and to suggest ways in which calibration might valuably

become more accepted than it is now in various legal contexts. Our aim is less, 9 On departures from the Sentencing Guidelines generally, see Michael S. Gelacak, Ilene H. Nagel, & Barry L. Johnson, Departures Under the Federal Sentencing Guidelines: An Empirical and Jurisprudential Analysis, 81 Minn. L. Rev. 299 (1996). Upward departures are now constrained by, inter alia, United States v. Booker, 543 U.S. 220 (2004), but for present purposes we need not get into the details about departures or review of them, other than to note that the standard of review is a deferential one. See Gall v. United States, 552 U.S. 38, 45-‐52 (2007) (mandating an abuse-‐of-‐discretion standard in reviewing departures from the sentencing guidelines). 10 F. R. Evid. Rule 611 (b). This is a topic about which one of us has previously done experimental research. See Barbara A. Spellman & Elizabeth R. Tenney, Credible Testimony In and Out of Court, 17 Psychonomic Bull. & Rev. 168 (2010); Barba aA. Spellman, Elizabeth R. Tenney, & Margaret J. Scalia, Relying on Other People’s Metamemory, in Successful Remembering and Successful Forgetting: A Festschrift in Honor of Robert J. Bork 387 (Aaron S. Benjamin ed., 2011); Elizabeth R. Tenney, Barbara A. Spellman, & Robert J. MacCoun, The Benefits of Knowing What you Know (and What You Don’t): How Calibration Affects Credibility, 44 J. Experiment. Soc. Psych. 1368 (2008); Elizabeth R. Tenney, Barbara A. Spellman, Robert J. MacCoun, & Reid Hastie, Calibration Trumps Confidence as a Basis for Witness Credibility, 18 Psych. Sci. 46 (2007); Elizabeth R. Tenney, Barbara A. Spellman, & Robert MacCoun, Expanding the Scope of Cross Examination So that Jurors Can Infer Witness Calibration, available at papers.ssrn.com/sol3/papers.cfm?abstract_id=998593.htm.

7

however, to make recommendations for reform than it is to open up for analysis and

discussion a topic that has seemed too long to be unfortunately ignored in the

design of legal institutions.

I. Three Concepts of Calibration

As mentioned above, we use “calibration” as a covering term to describe three

quite different processes, or three different contexts in which an assessor might be

faced with assessing someone else’s assessment. For purposes of clarity, and also to

sharpen the focus on the one of the three that is our principal concern here, we will

describe each of the three processes that might all be considered as calibration in

one sense or another.

Before turning to the three processes, we again start, broadly speaking, with

the idea that calibration is the process of setting or assessing the relationship

between some measuring device (or measurer) in order to conform the

measurement to some relevant standard. As in the examples of scales, rulers, and

gunsights, we treat the measuring device as accurate insofar as what the measuring

device reports accords with the relevant standard, and we treat the measuring

device as well calibrated insofar as its “judgments” over a range of measurements

demonstrate a consistent and therefore predictable difference between what the

device indicates and what the underlying “truth” actually is. So a scale that reads

150 pounds when the actual weight is 145 is inaccurate by five pounds, but could be

considered well calibrated to the extent that the scale is always five pounds high.

Even though the scale is inaccurate, its inaccuracy would be reliable, and the

calibration is effective insofar as it uses reliability to compensate for inaccuracy.

8

And if the scale were inaccurate but not in any reliable way, attempts at calibration

would be less effective, making the scale, or our use of it, less well calibrated. Thus,

we calibrate our use of a reliably inaccurate scale by subtracting five pounds, and

insofar as this calibration produces consistent results our assessments of weight

become well calibrated.

When we move from the judgments of mechanical devices like scales, rulers, and

gunsights to the judgments of human beings, however, things become more

complex. Indeed, there are multiple ways of assessing the calibration of human

judgment, and it is important at the outset to get clear just what it is that we are

talking about. Specifically, disaggregating the multiple ideas encompassed by the

covering term “calibration” is especially important because much of the

psychological literature on calibration turns out to address an issue related to but

importantly different from the type of calibration that is our primary focus here.

And that is our principal justification for commencing the analysis by distinguishing

three concepts of calibration.

A. Confidence and Calibration

1. Calibration in Psychology Research.

The first concept of calibration, and the one that dominates the psychological

research, focuses on the relationship between the degree of confidence a decision-‐

maker has expressed in some judgment and the actual accuracy of that judgment.11

11 See, e.g., Linda Bol & Douglas J. Hacker, Calibration Research: Where Do We Go from Here,?” 3 Frontiers Psych. 229 (2012) (“Calibration is the degree of fit between a person’s judgment of performance and his or her actual performance.”). See also Douglas J. Hacker, Linda Bol, & Matt C. Keener, Metacognition in Education: A Focus on Calibration, in Handbook of Metamemory and Memory 429 (John Dunlosky &

9

Let us label this confidence-‐accuracy calibration. Accuracy is of course itself a

relation between a judgment and reality (or “ground truth”), and thus when we talk

about confidence-‐accuracy calibration we are talking about two types of accuracy,

each the subject of considerable research by psychologists. One is absolute

accuracy, which typically is described as “calibration” in a narrow sense, and the

other is relative accuracy, also known as “resolution” or “discrimination.”

Decision makers may be more or less confident in their judgments, and those

judgments may be more or less accurate. Insofar as the decision maker’s degree of

confidence aligns with the likelihood that the judgment is correct, the decision

maker is considered to be well-‐calibrated (in the absolute accuracy sense). And the

decision maker is, accordingly, understood as less well-‐calibrated to the extent that

the degree of confidence over-‐ or under-‐predicts the likely accuracy of the

judgment.

Suppose, for example, that someone – the decision maker, or exerciser of

judgment – judges the speed of a passing car to be 55 miles per hour, plus or minus

five miles per hour. And suppose that the decision maker has a degree of confidence

such that she is 80% sure that her judgment is correct. And if it turns out, over

some number of trials, that she is in fact correct 80% of the time in assessing speeds

within this range, then we could conclude that she is well-‐calibrated – her

confidence level is a reliable predictor of the likelihood of her accuracy. But if she Robert A. Bjork, eds., 2008); Kevin Krug, The Relationship between Confidence and Accuracy: Current Thoughts of the Literature and a New Area of Research, 3 Appled Psych. in Crim. Just. 7 (2007); Karlos Luna & Beatriz Martin-‐Luengo, Confidence-‐Accuracy Calibration with General Knowledge and Eyewitness Memory Cued Recall Questions, 26 Applied Cognitive Psych. 289 (2012).

10

were accurate only 40% of the time that she expressed 80% confidence, and 30% of

the time she expressed 50% confidence, and 50% of the time she expressed 30%

confidence, then she would be poorly calibrated because although it looks like she is

overconfident on average (i.e., her confidence overstates her accuracy), there is still

no way to use her estimates to predict much of use.

On the other hand, consider someone who is generally overconfident but is so in

a systematic way; that is, who does not show absolute accuracy but does show good

relative accuracy. Take the example from Scott Plous, “suppose a decision maker

were 50 percent accurate when 70 percent confident, 60 percent accurate when 80

percent confident, and 70 percent accurate when 90 percent confident. In such a

case confidence would be perfectly correlated with accuracy, even though the

decision maker would be uniformly overconfident by 20 percent.”12 Even though

the decision maker was inaccurate in her assessments, and even though the decision

maker was inaccurate in her degree of confidence in her assessments, the

uniformity of the overconfidence, her reliability, would make it easy to calibrate our

use of that decision maker’s conclusions.

2. How Legal Judgments are Different

Although these analyses of confidence-‐accuracy calibration dominate the

psychological literature, they have obvious shortcomings when applied to legal

contexts. And that is because, first, legal decision makers at all levels and in all

contexts rarely articulate the degree of confidence they have in their conclusions;

12 Scott Plous, The Psychology of Judgment and Decision Making 225 (1993).

11

and, second, unlike in psychology research, legal decision makers typically do not

have access to the “ground truth”.

Although witnesses at trial, and especially in response to cross-‐examination,

may express varying degrees of certainty about the facts and observations that they

are reporting, such expressions of less than complete confidence, or indeed even

actual expressions of complete or even partial confidence, are largely absent (or at

least invisible) in the context of judicial judgments.13 Justice Brandeis thus captured

the phenomenon well when he observed that he ordinarily convinced himself to a

lower degree of certainty (fifty-‐one percent, he said) than that with which he

expressed his judgment in writing an opinion.14 And under one understanding (or,

perhaps more accurately, our understanding) of Ronald Dworkin’s well-‐known “one

right answer thesis,”15 Dworkin agrees with Brandeis. It is not as if judges actually

believe that there is no plausible alternative answer, Dworkin is best understood as

claiming, but that it is a feature of the phenomenology of judging that judges believe

that that their answer is correct, and believe that any other answer is incorrect,

independent of the actual strengths of those beliefs.16

13 And so too, typically, with administrative decisions and judgments. 14 Brandeis made the statement in the context of comparing himself to Justice Cardozo, who, Brandeis believed, found it necessary to convince himself one hundred percent before reaching a judgment or writing an opinion. See Joseph L. Rauh, et al., A Personal View of Justice Benjamin N. Cardozo: Recollections of Four Cardozo Law Clerks, 1 Cardozo L. Rev. 5, 12, 18 (1979). 15 Ronald Dworkin, Justice in Robes 41-‐43 & 266 nn. 3-‐5 (2006); Ronald Dworkin, Is There Really No Right Answer in Hard Cases,?” in A Matter of Principle 119 (1985). 16 See Dworkin, Justice in Robes, supra note 15, at 266 nn. 3, 5.

12

Because judges (as well as police officers seeking search warrants and

administrative officials making administrative decisions) are thus typically loath to

describe in their opinions the degree of confidence they have in their judgments,17

and because judges seem especially reluctant to admit to relative low levels of

confidence, the principal psychological concept of calibration described above may

be of limited value in most legal contexts. It would be nice to know, in theory,

whether a given legal judgment was correct or incorrect and how much confidence a

judge had that her conclusion was correct, thus enabling an observer to determine

the degree to which the judge’s confidence was calibrated with the likelihood that

the judge reached the correct conclusion. But with ground truths rarely accessible

(or existent) for such decisions, and with degrees of confidence even more rarely

expressed, it turns out that this precise sense of calibration is of limited value in

thinking about the nature of legal judgment.

B. Leaving Confidence Behind – Calibration for Accuracy

Because explicit expressions of degrees of confidence are so rare in legal and

judicial contexts, a more relevant conception of calibration in the context of the 17 We can think of two possible but indirect exceptions to the statement in the text. One might arise in the context of civil actions against public officials under 42 U.S.C. §1983 (or Bivens v. Six Unknown Named Agents, 403 U.S. 388 (1971)), where only violations of “clearly established law,” see Wilson v. Layne, 526 U.S. 603 (1999); Anderson v. Creighton, 483 U.S. 635 (1987); Harlow v. Fitzgerald, 457 U.S. 800 (1982), can produce liability in the face of a qualified immunity claim. And the other would arise in the contexts (legal malpractice actions being the most obvious) in which questions of what the law is are treated as questions of fact. Thus a judge in a civil rights action might rule (on a motion for summary judgment, for, example) that the state of the law was sufficiently uncertain that there could be no violation of clearly established law, and we can imagine an expert witness in a legal malpractice action testifying that the state of the law was, for example, probably such-‐and-‐such, but not certainly such-‐and-‐such.

13

analysis here is one that is concerned not with the alignment between confidence

and accuracy, but rather with the seemingly simpler question of the alignment

between an expressed judgment and the ground truth – the actual fact of the matter.

As noted previously, the relation between judgment and truth is what is ordinarily

understood as accuracy,18 and so we can think of the effort to align our judgments

with the ground truth as accuracy calibration.

Accuracy calibration is closer to the examples of the scale that reads five pounds

high or the ruler whose indication is an eighth of an inch short. So if we substitute a

human observer – a witness – for a mechanical scale, we could imagine a human

being who, like the “Guess Your Weight” booths at carnivals, estimated the weight of

the people she observed. And if the estimate were consistently five pounds over the

actual weight,19 we could calibrate her judgments by subtracting five pounds from

each of her estimates. That calibration would then bring an increase in the accuracy

of the post-‐calibration determination.

So now suppose that we are dealing with a witness who is testifying at a trial, or

a bystander who has witnessed a crime and is reporting what she saw to the police.

The witness or bystander reports that the person she saw running out of the bank

waving a gun and wearing a ski mask appeared to weigh about 200 pounds. If this

were a report to the police, the police officer might (in theory, even if rarely in

practice) ask the witness if her estimates of weight were usually high, or usually

18 Or, more precisely, as absolute accuracy, in the sense described above, and as distinguished from relative accuracy. 19 And on “actual” weight, see also note 5, supra.

14

low, or usually close to accurate. And if the estimate of 200 pounds were part of a

witness’s testimony at a trial, on cross-‐examination the witness might have, in

theory, have been asked about the accuracy of her other estimates of weight on

other occasions. Alternatively, opposing counsel might have offered evidence about

previous weight estimates by this witness that had proved to be inaccurate. Under

either scenario, the idea would be to calibrate the accuracy of the witness on this

occasion by looking at the degree of her accuracy on other occasions. A history of

inaccuracy would lead the rational evaluator of the testimony to discount it, just as a

history of consistent overestimates would lead the rational evaluator to subtract

from the estimate provided by the witness.

If such matters had been raised on cross-‐examination, it is likely that the inquiry

would have been excluded, possibly under something like Federal Rule of Evidence

611, which as applied typically excludes matters relating to events other than the

ones being litigated in the case at hand.20 However useful it might be to the trier of

fact to be able to calibrate the witness’s testimony in just the way just described, the

legal system appears resistant to allowing a trier of fact to calibrate a factual report

20 The statement in the text is an overstatement, partly because Rule 611 does allow inquiries into credibility, partly because of variation among jurisdictions with respect to whether they have wide or narrow scope limitations on cross-‐examination, see Christopher B. Mueller & Laird C. Kirkpatrick, Evidence §6.63 at 603-‐06 (5th ed. 2012), partly because so much is left to the trial judge’s discretion, and partly because trial judges vary in terms of how widely they understand Rule 611’s allowance in all cases of cross-‐examination on “matters affecting the witness’s credibility.” As a result, all we claim here is that the issues we offer in this Article might lead to a broader scope of cross-‐examination and rebuttal evidence than now generally exists in both the federal and state systems.

15

by examining the accuracy of other and even similar reports made by the same

observer.21

Although we will return to the factual witness example presently, this is not the

place to pursue it, in large part because it is not clear that even this type of

calibration is especially relevant to the kinds of legal, as opposed to factual,

judgments that are often made by the courts and other legal actors whose

judgments are being reviewed. Unlike estimates of weight and other factual reports,

locating the ground truth of a legal judgment is more elusive. Such a task is not

impossible, of course. For example, a reviewing court might wish to know how

often a trial judge had made obvious errors of law occasioning reversal.22 Especially

under circumstances in which a reviewing appellate court would perceive itself to

be highly knowledgeable about the area of law at issue, that court might engage in

rigorous scrutiny of the legal judgments of a trial judge known to be frequently

reversed for making obvious mistakes of law, while at the same time being highly 21 The testimony of expert witnesses represents an obvious exception, because here it is in fact common for cross-‐examination to focus on the other assessments made by that expert. 22 Implicit in this statement is the belief that reversal is some (admittedly imperfect) guide to the conformity of a judgment below with some notion of legal accuracy or legal correctness. One example is Chief Justice Marshall’s observation in Marbury v. Madison, 1 Cranch (5 U.S.) 137 (1803), that a law allowing a conviction for treason on the testimony of only one witness would be plainly unconstitutional in light of the two-‐witness rule in Article III, Section 3, of the Constitution. We believe that there are, in like fashion, other examples of legal decisions that are simply wrong independently of some court declaring them so, see H.L.A. Hart, The Concept of Law 124-‐47 (Penelope A. Bulloch, Joseph Raz, & Leslie Green, eds., 3d ed., 2012), but we also recognize that, especially at the appellate level, and especially in light of the selection effect (see George L. Priest & Benjamin Klein, The Selection of Disputes for Litigation, 13 J. Legal Stud. 1 (1984)), such examples of plain legal error or inaccuracy are rare.

16

deferential, under conditions of legal uncertainty, to the judgments of a trial judge

whose decisions on matters of law were routinely upheld on appeal. The reviewing

court would use the reversal rate as way of calibrating their judgment of the

accuracy, under conditions of legal uncertainty, of the trial judge’s legal conclusions.

Although this kind of calibration is thus possible in theory, it is likely that, in

practice, and especially given the operation of a selection effect making cases

involving clear right or wrong answers disproportionally unlikely to be litigated,,23

it is rare that we are able to characterize decisions on matters of law, or even of

mixed questions of law and fact, as simply right or wrong. Rather, such decisions

are more likely to involve questions in which there is no ground truth or in which

we do not know what the ground truth is. In such cases, a reviewing body may be

less concerned with the degree of accuracy of some reviewed decision as a matter of

ground truth than it is with just how to evaluate the evaluative judgment that is

being reviewed. And it is this concern not with fact but with evaluating an

evaluation that leads us to our third conception of calibration.

C. The Calibration of Evaluative Judgments

Although reviewing courts and other reviewing institutions are sometimes

required to review factual determinations and legal determinations that have

relatively clear right or wrong answers, the assessment of the judgments of others

even more often arises in contexts in which the judgments being assessed are far

more evaluative than factual or otherwise straightforward. The question then is 23 See especially Priest & Klein, supra note 22. A good overview of the central issues is Leandra Lederman, Which Cases Go to Trial?: An Empirical Study of Predictions of Failure to Settle, 49 Case West. Res. L. Rev. 315 (1999).

17

how do courts assess an assessment, and how do they evaluate an evaluation?

When an appellate court is evaluating a lower court’s determination that a

defendant had (or did not have) the effective assistance of counsel, that a

warrantless search was or was not reasonable, that a state interest was or was not

substantial, that a regulatory mechanism was or was not the least restrictive (of

some constitutionally-‐recognized interest), or that a defendant should or should not

prevail on summary judgment because of the insufficiency of the plaintiff’s potential

evidence, the appellate court is faced with the task of evaluating what is itself an

evaluative judgment. And in such cases, we might hypothesize that the evaluator

might usefully wish to know just what scale the original decision-‐maker was

employing in making the decision now under review.

In review contexts such as these, calibration takes on a different meaning,

and we can label it evaluative calibration. Just as the graduate school or employer

wants to know what a 3.6 from some university means, and just as the potential

hotel or restaurant patron wants to know what two stars means,24 so too might

evaluators often wish to be able to calibrate the kinds of earlier evaluations that

have no intrinsic meaning, or at least have a broad enough range of meaning that

there is no particular conception of a clearly right or clearly wrong answer. And

although one might be (and should be) a metaphysical realist – a believer in a mind-‐

independent reality – about water and gravity and gold, and maybe even about the

24 Even apart from questions of evaluative calibration, we would also, of course, want to know what the scale is. For restaurants, for example, three stars is the best you can do in the Michelin Guide, four stars is the maximum for the New York Times, and other guides go up to a maximum of five stars.

18

rightness of altruism and the wrongness of child abuse, there are few metaphysical

realists about the star ratings for hotels and restaurants, the wine ratings on the 100

point scale commonly used by wine experts, and even the idea of an A-‐ or a B+ on a

grade scale. And so when we want to know whether an 88 point wine or a two star

review or rating or a 3.6 grade point average is good or mediocre, we would, ideally,

like to know about the other ratings of the rater.25 If the rater has given a

restaurant three stars out of a possible four when we thought the restaurant was

terrible, and if the rater consistently gives high ratings to restaurants we believed

on the basis of our own experiences to be mediocre or worse, we might then well

ignore or discount the rater’s rating of an establishment we were considering

patronizing for the first time. Similarly, when some law schools algorithmically

lowered (or raised) the grade point averages of students coming from particular

institutions,26 their view was based on having seen how students with those grade

averages from those schools actually performed in law schools. If the students

consistently underperformed their undergraduate grade point averages compared

to students from other undergraduate institutions, then this differential would be

reflected in an adjustment of the admission index, and the adjustment can be

considered a form of calibration.

25 The Michelin Guide says that three-‐star restaurants are “worth a journey.” But if half the restaurants in the Guide were worth a journey in the Guide’s opinion, we might be more reluctant to actually make the journey than if such a rating were given, as is actually the case, to less than one percent of the establishments rated. 26 See note 6, supra.

19

Thus, when we are speaking of evaluative calibration, we are not primarily

interested in accuracy. Rather, the concern is with just how we should understand

what someone else’s judgment means in light of the other decisions or judgments

that that decision maker has reached, and thus in light of what we can infer that

decision maker’s evaluation scale to be. And if we then engage in our subsequent

evaluation of that earlier evaluation in light of this knowledge, we can be said to

have engaged in a process of calibration.

II. Some Potential Applications

A. The Norm of Non-‐Calibration

We have offered some potential applications of the possibility of calibration, and

it is now time to explore several of them in greater depth. We do so in order to

hypothesize the existence of what appears to be a norm of non-‐calibration, the

seeming norm of judicial behavior that prohibits or discourages courts from

officially examining the previous judgments of the body under review in order to

calibrate those judgments, or from officially acknowledging that such calibration has

occurred even if it has taken place surreptitiously.27

27 In saying that calibration sometimes occurs “surreptitiously,” we do not mean to imply anything pernicious, but rather to suggest that judges often know of the past behavior of the legislatures, courts, and agencies whose judgments they are reviewing, and might well be influenced by that knowledge even as they believe that it is officially the kind of information that they should not take into account. Cf. Andrew J. Wistrich, Chris Guthrie, & Jeffrey J. Rachlinski, Can Judges Ignore Inadmissible Information? The Difficulty of Deliberately Disregarding, 153 U. Pa. L. Rev. 1251 (2005) (reporting a study in which judges were often unable to ignore information they actually had but knew was legally unusable).

20

To start with a relatively straightforward example, consider a Supreme Court

Justice, especially one with no extreme28 views one way or another about the death

penalty, who is faced with evaluating29 a decision by a state supreme court to

uphold the death penalty as against defense claims of, for example, procedural

defects, ineffective assistance of counsel, or cruel and unusual administration. For

that Justice, we can ask whether it would make a difference that the state supreme

court whose judgment is under review almost always affirms death penalty

sentences, or instead almost always vacates them.30 If the state supreme court had a

long and persistent history of affirming capital sentences against such objections,

then the reviewing Justice might suppose that this case presented a more or less

typical case, and would evaluate the decision below according to the standard that

she generally applied to such matters. But if the court being reviewed was a court

that often or almost always vacated capital sentences, then the reviewing Justice

might calibrate this decision accordingly, concluding that here the causes for

potential reversal might be especially absent. And as a result she might be inclined

to be more deferential in her review. Conversely, if the case arose on a government

appeal from a lower appellate reversal, the reviewing Justice might conclude that a

reversal of a sentence or conviction by a court that routinely upholds convictions or 28 We use this term not as a pejorative, but just as a way of describing both tails of a distribution of views. 29 Possibly in deciding on the merits, possibly in deciding whether to vote to grant certiorari, and possibly in deciding whether when sitting as a single Justice to grant a request for a stay. 30 Cf. Charles Fried, Impudence, 1992 Sup. Ct. Rev. 155 (recounting the persistent anti-‐death penalty actions of the Ninth Circuit in the early 1990s).

21

sentences is an action that is very high on the reviewed court’s scale of error.31

Insofar as the reviewing Justice has information of this variety about other decisions

by the court being reviewed, she can be understood to be calibrating the current

decision, or, more precisely, to be calibrating her attitude to the current decision in

light of what she knows from other cases to be the relevant scale.

If we posit that actually knowing about these other results would enable the

reviewing Justice more accurately to calibrate the decision under review, or to

calibrate her degree of deference to the court being reviewed, the question then

arises as to whether she would or should in fact be permitted to examine those

other decisions. If she did calibrate on the basis of other decisions not now before

her, would this be a practice that could be openly acknowledged, for example by

making reference to these other decisions in an opinion? Could a judge actually say

that she is applying especially close scrutiny on review, for example, because of the

31 The phenomenon here is related but not identical to the occasional practice of the Supreme Court in signaling extreme easiness by the assignment of the opinion to the Justice least likely on the basis of past performance to be perceived as sympathetic to the claim now being upheld. Obviously unanimity itself will sometimes send such a signal, and Brown v. Board of Education, 347 U.S. 483 (1954), Cooper v. Aaron, 358 U.S. 1 (1958), and United States v. Nixon, 418 U.S. 683 (1974), are well-‐known examples. But there is also a signal sent when Justice Rehnquist writes for a unanimous Supreme Court in Jenkins v. Georgia, 418 U.S. 153 (1974), making clear the limits of the “local standards” idea in obscenity law, and so too with Justice White writing for a unanimous Court in Sable Communications v. FCC, 492 U.S. 115 (1989), again dealing with the limits of obscenity and communications indecency law, here in the context of a ban on sexually explicit telephone services. In such cases, the assignment of the opinion to the Justice known to be least receptive to the kinds of claims now being upheld is perhaps a way of telling the audience for the opinion that the case is especially easy, and that this form of signal is dependent on an implicit calibration by the audience for the opinion.

22

previous decisions by the particular court or the particular judge whose judgment

she is now reviewing?

The death penalty example is unrepresentative in some ways, but

representative in others. It is definitional of the appellate process that judges are

reviewing the decisions of other judges,32 and whether it be a death penalty appeal,

a grant or denial of a motion for summary judgment, a decision to support or reject

a constitutional challenge to some state law or practice, or any of a large number of

other contexts in which there is an appellate review of an evaluative decision below,

the basic dynamic remains one of a legal evaluation of an earlier evaluative legal

judgment. And especially when the governing law requires that the evaluation be

something other than de novo,33 the reviewing judges would seem to benefit from

being able to calibrate the judgments they are being asked to review. And those

reviewing judges would also seem to benefit by being able to know as much about

the other judgments of the reviewed judge as a reader of a restaurant review

32 We recognize that appellate courts often review jury verdicts, but even in such cases the appellate court is in the position of reviewing the decision of a trial judge to let the case go the jury, or to refuse to set aside a jury verdict. 33 When review is genuinely de novo, we might imagine that basis for the decision being reviewed makes little or no difference. But if the standard of review is anything above de novo, some degree of deference is required, and it is in those situations where the reviewing body would like, we suppose, to have some idea of where on some scale the particular decision lies for the body being reviewed. When the Supreme Court is engaged in genuine rational basis review, for example, as in cases like New Orleans v. Dukes, 427 U.S. 297 (1976), or Williamson v. Lee Optical of Oklahoma, Inc., 348 U.S. 483 (1955), it is engaged in extreme deference to the administrative or legislative judgment below, and in evaluating that judgment it might wish to know just what kinds of decisions that body makes, so as better to be able to calibrate its implicit standard of review to the decision and the decision-‐maker being reviewed on this occasion.

23

benefits from knowing about the other judgments of the reviewer. But what

appears to be a norm of non-‐calibration precludes knowing about such other

judgments. Congress, state legislatures, administrative officials, administrative law

judges, and lower court judges all make decisions that are part of a large collection

of decisions by that institution or judge, and thus any particular decision being

reviewed lies somewhere on a scale for that institution or that judge. But under a

norm of non-‐calibration there is no way, at least officially and openly, for a

reviewing court or other institution to obtain or overtly use just this sort of

presumably valuable information.

As described briefly above, a similar opportunity for calibration also arises in the

context of jurors (or judges operating as triers of fact) in evaluating the testimony of

witnesses. But again calibration by reference to the analog of a decisional history is

rarely permitted. Sometimes, of course, witnesses will testify as to matters of fact

that have straightforward answers or testify about questions where the answer is a

simple yes or no. But witnesses also testify about speed, height, weight,

temperature, attitude (“he seemed angry”), condition (“he was drunk”), and a vast

number of other matters that are as much evaluative (even if not normatively

evaluative) as they are simply factual, and that in important ways can be

characterized in terms of a scale. Speed is variable, as is height and weight and so

on, but there are also degrees of anger, degrees of drunkenness, and degrees of a

very large number of things about witnesses routinely offer evidence.

When witnesses testify about such scalar matters, the issues appear to be similar

to those arising when a reviewing court is evaluating an earlier determination by a

24

lower court, legislature, administrative official, or administrative law judge. As with

these latter examples, the evaluator of a witness’s testimony wants to know where

on the witness’s scale some conclusion lies, so as to be able better to understand and

use it. Indeed, the ideal testimony, in theory, would be testimony in which the

assessor – judge or jury – would know about a previous judgment by the witness on

some question whose answer is already known by the assessor. If a witness testifies

that Susan was angry, it would be ideal, in theory, if the assessor knew how the

witness would characterize the attitude of Harry, known by the assessor, on an

occasion also known by the assessor. This is close to what happens when we hear

or read a restaurant review, for what we would really like to know is how the

reviewer rated another restaurant about which we have already formed an opinion.

With that information in hand, we would be best able to calibrate the review of the

reviewer on this occasion, and thus, with analogous information on hand, the trier of

fact would be best able to calibrate the testimony of the witness on this occasion.

Such ideal information will, of course, rarely be available. Nevertheless, a

second-‐best solution would be for the trier of fact to be able to have information

about other judgments made by the witness, even if not about things already known

to the trier of fact. But at least if the trier of fact knew about the other judgments,

and could form some impression about the alignment of those judgments with some

assumed reference standard, the trier of face could engage in a better calibration of

the testimony than might be possible under current practice, where even with

25

respect to witnesses at trial a norm of non-‐calibration appears to make such matters

largely inaccessible.34

B. A Large Exception and Its Implications

We have suggested that there appears to exist a prevailing rule of non-‐

calibration. That is, courts and other reviewing bodies will not typically calibrate

their standard of review or degree of deference to what they know or might find out

about the previous decisions of the body being reviewed. Or, if the reviewers do

engage in such calibration, whether consciously or not, they typically will not, again

because of the force of the norm of unavailability, admit to doing it.

A significant exception to the norm of non-‐calibration appears to exist with

respect to judicial review of administrative agencies. In this context, there is some

indication that the norm of non-‐calibration is weaker, and that the past decisions

and behavior of an agency being reviewed are thought to influence the attitude of

the reviewing court on a particular occasion.

The practice of “agency-‐specific”35 standards of review appears to have

started with reactions to what were perceived to be National Labor Relations Board

34 Witnesses can, of course, be cross-‐examined or impeached on issues going to their credibility. Fed. R. Evid. 607, 611 (b). Moreover, cross-‐examination or impeachment going to witness credibility is typically not limited by the “scope of direct” rule. See, e.g., United States v. Moore, 917 F.2d 215, 222 (6th Cir. 1990); United States v. Sullivan, 803 F.2d 87, 90-‐91 (3d Cir. 1986). But credibility is rarely understood so broadly as to allow wide-‐ranging cross-‐examination or impeachment going to the kind of calibration we are discussing here, especially because credibility is widely understood to be focused, even if not exclusively, on issues of veracity and not on issues of perception or judgment. 35 See Richard E. Levy & Robert L. Glicksman, Agency-‐Specific Precedents, 89 Texas L. Rev. 499 (2011).

26

practices that diverged from those of other agencies, in particular the use by the

NLRB of adjudication as a rule-‐making tool in order to avoid the structures and

strictures imposed by the Administrative Procedure Act on agency-‐rulemaking, and

also a differentially (compared to other agencies) large gap between articulated

standards and adjudicative outcomes.36 As a result, it appeared to some observers

(although not acknowledged by the reviewing courts) that judicial review of NLRB

decisions was different from and more intrusive than judicial review of other

agencies, a consequence of the unadmitted judicial knowledge of exactly this

differential behavior.

More recently, others have noticed the more widespread existence of the

same phenomenon,37 and have gone on to endorse it,38 with Richard Pildes, for

example, arguing that taking differences among agencies and their past behavior

into account in evaluating the decisions of those agencies is a sensible and realistic

reaction to the differences among agencies in their political makeup, their structure,

the method by which their senior members are appointed, the matters they are

called up to decide, and much else.39 To fail to do so, Pildes argues, is a formalistic

reluctance to treat all agencies the same when in fact they are plainly not.40

36 See Joan Flynn, The Costs and Benefits of Hiding the Ball”: NLRB Policymaking and the Failure of Judicial Review, 75 B.U. L. Rev. 387 (1995), and earlier, and especially, Ralph Winter, Judicial Review of Agency Decisions: The Labor Board and the Court, 1968 Sup. Ct. Rev. 53. 37 E.g., Jennifer Nou, Sub-‐Regulating Elections, 2013 Sup. Ct. Rev. 135. 38 See Richard H. Pildes, Institutional Formalism and Realism in Constitutional and Public Law, 2013 Sup. Ct. Rev. 1, 21-‐30. 39 Id.

27

We do not in this Article purport to make a contribution to administrative

law. But the fact that the past decisions of a particular agency seem often to be

relevant to courts reviewing agency decisions suggests a possible generalization. If

it is at times useful and appropriate for reviewing courts to take an agency’s past

judgments into account in determining the degree and type of scrutiny to be applied

to that agency’s judgment on a particular occasion, then might it also at times be

useful and appropriate for a reviewing appellate court to do the same with different

trial courts and trial judges? Similarly, might it be useful and appropriate for a

magistrate to do the same with different police officers and different police

departments who are seeking search warrants? And might it be also useful and

appropriate for an administrative law judge to do much the same thing with the

different officials and different parts of the agency whose judgments she is asked to

review?

At times such history-‐based calibration practices do exist when legal

decision-‐makers are evaluating the judgments and actions of citizens. The

Securities and Exchange Commission, for example, is frequently required to

evaluate the accuracy or non-‐misleadingness of representations made in, for

example, registration statements, proxy statements, and periodic reports. But in

some of these areas the Commission has an overt process of allowing some

registrants the benefit of a fast-‐track or similarly cursory review process, a process 40 Id. Pildes is correct that treating different agencies, or different phenomena in general, in the same way, is formalistic, although for one of us such formalism is not necessarily always to be condemned. See Frederick Schauer, Formalism, 97 Yale L.J. 509 (1988).

28

that is available only to registrants with proven track records of accuracy, and a

process that is revocable upon evidence of inaccuracy on some occasion.41 In

evaluating the accuracy of such filings, therefore, the Commission is thus openly

calibrating its degree of scrutiny to what it knows from the past practices of the

individual or entity being reviewed. The audit practices of the Internal Revenue

Service are similar even if less overt and less (publicly) systematized, where again

the degree of scrutiny at the audit and subsequent stages appears to take account of

the particular history of the particular taxpayer.

The practice of agency-‐specific review suggests that the form of calibration

we describe here is hardly unknown to the law. At times the norm of unavailability

therefore does not prevail, and reviewing bodies calibrate their assessments of the

judgment being reviewed in light of the full array of judgments made over time by

subject of the review, just as the user of a TripAdvisor review calibrates her

assessment of a review in light of the full array of reviews made by a particular

reviewer. But agency-‐specific review is still more than exception than the rule, and

the question is then presented about what might explain the rarity in law of a

practice that characterizes not only TripAdvisor, but also the way that most people

make most of their decisions in most aspects of their lives.

III. Barriers to Calibration

41 See, e.g., the “safe harbor” provisions in Section 21E of the Securities Exchange Act, 15 U.S.C. §78u-‐5 (2014), the “seasoned issuer” provisions under SEC Rule 405, 17 C.F.R. §230.405 (2015), the “bad actor disqualifications” under Regulation D, Rule 506, 17 C.F.R. §230.506(d) (2015), and the streamlined disclosure procedures provided for in Rule 144(c)(1), 17 C.F.R. §230.144(c)(1).

29

If we are correct in believing, along with much of the non-‐legal world and some

of the legal world, that calibration, whether of witness testimony or legal judgments

being reviewed by legal reviewers, is at least sometimes potentially valuable, if we

are correct that calibration will often be assisted by knowing about judgments made

by the witness or reviewee body on other occasions (and maybe even the outcome

of those judgments), and if we are correct that such information is routinely

unavailable in the legal system, then we might usefully think about why this is so.

One possibility is that law imagines itself as a pervasively particularistic

institution. Whether it be legal scholars urging that matters be decided “one case at

a time,”42 or the fact-‐specific and particularistic orientation of the common law, or

the general impermissibility of using evidence of acts or behavior on other

occasions to prove conformity on this occasion,43 there are important ways in which

the legal system, whether for reasons of supposed particularistic justice or for

reasons of efficiency or just because of a pervasive legal ideology of particularism,44

is persistently averse to spending too much time dealing with cases or issues other 42 Cass R. Sunstein, One Case at a Time: Judicial Minimalism on the Supreme Court (2001). 43 This is the so-‐called propensity or character rule, embodied in, for example, Rule 404 of the Federal Rules of Evidence. 44 Which one of us has already spent far too much time and ink challenging. See Frederick Schauer, Thinking Like a Lawyer: A New Introduction to Legal Reasoning (2009); Frederick Schauer, Profiles, Probabilities, and Stereotypes (2003); Frederick Schauer, Playing By the Rules: A Philosophical Examination of Rule-‐Based Decision-‐Making in Law and in Life (1991). The other one of us, however, has spent less times and less ink arguing that people can be generalizers, and is concerned with the frequency with people are often – depending on context – particularists. See Barbara A. Spellman, Individual Reasoning, in Intelligence Analysis: Behavioral and Social Scientific Foundations 117 (Baruch Fischhoff & Cherie Chauvin eds., 2011).

30

than the one now under consideration. Whether this is actually true is a difficult

question, but it appears as if the law believes that it is true. And whether it is

desirable is also a difficult question, but, again, the law appears to believe that it is

desirable. And thus it may be that the aversion to calibrating by use of decisional

history, although most overtly manifested in the kinds of examples we have been

discussing, is in fact best explained by something far more pervasive in law in

general.

Other obstacles to history-‐based calibration may be more pragmatic. How will

judges (or jurors) get the kind of information they might need to engage in the

process of calibration? How will they find out about other decisions, and other

cases, and about how those other decisions turned out? There is a serious risk that

opening the door to this kind of information will lead to such an expansion of the

domain of usable legal information that the disadvantages, even if only logistical and

pragmatic, would far outweigh the advantages.

Moreover, the aversion to calibration may also embody non-‐epistemic goals.

Just as deference in general may often reflect a non-‐epistemic respect for the

decision-‐making powers of others,45 the unwillingness of an appellate body to

examine the decisional history of those it is reviewing may manifest a form of

respect, even if not epistemically justified, for those it is reviewing. Relatedly, there

may be non-‐epistemic institutional equality goals that would militate in favor of

treating all agencies or all lower courts as equivalent even if they are not. An

appellate court that is willing, openly, to treat the decisions of some lower court 45 See Philip Soper, The Ethics of Deference: Learning from Law’s Morals (2002).

31

judges with less deference than it treats others is displaying a lack of respect which,

even when epistemically justified, may be thought inconsistent with some of the

goals of the legal system.

Yet it is important to recognize that calibration, even under existing procedures,

is not absent. It is just that legal decision makers calibrate against their own views,

or against an assumed average, or an assumption that the witness or lower court

judge is more or less like them. Or it may be that legal decision makers, especially

judges, do exactly what we are suggesting here, but do it in the halls by making use

of gossip in the country’s courthouses, or do it by reading the newspapers, or in

other ways obtain information that they are, in theory, not supposed to have, and

use information that they are not permitted to acknowledge publicly that they have

in the first place. Insofar as this is the case, and thus insofar as there is far more

calibration from other events and other decisions than legal actors are willing or

allowed to admit, then it is possible that bringing the entire practice out in the open,

making it more systematic, and making it more legitimate, may some salutary

effects.

IV. Conclusion

The law hovers ambivalently between generality and particularity. The law’s

generality is exemplified by its heavy reliance on rules, on precedent, and on various

principles, maxims, canons, and other vehicles of generality. But the law’s

particularity manifests itself in, for example, the very idea of common law method,

in the calls to make decisions one case at a time or on the facts of particular

controversies, and in the reluctance of the law of evidence to allow evidence of past

32

practices to be used as proof of current behavior, however epistemically and

probabilistically rational such a course of action may be.

Law’s pervasive but not universal reluctance to allow reviewers to take account

of the past decisions of the individuals and institutions it is reviewing reflects this

ambivalence. By looking only at the particular decision under review, and not

calibrating the posture of review on the basis of a history of decisions, reviewing

courts and other reviewing institutions embody the particularism that is one part of

the American and common law legal traditions.46 But generality is also a part of the

legal tradition, both in the United States and elsewhere. In exercising review

without calibrating in light of the reviewee’s history, reviewing institutions choose a

form of particularism not only over generality, but over accuracy as well. In some

review environments that may be the right choice to make. But in others it may not

be, and it is not implausible to suggest that, at times, law may have at least a small

bit to learn from institutions such as TripAdvisor.

46 Of course distinguishing among different administrative agencies, or among different judges whose decisions are being reviewed, is itself a form of particularism, just as treating all agencies the same despite their differences, or treating all reviewee courts and judges the same despite their differences, is a form of generalization. This may suggest that the dimensions of generality and particularity, at least as applied to law, are themselves complex, but exploring this question would take us far beyond the focus of this article.

Schauer Spellman Paper - law.duke.edu · ! 2!...

Documents

Transcript of Schauer Spellman Paper - law.duke.edu · ! 2!...