Schauer Spellman Paper - law.duke.edu · ! 2!...
Transcript of Schauer Spellman Paper - law.duke.edu · ! 2!...
1
Note to Duke colleagues: This manuscript is in many ways more preliminary than the typical workshop paper. Not only does it need refinement of analysis, modification of argumentative structure, addition of references, and polishing of prose, but it could also benefit greatly from many more examples (and counterexamples). We think there is a genuine idea here, but much more needs to be done, and we are grateful for your assistance. Fred Schauer & Bobbie Spellman
10/26/2015
CALIBRATING LEGAL JUDGMENTS
Frederick Schauer1 & Barbara A. Spellman2
Legal decision-‐makers must frequently assess the judgments of other legal
decision-‐makers. The Supreme Court is required to evaluate the judgments of
federal courts of appeals and state supreme courts, and those courts must evaluate
the legal and sometimes the factual rulings of trial courts. Courts are also routinely
called upon to determine the legal sufficiency of decisions by Presidents, governors,
members of Congress, state legislators, heads of executive departments, police
officers, school principals, and countless administrative officials. So too must
1 David and Mary Harrison Distinguished Professor of Law, University of Virginia. 2 Professor of Law, University of Virginia. Earlier versions of this Article have been presented at the University of Arizona College of Law, the University of Pennsylvania School of Law, and at the 2015 Convention of the International Association of Legal and Social Philosophy. We are grateful for audience comments and questions on those occasions, as well as Will Baude, Paul Mahoney, Stephen Morse, and Andrew Vollmer for valuable information and references.
2
judges and magistrates in issuing search warrants assess the reliability of the
representations made to them by law enforcement officers seeking the warrant.
And of course jurors as well as judges must evaluate the veracity and credibility of
the witnesses who testify in court.
When such situations arise in everyday life, the evaluator of someone else’s
decision or judgment often seeks to inform her evaluation by considering, explicitly
or implicitly, the other decisions or judgments made by the decision-‐maker whose
decision is now under review. When a friend recommends a restaurant or movie to
us, we would like to know something about that friend’s reactions to other
restaurants or movies, preferably in the context of restaurants or movies we have
both seen, so that we can assess whether the judgment on this occasion is worth
following. If another friend says that we will like someone we have yet to meet, we
want to have some idea of who the friend likes and does not like, in order that we
can decide what to make of the endorsement on this occasion.3 And when an
applicant to admission to graduate school reports having an undergraduate GPA of
3.6, the graduate school wants to know where the 3.6 stands in relation to other
students (hence the value of class rankings), just as an employer of a student wants
to know much the same thing, and just as the recipient of a recommendation wants
to know whether the recommender tends to like everyone, or no one, or something
in between.
3 Yes, we are thinking of blind dates, but the situation arises on other contexts as well.
3
In ordinary talk, this process of attempting to gauge another’s decisions is
often referred to as calibration, although, as we will explain, this ordinary sense of
calibration both diverges from an important technical sense and may also be the
covering term that encompasses a number of diverse processes, which it will be
important to distinguish. But for the moment we can describe “calibration” as the
process by which a user of some measuring device (or an assessor of some
measurer) attempts to establish the relationship between the indicated measure
and some relevant standard.4 Thus, we calibrate a (weight) scale by attempting to
establish the relationship between what the scale reads and what something
actually weighs. If the scale systematically reads five pounds low, and if (perhaps
counterfactually) we wish to know what we actually weigh, we calibrate the scale
(or calibrate ourselves to the scale) by adding five pounds to the scale’s reading or
by adjusting the scale to take account of the error. So too when we take a
measurement knowing that some ruler gives a measure slightly longer than actual
length,5 or when a rifleman aims slightly lower than where the gunsight tells him to
aim, knowing from previous experience that the gunsight leads him consistently to
miss high.
4 The Oxford English Dictionary, for example, tells us that to calibrate is to “make allowance for [the] irregularities” of a measuring instrument. 1 Compact Edition of the Oxford English Dictionary 318 (1971). 5 We understand, of course, that neither one foot nor one inch nor one pound have some sort of metaphysical reality. Feet, inches, and pounds, like meters and kilograms, are human-‐created standards. But although the standard meter, for example, is a creation of human beings, it is nevertheless the case that a ruler is accurate insofar as what it reads approaches the standard measure.
4
At times the desire for this kind of calibration is systematized. Colleges and
universities often provide information so that employers and graduate schools can
calibrate whether the 3.6 is extraordinarily good or barely above average.6 And,
increasingly, the user of on-‐line restaurant and hotel reviews at sites such as
TripAdvisor can examine the review history of particular reviewers in order to
determine whether a reviewer’s rave review should be discounted because this
particular reviewer says nice things about everyplace, or whether some reviewer’s
brutally negative assessment should similarly be discounted (or, to put it differently,
inflated) because that reviewer has something bad to say about every establishment
he patronizes.7
Yet although such calibration is ubiquitous in ordinary life, and is at first
glance seemingly desirable, the legal system appears implicitly to resist it. A court
evaluating an administrative law judge’s denial of Veterans or Medicare or Social
Security benefits would presumably wish to know whether that administrative law
judge is a frequent denier or instead whether this denial should be entitled to great
deference because that judge denies benefits so rarely. Much the same desire for a 6 For example, prior to the emphasis on absolute grade point averages prompted by the formula used by the US News and World Report law school ranking system, some law schools would take into account in their admissions algorithm the previous grade point averages and law school performances of applicants from particular undergraduate colleges and universities. This knowledge would enable the law school, implicitly in its algorithm, to calibrate (or adjust) the undergraduate grade point average of a particular applicant according to the prior performances of prior applicants from that same school. 7 We have in mind TripAdvisor and Yelp, although there are other restaurant and hotel rating sites that do much the same thing. Indeed, the TripAdvisor model, in which a reviewer’s review history is easily accessible to the user, is one to which we will refer repeatedly in this Article.
5
decisional history would seem to apply as well to the decision by that administrative
law judge in evaluating the grant or denial by the administrative official who made
the initial decision. Similarly, the judges of an appellate court reviewing a trial
judge’s legal or factual ruling against a labor union might want to know whether
they are assessing the decision of a judge who typically rules for (or against) unions,
so as better to calibrate their assessment of the trial judge’s ruling on this occasion.
In the same way, the Supreme Court’s review of a lower court’s decision to uphold a
death sentence might usefully be informed by knowing whether the court under
review is generally sympathetic or hostile to capital sentences. And a jury
attempting to determine whether a witness who describes a person as drunk is
accurate or exaggerating would ideally benefit from information about the full range
of instances in which that witness has described someone as drunk, or not.
However sensible it might seem for judges, jurors, and appellate courts to
wish to calibrate in just this way, and however much the TripAdvisor model8 seems
to embody a larger rationality, the legal system typically, albeit implicitly, avoids
this form of calibration. Whatever the judges of a United States Court of Appeals
might actually be thinking in evaluating a decision by a sentencing District Judge to
depart upwards from the sentencing guidelines, or whatever those appellate judges
might actually know from informal research, personal contact, or hallway gossip, it
would be inappropriate for the appellate court to say explicitly in an opinion that it
was rejecting the upward departure because this judge is known to be an especially
8 See note 7 and accompanying text, supra.
6
tough sentencer, or has a history of upward departures.9 Indeed, it would even be
considered inappropriate for an appellate court openly to seek such information,
however much the appellate judges might be aware of the information from
previous cases or word of mouth. And cross-‐examination of a witness about other
similar assessments she has made – of drunkenness, say – would likely be excluded
by rules limiting cross-‐examination to matters addressed on direct examination.10
Our goal in this Article is to examine the ways in which just this kind of
calibration might be useful in various legal settings, to trace why the legal system
seems implicitly to be hostile to what so many other aspects of life and decision-‐
making have embraced, and to suggest ways in which calibration might valuably
become more accepted than it is now in various legal contexts. Our aim is less, 9 On departures from the Sentencing Guidelines generally, see Michael S. Gelacak, Ilene H. Nagel, & Barry L. Johnson, Departures Under the Federal Sentencing Guidelines: An Empirical and Jurisprudential Analysis, 81 Minn. L. Rev. 299 (1996). Upward departures are now constrained by, inter alia, United States v. Booker, 543 U.S. 220 (2004), but for present purposes we need not get into the details about departures or review of them, other than to note that the standard of review is a deferential one. See Gall v. United States, 552 U.S. 38, 45-‐52 (2007) (mandating an abuse-‐of-‐discretion standard in reviewing departures from the sentencing guidelines). 10 F. R. Evid. Rule 611 (b). This is a topic about which one of us has previously done experimental research. See Barbara A. Spellman & Elizabeth R. Tenney, Credible Testimony In and Out of Court, 17 Psychonomic Bull. & Rev. 168 (2010); Barba aA. Spellman, Elizabeth R. Tenney, & Margaret J. Scalia, Relying on Other People’s Metamemory, in Successful Remembering and Successful Forgetting: A Festschrift in Honor of Robert J. Bork 387 (Aaron S. Benjamin ed., 2011); Elizabeth R. Tenney, Barbara A. Spellman, & Robert J. MacCoun, The Benefits of Knowing What you Know (and What You Don’t): How Calibration Affects Credibility, 44 J. Experiment. Soc. Psych. 1368 (2008); Elizabeth R. Tenney, Barbara A. Spellman, Robert J. MacCoun, & Reid Hastie, Calibration Trumps Confidence as a Basis for Witness Credibility, 18 Psych. Sci. 46 (2007); Elizabeth R. Tenney, Barbara A. Spellman, & Robert MacCoun, Expanding the Scope of Cross Examination So that Jurors Can Infer Witness Calibration, available at papers.ssrn.com/sol3/papers.cfm?abstract_id=998593.htm.
7
however, to make recommendations for reform than it is to open up for analysis and
discussion a topic that has seemed too long to be unfortunately ignored in the
design of legal institutions.
I. Three Concepts of Calibration
As mentioned above, we use “calibration” as a covering term to describe three
quite different processes, or three different contexts in which an assessor might be
faced with assessing someone else’s assessment. For purposes of clarity, and also to
sharpen the focus on the one of the three that is our principal concern here, we will
describe each of the three processes that might all be considered as calibration in
one sense or another.
Before turning to the three processes, we again start, broadly speaking, with
the idea that calibration is the process of setting or assessing the relationship
between some measuring device (or measurer) in order to conform the
measurement to some relevant standard. As in the examples of scales, rulers, and
gunsights, we treat the measuring device as accurate insofar as what the measuring
device reports accords with the relevant standard, and we treat the measuring
device as well calibrated insofar as its “judgments” over a range of measurements
demonstrate a consistent and therefore predictable difference between what the
device indicates and what the underlying “truth” actually is. So a scale that reads
150 pounds when the actual weight is 145 is inaccurate by five pounds, but could be
considered well calibrated to the extent that the scale is always five pounds high.
Even though the scale is inaccurate, its inaccuracy would be reliable, and the
calibration is effective insofar as it uses reliability to compensate for inaccuracy.
8
And if the scale were inaccurate but not in any reliable way, attempts at calibration
would be less effective, making the scale, or our use of it, less well calibrated. Thus,
we calibrate our use of a reliably inaccurate scale by subtracting five pounds, and
insofar as this calibration produces consistent results our assessments of weight
become well calibrated.
When we move from the judgments of mechanical devices like scales, rulers, and
gunsights to the judgments of human beings, however, things become more
complex. Indeed, there are multiple ways of assessing the calibration of human
judgment, and it is important at the outset to get clear just what it is that we are
talking about. Specifically, disaggregating the multiple ideas encompassed by the
covering term “calibration” is especially important because much of the
psychological literature on calibration turns out to address an issue related to but
importantly different from the type of calibration that is our primary focus here.
And that is our principal justification for commencing the analysis by distinguishing
three concepts of calibration.
A. Confidence and Calibration
1. Calibration in Psychology Research.
The first concept of calibration, and the one that dominates the psychological
research, focuses on the relationship between the degree of confidence a decision-‐
maker has expressed in some judgment and the actual accuracy of that judgment.11
11 See, e.g., Linda Bol & Douglas J. Hacker, Calibration Research: Where Do We Go from Here,?” 3 Frontiers Psych. 229 (2012) (“Calibration is the degree of fit between a person’s judgment of performance and his or her actual performance.”). See also Douglas J. Hacker, Linda Bol, & Matt C. Keener, Metacognition in Education: A Focus on Calibration, in Handbook of Metamemory and Memory 429 (John Dunlosky &
9
Let us label this confidence-‐accuracy calibration. Accuracy is of course itself a
relation between a judgment and reality (or “ground truth”), and thus when we talk
about confidence-‐accuracy calibration we are talking about two types of accuracy,
each the subject of considerable research by psychologists. One is absolute
accuracy, which typically is described as “calibration” in a narrow sense, and the
other is relative accuracy, also known as “resolution” or “discrimination.”
Decision makers may be more or less confident in their judgments, and those
judgments may be more or less accurate. Insofar as the decision maker’s degree of
confidence aligns with the likelihood that the judgment is correct, the decision
maker is considered to be well-‐calibrated (in the absolute accuracy sense). And the
decision maker is, accordingly, understood as less well-‐calibrated to the extent that
the degree of confidence over-‐ or under-‐predicts the likely accuracy of the
judgment.
Suppose, for example, that someone – the decision maker, or exerciser of
judgment – judges the speed of a passing car to be 55 miles per hour, plus or minus
five miles per hour. And suppose that the decision maker has a degree of confidence
such that she is 80% sure that her judgment is correct. And if it turns out, over
some number of trials, that she is in fact correct 80% of the time in assessing speeds
within this range, then we could conclude that she is well-‐calibrated – her
confidence level is a reliable predictor of the likelihood of her accuracy. But if she Robert A. Bjork, eds., 2008); Kevin Krug, The Relationship between Confidence and Accuracy: Current Thoughts of the Literature and a New Area of Research, 3 Appled Psych. in Crim. Just. 7 (2007); Karlos Luna & Beatriz Martin-‐Luengo, Confidence-‐Accuracy Calibration with General Knowledge and Eyewitness Memory Cued Recall Questions, 26 Applied Cognitive Psych. 289 (2012).
10
were accurate only 40% of the time that she expressed 80% confidence, and 30% of
the time she expressed 50% confidence, and 50% of the time she expressed 30%
confidence, then she would be poorly calibrated because although it looks like she is
overconfident on average (i.e., her confidence overstates her accuracy), there is still
no way to use her estimates to predict much of use.
On the other hand, consider someone who is generally overconfident but is so in
a systematic way; that is, who does not show absolute accuracy but does show good
relative accuracy. Take the example from Scott Plous, “suppose a decision maker
were 50 percent accurate when 70 percent confident, 60 percent accurate when 80
percent confident, and 70 percent accurate when 90 percent confident. In such a
case confidence would be perfectly correlated with accuracy, even though the
decision maker would be uniformly overconfident by 20 percent.”12 Even though
the decision maker was inaccurate in her assessments, and even though the decision
maker was inaccurate in her degree of confidence in her assessments, the
uniformity of the overconfidence, her reliability, would make it easy to calibrate our
use of that decision maker’s conclusions.
2. How Legal Judgments are Different
Although these analyses of confidence-‐accuracy calibration dominate the
psychological literature, they have obvious shortcomings when applied to legal
contexts. And that is because, first, legal decision makers at all levels and in all
contexts rarely articulate the degree of confidence they have in their conclusions;
12 Scott Plous, The Psychology of Judgment and Decision Making 225 (1993).
11
and, second, unlike in psychology research, legal decision makers typically do not
have access to the “ground truth”.
Although witnesses at trial, and especially in response to cross-‐examination,
may express varying degrees of certainty about the facts and observations that they
are reporting, such expressions of less than complete confidence, or indeed even
actual expressions of complete or even partial confidence, are largely absent (or at
least invisible) in the context of judicial judgments.13 Justice Brandeis thus captured
the phenomenon well when he observed that he ordinarily convinced himself to a
lower degree of certainty (fifty-‐one percent, he said) than that with which he
expressed his judgment in writing an opinion.14 And under one understanding (or,
perhaps more accurately, our understanding) of Ronald Dworkin’s well-‐known “one
right answer thesis,”15 Dworkin agrees with Brandeis. It is not as if judges actually
believe that there is no plausible alternative answer, Dworkin is best understood as
claiming, but that it is a feature of the phenomenology of judging that judges believe
that that their answer is correct, and believe that any other answer is incorrect,
independent of the actual strengths of those beliefs.16
13 And so too, typically, with administrative decisions and judgments. 14 Brandeis made the statement in the context of comparing himself to Justice Cardozo, who, Brandeis believed, found it necessary to convince himself one hundred percent before reaching a judgment or writing an opinion. See Joseph L. Rauh, et al., A Personal View of Justice Benjamin N. Cardozo: Recollections of Four Cardozo Law Clerks, 1 Cardozo L. Rev. 5, 12, 18 (1979). 15 Ronald Dworkin, Justice in Robes 41-‐43 & 266 nn. 3-‐5 (2006); Ronald Dworkin, Is There Really No Right Answer in Hard Cases,?” in A Matter of Principle 119 (1985). 16 See Dworkin, Justice in Robes, supra note 15, at 266 nn. 3, 5.
12
Because judges (as well as police officers seeking search warrants and
administrative officials making administrative decisions) are thus typically loath to
describe in their opinions the degree of confidence they have in their judgments,17
and because judges seem especially reluctant to admit to relative low levels of
confidence, the principal psychological concept of calibration described above may
be of limited value in most legal contexts. It would be nice to know, in theory,
whether a given legal judgment was correct or incorrect and how much confidence a
judge had that her conclusion was correct, thus enabling an observer to determine
the degree to which the judge’s confidence was calibrated with the likelihood that
the judge reached the correct conclusion. But with ground truths rarely accessible
(or existent) for such decisions, and with degrees of confidence even more rarely
expressed, it turns out that this precise sense of calibration is of limited value in
thinking about the nature of legal judgment.
B. Leaving Confidence Behind – Calibration for Accuracy
Because explicit expressions of degrees of confidence are so rare in legal and
judicial contexts, a more relevant conception of calibration in the context of the 17 We can think of two possible but indirect exceptions to the statement in the text. One might arise in the context of civil actions against public officials under 42 U.S.C. §1983 (or Bivens v. Six Unknown Named Agents, 403 U.S. 388 (1971)), where only violations of “clearly established law,” see Wilson v. Layne, 526 U.S. 603 (1999); Anderson v. Creighton, 483 U.S. 635 (1987); Harlow v. Fitzgerald, 457 U.S. 800 (1982), can produce liability in the face of a qualified immunity claim. And the other would arise in the contexts (legal malpractice actions being the most obvious) in which questions of what the law is are treated as questions of fact. Thus a judge in a civil rights action might rule (on a motion for summary judgment, for, example) that the state of the law was sufficiently uncertain that there could be no violation of clearly established law, and we can imagine an expert witness in a legal malpractice action testifying that the state of the law was, for example, probably such-‐and-‐such, but not certainly such-‐and-‐such.
13
analysis here is one that is concerned not with the alignment between confidence
and accuracy, but rather with the seemingly simpler question of the alignment
between an expressed judgment and the ground truth – the actual fact of the matter.
As noted previously, the relation between judgment and truth is what is ordinarily
understood as accuracy,18 and so we can think of the effort to align our judgments
with the ground truth as accuracy calibration.
Accuracy calibration is closer to the examples of the scale that reads five pounds
high or the ruler whose indication is an eighth of an inch short. So if we substitute a
human observer – a witness – for a mechanical scale, we could imagine a human
being who, like the “Guess Your Weight” booths at carnivals, estimated the weight of
the people she observed. And if the estimate were consistently five pounds over the
actual weight,19 we could calibrate her judgments by subtracting five pounds from
each of her estimates. That calibration would then bring an increase in the accuracy
of the post-‐calibration determination.
So now suppose that we are dealing with a witness who is testifying at a trial, or
a bystander who has witnessed a crime and is reporting what she saw to the police.
The witness or bystander reports that the person she saw running out of the bank
waving a gun and wearing a ski mask appeared to weigh about 200 pounds. If this
were a report to the police, the police officer might (in theory, even if rarely in
practice) ask the witness if her estimates of weight were usually high, or usually
18 Or, more precisely, as absolute accuracy, in the sense described above, and as distinguished from relative accuracy. 19 And on “actual” weight, see also note 5, supra.
14
low, or usually close to accurate. And if the estimate of 200 pounds were part of a
witness’s testimony at a trial, on cross-‐examination the witness might have, in
theory, have been asked about the accuracy of her other estimates of weight on
other occasions. Alternatively, opposing counsel might have offered evidence about
previous weight estimates by this witness that had proved to be inaccurate. Under
either scenario, the idea would be to calibrate the accuracy of the witness on this
occasion by looking at the degree of her accuracy on other occasions. A history of
inaccuracy would lead the rational evaluator of the testimony to discount it, just as a
history of consistent overestimates would lead the rational evaluator to subtract
from the estimate provided by the witness.
If such matters had been raised on cross-‐examination, it is likely that the inquiry
would have been excluded, possibly under something like Federal Rule of Evidence
611, which as applied typically excludes matters relating to events other than the
ones being litigated in the case at hand.20 However useful it might be to the trier of
fact to be able to calibrate the witness’s testimony in just the way just described, the
legal system appears resistant to allowing a trier of fact to calibrate a factual report
20 The statement in the text is an overstatement, partly because Rule 611 does allow inquiries into credibility, partly because of variation among jurisdictions with respect to whether they have wide or narrow scope limitations on cross-‐examination, see Christopher B. Mueller & Laird C. Kirkpatrick, Evidence §6.63 at 603-‐06 (5th ed. 2012), partly because so much is left to the trial judge’s discretion, and partly because trial judges vary in terms of how widely they understand Rule 611’s allowance in all cases of cross-‐examination on “matters affecting the witness’s credibility.” As a result, all we claim here is that the issues we offer in this Article might lead to a broader scope of cross-‐examination and rebuttal evidence than now generally exists in both the federal and state systems.
15
by examining the accuracy of other and even similar reports made by the same
observer.21
Although we will return to the factual witness example presently, this is not the
place to pursue it, in large part because it is not clear that even this type of
calibration is especially relevant to the kinds of legal, as opposed to factual,
judgments that are often made by the courts and other legal actors whose
judgments are being reviewed. Unlike estimates of weight and other factual reports,
locating the ground truth of a legal judgment is more elusive. Such a task is not
impossible, of course. For example, a reviewing court might wish to know how
often a trial judge had made obvious errors of law occasioning reversal.22 Especially
under circumstances in which a reviewing appellate court would perceive itself to
be highly knowledgeable about the area of law at issue, that court might engage in
rigorous scrutiny of the legal judgments of a trial judge known to be frequently
reversed for making obvious mistakes of law, while at the same time being highly 21 The testimony of expert witnesses represents an obvious exception, because here it is in fact common for cross-‐examination to focus on the other assessments made by that expert. 22 Implicit in this statement is the belief that reversal is some (admittedly imperfect) guide to the conformity of a judgment below with some notion of legal accuracy or legal correctness. One example is Chief Justice Marshall’s observation in Marbury v. Madison, 1 Cranch (5 U.S.) 137 (1803), that a law allowing a conviction for treason on the testimony of only one witness would be plainly unconstitutional in light of the two-‐witness rule in Article III, Section 3, of the Constitution. We believe that there are, in like fashion, other examples of legal decisions that are simply wrong independently of some court declaring them so, see H.L.A. Hart, The Concept of Law 124-‐47 (Penelope A. Bulloch, Joseph Raz, & Leslie Green, eds., 3d ed., 2012), but we also recognize that, especially at the appellate level, and especially in light of the selection effect (see George L. Priest & Benjamin Klein, The Selection of Disputes for Litigation, 13 J. Legal Stud. 1 (1984)), such examples of plain legal error or inaccuracy are rare.
16
deferential, under conditions of legal uncertainty, to the judgments of a trial judge
whose decisions on matters of law were routinely upheld on appeal. The reviewing
court would use the reversal rate as way of calibrating their judgment of the
accuracy, under conditions of legal uncertainty, of the trial judge’s legal conclusions.
Although this kind of calibration is thus possible in theory, it is likely that, in
practice, and especially given the operation of a selection effect making cases
involving clear right or wrong answers disproportionally unlikely to be litigated,,23
it is rare that we are able to characterize decisions on matters of law, or even of
mixed questions of law and fact, as simply right or wrong. Rather, such decisions
are more likely to involve questions in which there is no ground truth or in which
we do not know what the ground truth is. In such cases, a reviewing body may be
less concerned with the degree of accuracy of some reviewed decision as a matter of
ground truth than it is with just how to evaluate the evaluative judgment that is
being reviewed. And it is this concern not with fact but with evaluating an
evaluation that leads us to our third conception of calibration.
C. The Calibration of Evaluative Judgments
Although reviewing courts and other reviewing institutions are sometimes
required to review factual determinations and legal determinations that have
relatively clear right or wrong answers, the assessment of the judgments of others
even more often arises in contexts in which the judgments being assessed are far
more evaluative than factual or otherwise straightforward. The question then is 23 See especially Priest & Klein, supra note 22. A good overview of the central issues is Leandra Lederman, Which Cases Go to Trial?: An Empirical Study of Predictions of Failure to Settle, 49 Case West. Res. L. Rev. 315 (1999).
17
how do courts assess an assessment, and how do they evaluate an evaluation?
When an appellate court is evaluating a lower court’s determination that a
defendant had (or did not have) the effective assistance of counsel, that a
warrantless search was or was not reasonable, that a state interest was or was not
substantial, that a regulatory mechanism was or was not the least restrictive (of
some constitutionally-‐recognized interest), or that a defendant should or should not
prevail on summary judgment because of the insufficiency of the plaintiff’s potential
evidence, the appellate court is faced with the task of evaluating what is itself an
evaluative judgment. And in such cases, we might hypothesize that the evaluator
might usefully wish to know just what scale the original decision-‐maker was
employing in making the decision now under review.
In review contexts such as these, calibration takes on a different meaning,
and we can label it evaluative calibration. Just as the graduate school or employer
wants to know what a 3.6 from some university means, and just as the potential
hotel or restaurant patron wants to know what two stars means,24 so too might
evaluators often wish to be able to calibrate the kinds of earlier evaluations that
have no intrinsic meaning, or at least have a broad enough range of meaning that
there is no particular conception of a clearly right or clearly wrong answer. And
although one might be (and should be) a metaphysical realist – a believer in a mind-‐
independent reality – about water and gravity and gold, and maybe even about the
24 Even apart from questions of evaluative calibration, we would also, of course, want to know what the scale is. For restaurants, for example, three stars is the best you can do in the Michelin Guide, four stars is the maximum for the New York Times, and other guides go up to a maximum of five stars.
18
rightness of altruism and the wrongness of child abuse, there are few metaphysical
realists about the star ratings for hotels and restaurants, the wine ratings on the 100
point scale commonly used by wine experts, and even the idea of an A-‐ or a B+ on a
grade scale. And so when we want to know whether an 88 point wine or a two star
review or rating or a 3.6 grade point average is good or mediocre, we would, ideally,
like to know about the other ratings of the rater.25 If the rater has given a
restaurant three stars out of a possible four when we thought the restaurant was
terrible, and if the rater consistently gives high ratings to restaurants we believed
on the basis of our own experiences to be mediocre or worse, we might then well
ignore or discount the rater’s rating of an establishment we were considering
patronizing for the first time. Similarly, when some law schools algorithmically
lowered (or raised) the grade point averages of students coming from particular
institutions,26 their view was based on having seen how students with those grade
averages from those schools actually performed in law schools. If the students
consistently underperformed their undergraduate grade point averages compared
to students from other undergraduate institutions, then this differential would be
reflected in an adjustment of the admission index, and the adjustment can be
considered a form of calibration.
25 The Michelin Guide says that three-‐star restaurants are “worth a journey.” But if half the restaurants in the Guide were worth a journey in the Guide’s opinion, we might be more reluctant to actually make the journey than if such a rating were given, as is actually the case, to less than one percent of the establishments rated. 26 See note 6, supra.
19
Thus, when we are speaking of evaluative calibration, we are not primarily
interested in accuracy. Rather, the concern is with just how we should understand
what someone else’s judgment means in light of the other decisions or judgments
that that decision maker has reached, and thus in light of what we can infer that
decision maker’s evaluation scale to be. And if we then engage in our subsequent
evaluation of that earlier evaluation in light of this knowledge, we can be said to
have engaged in a process of calibration.
II. Some Potential Applications
A. The Norm of Non-‐Calibration
We have offered some potential applications of the possibility of calibration, and
it is now time to explore several of them in greater depth. We do so in order to
hypothesize the existence of what appears to be a norm of non-‐calibration, the
seeming norm of judicial behavior that prohibits or discourages courts from
officially examining the previous judgments of the body under review in order to
calibrate those judgments, or from officially acknowledging that such calibration has
occurred even if it has taken place surreptitiously.27
27 In saying that calibration sometimes occurs “surreptitiously,” we do not mean to imply anything pernicious, but rather to suggest that judges often know of the past behavior of the legislatures, courts, and agencies whose judgments they are reviewing, and might well be influenced by that knowledge even as they believe that it is officially the kind of information that they should not take into account. Cf. Andrew J. Wistrich, Chris Guthrie, & Jeffrey J. Rachlinski, Can Judges Ignore Inadmissible Information? The Difficulty of Deliberately Disregarding, 153 U. Pa. L. Rev. 1251 (2005) (reporting a study in which judges were often unable to ignore information they actually had but knew was legally unusable).
20
To start with a relatively straightforward example, consider a Supreme Court
Justice, especially one with no extreme28 views one way or another about the death
penalty, who is faced with evaluating29 a decision by a state supreme court to
uphold the death penalty as against defense claims of, for example, procedural
defects, ineffective assistance of counsel, or cruel and unusual administration. For
that Justice, we can ask whether it would make a difference that the state supreme
court whose judgment is under review almost always affirms death penalty
sentences, or instead almost always vacates them.30 If the state supreme court had a
long and persistent history of affirming capital sentences against such objections,
then the reviewing Justice might suppose that this case presented a more or less
typical case, and would evaluate the decision below according to the standard that
she generally applied to such matters. But if the court being reviewed was a court
that often or almost always vacated capital sentences, then the reviewing Justice
might calibrate this decision accordingly, concluding that here the causes for
potential reversal might be especially absent. And as a result she might be inclined
to be more deferential in her review. Conversely, if the case arose on a government
appeal from a lower appellate reversal, the reviewing Justice might conclude that a
reversal of a sentence or conviction by a court that routinely upholds convictions or 28 We use this term not as a pejorative, but just as a way of describing both tails of a distribution of views. 29 Possibly in deciding on the merits, possibly in deciding whether to vote to grant certiorari, and possibly in deciding whether when sitting as a single Justice to grant a request for a stay. 30 Cf. Charles Fried, Impudence, 1992 Sup. Ct. Rev. 155 (recounting the persistent anti-‐death penalty actions of the Ninth Circuit in the early 1990s).
21
sentences is an action that is very high on the reviewed court’s scale of error.31
Insofar as the reviewing Justice has information of this variety about other decisions
by the court being reviewed, she can be understood to be calibrating the current
decision, or, more precisely, to be calibrating her attitude to the current decision in
light of what she knows from other cases to be the relevant scale.
If we posit that actually knowing about these other results would enable the
reviewing Justice more accurately to calibrate the decision under review, or to
calibrate her degree of deference to the court being reviewed, the question then
arises as to whether she would or should in fact be permitted to examine those
other decisions. If she did calibrate on the basis of other decisions not now before
her, would this be a practice that could be openly acknowledged, for example by
making reference to these other decisions in an opinion? Could a judge actually say
that she is applying especially close scrutiny on review, for example, because of the
31 The phenomenon here is related but not identical to the occasional practice of the Supreme Court in signaling extreme easiness by the assignment of the opinion to the Justice least likely on the basis of past performance to be perceived as sympathetic to the claim now being upheld. Obviously unanimity itself will sometimes send such a signal, and Brown v. Board of Education, 347 U.S. 483 (1954), Cooper v. Aaron, 358 U.S. 1 (1958), and United States v. Nixon, 418 U.S. 683 (1974), are well-‐known examples. But there is also a signal sent when Justice Rehnquist writes for a unanimous Supreme Court in Jenkins v. Georgia, 418 U.S. 153 (1974), making clear the limits of the “local standards” idea in obscenity law, and so too with Justice White writing for a unanimous Court in Sable Communications v. FCC, 492 U.S. 115 (1989), again dealing with the limits of obscenity and communications indecency law, here in the context of a ban on sexually explicit telephone services. In such cases, the assignment of the opinion to the Justice known to be least receptive to the kinds of claims now being upheld is perhaps a way of telling the audience for the opinion that the case is especially easy, and that this form of signal is dependent on an implicit calibration by the audience for the opinion.
22
previous decisions by the particular court or the particular judge whose judgment
she is now reviewing?
The death penalty example is unrepresentative in some ways, but
representative in others. It is definitional of the appellate process that judges are
reviewing the decisions of other judges,32 and whether it be a death penalty appeal,
a grant or denial of a motion for summary judgment, a decision to support or reject
a constitutional challenge to some state law or practice, or any of a large number of
other contexts in which there is an appellate review of an evaluative decision below,
the basic dynamic remains one of a legal evaluation of an earlier evaluative legal
judgment. And especially when the governing law requires that the evaluation be
something other than de novo,33 the reviewing judges would seem to benefit from
being able to calibrate the judgments they are being asked to review. And those
reviewing judges would also seem to benefit by being able to know as much about
the other judgments of the reviewed judge as a reader of a restaurant review
32 We recognize that appellate courts often review jury verdicts, but even in such cases the appellate court is in the position of reviewing the decision of a trial judge to let the case go the jury, or to refuse to set aside a jury verdict. 33 When review is genuinely de novo, we might imagine that basis for the decision being reviewed makes little or no difference. But if the standard of review is anything above de novo, some degree of deference is required, and it is in those situations where the reviewing body would like, we suppose, to have some idea of where on some scale the particular decision lies for the body being reviewed. When the Supreme Court is engaged in genuine rational basis review, for example, as in cases like New Orleans v. Dukes, 427 U.S. 297 (1976), or Williamson v. Lee Optical of Oklahoma, Inc., 348 U.S. 483 (1955), it is engaged in extreme deference to the administrative or legislative judgment below, and in evaluating that judgment it might wish to know just what kinds of decisions that body makes, so as better to be able to calibrate its implicit standard of review to the decision and the decision-‐maker being reviewed on this occasion.
23
benefits from knowing about the other judgments of the reviewer. But what
appears to be a norm of non-‐calibration precludes knowing about such other
judgments. Congress, state legislatures, administrative officials, administrative law
judges, and lower court judges all make decisions that are part of a large collection
of decisions by that institution or judge, and thus any particular decision being
reviewed lies somewhere on a scale for that institution or that judge. But under a
norm of non-‐calibration there is no way, at least officially and openly, for a
reviewing court or other institution to obtain or overtly use just this sort of
presumably valuable information.
As described briefly above, a similar opportunity for calibration also arises in the
context of jurors (or judges operating as triers of fact) in evaluating the testimony of
witnesses. But again calibration by reference to the analog of a decisional history is
rarely permitted. Sometimes, of course, witnesses will testify as to matters of fact
that have straightforward answers or testify about questions where the answer is a
simple yes or no. But witnesses also testify about speed, height, weight,
temperature, attitude (“he seemed angry”), condition (“he was drunk”), and a vast
number of other matters that are as much evaluative (even if not normatively
evaluative) as they are simply factual, and that in important ways can be
characterized in terms of a scale. Speed is variable, as is height and weight and so
on, but there are also degrees of anger, degrees of drunkenness, and degrees of a
very large number of things about witnesses routinely offer evidence.
When witnesses testify about such scalar matters, the issues appear to be similar
to those arising when a reviewing court is evaluating an earlier determination by a
24
lower court, legislature, administrative official, or administrative law judge. As with
these latter examples, the evaluator of a witness’s testimony wants to know where
on the witness’s scale some conclusion lies, so as to be able better to understand and
use it. Indeed, the ideal testimony, in theory, would be testimony in which the
assessor – judge or jury – would know about a previous judgment by the witness on
some question whose answer is already known by the assessor. If a witness testifies
that Susan was angry, it would be ideal, in theory, if the assessor knew how the
witness would characterize the attitude of Harry, known by the assessor, on an
occasion also known by the assessor. This is close to what happens when we hear
or read a restaurant review, for what we would really like to know is how the
reviewer rated another restaurant about which we have already formed an opinion.
With that information in hand, we would be best able to calibrate the review of the
reviewer on this occasion, and thus, with analogous information on hand, the trier of
fact would be best able to calibrate the testimony of the witness on this occasion.
Such ideal information will, of course, rarely be available. Nevertheless, a
second-‐best solution would be for the trier of fact to be able to have information
about other judgments made by the witness, even if not about things already known
to the trier of fact. But at least if the trier of fact knew about the other judgments,
and could form some impression about the alignment of those judgments with some
assumed reference standard, the trier of face could engage in a better calibration of
the testimony than might be possible under current practice, where even with
25
respect to witnesses at trial a norm of non-‐calibration appears to make such matters
largely inaccessible.34
B. A Large Exception and Its Implications
We have suggested that there appears to exist a prevailing rule of non-‐
calibration. That is, courts and other reviewing bodies will not typically calibrate
their standard of review or degree of deference to what they know or might find out
about the previous decisions of the body being reviewed. Or, if the reviewers do
engage in such calibration, whether consciously or not, they typically will not, again
because of the force of the norm of unavailability, admit to doing it.
A significant exception to the norm of non-‐calibration appears to exist with
respect to judicial review of administrative agencies. In this context, there is some
indication that the norm of non-‐calibration is weaker, and that the past decisions
and behavior of an agency being reviewed are thought to influence the attitude of
the reviewing court on a particular occasion.
The practice of “agency-‐specific”35 standards of review appears to have
started with reactions to what were perceived to be National Labor Relations Board
34 Witnesses can, of course, be cross-‐examined or impeached on issues going to their credibility. Fed. R. Evid. 607, 611 (b). Moreover, cross-‐examination or impeachment going to witness credibility is typically not limited by the “scope of direct” rule. See, e.g., United States v. Moore, 917 F.2d 215, 222 (6th Cir. 1990); United States v. Sullivan, 803 F.2d 87, 90-‐91 (3d Cir. 1986). But credibility is rarely understood so broadly as to allow wide-‐ranging cross-‐examination or impeachment going to the kind of calibration we are discussing here, especially because credibility is widely understood to be focused, even if not exclusively, on issues of veracity and not on issues of perception or judgment. 35 See Richard E. Levy & Robert L. Glicksman, Agency-‐Specific Precedents, 89 Texas L. Rev. 499 (2011).
26
practices that diverged from those of other agencies, in particular the use by the
NLRB of adjudication as a rule-‐making tool in order to avoid the structures and
strictures imposed by the Administrative Procedure Act on agency-‐rulemaking, and
also a differentially (compared to other agencies) large gap between articulated
standards and adjudicative outcomes.36 As a result, it appeared to some observers
(although not acknowledged by the reviewing courts) that judicial review of NLRB
decisions was different from and more intrusive than judicial review of other
agencies, a consequence of the unadmitted judicial knowledge of exactly this
differential behavior.
More recently, others have noticed the more widespread existence of the
same phenomenon,37 and have gone on to endorse it,38 with Richard Pildes, for
example, arguing that taking differences among agencies and their past behavior
into account in evaluating the decisions of those agencies is a sensible and realistic
reaction to the differences among agencies in their political makeup, their structure,
the method by which their senior members are appointed, the matters they are
called up to decide, and much else.39 To fail to do so, Pildes argues, is a formalistic
reluctance to treat all agencies the same when in fact they are plainly not.40
36 See Joan Flynn, The Costs and Benefits of Hiding the Ball”: NLRB Policymaking and the Failure of Judicial Review, 75 B.U. L. Rev. 387 (1995), and earlier, and especially, Ralph Winter, Judicial Review of Agency Decisions: The Labor Board and the Court, 1968 Sup. Ct. Rev. 53. 37 E.g., Jennifer Nou, Sub-‐Regulating Elections, 2013 Sup. Ct. Rev. 135. 38 See Richard H. Pildes, Institutional Formalism and Realism in Constitutional and Public Law, 2013 Sup. Ct. Rev. 1, 21-‐30. 39 Id.
27
We do not in this Article purport to make a contribution to administrative
law. But the fact that the past decisions of a particular agency seem often to be
relevant to courts reviewing agency decisions suggests a possible generalization. If
it is at times useful and appropriate for reviewing courts to take an agency’s past
judgments into account in determining the degree and type of scrutiny to be applied
to that agency’s judgment on a particular occasion, then might it also at times be
useful and appropriate for a reviewing appellate court to do the same with different
trial courts and trial judges? Similarly, might it be useful and appropriate for a
magistrate to do the same with different police officers and different police
departments who are seeking search warrants? And might it be also useful and
appropriate for an administrative law judge to do much the same thing with the
different officials and different parts of the agency whose judgments she is asked to
review?
At times such history-‐based calibration practices do exist when legal
decision-‐makers are evaluating the judgments and actions of citizens. The
Securities and Exchange Commission, for example, is frequently required to
evaluate the accuracy or non-‐misleadingness of representations made in, for
example, registration statements, proxy statements, and periodic reports. But in
some of these areas the Commission has an overt process of allowing some
registrants the benefit of a fast-‐track or similarly cursory review process, a process 40 Id. Pildes is correct that treating different agencies, or different phenomena in general, in the same way, is formalistic, although for one of us such formalism is not necessarily always to be condemned. See Frederick Schauer, Formalism, 97 Yale L.J. 509 (1988).
28
that is available only to registrants with proven track records of accuracy, and a
process that is revocable upon evidence of inaccuracy on some occasion.41 In
evaluating the accuracy of such filings, therefore, the Commission is thus openly
calibrating its degree of scrutiny to what it knows from the past practices of the
individual or entity being reviewed. The audit practices of the Internal Revenue
Service are similar even if less overt and less (publicly) systematized, where again
the degree of scrutiny at the audit and subsequent stages appears to take account of
the particular history of the particular taxpayer.
The practice of agency-‐specific review suggests that the form of calibration
we describe here is hardly unknown to the law. At times the norm of unavailability
therefore does not prevail, and reviewing bodies calibrate their assessments of the
judgment being reviewed in light of the full array of judgments made over time by
subject of the review, just as the user of a TripAdvisor review calibrates her
assessment of a review in light of the full array of reviews made by a particular
reviewer. But agency-‐specific review is still more than exception than the rule, and
the question is then presented about what might explain the rarity in law of a
practice that characterizes not only TripAdvisor, but also the way that most people
make most of their decisions in most aspects of their lives.
III. Barriers to Calibration
41 See, e.g., the “safe harbor” provisions in Section 21E of the Securities Exchange Act, 15 U.S.C. §78u-‐5 (2014), the “seasoned issuer” provisions under SEC Rule 405, 17 C.F.R. §230.405 (2015), the “bad actor disqualifications” under Regulation D, Rule 506, 17 C.F.R. §230.506(d) (2015), and the streamlined disclosure procedures provided for in Rule 144(c)(1), 17 C.F.R. §230.144(c)(1).
29
If we are correct in believing, along with much of the non-‐legal world and some
of the legal world, that calibration, whether of witness testimony or legal judgments
being reviewed by legal reviewers, is at least sometimes potentially valuable, if we
are correct that calibration will often be assisted by knowing about judgments made
by the witness or reviewee body on other occasions (and maybe even the outcome
of those judgments), and if we are correct that such information is routinely
unavailable in the legal system, then we might usefully think about why this is so.
One possibility is that law imagines itself as a pervasively particularistic
institution. Whether it be legal scholars urging that matters be decided “one case at
a time,”42 or the fact-‐specific and particularistic orientation of the common law, or
the general impermissibility of using evidence of acts or behavior on other
occasions to prove conformity on this occasion,43 there are important ways in which
the legal system, whether for reasons of supposed particularistic justice or for
reasons of efficiency or just because of a pervasive legal ideology of particularism,44
is persistently averse to spending too much time dealing with cases or issues other 42 Cass R. Sunstein, One Case at a Time: Judicial Minimalism on the Supreme Court (2001). 43 This is the so-‐called propensity or character rule, embodied in, for example, Rule 404 of the Federal Rules of Evidence. 44 Which one of us has already spent far too much time and ink challenging. See Frederick Schauer, Thinking Like a Lawyer: A New Introduction to Legal Reasoning (2009); Frederick Schauer, Profiles, Probabilities, and Stereotypes (2003); Frederick Schauer, Playing By the Rules: A Philosophical Examination of Rule-‐Based Decision-‐Making in Law and in Life (1991). The other one of us, however, has spent less times and less ink arguing that people can be generalizers, and is concerned with the frequency with people are often – depending on context – particularists. See Barbara A. Spellman, Individual Reasoning, in Intelligence Analysis: Behavioral and Social Scientific Foundations 117 (Baruch Fischhoff & Cherie Chauvin eds., 2011).
30
than the one now under consideration. Whether this is actually true is a difficult
question, but it appears as if the law believes that it is true. And whether it is
desirable is also a difficult question, but, again, the law appears to believe that it is
desirable. And thus it may be that the aversion to calibrating by use of decisional
history, although most overtly manifested in the kinds of examples we have been
discussing, is in fact best explained by something far more pervasive in law in
general.
Other obstacles to history-‐based calibration may be more pragmatic. How will
judges (or jurors) get the kind of information they might need to engage in the
process of calibration? How will they find out about other decisions, and other
cases, and about how those other decisions turned out? There is a serious risk that
opening the door to this kind of information will lead to such an expansion of the
domain of usable legal information that the disadvantages, even if only logistical and
pragmatic, would far outweigh the advantages.
Moreover, the aversion to calibration may also embody non-‐epistemic goals.
Just as deference in general may often reflect a non-‐epistemic respect for the
decision-‐making powers of others,45 the unwillingness of an appellate body to
examine the decisional history of those it is reviewing may manifest a form of
respect, even if not epistemically justified, for those it is reviewing. Relatedly, there
may be non-‐epistemic institutional equality goals that would militate in favor of
treating all agencies or all lower courts as equivalent even if they are not. An
appellate court that is willing, openly, to treat the decisions of some lower court 45 See Philip Soper, The Ethics of Deference: Learning from Law’s Morals (2002).
31
judges with less deference than it treats others is displaying a lack of respect which,
even when epistemically justified, may be thought inconsistent with some of the
goals of the legal system.
Yet it is important to recognize that calibration, even under existing procedures,
is not absent. It is just that legal decision makers calibrate against their own views,
or against an assumed average, or an assumption that the witness or lower court
judge is more or less like them. Or it may be that legal decision makers, especially
judges, do exactly what we are suggesting here, but do it in the halls by making use
of gossip in the country’s courthouses, or do it by reading the newspapers, or in
other ways obtain information that they are, in theory, not supposed to have, and
use information that they are not permitted to acknowledge publicly that they have
in the first place. Insofar as this is the case, and thus insofar as there is far more
calibration from other events and other decisions than legal actors are willing or
allowed to admit, then it is possible that bringing the entire practice out in the open,
making it more systematic, and making it more legitimate, may some salutary
effects.
IV. Conclusion
The law hovers ambivalently between generality and particularity. The law’s
generality is exemplified by its heavy reliance on rules, on precedent, and on various
principles, maxims, canons, and other vehicles of generality. But the law’s
particularity manifests itself in, for example, the very idea of common law method,
in the calls to make decisions one case at a time or on the facts of particular
controversies, and in the reluctance of the law of evidence to allow evidence of past
32
practices to be used as proof of current behavior, however epistemically and
probabilistically rational such a course of action may be.
Law’s pervasive but not universal reluctance to allow reviewers to take account
of the past decisions of the individuals and institutions it is reviewing reflects this
ambivalence. By looking only at the particular decision under review, and not
calibrating the posture of review on the basis of a history of decisions, reviewing
courts and other reviewing institutions embody the particularism that is one part of
the American and common law legal traditions.46 But generality is also a part of the
legal tradition, both in the United States and elsewhere. In exercising review
without calibrating in light of the reviewee’s history, reviewing institutions choose a
form of particularism not only over generality, but over accuracy as well. In some
review environments that may be the right choice to make. But in others it may not
be, and it is not implausible to suggest that, at times, law may have at least a small
bit to learn from institutions such as TripAdvisor.
46 Of course distinguishing among different administrative agencies, or among different judges whose decisions are being reviewed, is itself a form of particularism, just as treating all agencies the same despite their differences, or treating all reviewee courts and judges the same despite their differences, is a form of generalization. This may suggest that the dimensions of generality and particularity, at least as applied to law, are themselves complex, but exploring this question would take us far beyond the focus of this article.