VARIETIES OF NATURALIZED EPISTEMOLOGY: CRITICISMS AND ALTERNATIVES
PH500 Research Seminar in the Philosophy of Science...7 (3) Along side of, and intermingled with,...
Transcript of PH500 Research Seminar in the Philosophy of Science...7 (3) Along side of, and intermingled with,...
1
Professor Deborah G. Mayo
[email protected] Office Hrs: T405: TBA
PH500 Research Seminar in the Philosophy of Science:
Autumn 2008:
Topics in the Philosophy and History of Inductive/Statistical
Inference and the Foundations of Statistical Science:
10:00-12:00, on 15, 29 Oct; 12, 19 Nov; 10 Dec.
This seminar will focus on central problems in contemporary
philosophy of statistics, and their interrelationships with general
problems of inference and evidence in philosophy of science. We
will study some relevant work by statisticians, e.g., R.A. Fisher, J.
Neyman, E.S. Pearson, L. J. Savage, G. Barnard, D.R. Cox, E.
Lehmann, J. Berger, notable exchanges between them, and related
arguments by philosophers of statistics and confirmation theory.
We will trace out a handful of problems and principles that underlie
contrasting philosophies about evidence and inference, as well as
contemporary statistical foundations (frequentist, Bayesian, other).
Questions we consider include: What are the roles of probability in
uncertain inference in science? Which methods can be shown to
ensure reliability? And How is control of long-run error
probabilities relevant for warranting inductive inference in
science? We will explore how contrasting answers to these
questions directly connects to long-standing problems about
inductive inference and methodological issues about data collection
and hypotheses construction, e.g., data-dependent (versus
predesignated) hypotheses, double-counting, data-mining,
"stopping-rules", and "selection" effects.
2
• No knowledge of statistics is expected*, only an interest in
learning something about its foundations from a philosophical
perspective.
• The full syllabus lists some broad areas of possible discussion
for general, (optional) additional, meetings — to be held at
agreed upon times —, given interest (among seminar participants
and/or non-participants). Input, as well as presentations by
others, would be welcome.
• I expect to supply (hard copies or on line) all readings for the
seminar, and ongoing notes and slides as the course proceeds,
depending on the interests of participants.
• The material of this seminar will relate to a manuscript I am
writing entitled Learning From Error.
*However, if there is interest, I would expect to hold 1-2 optional
sessions on some of the more formal aspects.
3
Tentative List of Topics
1. October 15: INTRODUCTION TO THE SEMINAR:
Designing our Seminar. Introduction and Overview: (i) The
relevance of philosophy of statistics to philosophy of science;
(ii) 4 Waves of Controversy in the Philosophy of Statistics
2. October 29: Statistical Significance Tests: Family Feuds
and 50+ years of fallout. Some Classical Exchanges between R.
A. Fisher, J. Neyman, and E.S. Pearson
3. November 12: Some Problems of Frequentist Error
Statistics (Behavioristic and Evidential Interpretations) and
Bayesian Statistics (Subjective and Objective interpretations);
Fallacies of Testing (and their avoidance)
4. November 19: Error Statistical and Bayesian Principles
and Their Consequences: The (strong) Likelihood Principle;
Optional Stopping, (and/or Double-Counting, Non-novel data,
and Data-Dredging)
5. December 10: The O-Bayesian Movement: Bayesian-
Frequentist "reconciliations" and methodological
"unifications"; impersonal and reference Bayesians (and
resulting Bayesian family feuds); Highly Probable vs. Highly
Probed hypotheses
4
What is the Philosophy of Statistics?
At one level of analysis at least, statisticians and philosophers of science
ask many of the same questions:
• What should be observed and what may justifiably be inferred from
the resulting data?
• How well do data confirm or fit a model?
• What is a good test?
• Must predictions be “novel” in some sense? (selection effects,
double counting, data mining)
• How can spurious relationships be distinguished from genuine
regularities? from causal regularities?
• How can we infer more accurate and reliable observations from
less accurate ones?
• When does a fitted model account for regularities in the data?
That these very general questions are entwined with long standing
debates in philosophy of science helps to explain why the field of
statistics tends to cross over so often into philosophical territory.
That statistics is a kind of “applied philosophy of science” is not
too far off the mark (Kempthorne, 1976).
5
Before launching into philosophy of statistics proper, let's briefly
consider its relevance for philosophy of science:
Statistics ���� Philosophy
3 ways statistical accounts are used in philosophy of science
(1) Model Scientific Inference — to capture either the actual or
rational ways to arrive at evidence and inference
(2) Resolve Philosophical Problems about scientific inference,
observation, experiment;
(problem of induction, objectivity of observation, reliable
evidence, Duhem's problem, underdetermination).
(3) Perform a Metamethodological Critique
-scrutinize methodological rules, e.g., accord special weight to
"novel" facts, avoid ad hoc hypotheses, avoid "data mining",
require randomization.
Philosophy ���� Statistics
Central job to help resolve the conceptual, logical, and
methodological discomforts of scientists as to: how to make
reliable inferences despite uncertainties and errors?
Tackling the problems of philosophy of statistics helps in obtaining
a general account of inductive inference that solves or makes
progress on:
the philosopher's problems of induction, objective evidence,
underdetermination.
6
Philosophy of statistics and the goal of a philosophy of science
relevant for philosophical problems in scientific practice
To many of us who came of age, philosophically, during the
early 80’s, breaking out from the grips of the logical positivist
orthodoxy felt like a new birth of freedom: as we marched with
our placards “no more armchair philosophy of science,” we firmly
resolved to dedicate ourselves to be dynamically engaged with,
and genuinely relevant to scientific practice.
(1) Rather than a “white glove” analysis of the logical relations
between statements of ‘evidence’ and theory, we would get down
to the real nitty-gritty of the procedures for obtaining data in the
first place, and we would confront the complex linkages between
data, experiment, and theory.
(2) Rather than hand down pronouncements on ideally rational
methodology, we would examine methodologies of science
empirically and naturalistically.
Under (1), two broad trends: the “new experimentalism” and the
“new modeling”, respectively.
• Moving away from high–level theory, we should focus on the
manifold local experimental tasks of distinguishing real effects
from artifacts, of checking instruments, and estimating and
subtracting out effects of backgrounds.
• We should accept the disunified and pluralistic strategies by
which models mediate between data, hypotheses, and the world.
7
(3) Along side of, and intermingled with, these trends, one may
identify a broad area of naturalized philosophy of science: look to
psychology, sociology, biology, cognitive science, and various
spin-offs, and/or to the scientific record.
As interesting and invigorating as many as the new moves have
been, if one stops to survey the current landscape, by and large, it
appears that the problems of evidence and inference and method
have not been solved but rather shelved, in one way or another.
The thinking, perhaps, was that we would come back to them after
being steeped for a while in the practices of various sciences,
biology, psychology, economics, and others.
If so, the time seems ripe to return to them!
8
Fresh methodological problems arise in practice surrounding
a panoply of methods and models relied on to learn from
incomplete, and often non-experimental, data.
Examples abound:
Disputes over hypothesis-testing in psychology (e.g., the recently
proposed “significance test ban”);
Disputes over the proper uses of regression in applied statistics;
Disputes over dose-response curves in estimating risks;
Disputes about the use of computer simulations in “observational”
sciences;
Disputes about “external validity” in experimental economics;
and,
Across the huge landscape of fields using the latest, high-
powered, computer methods, there are disputes about data-
mining, algorithmic searches, and model validation.
Equally important are the methodological presuppositions that are
not, but perhaps ought to be, disputed, debated, or at least laid out
in the open — often, ironically, in the very fields in which
philosophers of science immerse themselves.
Scientists who look to contemporary philosophy of science
for insight (e.g., from psychology, epidemiology, ecology,
economics), find philosophical discussions either divorced from
issues of current practice, or in such disarray (regarding general
issues of evidence and inference) that they may be found reverting
back to the logical empiricist tracts (most commonly Popper) that
we were supposed to replace!
9
What happened? (“Was it all a dream?”)
In the face of the shortcomings of philosophical attempts to
arrive at uniform accounts of confirmation or testing, philosophers
of science have tended to turn away from the kind of normative
role that would be involved in scrutinizing the epistemic
credentials of methods and models used in practice.
The very idea of philosophers criticizing, much less
improving upon, methods, models, and strategies for learning
from data, many seem to fear, is too redolent of the now
discredited image of the philosopher operating from a privileged
vantage point.
10
Alan Chalmers (1999) suggests, trivial platitudes such as
“take evidence seriously (ibid., p. 171).”
“Scientists themselves” he concludes, “are the practitioners
best able to conduct science and are not in need of advice from
philosophers” (ibid. p. 252).
Achinstein (2001):
“scientists do not and should not take such philosophical account
of evidence seriously” (p. 9).
Philosophical accounts of evidence fail for actual science
evidence:
(i) they make it too easy to have evidence, and
(ii) they are based on a priori computations, whereas scientists
evaluate evidence empirically.
But once it is admitted that the problem of evidence is
empirical “then it is scientists, not philosophers, who are in the
best position to judge whether e, if true, is evidence that h, and
how strong that evidence is”.
The disarray and self-doubt within contemporary philosophy
of science has not gone unnoticed by scientists, especially those in
fields that have traditionally interacted with philosophy in
grappling with methodological debates: e.g., economics,
psychology, and statistics:
11
“If philosophers and others within science theory can’t agree
about the constitution of the scientific method (or even whether
asking about a scientific ‘method’ makes any sense), doesn’t it
seem a little dubious for economists to continue blithely taking
things off the shelf and attempting to apply them to economics?”
(Hands, 2001, p. 6).
Deciding that it is, methodologists of economics, no longer feeling
restricted by principles of scientific legitimacy espoused by
philosophers (Popper, Lakatos, or other), increasingly look to
sociology of science, rhetoric, evolutionary psychology.
The problem is not merely how this cuts philosophers of science
out of being engaged in methodological practice; equally serious,
is how it encourages practitioners to assume there are no deep
epistemological problems with the ways they collect and base
inferences on data.
Especially surprising is the dwindling of genuine
interdisciplinary work (between philosophers and scientists) in
foundations of statistics — one of the few cases where a
“philosophy of X” has historically involved statistical
practitioners as much as philosophers.
Here too, practitioners are not waiting for philosophers to
sort things out.
12
In a recent lead article in Statistical Science, we hear:
“Professional agreement on statistical philosophy is not on the
immediate horizon, but this should not stop us from agreeing on
methodology”, as if “what is correct methodologically” does not
depend on “what is correct philosophically” (Berger, 2003, p. 2).
The controversies among non-statistician practitioners are
even more charged, and how they are settled directly affects
science-based policy (in medicine, ecology, risk assessment).
In addition to the resurgence of the age-old controversies —
significance test vs. confidence intervals, frequentist vs. Bayesian
measures, — the latest statistical modeling techniques have
introduced brand new methodological issues.
High-powered computer science packages offer a welter of
algorithms for “automatically” selecting among this explosion of
models, but as each boasts different, and incompatible, selection
criteria, we are thrown back to the basic question of inductive
inference: what is required, to severely discriminate among well-
fitting models such that, when a claim (or hypotheses or model)
survives a test the resulting data count as good evidence for the
claim’s correctness or dependability or adequacy.
13
Granted, analytic epistemology often appeals to probability to carry
out its analytic work (e.g., Bayesian epistemology), but this is not
to connect up with philosophical problems of statistics — nor is it
seen to be.
Still, the upshot of developing an adequate philosophy of statistics
would be relevant to this reconstruction effort (we may need
different "logics" to adequately capture evidence and inference.)
The philosophy and history of statistics is also of great interest in
its own right…..so what are some of the issues we'll want to
consider?
A romp through 3-4 "waves in philosophy of statistics"
14
History and philosophy of statistics is a huge territory marked by
70 years of debates widely known for reaching unusual heights
both of passion and of technical complexity.
To get a handle on the movements and cycles without too much
distortion, I propose to identify three main “battle waves”—
Wave I ~ 1930 –1955/60
Wave II~ 1955/60-1980
Wave III~1980-2005 & beyond
15
A core question: What is the nature and role of probabilistic
concepts, methods, and models in making inferences in the face of
limited data, uncertainty and error?
1. Two Roles For Probability:
Degrees of Confirmation and Degrees of Well-Testedness
a. To provide a post-data assignment of degree of probability,
confirmation, support or belief in a hypothesis;
b. To assess the probativeness, reliability, trustworthiness, or
severity of a test or inference procedure.
These two contrasting philosophies of the role of probability in
statistical inference are very much at the heart of the central points
of controversy in the “three waves” of philosophy of statistics…
16
Having conceded loss in the battle for justifying induction, philosophers
appeal to logic to capture scientific method
Inductive Logics Logic of falsification
“Confirmation Theory”
Rules to assign degrees of
probability or confirmation to
hypotheses given evidence e
Methodological falsification
Rules to decide when to
“prefer” or accept hypotheses
Carnap C(H,e)
Popper
Inductive Logicians
we can build and try to justify
“inductive logics”
straight rule: Assign degrees of
confirmation/credibility
Statistical affinity
Bayesian (and likelihoodist)
accounts
Deductive Testers
we can reject induction and
uphold the “rationality” of
preferring or accepting
H if it is “well tested”
Statistical affinity
Fisherian, Neyman-Pearson
methods: probability enters to
ensure reliability and severity of
tests with these methods.
17
I. Philosophy of Statistics: “The First Wave”
WAVE I: circa 1930-1955:
Fisher, Neyman, Pearson, Savage, and Jeffreys.
Statistical inference tools use data x0 to probe aspects of the data
generating source:
In statistical testing, these aspects are in terms of statistical hypotheses
about parameters governing a statistical distribution
H tells us the “probability of x under H”, written P(x;H)
(probabilistic assignments under a model)
Important to avoid confusion with conditional probabilities in Bayes’s
theorem, P(x|H).
Testing model assumptions extremely important, though will not
discuss.
18
Modern Statistics Begins with Fisher:
“Simple” Significance Tests
Fisher strongly objected to Bayesian inference, in particular to the use of
prior distributions (relevant for psychology not science).
Looks to develop ways to express the uncertainty of inferences without
deviating from frequentist probabilities.
Example. Let the sample be X = (X1, …,Xn), be IID from a Normal
distribution (NIID) with σ =1.
1. A null hypothesis H0: H0: µ = 0
e.g., 0 mean concentration of lead, no difference in mean survival
in a given group, in mean risk, mean deflection of light.
2. A function of the sample, d(X), the test statistic: which reflects the
difference between the data x0 = (x1, …,xn), and H0;
The larger d(x0) the further the outcome is from what is expected under
H0, with respect to the particular question being asked.
3. The p-value is the probability of a difference larger than d(x0), under
the assumption that H0 is true:
p(x0)=P(d(X) > d(x0); H0)
19
Mini-recipe for p-value calculation:
The observed significance level (p-value) with observed X = .1
p(x0)=P(d(X) > d(x0); H0).
The relevant test statistic d(X) is:
d(X) = (X -µ0)/σx,
where X is the sample mean with standard deviation σx = (σ/√n).
0Observed - Expected (under H )( )
x
dσ
=X
Since xn
σσ = = 1/5 = .2, d(X) = .1 – 0 in units of σx yields
d(x0)=.1/.2 = .5
Under the null, d(X) is distributed as standard Normal, denoted by
d(X) ~ N(0,1).
(Area to the right of .5) ~.3, i.e. not very significant.
20
Logic of Simple Significance Tests: Statistical Modus Tollens
“Every experiment may be said to exist only in order to give the
facts a chance of disproving the null hypothesis” (Fisher, 1956,
p.160).
Statistical analogy to the deductively valid pattern modus tollens:
If the hypothesis H0 is correct then, with high probability, 1-p,
the data would not be statistically significant at level p.
x0 is statistically significant at level p.
____________________________
Thus, x0 is evidence against H0, or x0 indicates the falsity of H0.
Fisher described the significance test as a procedure for
rejecting the null hypothesis and inferring that the phenomenon
has been “experimentally demonstrated” once one is able to
generate “at will” a statistically significant effect. (Fisher,
1935a, p. 14),
21
The Alternative or “Non-Null” Hypothesis
Evidence against H0 seems to indicate evidence for some
alternative.
Fisherian significance tests strictly consider only the H0
Neyman and Pearson (N-P) tests introduce an alternative H1
(even if only to serve as a direction of departure).
Example. X = (X1, …,Xn), NIID with σ =1:
H0: µ = 0 vs. H1: µ > 0
Despite the bitter disputes with Fisher that were to erupt soon
after ~1935, Neyman and Pearson, at first saw their work as merely
placing Fisherian tests on firmer logical footing.
Much of Fisher’s hostility toward N-P methods reflects
professional and personality conflicts more than philosophical
differences.
22
Neyman-Pearson (N-P) Tests
N-P hypothesis test: maps each outcome x = (x1, …,xn) into either
the null hypothesis H0, or an alternative hypothesis H1 (where the
two exhaust the parameter space) to ensure the probabilities of
erroneous rejections (type I errors) and erroneous acceptances (type
II errors) are controlled at prespecified values, e.g., 0.05 or 0.01,
the significance level of the test.
Test T(α)α)α)α): X = (X1, …,Xn), NIID with σ =1,
H0: µ = µ0 vs. H1: µ > µ0
■ if d(x0) > cα, "reject" H0, (or declare the result statistically
significant at the α level);
■ if d(x0) < cα, "accept" H0,
e.g. cα=1.96 for α=.025, i.e.
“Accept/Reject” uninterpreted parts of the mathematical apparatus.
Type I error probability = P(d(x0) > cα; H0) ≤ α. The Type II error probability:
P(Test T(α) does not reject H0 ; µ =µ1) =
= P(d(X) < cα; H0) = ß(µ1), for any µ1 > µ0.
The "best" test at level α at the same time minimizes the value of ß
for all µ1 > µ0, or equivalently, maximizes the power:
POW(T(α); µ1)= P(d(X) > cα; µ1
T(α) is a Uniformly Most Powerful (UMP) level α test
23
Inductive Behavior Philosophy
Philosophical issues and debates arise once one begins to consider the
interpretations of the formal apparatus
‘Accept/Reject’ are identified with deciding to take specific
actions, e.g., publishing a result, announcing a new effect.
The justification for optimal tests is that
“it may often be proved that if we behave according to such a rule
... we shall reject H when it is true not more, say, than once in a
hundred times, and in addition we may have evidence that we shall
reject H sufficiently often when it is false.”
Neyman: Tests are not rules of inductive inference but rules of
behavior:
The goal is not to adjust our beliefs but rather to “adjust our behavior” to
limited amounts of data
Is he just drawing a stark contrast between N-P tests and Fisherian as
well as Bayesian methods? Or is the behavioral interpretation essential
to the tests?
24
“Inductive behavior” vs. “Inductive inference” battle
commingles philosophical, statistical and personality clashes.
Fisher (1955) denounced the way that Neyman and Pearson
transformed ‘his’ significance tests into ‘acceptance procedures’.
• They’ve turned my tests into mechanical rules or ‘recipes’ for
‘deciding’ to accept or reject statistical hypothesis H0,
• The concern has more to do with speeding up production or
making money than in learning about phenomena
N-P followers are like:
“Russians (who) are made familiar with the ideal that
research in pure science can and should be geared to
technological performance, in the comprehensive organized
effort of a five-year plan for the nation.” (1955, 70)
“In the U.S. also the great importance of organized
technology has I think made it easy to confuse the process
appropriate for drawing correct conclusions, with those aimed
rather at…speeding production, or saving money”.
25
Pearson distanced himself from Neyman’s “inductive
behavior” jargon, calling it “Professor Neyman’s field rather than
mine”.
But the most impressive mathematical results were in the
decision-theoretic framework of Neyman-Pearson-Wald.
Many of the qualifications by Neyman and Pearson in the first
wave are overlooked in the philosophy of statistics literature.
Admittedly, these “evidential” practices were not made
explicit *. (Had they been, the subsequent waves of philosophy of
statistics might have looked very different).
*Mayo’s goal in ~ 1978
26
The Second Wave: ~1955/60 -1980
“Post-data criticisms of N-P methods”:
Ian Hacking (1965), framed the main lines of criticism by philosophers
“Neyman-Pearson tests as suitable for before-trial betting, but not for
after-trial evaluation.” (p. 99):
Battles: “initial precision vs. final precision”,
“before-data vs. after data”
After the data, he claimed, the relevant measure of support is the
(relative) likelihood
Two data sets x and y may afford the same "support" to H, yet
warrant different inferences [on significance test reasoning]
because x and y arose from tests with different error
probabilities.
o This is just what error statisticians want!
o But (at least early on) Hacking (1965) held to the
“Law of Likelihood”: x support hypotheses H1 more than H2 if,
P(x;H1) > P(x;H2).
Yet, as Barnard notes, “there always is such a rival hypothesis:
That things just had to turn out the way they actually did” .
Since such a maximally likelihood alternative H2 can always be
constructed, H1 may always be found less well supported, even if
H1 is true—no error control
Hacking soon rejected the likelihood approach on such grounds,
likelihoodist accounts are advocated by others
27
Perhaps THE key issue of controversy in the philosophy of
statistics battles
The (strong) likelihood principle, likelihoods suffice to convey “all
that the data have to say” —
According to Bayes’s theorem, P(x|µ) ... constitutes the entire
evidence of the experiment, that is, it tells all that the experiment
has to tell. More fully and more precisely, if y is the datum of some
other experiment, and if it happens that P(x|µ) and P(y|µ) are
proportional functions of µ (that is, constant multiples of each
other), then each of the two data x and y have exactly the same
thing to say about the values of µ… (Savage 1962, p. 17.)
—the error probabilist needs to consider, in addition, the sampling
distribution of the likelihoods.
—significance levels and other error probabilities all violate the
likelihood principle (Savage 1962).
28
Paradox of Optional Stopping
Instead of fixing the same size n in advance, in some tests, n is
determined by a stopping rule:
In Normal testing, 2-sided H0: µ = 0 vs. H1: µ ≠ 0
Keep sampling until H is rejected at the .05 level
(i.e., keep sampling until | X | ≥ 1.96 σ/ n ).
Nominal vs. Actual significance levels: with n fixed the type 1 error
probability is .05.
With this stopping rule the actual significance level differs from, and
will be greater than .05.
By contrast, since likelihoods are unaffected by the stopping rule, the LP
follower denies there really is an evidential difference between the two
cases (i.e., n fixed and n determined by the stopping rule).
Should it matter if I decided to toss the coin 100 times and happened to
get 60% heads, or if I decided to keep tossing until I could reject at the
.05 level (2-sided) and this happened to occur on trial 100?
Should it matter if I kept going until I found statistical significance?
Error statistical principles: Yes! — penalty for perseverance!
The LP says NO!
Savage Forum 1959: Savage audaciously declares that the lesson
to draw from the optional stopping effect is that “optional stopping
is no sin” so the problem must lie with the use of significance
levels. But why accept the likelihood principle (LP)? (simplicity
and freedom?)
29
The likelihood principle emphasized in Bayesian statistics implies,
… that the rules governing when data collection stops are irrelevant to
data interpretation. It is entirely appropriate to collect data until a point
has been proved or disproved (p. 193)…This irrelevance of stopping
rules to statistical inference restores a simplicity and freedom to
experimental design that had been lost by classical emphasis on
significance levels (in the sense of Neyman and Pearson) (Edwards,
Lindman, Savage 1963, p. 239).
For frequentists this only underscores the point raised years before by
Pearson and Neyman:
A likelihood ratio (LR) may be a criterion of relative fit but it “is
still necessary to determine its sampling distribution in order to control
the error involved in rejecting a true hypothesis, because a knowledge of
[LR] alone is not adequate to insure control of this error (Pearson and
Neyman, 1930, p. 106).
The key difference: likelihood fixes the actual outcome, i.e., just
d(x), while error statistics considers outcomes other than the one
observed in order to assess the error properties
LP ���� irrelevance of, and no control over, error probabilities.
("why you cannot be just a little bit Bayesian" EGEK 1996)
Update: A famous argument (1962, Birnbaum) purports to
show that plausible error statistical principles entails the LP!
"Radical!" "Breakthrough!" (since the LP entails the
irrelevance of error probabilities!
But the "proof" is flawed! (Mayo 2007-8 and forthcoming).
30
The Statistical Significance Test Controversy
(Morrison and Henkel, 1970) – contributors chastise social scientists for
slavish use of significance tests
o Focus on simple Fisherian significance tests
o Philosophers direct criticisms mostly to N-P tests.
o
Fallacies of Rejection: Statistical vs. Substantive Significance
(i) take statistical significance as evidence of substantive theory
that explains the effect
(ii) Infer a discrepancy from the null beyond what the test
warrants
(i) Paul Meehl: It is fallacious to go from a statistically significant
result, e.g., at the .001 level, to infer that “one’s substantive theory T,
which entails the [statistical] alternative H1, has received .. quantitative
support of magnitude around .999”
A statistically significant difference (e.g., in child rearing) is not
automatically evidence for a Freudian theory.
T is subjected to only “a feeble risk”, violating Popper.
“After reading Meehl (1967) one wonders whether the function of
statistical techniques in the social sciences is not primarily to provide
a machinery for producing phoney corroborations and thereby a
semblance of ‘scientific progress’ where, in fact, there is nothing but
an increase in pseudo-intellectual garbage.” (Lakatos 1978, Note 4:
88-9)
31
Fallacies of rejection:
(i) Take statistical significance as evidence of substantive theory
that explains the effect
(ii) Infer a discrepancy from the null beyond what the test
warrants.
Finding a statistically significant effect, d(x0) > cαααα (cut-off for
rejection) need not be indicative of large or meaningful effect sizes — test too sensitive
Large N Problem: an α significant rejection of H0 can be very
probable, even with a substantively trivial discrepancy from H0 can
This is often taken as a criticism because it is assumed that
statistical significance at a given level is more evidence against the
null the larger the sample size (n) — fallacy!
"The thesis implicit in the [NP] approach [is] that a hypothesis may be
rejected with increasing confidence or reasonableness as the power of
the test increases (Howson and Urbach 1989 and later editions)
In fact, it is indicative of less of a discrepancy from the null than if it
resulted from a smaller sample size.
(analogy with smoke detector: an alarm from one that often goes off
from merely burnt toast (overly powerful or sensitive), vs. alarm from
one that rarely goes off unless the house is ablaze)
Comes also in the form of the “Jeffrey-Good-Lindley” paradox
Even a highly statistically significant result can, with n sufficiently
large, correspond to a high posterior probability to a null hypothesis.
32
Fallacy of Non-Statistically Significant Results
Test T(α) fails to reject the null, when the test statistic fails to reach
the cut-off point for rejection, i.e., d(x0) ≤ cα .
A classic fallacy is to construe such a “negative” result as evidence FOR
the correctness of the null hypothesis (common in risk assessment
contexts).
“No evidence against” is not “evidence for”
Merely surviving the statistical test is too easy, occurs too frequently,
even when the null is false.
— results from tests lacking sufficient sensitivity or power.
The Power Analytic Movement of the 60’s in psychology
Jacob Cohen: By considering ahead of time the Power of the test,
select a test capable of detecting discrepancies of interest.
– pre-data use of power (for planning).
(Power is a feature of N-P tests, but apparently the prevalence of
Fisherian tests in the social sciences, coupled, perhaps, with the
difficulty in calculating power, resulted in ignoring power)
A multitude of tables were supplied (Cohen, 1988), but until his
death he bemoaned their all-to-rare use.
33
Post-data use of power to avoid fallacies of insensitive tests
If there's a low probability of a statistically significant result, even
if a non-trivial discrepancy δnon-trivial is present (low power against δnon-
trivial) ) then a non-significant difference is not good evidence that a non-
trivial discrepancy is absent.
This still retains an unacceptable coarseness: power is always
calculated relative to the cut-off point cα for rejecting H0..
Consider test T(α= .025) , σ = 1, n = 25, and suppose
δnon-trivial = .2
No matter what the non-significant outcome,
Power to detect δnon-trivial is only .16!
So we’d have to deny the data were good evidence that µ < .2
This suggested to me (in writing my dissertation around 1978) that
rather than calculating
(1) P(d(X) > cα; µ =.2) Power
one should calculate
(2) P(d(X) > d(x0); µ=.2). observed power (severity)
Even if (1) is low, (2) may be high. We return to this in the
developments of Wave III.
34
III. The Third Wave: Relativism, Reformulations, Reconciliations
~1980-2005+
Rational Reconstruction and Relativism in Philosophy of Science
Fighting Kuhnian battles to the very idea of a unified method of
scientific inference, statistical inference less prominent in philosophy
— largely used rational reconstructions of scientific episodes,
— in appraising methodological rules,
— in classic philosophical problems e.g., Duhem’s problem—
reconstruct a given assignment of blame so as to be “warranted” by
Bayesian probability assignments.
no normative force.
The recognition that science involves subjective judgments and values,
reconstructions often appeal to a subjective Bayesian account (Salmon’s
“Tom Kuhn Meets Tom Bayes”).
(Kuhn thought this was confused: no reason to suppose an algorithm
remains through theory change)
Naturalisms, HPS —immersed in biology, psychology, etc.,
philosophers of science recoil from unified inferential accounts, content
with rich details of historical and current practice.
Achinstein (2001): “scientists do not and should not take
such philosophical account of evidence seriously” (p. 9).
35
Wave III in Scientific Practice
— Statisticians turn to eclecticism.
— Non-statistician practitioners (e.g., in psychology, ecology,
medicine), bemoan “unholy hybrids”
a mixture of ideas from N-P methods, Fisherian tests, and Bayesian
accounts that is “inconsistent from both perspectives and burdened with
conceptual confusion”. (Gigerenzer, 1993, p. 323).
• Faced with foundational questions, non statistician practitioners
raise anew the questions from the first and second waves.
• Finding the automaticity and fallacies still rampant, most, if they
are not calling for an outright “ban” on significance tests in
research, insist on reforms and reformulations of statistical tests.
Task Force to consider Test Ban in Psychology: 1990s
Reforms and Reinterpretations Within Error Probability Statistics
Any adequate reformulation must:
(i) Show how to avoid classic fallacies (of acceptance and of
rejection) —on principled grounds,
(ii) Show that it provides an account of inductive inference
36
Avoiding Fallacies
I have to skip discussion of various attempts in Wave III to avoid
fallacies of acceptance and rejection (e.g., using confidence interval
estimates—please see paper)
To quickly note my own recommendation (for test T(a)):
Move away from coarse accept/reject rule; use specific result
(significant or insignificant) to infer those discrepancies from the null
that are well ruled-out, and those which are not.
e.g., Interpretation of Non-Significant results:
If d(x) is not statistically significant, and the test
had a very high probability of a more statistically
significant difference if µ > µ0 + γγγγ, then d(x) is good
grounds for inferring µ ≤ µ0 + γγγγ.
Use specific outcome to infer an upper bound
µ ≤ µ* (values beyond are ruled out by given severity.)
If d(x) is not statistically significant, but the test had a very
low probability chance of a more statistically significant
difference if µ > µ0 + γ, then d(x) is poor evidence for
inferring µ ≤ µ0 + γ.
The test had too little probative power to have detected
such discrepancies even if they existed!
Alternatively, you give me the inference of interest and I
tell you how severely or inseverely it is warranted.
37
Takes us back to the post-data version of power:
Rather than construe “a miss as good as a mile”, parity of logic
suggests that the post-data power assessment should replace the usual
calculation of power against µ1:
POW(T(α), µ1) = P(d(X) > cα; µ=µ1),
with what might be called the power actually attained or, to have a
distinct term, the severity (SEV):
SEV(T(α), µ1) = P(d(X) > d(x0); µ=µ1),
where d(x0) is the observed (non-statistically significant) result.
38
Figure 1 compares power and severity for different outcomes
Figure 1. The graph shows that whereas POW(T(.025), µ1=.2) =.168,
irrespective of the value of d(x0) ; see solid curve, the severity
evaluations are data-specific:
The severity for the inference: µµµµ < .2.2.2.2
Both X = .39, andX = -.2, fail to reject H0, but
But with X = .39, SEV(µ < .2) is low (.17)
But with X = -.2, SEV(µ < .2) is high (.97)
39
Fallacies of Rejection: The Large n-Problem
While with a nonsignificant result, the concern is erroneously inferring
that a discrepancy from µ0 is absent;
With a significant result x0, the concern is erroneously inferring that it is
present.
Utilizing the severity assessment an α-significant difference with n1
passes µ > µ1 less severely than with n2 where n1 > n2.
Figure 2 compares test T(α), with three different sample sizes:
n = 25, n = 100, n = 400, denoted by T(α,n);
where in each case d(x0) = 1.96 – reject at the cut-off point.
In this way we solve the problems of tests too sensitive or not sensitive
enough, but there’s one more thing ... showing how it supplies an
account of inductive inference
Many argue in wave III that error statistical methods cannot supply an
account of inductive inference because error probabilities conflict with
posterior probabilities.
40
Figure 2 compares test T(α), α), α), α), with three different sample sizes:
n =25, n =100, n =400, denoted by T(α,α,α,α,n))));
in each case d(x0) = 1.96 – reject at the cut-off point.
Figure 2. In test T(α), (H0: µ < 0 against H1: µ > 0, and σ= 1),
α=.025, cα = 1.96 and d(x0) = 1.96.
The severity for the inference: µµµµ > .1.1.1.1
n = 25, SEV(µ >.1) is .93
n = 100, SEV(µ >.1) is .83
n = 400, SEV(µ >.1) is .5
41
P-values vs. Bayesian Posteriors
A statistically significant difference from H0 can correspond to large
posteriors in H0
From the Bayesian perspective, it follows that p-values come up
short as a measure of inductive evidence,
• the significance testers balk at the fact that the recommended
priors result in highly significant results being construed as no
evidence against the null — or even evidence for it!
The conflict often considers the two sided T(2α) test
H0: µ = 0 vs. H1: µ ≠ 0.
(The difference between p-values and posteriors are far less marked
with one-sided tests).
“Assuming a prior of .5 to H0, with n = 50 one can classically ‘reject H0
at significance level p = .05,’ although P(H0|x) = .52 (which would
actually indicate that the evidence favors H0).”
This is taken as a criticism of p-values, only because, it is assumed
the .51 posterior is the appropriate measure of the beliefworthiness.
As the sample size increases, the conflict becomes more
noteworthy.
42
If n = 1000, a result statistically significant at the .05 level
leads to a posterior to the null of .82!
SEV (H1) = .95 while the corresponding posterior has gone
from .5 to .82. What warrants such a prior?
n (sample size)
_____________________________________________________
_
p t n=10 n=20 n=50 n=100 n=1000
.10 1.645 .47 .56 .65 .72 .89
.05 1.960 .37 .42 .52 .60 .82
.01 2.576 .14 .16 .22 .27 .53
.001 3.291 .024 .026 .034 .045 .124
(1) Some claim the prior of .5 is a warranted frequentist
assignment:
H0 was randomly selected from an urn in which 50% are true
(*) Therefore P(H0) = p
H0 may be 0 change in extinction rates, 0 lead concentration, etc.
What should go in the urn of hypotheses?
For the frequentist: either H0 is true or false the probability in (*)
is fallacious and results from an unsound instantiation.
43
We are very interested in how false it might be, which is what we
can do by means of a severity assessment.
(2) Subjective degree of belief assignments will not ensure the error
probability, and thus the severity assessments we need.
(3) Some suggest an “impartial” or “uninformative” Bayesian prior
gives .5 to H0, the remaining .5 probability being spread out over the
alternative parameter space, Jeffreys.
This “spiked concentration of belief in the null” is at odds with the
prevailing view “we know all nulls are false”.
The “Bayesian” recently co-opts 'error probability' to describe a
posterior, but it is not a frequentist error probability which is measuring
something very different.
44
Fisher: The Function of the p-Value Is Not Capable of Finding
Expression
Faced with conflicts between error probabilities and Bayesian posterior
probabilities, the error probabilist would conclude that the flaw lies with
the latter measure.
Fisher: Discussing a test of the hypothesis that the stars are
distributed at random, Fisher takes the low p-value (about 1 in 33,000)
to “exclude at a high level of significance any theory involving a
random distribution” (Fisher, 1956, page 42).
Even if one were to imagine that H0 had an extremely high prior
probability, Fisher continues — never minding “what such a statement
of probability a priori could possibly mean”— the resulting high
posteriori probability to H0, he thinks, would only show that “reluctance
to accept a hypothesis strongly contradicted by a test of significance”
(ibid, page 44) . . . “is not capable of finding expression in any
calculation of probability a posteriori” (ibid, page 43).
— sampling theorists do not deny there is ever a legitimate
frequentist prior ….: one may consider hypotheses about such
distributions and subject them to probative tests.
— were to consider the claim about the a priori probability to be
itself a hypothesis, Fisher suggests, it would be rejected by the data!
(general problem of temporal incoherence of priors)
45
Update: Wave IV? 2005+ The Reference Bayesians Abandon
Coherence, the LP, and strive to match frequentist error probabilities!
Contemporary “Impersonal” Bayesianism: (Cox and Mayo 2007)
Because of the difficulty of eliciting subjective priors, and because
of the reluctance among scientists to allow subjective beliefs to be
conflated with the information provided by data, much current
Bayesian work in practice favors conventional “default”,
“uninformative,” or “reference”, priors .
1. What do reference posteriors measure?
• A classic conundrum: there is no unique “noninformative”
prior. (Supposing there is one leads to inconsistencies in
calculating posterior marginal probabilities).
• Any representation of ignorance or lack of information that
succeeds for one parameterization will, under a different
parameterization, entail having knowledge.
Contemporary “reference” Bayesians seeks priors that are simply
conventions to serve as weights for reference posteriors.
• not to be considered expressions of uncertainty, ignorance, or
degree of belief.
• may not even be probabilities; flat priors may not sum to one
(improper prior). If priors are not probabilities, what then is
the interpretation of a posterior? (a serious problem I would
like to see Bayesian philosophers tackle).
46
2. Priors for the same hypothesis changes according to what
experiment is to be done! Bayesian incoherence
If the prior is to represent information why should it be influenced
by the sample space of a contemplated experiment?
Violates the likelihood principle — the cornerstone of Bayesian
coherency
Reference Bayesians: it is “the price” of objectivity.
— seems to wreck havoc with basic Bayesian foundations, but
without the payoff of an objective, interpretable output — even
subjective Bayesians object — all this demands study by
Bayesian philosophers
3. Reference posteriors with good frequentist properties
Reference priors are touted as having some good frequentist
properties, at least in one-dimensional problems.
They are deliberately designed to match frequentist error
probabilities.
If you want error probabilities, why not use techniques that provide
them directly?
47
Philosophers who wish to champion Bayesian accounts
assuming they can provide rationale degrees of belief must
grapple with these questions!
Note: using conditional probability — which is part and parcel of
probability theory, as in “Bayes nets” does not make one a Bayesian
—no priors to hypotheses…
Of course, I have a particular statistical philosophy that I have
been developing, and I will wish to present parts of it, inviting
feedback from you, especially on some ongoing work (which is
to be a basis for a new book, Learning From Error)
My main goal, however, is to explore and try to explain the
issues in contemporary philosophy of statistics, leading up to
where we are today.