PH500 Research Seminar in the Philosophy of Science...7 (3) Along side of, and intermingled with,...

1

Professor Deborah G. Mayo

[email protected] Office Hrs: T405: TBA

PH500 Research Seminar in the Philosophy of Science:

Autumn 2008:

Topics in the Philosophy and History of Inductive/Statistical

Inference and the Foundations of Statistical Science:

10:00-12:00, on 15, 29 Oct; 12, 19 Nov; 10 Dec.

This seminar will focus on central problems in contemporary

philosophy of statistics, and their interrelationships with general

problems of inference and evidence in philosophy of science. We

will study some relevant work by statisticians, e.g., R.A. Fisher, J.

Neyman, E.S. Pearson, L. J. Savage, G. Barnard, D.R. Cox, E.

Lehmann, J. Berger, notable exchanges between them, and related

arguments by philosophers of statistics and confirmation theory.

We will trace out a handful of problems and principles that underlie

contrasting philosophies about evidence and inference, as well as

contemporary statistical foundations (frequentist, Bayesian, other).

Questions we consider include: What are the roles of probability in

uncertain inference in science? Which methods can be shown to

ensure reliability? And How is control of long-run error

probabilities relevant for warranting inductive inference in

science? We will explore how contrasting answers to these

questions directly connects to long-standing problems about

inductive inference and methodological issues about data collection

and hypotheses construction, e.g., data-dependent (versus

predesignated) hypotheses, double-counting, data-mining,

"stopping-rules", and "selection" effects.

2

• No knowledge of statistics is expected*, only an interest in

learning something about its foundations from a philosophical

perspective.

• The full syllabus lists some broad areas of possible discussion

for general, (optional) additional, meetings — to be held at

agreed upon times —, given interest (among seminar participants

and/or non-participants). Input, as well as presentations by

others, would be welcome.

• I expect to supply (hard copies or on line) all readings for the

seminar, and ongoing notes and slides as the course proceeds,

depending on the interests of participants.

• The material of this seminar will relate to a manuscript I am

writing entitled Learning From Error.

*However, if there is interest, I would expect to hold 1-2 optional

sessions on some of the more formal aspects.

3

Tentative List of Topics

1. October 15: INTRODUCTION TO THE SEMINAR:

Designing our Seminar. Introduction and Overview: (i) The

relevance of philosophy of statistics to philosophy of science;

(ii) 4 Waves of Controversy in the Philosophy of Statistics

2. October 29: Statistical Significance Tests: Family Feuds

and 50+ years of fallout. Some Classical Exchanges between R.

A. Fisher, J. Neyman, and E.S. Pearson

3. November 12: Some Problems of Frequentist Error

Statistics (Behavioristic and Evidential Interpretations) and

Bayesian Statistics (Subjective and Objective interpretations);

Fallacies of Testing (and their avoidance)

4. November 19: Error Statistical and Bayesian Principles

and Their Consequences: The (strong) Likelihood Principle;

Optional Stopping, (and/or Double-Counting, Non-novel data,

and Data-Dredging)

5. December 10: The O-Bayesian Movement: Bayesian-

Frequentist "reconciliations" and methodological

"unifications"; impersonal and reference Bayesians (and

resulting Bayesian family feuds); Highly Probable vs. Highly

Probed hypotheses

4

What is the Philosophy of Statistics?

At one level of analysis at least, statisticians and philosophers of science

ask many of the same questions:

• What should be observed and what may justifiably be inferred from

the resulting data?

• How well do data confirm or fit a model?

• What is a good test?

• Must predictions be “novel” in some sense? (selection effects,

double counting, data mining)

• How can spurious relationships be distinguished from genuine

regularities? from causal regularities?

• How can we infer more accurate and reliable observations from

less accurate ones?

• When does a fitted model account for regularities in the data?

That these very general questions are entwined with long standing

debates in philosophy of science helps to explain why the field of

statistics tends to cross over so often into philosophical territory.

That statistics is a kind of “applied philosophy of science” is not

too far off the mark (Kempthorne, 1976).

5

Before launching into philosophy of statistics proper, let's briefly

consider its relevance for philosophy of science:

Statistics �� Philosophy

3 ways statistical accounts are used in philosophy of science

(1) Model Scientific Inference — to capture either the actual or

rational ways to arrive at evidence and inference

(2) Resolve Philosophical Problems about scientific inference,

observation, experiment;

(problem of induction, objectivity of observation, reliable

evidence, Duhem's problem, underdetermination).

(3) Perform a Metamethodological Critique

-scrutinize methodological rules, e.g., accord special weight to

"novel" facts, avoid ad hoc hypotheses, avoid "data mining",

require randomization.

Philosophy �� Statistics

Central job to help resolve the conceptual, logical, and

methodological discomforts of scientists as to: how to make

reliable inferences despite uncertainties and errors?

Tackling the problems of philosophy of statistics helps in obtaining

a general account of inductive inference that solves or makes

progress on:

the philosopher's problems of induction, objective evidence,

underdetermination.

6

Philosophy of statistics and the goal of a philosophy of science

relevant for philosophical problems in scientific practice

To many of us who came of age, philosophically, during the

early 80’s, breaking out from the grips of the logical positivist

orthodoxy felt like a new birth of freedom: as we marched with

our placards “no more armchair philosophy of science,” we firmly

resolved to dedicate ourselves to be dynamically engaged with,

and genuinely relevant to scientific practice.

(1) Rather than a “white glove” analysis of the logical relations

between statements of ‘evidence’ and theory, we would get down

to the real nitty-gritty of the procedures for obtaining data in the

first place, and we would confront the complex linkages between

data, experiment, and theory.

(2) Rather than hand down pronouncements on ideally rational

methodology, we would examine methodologies of science

empirically and naturalistically.

Under (1), two broad trends: the “new experimentalism” and the

“new modeling”, respectively.

• Moving away from high–level theory, we should focus on the

manifold local experimental tasks of distinguishing real effects

from artifacts, of checking instruments, and estimating and

subtracting out effects of backgrounds.

• We should accept the disunified and pluralistic strategies by

which models mediate between data, hypotheses, and the world.

7

(3) Along side of, and intermingled with, these trends, one may

identify a broad area of naturalized philosophy of science: look to

psychology, sociology, biology, cognitive science, and various

spin-offs, and/or to the scientific record.

As interesting and invigorating as many as the new moves have

been, if one stops to survey the current landscape, by and large, it

appears that the problems of evidence and inference and method

have not been solved but rather shelved, in one way or another.

The thinking, perhaps, was that we would come back to them after

being steeped for a while in the practices of various sciences,

biology, psychology, economics, and others.

If so, the time seems ripe to return to them!

8

Fresh methodological problems arise in practice surrounding

a panoply of methods and models relied on to learn from

incomplete, and often non-experimental, data.

Examples abound:

Disputes over hypothesis-testing in psychology (e.g., the recently

proposed “significance test ban”);

Disputes over the proper uses of regression in applied statistics;

Disputes over dose-response curves in estimating risks;

Disputes about the use of computer simulations in “observational”

sciences;

Disputes about “external validity” in experimental economics;

and,

Across the huge landscape of fields using the latest, high-

powered, computer methods, there are disputes about data-

mining, algorithmic searches, and model validation.

Equally important are the methodological presuppositions that are

not, but perhaps ought to be, disputed, debated, or at least laid out

in the open — often, ironically, in the very fields in which

philosophers of science immerse themselves.

Scientists who look to contemporary philosophy of science

for insight (e.g., from psychology, epidemiology, ecology,

economics), find philosophical discussions either divorced from

issues of current practice, or in such disarray (regarding general

issues of evidence and inference) that they may be found reverting

back to the logical empiricist tracts (most commonly Popper) that

we were supposed to replace!

9

What happened? (“Was it all a dream?”)

In the face of the shortcomings of philosophical attempts to

arrive at uniform accounts of confirmation or testing, philosophers

of science have tended to turn away from the kind of normative

role that would be involved in scrutinizing the epistemic

credentials of methods and models used in practice.

The very idea of philosophers criticizing, much less

improving upon, methods, models, and strategies for learning

from data, many seem to fear, is too redolent of the now

discredited image of the philosopher operating from a privileged

vantage point.

10

Alan Chalmers (1999) suggests, trivial platitudes such as

“take evidence seriously (ibid., p. 171).”

“Scientists themselves” he concludes, “are the practitioners

best able to conduct science and are not in need of advice from

philosophers” (ibid. p. 252).

Achinstein (2001):

“scientists do not and should not take such philosophical account

of evidence seriously” (p. 9).

Philosophical accounts of evidence fail for actual science

evidence:

(i) they make it too easy to have evidence, and

(ii) they are based on a priori computations, whereas scientists

evaluate evidence empirically.

But once it is admitted that the problem of evidence is

empirical “then it is scientists, not philosophers, who are in the

best position to judge whether e, if true, is evidence that h, and

how strong that evidence is”.

The disarray and self-doubt within contemporary philosophy

of science has not gone unnoticed by scientists, especially those in

fields that have traditionally interacted with philosophy in

grappling with methodological debates: e.g., economics,

psychology, and statistics:

11

“If philosophers and others within science theory can’t agree

about the constitution of the scientific method (or even whether

asking about a scientific ‘method’ makes any sense), doesn’t it

seem a little dubious for economists to continue blithely taking

things off the shelf and attempting to apply them to economics?”

(Hands, 2001, p. 6).

Deciding that it is, methodologists of economics, no longer feeling

restricted by principles of scientific legitimacy espoused by

philosophers (Popper, Lakatos, or other), increasingly look to

sociology of science, rhetoric, evolutionary psychology.

The problem is not merely how this cuts philosophers of science

out of being engaged in methodological practice; equally serious,

is how it encourages practitioners to assume there are no deep

epistemological problems with the ways they collect and base

inferences on data.

Especially surprising is the dwindling of genuine

interdisciplinary work (between philosophers and scientists) in

foundations of statistics — one of the few cases where a

“philosophy of X” has historically involved statistical

practitioners as much as philosophers.

Here too, practitioners are not waiting for philosophers to

sort things out.

12

In a recent lead article in Statistical Science, we hear:

“Professional agreement on statistical philosophy is not on the

immediate horizon, but this should not stop us from agreeing on

methodology”, as if “what is correct methodologically” does not

depend on “what is correct philosophically” (Berger, 2003, p. 2).

The controversies among non-statistician practitioners are

even more charged, and how they are settled directly affects

science-based policy (in medicine, ecology, risk assessment).

In addition to the resurgence of the age-old controversies —

significance test vs. confidence intervals, frequentist vs. Bayesian

measures, — the latest statistical modeling techniques have

introduced brand new methodological issues.

High-powered computer science packages offer a welter of

algorithms for “automatically” selecting among this explosion of

models, but as each boasts different, and incompatible, selection

criteria, we are thrown back to the basic question of inductive

inference: what is required, to severely discriminate among well-

fitting models such that, when a claim (or hypotheses or model)

survives a test the resulting data count as good evidence for the

claim’s correctness or dependability or adequacy.

13

Granted, analytic epistemology often appeals to probability to carry

out its analytic work (e.g., Bayesian epistemology), but this is not

to connect up with philosophical problems of statistics — nor is it

seen to be.

Still, the upshot of developing an adequate philosophy of statistics

would be relevant to this reconstruction effort (we may need

different "logics" to adequately capture evidence and inference.)

The philosophy and history of statistics is also of great interest in

its own right…..so what are some of the issues we'll want to

consider?

A romp through 3-4 "waves in philosophy of statistics"

14

History and philosophy of statistics is a huge territory marked by

70 years of debates widely known for reaching unusual heights

both of passion and of technical complexity.

To get a handle on the movements and cycles without too much

distortion, I propose to identify three main “battle waves”—

Wave I ~ 1930 –1955/60

Wave II~ 1955/60-1980

Wave III~1980-2005 & beyond

15

A core question: What is the nature and role of probabilistic

concepts, methods, and models in making inferences in the face of

limited data, uncertainty and error?

1. Two Roles For Probability:

Degrees of Confirmation and Degrees of Well-Testedness

a. To provide a post-data assignment of degree of probability,

confirmation, support or belief in a hypothesis;

b. To assess the probativeness, reliability, trustworthiness, or

severity of a test or inference procedure.

These two contrasting philosophies of the role of probability in

statistical inference are very much at the heart of the central points

of controversy in the “three waves” of philosophy of statistics…

16

Having conceded loss in the battle for justifying induction, philosophers

appeal to logic to capture scientific method

Inductive Logics Logic of falsification

“Confirmation Theory”

Rules to assign degrees of

probability or confirmation to

hypotheses given evidence e

Methodological falsification

Rules to decide when to

“prefer” or accept hypotheses

Carnap C(H,e)

Popper

Inductive Logicians

we can build and try to justify

“inductive logics”

straight rule: Assign degrees of

confirmation/credibility

Statistical affinity

Bayesian (and likelihoodist)

accounts

Deductive Testers

we can reject induction and

uphold the “rationality” of

preferring or accepting

H if it is “well tested”

Statistical affinity

Fisherian, Neyman-Pearson

methods: probability enters to

ensure reliability and severity of

tests with these methods.

17

I. Philosophy of Statistics: “The First Wave”

WAVE I: circa 1930-1955:

Fisher, Neyman, Pearson, Savage, and Jeffreys.

Statistical inference tools use data x0 to probe aspects of the data

generating source:

In statistical testing, these aspects are in terms of statistical hypotheses

about parameters governing a statistical distribution

H tells us the “probability of x under H”, written P(x;H)

(probabilistic assignments under a model)

Important to avoid confusion with conditional probabilities in Bayes’s

theorem, P(x|H).

Testing model assumptions extremely important, though will not

discuss.

18

Modern Statistics Begins with Fisher:

“Simple” Significance Tests

Fisher strongly objected to Bayesian inference, in particular to the use of

prior distributions (relevant for psychology not science).

Looks to develop ways to express the uncertainty of inferences without

deviating from frequentist probabilities.

Example. Let the sample be X = (X1, …,Xn), be IID from a Normal

distribution (NIID) with σ =1.

1. A null hypothesis H0: H0: µ = 0

e.g., 0 mean concentration of lead, no difference in mean survival

in a given group, in mean risk, mean deflection of light.

2. A function of the sample, d(X), the test statistic: which reflects the

difference between the data x0 = (x1, …,xn), and H0;

The larger d(x0) the further the outcome is from what is expected under

H0, with respect to the particular question being asked.

3. The p-value is the probability of a difference larger than d(x0), under

the assumption that H0 is true:

p(x0)=P(d(X) > d(x0); H0)

19

Mini-recipe for p-value calculation:

The observed significance level (p-value) with observed X = .1

p(x0)=P(d(X) > d(x0); H0).

The relevant test statistic d(X) is:

d(X) = (X -µ0)/σx,

where X is the sample mean with standard deviation σx = (σ/√n).

0Observed - Expected (under H )( )

x

dσ

=X

Since xn

σσ = = 1/5 = .2, d(X) = .1 – 0 in units of σx yields

d(x0)=.1/.2 = .5

Under the null, d(X) is distributed as standard Normal, denoted by

d(X) ~ N(0,1).

(Area to the right of .5) ~.3, i.e. not very significant.

20

Logic of Simple Significance Tests: Statistical Modus Tollens

“Every experiment may be said to exist only in order to give the

facts a chance of disproving the null hypothesis” (Fisher, 1956,

p.160).

Statistical analogy to the deductively valid pattern modus tollens:

If the hypothesis H0 is correct then, with high probability, 1-p,

the data would not be statistically significant at level p.

x0 is statistically significant at level p.

____________________________

Thus, x0 is evidence against H0, or x0 indicates the falsity of H0.

Fisher described the significance test as a procedure for

rejecting the null hypothesis and inferring that the phenomenon

has been “experimentally demonstrated” once one is able to

generate “at will” a statistically significant effect. (Fisher,

1935a, p. 14),

21

The Alternative or “Non-Null” Hypothesis

Evidence against H0 seems to indicate evidence for some

alternative.

Fisherian significance tests strictly consider only the H0

Neyman and Pearson (N-P) tests introduce an alternative H1

(even if only to serve as a direction of departure).

Example. X = (X1, …,Xn), NIID with σ =1:

H0: µ = 0 vs. H1: µ > 0

Despite the bitter disputes with Fisher that were to erupt soon

after ~1935, Neyman and Pearson, at first saw their work as merely

placing Fisherian tests on firmer logical footing.

Much of Fisher’s hostility toward N-P methods reflects

professional and personality conflicts more than philosophical

differences.

22

Neyman-Pearson (N-P) Tests

N-P hypothesis test: maps each outcome x = (x1, …,xn) into either

the null hypothesis H0, or an alternative hypothesis H1 (where the

two exhaust the parameter space) to ensure the probabilities of

erroneous rejections (type I errors) and erroneous acceptances (type

II errors) are controlled at prespecified values, e.g., 0.05 or 0.01,

the significance level of the test.

Test T(α)α)α)α): X = (X1, …,Xn), NIID with σ =1,

H0: µ = µ0 vs. H1: µ > µ0

■ if d(x0) > cα, "reject" H0, (or declare the result statistically

significant at the α level);

■ if d(x0) < cα, "accept" H0,

e.g. cα=1.96 for α=.025, i.e.

“Accept/Reject” uninterpreted parts of the mathematical apparatus.

Type I error probability = P(d(x0) > cα; H0) ≤ α. The Type II error probability:

P(Test T(α) does not reject H0 ; µ =µ1) =

= P(d(X) < cα; H0) = ß(µ1), for any µ1 > µ0.

The "best" test at level α at the same time minimizes the value of ß

for all µ1 > µ0, or equivalently, maximizes the power:

POW(T(α); µ1)= P(d(X) > cα; µ1

T(α) is a Uniformly Most Powerful (UMP) level α test

23

Inductive Behavior Philosophy

Philosophical issues and debates arise once one begins to consider the

interpretations of the formal apparatus

‘Accept/Reject’ are identified with deciding to take specific

actions, e.g., publishing a result, announcing a new effect.

The justification for optimal tests is that

“it may often be proved that if we behave according to such a rule

... we shall reject H when it is true not more, say, than once in a

hundred times, and in addition we may have evidence that we shall

reject H sufficiently often when it is false.”

Neyman: Tests are not rules of inductive inference but rules of

behavior:

The goal is not to adjust our beliefs but rather to “adjust our behavior” to

limited amounts of data

Is he just drawing a stark contrast between N-P tests and Fisherian as

well as Bayesian methods? Or is the behavioral interpretation essential

to the tests?

24

“Inductive behavior” vs. “Inductive inference” battle

commingles philosophical, statistical and personality clashes.

Fisher (1955) denounced the way that Neyman and Pearson

transformed ‘his’ significance tests into ‘acceptance procedures’.

• They’ve turned my tests into mechanical rules or ‘recipes’ for

‘deciding’ to accept or reject statistical hypothesis H0,

• The concern has more to do with speeding up production or

making money than in learning about phenomena

N-P followers are like:

“Russians (who) are made familiar with the ideal that

research in pure science can and should be geared to

technological performance, in the comprehensive organized

effort of a five-year plan for the nation.” (1955, 70)

“In the U.S. also the great importance of organized

technology has I think made it easy to confuse the process

appropriate for drawing correct conclusions, with those aimed

rather at…speeding production, or saving money”.

25

Pearson distanced himself from Neyman’s “inductive

behavior” jargon, calling it “Professor Neyman’s field rather than

mine”.

But the most impressive mathematical results were in the

decision-theoretic framework of Neyman-Pearson-Wald.

Many of the qualifications by Neyman and Pearson in the first

wave are overlooked in the philosophy of statistics literature.

Admittedly, these “evidential” practices were not made

explicit *. (Had they been, the subsequent waves of philosophy of

statistics might have looked very different).

*Mayo’s goal in ~ 1978

26

The Second Wave: ~1955/60 -1980

“Post-data criticisms of N-P methods”:

Ian Hacking (1965), framed the main lines of criticism by philosophers

“Neyman-Pearson tests as suitable for before-trial betting, but not for

after-trial evaluation.” (p. 99):

Battles: “initial precision vs. final precision”,

“before-data vs. after data”

After the data, he claimed, the relevant measure of support is the

(relative) likelihood

Two data sets x and y may afford the same "support" to H, yet

warrant different inferences [on significance test reasoning]

because x and y arose from tests with different error

probabilities.

o This is just what error statisticians want!

o But (at least early on) Hacking (1965) held to the

“Law of Likelihood”: x support hypotheses H1 more than H2 if,

P(x;H1) > P(x;H2).

Yet, as Barnard notes, “there always is such a rival hypothesis:

That things just had to turn out the way they actually did” .

Since such a maximally likelihood alternative H2 can always be

constructed, H1 may always be found less well supported, even if

H1 is true—no error control

Hacking soon rejected the likelihood approach on such grounds,

likelihoodist accounts are advocated by others

27

Perhaps THE key issue of controversy in the philosophy of

statistics battles

The (strong) likelihood principle, likelihoods suffice to convey “all

that the data have to say” —

According to Bayes’s theorem, P(x|µ) ... constitutes the entire

evidence of the experiment, that is, it tells all that the experiment

has to tell. More fully and more precisely, if y is the datum of some

other experiment, and if it happens that P(x|µ) and P(y|µ) are

proportional functions of µ (that is, constant multiples of each

other), then each of the two data x and y have exactly the same

thing to say about the values of µ… (Savage 1962, p. 17.)

—the error probabilist needs to consider, in addition, the sampling

distribution of the likelihoods.

—significance levels and other error probabilities all violate the

likelihood principle (Savage 1962).

28

Paradox of Optional Stopping

Instead of fixing the same size n in advance, in some tests, n is

determined by a stopping rule:

In Normal testing, 2-sided H0: µ = 0 vs. H1: µ ≠ 0

Keep sampling until H is rejected at the .05 level

(i.e., keep sampling until | X | ≥ 1.96 σ/ n ).

Nominal vs. Actual significance levels: with n fixed the type 1 error

probability is .05.

With this stopping rule the actual significance level differs from, and

will be greater than .05.

By contrast, since likelihoods are unaffected by the stopping rule, the LP

follower denies there really is an evidential difference between the two

cases (i.e., n fixed and n determined by the stopping rule).

Should it matter if I decided to toss the coin 100 times and happened to

get 60% heads, or if I decided to keep tossing until I could reject at the

.05 level (2-sided) and this happened to occur on trial 100?

Should it matter if I kept going until I found statistical significance?

Error statistical principles: Yes! — penalty for perseverance!

The LP says NO!

Savage Forum 1959: Savage audaciously declares that the lesson

to draw from the optional stopping effect is that “optional stopping

is no sin” so the problem must lie with the use of significance

levels. But why accept the likelihood principle (LP)? (simplicity

and freedom?)

29

The likelihood principle emphasized in Bayesian statistics implies,

… that the rules governing when data collection stops are irrelevant to

data interpretation. It is entirely appropriate to collect data until a point

has been proved or disproved (p. 193)…This irrelevance of stopping

rules to statistical inference restores a simplicity and freedom to

experimental design that had been lost by classical emphasis on

significance levels (in the sense of Neyman and Pearson) (Edwards,

Lindman, Savage 1963, p. 239).

For frequentists this only underscores the point raised years before by

Pearson and Neyman:

A likelihood ratio (LR) may be a criterion of relative fit but it “is

still necessary to determine its sampling distribution in order to control

the error involved in rejecting a true hypothesis, because a knowledge of

[LR] alone is not adequate to insure control of this error (Pearson and

Neyman, 1930, p. 106).

The key difference: likelihood fixes the actual outcome, i.e., just

d(x), while error statistics considers outcomes other than the one

observed in order to assess the error properties

LP �� irrelevance of, and no control over, error probabilities.

("why you cannot be just a little bit Bayesian" EGEK 1996)

Update: A famous argument (1962, Birnbaum) purports to

show that plausible error statistical principles entails the LP!

"Radical!" "Breakthrough!" (since the LP entails the

irrelevance of error probabilities!

But the "proof" is flawed! (Mayo 2007-8 and forthcoming).

30

The Statistical Significance Test Controversy

(Morrison and Henkel, 1970) – contributors chastise social scientists for

slavish use of significance tests

o Focus on simple Fisherian significance tests

o Philosophers direct criticisms mostly to N-P tests.

o

Fallacies of Rejection: Statistical vs. Substantive Significance

(i) take statistical significance as evidence of substantive theory

that explains the effect

(ii) Infer a discrepancy from the null beyond what the test

warrants

(i) Paul Meehl: It is fallacious to go from a statistically significant

result, e.g., at the .001 level, to infer that “one’s substantive theory T,

which entails the [statistical] alternative H1, has received .. quantitative

support of magnitude around .999”

A statistically significant difference (e.g., in child rearing) is not

automatically evidence for a Freudian theory.

T is subjected to only “a feeble risk”, violating Popper.

“After reading Meehl (1967) one wonders whether the function of

statistical techniques in the social sciences is not primarily to provide

a machinery for producing phoney corroborations and thereby a

semblance of ‘scientific progress’ where, in fact, there is nothing but

an increase in pseudo-intellectual garbage.” (Lakatos 1978, Note 4:

88-9)

31

Fallacies of rejection:

(i) Take statistical significance as evidence of substantive theory

that explains the effect

(ii) Infer a discrepancy from the null beyond what the test

warrants.

Finding a statistically significant effect, d(x0) > cαααα (cut-off for

rejection) need not be indicative of large or meaningful effect sizes — test too sensitive

Large N Problem: an α significant rejection of H0 can be very

probable, even with a substantively trivial discrepancy from H0 can

This is often taken as a criticism because it is assumed that

statistical significance at a given level is more evidence against the

null the larger the sample size (n) — fallacy!

"The thesis implicit in the [NP] approach [is] that a hypothesis may be

rejected with increasing confidence or reasonableness as the power of

the test increases (Howson and Urbach 1989 and later editions)

In fact, it is indicative of less of a discrepancy from the null than if it

resulted from a smaller sample size.

(analogy with smoke detector: an alarm from one that often goes off

from merely burnt toast (overly powerful or sensitive), vs. alarm from

one that rarely goes off unless the house is ablaze)

Comes also in the form of the “Jeffrey-Good-Lindley” paradox

Even a highly statistically significant result can, with n sufficiently

large, correspond to a high posterior probability to a null hypothesis.

32

Fallacy of Non-Statistically Significant Results

Test T(α) fails to reject the null, when the test statistic fails to reach

the cut-off point for rejection, i.e., d(x0) ≤ cα .

A classic fallacy is to construe such a “negative” result as evidence FOR

the correctness of the null hypothesis (common in risk assessment

contexts).

“No evidence against” is not “evidence for”

Merely surviving the statistical test is too easy, occurs too frequently,

even when the null is false.

— results from tests lacking sufficient sensitivity or power.

The Power Analytic Movement of the 60’s in psychology

Jacob Cohen: By considering ahead of time the Power of the test,

select a test capable of detecting discrepancies of interest.

– pre-data use of power (for planning).

(Power is a feature of N-P tests, but apparently the prevalence of

Fisherian tests in the social sciences, coupled, perhaps, with the

difficulty in calculating power, resulted in ignoring power)

A multitude of tables were supplied (Cohen, 1988), but until his

death he bemoaned their all-to-rare use.

33

Post-data use of power to avoid fallacies of insensitive tests

If there's a low probability of a statistically significant result, even

if a non-trivial discrepancy δnon-trivial is present (low power against δnon-

trivial) ) then a non-significant difference is not good evidence that a non-

trivial discrepancy is absent.

This still retains an unacceptable coarseness: power is always

calculated relative to the cut-off point cα for rejecting H0..

Consider test T(α= .025) , σ = 1, n = 25, and suppose

δnon-trivial = .2

No matter what the non-significant outcome,

Power to detect δnon-trivial is only .16!

So we’d have to deny the data were good evidence that µ < .2

This suggested to me (in writing my dissertation around 1978) that

rather than calculating

(1) P(d(X) > cα; µ =.2) Power

one should calculate

(2) P(d(X) > d(x0); µ=.2). observed power (severity)

Even if (1) is low, (2) may be high. We return to this in the

developments of Wave III.

34

III. The Third Wave: Relativism, Reformulations, Reconciliations

~1980-2005+

Rational Reconstruction and Relativism in Philosophy of Science

Fighting Kuhnian battles to the very idea of a unified method of

scientific inference, statistical inference less prominent in philosophy

— largely used rational reconstructions of scientific episodes,

— in appraising methodological rules,

— in classic philosophical problems e.g., Duhem’s problem—

reconstruct a given assignment of blame so as to be “warranted” by

Bayesian probability assignments.

no normative force.

The recognition that science involves subjective judgments and values,

reconstructions often appeal to a subjective Bayesian account (Salmon’s

“Tom Kuhn Meets Tom Bayes”).

(Kuhn thought this was confused: no reason to suppose an algorithm

remains through theory change)

Naturalisms, HPS —immersed in biology, psychology, etc.,

philosophers of science recoil from unified inferential accounts, content

with rich details of historical and current practice.

Achinstein (2001): “scientists do not and should not take

such philosophical account of evidence seriously” (p. 9).

35

Wave III in Scientific Practice

— Statisticians turn to eclecticism.

— Non-statistician practitioners (e.g., in psychology, ecology,

medicine), bemoan “unholy hybrids”

a mixture of ideas from N-P methods, Fisherian tests, and Bayesian

accounts that is “inconsistent from both perspectives and burdened with

conceptual confusion”. (Gigerenzer, 1993, p. 323).

• Faced with foundational questions, non statistician practitioners

raise anew the questions from the first and second waves.

• Finding the automaticity and fallacies still rampant, most, if they

are not calling for an outright “ban” on significance tests in

research, insist on reforms and reformulations of statistical tests.

Task Force to consider Test Ban in Psychology: 1990s

Reforms and Reinterpretations Within Error Probability Statistics

Any adequate reformulation must:

(i) Show how to avoid classic fallacies (of acceptance and of

rejection) —on principled grounds,

(ii) Show that it provides an account of inductive inference

36

Avoiding Fallacies

I have to skip discussion of various attempts in Wave III to avoid

fallacies of acceptance and rejection (e.g., using confidence interval

estimates—please see paper)

To quickly note my own recommendation (for test T(a)):

Move away from coarse accept/reject rule; use specific result

(significant or insignificant) to infer those discrepancies from the null

that are well ruled-out, and those which are not.

e.g., Interpretation of Non-Significant results:

If d(x) is not statistically significant, and the test

had a very high probability of a more statistically

significant difference if µ > µ0 + γγγγ, then d(x) is good

grounds for inferring µ ≤ µ0 + γγγγ.

Use specific outcome to infer an upper bound

µ ≤ µ* (values beyond are ruled out by given severity.)

If d(x) is not statistically significant, but the test had a very

low probability chance of a more statistically significant

difference if µ > µ0 + γ, then d(x) is poor evidence for

inferring µ ≤ µ0 + γ.

The test had too little probative power to have detected

such discrepancies even if they existed!

Alternatively, you give me the inference of interest and I

tell you how severely or inseverely it is warranted.

37

Takes us back to the post-data version of power:

Rather than construe “a miss as good as a mile”, parity of logic

suggests that the post-data power assessment should replace the usual

calculation of power against µ1:

POW(T(α), µ1) = P(d(X) > cα; µ=µ1),

with what might be called the power actually attained or, to have a

distinct term, the severity (SEV):

SEV(T(α), µ1) = P(d(X) > d(x0); µ=µ1),

where d(x0) is the observed (non-statistically significant) result.

38

Figure 1 compares power and severity for different outcomes

Figure 1. The graph shows that whereas POW(T(.025), µ1=.2) =.168,

irrespective of the value of d(x0) ; see solid curve, the severity

evaluations are data-specific:

The severity for the inference: µµµµ < .2.2.2.2

Both X = .39, andX = -.2, fail to reject H0, but

But with X = .39, SEV(µ < .2) is low (.17)

But with X = -.2, SEV(µ < .2) is high (.97)

39

Fallacies of Rejection: The Large n-Problem

While with a nonsignificant result, the concern is erroneously inferring

that a discrepancy from µ0 is absent;

With a significant result x0, the concern is erroneously inferring that it is

present.

Utilizing the severity assessment an α-significant difference with n1

passes µ > µ1 less severely than with n2 where n1 > n2.

Figure 2 compares test T(α), with three different sample sizes:

n = 25, n = 100, n = 400, denoted by T(α,n);

where in each case d(x0) = 1.96 – reject at the cut-off point.

In this way we solve the problems of tests too sensitive or not sensitive

enough, but there’s one more thing ... showing how it supplies an

account of inductive inference

Many argue in wave III that error statistical methods cannot supply an

account of inductive inference because error probabilities conflict with

posterior probabilities.

40

Figure 2 compares test T(α), α), α), α), with three different sample sizes:

n =25, n =100, n =400, denoted by T(α,α,α,α,n))));

in each case d(x0) = 1.96 – reject at the cut-off point.

Figure 2. In test T(α), (H0: µ < 0 against H1: µ > 0, and σ= 1),

α=.025, cα = 1.96 and d(x0) = 1.96.

The severity for the inference: µµµµ > .1.1.1.1

n = 25, SEV(µ >.1) is .93

n = 100, SEV(µ >.1) is .83

n = 400, SEV(µ >.1) is .5

41

P-values vs. Bayesian Posteriors

A statistically significant difference from H0 can correspond to large

posteriors in H0

From the Bayesian perspective, it follows that p-values come up

short as a measure of inductive evidence,

• the significance testers balk at the fact that the recommended

priors result in highly significant results being construed as no

evidence against the null — or even evidence for it!

The conflict often considers the two sided T(2α) test

H0: µ = 0 vs. H1: µ ≠ 0.

(The difference between p-values and posteriors are far less marked

with one-sided tests).

“Assuming a prior of .5 to H0, with n = 50 one can classically ‘reject H0

at significance level p = .05,’ although P(H0|x) = .52 (which would

actually indicate that the evidence favors H0).”

This is taken as a criticism of p-values, only because, it is assumed

the .51 posterior is the appropriate measure of the beliefworthiness.

As the sample size increases, the conflict becomes more

noteworthy.

42

If n = 1000, a result statistically significant at the .05 level

leads to a posterior to the null of .82!

SEV (H1) = .95 while the corresponding posterior has gone

from .5 to .82. What warrants such a prior?

n (sample size)

_____________________________________________________

_

p t n=10 n=20 n=50 n=100 n=1000

.10 1.645 .47 .56 .65 .72 .89

.05 1.960 .37 .42 .52 .60 .82

.01 2.576 .14 .16 .22 .27 .53

.001 3.291 .024 .026 .034 .045 .124

(1) Some claim the prior of .5 is a warranted frequentist

assignment:

H0 was randomly selected from an urn in which 50% are true

(*) Therefore P(H0) = p

H0 may be 0 change in extinction rates, 0 lead concentration, etc.

What should go in the urn of hypotheses?

For the frequentist: either H0 is true or false the probability in (*)

is fallacious and results from an unsound instantiation.

43

We are very interested in how false it might be, which is what we

can do by means of a severity assessment.

(2) Subjective degree of belief assignments will not ensure the error

probability, and thus the severity assessments we need.

(3) Some suggest an “impartial” or “uninformative” Bayesian prior

gives .5 to H0, the remaining .5 probability being spread out over the

alternative parameter space, Jeffreys.

This “spiked concentration of belief in the null” is at odds with the

prevailing view “we know all nulls are false”.

The “Bayesian” recently co-opts 'error probability' to describe a

posterior, but it is not a frequentist error probability which is measuring

something very different.

44

Fisher: The Function of the p-Value Is Not Capable of Finding

Expression

Faced with conflicts between error probabilities and Bayesian posterior

probabilities, the error probabilist would conclude that the flaw lies with

the latter measure.

Fisher: Discussing a test of the hypothesis that the stars are

distributed at random, Fisher takes the low p-value (about 1 in 33,000)

to “exclude at a high level of significance any theory involving a

random distribution” (Fisher, 1956, page 42).

Even if one were to imagine that H0 had an extremely high prior

probability, Fisher continues — never minding “what such a statement

of probability a priori could possibly mean”— the resulting high

posteriori probability to H0, he thinks, would only show that “reluctance

to accept a hypothesis strongly contradicted by a test of significance”

(ibid, page 44) . . . “is not capable of finding expression in any

calculation of probability a posteriori” (ibid, page 43).

— sampling theorists do not deny there is ever a legitimate

frequentist prior ….: one may consider hypotheses about such

distributions and subject them to probative tests.

— were to consider the claim about the a priori probability to be

itself a hypothesis, Fisher suggests, it would be rejected by the data!

(general problem of temporal incoherence of priors)

45

Update: Wave IV? 2005+ The Reference Bayesians Abandon

Coherence, the LP, and strive to match frequentist error probabilities!

Contemporary “Impersonal” Bayesianism: (Cox and Mayo 2007)

Because of the difficulty of eliciting subjective priors, and because

of the reluctance among scientists to allow subjective beliefs to be

conflated with the information provided by data, much current

Bayesian work in practice favors conventional “default”,

“uninformative,” or “reference”, priors .

1. What do reference posteriors measure?

• A classic conundrum: there is no unique “noninformative”

prior. (Supposing there is one leads to inconsistencies in

calculating posterior marginal probabilities).

• Any representation of ignorance or lack of information that

succeeds for one parameterization will, under a different

parameterization, entail having knowledge.

Contemporary “reference” Bayesians seeks priors that are simply

conventions to serve as weights for reference posteriors.

• not to be considered expressions of uncertainty, ignorance, or

degree of belief.

• may not even be probabilities; flat priors may not sum to one

(improper prior). If priors are not probabilities, what then is

the interpretation of a posterior? (a serious problem I would

like to see Bayesian philosophers tackle).

46

2. Priors for the same hypothesis changes according to what

experiment is to be done! Bayesian incoherence

If the prior is to represent information why should it be influenced

by the sample space of a contemplated experiment?

Violates the likelihood principle — the cornerstone of Bayesian

coherency

Reference Bayesians: it is “the price” of objectivity.

— seems to wreck havoc with basic Bayesian foundations, but

without the payoff of an objective, interpretable output — even

subjective Bayesians object — all this demands study by

Bayesian philosophers

3. Reference posteriors with good frequentist properties

Reference priors are touted as having some good frequentist

properties, at least in one-dimensional problems.

They are deliberately designed to match frequentist error

probabilities.

If you want error probabilities, why not use techniques that provide

them directly?

47

Philosophers who wish to champion Bayesian accounts

assuming they can provide rationale degrees of belief must

grapple with these questions!

Note: using conditional probability — which is part and parcel of

probability theory, as in “Bayes nets” does not make one a Bayesian

—no priors to hypotheses…

Of course, I have a particular statistical philosophy that I have

been developing, and I will wish to present parts of it, inviting

feedback from you, especially on some ongoing work (which is

to be a basis for a new book, Learning From Error)

My main goal, however, is to explore and try to explain the

issues in contemporary philosophy of statistics, leading up to

where we are today.

PH500 Research Seminar in the Philosophy of Science...7 (3) Along side of, and intermingled with,...

Documents

Transcript of PH500 Research Seminar in the Philosophy of Science...7 (3) Along side of, and intermingled with,...