A. spanos slides ch14-2013 (4)

CHAPTER 14: Frequentist Hypothesis Testing:a Coherent AccountAris Spanos [Fall 2013]

1 Inherent difficulties in learning statistical testing

Statistical testing is arguably the most important, but also the mostdifficult and confusing chapter of statistical inference for several rea-sons, including the following.(i) The need to introduce numerous new notions, concepts and

procedures before one can paint — even in broad brushes — a coherentpicture of hypothesis testing.(ii) The current textbook discussion of statistical testing is both

highly confusing and confused. There are several sources of confusion.I (a) Testing is conceptually one of the most sophisticated sub-fields

of any scientific discipline.I (b) Inadequate knowledge by textbook writers who often do not

have the technical skills to read and understand the original sources,and have to rely on second hand accounts of previous textbook writ-ers that are often misleading or just outright erroneous. In most ofthese textbooks hypothesis testing is poorly explained as an idiot’sguide to combining off-the-shelf formulae with statistical tables like theNormal, the Student’s t, the chi-square, etc., where the underlying sta-tistical model that gives rise to the testing procedure is hidden in thebackground.I (c) The misleading portrayal of Neyman-Pearson testing as es-

sentially decision-theoretic in nature, when in fact the latter has muchgreater affinity with the Bayesian rather than the frequentist inference.I (d) A deliberate attempt to distort and cannibalize frequentist

testing by certain Bayesian statisticians who revel in (unfairly) malign-ing frequentist inference in their misguided attempt to motivate theirpreferred viewpoint of statistical inference. You will often hear suchBayesians tell you that "what you really want (require) is probabilitiesattached to hypotheses" and other such misadvised promptings! Infrequentist inference probabilities are always attached to the differentpossible values of the sample x∈R

; never to hypotheses!(iii) The discussion of frequentist testing is rather incomplete in so

far as it has been beleaguered by serious foundational problems since

1

the 1930s. As a result, different applied fields have generated their ownsecondary literatures attempting to address these problems, but oftenmaking things much worse! Indeed, in some fields like psychology ithas reached the stage where one has to correct the ‘corrections’ of thosechastising the initial ‘correctors’ !In an attempt to alleviate problem (i), the discussion that follows

uses a sketchy historical development of frequentist testing. To amelio-rate problem (ii), the discussion includes ‘red flag’ pointers (¥) designedto highlight important points that shed light on certain erroneous in-terpretations or misleading arguments. The discussion will pay specialattention to (iii), addressing some of the key foundational problems.

2 Francis Edgeworth

A typical example of a testing procedure at the end of the 19th centuryis provided by Edgeworth (1885). From today’s perspective his testingtakes the form of viewing data x0:=(11 12 1;21 22 2) inthe context of the following bivariate simple Normal model:

X:=

µ1

2

¶v NIID

µµ12

¶

µ2 00 2

¶¶ =1 2

(1)=1 (2)=2 (1)= (2) = 2 (1 2)=0

The hypothesis of interest concerns the equality of the two means:

Hypothesis of interest: 1 = 2

The 2-dimensional sample is: X:=(1112 1;21 22 2)Common sense, combined with the statistical knowledge at the time

suggested using the difference between the estimated means:

b1 − b2 where b1= 1

P

=11 b2= 1

P

=12

as a basis for deciding whether the two means are different or not.To render the difference (b1 − b2) free of the units of measurement,Edgeworth divided it by its standard deviation:p

(b1−b2)=q(b21 + b21) b21= 1

P=1

(1−b1)2 b22= 1

P=1

(2−b1)22

to define the distance function: (X)= |1−2|√(21+21)

The last issue he needed to address is how big the observed distance(x0) should be to justify inferring that the two means are different.Edgeworth argued that (1−2)6=0 could not be ‘accidental’ (due tochance) if:

(X) =|1−2|√(21+21) 2

√2 (1)

I Where did the threshold 2√2 come from? The tail area of N(0 1)

beyond ±2√2 is approximately 005, which was viewed as a reasonablevalue for the probability of a ‘chance’ error, i.e. the error of inferringa significant discrepancy. But why Normal? At the time statisticalinference relied heavily on large sample size (asymptotic) results byinvoking the Central Limit Theorem.In summary, Edgeworth introduced 3 features of statistical testing:

(i) a hypothesis of interest: 1 = 2(ii) the notion of a standardized distance: (X),

(iii) a threshold value for significance: (X) 2√2

3 Karl Pearson

The Pearson approach to statistics can be summarized as follows:

Datax0:=(1 )

=⇒Histogram

=⇒Fitting a frequency curvefrom the Pearson family

(;b1b2b3b4)The Pearson family of frequency curves can be expressed in terms

of the following differential equation:

ln ()

=

(−1)2+3+42

(2)

Depending on the values taken by the parameters (1 2 3 4) thisequation can generate numerous frequency curves such as the Normal,the Student’s , the Beta, the Gamma, the Laplace, the Pareto, etc.

3

The first four raw data moments b01, b02, b03 and b04 are sufficient todetermine (b1b2b3b4) and that enables one can select the particular() from the Pearson family.Having selected a particular frequency curve 0() on the basis of

(;b1b2b3b4)=(; bθ) Karl Pearson proposed to assess the appro-priateness of this choice. His hypothesis of interest is the form:

0() = ∗()∈Pearson(1 2 3 4) ∗()-true density.Pearson proposed the standardized distance function:

(X)=X

=1

( −)2

=X

=1

(( )−())2()

v2() (3)

where (b =1 2 ) and ( =1 2 ) denote the empirical andassumed (as specified by 0()) frequencies, and introduced the notionof a p-value:

P((X) (x0)) = (x0) (4)

as a basis for inferring whether the choice of 0() was appropriate ornot. The smaller the (x0) the worst the fit.¥ It is important toNote that this hypothesis of interest 0()= ∗()

is not about a particular parameter, but about the adequacy of thechoice of the probability model. In this sense, this was the first Mis-Specification (M-S) test for a distributional assumption!

Example. Mendel’s cross-breeding experiments (1865-6) were basedon pea-plants with different shape and color, which can be framed interms of two Bernoulli random variables:

(

Rz }| {round)=0 (

Wz }| {wrinkled)=1 (

Yz }| {yellow)=0 (

Gz }| {green)=1

His theory of heredity was based on two assumptions:(i) the two random variables and are independent,(ii) round (=0) and yellow (=0) are dominant traits, but ‘win-

kled’ (=1) and ‘green’ (=1) are recessive traits with probabilities:

P(=0)=P(=0) = 75 P(=1)=P(=1) = 25

This substantive theory gives rise to Mendel’s model based on thebivariate distribution below:

4

\ 0 1 ()

0 5625 1875 750

1 1875 0625 250

() 750 250 100

Mendel’s theory model ( )

Data. The 4 gene-pair experiments Mendel carried out with =556,gave rise the observed frequencies and relative frequencies given below:

R,Y

(0 0) (R,G

0 1)W,Y

(1 0)W,G

(1 1)

315 108 101 32Observed frequencies

⇒

\ 0 1 b()0 5666 1942 7608

1 1817 0576 2393b() 7483 2518 1001

Observed relative frequencies

How adequate is Mendel’s theory in light of the data? One can answerthat question using Pearson’s goodness-of-fit chi-square test.

The chi-square test statistic (X)=X

=1

(( )−())2()

compares

how close are the observed to the expected relative frequencies.Using the above data this test statistic yields:

(X)=556³(5666−5625)2

5625+(1942−1875)2

1875+(1817−1875)2

1875+(0576−0625)2

0625

´=463

In light of the fact that the tail area of 2(3) yields: P((X) 0470)= 927suggests accordance with Mendel’s theory.

Karl Pearson’s primary contributions to testing were:

(a) the broadening of the scope of the hypothesis of interest,initiating misspecification testing with a goodness-of-fit test,

(b) a distance function whose distribution is known(at least asymptotically), and

(c) the use of the tail probability as a basis of deciding how goodthe fit with data x0 is.

5

4 William Gosset (aka ‘Student’)

Gosset’s 1908 seminal paper provided the cornerstone upon which Fisherfounded modern statistical inference. At that time it was known that inthe case when (1 2 ) are NIID (the simple Normal model),the MLE estimator b:==

1

P

=1 had the following ‘sampling’distribution:

v N³

2

´⇒

√(−)

v N (0 1)

It was also known that in the case where 2 is replaced by its estimator:

2= 1

−1P

=1(−)2

the distribution of√(−)

is unknown, i.e.

√(−)

v

?

D (0 1)

but asymptotically it can be approximated by:√(−)

vN (0 1)

Gosset (1908) derived the unknown?

D () showing that for any 1:

√(−)

v St(−1) (5)

where St(−1) denotes a Student’s t distribution with (−1) degreesof freedom. The subtle question that needs to be answered is: whatdoes the result mean in light of the fact that is unknown.I This was the first finite sample result [a distributional result that

is valid for any 1] that inspired R. A. Fisher to pioneer modernfrequentist inference.

5 R. A. Fisher’s significance testing

The result (5) was formally proved and extended by Fisher (1915) andused subsequently as a basis for several tests of hypotheses associatedwith a number of different statistical models in a series of papers.A key element in Fisher’s framework was the introduction of the

notion of a statistical model, whose generic form:

M(x)={(x;θ) θ∈Θ} x∈R (6)

6

where (x;θ) x∈R denotes the (joint) distribution of the sample

X:=(1 ) that encapsulates the prespecified probabilistic struc-ture of the underlying stochastic process { ∈N} The link to phe-nomenon of interest comes from viewing data x0:=(1 2 ) as a‘truly typical’ realization of the process { ∈N}.Fisher used the result (5) to construct a test of significance for the:

null hypothesis: 0: =0 (7)

in the context of the simple Normal model (table 1).

Table 1 - The simple Normal model

Statistical GM: = + ∈N[1] Normal: v N( )[2] Constant mean: () = for all ∈N[3] Constant variance: () = 2 for all ∈N[4] Independence: { ∈N} - independent process.

¥ Fisher must have realized that the result in (5) by Gosset couldnot be used directly as a basis for a test because it involved the unknownparameter One can only speculate, but it must have dawn on himthat the result can be rendered operational if it is interpreted in termsof factual reasoning in the sense that it holds under the True Stateof Nature (TSN) (=∗ the true value). Note that, in general, theexpression ‘

∗denotes the true value of ’ is a shorthand for saying

that ‘data x0 constitute a realization of the sample X with distribution(x;

∗)’. That is, (5) is a pivot (pivotal quantity):

(X;)=√(−∗)

TSNv St(−1) (8)

in the sense that it is a function of both X and whose distributionunder TSN is known.Fisher (1956), p. 47, was very explicit about the nature of reasoning

underlying his significance testing:“In general, tests of significance are based on hypothetical probabilities

calculated from their null hypotheses.” [emphasis in the original].His problem was to adapt the result in (8) to hypothetical reasoning

with a view to construct a test. Fisher accomplished that in three steps.

7

Step 1. He replaced the unknown ∗ in (8) with the known nullvalue 0

transforming the pivot (X;) into a statistic (X)=√(−0)

A statistic is only a function of the sample X:=(1 2 )Step 2. He adapted the factual reasoning underlying the pivot (8)

into hypothetical reasoning by evaluating the test statistic under 0 :

(X)=√(−0)

0v St(−1) (9)

Step 3. He employed (9) to define the -value:

P((X) (x0); 0) = (x0) (10)

as an indicator of disagreement (inconsistency, contradiction) between datax0 and 0; the bigger the value of (x0) the smaller the -value.£ It is crucial to note that it is a serious mistake — committed often

by misguided authors — to interpret the above probabilistic statementdefining the p-value as conditional on the null hypotheses 0. Avoidthe highly misleading notation,

P((X) (x0) | 0) = (x0) ×of the vertical line (|) instead of a semi-colon (;). Conditioning on 0

makes no sense in frequentist inference since is not a random variable;the evaluation in (10) is made under the scenario that0: =0 is true.Although it is impossible to pin down Fisher’s interpretation of

the p-value because his views changed over time and he held severalinterpretations at any one time, there is one fixed point among hisnumerous articulations (Fisher, 1925, 1955):

‘a small p-value can be interpreted as a simple logical disjunction:either an extremely rare event has occurred or 0 is not true’.

The focus on "a small p-value" needs to be read in conjunctionwith Fisher’s falsificationist stance about testing in the sense thatsignificance tests can falsify but never verify hypotheses (Fisher, 1955):

“... tests of significance, when used accurately, are capable ofrejecting or invalidating hypotheses, in so far as these are contra-dicted by the data; but that they are never capable of establishingthem as certainly true.”

8

His logical disjunction combined with the falsificationist stance arenot helpful in shedding light on the key issue of learning from data:what do data x0 convey about the truth or falsity of 0?Occasionally, Fisher would go further and express what sounds more

like an aspiration for an evidential interpretation:

“The actual value of ... indicates the strength of evidenceagainst the hypothesis” (see Fisher, 1925, p. 80),

but he never articulated a coherent evidential interpretation basedon the p-value. Indeed, as late as 1956 he offered ‘a degree of rationaldisbelief’ interpretation (Fisher, 1956, pp. 46-47).¥ A more pertinent interpretation of the -value (section 8) is :‘the p-value is the probability – evaluated under the scenario that

0 is true – of all possible outcomes x∈R that accord less well with

0 than x0 does.In light of the fact that data x0 were generated by the ‘true’ (θ=θ

∗)

statistical Data Generating Mechanism (DGM):

M∗(x)={(x;θ∗)} x∈R

and the test statistic (X)=√(

∗↓−0)

measures the standardized differ-ence between ∗ and 0 the p-value P((X) (x0); 0)=(x0) seeksto measure the discordance between 0 andM∗(x).This precludes several erroneous interpretations of the p-value:

£ The p-value is the probability that 0 is true.£ 1−(x0) is the probability that the 1 is true.£ The p-value is the probability that the particular result is due to‘chance error’.£ The p-value is an indicator of the size or the substantive importanceof the observed effect.£ The p-value is the conditional probability of obtaining a test statis-tic (x) more extreme than (x0) given 0; there is no conditioninginvolved.

Example 1. Consider testing the null hypothesis: 0: = 70in the context of the simple Normal model (see table 1), using theexam score data in table 1.6 (see chapter 1) where:=71686

2=13606 and =70. The test statistic (9) yields:

(x0)=√70(71686−70)√

13606=3824 P((X) 3824;0=70)=00007

9

where (x0)=00007 is found from the St(69) tables. The tiny -valuesuggests that x0 indicate strong discordance with 0.

Table 2 - The simple Bernoulli model

Statistical GM: = + ∈N[1] Bernoulli: v Ber( )[2] Constant mean: () = for all ∈N[3] Constant variance: ()=(1− ) for all ∈N[4] Independence: { ∈N} - independent process.

Example 2. Arbuthnot’s 1710 conjecture: the ratio of malesto females in newborns might not be ‘fair’. This can be tested usingthe statistical null hypothesis:

0: = 0 where 0=5 denotes ‘fair’

in the context of the simple Bernoulli model (table 2), based on therandom variable defined by: {male}={=1} {female}={=0}Using the MLE of :==

1

P

=1 whose sampling distribu-tion:

v Bin³

(1−);´

one can derive the test statistic:

(X)=√(−0)√0(1−0)

0v Bin (0 1;) (11)

Data: =30762 newborns during the period 1993-5 in Cyprus, outof which 16029 were boys and 14833 girls.

The test statistic takes the form (11) and =16029

30762=521

(x0)=√30762(521−5)√

5(5)=7366 P((X) 7366; =5)=00000017

The tiny -value indicates strong discourdance with 0. In general oneneeds to specify a certain threshold for deciding how small the p-value is‘small enough’ to falsify 0. Fisher suggested several such thresholds,01 025 05, 1, but insisted that the choice has to be made on a caseby case basis.

10

The main elements of a Fisher significance test {(X) (x0)}.Components of a Fisher significance test

(i) a prespecified statistical model: M(x)(ii) a null (0: =0) hypothesis,(iii) a test statistic (distance function) (X)(iv) the distribution of (X) under 0(v) the -value P((X) (x0);0)=(x0)(vi) a threshold value 0 [e.g.01 025 05], such that:

(x0) 0 ⇒ x0 falsifies (rejects) 0

Example. Consider a Fisher test based on a simple (one parameter)Normal model (table 3).

Table 3 - The simple (one parameter) Normal model

Statistical GM: = + ∈N[1] Normal: v N( )[2] Constant mean: () = for all ∈N[3] Constant variance: ()=

2-known, for all ∈N[4] Independence: { ∈N} - independent process.

(i)M(x) : v N ( 2) [2 is known], = 1 2 (ii) Null hypothesis: 0 : = 0 (e.g. 0 = 0)

(iii) a test statistic: (X) =√(−0)

(iv) the distribution of (X) under 0 : (X)0v N (0 1)

(v) the -value: P((X) (x0);0)=(x0)(vi) a threshold value for discordance, say 0 = 05

0.4

0.3

0.2

0.1

0.0

X

Dens

ity

-1.64

0.05

0

Distribution PlotNormal, Mean=0, StDev=1

Normal: P( −164) = 05

0.4

0.3

0.2

0.1

0.0

X

Dens

ity

1.96

0.025

0

Distribution PlotNormal, Mean=0, StDev=1

Normal: P( 196) = 025

11

6 The Neyman-Pearson (N-P) framework

Neyman and Pearson (1933) motivated their own approach to testingas an attempt to improve upon Fisher’s testing by addressing whatthey consider as ad hoc features of that approach:[a] Fisher’s ad hoc choice of a test statistics (X) pertaining to the

optimality of tests,[b] Fisher’s use of a post-data [(x0) is known] threshold for the -

value to indicate falsification (rejection) of 0 and[c] Fisher’s denial that (x0) can indicate confirmation (accordance)of 0Neyman-Pearson (N-P) proposed solutions for [a]-[c] by:¥ [i] Introducing the notion of an alternative hypothesis 1 as

the complement to the null with respect toΘ,within the same statisticalmodel:

M(x)={(x;θ) θ∈Θ} x∈R (12)

[ii] Replacing the post-data p-value and its threshold with a pre-data[before (x0) is available] significance level determining the decisionrules:

(i) Reject 0 (ii) Accept 0

which are calibrated using the error probabilities:

type I: P(Reject 0 when 0 is true)=,

type II: P(Accept 0 when 0 false)=.

These error probabilities enabled Neyman and Pearson to introduce thekey notion of an optimal N-P test: one that minimizes subject toa fixed , i.e.[iii] Selecting the particular (X) in conjection with a rejection re-

gion for a given that has the highest pre-data capacity (1 − ) todetect discrepancies in the direction of 1In this sense, the N-P proposed solutions to the perceived weak-

nesses of Fisher’s significance testing, have changed the original Fisherframework in four important respects:¥ (a) the N-P testing takes place within a prespecifiedM(x) by

partitioning it into the null and alternative hypotheses,

12

(b) Fisher’s post-data p-value has been replaced with a pre-data and (c) the combination of (a)-(b) render ascertainable the pre-data ca-

pacity (power) of the test to detect discrepancies in the direction of1,(d) Fisher’s evidential aspirations for the p-value based on discor-

dance, were replaced by the more behavioristic accept/reject decisionrules that aim to provide input for alternative actions: when accept 0

take action when reject 0 take action .As claimed by Neyman and Pearson (1928):“the tests themselves give no final verdict, but as tools help the worker

who is using them to form his final decision.”

6.1 The archetypal N-P hypotheses specification

In N-P testing there has been a long debate concerning the proper wayto specify the null and alternative hypotheses. It is often thought tobe rather arbitrary and subject to abuse. This view stems primarilyfrom inadequate understanding of the role of statistical vs. substantiveinformation.¥ It is argued that there is nothing arbitrary about specifying the

null and alternative hypotheses in N-P testing. The default alter-native is always the complement to the null relative to the parameterspace ofM(x).Consider a generic statistical model:

M(x)={(x;θ) θ∈Θ} x∈X:=R

The archetypal way to specify the null and alternative hypothesesfor N-P testing is:

0: ∈Θ0 vs. 1: ∈Θ1 (13)

where Θ0 and Θ1 constitute a partition of the parameter space Θ:

Θ0 ∩Θ1=∅ Θ0 ∪Θ1=Θ

In the case where the sets Θ0 or Θ1 contain a single point, say Θ0={0}that determines (x; 0) the hypothesis is said to be simple, otherwiseit is composite.Example. For the simple Bernoulli model, 0: = 0 is simple,

but 1: 6= 0 is composite.

13

¥ An equivalent formulation that brings out the fact that N-P test-ing poses questions pertaining to the ‘true’ statistical Data GeneratingMechanism (DGM)M∗(x)={ ∗(x)} x∈X most clearly is to re-write(13) in the equivalent form:

0: ∗(x)∈M0(x) vs. 1:

∗(x)∈M1(x)

where ∗(x) denotes the ‘true’ distribution of the sample, and:

M0(x)={(x;θ) θ∈Θ0} orM1(x)={(x;θ) θ∈Θ1} x∈X:=R?

constitute a partition ofM(x) P(x) denotes the set of all possiblestatistical models that could have given rise to data x0:=(1 2 ).

P ( )x

H0

H1

N-P testing withinM(x)

¥ When properly understood in the context of a statistical modelM(x) N-P testing leaves very little leeway in the specification of 0

and 1 hypotheses. This is because the whole of Θ [all possible valuesof ] is relevant on statistical grounds, despite the fact that only asmall subset is often deemed relevant on substantive grounds.The reasoning underlying this argument is that N-P tests invariably

pose hypothetical questions — framed in terms of ∈Θ — pertainingto the true statistical DGM M∗(x)={ ∗(x)} x∈X PartitioningM(x) provides the most effective way to learn from data by zeroingin on the true in the same way one can zero in on an unknownprespecified city on a map using sequential partitioning.¥ This foregrounds the importance of securing the statistical ad-

equacy of M(x) [validating its probabilistic assumptions] before itprovides the basis of any inference. Hence, whenever the union of Θ0

and Θ1 the values of postulated by 0 and 1, respectively, do not

14

exhaust Θ there is always the possibility that the true lies in thecomplement: Θ− (Θ0 ∪Θ1).

What constitutes an N-P test?

A misleading idea that needs to be done away with immediately isthat a N-P test is not a formula with some associated statistical tables!It is a combination of a test statistic whose sampling distribution underthe null and alternative hypotheses is tractable in conjunction with arejection region.Test statistic. Like an estimator, an N-P test statistic is a mapping

from the sample (X:=R) to the parameter space (Θ):

() : X→ R

that partitions the sample space into an acceptance 0 and a rejec-tion region 1: 0 ∩ 1=∅ 0 ∪ 1=Xin a way that correspond to Θ0 and Θ1 respectively:

X=

½0 ↔ Θ0

1 ↔ Θ1

¾=Θ

TheN-P decision rules, based on a test statistic (X) take the form:

[i] if x0∈0 accept 0 [ii] if x0∈1 reject 0 (14)

and the associated error probabilities are:

type I: P(x0∈1; 0() true)=() for ∈Θ0

type II: P(x0∈0; 1(1) true)=(1) for 1∈Θ1

True State of Nature

N-P rule 0 true 0 false

Accept 0

√Type II error

Reject 0 Type I error√

£ Some misinformed statistics books go into great lengths to explainthat the above error probabilities should be viewed as conditional on0 or 1 and they used the misleading notation (|):

P(x0∈1|0() true)=() for ∈Θ0

15

Such conditioning makes no sense in a frequentist framework because is treated as an unknown constant. Instead, they should be interpretedas evaluations of the sampling distribution of the test statistic underdifferent scenarios pertaining to different hypothesized values of ¥ Some statistics books make a big deal of the distinction ‘accept

0’ vs. ‘fail to reject 0’, in their commendable attempt to bring outthe problem of misinterpreting ‘accept 0’ as tantamount to ‘there isevidence for 0’; this is known as the fallacy of acceptance. Bythe same token there is an analogous distinction between ‘reject 0’vs. ‘accept 1’, that highlights the problem of misinterpreting ‘reject0’ as tantamount to ‘there is evidence for 1’; this is known as thefallacy of rejection.What is objectionable about this practice is that the verbal distinc-

tions by themselves do nothing but perpetuate these fallacies; we needto address them, not play clever with words!

Table 3 - The simple (one parameter) Normal model

Statistical GM: = + ∈N[1] Normal: v N( )[2] Constant mean: () = for all ∈N[3] Constant variance: ()=

2-known, for all ∈N[4] Independence: { ∈N} - independent process.

Example 1. Consider the following hypotheses in the context ofthe simple (one parameter) Normal model:

0: ≤ 0 vs. 1: 0 (15)

Let us choose a test :={(X) 1()} of the form:

test statistic: (X)=√(−0)

where =

1

P

=1

rejection region: 1()={x : (x) }(16)

To evaluate the two types of error probabilities the distribution of (X)under both the null and alternatives are needed:

[I] (X)=√(−0)

0(0)v N(0 1)

[II] (X)=√(−0)

1(1)v N(1 1) for all 1 0

(17)

16

where the non-zero mean 1 takes the form:

1=√(1−0)

0 for all 1 0

The evaluation of the type I error probability is based on (I):

=max≤0 P((X) ;0())=P((X) ;=0)

Notice that in cases where the null is for the form 0: ≤ 0 thetype I error probability is defined as the maximum over all values inthe interval (−∞ 0] which turns out to be the one evaluated at thelast point of the interval = 0 Hence, the same test will be optimalalso in the case where the null is of the form 0: = 0The evaluation of type II error probabilities is based on (II):

(1)=P((X) ≤ ;1(1)) for all 1 0

The power [rejecting the null when false] is equal to 1−(1) i.e.(1)=P((X) ;1(1)) for all 1 0

Why do we care about power? The power (1)measures the pre-data(generic) capacity of test :={(X) 1()} to detect a discrepancy,say =1−0 = 1 when present. Hence, when (1)=35 this test hasvery low capacity to detect such a discrepancy.

How is this information helpful in practice? If =1 is the discrepancyof substantive interest, this test is practically useless for that purposesbecause we know beforehand that this test does not enough capacityto detect even if present!

17

What can one do in such a case? The power of the above test is

monotonically increasing with 1=√(1−0)

and thus, increasing the

sample size , increases the power.

Hypothesis testing reasoning.

¥ As with Fisher’s significance testing, the reasoning underlying N-P testing is hypothetical in the sense that both error probabilities arebased on evaluating the relevant test statistic under several hypotheticalscenarios, e.g. under0(=0) or under1(=1) for all 10 Thesehypothetical sampling distributions are then used to compare 0 or 1

via (x0) to the True State of Nature (TSN) represented by data x0This is in contrast to frequentist estimation which is factual in

nature, i.e. the sampling distributions of estimators is evaluated underthe TSN, i.e. =

∗

Significance level vs. -value

It is important to note that there is a mathematical relationshipbetween the type I error probability (significance level) and the p-value.Placing them side by side:

P(type I error): P((X) ;=0) =

-value: P((X) (x0);=0)=(x0)(18)

it becomes obvious that:(a) they are both specified in terms of the distribution of the same

test statistic (X) under 0(b) they both evaluate the probability of tail events of the form

{x: (x) } x∈X, but(c) they differ in terms of the threshold : vs. (x0);

in that sense is a pre-data and (x0) is a post-data error probability.In light of (a)-(c), three comments are called for.First, the p-value can be viewed as the smallest significance level

at which0 would have been rejected with data x0. For that reason thep-value is often referred to as the observed significance level. Hence, itshould come as no surprise to learn that the above N-P decision rules(14) could be recast in terms of the p-value:

[i]* if (x0) accept 0 [ii]* if (x0) ≤ reject 0

18

Indeed, practitioners often prefer to use the rules [i]*-[ii]* because thep-value conveys additional information when it is not close to the pre-designated threshold . For instance, rejecting 0 with (x0)=0001seems more informative than just 0 was rejected at = 05¥ Second, the p-value seems to have an implicit alternative built

into the choice of the tail area. For instance, a 2-sided alternative 1:6=0 would require one to evaluate P(|(X)| |(x0)|;=0) Howdid Fisher get away with such an obvious breach of his own preaching?Speculating on that, his answer might have been: 2-sided evaluationof the p-value makes no sense, because, post-data, the sign of the teststatistic (x0) determines the direction of departure!¥ Third, neither the significance level nor the p-value can be

interpreted as probabilities attached to particular values of associatedwith 0 or 1, since the probabilities in (18) are firmly attached tothe sample realizations x∈X. Indeed, attaching probabilities to theunknown constant makes no sense in frequentist statistics.As argued by Fisher (1921), p. 25: “We may discuss the probability

of occurrence of quantities which can be observed or deduced from obser-vations, in relation to any hypotheses which may be suggested to explainthese observations. We can know nothing of the probability of hypothe-ses...”This scotches another two erroneous interpretations of the p-value:

£ (x0) = 005 does not mean that there is a 5% chance of a Type Ierror (i.e. false positive).£ (x0) = 005 does not mean that there is a 95% chance that theresults would replicate if the study were repeated.

Example 1 (continued). Consider applying test :={(X) 1()}for: 0=10 =1 =100 =104 = 05⇒ =1645

(x0)=√(−0)

=√100(10175−10)

1=175 Reject 0

The p-value is: P((X) 175;=0)=04

standard Normal tablesOne-sided values Two-sided values=100 : =128 =100 : =1645=050 : =1645 =050 : =196=025 : =196 =025 : =200=010 : =233 =010 : =258

19

The trade-off between type I and II error probabilities

I How does the introduction of type I and II error probabilities addressesthe arbitrariness associated with Fisher’s choice of a test statistic?The ideal test is one whose error probabilities are zero, but no such

test exist for a given ; an ideal test, like the ideal estimator, exists as→∞! Worse, there is a trade-off between the above error probabili-ties: as one increases the other decreases and vice versa.Example. In the case of test :={(X) 1()} in (16), decreas-

ing the probability of type I error from =05 to =01 increases thethreshold from =1645 to =233 which makes it easier to accept0 and this increases the probability of type II error.To address this trade-off Neyman and Pearson (1933) proposed:(i) to specify 0 and 1 in such a way so as to render thetype I error the more serious one, and then:(ii) to choose an significance level test {(X) 1()}so as to maximize its power for all ∈Θ1N-P rationale. Consider the analogy with a criminal offense trial,

where the jury in such a trial are instructed by the judge to find thedefendant ‘not guilty’ unless they have been convinced ‘beyond anyreasonable doubt’ by the evidence:

0: not guilty, vs. 1: guilty.

The clause ‘beyond any reasonable doubt’ amounts to fixing the typeI error to a very small value, to reduce the risk of sending innocentpeople to prison, or even death. At the same time, one would want thesystem to minimize the risk of letting guilty people get off scot-free".More formally, the choice of an optimal N-P test is based on fixing

an upper bound for the type I error probability:

P(x∈1; 0() true) ≤ for all ∈Θ0

and then select a test {(X) 1()} that minimizes the type II errorprobability, or equivalently, maximizes the power:

()=P(x0∈1; 1() true)=1−() for all ∈Θ1

I The general rule is that in selecting an optimal N-P test the wholeof the parameter space Θ is relevant. This is why partitioning both Θand the sample space X:=R

using a test statistic (X) provides thekey to N-P testing.

20

Optimal properties of tests

The property that sets the gold standard for optimal test is:[1] Uniformly Most Powerful (UMP): A test

:={(X) 1()} is said to be UMP if it has higher power than anyother -level test e for all values ∈Θ1, i.e.

(;) ≥ (; e) for all ∈Θ1

Example. Test :={(X) 1()} in (16) is UMP; Lehmann (1986).Additional properties of N-P tests

[2]Unbiasedness: A test :={(X) 1()} is said to be unbiasedif the probability of rejecting 0 when false is always greater than thatof rejecting 0 when true, i.e.

max∈Θ0

P(x0∈1; 0()) ≤ P(x0∈1; 1()) for all ∈Θ1

[3] Consistency: A test :={(X) 1()} is said to be consistentif its power goes to one for all ∈Θ1 as →∞, i.e.

lim→∞

() = P(x0∈1; 1() true)=1 for all ∈Θ1

Like in estimation, this is a minimal property for tests.

Evaluating power. How does one evaluate the power of test

:={(X) 1()} for different discrepancies =1−0?Step 1. In light of the fact that for the relevant sampling distribu-

tion for the evaluation of (1) is:

[II] (X)=√(−0)

1(1)v N(1 1) for all 1 0

the use the N(0 1) tables requires on to split the test statistic into:√(−0)

=√(−1)

+ 1 1=

√(1−0)

since under 1(1) the distribution of the first component is:√(−1)

=³√

(−0)

− 1

´1(1)v N(0 1) for 1 0

Step 2. Evaluating of power the test :={(X) 1()} for dif-ferent discrepancies yields:

=1−0 1=√(1−0)

: (1)=P(

√(−1)

−1+;1)

=1 =1: (101)=P( −1 + 1645) = 259=2 =2: (102)=P( −2 + 1645) = 639=3 =3: (103)=P( −3 + 1645) = 913

21

where is a generic standard Normal r.v., i.e. v N(0 1)

The power of this test is typical of an optimal test since (1)

increases with the non-zero mean 1=√(1−0)

and thus:

(a) the power increases with the sample size (b) the power increases with the discrepancy =(1−0) and(c) the power decreases with It is very important to emphasize three features of an optimal test.First, a test is not just a formula associated with particular statistical

tables. It is a combination of a test statistic and a rejection region;hence the notation :={(X) 1()}.Second, the optimality of an N-P test is inextricably bound up with

the optimality of the estimator the test statistic is based on. Hence,it is no accident that most optimal N-P tests are based on consistent,fully efficient and sufficient estimators. Example. In the case of thesimple (one parameter) Normal model, consider replacing with the

unbiased estimator b2=(1+)2 vN( 2

2) The resulting test:

:={(X) 1()} where (X)=√2(2−0)

and 1()={x: (x) }

will not be optimal because its power is much lower than that of , and is inconsistent! This is because the non-zero mean of the sampling

distribution under =1 will be =√2(1−0)

which does not change as

→∞Third, it is important to note that by changing the rejection region

one can render an optimal N-P test useless! For instance, replacing the

22

rejection region of :={(X) 1()} with:1()={x: (x) }

the resulting test T:={(X) 1()} is practically useless because itis biased and its power decreases as the discrepancy increases.Example 2 - the Student’s t test. Consider the 1-sided hypothe-

ses:(1-s): 0: = 0 vs. 1: 0

or (1-s*): 0: ≤ 0 vs. 1: 0(19)

in the context of the simple Normal model (table 1).In this case, a UMP (consistent) test :={(X) 1()} exists.It is the well-known Student’s t test, and takes the form:

test statistic: (X)=√(−0)

2= 1

−1P

=1

(−)2

rejection region: 1()={x : (x) }(20)

where can be evaluated using the Student’s t tables.To evaluate the two types of error probabilities the distribution of

(X) under both the null and alternatives are needed:

[I]* (X)=√(−0)

0(0)v St(−1)[II]* (X)=

√(−0)

1(1)v St(1;−1) for all 1 0

(21)

where 1=√(1−0)

is the non-centrality parameter.

Student’s t tables (=60)One-sided values Two-sided values=100 : =1296 =100 : =1671=050 : =1671 =050 : =2000=010 : =2390 =010 : =2660=001 : =3232 =001 : =3460

In summary, the main components of a Neyman-Pearson (N-P) test{(X) 1()} are given below.

23

Components of a Neyman-Pearson (N-P) test

(i) a prespecified statistical model: M(x)(ii) a null (0) and the alternative (1) withinM(x),(iii) a test statistic (distance function) (X)(iv) the distribution of (X) under 0(v) prespecifying the significance level [.01, .025, .05],(vi) the rejection region 1()(vii) the distribution of (X) under 1

6.2 The N-P Lemma and its extensions

The cornerstone of the Neyman-Pearson (N-P) approach isthe Neyman-Pearson lemma. Contemplate the simple generic sta-tistical model:

M(x)={(x; )} ∈Θ:={0 1}} x∈R (22)

and consider the problem of testing the simple hypotheses:

0: = 0 vs. 1: = 1 (23)

¥ The fact that the assumed parameter space isΘ:={0 1} and (23)constitute a partition, is often left out from most statistics textbookdiscussions of this famous lemma!Existence. There is exists an -significance level Uniformly Most

Powerful (UMP) [-UMP] test, whose generic form is:

(X)=((x;1)

(x;0)) 1()={x: (x) } (24)

where () is a monotone function.Sufficiency. If an -level test of the form (24) exists, then it is UMP fortesting (23).Necessity. If {(X) 1()} is -UMP test, then it will be given by (24).At first sight the N-P lemma seems rather contrived because it is an

existence result for a simple statistical model M(x) whose parame-ter space is artificial Θ:={0 1}, but fits perfectly into the archetypalformulation. In addition, to implement this result one would need tofind () yielding a test statistic (X) whose distribution is known under

24

both 0 and 1. Worse, this lemma is often misconstrued as suggest-ing that for an -UMP test to exist one needs to confine testing tosimple-vs-simple cases even when Θ is uncountable!¥ The truth of the matter is that the construction of an -UMP

test in more realistic cases has nothing to do with simple-vs-simplehypotheses, but instead it is invariably based on the archetypal N-Ptesting formulation in (13), and relies primarily on monotone likelihoodratios and other features of the prespecified statistical modelM(x).Example. To illustrate these comments consider the simple-vs-

simple hypotheses:

(i) (1-1): 0: =0 vs. 1: =1 (25)

in the context of a simple Normal (one parameter) model (table 3).Applying the N-P lemma requires setting up the ratio:

(x;1)

(x;0)=exp

©

2(1 − 0) −

22(21 − 20)

ª (26)

which is clearly not a test statistic, as it stands. However, there existsa monotone function () which transforms (26) into a familiar teststatistic (Spanos, 1999, pp. 708-9):

(X)=((x;1)

(x;0))=h( 11) ln(

(x;1)

(x;0))+1

2

i=√(−0)

with ascertainable type I and II error probabilities based on:

(X)=0v N(0 1) (X)

=1v N(1 1) 1=√(1−0)

In this particular case, it is clear that the Neyman-Pearson lemmadoes not apply because the hypotheses in (25) do not constitute a par-tition of the parameter space Θ=R, as in (13). However, it turns outthat when the test statistic (X)=

√(−0)

is combined with infor-

mation relating to the values 0 and 1, it can provide the basis forconstructing several optimal tests (Lehmann, 1986):[1] For 1 0 the test

:={(X)

1 ()} where1 ()={x: (x) } is a -UMP for the hypotheses:

(ii) (1-s≥): 0: ≤ 0 vs. 1: 0

(iii) (1-s): 0: = 0 vs. 1: 0

25

Despite the difference between (ii) and (iii), the relevant error proba-bilities coincide. The type I error probability for (ii) is defined as themaximum over all ≤ 0, but coincides with that of (iii) (=0):

=max≤0 P((X) ;0())=P((X) ;=0)

Similarly, the p-value is the same because:

(x0)=max≤0

P((X)(x0);0())=P((X)(x0);=0)

[2] For 1 0 the test :={(X)

1 ()} where1 ()={x: (x) } is a -UMP for the hypotheses:

(iv) (1-s≤): 0: ≥ 0 vs. 1: 0

(v) (1-s): 0: = 0 vs. 1: 0

Again, the type I error probability and p-value are defined by:

=max≥0

P((X) ;0())=P((X) ;=0)

(x0)=max≥0

P((X)(x0);0())=P((X)(x0);=0)

The existence of these -UMP tests extends the N-P lemma to morerealistic cases by invoking two regularity conditions:¥ [A] The ratio (26) is amonotone function of the statistic in

the sense that for any two values 10(x;1)

(x;0)changes monotonically

with This implies that(x;1)

(x;0) if and only if for some

This regularity condition is valid for most statistical models of in-terest in practice, including the one parameter Exponential family ofdistributions [Normal, Gamma, Beta, Binomial, Negative Binomial,Poisson, etc.], the Uniform, the Exponential, the Logistic, the Hyper-geometric etc.; Lehmann (1986).¥ [B] The parameter space under 1 say Θ1 is convex, i.e. for any

two values (1 2)∈Θ1 their convex combinations 1+(1−)2∈Θ1for any 0 ≤ ≤ 1When convexity does not hold, like the 2-sided alternative:

(vi) (2-s): 0: = 0 vs. 1: 6= 0

[3] the test :={(X) 1()} 1()={x: |(x)| 2}

is -UMPU (Unbiased); the -level and p-value are:

=P(|(X)| 2; =0) q(x0)=P(|(X)| |(x0)|;=0)

26

6.3 Constructing optimal tests: Likelihood Ratio

The likelihood ratio test procedure can be viewed as a generalizationof the Neyman-Pearson lemma to more realistic cases where the nulland/or the alternative are composite hypotheses.(a) The hypothesis of interest are of the general form:

0: θ∈Θ0 vs. 1: θ∈Θ1

(b) The test statistic is related to the ‘likelihood’ ratio:

(X)=max∈Θ (;X)

max∈Θ0 (;X)=

(;X)(;X) (27)

Note that the max in the numerator is over all θ∈Θ [yielding the MLEbθ], but that of the denominator is confined to all values under 0:

θ∈Θ0 [yielding the constrained MLE eθ].(c) The rejection region is defined in terms of a transformed ratio,

where () is chosen to yield a known sampling distribution under 0:

1()= {x: (X)= ((X)) } (28)

In terms of procedures to construct optimal tests, the LR procedurecan be shown to yield several optimal tests; Lehmann (1986). Moreover,in cases where no such () can be found, one can use the asymptoticdistribution which states that under certain restrictions (Wilks, 1938):

2 ln(X) =2³ln(bθ;X− ln(eθ;X)´ 0v

2()

where0∼reads “under 0 is asymptotically distributed as” and de-

notes the number of restrictions involved in Θ0Example. Consider testing the hypotheses:

0: 2=20 vs. 1:

2 6=20in the context of the simple Normal model (table 1). From the previouschapter we know that the Maximum Likelihood method gives rise to:

max∈Θ

(θ;x)=(bθ;x)⇒ =1

P

=1 and b2= 1

P

=1(−)2

max∈Θ0

(θ;x)=(eθ;x)⇒ =1

P

=1 and 2=20

27

Moreover, the two estimated likelihoods are:

max∈Θ

(θ;x)=¡2b2¢−

2

max∈Θ0

(θ;x)= (220)−2 exp

n−2220

o

Hence, the likelihood ratio is:

(X)=(;X)(;X)=(2

20)−2 exp{−(2220)}(22)−2 =

h(220) exp{−(2

20)+1}

i2

The function () that transforms this ratio into a test statistic is:

(X)=((X))=(220)0v 2(−1)

and a rejection region of the form:

1={x: (X) 2 or (X) 1}where for a size test, the constants 1 2 are chosen in such as wayso as: R 1

0() =

R∞2() =

2

() denotes the chi-square density function. The test defined by{(X) 1()} turns out to be UMP Unbiased (UMPU) (see Lehmann,1986).

6.4 Substantive vs. statistical significance in N-P testing

We consider two substantive values of interest that concern the ratioof Boys (B) to Girls (G) in newborns:

Arbuthnot: #B=#G, Bernoulli: 18B to 17G. (29)

Statistical hypotheses. The Arbuthnot and Bernoulli conjecturedvalues for the proportion can be embedded into a simple Bernoullimodel (table 2), where =P(=1)=P(Boy):

Arbuthnot: =1

2 Bernoulli: =

18

35 (30)

How should one proceed to appraise and vis-a-vis the above data?The archetypal specification (13) suggests probing each value sep-

arately using the difference ( − ) as the discrepancy of interest.

28

Despite the fact that on substantive grounds the only relevant values of are ( ), on statistical inference grounds, the rest of the parameterspace is relevant;M(x) provides the relevant inductive premises.This argument, however, has been questioned in the literature on the

grounds that N-P testing should probe these two values using simple-vs-simple hypotheses? After all, the argument goes, the Neyman-Pearsonlemma would secure a -UMP test!This argument shows insufficient understanding of the Neyman-

Pearson lemma, because the parameter space of the simple Bernoullimodel is not Θ={ } but Θ=[0 1] Moreover, the fact that thesimple Bernoulli model has a monotone likelihood ratio, i.e.:

(x;1)

(x;0)=³(1−1)(1−0)

´ ³1(1−0)0(1−1)

´=1

for (0 1) s.t. 0011

is a monotonically increasing function of the statistic =P

=1

since1(1−0)0(1−1) 1 ensuring the existence of several -UMP tests.

In particular, [i] :={(X) C1 ()}:

(X)=√(−0)√0(1−0)

0v Bin (0 1;) C1 ()={x: (x) } (31)

is a -UMP test for the N-P hypotheses:

(ii) (1-s≥): 0: ≤ 0 vs. 1: 0(iii) (1-s): 0: = 0 vs. 1: 0

and [ii] :={(X) C1 ()}:

(X)=√(−0)√0(1−0)

0v Bin (0 1;) C1 ()={x: (x) } (32)

(iv) (1-s≤): 0: ≥ 0 vs. 1: 0(v) (1-s): 0: = 0 vs. 1: 0

is a -UMP test for the N-P hypotheses. These results stem from thefact that the formulations (ii)-(v) ensure the convexity of the parameterspace under 1Moreover, in the case of a two-sided hypotheses:

(i) (2-s): 0: = 0 vs. 1: 6= 0

29

the test [iii] :={(X) C1()}:

(X)=√(−0)√0(1−0)

0v Bin (0 1;) C1()={x: |(x)| } (33)

is a -UMP, Unbiased; see Lehmann (1986).¥ Randomization? No! In cases where the test statistic has a

discrete sampling distribution under 0 as in (32)-(33), one mightnot be able to define exactly.The traditional way of dealing with this problem is to randomize,

but this ‘solution’ raises more problems than it solves. Instead, thebest way to address the discreteness issue is to select a value thatis attained for the given or approximate the discrete distributionwith a continuous one to circumvent the problem. In practice, thereare several continuous distributions one can use, depending on whetherthe original distribution is symmetric or not. In the above case, even formoderately small size sizes, say =20 the Normal distribution providesan excellent approximation for values of around .5; see graph. With=30762, the approximation is nearly exact.

1 81 61 41 21 08642

0 .2 0

0 .1 5

0 .1 0

0 .0 5

0 .0 0

X

Dens

ity

B in o m ia l 2 0 0 .5D is tr ib u tio n n p

N o r m a l 10 2 .236D istr ib u tio n M e a n S tD e v

D i s t r i b u t i o n P l o t

Normal approx. of Binomial: (; =5 =20)

What about the choice between one-sided vs. two-sided?The answer depends crucially on whether there exists reliable sub-

stantive information that renders part of the parameter space Θ irrel-evant for both substantive and statistical purposes.

6.5 N-P testing of Arbuthnot’s value

In this case there is reliable substantive information that the probabilityof the newborn being a male, =P(=1), is systematically slightlyabove 1

2; the human sex ratio at birth is approximately

=512 (Hardy,

30

2002). Hence, a statistically more informative way to assess =5mightbe to use one-sided (1-s) directional probing.Data: =30762 newborns during the period 1993-5 in Cyprus, out

of which 16029 were boys and 14833 girls.In view of the huge sample size it is advisable to choose a smaller

significance level, say =01⇒ =2326Case 1. For testing Arbuthnot’s value 0=

1

2the relevant hypotheses

are:(1-s): 0: =

1

2vs. 1:

1

2

or (1-s*≤): 0: ≤ 1

2vs. 1:

1

2

(34)

and := {(X) C1 ()} is a -UMP test. In view of the fact that:

(x0)=√30762(1602930762−12 )√

5(5)=7389

(x0)=P((X)7389;0) = 00000(35)

the null hypothesis 0: =1

2is strongly rejected.

What if one were to ignore the substantive information and apply a2-sided test?Case 2. For testing Arbuthnot’s value this takes the form:

(2-s): 0: =1

2vs. 1: 6= 1

2 (36)

:={(X) C1()} in (33) is a -UMP test; for =05 2=2576In view of the fact that:

(x0)=√30762(1602930762−12 )√

5(5)=7389 2576

q(x0)=P(|(X)| 7389;0)=00000


2is strongly rejected.

6.6 N-P testing of Bernoulli’s value

The first question we need to answer is the appropriate form of thealternative in testing Bernoulli’s conjecture =18

35.

Using the same substantive information that indicates is system-atically slightly above 1

2(Hardy, 2002), it seems reasonable to use a

1-sided alternative in the direction of 12.

Case 1. For probing Bernoulli’s value the relevant hypotheses are:

(1-s): 0: =18

35vs. 1: 1

18

35 (37)

31

The optimal test (UMP) for (37) is := {(X) C1 ()}, where for

=01⇒ = −2326. Using the same data:(x0)=

√30762(1602930762−1835 )√

1835 (1−1835 )

=2379 (x0)=P((X) 2379;0)=991


35is accepted with a high p-value.

What if one were to ignore the substantive information as pertainingonly to Arbuthnot’s value, and apply a 2-sided test?Case 2. For testing Bernoulli’s value using the 2-s probing:

(2-s): 0: =18

35vs. 1: 6= 18

35 (38)

based on the UMPU test :={(X) C1()} in (33):(x0)=

√30762(1602930762−18

35 )√1835 (1−18

35 )=2379 2576

q(x0)=P(|(X)| 2379;0)=0174(39)

yields accepting 0, but the p-value is considerably smaller than(x0)=991 for the 1-sided test.The question that arises is which one is the relevant p-value?Indeed, a moments reflection suggests that these two are not the

only choices. There is also the p-value for:(1-s): 0: =

18

35vs. 1:

18

35 (40)

which yields a much smaller p-value:(x0)=P((X) 2379;0)=009 (41)

reversing the previous results and leading to rejecting 0.¥ This raises the question:On what basis should one choose the relevant p-value?

The short answer ‘on substantive grounds’ seems questionable because,first, it ignores the fact that the whole of the parameter space is rel-evant on statistical grounds, and second, one applies a test to assessthe validity of substantive information, not foist it on the data atthe outset. Even more questionable is the argument that one shouldchoose a simple alternative on substantive grounds in order to apply theNeyman-Pearson lemma that guarantees a -UMP test; this is basedon a misapprehension of the lemma.The longer answer, that takes account the fact that post-data (x0)

is known, and thus there is often a clear direction of departure indicatedby data x0 associated with the sign of (x0) will be elaborated uponin section 8.

32

7 Foundational issues raised by N-P testing

Let us now appraise how successful the N-P framework was in address-ing the perceived weaknesses of Fisher’s testing:[a] his ad hoc choice of a test statistics (X)[b] his use of a post-data [(x0) is known] threshold for the -valuethat indicates discordance (reject) with 0 and[c] his denial that (x0) can indicate accordance with 0Very briefly, the N-P framework partly succeeded in addressing

[a] and [c], but did not provide a coherent evidential account thatanswers the basic question (Mayo, 1996):

when do data x0 provide evidence for or againsta hypothesis or a claim ?

(42)

This is primarily because one could not interpret the decision results‘Accept 0’ (‘Reject 0’) as data x0 provide evidence for (against) 0

[and thus provide evidence for 1]. Why?(i) A particular test :={(X) C1()} could have led to ‘Accept

0’ because the power of that test to detect an existing discrepancy was very low; the test had no generic capacity to detect an existingdiscrepancy. This can easily happen when the sample size is small.(ii) A particular test :={(X) C1()} could have led to ‘Reject

0’ simply because the power of that test was high enough to detect‘trivial’ discrepancies from 0, i.e. the generic capacity of the test wasextremely high. This can easily happen when the sample size is verylarge.In light of the fact that the N-P framework did not provide a bridge

between the accept/reject rules and what scientific inquiry is after,i.e. when data x0 provide evidence for or against a hypothesis ,it should come as no surprise to learn that most practitioners soughtsuch answers by trying unsuccessfully (and misleadingly) to distill anevidential interpretation out of Fisher’s p-value. In their eyes the pre-designated is equally vulnerable to manipulation [pre-designation tokeep practitioners ‘honest’ is unenforceable in practice], but the p-valueis more informative and data-specific.Is Fisher’s p-value better in answering the question (42)? No. A

small (large) — relative to a certain threhold — p-value could notbe interpreted as evidence for the presence (absence) of a substantive

33

discrepancy for the same reasons as (i)-(ii). The generic capacityof the test (its power) affects the p-value. For instance, a very smallp-value can easily arise in the case of a very large sample size .

Fallacies of Acceptance and Rejection

The issues with the accept/reject0 and p-value results raised abovecan be formalized into two classic fallacies.(a) The fallacy of acceptance: no evidence against 0 is misinterpretedas evidence for 0.This fallacy can easily arise in cases where the test in question has

low power to detect discrepancies of interest.(b) The fallacy of rejection: evidence against 0 is misinterpreted asevidence for a particular 1.This fallacy arises in cases where the test in question has high power

to detect substantively minor discrepancies. Since the power of a testincreases with the sample size this renders N-P rejections, as well astiny p-values, with large highly susceptible to this fallacy.In the statistics literature, as well as in the secondary literatures in

several applied fields, there have been numerous attempts to circumventthese two fallacies, but none succeeded.

8 Post-data severity evaluation

A moment’s reflection about the inability of the accept/reject 0 andp-value results to provide a satisfactory answer to the above questionin (42) suggests that it is primarily due to the fact that there is aproblem when such results are detached from the test itself, and aretreated as providing the same evidence for a particular hypothesis (0 or 1), regardless of the generic capacity (the power) of the testin question. That is, whether data x0 provide evidence for or against aparticular hypothesis (0 or 1) depends crucially on the genericcapacity of the test (power) in question to detect discrepancies from0. This stems from the intuition that a small p-value or a rejectionof 0 based on a test with low power (e.g. a small ) for detecting aparticular discrepancy provides stronger evidence for the pres-ence of a particular discrepancy than using a test with much higherpower (e.g. a large ). Mayo (1996) proposed a frequentist evidentialaccount based on harnessing this intuition in the form of a post-data

34

severity evaluation of the accept/reject results. This is based oncustom-tailoring the generic capacity of the test to establish the dis-crepancy warranted by data x0. This evidential account can be usedto circumvent the above fallacies, as well as other (misplaced) chargesagainst frequentist testing.The severity evaluation is a post-data appraisal of the accept/reject

and p-value results that revolves around the discrepancy from 0

warranted by data x0¥ A hypothesis passes a severe test with data x0 if:(S-1) x0 accords with , and(S-2) with very high probability, test would have produced a resultthat accords less well with than x0 does, if were false.Severity can be viewed as an feature of a test as it relates to

a particular data x0 and a specific claim being considered. Hence,the severity function has three arguments, (x0 ) denoting theseverity with which passes with x0; see Mayo and Spanos (2006).Example. To explain how the above severity evaluation can be

applied in practice, let us return to the problem of assessing the twosubstantive values of interest:

Arbuthnot: =1

2 Bernoulli: =

18

35

When these values are viewed in the context of the error statistical per-spective, it becomes clear that the way to frame the probing is to chooseone of the values as the null hypothesis, and let the difference betweenthem ( − ) represent the discrepancy of substantive interest.Data: =30762 newborns during the period 1993-5 in Cyprus,

16029 are boys and 14833 girls.Let us return to probing the Arbuthnot value based on:

(1-s): 0: 0=1

2vs. 1: 0

1

2

that gave rise to the observed test statistic:

(x0)=√30762(1602930762−12 )√

5(5)=7.389, (43)

leading to a rejection of 0 at =01 ⇒ =2326 Indeed, 0 wouldhave also been rejected by a (2-s) test since =2576.An important feature of the severity evaluation is that it is post-

data, and thus the sign of the observed test statistic (x0) provides

35

directional information that indicates the directional inferential claimsthat ‘passed’. In relation to the above example, the severity ‘accor-dance’ condition (S-1) implies that the rejection of 0=

1

2with (x0)=7389

0, indicates that the form of the inferential claim that ‘passed’ is of thegeneric form:

1 = 0+ for some ≥ 0 (44)

The directional feature of the severity evaluation is very importantin addressing several criticisms of N-P testing, including the potentialarbitrariness and possible abuse of:[a] switching between one-sided, two-sided or simple-vs-simple hy-

potheses,[b] interchanging the null and alternative hypotheses, and[c] manipulating the level of significance in an attempt to get the

desired testing result.To establish the particular discrepancy warranted by data x0, the

severity post-data ‘discordance’ condition (S-2) calls for evaluating theprobability of the event: "outcomes x that accord less well with 1than x0 does", i.e. [x: (x) ≤ (x0)] yielding:

( ; 1) = P(x : (x) ≤ (x0); 1 is false) (45)

The evaluation of this probability is based on the sampling distribution:

(X)=√(−0)√0(1−0)

=1v Bin ((1) (1);) for 1 0 (46)

(1)=√(1−0)√0(1−0)

≥ 0 (1)=1(1−1)0(1−0) 0 (1) ≤ 1

Since the hypothesis that ‘passed’ is of the form 1=0+, theprimary objective of SEV(

; 1) is to determine the largest dis-crepancy ≥ 0 warranted by data x0.Table 4 evaluates SEV(; 1) for different values of with the

evaluations based on the standardized (X):

[(X)−(1)]√ (1)

=1v Bin (0 1;) ' N(0 1) (47)

For example, when 1 = 5142 :√(−0)√0(1−0)

− (1)=7389−√30762(5142−5)√

5(5)=2408

36

and Φ( ≤ 2408)=992 where Φ() is the cumulative distribution func-tion (cdf) of the standard Normal distribution. Note that the scalingp (1)=9996 ' 1 can be ignored.

Table 4: Reject 0: = 5 vs. 1: 5Relevant claim Severity

1=[5+] P(x: (X)≤(x0); 1)010 0510 999013 0513 9970142 05142 9920143 05143 991015 0515 983016 0516 962017 0517 923018 0518 859020 0520 645021 0521 500025 0525 084

An evidential interpretation. Taking a very high probability,say 95 as a threshold, the largest discrepancy from the null warrantedby this data is:

≤ 01637 since ( ; 51637) = 950

Is this discrepancy substantively significant? In general, to answer thisquestion one needs to appeal to substantive subject matter informa-tion to assess the warranted discrepancy on substantive grounds. Inhuman biology it is commonly accepted that the sex ratio at birth isapproximately 105 boys to 100 girls; see Hardy (2002). Translating thisratio in terms of yields

=105

205=5122 which suggests that the above

warranted discrepancy ≥ 01637 is substantively significant since thisoutputs ≥ 5173 which exceeds

. In light of that, the substantive

discrepancy of interest between and :∗=(1835)− 5 = 0142857

yields SEV( ; 5143)=991 indicating that there is excellent evi-

dence for the claim 1=0+∗.

I In terms of the ultimate objective of statistical inference, learningfrom data, this evidential interpretation seems highly effective becauseit narrows down an infinite set Θ to a very small subset!

37

8.1 Revisiting issues pertaining to N-P testing

The above evidential interpretation based on the severity assessmentcan also be used to shed light on a number of issues raised in theprevious sections.

(a) The arbitrariness of N-P hypotheses specification

The question that naturally arises at this stage is whether havinggood evidence for inferring 18

35depends on the particular way one

has chosen to specify the hypotheses for the N-P test. Intuitively,one would expect that the evidence should not be dependent on thespecification of the null and alternative hypotheses as such, but on x0and the statistical modelM(x) x∈R

.To explore that issue let us focus on probing the Bernoulli value:

0: 0=18

35 where the test statistic yields:

(x0)=√30762(1602930762−1835 )√

1835 (1−18

35 )=2379 (48)

This value leads to rejecting 0 at =01 when one uses a (1-s) test[=2326], but accepting 0 when one uses a (2-s) test [=2576].Which result should one believe?Irrespective of these conflicting results, the severity ‘accordance’

condition (S-1) implies that, in light of the observed test statistic(x0)=23790, the directional claim that ‘passed’ on the basis of (48)is of the form:

1=0+

To evaluate the particular discrepancy warranted by data x0, con-dition (S-2) calls for the evaluation of the probability of the event:‘outcomes x that accord less well with 1 than x0 does’, i.e. [x:(x) ≤ (x0)] giving rise to (45), but with a different 0.Table 5 provides several such evaluations of SEV(

; 1) fordifferent values of ; note that these evaluations are also based on (46).A number of negative values of are included in order to address theoriginal substantive hypotheses of interest, as well as bring out the factthat the results of table 4 and 5 are identical when viewed as inferencespertaining to ; they simply have a different null values 0 and thusdifferent discrepancies, but identical 1 values.

38

The severity evaluations in tables 4 and 5, not only render the choicebetween (2-s), (1-s) and simple-vs-simple irrelevant, they also scotchthe well-rehearsed argument pertaining to the asymmetry between thenull and alternative hypotheses. The N-P convention of selecting asmall is generally viewed as reflecting a strong bias against rejectionof the null. On the other hand, Bayesian critics of frequentist testingargue the exact opposite: N-P testing is biased against accepting thenull; see Lindley (1993). The evaluations in tables 4 and 5 demonstratethat the evidential interpretation of frequentist testing based on severetesting addresses all asymmetries, notional or real.

Table 5: Accept/Reject 0: =18

35vs. 1: ≷18

35

Relevant claim Severity

1=[5143+] P(x: (x)≤(x0); 1)−0043 0510 999−0013 0513 997−0001 05142 9920000 05143 9910007 0515 9830017 0516 9620027 0517 9230037 0518 8590057 0520 6450067 0521 5000107 0525 084

(b) Addressing the Fallacy of Rejection

The potential arbitrariness of the N-P specification of the null andalternative hypotheses and the associated p-values is brought out inprobing the Bernoulli value =

18

35using different alternatives:

(1-s): (x0)=P((X) 2379;0)=991

(2-s): q(x0)=P(|(X)| 2379;0)=017

(1-s): (x0)=P((X) 2379;0)=009

Can the above severity evaluation explain away these highly conflicting andconfusing results?

39

The first choice 1: 18

35was driven solely by substantive infor-

mation relating to =1

2, which is often a bad idea because it ignores

the statistical dimension of inference. The second choice was based onlack of information about the direction of departure, which makes sensepre-data, but not post-data. The third choice reflects the post-data di-rection of departure indicated by (x0)=2379 0 In light of that, theseverity evaluation renders (x0)=009 the relevant p-value, after all!¥ The key problem with the p-value. In addition, the severity

evaluation brings out the vulnerability of both the N-P reject rule andthe p-value to the fallacy of rejection in cases where is large. Viewingthe relevant p-value ((x0)=009) from the severity vantage point, itis directly related to 1 passing a severe test because: the probabilitythat test would have produced a result that accords less well with1 than x0 does (x: (x) (x0)), if 1 were false (0 true):

Sev( ;x0;0) =P((X) (x0); ≤ 0) =

=1−P((X)(x0);=0)=991is very high; Mayo (1996). The key problem with the p-value, however,is that is establishes the existence of some discrepancy ≥ 0 butprovides no information concerning the magnitude licensed by x0 Theseverity evaluation remedies that by relating Sev(

;x0; 0)=991to the inferential claim 18

35 with the implicit discrepancy associated

with the p-value being =0.

(c) Addressing the Fallacy of Acceptance

Example. Let us return to the hypotheses:

(1-s): 0: 0=1

2vs. 1: 0

1

2

in the context of the simple Bernoulli model (table 2), and considerapplying test

with =025 ⇒ =196 to the case where data x0are based on =2000 and yield =0503:

(x0)=√2000(503−5)√

5(5)=268 P((x) 268; 0) = 394

(x0) = 268 leads to accepting 0, and the p-value indicates no rejec-tion of 0Does this mean that data x0 provide evidence for 0? or more ac-

curately, does x0 provide evidence for no substantive discrepancy from0? Not necessarily!

40

To answer that question one needs to apply the post-data severetesting reasoning to evaluate the discrepancy ≥ 0 for 1=0+ war-ranted by data x0. Test

:={(X)

1 ()} passes 0, and the idea isto establish the smallest discrepancy ≥ 0 from 0 warranted by datax0 by evaluating (post-data) the claim ≤ 1 for 1=0+Condition (S-1) of severity is satisfied since x0 accords with 0 be-

cause (x0) , but (S-2) requires one to evaluate the probabilityof "all outcomes x for which test

accords less well with 0 thanx0 does", i.e. (x: (x) (x0)) under the hypothetical scenario that" ≤ 1 is false" or equivalently " 1 is true":

( ;x0; ≤ 1)=P(x: (x) (x0); 1) for ≥ 0 (49)

The evaluation of (49) relies on the sampling distribution (46), andfor different discrepancies ( ≥ 0) yields the results in table 6. Notethat Sev(

;x0; ≤ 1) is evaluated at =1 because the probabilityincreases with .The numerical evaluations are based on the standardized (X) from

(47). For example, , when 1 = 515 :

√(−0)√0(1−0)

− (1)=268−√2000(515−5)√

5(5)= −1074

Φ( −1074)=859 whereΦ() is the cumulative distribution function(cdf) of N(0 1). Note that the scaling factor

p (1)=

q515(1−515)

5(5)=

9996 can be ignored because its value is so close to 1.

Table 6 — Severity Evaluation of ‘Accept 0’ with ( ;x0)

Relevant claim: ≤ 1 = 0 +

.001 .005 .01 .015 .017 .0173 .018 .02 .025 .03

Sev( ≤ 1) .429 .571 .734 .859 .895 .900 .910 .936 .975 .992

Assuming a severity threshold of, say 90 is high enough, the aboveresults indicate that the ‘smallest’ discrepancy warranted by x0, is ≥0173 since:

(;x0; ≤ 5173)=9

41

That is, data x0 in conjunction with test provide evidence (withseverity at least 9) for the presence of a discrepancy as large as ≥0173. Is this discrepancy substantively insignificant? No!As mentioned above, the accepted ratio at birth in human biology

is =512 which suggests that the warranted discrepancy ≥ 0173

is, indeed, substantively significant since this outputs ≥ 5173 whichexceeds

.

This is an example where the post-data severity evaluation can beused to address the fallacy of acceptance. The severity assessment in-dicates that the statistically insignificant result (x0)=268 at =025,actually provides evidence against 0 : ≤ 5 and for ≥ 5173 withseverity at least 90

(d) The arbitrariness of choosing the significance level

It is often argued by critics of frequentist testing that the choiceof the significance level in the context of the N-P approach, or thethreshold for the p-value in the case of Fisherian significance testing,is totally arbitrary and manipulable.To see how the severity evaluations can address this problem, let us

try to ‘manipulate’ the original significance level (=01) to a differentone that would alter the accept/reject results for 0: =18

35 Looking

back at results for Bernoulli’s conjectured value using the data fromCyprus, it is easy to see that choosing a larger significance level, say=02⇒ =2053 would lead to rejecting 0 for both N-P formula-tions:

(2-s) (1: 6=18

35) and (1-s) (1:

18

35)

since: (x0)=2379 2053

(2-s): (x0)=P(|(X)| 2379;0)=0174

(1-s): (x0)=P((X) 2379;0)=009

How does this change the original severity evaluations? The shortanswer is: it doesn’t! The severity evaluations remain invariant toany changes to because they depend on on (x0)=2379 0 andnot on which indicates that the generic form of the hypothesis that‘passed’ is 1=0+ and the same evaluations apply.

42

9 Summary and conclusions

In frequentist inference learning from data x0 about the stochasticphenomenon of interest is accomplished by applying optimal inferenceprocedures with ascertainable error probabilities in the context of a sta-tistical model:

M(x)={(x;θ) θ∈Θ} x∈R (50)

Hypothesis testing gives rise to learning from data by partitioningM(x) into two subsets:

M0(x)={(x;θ) θ∈Θ0} orM1(x)={(x;θ) θ∈Θ1} x∈R? (51)

framed in terms of the parameter(s):

0: ∈Θ0 vs. 0: ∈Θ1 (52)

and inquiring x0 for evidence in favor one of the two subsets (51) usinghypothetical reasoning.Note that one could specify (52) more perceptively as:

0:

∈Θ0z }| { ∗(x)∈M0(x)={(x;θ) θ∈Θ0} vs. 1:

∈Θ1z }| { ∗(x)∈M1(x)={(x;θ) θ∈Θ1}

where ∗(x)=(x;θ∗) denotes the ‘true’ distribution of the sample. Thisnotation elucidates the basic question posed in hypothesis testing:I Given that the trueM∗(x) lies within the boundaries ofM(x)can x0 be used to narrow it down to a smaller subsetM0(x)?A test :={(X) 1()} is defined in terms of a test statistic

((X)) and a rejection region (1()), and its optimality is calibratedin terms of the relevant error probabilities:

type I: P(x0∈1; 0() true) ≤ () for ∈Θ0

type II: P(x0∈0; 1(1) true)=(1) for 1∈Θ1

These error probabilities specify how often these procedures lead toerroneous inferences, and thus determine the pre-data capacity (power)of the test in question to give rise to valid inferences:

(1)=P(x0∈1; 1(1) true)=(1) for all 1∈Θ1

An inference is reached by an inductive procedure which, with highprobability, will reach true conclusions from valid (or approximately

43

valid) premises (statistical model). Hence, the trustworthiness of fre-quentist inference depends on two interrelated pre-conditions:(a) adopting optimal inference procedures, in the context of:(b) a statistically adequate model.I How does hypothesis testing related to the other forms of frequentist

inference?It turns out that testing is inextricably bound up with estimation

(point and interval) in the sense that an optimal estimator is often thecornerstone of an optimal test and an optimal confidence interval. Forinstance, consistent, fully efficient and minimal sufficient estimatorsprovide the basis for most of the optimal tests and confidence intervals!Having said that, optimal point estimation is considered rather non-

controversial, but optimal interval estimation and hypothesis testinghave raised several foundational issues that have bedeviled frequen-tist inference since the 1930s. In particular, the use and abuse of therelevant error probabilities calibrating these latter forms of inductiveinference have contributed significantly to the general confusion andcontroversy when applying these procedures.

Addressing foundational issues

The first such issue concerns the presupposition of the (approxi-mately) true premises, i.e. the trueM∗(x) lies within the boundariesofM(x) This can be addressed by securing the statistical adequacyofM(x) vis-a-vis data x0 using trenchant Mis-Specification (M-S) testing, before applying frequentist testing; see chapter 15. Asshown above, whenM(x) is misspecified, the nominal and actual er-ror probabilities can be very different, rendering the reliability of thetest in question, at best unknown and at worst highly misleading. Thatis, statistical adequacy ensures the ascertainability of the relevant errorprobabilities, and hence the reliability of inference.The second issue concerns the form and nature of the evidence

x0 can provide for ∈Θ0 or ∈Θ1 Neither Fisher’s p-value, nor theN-P accept/reject rules can provide such an evidential interpretation,primarily because they are vulnerable to two serious fallacies:(c) Fallacy of acceptance: no evidence against the null is misinter-

preted as evidence for it.(d) Fallacy of rejection: evidence against the null is misinterpreted

as evidence for a specific alternative.

44

As discussed above, these fallacies can be circumvented by supple-menting the accept/reject rules (or the p-value) with a post-data eval-uation of inference based on severe testing with a view to determinedthe discrepancy from the null warranted by data x0 This establishesthe inferential claim warranted [and thus, the unwarranted ones], inparticular cases, so that it gives rise to learning from data.The severity perspective sheds ample light on the various controver-

sies relating to p-values vs. type I and II error probabilities, delineatingthat the p-value is a post-data and the type I and II are pre-data er-ror probabilities, and although inadequate for the task of furnishingan evidential interpretation, they were intended to fulfill complemen-tary roles. Pre-data error probabilities, like the type I and II are usedto appraise the generic capacity of testing procedures, and post-dataerror probabilities, like severity, are used to bridge the gap betweenthe coarse ‘accept/reject’ and evidence provided by data x0 for thewarranted discrepancy from the null, by custom-tailoring the pre-datacapacity in light of (x0).This refinement/extension of the Fisher-Neyman-Pearson frame-

work can be used to address, not only the above mentioned fallacies,but several criticisms that have beleaguered frequentist testing sincethe 1930s, including:(e) the arbitrariness of selecting the thresholds,(f) the framing of the null and alternative hypotheses in the contextof a statistical modelM(x), and(g) the numerous misinterpretations of the p-value.

45

10 Appendix: Revisiting Confidence Intervals (CIs)

Mathematical Duality between Testing and Cls

From a purely mathematical perspective a point estimator is a map-ping () of the form:

() : R → Θ

Similarly, an interval estimator is also a mapping () of the form:

(): R → Θ, such that −1()={x: ∈(x)}⊂F for ∈Θ.

Hence, one can assign a probability to −1()=(x)={x: ∈(x)} inthe sense that:

P (x: ∈(x); =∗) :=P (∈(X); =∗) denotes the probability that the random set (X) contains

∗-the true

value of The expression ‘∗denotes the true value of ’ is a shorthand

for saying that ‘data x0 constitute a realization of the sample X withdistribution (x;

∗)’. One can define a (1−) Confidence Interval (CI)

for by attaching a lower bound to this probability:

P (∈(X); =∗) ≥ (1− )

Neyman (1937) utilized a mathematical duality between the N-PHypothesis Testing (HT) and Confidence Intervals (CIs) to put forwardan optimal theory for the latter.The -level test of the hypotheses:

0: =0 vs. 1: 6=0takes the form {(X;0) 1(0;)} (X;0)-the test statistic and therejection region defined by:

1(0;)={x: |(x;0)| 2} ⊂ R

The complement of the latter with respect to the sample space (R),

defines the acceptance region:

0(0;)={x: |(x;0)| ≤ 2} = R

− 1(0;)

The mathematical duality between HT and CIs stems from the factthat for each x∈R

one can define a corresponding set (x;) on theparameter space by:

(x;)={: x∈0(0;)} ∈Θ e.g. (x;)={: |(x;0)|≤2 }

46

Conversely, for each (x;) there is a CI corresponding to the accep-tance region 0(0;) such that:

0(0;)={x: 0∈(x;)} x∈R e.g. 0(0;)={x: |(x;0|≤2 }

This equivalence is encapsulated by:

for each x∈R x∈0(0;) ⇔ 0∈(x;) for each 0∈Θ

When the lower b(X) and upper bound b(X) of a CI, i.e.P³b(X) ≤ ≤ b(X)´=1−

are monotone increasing functions of X the equivalence is:

∈³b(x) ≤ ≤ b(x)´⇔ x∈

³b−1 () ≤ x ≤ b−1 ()´ Optimality of Confidence Intervals

The mathematical duality between HT and CIs is particularly usefulin defining an optimal CI that is analogous to an optimal N-P test.The key result is that the acceptance region of a -significance level

test that is Uniformly Most Powerful (UMP), gives rise to (1−) CIthat is Uniformly Most Accurate (UMA), i.e. the error cover-age probability that (x;) contains a value 6= 0 where 0 is as-sumed to be the true value, is minimized ; Lehmann (1986). Similarly,a −UMPU test gives rise to a UMAU Confidence Interval.Moreover, it can be shown that such optimal CIs havethe shortest length in the following sense:

if [b(x)b(x)] is a (1−)-UMAU CI, then its length [width] is lessthan or equal to that of any other (1−) CI, say [e(x)e(x)]:

∗(b(x)− b(x)) ≤ ∗(e(x)− e(x))Underlying reasoning: CI vs. HT

Does this mathematical duality render HT and CIs equivalent interms of their respective inference results? No!

47

The primary source of the confusion on this issue is that they sharea common pivotal function, and their error probabilities are evaluatedusing the same tail areas. But that’s where the analogy ends.To bring out how their inferential claims and respective error proba-

bilities are fundamentally different, consider the simple (one parameter)Normal model (table 3), where the common pivot is:

(X;)=√(−)

v N (0 1) (53)

Since is unknown, (53) is meaningless as it stands, and it crucial tounderstand how the CI and HT procedures render it meaningful usingdifferent types of reasoning, referred to as factual and hypothetical,respectively:

(X;)=√(−∗)

TSNv N(0 1)

(X)=√(−0)

0v N(0 1)(54)

Putting the (1−) CI side-by-side with the (1−) acceptance region:

P³−2 ( √

) ≤ ≤ +2 (

√);=∗

´=1−

P³0−2 ( √

) ≤ ≤ 0+2 (

√);=0

´=1−

(55)

it is clear that their inferential claims are crucially different:[a] The CI claims that with probability (1−) the random bounds

[ ± 2( √

)] will cover (overlay) the true ∗ [whatever that happens

to be], under the TSN.[b] The test claims that with probability (1−) test would yield

a result that accords equally well or better with0 (x: |(x;0)| ≤ 2),

if 0 were true.Hence, their claims pertain to the parameter vs. the sample space,

and the evaluation under the TSN (=∗) vs. under the null (=0),render the two statements inferentially very different. The idea that agiven (known) 0 and the true

∗ are interchangeable is an illusion.This can be seen most clearly in the simple Bernoulli case where

needs to be estimated using b= 1

P

=1:= for the UMAU (1−)CI:

P³−2 (

(1−)√) ≤ +2 (

(1−)√); =

∗´=1−

48

but the acceptance region of -UMPU test is defined in terms of thehypothesized value 0:

P³0−2 (0(1−0)√

) ≤ 0+2 (

0(1−0)√); =0

´=1−

The fact is that there is only one ∗ and an infinity of values 0 Asa result, the HT hypothetical reasoning is equally well-defined pre-dataand post data. In contrast, there is only one TSN which has alreadyplayed out post-data, rendering coverage error probability degenerate,i.e. the observed CI:

[ − 2( √

) +

2( √

)] (56)

either includes or excludes the true value ∗.

Observed Confidence Intervals and Severity

In addition to addressing the fallacies of acceptance and rejection,the post-data severity evaluation can be used to address the issue ofdegenerate post-data error probabilities and the inability to distinguishbetween different values of within an observed CI; Mayo and Spanos(2006). This is accomplished by making two key changes:(a) Replace the factual reasoning underlying CIs, which becomes

degenerate post-data, with the hypothetical reasoning underlying thepost-data severity evaluation.(b) Replace any claims pertaining to overlaying the true ∗ with

severity-based inferential claims of the form:

1=0+ ≤ 1=0+ for some ≥ 0 (57)

This can be achieved by relating the observed bounds:

± 2( √

) (58)

to particular values of 1 associated with (57), say 1=−2 ( √) and

evaluating the post-data severity of the claim:

1=−2 ( √)

A moment’s reflection, however, suggests that the connection be-tween establishing the warranted discrepancy from the null value 0and the observed CI (58) is more apparent than real, because the sever-ity evaluation probability is attached to the inferential claim (57), andnot to the particular value 1. Hence, at best one would be creating

49

the illusion that the severity assessment on the boundary of a 1-sidedobserved CI is related to the confidence level; it does not! The truth ofthe matter is that the inferential claim (58) does not pertain directlyto ∗ but to the warranted discrepancy in light of x0.

Fallacious arguments for using Confidence Intervals

A. Confidence Intervals are more reliable than p-values

It is often argued in the social science statistical literature that CIsare more reliable than p-values because the latter are vulnerable to thelarge problem; as the sample size →∞ the p-value becomes smallerand smaller. Hence, one would reject any null hypothesis given enoughobservations. What these critics do not seem to realize is that a CIlike:

P³−2 ( √

) ≤ ≤ +2 (

√);=∗

´=1−

is equally vulnerable to the large problem because the length of theinterval: h

+2 (√)i−h−2 ( √

)i= 2

2( √

)

shrinks to zero as →∞

B. The middle of a Confidence Interval is more probable?

Despite the obvious fact that post-data the probability of coverageis either zero or one, there have been numerous recurring attempts inthe literature to discriminate among the different values of within anobserved interval:“... the difference between population means is much more likely to

be near the middle of the confidence interval than towards the extremes.”(Gardner and Altman, 2000, p. 22)This claim is clearly fallacious. A moment’s reflection suggests

that viewing all possible observed CIs (56) as realizations from a singledistribution in (54), leads one to the conclusion that, unless the partic-ular happens (by accident) to coincide with the true value

∗ valuesto the right or the left of will be more likely depending on whether is to the left or the right of

∗ Given that ∗ is unknown, however,such an evaluation cannot be made in practice with a single realization.This is illustrated below with a CI for in the case of the simple (oneparameter) Normal model.

50

∗

1. ` −−−− −−−− a2. ` −−−− −−−− a3. ` −−−− −−−− a4. ` −−−− −−−− a5. ` −−−− −−−− a6. ` −−−− −−−− a7. ` −−−− −−−− a8. ` −−−− −−−− a9. ` −−−− −−−− a10. ` −−−− −−−− a11. ` −−−− −−−− a12. ` −−−− −−−− a13. ` −−−− −−−− a14. ` −−−− −−−− a15. ` −−−− −−−− a16. ` −−−− −−−− a17. ` −−−− −−−− a18. ` −−−− −−−− a19. ` −−−− −−−− a20. ` −−−− −−−− a

51

C. Omnibus curves, consonance intervals and all that!

In addition to these obvious fallacious claims, there have been sev-eral more sophisticated attempts to relate the coverage error probabilitywith the type I and II error probabilities, including the omnibus confidencecurve (Birnbaum, 1961), the consonance interval curve (Kempthorneand Folks, 1971), and the p-value curve (Poole, 1987). However, allthese attempts run into major conceptual befuddlements primarily be-cause they ignored the key difference in the underlying reasoning. Allof the them end up inadvertently switching from factual to hypotheti-cal reasoning half-stream without realizing that it cannot be done andretain the CI inferential claim.As shown in Spanos (2005), in the case of a (1−2) 2-sided CI there

is a one-to-one mapping between ∈[0 1] and ∈R via:([1− Φ()] =)⇒ (Φ−1(1−)=) (59)

which gives rise to:

Φ−1(1−)=√(−)

=(x0;)⇒

⇒ ()=1−Φ ((x0;)) for each ∈R(60)

This mapping between ∈R and ∈[0 1] can be seen more clearly by:(a) placing a particular value of say 1 on the boundary of the

observed CI, i.e.

1= − (√) ⇒ =

√(−1)

= (x0;1)

(b) holding constant, and solving it to show that for each there ex-ists a such that =(x0;1) This confirms the relationship between and the p-value in the sense that the latter represents the smallestsignificance level at which the null would have been rejected.That is, there is a one-to-one relationship between the different sig-

nificant levels [or — the associated rejection threshold] and theobserved value of the test statistic for a corresponding value of .

52

11 Pioneers of statistical testing

Francis Edgeworth (1845—1926)Karl Pearson (1857-1936)

William Gosset (1876—1937) R.A.Fisher (1890-1962)

JerzyNeyman (1894-1981) Egon Pearson (1895-1980)

53

Relevant references:I Mayo, D. G. and A. Spanos. (2006), “Severe Testing as a Basic

Concept in a Neyman-Pearson Philosophy of Induction,” The BritishJournal for the Philosophy of Science, 57: 323-357.I Mayo, D. G. and A. Spanos (2011), “Error Statistics,” pp. 151-

196 in the Handbook of Philosophy of Science, vol. 7: Philosophy ofStatistics, D. Gabbay, P. Thagard, and J. Woods (editors), Elsevier.I Spanos, A. (1999), Probability Theory and Statistical Inference:

econometric modeling with observational data, Cambridge UniversityPress, Cambridge.I Spanos, A. (2006), “Where Do Statistical Models Come From?

Revisiting the Problem of Specification,” pp. 98-119 in Optimality:The Second Erich L. Lehmann Symposium, edited by J. Rojo, LectureNotes-Monograph Series, vol. 49, Institute of Mathematical Statistics.I Spanos, A. (2008), “Review of Stephen T. Ziliak and Deirdre

N. McCloskey’s The Cult of Statistical Significance,” Erasmus Journalfor Philosophy and Economics, 1:154-164. http://ejpe.org/pdf/1-1-br-2.pdf.I Spanos, A. (2011), “Misplaced Criticisms of Neyman-Pearson (N-

P) Testing in the Case of Two Simple Hypotheses,” Advances and Ap-plications in Statistical Science, 6: 229-242, (2011).I Spanos, A. (2012), “Revisiting the Berger location model: Falla-

cious Confidence Interval or a Rigged Example?” Statistical Methodol-ogy, 9: 555-561.I Spanos, A. (2012), “A Frequentist Interpretation of Probability

for Model-Based Inductive Inference,” Synthese, 190: 1555—1585.I Spanos, A. (2013), “Revisiting the Likelihoodist Evidential Ac-

count,” Journal of Statistical Theory and Practice, 7: 187-195.I Spanos, A. (2013), “Who Should Be Afraid of the Jeffreys-Lindley

Paradox?” Philosophy of Science, 80: 73-93.I Spanos, A. (2013), “The ‘Mixed Experiment’ Example Revisited:

Fallacious Frequentist Inference or an Improper Statistical Model?”Advances and Applications in Statistical Sciences, 8: 29-47.I Spanos, A. and A. McGuirk (2001), “The Model Specification

Problem from a Probabilistic Reduction Perspective,” Journal of theAmerican Agricultural Association, 83: 1168-1176.

54

A. spanos slides ch14-2013 (4)

Technology

Transcript of A. spanos slides ch14-2013 (4)