Sociology 6Z03 McMaster University · PDF fileAn Extended Example Some Points About the Null...

Sociology 6Z03Topic 14: Hypothesis Tests

John Fox

McMaster University

Fall 2016

John Fox (McMaster University) Soc 6Z03:Hypothesis Tests Fall 2016 1 / 44

Outline: Hypothesis Tests

Introduction

An Extended Example

The Procedure of Hypothesis Testing

One-Sided and Two-Sided Tests

“Statistical Significance”

Reporting P-Values

Hypothesis Tests and Confidence Intervals

Hypothesis Testing as a Decision Problem

The Power of the Test


Introduction

In constructing a confidence interval, we specify a range of values that plausibly might bethought to contain the parameter — such as a population mean — that we want toestimate.

In hypothesis testing — the second common procedure of classical statistical inference —we assess the strength of evidence against the proposition that a parameter is equal tosome specific value.

As in the case of confidence intervals, the mechanics of statistical hypothesis testing arereasonably straightforward, but the rationale for the procedure, and the properinterpretation of the results, are more complex.


An Extended ExampleAn Imaginary Experiment

It is best to approach the logic of hypothesis testing by developing a simple example.Consider the following (made-up) experiment:

An educational researcher wants to know whether a new method of teaching statistics issuperior to the old method.Ten instructors who each teach two sections of an introductory statistics class are recruitedinto the study.Each instructor has one of his or her sections assigned at random to the new teachingmethod; the other section is taught by the old method.At the end of the study, the students in all sections of the course take a common exam.


An Extended Example

The average grade on the exam in each section along with the difference between the newand old methods are as follows:

InstructorNew MethodClass Mean

Old MethodClass Mean

Differencexi

1 94 71 232 75 70 53 75 58 174 84 80 45 79 70 96 85 67 187 73 66 78 75 80 −59 75 72 3

10 83 70 13

The mean difference is x = 9.4, and the standard deviation of the n = 10 differences iss = 8.38.


An Extended ExampleCautions

Even a study this simple raises serious questions of interpretation: Suppose, as I showbelow, that there is strong evidence, beyond the expected level of chance fluctuations,that sections taught by the new method performed better.

It is possible that instructors were enthusiastic about the change and consequently taughtbetter with the new method, even if that method has no intrinsic advantage.Furthermore, although the new method has an advantage on average, some types ofinstructors might obtain better results with the old method; likewise, some students mightlearn better with the old method.Finally, if we wish to generalize from these ten instructors (and their students) to some largerpopulation, then the instructors in the study must be “representative” of the more generalpopulation, ideally by being drawn at random from the population.


An Extended ExampleEstablishing Whether the New Method is Better

We want to establish whether or not the data support the proposition that the newmethod is more effective on average than the old.

Let us call the population mean difference between the two methods µ.

If the new method is better than the old, then µ > 0.If, on the other hand, the two methods are equally effective, then µ = 0.Let us, for the moment, rule out the possibility that the old method is better — i.e., thatµ < 0.


An Extended ExampleThe Null and Alternative Hypotheses

Hypothesis testing proceeds by assuming, for purposes of argument, that the twomethods are equally effective. This is called the null hypothesis,

H0: µ = 0

Notice that the null hypothesis specifies a particular value for the parameter µ, namely thevalue zero.The null hypothesis always specifies a particular value, but what that value is depends uponcontext.

The opposite of the null hypothesis is called the alternative hypothesis.

Here, the alternative hypothesis is that the new method is superior to the old one,

Ha: µ > 0


An Extended ExampleSome Points About the Null and Alternative Hypotheses

Unlike the null hypothesis, the alternative hypothesis does not specify a particular valuefor the parameter µ.

This means that we cannot test the alternative hypothesis directly.

Instead, we will assess the strength of evidence against the null hypothesis.If the null hypothesis is not supported by the data, then we will reject it in favour of thealternative hypothesis.It is this “backwards” logic that makes hypothesis testing conceptually difficult.

Usually, our interest is in establishing the alternative hypothesis.

Assessing the strength of evidence against the null hypothesis is an indirect way of assessingthe strength of evidence in favour of the alternative hypothesis.

You may encounter some variation in terminology and notation:

Sometimes the alternative hypothesis is called the research hypothesis, and is symbolized byH1.


An Extended ExampleTesting the Null Hypothesis

We assess the evidence against the null hypothesis by tentatively assuming that it is true.

We then calculate the probability that data as discrepant as those we in fact obtainedarise purely by chance.

We do this by referring to the sampling distribution of the sample mean x , constructedassuming that the null hypothesis is true.If this probability is sufficiently small — that is, if data as extreme as ours are rare assumingthe truth of H0 — then we conclude that the null hypothesis is probably wrong.

If the null hypothesis is true — that is, if µ = 0 — then, with repeated sampling, samplemeans x are approximately normally distributed with an average value of µ = 0 and astandard deviation of σ/

√n = σ/

√10.


An Extended ExampleThe Hypothetical Sampling Distribution of the Mean

There are two practical problems here which, for the moment, we will effectively ignore:

1 We don’t know the standard deviation σ of x in the population.

We will simply take σ = s = 8.38, but this is not a good solution when — as here — thesample size is small.

2 Because the sample size is small, the sampling distribution of the means x from repeatedsamples may not be close enough to a normal distribution if the population distribution ofx is sufficiently non-normal.

Disregarding these problems, x ∼ N(0, 8.38/√

10) = N(0, 2.65).

This hypothetical sampling distribution of x , calculated assuming the truth of the nullhypothesis, is shown on the next slide.The observed value of x = 9.4 is also shown on the graph.


An Extended ExampleThe Hypothetical Sampling Distribution of the Mean

Hypothetical Sampling Distribution of x under H0 :: µµ == 0

0.0 9.4 x

0 3.55 z

P = .0002

observed value of x John Fox (McMaster University) Soc 6Z03:Hypothesis Tests Fall 2016 12 / 44

An Extended ExampleThe P-Value for the Test

Now we calculate the probability of getting a sample mean x of 9.4 or larger, assumingthat H0 is true.

This is simply a normal distribution calculation:

Converting x = 9.4 to the standard normal value z ,

z =x − µ0

σ/√n

=9.4− 0

2.65= 3.55

In this formula, µ0 is not the true population value of µ, which is unknown, but rather thevalue of µ specified by the null hypothesis.Then

P(x ≥ 9.4) = P(z ≥ 3.55) = 1− .9998 = .0002

This probability is called the P-value of the hypothesis test.


An Extended ExampleInterpretation of the P-Value

Thought Question

Indicate whether each of the following statements is (A) true or (B) false:

If the null hypothesis is true, then the probability of getting a sample mean as large as orlarger than the one obtained (x = 9.4) is very small — about 2 chances in 10,000.

This is a very small probability, and so the null hypothesis is probably correct and thealternative hypothesis — that the new method is better on average than the old one — isprobably incorrect.


The Procedure of Hypothesis TestingThe Null and Alternative Hypotheses

Specify the null hypothesis,H0: µ = µ0

and the alternative hypothesis.

There are three alternative hypotheses that could accompanyH0: µ = µ0:

1 Ha: µ > µ02 Ha: µ < µ03 Ha: µ 6= µ0

The first two of these alternative hypotheses specify a direction of departure from H0; thethird alternative hypothesis is nondirectional.In any given application, only one alternative hypothesis would be used. More about thisshortly.


The Procedure of Hypothesis TestingThe Null and Alternative Hypotheses

Notice that the null and alternative hypotheses are specified in terms of the parameter ofinterest, here µ.

It is wrong, for example, to write, H0: x = 0.


The Procedure of Hypothesis TestingHypothetical Sampling Distribution, Test Statistic, and P-Value

Tentatively assuming the truth of the null hypothesis, find the hypothetical samplingdistribution of the sample mean x ,

x ∼ N

(µ0,

σ√n

)Using the hypothetical sampling distribution, find the probability P of obtaining a resultas or more extreme than the one observed.

This is done by calculating the test statistic

z =x − µ0

σ/√n

and using the standard normal table.


The Procedure of Hypothesis TestingInterpreting the P-Value

If the P-value is sufficiently small, then the null hypothesis is probably wrong and thealternative hypothesis is probably right.

Important Point

It is not correct, however, to say that P is the probability that the null hypothesis is right.

The null hypothesis is either right or wrong — µ is either equal to µ0 or it is not — but wedo not know which.This is similar to the interpretation of a confidence interval, where the level of confidence isnot the probability that the parameter µ is in our specific interval.



Testing H0: µ = 0 against the alternative hypothesis Ha: µ > 0 led to a one-sided test:

To find the P-value for the test we looked in the upper tail of the hypothetical samplingdistribution, beyond the observed value of the sample mean x .In this case, the alternative hypothesis is also called one-sided or directional; the test issometimes called one-tailed.

If we had instead expected the new method to be worse than the old one, we would havespecified the directional alternative hypothesis Ha: µ < 0, and would have found theP-value by looking to the left of the sample mean x rather than to the right.

For the observed value of x = 9.4, the left-tail P-value is very big, P = .9998 (see the graphon the next slide).This counts as evidence in favour of the null hypothesis and against the alternativehypothesis (that the old method is better).




0.0 9.4 x

0 3.55 z

.0002P = .9998

observed value of x

P-value for Ha: µ < 0.John Fox (McMaster University) Soc 6Z03:Hypothesis Tests Fall 2016 20 / 44

One-Sided and Two-Sided TestsOne-Sided Tests

Thought Question

True or False: This example illustrates an important characteristic of one-sided tests:If you observe a departure from µ0 in the direction opposite to the expected one, thenthis counts as evidence in favour of H0 no matter how far x is from µ0.

A True.

B False.

C I don’t know.


One-Sided and Two-Sided TestsTwo-Sided Test

If you are not confident in your expectation that the true value of µ departs from the nullvalue µ0 in a particular direction, then you should use a two-sided or nondirectionalalternative hypothesis, which leads to a two-sided or two-tailed test:

Ha: µ 6= µ0

In this case, the alternative hypothesis simply states that µ is different from µ0, and wereject H0 in favour of Ha for values of x that are sufficiently far from µ0 in either direction.


One-Sided and Two-Sided TestsTwo-Sided Test

To implement a two-sided test, we need to find the probability of getting a result as ormore extreme than the one obtained in either direction.

This, in effect, requires us to double the one-sided P-value; if, for example, the obtained z ispositive, then

P = P(Z ≤ −z or Z ≥ z) = 2× P(Z ≥ z)

Here, Z is a standard normal variable.

For the example, z = 3.55, so the two-sided P-value is (see the graph on the next slide)

P = P(Z ≤ −3.55 or Z ≥ 3.55)

= 2× .0002 = .0004




0.0 9.4 x

−3.55 0 3.55 z

.0002.0002

observed value of x

P-value for two-sided Ha: µ 6= 0.John Fox (McMaster University) Soc 6Z03:Hypothesis Tests Fall 2016 24 / 44


Because the two-sided P-value is always twice as large as the one-sided value if thedeparture of x from µ0 is in the predicted direction, you might be tempted to select thedirection of Ha after examining the data.

This is cheating: The P-value for a one-sided test is correct only if the direction of thealternative hypothesis is specified in advance of looking at the data.


“Statistical Significance”

We say that a result is statistically significant when the P-value is sufficiently small towarrant rejection of the null hypothesis.

How small is “sufficiently small”?

Conventionally, the criterion of statistical significance is a P-value smaller than .05 (onechance in 20).Sometimes, other criteria are used, such as .1, .01, or even .001.

A pre-specified value for assessing statistical significance is called the α-level(“alpha-level”) of the test — e.g., α = .05.

Notice that the smaller the α-level of the test, the larger the departure of x from µ0

needed to reject H0.


“Statistical Significance”Cautions

The term “statistical significance” is in very wide use, so you need to understand what itmeans, but the terminology is unfortunate:

To say that a result is statistically significant means that it is unlikely to have occurred bychance alone if the null hypothesis is true.The null hypothesis is therefore probably wrong.But this does not necessarily mean that the observed result is of any practical significance orimportance.

For example, in a very large sample, even a very small departure of x from µ0 can proveto be statistically significant.

Because the null hypothesis is unlikely to be exactly right, we are very likely to reject it givena sample that is sufficiently large, even if H0 is nearly correct.



To say that a result is “statistically significant” means that it is unlikely to be the productof chance.

Performing a statistical test does not correct basic flaws in the design of a study.For example, to find a statistically significant difference between the two teaching methodsdoes not rule out the possibility that it was simply the novelty of the new method thatproduced the difference.

Beware of performing many statistical tests simultaneously.

Suppose that you perform 100 tests, each at the α = .05 level.Although the probability of rejecting any individual null hypothesis just by chance, even if itis correct, is 5 percent, the probability of rejecting at least one hypothesis among the 100 ismuch larger than 5 percent.



Thought Question

Suppose, for example, that we performed 100 independent hypothesis tests of truenull hypotheses, each at the α = .05 level. How many null hypotheses would weexpect to reject by chance alone?

A 0 of the 100 hypotheses.

B 5 of the 100 hypotheses.

C 10 of the 100 hypotheses.

D All 100 hypotheses.

E I don’t know.


Reporting P-Values

With the normally distributed test statistic z , we can find a nearly exact P-value as longas the obtained value of z is in the range of the standard normal table.

Alternatively, most statistical software reports the P-value for test statistics.

Sometimes, however, you may need to use a statistical table for a test statistic that onlyshows the “critical values” of the statistic corresponding to certain α-levels.

For the normal distribution, for example,

α-levelone-tail P .1 .05 .025 .01 .005 .001 .0005two-tail P .2 .1 .05 .02 .01 .002 .001critical z∗ 1.282 1.645 1.960 2.326 2.576 3.091 3.291

Then, having obtained z = 3.55 for a two-sided test, for example, we would report P < .001.


Reporting P-Values

Thought Question

To take another example, suppose that z = 2.45 for a one-sided test in which thealternative hypothesis is Ha:µ > 0. Using the table of critical values of z, what is theP-value for the test?

α-level

one-tail P .1 .05 .025 .01 .005 .001 .0005two-tail P .2 .1 .05 .02 .01 .002 .001critical z∗ 1.282 1.645 1.960 2.326 2.576 3.091 3.291

A P = .01.

B .01 > P > .005 .

C .01 < P < .005.

D I don’t know.John Fox (McMaster University) Soc 6Z03:Hypothesis Tests Fall 2016 31 / 44


Although I have developed them separately, there is a close relationship betweenhypothesis tests and confidence intervals:

The hypothesis H0: µ = µ0 is acceptable against a two-sided alternative at the level α if andonly if the value µ0 lies in the confidence interval for µ constructed at the level of confidence1− α.For example, a two-sided hypothesis test at the α = .05 level corresponds to a 95-percentconfidence interval.We can therefore think of the 95-percent confidence interval as testing all possiblehypotheses about µ at the α = .05 level:Any value of µ that lies within the confidence interval is acceptable at the .05 level; and anyvalue of µ that is outside of the interval is unacceptable.



Thought Question

A two-sided hypothesis test at α = .01 corresponds to a confidence interval at whatlevel of confidence?

A 1%.

B 5%.

C 95%.

D 99%.

E I don’t know.



To illustrate, recall the “educational experiment” in which ten instructors each taught twosections of an introductory statistics course, and the variable of interest gave thedifference in average grades between a new and old method of instruction.

In this example, n = 10, x = 9.4, and s = 8.38; we took σ = s.

For the example, the 95 percent confidence interval for µ is

x ± 1.96σ√n= 9.4± 1.96

8.38√10

= 9.4± 5.19

= 4.21 to 14.59


title

Thought Question

(A) True, (B) False, or (C) I don’t know?Because µ = 0 is outside of this interval, the hypothesis H0: µ = 0 can be rejected at theα = .05 level for a two-sided test. (We already know that the two-sided P-value for this test,.0004, is much smaller than .05.)



95 percent confidenceinterval aroundobserved value of x

_x

0

0

9.4

z

_

.025.025

1.961.96

5.195.19

14.594.21

The relationship between the 95percent confidence interval and atwo-sided hypothesis test at α = .05.



One way of thinking about hypothesis testing is as a decision-making problem:

We need to decide whether to accept the null hypothesis H0 or whether to reject it in favourof its alternative.

Scientific inference isn’t quite the same as decision-making, and there are few applicationsin which hypothesis testing and decision-making literally coincide.

Nevertheless, it can illuminate the essential nature of hypothesis testing to think about it inthis way.

Suppose that we want to test the null hypothesis H0: µ = µ0, where — as before — µ0

is some value (like 0) specified in advance.

There are two possibilities: Either H0 is correct, or it is wrong.We do not know, of course, which of these “states of nature” obtains, or we would not needto test the hypothesis.



Based on sample data, we decide either to accept the null hypothesis H0 (if the teststatistic z fails to exceed a pre-specified critical value, say 1.96 for a two-sided test at theα = .05 level), or we decide to reject H0 (if the test statistic exceeds the critical value).

All possibilities are summarized in the following table:

State of natureDecision H0 true H0 false

Reject H0 Type I error :( Correct decision :)

Accept H0 Correct decision :) Type II error :(



Thought Question

State of natureDecision H0 true H0 false

Reject H0 Type I error :( Correct decision :)

Accept H0 Correct decision :) Type II error :(

(A) True, (B) False, or (C) I don’t know?

1 If the null hypothesis is true and we reject it, then we commit a mistake, called a Type IIerror.

2 If the null hypothesis is true and we accept it, then we have made a correct decision.

3 If the null hypothesis is false and we reject it, then we have made a Type I error.

4 If the null hypothesis is false and we accept it, then we have committed a Type II error.



Figuring out the probability of a Type I error is simple: It is just the significance level ofthe test, α.

The probability of a Type II error — failing to reject a false H0 — is less straightforward.

The power of the test is the probability of correctly rejecting a false null hypothesis:power = 1− P(Type II error).

µ0 µ

power ofthe test

α

true samplingdistribution of x

_sampling distributionof x under H0

_



The probability of a Type II error and the power of the test depend upon three factors:

1. The α-level of the test: The larger we make the value of α, the easier it is toreject H0 (whether H0 is true or false).

Thus, increasing the probability of a Type I error (as on the right in the figure below)decreases the probability of a Type II error and increases the power of the test.

µ0 µ

power ofthe test

α

µ0 µ

Unfortunately, then, the two types of errors work at cross-purposes.



2. The true value of the parameter µ: The farther µ is from the hypothesized valueµ0 (as on the right), the more likely it is that we will have evidence against H0:µ = µ0, and the less likely it is that we will commit a Type II error — increasingthe power of the test .

µ0 µµ0 µ

power ofthe test

α



3. The sample size n: The larger the sample (as on the right), the easier it will beto detect a departure of a given size from the null hypothesis (because thestandard deviation of x is smaller in a large sample) — so the probability of aType II error goes down and the power of the test increases.

µ0 µµ0 µ

power ofthe test

α


The Power of the TestSummary

Thought Question

(A) True, (B) False, or (C) I don’t know?The probability of a Type II error goes down and the power of the test goes up:

as α gets smaller;

as µ gets farther from µ0;

as n gets smaller.


Sociology 6Z03 McMaster University · PDF fileAn Extended Example Some Points About the Null...

Documents

Transcript of Sociology 6Z03 McMaster University · PDF fileAn Extended Example Some Points About the Null...