FPP 26-27 Significance Tests. Significance tests Question: Given the collected data, is there...
-
Upload
phoebe-jefferson -
Category
Documents
-
view
228 -
download
2
Transcript of FPP 26-27 Significance Tests. Significance tests Question: Given the collected data, is there...
FPP 26-27
Significance Tests
Significance testsQuestion:
Given the collected data, is there evidence against a specified hypothesis about the corresponding parameter?
In other words, are the data consistent or not with a specified hypothesis?
Logic of significance tests:
Proof by contradiction:
1. assume some hypothesis is true2. find a statistic (a quantity that depends on data) that
takes on extreme values when assumed hypothesis is false3. Calculate the value of this statistic in the collected data4. Calculate the probability of observing a value of the
statistic as or more extreme than the observed value, under the assumed hypothesis
5. when this probability is small, one of two things happened A. the assumed hypothesis is correct and a rare event occurred B . the assumed hypothesis is incorrect.
since rare events are by definition rare, we interpret small probabilities as evidence that the assumed hypothesis is false.
When the probability is not small, the data provide insufficient evidence to claim that the assumed hypothesis is false.
Significance test for a population percentageCivil rights and the 1960s
In the court case Swain vs. Alabama (1965), the prosecution alleged there was discrimination against black people in grand jury selection. Census data from the time indicates that 25% of people eligible for grand jury service were black. A random sample of 1050 people called to appear for possible jury duty contained 177 black people. Is there evidence of discrimination?
Reference: Devore, J. Probability and Statistics for Engineering and the Sciences. Pacific Grove, CA: Duxbury, 2000, p. 339
Step 1: Formulate hypothesisClaim: There is discrimination
The opposite of this claim is called the null hypothesis. It usually can be translated as there is nothing unusual going on.
The claim is called the alternative hypothesis. It usually can be translated as there is some unusual pattern in the data
H0: P = 0.25 vs HA: P < 0.25
Step2: Find a relevant statisticValues of the sample percentage of black
jurors much smaller than 0.25 suggest the null hypothesis is not true
Sample proportion = 177/1050 = 0.1689.Is this much smaller than 0.25?
A good way to determine this is by converting the difference between 0.1689 and 0.25 to standard units
€
z =(ˆ p − p0)
p0(1− p0) /n=
(.1689 − 0.25)
0.25(1− 0.25) /1050
Step 3: Calculate z in dataWe get:
The sample percentage of black jurors is six SE away from zero
06.601336306.
)25.01689(.
z
Step 4: Calculate the p-valueWhen n (the sample size) is large enough,
we an use a standard normal curve to calculate the probability of seeing a value of z less (i.e.as or more extreme) the observed value of -6.06
To find the probability we need the distribution of z. Do we know it?
Conclusion in Swain caseBecause the p-value is approximately 0, we
reject the null hypothesis. It is very unlikely that we would observe a sample percentage of 16.89% or smaller if the true percentage was 0.25. The data suggest that black jurors were indeed selected less frequently than would have been expected. The data provide some evidence of discrimination.
Stating hypothesisNull Hypothesis (H0)
The statement being tested in a test of significance is called the null hypothesis
Usually the null hypothesis
is a statement of “no effect” or “no difference”,
is a statement about a population,
is expressed in terms of a (some) parameter(s).
Example H0: =0
Stating hypothesisAlternative Hypothesis ( Ha )
name given to the statement we hope or suspect to be true instead of H0
Example Ha: 0
Hypotheses always refer to some population or model, not a particular outcome
We must decide whether the alternative hypothesis (Ha) should be one-sided or two-sided
Stating hypothesisOne-sided alternative hypotheses:
Example: Ha: μ< 0. Ha: μ > 0
Two-sided alternative hypothesis:
Example: Ha: μ≠ 0
Stating hypothesisChoosing one-sided or two-sided Hypothesis
The alternative hypothesis should express the hopes or suspicions we had in mind when we decided to collect the data
It is cheating to first look at the data and then frame Ha to fit what the data show
If you do not have a specific direction in advance, use a two-sided alternative
Stating hypothesisExample: Your company hopes to reduce the
mean time () required to process customer orders. At present, this mean is 3.8 days. You study the process and eliminate some unnecessary steps.
Q: Did you succeed in decreasing the average process time?
Target: to show that the mean is now less than 3.8 days.So alternative hypothesis is one-sided
The null hypothesis is “no change” value
Ho: μ= 3.8 vs Ha: μ< 3.8
Stating hypothesisThe mean area of several thousand
apartments in a new development is advertised to be 1250 sqft. A tenant group thinks that the apartments are smaller than advertised. They hire an engineer to measure a sample of apartments to test their suspicion.H0: =1250 vs. Ha: <1250
Stating hypothesisExperimenters on learning in animals
sometimes measure how long it takes a mouse to find its way through a maze. The mean time is 18 seconds for one particular maze. A researcher thinks that a loud noise will cause the mice to complete the maze slower. She measures how long each of 10 mice takes with a noise as stimulusH0: =18 vs. Ha: >18
Stating hypothesisLast year, your company’s service
technicians took an average of 2.6 hours to respond to trouble calls from business customers who purchased service contracts. Do this year’s data show a different average response time?H0: = 2.6 vs. Ha: 2.6
Test StatisticAfter correctly formulating the null and
alternative hypothesis we make a comparison between the hypothesized value and the data by using a test statistic.Many test statistics can be thought of as a
standardized distance between a sample estimate of a parameter and the value of the parameter specified by the null hypothesis
Most test statistics have generic form:
Test statistic for a proportion :
Test statistic for a mean:€
z =ˆ p − p0
p0(1− p0)
n
€
t =x − μ0
s / n€
observed − expected
SE
P-valuesA test of significance assesses the evidence
against the null hypothesis and provides a numerical summary of this evidence in terms of a probability
The idea is that “surprising” outcomes are evidence against Ho
A surprising outcome is one that is far from what we would expect if Ho were true
P-valuesA test of significance finds the probability of
getting an outcome as extreme or more extreme than the actually observed outcome
The direction or directions that count as “far from what we would expect” are determined by the alternative hypothesis
Definition: The probability, assuming that H0 is true, that the test statistic would take a value as extreme or more extreme than that actually observed is called the P-value of the testthe smaller the P-value, the stronger the
evidence against H0 provided by the data
P-valuesWhat does “as or more extreme really mean”?
When the alternative has a > sign, “as or more extreme” means use area to the right of the test statistic in p-value calculation
When the alternative has a < sign, “as or more extreme” means use area to the left of the test statistic in p-value calculation
When the alternative uses a “as or more extreme” mean values of the test statistic far from zero in positive and negative directions.For these type of alternative hypthoses, add areas to the
left of -|test statistic| and to the right of |test statistic|
P-values
Interpretation of a p-valueCommon misinterpretations of p-values
The p-value is not the probability that the null hypothesis is true. (the null is either true or not)
Also, (1-p-value) is not the probability that the alternative hypothesis is true. (the alternative is either true or not true)
Correct interpretationThe p-value is the probability of getting a value of
a test statistic as or more extreme than the value of the statistic computed from the collected data, under the assumption that the null hypothesis is true
Enough evidence?Below are some guidelines for judging p-
values. (Don’t treat these as “golden standards”)
p-value Evidence against H0
< 0.01-ish very strong > .01-ish and <.05-ish moderate
> .05-ish and < .10-ish weak > .10 ish
practically none
Etruscan example In the eighth century B.C., the Etruscan civilization was the most
advanced in all of Italy. Its art forms and political innovations were destined to leave indelible marks on the entire Western world. Originally located in the region now known as Tuscany, it spread rapidly across the Apennines and eventually overran much of Italy. But as quickly as it came, it faded. Militarily it was no match for the burgeoning Roman legions, and by the dawn of Christianity it was all but gone.
No chronicles of the Etruscan empire have ever been found, and to this day its origin remains shrouded in mystery. Were the Etruscans native Italians or were they immigrants? And if they were immigrants, where did they come from? Much of our knowledge of the Etruscans derives from archaeological investigations and anthropometric studies… (for example) body measurements to determine… origins.” (Source: Larsen and Marx, Statistics, 2001, p. 513.)
A team of archaeologists collected 84 skulls of Etruscan men and measured their head breadth (in mm). Let’s assume that these 84 men are a random sample of Etruscan men. If the Etruscan men were native, it makes sense to think that the population average head breadth of Etruscans is comparable to the head breadth of modern Italians, 132.44 mm. This assumes evolution has not shifted average head size substantially over the last 2800 years, an assumption that is reasonably close to true.
Exploratory data analysis for Etruscans
.01
.05
.10
.25
.50
.75
.90
.95
.99
-2
-1
0
1
2
3
Norm
al Q
uantil
e P
lot
125 130 135 140 145 150 155 160
100.0%
99.5%
97.5%
90.0%
75.0%
50.0%
25.0%
10.0%
2.5%
0.5%
0.0%
maximum
quartile
median
quartile
minimum
158.00
158.00
157.63
151.00
148.00
143.50
140.00
136.50
131.13
126.00
126.00
Quantiles
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
143.77381
5.9705123
0.6514363
145.06949
142.47813
84
Moments
Head breadth (mm)
Distributions
Significance testStep1: Specify the null and alternative hypothesis
Claim: true average breadth of Etruscan heads differs from 132.44
Ho:μ = 132.44 vs Ha: μ≠ 132.44Step2: compute a test statistic
The sample average is over 17 SE’s away from the hypothesized average of 132.44
Step3: calculate the p-valueFor all intents and purposes this p-value is zero why?
Step4: make a conclusionThere is enough evidence in the data to conclude
that modern Italians and the Estruscans have different average head sizes.
€
observed - expected
SE=
x - μ
SD/ n=
143.77381−132.44
0.6514363=17.4
A more wordy conclusionIt’s practically impossible to observe a
difference of 17 SE’s by chance alone. Our initial assumption in the null hypothesis is very unlikely to be true. The data overwhelmingly suggest that modern Italians and the Etruscans have different average head sizes, indicating that Etruscans were not native to Italy.
For those interested, current theory is that Etruscans came from Asia. But, it remains a mystery how they got to Italy
Significance test using JMP
.01
.05
.10
.25
.50
.75
.90
.95
.99
-2
-1
0
1
2
3
Nor
mal
Qua
ntile
Plo
t
125 130 135 140 145 150 155 160
100.0%
99.5%
97.5%
90.0%
75.0%
50.0%
25.0%
10.0%
2.5%
0.5%
0.0%
maximum
quartile
median
quartile
minimum
158.00
158.00
157.63
151.00
148.00
143.50
140.00
136.50
131.13
126.00
126.00
Quantiles
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
143.77381
5.9705123
0.6514363
145.06949
142.47813
84
Moments
Hypothesized Value
Actual Estimate
df
Std Dev
132.4
143.774
83
5.97051
Test Statistic
Prob > |t|
Prob > t
Prob < t
17.4596
<.0001
<.0001
1.0000
t Test
Test Mean=value
Head breadth (mm)
Distributions
Example 1A sample of 40 recovery alcoholics was
given the State-Trait Inventory Test. The mean score of the 40 recovery alcoholics was 38 with a sample SD of 7. A psychologist suspected that recovering alcoholics in general had a higher mean score than the norm of 35. Do the sample justify the suspicion?
Example 2There was concern among health officials
in a community that an unusually large percentage of babies with abnormally low birth weight were being born. Abnormally low birth weight here is defined as less than 88 ounces. A sample of 180 births showed 14 babies with abnormally low birth weight. The proportion births that the officials expect to be abnormally low is 5%. Do the data support the health officials claims?
Statistical significanceTo formalize testing further, some
researchers advocate strict p-value cutoffs when deciding whether or not to reject null hypotheses.
Example: reject the null hypothesis when the p-value is less than 0.05. Otherwise, do not reject it.
Statistical significanceThese cut-offs are called “significance levels”.They are typically labeled with the Greek letter α
(alpha).
Example: for a statistical significance level of 0.05, we writeα = 0.05
When the null hypothesis is rejected, the term used to describe the outcome of the test is “statistically significant”.
Made-up example with typical language:“We go a p-value of 0.036 and used α = 0.05. The
results are statistically significant at the 0.05 level.
My opinion about statistical significanceDO NOT RELY BLINDLY ON A FIXED CUT-OFF
Consider two p-values: 0.050001 and 0.049999.
These two p-values provide the same amount of evidence against the null hypothesis.
But if we judge strictly by the 0.05 cut-off we don’t reject the null for 0.050001 and we do for 0.04999.
Ridiculous no? Consider p-values on their own merits
Type I and Type II errorsPossible errors from decision to reject or
not to reject the null hypothesis
Type I error = reject when Ho is trueType II error = fail to reject when Ha is true
Hypothesis testing is not perfect. You never know if you are making one these errors!
Important to replicate study whenever possible to reduce these errors
The role of sample sizeThe chance of a making a Type I error does
not depend on sample size. (Sample sizes incorporated into test statistics).
The chance of making a Type II error decreases as sample size increases. (Be wary when using test based on small sample sizes)
The role of sample sizeWhen the hypothesized value is NOT very
different from the actual value of the parameter, you need a large sample size to reduce the chance of a Type II error.
In many grant proposals, you have to justify the study size by methods that attempt to minimize the chance of Type II errors.
These methods are called power analyses.
The role of sample sizeInferences are always improved by
obtaining as much (accurate and relevant) data as possible.
With large enough sample size, you can reject any false null hypothesis
However,
Practical vs. statistical significanceWhen you get a statistically significant result,
consider whether it is practically significant.
If your sample size is large enough you’ll be able to detect a difference between the hypothesised value of a parameter and its true value if Ho is wrong.
But is this difference of practical significance
Example of weight lifting study
Dangers of excessive fishingWith enough hypothesis tests, you’ll find
something statistically significant.
Some of these statistically significant results may really be Type I errors.
Try to avoid excessive fishing for statistical significance. If you perform many tests, be sure to report how many you do. And, see if results are replicated in separate studies
Non-significant resultsFailing to reject a null hypothesis is not a
failed study
It is just as important to learn that a null hypothesis explains data well as it is to learn that it does not
Relationship between CI and hypothesis testsYou can use CIs like a hypothesis test
Example: Say your null hypothesis is Ho: p = 0.5.If 95% CI does not contain null hypothesis vale, e.g.
(0.64, 0.70), then the two sided test has p.value < 0.05
If 95% CI contains the null hypothesis value, e.g. (0.47, 0.87), then the two-sided test has p-value > 0.05
CIs vs Hypothesis testsHypothesis test can identify parameter
values that are inconsistent with the data.
They do not specify parameter values that plausibly could have produced the data.
Confidence intervals do this. Hence, when given a choice use CIs over hypothesis tests.
Important caveatA hypothesis test will not remedy a poorly
designed study
Bad data yield unreliable p-values