What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

32
What’s Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy Patrick B. Barlow, The University of Tennessee

Transcript of What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Page 1: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

What’s Significant?

Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Patrick B. Barlow, The University of Tennessee

Page 2: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

On the Agenda…

• Recap of causation• The basics of hypothesis testing

– From research question to testable hypothesis

• Effect size– What is it?– What can impact effect size?

• Confidence Intervals– What are they?– How do you interpret?– What are the implications for interpreting statistical findings?

• Statistical significance & p-values– What counts as “statistically significant”?– Weaknesses of the p-value– The p-value fallacy

• Putting it all Together

Page 3: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Recap: Bradford Hill Criteria

• Strength of causal inference is affected by a number of different factors:– Strength of

association– Consistency– Specificity– Temporal

relationship– Biological gradient– Plausibility– Coherence– Experiment

(reversibility)– Analogy

(consideration of alternate explanations)

Page 4: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

THE BASICS OF HYPOTHESIS TESTING

From research question to testable hypothesis

Statistical significance & p-values

Page 5: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

The Basics of Hypothesis Testing

In statistics, hypothesis testing forms the basis for the majority of inferential statistical tests.

• Three basic components:– Null hypothesis (H0)

– Alternative/research hypothesis (H1)

– Error

• Type I

• Type II

• Was originally conceived as a way to minimize error over infinite trials rather than specify the absolute “truth” in a single scenario.– Goodman equated hypothesis testing to, “a system of justice that is not concerned with

which individual defendant is found guilty or innocent…but tries instead to control the overall number of incorrect verdicts.”

Page 6: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

The Basics of Hypothesis Testing

Null Hypothesis (H0)

• Almost always the statement that nodifference or relationship exists between the variables of interest.

• Example: A study looking at deep vein thrombosis (DVT) & the risk of pulmonary embolism (PE) – The null hypothesis would

be… – “Having DVT does not

increase one’s risk for developing a PE.”

Alternative Hypothesis (H1)

• The statement that you will be trying to “prove” by conducting your inferential statistics.

• It is almost always the statement that a difference or relationship does existbetween the variables of interest.

• What would be an alternative hypothesis for our example?– “Having DVT increases the

risk of developing a PE.”

Page 7: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

The Basics of Hypothesis Testing

The two most common errors we encounter in statistical testing are Type I & Type II error. Both of these errors pose serious risks to the integrity of your conclusions if ignored.

• Type I error: falsely concluding a statistically significant relationship does exist when in fact it does not– “Alpha”, “False positive”, “False alarm”, “Red-herring”, etc.

– Origin of the “p<.05” as statistically significant.

• Type II error: failing to detect a statistically significant relationship when in fact one does exist– “Beta”, “Miss”, “False negative”

– Statistical power & Type II error

The probability for committing either error is interdependent, so the researcher/analyst must consider which error would be more costly to their study.

Page 8: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Your Turn

Instructions

In groups of 2-3, work together to brainstorm at

least two research questions/topics, & answer each of the

following questions:

Questions(for each research topic)

1. What is your research question?

2. What would you propose to use as a research design?

3. What would be the null hypothesis?

4. What are two possible alternative/research hypotheses that could be tested?

5. Considering the relationship between Type I & II error, which would be more costly/serious to commit if conducting your particular study?

Be prepared to discuss your answers!

Page 9: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

EFFECT SIZE

What is it?

How do we interpret effect sizes?

How does effect size relate to issues of statistical power, sample size, and error?

Page 10: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

What is it?

Generally speaking, the effect size represents the magnitudeor strength of the relationship between two variables.

• The proportion of variance in the DV explained by your IV.

• Example…

• The difference in the mean on your DV among levels of your IV.

• Example…

• The difference in proportion of patients with an outcome in the exposed vs. the unexposed groups of your IV.

Two types

1. Unstandardized Effect Sizes:

2. Standardized Effect Sizes:

Page 11: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

How do we interpret unstandardizedeffect sizes?

Interpreted in the samemetric as your variables.

Example:

In a fitness study looking at differences between the sexes, men (M=26.0, SD=3.0) reported significantly higher average BMI than women (M=23.0, SD=2.5), p = .02.

What is the unstandardized effect size?

28.5

26

25

23

20

21

22

23

24

25

26

27

28

29

Pre Intervention Post Intervention

Ave

rage

BM

I

Average BMI Between Men & Women Following Physical Fitness Intervention

Men

Women

Meandifference = 3.0

kg/m2

Page 12: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Your Turn

In pairs, calculate & interpret (in sentence format) the unstandardized effect size. Be ready to share your interpretations.

1. Patients admitted to “academic” hospital clinics (M=.50, SD=.40) had lower average 90-day readmissions than patients seen by non-academic clinics (M=1.5, SD=.75), p = .02.

2. A researcher looks at differences in number of side effects patients had on three difference drugs (A, B, and C). Comparison of Drug “A” to Drug “B” shows average side effects to be 4(SD=2.5) and 7(SD=4.8), respectively, p=.04

3. An article shows a difference in average number of COPD-related readmissions before (M=1.5, SD=2.0) and after (M=.05, SD=.90) a patient education intervention, p=.08.

4. An article shows a difference in average number of COPD-related readmissions before (M=1.5, SD=2.0) and after (M=.05, SD=.90), and six months following a patient education intervention (M=0.80, SD=3.0), p =.12

Page 13: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

How do we interpret standardizedeffect sizes?

Two of the most common standardized effect sizes are Risk / Odds Ratios and Pearson r/R2

Page 14: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Interpreting ORs and RRs

• Odds/Risk ratio ABOVE 1.0 = Your exposure INCREASES risk of the event occurring– For OR/RRs between 1.00 and 1.99 the risk is increased by

(OR – 1)%.– For OR/RRs 2.00 or higher, the risk is increased OR

times, but you could also still use (OR – 1)%.

• Example:– Smoking is found to increase your odds of breast cancer by

OR = 1.25. What is the increase in odds?• You are 25% more likely to have breast cancer if you are a smoker.

– Smoking is found to increase your risk of developing lung cancer by RR = 4.8. What is the increase in risk?• You are 4.8 times more likely to develop lung cancer if you are a

smoker vs. non-smoker.

Page 15: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Interpreting ORs and RRs

• Odds/Risk ratio BELOW 1.0 = Your exposure DECREASES risk of the event occurring– The risk is decreased by (1 – OR)%– Often called a PROTECTIVE effect

• Example: – Addition of the new guidelines for pacemaker/ICD

interrogation produced an OR for device interrogation of OR = .30 versus the old guidelines. What is the reduction in odds?• (1 – OR) = (1 – .30) = 70% reduction in odds.

Page 16: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Your Turn

Instructions

Feel free to make up your own examples or just use, “Odds/Risk of having disease if you have the exposure of interest.”

What does the OR/RR say about the strength of

relationship?

Practice

1. OR = 3.00

2. OR = .39

3. RR = 1.50

4. OR = 1.00

5. RR = .22

6. RR = 18.99

7. OR = .78

8. RR = 6.30

Page 17: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Interpreting r / R2

Pearson r

• Provides the strength of a linear relationship between exactly two continuous, quantitative variables.

• Can vary between -1 (perfect negative) to 1 (perfect positive)

• Most correlational studies only report r

R2

• Literally calculated the square of an r statistic.

• Also known as the coefficient of determination

• Provides the proportion of shared variance between your IV and DV– What’s the range?

Page 18: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

How do we interpret effect sizes?

Page 19: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

How does effect size relate to issues of statistical power, sample size, and

error?

Effect size vs. Statistical Power, sample size, and error.• As effect size increases , statistical

power also increases . Which means that (1) you need a smaller sample size, and (2) have a lower chance of making a Type II error (i.e. a “miss”).

So, when possible, measure for a large effect size!

Page 20: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

CONFIDENCEINTERVALS

What are they?

How do you interpret?

How do they affect our conclusions?

An OR/RR is only as important as the

confidence interval that comes with it!

Page 21: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

What are they?

• Confidence intervals provide, as the name suggests, the confidence in a particular inferential statistic.

• Provide the range of values within which we are confident the true population parameter (e.g. mean, proportion, etc.) exists.

• Usually set at 95%

• They are calculated by using:

• Standard error of measurement (Sm or SE)

• Point estimate for your sample (e.g. t statistic)

• Degrees of freedom for the sample

Page 22: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

What are they? OR / RR example

95% Confidence intervals are added to any OR/RR calculation to provide an estimate on the accuracy of the estimation.

• Size Matters!

– Wide CI = weaker inference

– Narrow CI = stronger inference

– CI crosses over 1.0 = non-significant

• Any 95% CI can instantly tell us:

1. Sample size

2. Accuracy of estimation

3. Statistical significance1.0

Page 23: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Interpreting 95% Confidence Intervals

95% CI of an Odds or Risk Ratio

• What you read…– OR = 4.5 (95% CI =2.8 –

6.1)

• What you interpret…– Lower bounds: OR = 2.8– Upper bounds: OR = 6.1

• How you interpret…– “We are 95% confident

that the true odds of disease for exposed vs. unexposed lies between 2.8 and 6.1.”

Your Turn

Interpret these 95% CIs

1. OR 2.4 (95% CI 1.7 - 3.3)

2. OR 6.7 (95% CI 1.4 -

107.2)

3. OR 1.2 (95% CI .147 - 1.97)

4. OR .37 (95% CI .22 - .56)

5. OR .57 (95% CI .12 - .99)

6. OR .78 (95% CI .36 – 1.65)

Page 24: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

STATISTICAL SIGNIFICANCE

What counts as “statistically significant”?

Weaknesses of the p-value

The p-value fallacy

Page 25: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

What counts as “Statistically significant?”

• To be considered statistically significant, the probability of obtaining a value of the test statistic (e.g. t, z, F, or χ2) must smaller than the probability for committing a Type I error.

• In other words, the probability (p) must be less than (<) what you have chosen for your alpha value (.05).– So, in most cases we conclude that a relationship if

statistically significant if the test returns a p<.05.

Page 26: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Interpretation & Practice

• If a statistically significant relationship is found, then we conclude that observed relationship is too great to exist by chance alone.

• Which of the following are statistically significant results?1. t(34)=5.89, p = .0022. F(3, 285)=1.09, p = .1013. χ2(4)=18.78, p = .044. t(68) = 4.25, p = .05

Page 27: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Weakness of p-values

• Not truly compatible with hypothesis testing– Absence of evidence vs. evidence of absence

• Never meant to be the sole indicator of significance– Average knowledge of statistical interpretation in evidence-based

professions

• No consideration of effect size

• What influences p-values?– Sample size– Chance– Effect size– Statistical power

Page 28: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

The “p-value fallacy”

P-values have become the “have your cake and eat it too” of the statistical world.

• You get the supposed accuracy of a single study (short term) while being able to simultaneously avoid errors in the long term.

• Comes from misinterpretation of p-values as absolute indicators of the strength of a relationship. That is, seeing p = .03 as moresignificant than p = .04.

Page 29: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

PUTTING IT ALL TOGETHER

How to use multiple sources to become a better consumer of Epidemiologic Evidence

Page 30: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Going beyond the p-value

• Measures of effect size provides a far more vivid description of the magnitude of the relationship.– An OR of 4.30 is stronger than an OR of 1.50.– A mean difference of 35pts is larger than a mean

difference of 20pts.– 65% of the variance is more than 20% of the variance

• The 95% CI provides far more information on the accuracy of the inference.– Which is more accurate?

• OR = 2.5 (95% CI = 1.2 – 10.0) vs. OR = 2.5 (95% CI = 1.2 –3.1)

Page 31: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

When reading an article…

Always consider:1. What is the research question? Have the

researchers used the correct null & alternative hypotheses?

2. How large is the…− Sample? Subgroup? Etc.− Effect size? (standardized or unstandardized)− Confidence interval?

3. Finally, what is the p-value?

Page 32: What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy

Just because a finding is not significant does not mean that it is not meaningful.

You should always consider the effect size and context of the research when making a decision about whether or not any finding is clinically

relevant.