Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 3:...

36
Biostatistics in Practice Peter D. Christenson Biostatistician http://gcrc.LABioMed.org/ Biostat Session 3: Testing Hypotheses

description

Session 3 Outline Mechanics of statistical testing TBI study example t-Test p-values Conceptual understanding Key issues Comparison to diagnostic testing Not covered in detail; needed in session 4 on power.

Transcript of Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 3:...

Page 1: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Biostatistics in Practice

Peter D. ChristensonBiostatistician

http://gcrc.LABioMed.org/Biostat

Session 3: Testing Hypotheses

Page 2: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Readings for Session 3

Select sections from www.StatisticalPractice.com entitled:

• Significance test / hypothesis testing

• Significance tests simplified

Page 3: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Session 3 Outline

• Mechanics of statistical testing• TBI study example• t-Test• p-values

• Conceptual understanding• Key issues• Comparison to diagnostic testing• Not covered in detail; needed in session 4 on power.

Page 4: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Goal: Do Groups Differ By More than is Expected By Chance?

Cohan (2005) Crit Care Med;33:2358-66.

Page 5: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Goal: Do Groups Differ By More than is Expected By Chance?

First, need to:

• Specify experimental units (Persons? Blood draws?).

• Specify single outcome for each unit (e.g., Yes/No, mean or minimum of several measurements?).

• Examine raw data, e.g., histogram, for meeting test requirements.

• Specify group summary measure to be used (e.g., % or mean, median over units).

• Choose particular statistical test for the outcome.

Page 6: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Outcome Type → Statistical Test

Cohan (2005) Crit Care Med;33:2358-66.

. . .

. . .

Medians

%s

Means

WilcoxonTest

ChiSquareTest

t Test

Page 7: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Minimal MAP: Group Distributions of Individual Units

AI Group (N=42) Stem.Leaf # 7 6 1 7 11334 5 6 555 3 6 01112344 8 5 5566778 7 5 01222234 8 4 57788 5 4 23 2 3 6 1 3 13 2 ----+----+----+----+ Multiply Stem.Leaf by 10**+1

Non-AI Group (N=38)Stem.Leaf # 7 79 2 7 00111234 8 6 5556777888 10 6 00112234 8 5 67999 5 5 3 1 4 79 2 4 04 2 ----+----+----+----+ Multiply Stem.Leaf by 10**+1

→ Approximately normally distributed

→ Use means to summarize groups.

→ Use t-test to compare means.

Page 8: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Goal: Do Groups Differ By More than is Expected By Chance?

Next, need to:

1. Calculate a standardized quantity for the particular test, a “test statistic”.

• Often: t=(observed - expected diff)/SE(obs diff)

2. Compare the test statistic to what it is expected to be if (populations represented by) groups do not differ.

• Often: t is approx’ly std normal bell curve.

3. Declare groups to differ if test statistic is too deviant.

• Often: absolute value of t >~2.

Page 9: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

t-Test for Minimal MAP: Step 1

1. Calculate a standardized quantity for the particular test, a “test statistic”.

• Often: t=(observed - expected diff)/SE(obs diff)

Observed difference = diff in means = 63.4 - 56.2 = 7.2

Expected Difference = 0 if groups do not differ.

SE(Obs Diff) ≈ sqrt[SEM12 + SEM2

2] = sqrt(1.662+1.412) ≈ 2.2

AI N 42Mean 56.1666667Std Dev 10.7824634SE(Mean) 1.66=10.78/√42

Non AI N 38Mean 63.4122807Std Dev 8.7141575SE(Mean) 1.41=8.71/√38

→ Test Statistic = t = (7.2 - 0)/2.2 = 3.28

Page 10: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

t-Test for Minimal MAP: Step 2

2. Compare the test statistic to what it is expected to be if (populations represented by) groups do not differ.

• Often: t is approx’ly std normal bell curve.

Expect

0.95 Chance Observed = 3.28

Expected values for test statistic if groups do not differ.

Area under sections of curve = probability of those values (1 for -∞ to ∞).

Prob (-2 to -1) is Area = 0.14

Page 11: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

t-Test for Minimal MAP: Step 3

Expect

95% Chance Observed = 3.28

3. Declare groups to differ if test statistic is too deviant.

• Often: absolute value of t >~2. Convention:

“Too deviant” is < 5% chance → t >~2.

“Two-tailed” = the 5% is allocated equally for either group to be superior.

2.5%2.5% Conclude: Groups differ since ≥3.28 has <5% if no diff in entire populations.

Page 12: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

t-Test for Minimal MAP: Step 3 - p value

Expect

95% Chance Observed = 3.28

3. Declare groups to differ if test statistic is too deviant.

• Often: absolute value of t >~2.

p-value:

Probability of a test statistic at least as deviant as observed, if populations really do not differ.

Smaller values ↔ more evidence of group differences.

Area = 0.0007

Area = 0.0007

p value = 2(0.0007) = 0.0014 <<0.05

Page 13: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Back to Paper: Minimal MAP

• Δ= 63.4-56.2= 7.2 is the best guess for the MAP diff between a randomly chosen AI and non-AI patient, w/o other patient info.

• Δ= 7.2 is the best guess for the MAP diff between the means of “all” AI and non-AI patients. We are 95% sure that diff is within ≈ 7.2±2SE(Diff) = 7.2±2(2.2) = 2.8 to 11.6.

• Δ= 7.2 is statistically significant (p=0.0014); i.e., only 14 of 1000 sets of 80 patients would differ so much, if AI and non-AI really don’t differ in MAP.

• Is Δ= 7.2 clinically significant? … significant for basic science?

Page 14: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Confidence Intervals ↔ Tests

95% Confidence Intervals

Non-overlapping 95% confidence intervals, as here, are sufficient for significant (p<0.05) group differences.

However, non-overlapping is not necessary. They can overlap and still groups can differ significantly. If the single 95% CI for the difference (2.8 to 11.6 here) does not contain 0, then the groups differ with p<0.05.

Page 15: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Back to Paper: Experimental UnitsCannot use t-test for comparing lab data for multiple blood draws per subject.

Page 16: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Tests on PercentagesCannot use t-test for comparing lab data for multiple blood draws per subject.

Is 26.3% vs. 61.9% statistically significant (p<0.05), i.e., a difference too large to have a <5% of occurring by chance if groups do not really differ?

Solution: same theme as for means. Find a test statistic and compare to its expected values if groups do not differ.

See next slide.

Page 17: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Tests on PercentagesCannot use t-test for comparing lab data for multiple blood draws per subject.

Expect

1Observed = 10.2

Area = 0.002

Chi-Square Distribution

95% Chance

5.99

Here, the test statistic is a ratio, expected to be 1, rather than a difference, expected to be 0.

Test statistic=10.2 >> 5.99, so p<0.05. In fact, p=0.002.

Page 18: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Example for Conceptual ApproachConsider a parallel study:

1. Randomize an equal number of subjects to treatment A or treatment B.

2. Follow all subjects for a specified period of time.

3. Measure X= post-pre change in an outcome, such as cholesterol.

Primary Aim: Do treatments A and B differ in mean effectiveness?

Restated aim: If μA and μB are the true, unknown, mean post-pre changes that would occur if all potential subjects received treatment A or treatment B, do we have evidence from our limited sample whether μA ≠ μB?

Page 19: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Extreme Outcome #1Suppose results from the study are plotted as:

Obviously, B is more effective than A.

A B

XEach point is a separate subject.

Page 20: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Extreme Outcome #2Suppose results from the study are plotted as:

Obviously, A and B are equally effective.

A B

XEach point is a separate subject.

Page 21: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

More Realistic Possible Outcome ISuppose results from the study are plotted as:

Is the overlap small enough to claim that B is more effective?

A B

XEach point is a separate subject.

Page 22: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

More Realistic Possible Outcome IISuppose the ranges are narrower, with the same group mean

difference:

Now, is this minor overlap sufficient to come to a conclusion?

A B

XEach point is a separate subject.

Page 23: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

More Realistic Possible Outcome IIISuppose the ranges are wider, but so is the group difference:

Is the overlap small enough to claim that B is more effective?

A B

XEach point is a separate subject.

Page 24: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

More Realistic Possible Outcome IVHere, the ranges for X are the same as the last slide, but there

are many more subjects:

So, just examining the overlap isn’t sufficient to come to a conclusion, since intuitively the larger N should affect the results.

A B

XEach point is a separate subject.

Page 25: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Our GoalGoal: We need a rule that can be consistently applied to most

studies to make the decision whether or not μA ≠ μB.

From the previous 4 slides, relevant measures that will go into our decision rule are:

1. Number of subjects, N; could be different for the groups.

2. Difference between groups in observed means (X-bar for A and for B subjects).

3. Variability among subjects (SD for A and B subjects).

Page 26: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Goal, ContinuedGoal: We need a rule that can be consistently applied to most

studies to make the decision whether or not μA ≠ μB.

Other relevant issues:

1. Our conclusion could be wrong. We need to incorporate a mechanism for minimizing that possibility.

2. Small differences are probably unimportant. Can we incorporate that as well?

Page 27: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

A Graphical Look at All of the Issues

The figure on the following slide shows most of the issues that are involved in testing hypotheses.

It is complicated, but we will go through each of the factors that it addresses, on slides after the figure:

1. Null hypothesis H0 vs. alternative hypothesis HA.

2. Decision rule: Choose HA if ….[involves Ns, means and SDs] .

3. α=Probability (Type I error)= Prob (choosing HA when H0 is true).

4. β=Probability (Type II error)= Prob (choosing H0 when HA is true).

5. What changes if N was larger?

Page 28: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Graphical Representation of Hypothesis Tests

Page 29: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

1: Null hypothesis H0 vs. alternative hypothesis HA.

All statistical tests have two hypotheses to choose from:

The null hypothesis states a negative conclusion, that there is “no effect”, which could mean various specific outcomes in different studies. It always includes at least one mathematical expression that is 0.

Here, the null hypothesis is H0: μA- μB = 0. This states that the post-pre changes are, on the average, the same for A as for B. The left (red) curve has it’s peak at this 0.

The alternative hypothesis includes every possibility other than 0, i.e., HA: μA- μB ≠ 0. In the figure, we chose just one alternative for illustration, namely that μA- μB = 3. The right (blue) curve has it’s peak at this value of 3.

For each curve, the height represents the relative frequency of subjects, so more subjects have X’s near the peak.

Page 30: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

2: Decision Rule for Choosing H0 or HA.A poor, but reasonable rule.

First suppose that we only consider choosing between H0 and the particular HA: μA- μB = 3, as in the figure.

Common sense might say that we calculate x-bar (which is the mean of changes for A subjects, minus the mean of changes for B subjects), and then choose H0 if x-bar is closer to 0, the hypothesized value under H0, or choose HA if it closer to 3, the hypothesized value for HA.

The green line in the figure is on the x-bar from the sample, which is 1.128, and so HA would be chosen with this rule, since it is closer to 0 than 3.

A problem with this rule is that we cannot state how certain we are about our decision. It seems like the reasonable choice between the 2 possibilities, but if we used the rule in many studies, we could not say that most (90%?, 95%?) were correct.

Page 31: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

2: Decision Rule for Choosing H0 or HA. The correct rule.

To start to quantify the certainty of some conclusions we will make, recall the reasoning for confidence intervals.

If H0 is true, we expect that x-bar will not only be close to 0, but that with 95% probability, it will be within about* ±2SE of 0, i.e., between about -2.8 and +2.8. This is the non-\\\’d region under the H0 (red) curve.

Thus, the decision rule is: Choose HA if x-bar is outside 0±2SE, the critical region. The reason for using this rule is that if H0 is really true, then there is only a 5% chance we would get an x-bar in the critical region. Thus, if we decide on HA, there is only a 5% chance we are wrong for any particular test. Roughly, if the rule is applied consistently, then only 5% of statistical tests will be false positive conclusions, although which ones are wrong is unknown. *See a textbook for exact calculations. The multiplier is slightly larger than 2.

Page 32: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

3: Probabilities of False Positive Conclusions

A false positive conclusion, i.e., choosing HA (positive conclusion) when H0 is really true (so the conclusion is false) is considered the more serious error, denoted “Type I”.

We have guaranteed (previous slide) that the rate for this error, denoted α=level of significance, is 0.05, or that there is a 5% chance of it occurring.

The 0.05 or 5% value is just the conventional level of risk for positive conclusions that scientists have decided is acceptable. The FDA also requires this level in most clinical studies.

The concept carries over for other levels of risk, though, and statistical tables can determine the critical region for other levels, e.g., approximately 0±1.65SE for α=0.10, where we would choose HA more often, and make twice as many mistakes in the long run in so doing.

Page 33: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

4: Probabilities of False Negative Conclusions

In our figure example, we choose H0, i.e., no treatment difference, i.e., a negative conclusion, since x-bar=1.128 is between -2.8 and +2.8. If we had chosen HA, we would know there was only a 5% chance we were wrong.

Can we also quantify the chances of a false negative conclusion, which we might be making here?

Yes, but it will depend on what really constitutes “false negative”. I.e., we conclude μA- μB = 0, but if really μA- μB = 0.0001, are we wrong in a practical sense? Often, a value for a clinically relevant effect is specified, such as 3 in the figure example. Then, if HA: μA- μB=3 is really true, but we choose H0, we have made a type-2 error. It’s probability is the area under the correct (HA now, blue) curve in the region where H0 is chosen (///). The computer needs to calculate this, and it is 0.41 here.

Page 34: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

3 and 4: Tradeoffs Between Risks of Two Errors

In our figure example, if μA- μB=3 is the smallest difference that we care about (smaller differences are 0 in a practical sense), then we have an α=0.05 chance of wrongly declaring that treatments differ when in fact they are identical, and a β=0.41 chance of declaring them the same when they really differ by 3.

If we try to decrease the risk of one of the errors, the risk of the other error increases, i.e., α↑ as β↓. [This is the same as sensitivity and specificity of diagnostic tests.] To visualize it on our figure, imagine shifting the ///\\\ demarcation at 2.8 to the left, to say 2.7. That increases α. Then the /// area, i.e., β, decreases.

Practical application: If A is a current treatment, and B is a potential new one, then smaller αs mean that we are more concerned with marketing a non-superior new drug. Smaller βs mean we are more concerned with missing a superior new drug.

Page 35: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

5: Effect of Study Size on Risks of ErrorIn the previous slide, the FDA may want a small α, and drug company might want a small β. To achieve this, a larger study could be performed. We can verify this with our graph.

In our figure example, suppose we had had a larger study, say twice as many subjects in each group. Then, both curves will be narrowed, since their widths depend on SE, which has N in the denominator. If we maintain α=0.05, the ///\\\ demarcation will shift to the left due to the narrowed left curve, and β will be much smaller, due to both the narrower right curve, and the demarcation shift. The demarcation could then be shifted to the right to lower α, which increases the current β, but still keeps it small.

There are algorithms to choose the right N to achieve any desired α and β.

Page 36: Biostatistics in Practice Peter D. Christenson Biostatistician  LABioMed.org /Biostat Session 3: Testing Hypotheses.

Power of a Study

Statistical power = 1 – β.

Power is thus the probability of correctly detecting an effect in a study. In our example, the drug company is really thinking not in terms of β, but in the ability of the study to detect that the new drug is actually more effective, if in fact it is.

Since the FDA requires α=0.05, then a major component of designing a study is the determination of it’s size so that it has sufficient power.

This is the topic for the next session #4.