Biostatistics in Practice Session 4: Study Size for Precision or Power Peter D. Christenson...

Post on 31-Dec-2015

217 views 2 download

Tags:

Transcript of Biostatistics in Practice Session 4: Study Size for Precision or Power Peter D. Christenson...

Biostatistics in Practice

Session 4: Study Size for Precision or Power

Peter D. ChristensonBiostatistician

http://research.LABioMed.org/Biostat

Session 4 Issue

How many subjects?

Session 4 Preparation

We have been using a recent study on hyperactivity in children under diets with various amounts of food additives for the concepts in this course. The questions below based on this paper are intended to prepare you for session 4, which is on determining the size of a study.

1. How many children were deemed necessary to complete the entire study? Use the second column on the 4th page of the paper.

Session 4 Preparation #1

Session 4 Preparation #2

2. The authors accounted for some children to start, but not complete the study. What percentage of "dropouts" did they build into their calculations?

The statistical requirements are for 80 “evaluable” subjects. They decided on a study size of 120, so they were allowing up to 40/120 = 33% of subjects to not complete.

Session 4 Preparation #3

3. The authors will perform a test similar to the t-test we discussed last week, to conclude whether there is evidence that hyperactivity differs under Mix A than placebo. There are two mistakes that they may make in this decision. What are they?

I. Conclude Mix A ≠ Placebo, but Mix A = Placebo

II. Conclude Mix A = Placebo, but Mix A ≠ Placebo

Session 4 Preparation #4 and #5

4. How large a difference between Mix A and placebo do they want to detect?

5. Does the value of 0.32 in the study size description (second column on the 4th page) refer to a difference? They seem to imply it is a SD. Based on what we have said about tests comparing "signal" to "noise", do you think both a difference and SD are relevant for determining the study size?

Session 4 Preparation: #4 and #5

Session 4 Preparation #4 and #5

They want to detect a difference Δ of 0.32 in GHA.[ Smallest clinically relevant Δ? ]

Both the Δ and SD need to be accounted for.

Effect size = Δ / SD = “# of SDs”.

Remember, reference range = 4 to 6 SDs.

For this study (unusual) GHA is scaled to have a SD of 1, so Δ = effect size =0.32.

Session 4 Goals Review estimating and testing

Δ, SD and N in estimating and testing

False positive and false negative conclusions from tests

What is needed to determine study size

Software for study size

Review Estimation

Typically:

1. Have sample of N representing “all”.

2. Find mean and SD from the N units.

3. Expect new unit to be within mean ± 2SD.

4. Confident (95%) that mean of all is in

mean ± 2SD/√N.

May have this info for one or multiple groups.

Study Size to Achieve Precision

Precision refers to how well a measure is estimated.

Margin of error = the ± value (half-width) of the 95% confidence interval.

Lower margin of error ↔ greater precision.

To achieve a specified margin of error, say d, solve the CI formula for N:

For a mean, d = 2SD/√N, so N=(2SD/d)2.

For a proportion p, d = 2[p(1-p)/N]1/2 ≤ 1/√N.

Most polls use N ≈ 1000, so margin of error on % ≈ 3%

Review Statistical Tests

1. Calculate a standardized quantity for the particular test, a “test statistic”:

• Often: t = (Mean – Expected) / SE(Mean)

If 1 group, Mean may be a change score.

If 2 groups, Mean may be the difference between means for two groups.

Expected = 0 if no effect.

Looking for evidence to contradict “no effect”.

Review Statistical Tests

2. Compare the test statistic to the range of values it should be if expectations are correct.

Often: The range has approx’ly normal bell curve.

3. Declare “effect” if test statistic is too extreme, relative to this range.

Often: |test statistic| >~2 → Declare effect.

t-Test

Expect

95% Chance

Declare effect if test statistic is “too extreme”.

How extreme?

Convention:

“Too extreme” means < 5% chance of wrongly declaring an effect.

2.5%2.5%

Effect No Effect Effect

Declare:

t =

(mean – expected)SD/

√N

t-Test

Expect

95% Chance

Declare effect if test statistic is “too extreme”.

Convention:

“Too extreme” means < 5% chance of wrongly declaring an effect.

But, what are the chances of wrongly declaring no effect?

2.5%2.5%

Effect No Effect Effect

Declare:

t-Test

Expect

95% Chance

Declare effect if test statistic is “too extreme”.

But, what are the chances of wrongly declaring no effect?

To answer, we need a similar curve for the range of values expected when there is an effect.

2.5%2.5%

Effect No Effect Effect

Declare:

Two Possible Errors from t-test

No Effect

Real Effect

No real effect (0)

Real effect = 3

Effect in study=1.13

\\\ = Probability: Conclude Effect, But no Real Effect (5%).

/// = Probability: Conclude No Effect, But Real Effect (41%).

41%

5%

Δ = Effect (Difference Between Group Means)

Red Blue

Green

Just Δ, not t = Δ/SE(Δ) Conclude effect.

Consider just one possible real effect, the value 3.

Graphical Representation of t-test

No Effect

Real Effect

No real effect (0)

Real effect = 3

Effect in study=1.13

41%

5%

Δ = Effect (Difference Between Group Means)

Red Blue

Green

Just Δ, not t = Δ/SE(Δ) Conclude effect.

Suppose we need stronger proof; i.e., shift cutoff to right.

Then, chance of false positive is reduced to ~1%, but false negative is increased to ~60%.

Power of a Study

Statistical power is the sensitivity of a study to detect real effects, if they exist.

It is 100-41=59% two slides back.

Truth:No Disease Disease

No Disease

Disease

Diagnosis:

Correct

CorrectError

Error

Want high for a screening test

Need high in follow-up test

Specificity

Sensitivity

Two Possible Errors in a Diagnostic Test

Specificity ↓ as Sensitivity↑

Truth:

No Effect Effect

No Effect

Effect

Study Claims:

Correct

CorrectError (Type I)

Error (Type II)

Power: Maximize.

Choose N for 80%

Set α=0.05

Specificity=95%

Specificity

Sensitivity

Analogy with Diagnostic Testing

← Typical →

Summary: Factors Related to Study Size

Five factors are inter-related. Fixing four of these specifies the fifth:

1. Study size, N.

2. Power (often 80% is desirable).

3. p-value cutoff (level of significance, e.g., 0.05).

4. Magnitude of the effect to be detected (Δ).

5. Heterogeneity among subjects (SD).

The next slide shows how these factors (except SD) are typically presented in a study protocol.

Quote from Local Protocol ExampleThe following table presents detectable differences, with p=0.05 and 80% power, for different study sizes.

Total Number

of Subjects

Detectable Difference in Change in Mean MAP (mm Hg)(1)

Detectable Difference in Change in

Mean Number of

Vasopressors(2)

20 10.9 0.77 40 7.4 0.49 60 6.0 0.39 80 5.2 0.34

100 4.6 0.30 120 4.2 0.27

Thus, with a total of the planned 80 subjects, we are 80% sure to detect (p<0.05) group differences if treatments actually differ by at least 5.2 mm Hg in MAP change, or by a mean 0.34 change in number of vasopressors.

Comments on the Previous Table

• Typically power=80% and almost always p<0.05.

• SD was not mentioned. There may be several estimates from other studies (different populations, intervention characteristics such as dosage, time, etc). Here, a pilot study exactly like the trial was performed by the same investigators.

• Detectable difference refers to the unknown true difference for “all”, not the difference that will be seen eventually in the N study subjects.

• N ↑ as detectable difference ↓.

• So, the major consideration is usually a tradeoff between N and the detectable difference.

Free Study Size Softwarewww.stat.uiowa.edu/~rlenth/Power

Local Protocol Example: CalculationsPilot data: SD=8.16 for ΔMAP in 36 subjects.

For p-value<0.05, power=80%, N=40/group, the detectable Δ of 5.2 in the previous table is found as:

Hyperactivity Study Size

Study is 1-sample or paired (for each age

group).

SD=1 Δ=0.32

Use p-value<0.05. Want power=80%.

Solve for N in software to get N=79.

Study Size for Some Other Study Types

1. Phase I: Dose escalation. Safety, not efficacy. No power. Use N=3 low dose; if safe N=3 in higher dose, etc.

2. Phase II: Small, primarily safety; look for enough evidence of efficacy to go on to Phase III. Often staged: e.g., if 3/10 respond, test 10 more, etc.

3. Mortality studies: Patterns of deaths over time can be used in sample size calculations. Software not in the online package.

Approximate Formulas for Study Size

1. Two-sample t-test:

Total N ~ 4 x 7.85 x (SD/Δ)2

MAP Example: 4 x 7.85 x (8.16/5.2)2 = 77 ~ 80

2. Paired t-test:

N ~ 7.85 x (SD/Δ)2Hyperactivity Example:

7.85 x (1/0.32)2 = 77 ~ 80

Summary: Study Size and Power

1. Power analysis assures that effects of a specified magnitude can be detected.

2. Five factors including power are inter-related. Fixing four of these specifies the fifth.

3. For comparing means, need pilot or data from other studies to estimate SD for the outcome measure. Comparing %s does not require SD.

4. Helps support the believability of studies if the conclusions turn out to be negative.