Power and Sample Size (At study design stage before doing the study):

Power and Sample Size

(At study design stage before doing the study):

“How large a sample size do I need to have a good chance of statistically finding a difference if a difference (or effect) truly exists.”

Robert Boudreau, PhDCo-Director of Methodology Core

PITT-Multidisciplinary Clinical Research Center for Rheumatic and Musculoskeletal Diseases

PHARYNX

• A Clinical Trial in the Treatment of Carcinoma of the Oropharynx

• SIZE: 195 observations

SEX Frequency Percent

Male 149 76.4

Female 46 23.6

Standard treatment: Radiation therapy alone (n=100)

Test treatment: Radiation + Chemotherapy (n=95)

Post Treatment: 1 Yr Mortality Signif Diffs By Gender (?)

% died < 1 yr ‚Standard‚ Test ‚P-value‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆMen ‚ 42.1% ‚ 45.7% ‚ 0.66 ‚ ‚ ‚ ‚ ‚ (n=146) ‚ (32/76)‚ (32/70)‚ ‚ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆWomen ‚ 21.7% ‚ 52.2% ‚ 0.03 ‚ ‚ ‚ ‚ ‚ (n=46) ‚ (5/23) ‚ (12/23)‚ ‚ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆFrequency Missing = 3 (censored before 1yr)

• Large difference in women detected (even with smaller N)

Is Stage of Cancer a Factor ?

T_STAGE • 1=primary tumor measuring 2 cm or less in largest diameter,• 2=primary tumor measuring 2 cm to 4 cm in largest diameter with minimal infiltration in depth• 3=primary tumor measuring more than 4 cm, 4=massive invasive tumor

N_STAGE (see Cooper et. al, NEJM: Stage 2+ => high mortality) • 0=no clinical evidence of node metastases• 1=single positive node 3 cm or less in diameter, not fixed• 2=single positive node more than 3 cm in diameter, not fixed• 3=multiple positive nodes or fixed positive nodes

Is Stage of Cancer a Factor ?

Cooper JS, et.al. Postoperative Concurrent Radiotherapy and Chemotherapy for High-Risk Squamous-Cell Carcinoma of the Head and Neck. NEJM 350(19):1937-1944. May 6, 2004

• “Patients who have two or more regional lymph nodes involved, extracapsular spread of disease, or microscopically involved mucosal margins of resection have particularly high rates of local recurrence (27 to 61 percent) and distant metastases (18 to 21 percent) and a high risk of death (five-year survival rate, 27 to 34 percent).”

Males: Tumor Stage by Metastasized Nodes

-------------------------------- SEX=Male -----------------------

The FREQ Procedure Table of T_STAGE by N_STAGE T_STAGE(T_STAGE) N_STAGE(N_STAGE) Frequency‚ 0 ‚ 1 ‚ 2 ‚ 3 ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 0 ‚ 0 ‚ 3 ‚ 5 ‚ 8 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 2 ‚ 0 ‚ 0 ‚ 9 ‚ 10 ‚ 19 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 3 ‚ 17 ‚ 11 ‚ 11 ‚ 29 ‚ 68 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 4 ‚ 13 ‚ 9 ‚ 2 ‚ 30 ‚ 54 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 30 20 25 74 149

Males: 1 Year Mortality(Among those with none or 1 small node)

TX(TX) died < 1 yr

Frequency‚ Row Pct ‚ 0‚ 1‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Standard ‚ 20 ‚ 9 ‚ 29 ‚ 68.97 ‚ 31.03 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Test ‚ 10 ‚ 10 ‚ 20 ‚ 50.00 ‚ 50.00 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 30 19 49

Frequency Missing = 1

Statistics for Table of TX by died_lt_1yr

Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 1.7934 0.1805

• Not quite Statistically Significant

Males: 1 Year Mortality (Among those with none or 1 small node)

WHAT IF: Exact same rates, but 5 times as many in study (n=245 vs 49)

TX(TX) died < 1 yr

Frequency‚ Row Pct ‚ 0‚ 1‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Standard ‚ 100 ‚ 45 ‚ 145 ‚ 68.97 ‚ 31.03 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Test ‚ 50 ‚ 50 ‚ 100 ‚ 50.00 ‚ 50.00 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 150 95 245

Frequency Missing = 5

Statistics for Table of TX by died_lt_1yr

Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 8.9670 0.0027

Sampling Variability, Power and Sample Size

Standard Treatment

Case 1: n=29 (original study sample size)

p1= sample estimate of prob of death < 1 yr

= 9/29 = 0.3103

Stderr(p1) = sqrt ( p1*(1-p1) / n1 )

= sqrt ( 0.3103*0.6897/29) = 0.0859 (8.6%)

Case 2: n=145 (if 5 times larger sample size)

p1* = 45/145= 0.3103

Stderr(p1) = sqrt(0.3103*0.6897/ 145)

= 0.0384 (3.8%))

= Stderr(p1) / sqrt(5) = Stderr(p1) / 2.236

Sampling Variability, Power and Sample Size (cont’d)

Standard Test Difference .

n1 p1 Stderr(p1) n2 p2 Stderr(p2) p2-p1 Stderr(p2-p1) Z (ratio)

29 0.3105 0.0859 20 0.50 0.1180 0.1895 0.1460 1.30

145 0.3105 0.0384 100 0.50 0.0500 0.1895 0.0653 2.90

In both cases• The null hypothesis is H0: True Diff=0 • P[ Type I error ] = P[ Reject H0 when H0 true ] = 0.05

Case #1: Observed diff. explainable “by chance” (Z=1.30, p=0.1936)

Case #2: Observed diff. not explainable “by chance” (Z=3.01, p=0.0037)

“Level of significance”, “alpha-level"

Distribution of possible observed p2-p1 for different sample sizes under hypothetical condition that the mortality rates are really the same

n=49per group

n=245per group

(Pvalue)/2

Two-sided Hypothesis Test (2 treatments equal vs not equal ?)

n=49, p=0.1936 n=245, p=0.0037

Sampling Variability, Power and Sample Size (cont’d)

• Null Hypothesis: 1 yr mortality rates are same• Alternate Hypothesis: 1 yr mortality rates differ by

treatment

Natural Question: Is there actually a difference, but the small sample size study didn’t find it ?

• Type II Error: Accept null hypothesis when alternate hypothesis is true

• Prob[Type II Error] = β

• Power = Prob[ Reject Ho when alternate true] = 1 - β

Making Decisions Using Statistical Tests: Type I & Type II Errors

Q: “Is there actually a difference in 1 yr mortality rates, but the small sample size in the study didn’t find it ?”

The question is asking about the two cells highlighted in blue.

True State of Nature

(Actual Relationship of 1 yr mortality rates between treatments)

Null True

(1 yr mortality rates actually the same)

Alternate True ( 1 yr mortality rates actually differ )

Decision

Based on

Statistical Test

Accept Null

(1 yr mortality rates not signif diff )

Correct Decision Made by Statistical Test

Type II error Prob: β= ?(depends on how

different)

Reject Null

(1 yr mortality rates are signif diff )

Type I error Prob: α = 0.05 (preset level)

Correct Decision Made by Statistical Test Prob: Power=1- β

Power & Sample Size

Cooper et. al. NEJM• “On the basis of the previous trials of the RTOG,

patients treated with postoperative radiation were expected to have a two-year rate of local or regional recurrence of 38 percent. The study required the randomization of 398 eligible patients to have the statistical power to detect an absolute improvement of 15 percent in this rate with the use of a two-sided test with 0.80 statistical power and a significance level of 0.05.

Power & Sample Size Calculations

• Power & sample size calculations are typically made using estimated rates from prior or related studies

(1) A scientifically meaningful improvement, change, difference, odds-ratio (OR) or hazard-ratio (HR) is set, then a required sample size to achieve 80% power is computed.

(2) The budget may dictate the maximum available “N”. => Power is then calculated based on fixed “N” for a range of differences, ORs or HRs. Prior studies are used to estimate means, stdevs, rates, ORs … etc.

A. Power with sample size (N) fixed

Absolute Improvement

Two Year Rate of Local or Regional Recurrence

Radiation Radiation + Chemo Power with n=150 per group *

0 0.38 0.38 0.050

0.05 0.38 0.33 0.147

0.10 0.38 0.28 0.453

0.15 0.38 0.23 0.809

Power = Prob[ finding signif difference if recurrence rates differ by tabulated amounts]

* Using two-sample independent chi-square test

A. Power with sample size (N) fixed• Null Hypothesis: 1 yr mortality rates are same• Alternate Hypothesis: 1 yr mortality rates differ by treatment

Test statistic: Z = (p1 – p2) / Stderr(p1-p2) Stderr(p1-p2) =sqrt( var(p1-p2) ) =sqrt( p1*(1-p1)/n + p2*(1-p2)/n )

Z is approximately Normal (for any p1, p2)

with mean: (p1-p2)/stderr (=0 if no difference) with SD=1 (aka “standarized”)

A. Power with sample size fixed(n=150 each group)

| → Rejection Region Rejection Region ← |

←Alt #3: p1=0.38, p2=0.23 (recurrence rates) (radiation) (rad + chemo) Power=0.809 = Prob [ in rejection region ]

←Null Hypothesis distribution is red

←Alt #3: p1=0.38, p2=0.23 (radiation) (rad + chemo) Power=0.809 = Prob [ in rejection region ]


Z = (p1 – p2) / stderr Under Alt #3, distribution of Z has mean:

(0.38 – 0.23) / 0.052 = 0.15 / 0.052 = 2.86

→ 80.9% of area is to right of null hypothesis (no diff) rejection region

→ Reject H0 if |Z| > 1.96


* In SAS:* Compute power with n=150* per group with alternate p2=0.23;proc power; twosamplefreq test=pchi groupproportions = (0.38, 0.23) npergroup = 150 power= .;run;


The POWER Procedure Pearson Chi-square Test for Two Proportions

Fixed Scenario Elements

Distribution Asymptotic normal Method Normal approximation Group 1 Proportion 0.38 Group 2 Proportion 0.23 Sample Size Per Group 150 Number of Sides 2 Null Proportion Difference 0 Alpha 0.05

Computed Power

Power

0.809

B: Sample size (N) to achieve 80% power

* How many needed per group for exactly* 80% power ?;proc power; twosamplefreq test=pchi groupproportions = (0.38, 0.23) npergroup = . power= 0.8;run;




Distribution Asymptotic normal Method Normal approximation Group 1 Proportion 0.38 Group 2 Proportion 0.23 Nominal Power 0.8 Number of Sides 2 Null Proportion Difference 0 Alpha 0.05

Computed N Per Group

Actual N Per Power Group

0.801 147


AbsoluteImprovement

Two Year Rate of Local or Regional Recurrence

N to achieve 80% Power *

Radiation Radiation + Chemo N per group Total N

0 0.38 0.38

0.05 0.38 0.33 1437 2874

0.10 0.38 0.28 346 692

0.15 0.38 0.23 147 294

* 80% Power = Prob[ finding signif difference if recurrence rates differ by tabulated amounts] Using two-sample independent chi-square test

Rates of Local and Regional Control

Cooper JS et al. Postoperative Concurrent Radiotherapy and Chemotherapy for High-Risk Squamous-Cell Carcinoma of the Head and Neck. New Eng J Med. 350 (2004) 1937-1944.

Actual Results of the Cooper Study Using the SampleSizes Based on Their Power Calculations (P = 0.01)


Sample Size: Two-sample Test of Proportions


* How many are needed per group for exactly* 80% power ? (implements the formula);data _null_; p1=0.38; p2=0.23; p=(p1+p2)/2; n=( 1.96*sqrt( 2*p*(1-p) ) + 0.84*sqrt( p1*(1-p1)+ p2*(1-p2) ) )**2 /(p2-p1)**2; put n=;run;

n=146.5414874

BARI 10-Year SurvivalStratified by Diabetes Status

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6 7 8 9 10

Su

rviv

al

Diabetes CABG (n=180) Diabetes PTCA (n=173)No diabetes CABG (n=734) No diabetes PTCA (n=742)

ND CABG 78.2%ND PTCA 76.8%

D CABG 57.1%

D PTCA 44.1%

No Treated Diabetes CABG vs PTCA: p = 0.50Treated Diabetes CABG vs PTCA: p = 0.012

Years

Logistic Regression: Sample size (N) to achieve 80% power

Goal of new study proposal:

Test survival for improved method of PTCA

BARI: Diabetics vs Non-Diabetics

PTCA 10 yrs survival: p1=0.441, p2=0.768

OR= ( p2/(1-p2) ) / (p1/(1-p1)) = 3.31

Approx 20% of eligible patients are diabetic

(in general population)


* To Detect OR=1.8 with 80% Power;* 20% diabetics (e.g like cohort study);proc power; twosamplefreq test=pchi oddsratio= 1.8 refproportion=0.441 groupweights=(1 4) ntotal=. power=0.80;run;* Note: Could assume higher than 0.441 for diabetics if new method does better




Distribution Asymptotic normal Method Normal approximation Reference (Group 1) Proportion 0.441 Odds Ratio 1.8 Group 1 Weight 1 Group 2 Weight 4 Nominal Power 0.8 Number of Sides 2 Null Odds Ratio 1 Alpha 0.05 Computed N Total

Actual N Power Total

0.801 570


* Detect OR=1.8 with 80% power;* With equal number of diabetics/non-diabetics* recruited into study;proc power; twosamplefreq test=pchi oddsratio= 1.8 refproportion=0.441 npergroup=. power=0.80; run;

N Per Group = 184 ( Total N = 368 )

Note: Total N = 570 when 20% diabetics, 80% non-diab Power always lower with unequal sample sizes

Comparing Means of 2 Groups:Power and Sample Size

From Women’s Health Initiative Observational Study (WHI-OS)

~ 90,000 women longitudinal cohort study (8yrs and continuing)

Osteoporotic Fractures Ancillary Substudy Funded case-control study: 1200 cases (fractures), 1200 controls• 25(OH)2 Vitamin D3 (ng/ml)• Inflammatory markers (e.g. IL-6)• Hormones (estradiol), bone mineral density, …


25(OH)2 Vitamin D3 (ng/ml) mean (sd): 25.8 ± 10.7

With n=1200 in each group (fracture=case, no fracture=control)

What is difference in means of Vitamin D3 that can be detected with 80% power ?


proc power;

twosamplemeans test=diff

meandiff=.

stddev=10.7

npergroup=1200

power=0.80;

run;


The POWER Procedure Two-sample t Test for Mean Difference


Distribution Normal Method Exact Standard Deviation 10.7 Sample Size Per Group 1200 Power 0.8 Number of Sides 2 Null Difference 0 Alpha 0.05

Computed Mean Diff

Mean Diff 1.22


Suppose a 1 ng/ml difference is considered scientifically/clinically meaningful

(or) You are designing a study to potentially detect

differences in Vitamin D3 that are this small.

How many are needed in each group to have 80% power to detect a difference of 1 ng/ml ?

25(OH)2 Vitamin D3 (ng/ml) mean (SD): 25.8 ± 10.7

Sample Size Formula for Comparing Means of 2 Groups

Usually: D0 = 0(i.e. equality of the means)

Sample Size Formula for Comparing Means of 2 Groups

• How many fracture cases and non-fracture controls are needed to have 80% power to detect a difference of 1 ng/ml in Vitamin D3?

We know from a pilot study or other published results that:

25(OH)2 Vitamin D3 (ng/ml): mean (SD): 25.8 ± 10.7 (SD=10.7)

0.05, =0.025, Z/2= 1.96 (/2= 0.025 =area to the right on the normal curve )

Power=0.80 → β = 0.20, Zβ = 0.84 (β = 0.20 =area to the right on the normal curve )

σ ~10.7, Z/2= 1.96, Zβ = 0.84, Δ = 1

The sample size (approx) required in each group is:

2 σ2 (Z/2 +Zβ )2 2 (10.7)2 ( 1.96 + 0.84)2

n ~ ------------------- = ------------------------------- = 1795.2 → 1796

Δ2 12


proc power; twosamplemeans test=diff meandiff=1 stddev=10.7 npergroup=. power=0.80;run;

Computed N Per Group Actual N Per Power Group 0.800 1799 (vs 1200 to detect 1.22 diff)

Comparing Means of 2 Groups:Related to Logistic Regression OR

Hosmer & Lemeshow, Applied Logistic Regression

• Relationship between 2-sample t-test

and logistic regression

For continuous predictor (e.g. Vitamin D3):

Let u2-u1 = detectable difference with 80% power

σ = standard deviation

An odds-ratio (OR) per SD ~ exp ( (u2-u1)/ σ )

is detectable with approx. 80% power

OR between 1st & 4th quartile ~ exp (3*(u2-u1)/ σ )

Comparing Means of 2 Groups:Related to Logistic Regression OR

25(OH)2 Vitamin D3 (ng/ml) mean (sd): 25.8 ± 10.7

Actual funded study:With n=1200 in each group (fracture, no fracture)

Diff in means = 1.22 is detectable with 80% power

=> OR per SD= exp(1.22/10.7) = 1.12 OR between 1st & 4th quartile ~ exp(3*1.22/10.7) = 1.4

are both detectable with 80% power

Proc Power Capabilities

– MULTREG < options > ; – ONECORR < options > ; – ONESAMPLEFREQ < options > ; – ONESAMPLEMEANS < options > ; – ONEWAYANOVA < options > ; – PAIREDFREQ < options > ; – PAIREDMEANS < options > ; – TWOSAMPLEFREQ < options > ; – TWOSAMPLEMEANS < options > ; – TWOSAMPLESURVIVAL < options > ; – PLOT < plot-options > < / graph-options > ;

Thank you !

Any Questions?

Robert Boudreau, PhDCo-Director of Methodology Core

PITT-Multidisciplinary Clinical Research Center for Rheumatic and Musculoskeletal Diseases

Power and Sample Size (At study design stage before doing the study):

Documents

Transcript of Power and Sample Size (At study design stage before doing the study):