Power and Sample Size: Why Should We Care? November 17, 2015 Allen Kunselman, MA Division of...

Power and Sample Size: Why Should We Care?

Power and Sample Size: Why Should We Care?

November 17, 2015

Allen Kunselman, MADivision of Biostatistics and Bioinformatics

Department of Public Health Sciences

QuotesQuotes An approximate answer to the right question is worth a good

deal more than an exact answer to the wrong question. John Tukey

All models are wrong, but some are useful. George Box

Like other occult techniques of divination, the statistical method has a private jargon deliberately contrived to obscure its methods from non-practitioners.

G.O. Ashley

Why Worry About Sample Size and Power?

Why Worry About Sample Size and Power?

Scientifically Meaningful Effect Study must be “big enough” to statistically detect a scientifically

(clinically) meaningful effect. Study must not be “too big” where an effect of little scientific

importance is nevertheless statistically detectable. Money, Money, Money

Undersized study wastes resources and money by not having the capacity to produce useful (definitive) results.

Oversized study uses more resources and money than necessary. Ethical Issues

Undersized study exposes subjects to potentially harmful treatments without advancing knowledge.

Oversized study exposes more people than necessary to a potentially harmful treatment or denies a potentially beneficial treatment.

Grant reviewers (and IRBs) are looking for sample size and power considerations!

Outline of ExperimentOutline of Experiment

Statement of hypothesis Affects sample size and power

Design of the experiment Affects sample size and power Selection of measurement method (evaluation) Recognition of all variables and the effect of their operation Selection of sample (of population) to be studied Preliminary evaluation of the effect of chance on the

experiment Execution of the experiment Analysis of the data Interpretation of the results Generalization of the results

Characteristics of a Good HypothesisCharacteristics of a Good Hypothesis

Simple: contains one predictor and one outcome. e.g., a sedentary lifestyle is associated with an

increased risk of proteinuria in patients with diabetes

Specific: leaves no ambiguity. e.g., alcohol consumption [measured in mg/day] is

associated with an increased risk of proteinuria [>30 mg/dL] in patients with diabetes

In Advance (a priori): stated in writing at the onset of the study.

Types of HypothesesTypes of Hypotheses

Null Hypothesis (Ho): “nothing happening” Alternative Hypothesis (HA): some

association; “working hypothesis” The null hypothesis is the formal basis for

testing statistical significance. Assuming there is no association in the population, statistical tests help to estimate the probability that an association is observed due to chance.

We want to collect evidence against the null hypothesis and support the alternative hypothesis.

Types of Hypothesis TestingTypes of Hypothesis Testing

Two-sided test Example: A serum sample is part of an assay kit sent by the

manufacturer and they claim that the concentration of follicle-stimulating hormone (FSH) in the sample is 115 mIU/mL. We wish to test their claim, or (assuming their claim is true) we wish to test the accuracy of our sample procedure.

Ho: μ = 115

HA: μ ≠ 115

Some clinical trial investigators advocate that all trials should be non-directional (two-sided) because one direction tests efficacy and the other tests harm.

Types of Hypothesis Testing - continuedTypes of Hypothesis Testing - continued

One-sided test Example: We wish to know whether the mean hemoglobin

level of children who are exposed to high levels of lead is equal to the hemoglobin level of the general population (12.3 g/100ml). However, we believe that the hemoglobin level of the exposed children must on the average be lower.

Ho: μ ≥ 12.3HA: μ < 12.3

“One-siders” advocate that studies have unique backgrounds and goals. If the research hypothesis is directional, then so should the statistical hypothesis. This will gain power ― if you pick the correct direction.

One-sided versus Two-sided HypothesesOne-sided versus Two-sided Hypotheses

Many research hypotheses are often one-sided; however, the alternative hypothesis used when planning sample size is usually two-sided. Why?

Grant and manuscript reviewers are often critical if the hypothesis is one-sided. More importantly, in the past many well-supported one-sided hypotheses turned out to be wrong in randomized trials.

Therefore, statistical rigor requires the investigator to make the choice between one- and two-sided hypotheses before analyzing the data; switching to a one-sided alternative hypothesis to reduce the p-value (increase power) is cheating.

Bottom Line: Unless you are very confident that the effect is directional, choose a two-sided hypothesis.

P-valueP-value

The null hypothesis acts like a house of cards. We want to use our test statistic (whatever it is) as evidence for argument to knock it down. How?

We need our test statistic to incorporate both the point estimate and its spread (interval estimate). Then we attach a probability to this test statistic.

Specifically, the p-value is the probability that a test statistic will assume the observed value or something more extreme if Ho is true.

In order to claim any significant result, the p-value is compared to the “pre-determined” significance level α (typically 0.05) to decide whether Ho should be rejected. A “non-significant” result (p-value > α) does not mean no association; it only means that the result observed in the sample is small compared with what could have occurred by chance alone.

Cautions in P-value InterpretationCautions in P-value Interpretation

The significance level cutoff (0.05 or otherwise) is not a hard-and-fast rule. Note: 0.05 is not a magic number!

The p-value is not the probability that HA is true, given the data.

Statistical significance does not always imply clinical significance. Large studies can find statistical significance for very

small effect sizes that may be clinically unimportant. Small studies may not find statistical significance, but the

effect may be clinically important. P-values make no reference to the magnitude of the effect of

interest. Summary statistics (point estimates) and confidence

intervals should be provided.

Types of ErrorsTypes of Errors

In general, Type I errors are considered more serious. It turns out that as α

decreases, β increases and vice-versa. Therefore, we generally set α at a

pre-specified value before the experiment. Then β will depend on our

alternative hypothesis, test statistic, and sample size.

Errors: Analogy to Jury DecisionErrors: Analogy to Jury Decision

Jury Decision Statistical TestInnocence Null Hypothesis

(no association)

Guilty Alternative Hypothesis(an association)

Convict an innocent person claim an association when thereis none (Type I Error)

Acquit a guilty person claim no association whenthere is one (Type II Error)

Sample Size ApproachesSample Size Approaches

Specify desired width of a confidence interval (estimation and precision)

Bayesian approach to optimize a utility function (e.g., cost)

Power of a test of hypothesis This is the most common approach and the

one we will focus upon.

Elements for Power Approach to Sample Size Estimation

Elements for Power Approach to Sample Size Estimation

specify Ho and either a one-sided or two-sided HA on a parameter (e.g., μ)

effect size: deviation from the null that is deemed clinically significant

variability: inherent uncertainty significance level (α): Type I error power: compliment of Type II error

Power = 1 ─ β sample size (n): how much evidence will be

collected

Effect SizeEffect Size

Effect size is the size of the deviation from the null that we wish to detect in our sample.

should be clinically (scientifically) important most difficult aspect of sample size planning make informed guess from prior studies (e.g., pilot studies)

Standardized (or relative) effect size is the effect size divided by the variability (e.g., standard deviation)

usually set between 0.1 and 0.5 for most studies e.g., 0.5 means HA is 0.5 standard deviations away from Ho

unit free

VariabilityVariability

The greater the variability (or spread) in the outcome variable, the more likely it is that the values in the groups will overlap, and the more difficult it will be to demonstrate an overall difference between them.

The investigator may need to obtain an estimate of the variability for the power calculation. experience historical data (literature) pilot studies

More complicated designs have multiple sources of variation. e.g., repeated measures designs require within-subject

and between-subject variability

PowerPower

Intuitive definition: the strength of the study Textbook definition: probability that a future test will reject Ho if HA

is true. If power is high and the test is non-significant, this implies the

effect, if any, is small (higher “negative predictive value”) Studies with high power give researchers greater confidence that a

statistically significant result reflects the truth (higher “positive predictive value”)

Power and effect size are directly related. That is, all other things being equal, larger effect size more power.

Power and variability are inversely related. That is, all other things being equal, smaller variability more power.

Power and sample size are directly related. That is, all other things being equal, larger sample size more power.

Power and significance level (α) are directly related. That is, all other things being equal, larger α more power. However, increasing α increases the Type I error.

Power versus Effect Size (continuous outcome example)

Power versus Effect Size (continuous outcome example)

Power versus Standard Deviation (continuous outcome example)

Power versus Standard Deviation (continuous outcome example)

Power versus Sample Size (continuous outcome example)

Power versus Sample Size (continuous outcome example)

Steps for Sample Size EstimationSteps for Sample Size Estimation

State Ho and either a one-sided or two-sided HA

Select the statistical test

Choose effect size (and variability)

Set α and β

Calculate the sample size

Typical Effect Size/Variability Required for Sample Size Estimation by Type of OutcomeTypical Effect Size/Variability Required for

Sample Size Estimation by Type of Outcome

Continuous Outcome Estimates of the means in each group Estimates of variability (e.g., standard deviation)

Categorical Outcome Estimates of the proportions in each group

Survival Outcome Estimates for the proportion of subjects surviving

until time t in each group

Additional Issues to Consider that Affect Sample Size and Power

Additional Issues to Consider that Affect Sample Size and Power

Hypothesis Goal Superiority Non-inferiority Equivalence

Study Design One-sample (e.g., test of means or proportions) Two-sample (e.g., test of means, proportions, survival) Paired (e.g., test of means or proportions, correlation) >2 samples (e.g., ANOVA) Longitudinal studies (e.g., repeated measures) Crossover studies Design effect (e.g., cluster sampling) Dose finding

Additional Issues to Consider that Affect Sample Size and Power - continued

Additional Issues to Consider that Affect Sample Size and Power - continued

Other Issues to Consider Multiple endpoints Unequal sample size Dropouts Noncompliance Interim analyses Multiple comparisons testing Adjustment for covariates (e.g., ANCOVA) Number of available subjects Sufficient power for an important secondary

outcome Screening for eligibility

especially when invasive tests needed to determine eligibility)

DropoutsDropouts

Dropouts are not incorporated in sample size estimation.

Thus, estimate it during the design stage.

Increase the size of the enrolled sample to account for dropouts.

Example: Inflate the sample size estimate by 20% (i.e., a factor of 1.25).

Number of Available SubjectsNumber of Available Subjects

Example: The clinic sees approximately 20 patients per year with the given condition of interest, of whom 10% may refuse to participate in the study. Therefore, over the 2 years of the study, the sample size will be approximately 36 patients.

Although most studies need to balance feasibility with study power, the sample size should not be determined by the number of available subjects alone.

Perhaps consider a multicenter trial if the number of available subjects is an issue.

Sample Size Estimates are Approximations!

Sample Size Estimates are Approximations!

The assumptions that are made for the sample size calculation may not hold exactly for the study. Standard deviation of an outcome variable may

have been poorly chosen. Proportion of subjects who improve on placebo

therapy may have been under-estimated. An assumption that the distribution of the outcome

of interest is normal may not be true.

Investigator and Statistician Relationship for Sample Size Estimation

Investigator and Statistician Relationship for Sample Size Estimation

Investigator: scientific knowledge of the disease, outcomes, device, etc.

Statistician: technical skills to incorporate investigator’s scientific knowledge into an appropriate sample size for the study design

In short, statisticians are not the enemy! Statisticians are there to contribute and support investigator research in order to obtain valid results.

Bad Example #1 for Grant or Manuscript Write-up


The sample size of 60 diabetic subjects has been chosen to provide adequate numbers of subjects for evaluating the treatments.



A sample size of 50 subjects (25 per group) yielded 80% power to compare the placebo group to the active treatment group.



A previous study in this area recruited 75 subjects and found a highly significant result (p=0.008), and therefore a similar sample size should be sufficient for our study.



A formal power analysis was not done because there is no prior information on which to base the sample size estimate.

Good Example #1 for Grant or Manuscript Write-up


With 16 subjects per treatment group and an assumed within-group standard deviation of 25 mm, the study will have 90% statistical power for a two-sided, 0.05 significance level two-sample t-test to detect a difference of 30 mm in the visual analog scale (VAS) pain severity score over 3-months between a continuous and traditional 21/7 oral contraceptive pill (OCP). However, we anticipate subject drop-out as high as 15%; therefore, we will recruit 19 subjects per treatment group.



With a sample size of 45 subjects per treatment group there is greater than 88% power to detect a difference between the proportion of ventricular tachyarrhythmia (VTA) inducible subjects in the placebo and active treatment groups when the proportion is 0.90 for placebo and 0.60 for active treatment.



“The primary end point compared among the groups was the change in morning peak expiratory flow (PEF) from randomization to the end of the trial. Using the standard deviation for morning PEF of 36.6 liters per minute noted in a previous study, we calculated that a sample of 216 patients would provide a statistical power of 90 percent to detect the difference widely considered to be of clinical significance, 25 liters per minute, at a significance level of 4.8 percent, allowing a dropout rate of 15 percent. For the secondary end point — change in morning PEF from the first to the second period of intense combined therapy — we used the variability observed in the corticosteroid run-in period of a previous trial. We calculated that if 199 patients completed the study, the study would have a statistical power of 80 percent to detect a difference of 21 liters per minute in this morning PEF during the period of intense combined therapy between any two treatment groups. We further calculated that this sample would provide 80 percent power to detect a change of 13 liters per minute in morning PEF within groups.”

Boushey HA et al. Daily versus as-needed corticosteroids for mild persistent asthma. NEJM 2005; 352:1519-1528.



“To test the primary hypothesis that the mean time to the next suicide attempt during the follow-up period is different between treatment groups, a priori power calculations were based on the results of a previous randomized controlled trial with a similar protocol. The current sample size (N = 120) provided at least 80% power to detect a hazard ratio of 0.44 in terms of time to next suicide attempt between treatment groups using an assumed repeat attempt rate of 25.8% during the follow-up period and a 2-sided level of .05.”

Brown GK et al. Cognitive therapy for the prevention of suicide attempts: a randomized controlled trial. JAMA 2005; 294:563-570.

SummarySummary

Carefully consider and state the hypothesis to be tested a priori.

Select an appropriate statistical test. e.g., two-sample t-test

Choose a clinically meaningful effect size. Provide an estimate of the variability.

previous studies or literature Select the type I error () and type II error () you are

willing to accept. Calculate a sample size. Adjust sample size estimate for dropout if necessary. Consult and collaborate with a statistician.

ReferencesReferences

Lenth RV. Some practical guidelines for effective sample size determination. The American Statistician 2001; 55:187-193.

Florey C. Sample size for beginners. British Medical Journal 1993; 306:1181-1184.

Power and Sample Size: Why Should We Care? November 17, 2015 Allen Kunselman, MA Division of...

Documents

Transcript of Power and Sample Size: Why Should We Care? November 17, 2015 Allen Kunselman, MA Division of...