Power and Sample Size: Why Should We Care? November 17, 2015 Allen Kunselman, MA Division of...
-
Upload
egbert-preston -
Category
Documents
-
view
219 -
download
0
Transcript of Power and Sample Size: Why Should We Care? November 17, 2015 Allen Kunselman, MA Division of...
Power and Sample Size: Why Should We Care?
Power and Sample Size: Why Should We Care?
November 17, 2015
Allen Kunselman, MADivision of Biostatistics and Bioinformatics
Department of Public Health Sciences
QuotesQuotes An approximate answer to the right question is worth a good
deal more than an exact answer to the wrong question. John Tukey
All models are wrong, but some are useful. George Box
Like other occult techniques of divination, the statistical method has a private jargon deliberately contrived to obscure its methods from non-practitioners.
G.O. Ashley
Why Worry About Sample Size and Power?
Why Worry About Sample Size and Power?
Scientifically Meaningful Effect Study must be “big enough” to statistically detect a scientifically
(clinically) meaningful effect. Study must not be “too big” where an effect of little scientific
importance is nevertheless statistically detectable. Money, Money, Money
Undersized study wastes resources and money by not having the capacity to produce useful (definitive) results.
Oversized study uses more resources and money than necessary. Ethical Issues
Undersized study exposes subjects to potentially harmful treatments without advancing knowledge.
Oversized study exposes more people than necessary to a potentially harmful treatment or denies a potentially beneficial treatment.
Grant reviewers (and IRBs) are looking for sample size and power considerations!
Outline of ExperimentOutline of Experiment
Statement of hypothesis Affects sample size and power
Design of the experiment Affects sample size and power Selection of measurement method (evaluation) Recognition of all variables and the effect of their operation Selection of sample (of population) to be studied Preliminary evaluation of the effect of chance on the
experiment Execution of the experiment Analysis of the data Interpretation of the results Generalization of the results
Characteristics of a Good HypothesisCharacteristics of a Good Hypothesis
Simple: contains one predictor and one outcome. e.g., a sedentary lifestyle is associated with an
increased risk of proteinuria in patients with diabetes
Specific: leaves no ambiguity. e.g., alcohol consumption [measured in mg/day] is
associated with an increased risk of proteinuria [>30 mg/dL] in patients with diabetes
In Advance (a priori): stated in writing at the onset of the study.
Types of HypothesesTypes of Hypotheses
Null Hypothesis (Ho): “nothing happening” Alternative Hypothesis (HA): some
association; “working hypothesis” The null hypothesis is the formal basis for
testing statistical significance. Assuming there is no association in the population, statistical tests help to estimate the probability that an association is observed due to chance.
We want to collect evidence against the null hypothesis and support the alternative hypothesis.
Types of Hypothesis TestingTypes of Hypothesis Testing
Two-sided test Example: A serum sample is part of an assay kit sent by the
manufacturer and they claim that the concentration of follicle-stimulating hormone (FSH) in the sample is 115 mIU/mL. We wish to test their claim, or (assuming their claim is true) we wish to test the accuracy of our sample procedure.
Ho: μ = 115
HA: μ ≠ 115
Some clinical trial investigators advocate that all trials should be non-directional (two-sided) because one direction tests efficacy and the other tests harm.
Types of Hypothesis Testing - continuedTypes of Hypothesis Testing - continued
One-sided test Example: We wish to know whether the mean hemoglobin
level of children who are exposed to high levels of lead is equal to the hemoglobin level of the general population (12.3 g/100ml). However, we believe that the hemoglobin level of the exposed children must on the average be lower.
Ho: μ ≥ 12.3HA: μ < 12.3
“One-siders” advocate that studies have unique backgrounds and goals. If the research hypothesis is directional, then so should the statistical hypothesis. This will gain power ― if you pick the correct direction.
One-sided versus Two-sided HypothesesOne-sided versus Two-sided Hypotheses
Many research hypotheses are often one-sided; however, the alternative hypothesis used when planning sample size is usually two-sided. Why?
Grant and manuscript reviewers are often critical if the hypothesis is one-sided. More importantly, in the past many well-supported one-sided hypotheses turned out to be wrong in randomized trials.
Therefore, statistical rigor requires the investigator to make the choice between one- and two-sided hypotheses before analyzing the data; switching to a one-sided alternative hypothesis to reduce the p-value (increase power) is cheating.
Bottom Line: Unless you are very confident that the effect is directional, choose a two-sided hypothesis.
P-valueP-value
The null hypothesis acts like a house of cards. We want to use our test statistic (whatever it is) as evidence for argument to knock it down. How?
We need our test statistic to incorporate both the point estimate and its spread (interval estimate). Then we attach a probability to this test statistic.
Specifically, the p-value is the probability that a test statistic will assume the observed value or something more extreme if Ho is true.
In order to claim any significant result, the p-value is compared to the “pre-determined” significance level α (typically 0.05) to decide whether Ho should be rejected. A “non-significant” result (p-value > α) does not mean no association; it only means that the result observed in the sample is small compared with what could have occurred by chance alone.
Cautions in P-value InterpretationCautions in P-value Interpretation
The significance level cutoff (0.05 or otherwise) is not a hard-and-fast rule. Note: 0.05 is not a magic number!
The p-value is not the probability that HA is true, given the data.
Statistical significance does not always imply clinical significance. Large studies can find statistical significance for very
small effect sizes that may be clinically unimportant. Small studies may not find statistical significance, but the
effect may be clinically important. P-values make no reference to the magnitude of the effect of
interest. Summary statistics (point estimates) and confidence
intervals should be provided.
Types of ErrorsTypes of Errors
In general, Type I errors are considered more serious. It turns out that as α
decreases, β increases and vice-versa. Therefore, we generally set α at a
pre-specified value before the experiment. Then β will depend on our
alternative hypothesis, test statistic, and sample size.
Errors: Analogy to Jury DecisionErrors: Analogy to Jury Decision
Jury Decision Statistical TestInnocence Null Hypothesis
(no association)
Guilty Alternative Hypothesis(an association)
Convict an innocent person claim an association when thereis none (Type I Error)
Acquit a guilty person claim no association whenthere is one (Type II Error)
Sample Size ApproachesSample Size Approaches
Specify desired width of a confidence interval (estimation and precision)
Bayesian approach to optimize a utility function (e.g., cost)
Power of a test of hypothesis This is the most common approach and the
one we will focus upon.
Elements for Power Approach to Sample Size Estimation
Elements for Power Approach to Sample Size Estimation
specify Ho and either a one-sided or two-sided HA on a parameter (e.g., μ)
effect size: deviation from the null that is deemed clinically significant
variability: inherent uncertainty significance level (α): Type I error power: compliment of Type II error
Power = 1 ─ β sample size (n): how much evidence will be
collected
Effect SizeEffect Size
Effect size is the size of the deviation from the null that we wish to detect in our sample.
should be clinically (scientifically) important most difficult aspect of sample size planning make informed guess from prior studies (e.g., pilot studies)
Standardized (or relative) effect size is the effect size divided by the variability (e.g., standard deviation)
usually set between 0.1 and 0.5 for most studies e.g., 0.5 means HA is 0.5 standard deviations away from Ho
unit free
VariabilityVariability
The greater the variability (or spread) in the outcome variable, the more likely it is that the values in the groups will overlap, and the more difficult it will be to demonstrate an overall difference between them.
The investigator may need to obtain an estimate of the variability for the power calculation. experience historical data (literature) pilot studies
More complicated designs have multiple sources of variation. e.g., repeated measures designs require within-subject
and between-subject variability
PowerPower
Intuitive definition: the strength of the study Textbook definition: probability that a future test will reject Ho if HA
is true. If power is high and the test is non-significant, this implies the
effect, if any, is small (higher “negative predictive value”) Studies with high power give researchers greater confidence that a
statistically significant result reflects the truth (higher “positive predictive value”)
Power and effect size are directly related. That is, all other things being equal, larger effect size more power.
Power and variability are inversely related. That is, all other things being equal, smaller variability more power.
Power and sample size are directly related. That is, all other things being equal, larger sample size more power.
Power and significance level (α) are directly related. That is, all other things being equal, larger α more power. However, increasing α increases the Type I error.
Power versus Effect Size (continuous outcome example)
Power versus Effect Size (continuous outcome example)
Power versus Standard Deviation (continuous outcome example)
Power versus Standard Deviation (continuous outcome example)
Power versus Sample Size (continuous outcome example)
Power versus Sample Size (continuous outcome example)
Steps for Sample Size EstimationSteps for Sample Size Estimation
State Ho and either a one-sided or two-sided HA
Select the statistical test
Choose effect size (and variability)
Set α and β
Calculate the sample size
Typical Effect Size/Variability Required for Sample Size Estimation by Type of OutcomeTypical Effect Size/Variability Required for
Sample Size Estimation by Type of Outcome
Continuous Outcome Estimates of the means in each group Estimates of variability (e.g., standard deviation)
Categorical Outcome Estimates of the proportions in each group
Survival Outcome Estimates for the proportion of subjects surviving
until time t in each group
Additional Issues to Consider that Affect Sample Size and Power
Additional Issues to Consider that Affect Sample Size and Power
Hypothesis Goal Superiority Non-inferiority Equivalence
Study Design One-sample (e.g., test of means or proportions) Two-sample (e.g., test of means, proportions, survival) Paired (e.g., test of means or proportions, correlation) >2 samples (e.g., ANOVA) Longitudinal studies (e.g., repeated measures) Crossover studies Design effect (e.g., cluster sampling) Dose finding
Additional Issues to Consider that Affect Sample Size and Power - continued
Additional Issues to Consider that Affect Sample Size and Power - continued
Other Issues to Consider Multiple endpoints Unequal sample size Dropouts Noncompliance Interim analyses Multiple comparisons testing Adjustment for covariates (e.g., ANCOVA) Number of available subjects Sufficient power for an important secondary
outcome Screening for eligibility
especially when invasive tests needed to determine eligibility)
DropoutsDropouts
Dropouts are not incorporated in sample size estimation.
Thus, estimate it during the design stage.
Increase the size of the enrolled sample to account for dropouts.
Example: Inflate the sample size estimate by 20% (i.e., a factor of 1.25).
Number of Available SubjectsNumber of Available Subjects
Example: The clinic sees approximately 20 patients per year with the given condition of interest, of whom 10% may refuse to participate in the study. Therefore, over the 2 years of the study, the sample size will be approximately 36 patients.
Although most studies need to balance feasibility with study power, the sample size should not be determined by the number of available subjects alone.
Perhaps consider a multicenter trial if the number of available subjects is an issue.
Sample Size Estimates are Approximations!
Sample Size Estimates are Approximations!
The assumptions that are made for the sample size calculation may not hold exactly for the study. Standard deviation of an outcome variable may
have been poorly chosen. Proportion of subjects who improve on placebo
therapy may have been under-estimated. An assumption that the distribution of the outcome
of interest is normal may not be true.
Investigator and Statistician Relationship for Sample Size Estimation
Investigator and Statistician Relationship for Sample Size Estimation
Investigator: scientific knowledge of the disease, outcomes, device, etc.
Statistician: technical skills to incorporate investigator’s scientific knowledge into an appropriate sample size for the study design
In short, statisticians are not the enemy! Statisticians are there to contribute and support investigator research in order to obtain valid results.
Bad Example #1 for Grant or Manuscript Write-up
Bad Example #1 for Grant or Manuscript Write-up
The sample size of 60 diabetic subjects has been chosen to provide adequate numbers of subjects for evaluating the treatments.
Bad Example #2 for Grant or Manuscript Write-up
Bad Example #2 for Grant or Manuscript Write-up
A sample size of 50 subjects (25 per group) yielded 80% power to compare the placebo group to the active treatment group.
Bad Example #3 for Grant or Manuscript Write-up
Bad Example #3 for Grant or Manuscript Write-up
A previous study in this area recruited 75 subjects and found a highly significant result (p=0.008), and therefore a similar sample size should be sufficient for our study.
Bad Example #4 for Grant or Manuscript Write-up
Bad Example #4 for Grant or Manuscript Write-up
A formal power analysis was not done because there is no prior information on which to base the sample size estimate.
Good Example #1 for Grant or Manuscript Write-up
Good Example #1 for Grant or Manuscript Write-up
With 16 subjects per treatment group and an assumed within-group standard deviation of 25 mm, the study will have 90% statistical power for a two-sided, 0.05 significance level two-sample t-test to detect a difference of 30 mm in the visual analog scale (VAS) pain severity score over 3-months between a continuous and traditional 21/7 oral contraceptive pill (OCP). However, we anticipate subject drop-out as high as 15%; therefore, we will recruit 19 subjects per treatment group.
Good Example #2 for Grant or Manuscript Write-up
Good Example #2 for Grant or Manuscript Write-up
With a sample size of 45 subjects per treatment group there is greater than 88% power to detect a difference between the proportion of ventricular tachyarrhythmia (VTA) inducible subjects in the placebo and active treatment groups when the proportion is 0.90 for placebo and 0.60 for active treatment.
Good Example #3 for Grant or Manuscript Write-up
Good Example #3 for Grant or Manuscript Write-up
“The primary end point compared among the groups was the change in morning peak expiratory flow (PEF) from randomization to the end of the trial. Using the standard deviation for morning PEF of 36.6 liters per minute noted in a previous study, we calculated that a sample of 216 patients would provide a statistical power of 90 percent to detect the difference widely considered to be of clinical significance, 25 liters per minute, at a significance level of 4.8 percent, allowing a dropout rate of 15 percent. For the secondary end point — change in morning PEF from the first to the second period of intense combined therapy — we used the variability observed in the corticosteroid run-in period of a previous trial. We calculated that if 199 patients completed the study, the study would have a statistical power of 80 percent to detect a difference of 21 liters per minute in this morning PEF during the period of intense combined therapy between any two treatment groups. We further calculated that this sample would provide 80 percent power to detect a change of 13 liters per minute in morning PEF within groups.”
Boushey HA et al. Daily versus as-needed corticosteroids for mild persistent asthma. NEJM 2005; 352:1519-1528.
Good Example #4 for Grant or Manuscript Write-up
Good Example #4 for Grant or Manuscript Write-up
“To test the primary hypothesis that the mean time to the next suicide attempt during the follow-up period is different between treatment groups, a priori power calculations were based on the results of a previous randomized controlled trial with a similar protocol. The current sample size (N = 120) provided at least 80% power to detect a hazard ratio of 0.44 in terms of time to next suicide attempt between treatment groups using an assumed repeat attempt rate of 25.8% during the follow-up period and a 2-sided level of .05.”
Brown GK et al. Cognitive therapy for the prevention of suicide attempts: a randomized controlled trial. JAMA 2005; 294:563-570.
SummarySummary
Carefully consider and state the hypothesis to be tested a priori.
Select an appropriate statistical test. e.g., two-sample t-test
Choose a clinically meaningful effect size. Provide an estimate of the variability.
previous studies or literature Select the type I error () and type II error () you are
willing to accept. Calculate a sample size. Adjust sample size estimate for dropout if necessary. Consult and collaborate with a statistician.
ReferencesReferences
Lenth RV. Some practical guidelines for effective sample size determination. The American Statistician 2001; 55:187-193.
Florey C. Sample size for beginners. British Medical Journal 1993; 306:1181-1184.