Critically assessing published clinical trials: The danger ... · 2. Populations . Statistics do...
Transcript of Critically assessing published clinical trials: The danger ... · 2. Populations . Statistics do...
Critically assessing published clinical trials: The danger of bias and wrong interpretations
Saskia Litière – EORTC Biostatistician
2. Populations Statistics do not exist at the level of the individual • CONFLICT:
• Physician thinks in terms of specific patients
• Statistician thinks in terms of groups of patients (populations) Note that this can be a very specific population
A population treatment benefit does not necessarily translate into an individual treatment benefit
Statistics only have a ‘sense’ as a summary of a broader group
Patient x dies after 1 year on the experimental therapy
Patient y dies after 3 years on the standard therapy
• Current belief or knowledge: H0 or null hypothesis
• Is trusted until it is invalidated by enough new evidence according to the “innocent until proven guilty” principle
• What we want to demonstrate: H1 or alternative hypothesis
• Often represented in terms of a targeted treatment effect
• The smaller the “difference” you want to detect, the larger the sample size
• Large enough to be clinically relevant, small enough to be realistic
3. Decision making: hypothesis testing
5
I’m not deploying my parachute until scientists
get more certain about this gravity stuff
6
Study data
Reality (unknown) Reject H0 Do not reject H0
H0 true
CORRECT
H1 true
CORRECT
WRONG
WRONG
False positive, or type I error, α
False negative, or type II error, β
Power of test, 1 – β Probability that a test/trial will produce a statistically significant result, given a true difference of a certain magnitude between treatments
4. The ‘ingredients’ of the sample size
7
Sample size?
Design
Endpoint
Power
Target effect
Type I error
Patient availability
Why does sample size matter?
Too many patients:
• An unnecessary amount of patients are exposed to an experimental drug (risk of
unexpected toxicities, detrimental effect)
8
Too few patients:
• The study ends up with inconclusive results
Not a good use of resources
What is bias?
• Wikipedia: In statistics, bias is systematic favoritism present in data collection, analysis or reporting of quantitative research.
• Bias does not need to be intentional (it usually is not) and can be well hidden.
• Increasing sample size will yield more precision, but does not help against bias!
• Data will look more convincing with more data, so the bias problem will be more worrisome.
Bias may come in at any time
10
Onset
• Patient selection
• Treatment allocation
Conduct
• In the clinic
• During data management
Analysis
• Inappropriate methodology
• Patient selection, multiple testing
Report
• Selection of reported results
• Interpretation of results
Bias may come in at any time
11
Onset
• Patient selection
• Treatment allocation
Conduct
• In the clinic
• During data management
Analysis
• Inappropriate methodology
• Patient selection, multiple testing
Report
• Selection of reported results
• Interpretation of results
Selection bias
• Authors checked % of patients with stage III/IV NSCLC discussed at thoracic cancer MDT at Concord Hospital in Sidney who would meet eligibility criteria to 6 trials testing CT+MTA: only 43% (range: 24%-69%) were!
• Major reason for exclusion: comorbidities (40%), PS≥2 (39%), symptomatic brain mets (8%), prior cancer (11%)
• However, 66% of the non-eligible patients were effectively treated with CT+MTAs (median 5 cycles, median OS 10.3 months) .
12
Patient selection
Patient selection
• Patients on the trials are so highly selected that they are not representative of the majority of patients with advanced NSCLC!
• Despite all the efforts to select the right population, there is still limited control over the selection of patients actually going into the trial • Competing trials
• Investigators enrolling preferentially the best prognosis or worst prognosis patients
• Conversely, the extrapolation of the trial results to clinical practice may not always be correct/guaranteed.
generalizability of results is uncertain
13
Treatment allocation
14
MarvelDrug®
WonderPill®
Overall survival
Older patients
Younger patients
Randomization
Randomization
Solution: treatment is allocated at random, not chosen
Randomization allows causal reasoning
Randomization is highly desirable, but not always possible. Certain lifestyle choices (e.g. physical exercise, smoking) are difficult to randomize.
15
• In non-randomized studies (e.g. observation studies) check whether the
analysis was adjusted for important potential confounders.
• Be aware that an adjusted analysis is not fault-proof (a statistician is not a magician!):
o statistical models rely on assumptions o quantifying potential confounders is rarely unambiguous
Bias may come in at any time
16
Onset
• Patient selection
• Treatment allocation
Conduct
• In the clinic
• During data management
Analysis
• Inappropriate methodology
• Patient selection, multiple testing
Report
• Selection of reported results
• Interpretation of results
Operational bias
A few common problems
• A patient is ineligible, but this is only discovered after randomization (protocol deviation at entry)
• A patient does not return for a visit during the allotted time (protocol deviation)
• Investigators do not perform all of the tests required by the protocol (missing data)
• A patient does not comply to the allocated treatment (protocol deviation)
• A patient does not return for any visits (missing data / lost to follow-up / dropout)
Accrual, conduct and data collection during a trial
Possible complications
• Example • Some patients taking the new treatment are too sick to go to the
next visit within the allotted time due to side effects.
• These could be weaker patients.
Standard New treatment
• Intention-to-treat (ITT): all randomized patients are included in the analysis in the arm they were allocated to, includes
• ineligible patients
• patients receiving different treatment
• poor compliers
• patients with protocol violations
• …
• Necessary to maintain the balance obtained by randomization
• conservative but also more reflective of clinical practice
How do we deal with this?
ITT or ITT?
“Of the 403 articles published in 2002 in 10 medical journals, 249 (62%) reported the use of ITT. Among these, available patients were clearly analyzed as randomized in 192 (77%). Authors used a modified ITT in 9%; clearly violated a major component of ITT in 7%, and the approach used was unclear in 7%. […] This study emphasizes that authors use the label ‘intention-to-treat’ quite differently”
Gravel et al, Clin Trials 2009.
Quiz: true or false?
The progression assessment schedules of two treatment arms in a randomized clinical trial can differ, as long as the schedules are strictly followed.
Does it matter?
• Assessing survival
date of death is known and documented
• Assessing progression
date of progression is only recorded and reported at the next visit
23
Randomization Visit 1 Visit 2
Randomization Visit 1 Visit 2
Bias in the endpoint
• Schedule of assessments matters
• Compliance to schedule matters
Randomization Visit 1 Visit 2
Arm 1
Randomization Visit 1 Visit 3
Arm 2
Visit 2
Randomization Visit 1 Visit 3
Arm 1
Randomization Visit 1 Visit 3
Arm 2
Visit 2
Visit 2
Actual Actual Actual
The true median PFS is 18 weeks in both group (A and B)
B
A
Gignac GA et al.
Cancer 2008; 113(5): 966-74
The curves drop at
each visit
Drug B appears better, only because visits are systematically delayed
by a few weeks
Impact of deviations in the assessments
There is still more to think about
• Investigators are very optimistic about the effect of the new treatment.
• Investigators may monitor the adverse effects of the new treatment more carefully.
• A patient may be disappointed to receive the “old” standard treatment.
examples of systematic observer bias which may dilute the actual treatment effect!
26
Blinding
Try to increase objectivity of persons involved in the trial by blinding: study team, physicians and patient do not know the treatment allocation
Of special interest when
• Outcome is subjective: reduction in pain, …
• Comparison with placebo: to correct for placebo-effect
Can be done in more than one way:
• Full (double) blinding of (almost) all involved in the trial to the drug being administered
• Blinding of endpoint review: for example independent assessment of response / progression can be blinded to treatment arm
27
Blinding is not always feasible/ethical
28
SCALPEL…
When randomized treatments involve radiotherapy/surgery, when certain drugs require a different route of administration or more intensive hospital visits,…
Bias may come in at any time
29
Onset
• Patient selection
• Treatment allocation
Conduct
• In the clinic
• During data management
Analysis
• Inappropriate methodology
• Patient selection, multiple testing
Report
• Selection of reported results
• Interpretation of results Statistical bias
Predictive factors …
Patients born in winter have better survival
Strong effect in p53 wild type, not in mutated
(years)
0 1 2 3 4 5 6 7 8 9
0
10
20
30
40
50
60
70
80
90
100
Duration of Survival p53 wild type
Overall Logrank test: p=0.028
HR=0.66 (CI: 0.45-0.96)
(years)
0 1 2 3 4 5 6 7 8 9
0
10
20
30
40
50
60
70
80
90
100
Duration of Survival p53 mutated
Overall Logrank test: p=0.54
HR=0.91 (CI: 0.66-1.24)
Winter
Summer
Winter
Summer
Based on data from Bonnefoi et al. 2011 Lancet Oncology
Multiplicity / overtesting
• The interpretation of p-values is mathematically correct if you do 1 test.
• Under multiple testing (e.g. doing subgroup analyses, or many covariates) the interpretation changes.
• Perform K tests, each of them at α=0.05 significance level (chance of false positive 5% … for each test)
• The overall type I error rate (the risk of finding at least one spurious statistically significant result) is
overall=1-(1- )K
A difference is only a difference if it makes a difference
As a result, even when there is no effect overall, one can always find at least one subset where there is some apparent effect … it’s just a matter of trying long enough
→ and then hindsight bias kicks in
• The "I-knew-it-all-along" effect, the inclination to see past events as being predictable
• This is why subgroup analyses should be preplanned … and seldom are
• Even if preplanned, such analyses are still open to criticism of multiplicity
33
“Popes live longer than artists”
Carrieri MP, Serraino D. Longevity of popes and artists between the 13th and the 19th century. Int J Epidemiol 2005;34:1435–36.
Hanley JA, Carrieri MP, Serraino D. Statistical fallibility and the longevity of popes: William Farr meets Wilhelm Lexis. Int J Epidemiol 2006;35:802-805.
“Women with amenorrhea and ER- breast cancer had improved DFS”
Swain SM, Jeong JH, Geyer CE Jr, et al. (2010) Longer therapy, iatrogenic amenorrhea, and survival in early breast cancer. N Engl J Med 362:2053–2065.
• Randomized phase III clinical trial in premenopausal women with lymph node–positive breast cancer. (Data from IBCSG 13-93 trial)
• A woman was classified as amenorrheic if there was at least one report of no menses during the first five follow-up visits (15 to 18 months) after random assignment
ER+ ER-
Improper subgroups:
Subgroups of patients classified by an event measured after randomization and potentially affected by treatment
This is a frequently made mistake!
• Examples: Survival analysis by response, by compliance or amount of treatment received, by severity of side effects
• Lead time bias: “responders” are selected to be patients that survived long enough to achieve a response.
SOLUTION: Landmark analysis: Subgroup classification is assessed at the end of a fixed time period and survival after that time period is compared. (Patients with an event before the landmark are excluded) Other solutions besides a landmark analysis that you might encounter in journals: dynamic prediction, time-dependent covariates
Statistical bias: Improper subgroups
Giobbie-Hurder A, Gelber RD, Regan MM (2013). Challenges of Guarantee-Time Bias. JCO 31:2963-2969.
Landmark analysis
“Women with amenorrhea and ER- breast cancer had improved DFS”
ER+ ER-
Bias may come in at any time
39
Onset
• Patient selection
• Treatment allocation
Conduct
• In the clinic
• During data management
Analysis
• Inappropriate methodology
• Patient selection, multiple testing
Report
• Selection of reported results
• Interpretation of results
Reporting bias
The ‘success reporting’ chart
Is your primary endpoint successful ?
Don’t touch the data! Publish ASAP
Yes Any other endpoints
successful?
Declare no effect but report 2ry
finding.
Suggest a success on 2ry endpoint
Declare no effect was found
Write up as ‘inconclusive’, find a
subgroup, …
Vera-Badillo et al. Ann Oncol 2013: 59% of 92 trials with a negative PE used secondary end points to suggest
benefit of experimental therapy.
40% 60%
No
Towards eradication of publication bias Evolving landscape of public disclosure
41
FDA/NIH launches
clinicaltrials.gov
WHO calls for
prospective registration
of CTs
More legal obligation to
register FDAA
results posting
(for drugs)
EudraCT
Protocol disclosure
2000-1
2004-5
2007-9
2010-11
2012-14
EMA
Disclosure of results before product approval EudraCT
Full Protocol disclosure& results disclosure
EU CT regulation
Lay public summary
EMA database
FDA disclosure of results before product approval
2015-16
ICJME mandates Registration of CTs
CONSORT statement
ICJME Editorial decisions not based on results (P)
ICJME clarifies “conflict of interest”
ICJME statement about sharing data
Blanket terms in clinical trials
• Due to limitations, catch–all phrases or terms are used to describe (slightly) different processes.
• Eg.: Neoadjuvant chemotherapy or primary surgery in advanced ovarian cancer. Vergote et al. NEJM 2010.
Paper Protocol
“advanced ovarian cancer “ = 1.5 pages of eligibility criteria.
“neo-adjuvant chemotherapy” = 2 pages of schedule, doses, modifications, guidelines.
“primary surgery” = 0.5 page + 3 pages appendix guidelines.
2 breast cancer trials
44
Pagani et al NEJM 2014 Bonnefoi et al Lancet Oncology 2011
Relative treatment effect summary statistics are useful … but they don’t tell the whole story
Statistical significance ≠ clinical significance!
A statistical design should be adequately
powered for a clinically significant effect.
No matter how small the true difference between treatments, one can design a trial with: high chance ( e.g. power > 95%) to achieve p<0.05, simply by increasing N.
An “overpowered” trial will give you a high chance of finding a statistically significant result for a difference that is not clinically meaningful.
Look at both at:
Observed effect size (hazard ratio, difference in 5-year OS/ PFS)
(is it clinically meaningful?)
Associated p-value
(is it statistically meaningful?)
Confidence interval
Very important to be aware of potential biases
• Develop your own dose of critical thinking and reading of study
results
Don’t always believe what is claimed!
.. even if published in a major journal
… or told by a key opinion leader
• Reach a level of familiarity with the frequently used statistics in your field to reduce your vulnerability to misinterpretation. Find access to a statistician when you need one.
46
Acknowledgment
Stats colleagues at the EORTC, in particular Laurence Collette
47
Thank you!