Critically assessing published clinical trials: The danger ... · 2. Populations . Statistics do...

Critically assessing published clinical trials: The danger of bias and wrong interpretations

Saskia Litière – EORTC Biostatistician

[email protected]

Some basic principles

2

1. Sampling

Based on our sample, what can we say about the

population?

3

2. Populations Statistics do not exist at the level of the individual • CONFLICT:

• Physician thinks in terms of specific patients

• Statistician thinks in terms of groups of patients (populations) Note that this can be a very specific population

A population treatment benefit does not necessarily translate into an individual treatment benefit

Statistics only have a ‘sense’ as a summary of a broader group

Patient x dies after 1 year on the experimental therapy

Patient y dies after 3 years on the standard therapy

• Current belief or knowledge: H0 or null hypothesis

• Is trusted until it is invalidated by enough new evidence according to the “innocent until proven guilty” principle

• What we want to demonstrate: H1 or alternative hypothesis

• Often represented in terms of a targeted treatment effect

• The smaller the “difference” you want to detect, the larger the sample size

• Large enough to be clinically relevant, small enough to be realistic

3. Decision making: hypothesis testing

5

I’m not deploying my parachute until scientists

get more certain about this gravity stuff

6

Study data

Reality (unknown) Reject H0 Do not reject H0

H0 true

CORRECT

H1 true

CORRECT

WRONG

WRONG

False positive, or type I error, α

False negative, or type II error, β

Power of test, 1 – β Probability that a test/trial will produce a statistically significant result, given a true difference of a certain magnitude between treatments

4. The ‘ingredients’ of the sample size

7

Sample size?

Design

Endpoint

Power

Target effect

Type I error

Patient availability

Why does sample size matter?

Too many patients:

• An unnecessary amount of patients are exposed to an experimental drug (risk of

unexpected toxicities, detrimental effect)

8

Too few patients:

• The study ends up with inconclusive results

Not a good use of resources

What is bias?

• Wikipedia: In statistics, bias is systematic favoritism present in data collection, analysis or reporting of quantitative research.

• Bias does not need to be intentional (it usually is not) and can be well hidden.

• Increasing sample size will yield more precision, but does not help against bias!

• Data will look more convincing with more data, so the bias problem will be more worrisome.

Bias may come in at any time

10

Onset

• Patient selection

• Treatment allocation

Conduct

• In the clinic

• During data management

Analysis

• Inappropriate methodology

• Patient selection, multiple testing

Report

• Selection of reported results

• Interpretation of results


11

Onset



Conduct

• In the clinic


Analysis



Report



Selection bias

• Authors checked % of patients with stage III/IV NSCLC discussed at thoracic cancer MDT at Concord Hospital in Sidney who would meet eligibility criteria to 6 trials testing CT+MTA: only 43% (range: 24%-69%) were!

• Major reason for exclusion: comorbidities (40%), PS≥2 (39%), symptomatic brain mets (8%), prior cancer (11%)

• However, 66% of the non-eligible patients were effectively treated with CT+MTAs (median 5 cycles, median OS 10.3 months) .

12

Patient selection

Patient selection

• Patients on the trials are so highly selected that they are not representative of the majority of patients with advanced NSCLC!

• Despite all the efforts to select the right population, there is still limited control over the selection of patients actually going into the trial • Competing trials

• Investigators enrolling preferentially the best prognosis or worst prognosis patients

• Conversely, the extrapolation of the trial results to clinical practice may not always be correct/guaranteed.

generalizability of results is uncertain

13

Treatment allocation

14

MarvelDrug®

WonderPill®

Overall survival

Older patients

Younger patients

Randomization

Randomization

Solution: treatment is allocated at random, not chosen

Randomization allows causal reasoning

Randomization is highly desirable, but not always possible. Certain lifestyle choices (e.g. physical exercise, smoking) are difficult to randomize.

15

• In non-randomized studies (e.g. observation studies) check whether the

analysis was adjusted for important potential confounders.

• Be aware that an adjusted analysis is not fault-proof (a statistician is not a magician!):

o statistical models rely on assumptions o quantifying potential confounders is rarely unambiguous


16

Onset



Conduct

• In the clinic


Analysis



Report



Operational bias

A few common problems

• A patient is ineligible, but this is only discovered after randomization (protocol deviation at entry)

• A patient does not return for a visit during the allotted time (protocol deviation)

• Investigators do not perform all of the tests required by the protocol (missing data)

• A patient does not comply to the allocated treatment (protocol deviation)

• A patient does not return for any visits (missing data / lost to follow-up / dropout)

Accrual, conduct and data collection during a trial

Quiz: what to do with such patients?

Include? Exclude?

Possible complications

• Example • Some patients taking the new treatment are too sick to go to the

next visit within the allotted time due to side effects.

• These could be weaker patients.

Standard New treatment

• Intention-to-treat (ITT): all randomized patients are included in the analysis in the arm they were allocated to, includes

• ineligible patients

• patients receiving different treatment

• poor compliers

• patients with protocol violations

• …

• Necessary to maintain the balance obtained by randomization

• conservative but also more reflective of clinical practice

How do we deal with this?

ITT or ITT?

“Of the 403 articles published in 2002 in 10 medical journals, 249 (62%) reported the use of ITT. Among these, available patients were clearly analyzed as randomized in 192 (77%). Authors used a modified ITT in 9%; clearly violated a major component of ITT in 7%, and the approach used was unclear in 7%. […] This study emphasizes that authors use the label ‘intention-to-treat’ quite differently”

Gravel et al, Clin Trials 2009.

Quiz: true or false?

The progression assessment schedules of two treatment arms in a randomized clinical trial can differ, as long as the schedules are strictly followed.

Does it matter?

• Assessing survival

date of death is known and documented

• Assessing progression

date of progression is only recorded and reported at the next visit

23

Randomization Visit 1 Visit 2


Bias in the endpoint

• Schedule of assessments matters

• Compliance to schedule matters


Arm 1


Arm 2

Visit 2


Arm 1


Arm 2

Visit 2

Visit 2

Actual Actual Actual

The true median PFS is 18 weeks in both group (A and B)

B

A

Gignac GA et al.

Cancer 2008; 113(5): 966-74

The curves drop at

each visit

Drug B appears better, only because visits are systematically delayed

by a few weeks

Impact of deviations in the assessments

There is still more to think about

• Investigators are very optimistic about the effect of the new treatment.

• Investigators may monitor the adverse effects of the new treatment more carefully.

• A patient may be disappointed to receive the “old” standard treatment.

examples of systematic observer bias which may dilute the actual treatment effect!

26

Blinding

Try to increase objectivity of persons involved in the trial by blinding: study team, physicians and patient do not know the treatment allocation

Of special interest when

• Outcome is subjective: reduction in pain, …

• Comparison with placebo: to correct for placebo-effect

Can be done in more than one way:

• Full (double) blinding of (almost) all involved in the trial to the drug being administered

• Blinding of endpoint review: for example independent assessment of response / progression can be blinded to treatment arm

27

Blinding is not always feasible/ethical

28

SCALPEL…

When randomized treatments involve radiotherapy/surgery, when certain drugs require a different route of administration or more intensive hospital visits,…


29

Onset



Conduct

• In the clinic


Analysis



Report


• Interpretation of results Statistical bias

Predictive factors …

Patients born in winter have better survival

Strong effect in p53 wild type, not in mutated

(years)

0 1 2 3 4 5 6 7 8 9

0

10

20

30

40

50

60

70

80

90

100

Duration of Survival p53 wild type

Overall Logrank test: p=0.028

HR=0.66 (CI: 0.45-0.96)

(years)

0 1 2 3 4 5 6 7 8 9

0

10

20

30

40

50

60

70

80

90

100

Duration of Survival p53 mutated

Overall Logrank test: p=0.54

HR=0.91 (CI: 0.66-1.24)

Winter

Summer

Winter

Summer

Based on data from Bonnefoi et al. 2011 Lancet Oncology

Multiplicity / overtesting

• The interpretation of p-values is mathematically correct if you do 1 test.

• Under multiple testing (e.g. doing subgroup analyses, or many covariates) the interpretation changes.

• Perform K tests, each of them at α=0.05 significance level (chance of false positive 5% … for each test)

• The overall type I error rate (the risk of finding at least one spurious statistically significant result) is

overall=1-(1- )K

False positive error rate inflation

Lagakos SW, NEJM 354:16, 2006

overall=1-(1- )K

A difference is only a difference if it makes a difference

As a result, even when there is no effect overall, one can always find at least one subset where there is some apparent effect … it’s just a matter of trying long enough

→ and then hindsight bias kicks in

• The "I-knew-it-all-along" effect, the inclination to see past events as being predictable

• This is why subgroup analyses should be preplanned … and seldom are

• Even if preplanned, such analyses are still open to criticism of multiplicity

33

“Popes live longer than artists”

Carrieri MP, Serraino D. Longevity of popes and artists between the 13th and the 19th century. Int J Epidemiol 2005;34:1435–36.

Hanley JA, Carrieri MP, Serraino D. Statistical fallibility and the longevity of popes: William Farr meets Wilhelm Lexis. Int J Epidemiol 2006;35:802-805.

“Women with amenorrhea and ER- breast cancer had improved DFS”

Swain SM, Jeong JH, Geyer CE Jr, et al. (2010) Longer therapy, iatrogenic amenorrhea, and survival in early breast cancer. N Engl J Med 362:2053–2065.

• Randomized phase III clinical trial in premenopausal women with lymph node–positive breast cancer. (Data from IBCSG 13-93 trial)

• A woman was classified as amenorrheic if there was at least one report of no menses during the first five follow-up visits (15 to 18 months) after random assignment

ER+ ER-

Improper subgroups:

Subgroups of patients classified by an event measured after randomization and potentially affected by treatment

This is a frequently made mistake!

• Examples: Survival analysis by response, by compliance or amount of treatment received, by severity of side effects

• Lead time bias: “responders” are selected to be patients that survived long enough to achieve a response.

SOLUTION: Landmark analysis: Subgroup classification is assessed at the end of a fixed time period and survival after that time period is compared. (Patients with an event before the landmark are excluded) Other solutions besides a landmark analysis that you might encounter in journals: dynamic prediction, time-dependent covariates

Statistical bias: Improper subgroups

Giobbie-Hurder A, Gelber RD, Regan MM (2013). Challenges of Guarantee-Time Bias. JCO 31:2963-2969.

Landmark analysis

“Women with amenorrhea and ER- breast cancer had improved DFS”

ER+ ER-


39

Onset



Conduct

• In the clinic


Analysis



Report



Reporting bias

The ‘success reporting’ chart

Is your primary endpoint successful ?

Don’t touch the data! Publish ASAP

Yes Any other endpoints

successful?

Declare no effect but report 2ry

finding.

Suggest a success on 2ry endpoint

Declare no effect was found

Write up as ‘inconclusive’, find a

subgroup, …

Vera-Badillo et al. Ann Oncol 2013: 59% of 92 trials with a negative PE used secondary end points to suggest

benefit of experimental therapy.

40% 60%

No

Towards eradication of publication bias Evolving landscape of public disclosure

41

FDA/NIH launches

clinicaltrials.gov

WHO calls for

prospective registration

of CTs

More legal obligation to

register FDAA

results posting

(for drugs)

EudraCT

Protocol disclosure

2000-1

2004-5

2007-9

2010-11

2012-14

EMA

Disclosure of results before product approval EudraCT

Full Protocol disclosure& results disclosure

EU CT regulation

Lay public summary

EMA database

FDA disclosure of results before product approval

2015-16

ICJME mandates Registration of CTs

CONSORT statement

ICJME Editorial decisions not based on results (P)

ICJME clarifies “conflict of interest”

ICJME statement about sharing data

Blanket terms in clinical trials

• Due to limitations, catch–all phrases or terms are used to describe (slightly) different processes.

• Eg.: Neoadjuvant chemotherapy or primary surgery in advanced ovarian cancer. Vergote et al. NEJM 2010.

Paper Protocol

“advanced ovarian cancer “ = 1.5 pages of eligibility criteria.

“neo-adjuvant chemotherapy” = 2 pages of schedule, doses, modifications, guidelines.

“primary surgery” = 0.5 page + 3 pages appendix guidelines.

Are you familiar with the hazard ratio?

43

HR = 0.78

HR = 0.97

HR = 0.85

HR = 0.66

2 breast cancer trials

44

Pagani et al NEJM 2014 Bonnefoi et al Lancet Oncology 2011

Relative treatment effect summary statistics are useful … but they don’t tell the whole story

Statistical significance ≠ clinical significance!

A statistical design should be adequately

powered for a clinically significant effect.

No matter how small the true difference between treatments, one can design a trial with: high chance ( e.g. power > 95%) to achieve p<0.05, simply by increasing N.

An “overpowered” trial will give you a high chance of finding a statistically significant result for a difference that is not clinically meaningful.

Look at both at:

Observed effect size (hazard ratio, difference in 5-year OS/ PFS)

(is it clinically meaningful?)

Associated p-value

(is it statistically meaningful?)

Confidence interval

Very important to be aware of potential biases

• Develop your own dose of critical thinking and reading of study

results

Don’t always believe what is claimed!

.. even if published in a major journal

… or told by a key opinion leader

• Reach a level of familiarity with the frequently used statistics in your field to reduce your vulnerability to misinterpretation. Find access to a statistician when you need one.

46

Acknowledgment

Stats colleagues at the EORTC, in particular Laurence Collette

47

Thank you!

https://sslvpn.eortc.be/extranet/pictureboard/,DanaInfo=www.eortc.be+details.aspx?id=18068

https://sslvpn.eortc.be/extranet/pictureboard/,DanaInfo=www.eortc.be+details.aspx?id=18068

Critically assessing published clinical trials: The danger ... · 2. Populations . Statistics do...

Documents

Transcript of Critically assessing published clinical trials: The danger ... · 2. Populations . Statistics do...