Design and analysis of sequential clinical trials using a ...€¦ · Design and analysis of...

Design and analysis of sequential clinical trials using a Markov

chain transition rate model with conditional power

by

GREGORY RUSSELL POND

A thesis submitted in conformity with the requirements

for the degree of Doctor of Philosophy,

Department of Public Health Sciences,

University of Toronto

c© Copyright by Gregory Russell Pond (2008)

i

Design and analysis of sequential clinical trials using a Markov chain transition rate model with

conditional power, Gregory Russell Pond, Department of Public Health Sciences, University of

Toronto, Doctor of Philosophy, 2008

Abstract

Background:

There are a plethora of potential statistical designs which can be used to evaluate efficacy

of a novel cancer treatment in the phase II clinical trial setting. Unfortunately, there is

no consensus as to which design one should prefer, nor even which definition of efficacy

should be used and the primary endpoint conclusion can vary depending on which design

is chosen. It would be useful if an all-encompassing methodology was possible which could

evaluate all the different designs simultaneously and allow investigators an understanding

of the trial results under the varying scenarios.

Methods:

Finite Markov chain imbedding is a method which can be used in phase II oncology clin-

ical trials but has never previously been evaluated for examining phase II cancer clinical

trials. Simple variations to the transition matrix or end-state probability definitions can

be performed which allow for evaluation of multiple designs and endpoints for a single

trial. A computer program is written in R which allows for computation of p-values and

conditional power, two common statistical measures used for evaluation of trial results.

A simulation study is performed using data arising from an actual phase II clinical trial

performed recently in which the study conclusion regarding the efficacy of the potential

treatment was debatable.

ii

Results:

Finite Markov chain imbedding is shown to be useful for evaluating phase II oncology

clinical trial results. The R code written for evaluating the simulation study is demon-

strated to be fast and useful for investigating different trial designs. Further details

regarding the clinical trial results are presented, including the potential prolongation of

stable disease of the treatment, which is a potentially useful marker of efficacy for this

cytostatic agent.

Conclusions:

This novel methodology may prove to be an useful investigative technique for the eval-

uation of phase II oncology clinical trial data. Future studies which have disputable

conclusions might become less controversial with the aid of finite Markov chain imbed-

ding and the possible multiple evaluations which are now viable. Better understanding

of activity for a given treatment might expedite the drug development process or help

distinguish active from inactive treatments.

iii

Acknowledgement

Completion of any achievement is hollow without incentive. For me that incentive is my

family.

Beverly, you have stuck by me through every adversity, energised me when I was

tired, guided me when I was lost, supported my decisions and endured many tribulations

as I pursued my degree. I can never express my gratitude for your support or my infinite

love for you.

Connor and Carter, I can not express the joy and pride you bring to my life. I

can already see the remarkably caring, thoughtful and outstanding young men you are

becoming.

I also wish to thank my parents for all your encouragement, love and guidance

throughout the years; my parents-in-law, Gloria and Wilson, for your kindness, gen-

erosity and willingness to help; Dr. Lillian Siu, a truly amazing oncologist and researcher

and wonderful colleague for the opportunities you have given me; and Dr. Wendy Lou,

for your guidance, suggestions and support.

Finally, to all my friends, family and colleagues over the years, without whom I would

not be where I am today, I thank you.

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

1 Introduction 1

2 Statistical Issues 5

2.1 Phase II and III Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Methods for Adjusting the Type I Error (α) In the Situation of Multiple-

Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Bonferroni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 Early Group-Sequential Methods . . . . . . . . . . . . . . . . . . 8

2.2.3 Alpha-Spending Function . . . . . . . . . . . . . . . . . . . . . . 9

2.2.4 Repeated Confidence Intervals . . . . . . . . . . . . . . . . . . . . 10

2.2.5 Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.6 Using a Risk/Loss Function . . . . . . . . . . . . . . . . . . . . . 13

2.2.7 Stochastic Curtailment and Conditional Probability . . . . . . . . 13

2.3 Sample Size Re-adjustment [52] [53] [54] [55] . . . . . . . . . . . . . . . . 14

2.3.1 Variance Spending Approach . . . . . . . . . . . . . . . . . . . . . 16

2.3.2 Fisher Combination Test . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.3 Conditional Power . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Combining Data from Different Analyses . . . . . . . . . . . . . . . . . . 20

iv

CONTENTS v

2.4.1 Continuous and Categorical Outcomes . . . . . . . . . . . . . . . 20

2.4.2 Survival-Type Outcomes . . . . . . . . . . . . . . . . . . . . . . . 20

2.5 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.2 Properties of Markov Chains . . . . . . . . . . . . . . . . . . . . . 23

2.5.3 Markov Chains as a Model for Cancer Phase II Clinical Trials . . 25

3 Potential Trial Designs for Phase II Oncology Clinical Trials 29

3.1 Design Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Phase II Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Univariate Designs With Response as the Outcome . . . . . . . . 32

3.3 Multinomial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.1 Zee Design [11] [82] . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3.2 Trinomial Design [12] . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3.3 Dual-Response Design [13] . . . . . . . . . . . . . . . . . . . . . . 51

3.3.4 Weighted Response Design [14] . . . . . . . . . . . . . . . . . . . 54

3.4 Using Finite Markov Chain Imbedding . . . . . . . . . . . . . . . . . . . 59

4 Examples and Simulation Set-up 61

4.1 Phase II Clinical Trial of CCI-779 (temsirolimus) in Neuroendocrine Car-

cinoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.1.1 Trial Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.1.2 Trial Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1.3 Note regarding response rates . . . . . . . . . . . . . . . . . . . . 64

4.2 Implementation of Markov Chain Methods . . . . . . . . . . . . . . . . . 65

4.2.1 RECIST criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3.1 Estimating H0 and HA . . . . . . . . . . . . . . . . . . . . . . . . 69

CONTENTS vi

4.4 Models Investigated in Simulation . . . . . . . . . . . . . . . . . . . . . . 70

4.4.1 RECIST model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4.2 RECIST model evaluating outcomes at different transition times . 71

4.4.3 Transition Matrices Based on Immediate Changes . . . . . . . . . 72

4.4.4 Transition Matrices with Different Positive Outcomes . . . . . . . 73

4.4.5 Multi-binomial transition matrices . . . . . . . . . . . . . . . . . 75

4.5 Calculation of p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.6 Methods Used for Investigating Different Outcomes . . . . . . . . . . . . 77

5 Results 80

5.1 RECIST Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.1.3 Transition Time-Important RECIST model . . . . . . . . . . . . . 84

5.1.4 Varying away from the RECIST criteria . . . . . . . . . . . . . . 87

5.1.5 Immediate response . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.1.6 Consecutive states . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.1.7 Dual-Binomial Outcomes . . . . . . . . . . . . . . . . . . . . . . . 90

5.1.8 Theoretical Versus Simulated Calculations . . . . . . . . . . . . . 91

6 Discussion 93

A Data 98

B State Spaces 100

C Computer Code 102

D Results 108

List of Tables

1 Table of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

2.1 Repeated significance testing on accumulating data, taken from [20] . . . 6

3.1 Potential Phase II Designs Using Response . . . . . . . . . . . . . . . . . 31

3.2 Potential Phase II Designs Using Response & Stable Disease . . . . . . . 32

3.3 Acceptance Region for Hypothetical Trial using Lin and Chen Design [14],

comparing H0 : RR = 0.05 and SD = 0.25 versus HA : RR = 0.15 and

SD = 0.50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

A.1 Data, in mm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

B.1 Data, in State Spaces According to RECIST Criteria . . . . . . . . . . . 101

D.1 Data input for matrix (2.9) modelling the RECIST criteria . . . . . . . . 108

D.2 Endstate probabilities for (2.9) modelling the RECIST criteria . . . . . . 108

D.3 Outcomes for (2.9) modelling the RECIST criteria and n=36 patients . . 109

D.4 Outcomes for (2.9) modelling the RECIST criteria and n=54 patients . . 110

D.5 Data input for matrix (4.1) modelling the transition-time dependent RE-

CIST criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

D.6 Endstate probabilities for (4.1) modelling the transition-time dependent

RECIST criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

vii

LIST OF TABLES viii

D.7 Outcomes for (4.1) modelling the transition-time dependent RECIST cri-

teria with n=36 patients . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

D.8 Outcomes for (4.1) modelling the transition-time dependent RECIST cri-

teria with n=54 patients . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

D.9 Modified data input (2), slightly better expectations under H0, for matrix

(4.1) modelling the transition-time dependent RECIST criteria . . . . . . 113

D.10 Endstate probabilities for modified data input (2), slightly better expecta-

tions under H0, for (4.1) modelling the transition-time dependent RECIST

criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

D.11 Outcomes for modified data input (2), slightly better expectations under

H0, for (4.1) modelling the transition-time dependent RECIST criteria

with n=36 patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

D.12 Outcomes for modified data input (2), slightly better expectations under

H0, for matrix (4.1) modelling the transition-time dependent RECIST

criteria with n=54 patients . . . . . . . . . . . . . . . . . . . . . . . . . . 115

D.13 Modified data input (3), extremely better expectations under H0, for ma-

trix (4.1) modelling the transition-time dependent RECIST criteria . . . 115

D.14 Endstate Probabilities for Modified Data Input (3), extremely better ex-

pectations under H0, for (4.1) modelling the transition-time dependent

RECIST criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

D.15 Outcomes for modified data input (3), extremely better expectations under

H0, for (4.1) modelling the transition-time dependent RECIST criteria

with n=36 patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

D.16 Outcomes for modified data input (3), extremely better expectations under

H0, for matrix (4.1) modelling the transition-time dependent RECIST

criteria with n=54 patients . . . . . . . . . . . . . . . . . . . . . . . . . . 117

LIST OF TABLES ix

D.17 Modified data input (4), hypothesing a cyto-toxic treatment with im-

proved immediate response but no durability, for matrix 4.1 modelling

the transition-time dependent RECIST criteria . . . . . . . . . . . . . . . 118

D.18 Endstate probabilities for modified data input (4), hypothesing a cyto-

toxic treatment with improved immediate response but no durability, for

(4.1) modelling the transition-time dependent RECIST criteria . . . . . . 118

D.19 Outcomes for modified data input (4), hypothesing a cyto-toxic treatment

with improved immediate response but no durability, for (4.1) modelling

the transition-time dependent RECIST criteria with n=36 patients . . . 119

D.20 Outcomes for modified data input (4), hypothesing a cyto-toxic treatment

with improved immediate response but no durability, for matrix (4.1) mod-

elling the transition-time dependent RECIST criteria with n=54 patients 120

D.21 Modified data input (5), an extreme optimist, for matrix (4.1) modelling

the transition-time dependent RECIST criteria . . . . . . . . . . . . . . . 121

D.22 Endstate probabilities for modified data input (5), an extreme optimist,

for matrix (4.1) modelling the transition-time dependent RECIST criteria 121

D.23 Outcomes for modified data input (5), an extreme optimist, for matrix

(4.1) modelling the transition-time dependent RECIST criteria with n=36

patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

D.24 Outcomes for modified data input (5), an extreme optimist, for matrix

(4.1) modelling the transition-time dependent RECIST criteria with n=54

patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

D.25 Data input (6), an additional transition, for matrix (4.1) modelling the

transition-time dependent RECIST criteria . . . . . . . . . . . . . . . . . 124

D.26 Endstate probabilities (6), an additional transition, for matrix (4.1) mod-

elling the transition-time dependent RECIST criteria . . . . . . . . . . . 124

LIST OF TABLES x

D.27 Outcomes (6), an additional transition, for matrix(4.1) modelling the

transition-time dependent RECIST criteria with n=36 patients . . . . . . 125

D.28 Outcomes (6), an additional transition, for matrix (4.1) modelling the

transition-time dependent RECIST criteria with n=54 patients . . . . . . 126

D.29 Endstate Probabilities for matrix (4.1) modelling the transition-time de-

pendent RECIST criteria with only 3 transitions . . . . . . . . . . . . . . 126

D.30 Outcomes for matrix (4.1) modelling the transition-time dependent RE-

CIST criteria with 3 transitions and n=36 patients . . . . . . . . . . . . 127


CIST criteria with 3 transitions and n=54 patients . . . . . . . . . . . . 128

D.32 Data input for matrix (4.2) modelling the transition-time dependent RE-

CIST criteria with response not an absorbing state . . . . . . . . . . . . 129

D.33 Endstate probabilities for matrix (4.2) modelling the transition-time de-

pendent RECIST criteria with response not an absorbing state . . . . . . 130


CIST criteria with response not an absorbing state and n=36 patients . . 131


CIST criteria with response not an absorbing state and n=54 patients . . 132

D.36 Data input for matrix (4.3) modelling the change in response (10%) at

each transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

D.37 Endstate probabilities for matrix (4.3) modelling the change in response

(10%) at each transition . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

D.38 Outcomes for matrix (4.3) modelling the change in response (10%) at each

transition and n=36 patients . . . . . . . . . . . . . . . . . . . . . . . . . 135



LIST OF TABLES xi

D.40 Data input for matrix (4.3) modelling the change in response (5%) at each

transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

D.41 Endstate probabilities for matrix (4.3) modelling the change in response

(5%) at each transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138





D.44 Data input for matrix (4.4) modelling the change in response, with no

stable disease, at each transition . . . . . . . . . . . . . . . . . . . . . . . 140

D.45 Endstate probabilities for matrix (4.4) modelling the change in response,

with no stable disease, at each transition . . . . . . . . . . . . . . . . . . 141

D.46 Outcomes for matrix (4.4) modelling the change in response, with no stable

disease, at each transition and n=36 patients . . . . . . . . . . . . . . . . 141

D.47 Outcomes for matrix (4.4) modelling the change in response, with no stable

disease, at each transition and n=54 patients . . . . . . . . . . . . . . . . 142

D.48 Data input for matrix (4.5) modelling response+3 consecutive stable dis-

ease observations as a good outcome . . . . . . . . . . . . . . . . . . . . 143

D.49 Endstate probabilities for matrix (4.5) modelling response+3 consecutive

stable disease observations as a good outcome . . . . . . . . . . . . . . . 144

D.50 Outcomes for matrix (4.5) modelling response+3 consecutive stable disease

observations as a good outcome and n=36 patients . . . . . . . . . . . . 144



D.52 Data input for matrix (4.6) modelling response+4 consecutive stable dis-

ease observations as a good outcome . . . . . . . . . . . . . . . . . . . . 146

LIST OF TABLES xii

D.53 Endstate probabilities for matrix (4.6) modelling response+4 consecutive

stable disease observations as a good outcome . . . . . . . . . . . . . . . 147





D.56 Data input for matrix (4.7) modelling response+consecutive minor re-

sponses as a good outcome . . . . . . . . . . . . . . . . . . . . . . . . . . 149

D.57 Endstate probabilities for matrix (4.7) modelling response+consecutive

minor responses as a good outcome . . . . . . . . . . . . . . . . . . . . . 150

D.58 Outcomes for matrix (4.7) modelling response+consecutive minor responses

as a good outcome and n=36 patients . . . . . . . . . . . . . . . . . . . . 151

D.59 Outcomes for matrix (4.7) modelling response+consecutive minor responses

as a good outcome and n=54 patients . . . . . . . . . . . . . . . . . . . . 152

D.60 Data mnput for matrix (4.8) modelling response & toxicity outcomes . . 152

D.61 Endstate probabilities for matrix (4.8) modelling response & toxicity out-

comes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

D.62 Outcomes for matrix (4.8) modelling response & toxicity outcomes and

n=36 patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

D.63 Outcomes for matrix (4.8) modelling response & toxicity outcomes and

n=54 patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

List of Figures

3.1 Potential Distribution for Standard Treatment Response Rate . . . . . . 43

3.2 Tumour Shrinkage and Growth for Three Hypothetical Patients Over Time 47

3.3 Decision Process for Zee [82] multinomial design. Figure A is the decision

rule after stage 1 and Figure B is the decision rule after stage 2 . . . . . 50

xiii

LIST OF FIGURES xiv

Abbreviation Meaning

C Censored

CI Confidence Interval

CR Complete Response

DSMB Data Safety Monitoring Board

MTA Molecularly Targeted Agent

PD Progressive Disease

PR Partial Response

RECIST Response Evaluation Criteria In Solid Tumours

RR Response Rate

SD Stable Disease

UR Unconfirmed Response

Table 1: Table of Abbreviations

Chapter 1

Introduction

The National Cancer Institute of Canada enrolled 6626 patients to clinical trials, at a

cost of $64 million, in 2004-05, which is a mean of over $10,000 per patient [1]. In the

United States, the National Cancer Institute is planning a budget of nearly $5.8 billion

for 2008 and supports over 1300 clinical trials a year, treating over 200,000 patients [2].

In addition to the high financial cost of cancer clinical trials, there is an even greater

human cost. Patients who are eligible for clinical trials are generally patients who have

no other options, having failed all standard therapies or having a disease for which no

standard therapy exists. These patients often enter clinical trials as their last hope and

are at the end of their life. Further, the types of treatments studied in cancer clinical

trials can be quite toxic with life-threatening adverse events.

Despite the large numbers of patients accrued to clinical trials, this is only a small

fraction of patients diagnosed with cancer, which numbers around 150,000 yearly in

Canada [3]. Canadians have a lifetime probability of developing cancer of around 44% and

38%, with a lifetime probability of dying of cancer of 28% and 23% in males and females,

respectively [3]. Accrual to clinical trials remains difficult [4], especially amongst minority

and disadvantaged groups [5]. It is necessary that data from patients accrued to cancer

clinical trials be optimally used and clinical trial designs need to be constructed such

1

Chapter 1. Introduction 2

that data is used as efficiently as possible. Efficient designs save valuable resources and

funds from being wasted, save patients from receiving toxic and possibly non-efficacious

treatments needlessly, and speed up the drug development process, allowing efficacious

treatments to be available for all cancer patients more promptly.

In the drug development paradigm, there are 4 main clinical trial stages [6]. Phase I

is dose-finding, with the ultimate goal being to determine the optimal dose, which is the

highest dose that has acceptable levels of toxicity. Phase II trials are preliminary inves-

tigations of treatment efficacy in which the ultimate goal is to weed out non-efficacious

treatments and allow treatments with some potential activity to continue on to more

definitive study. This occurs in phase III, when the experimental treatment is studied

in a randomized clinical trial compared with the present standard of care. Phase IV is

post-marketing, which further evaluate the long-term safety and effectiveness of a new

treatment, including determining if there are subgroups which benefit from treatment,

or alternate doses which are superior.

Phase II cancer clinical trials tend to be single-arm, open-label studies with a small

number of patients, usually no more than 50. As noted above, phase II trials serve

primarily to discriminate between treatments with some potential activity, and those

which are ineffective and do not prevent tumour growth. Since it is unethical to accrue

patients to a treatment which is ineffective, especially if it has significant toxicity, a

number of clinical trial designs have been proposed which attempt to optimize these

trials [7] [8] [9] [10] [11] [12] [13] [14] [15] [16].

There is no consensus as to which design should be used, and the final decision as

to which design to use is often a subjective choice based on personal preference of the

statistician or the principal investigator. The choice of design is often subjective and

the final statistical decision as to whether a treatment is worthy of further study, or

ineffective, may differ significantly depending on which design is used. Further compli-

cating this decision is the choice of which primary outcome measure to use. The most


frequently used primary outcome is response rate, with definitions for solid tumours of-

ten following the response evaluation criteria in solid tumours, more frequently referred

to as the RECIST criteria [17]. Prior to RECIST, the World Health Organization [18]

criteria was commonly used. Traditionally, cytotoxic treatments were investigated which

would only be deemed effective if the treatment could shrink the tumour, hence causing

an objective response. Recently, however, molecularly targeted agents [MTA], which are

cytostatic, are more frequently studied. These agents may work by preventing tumour

growth, not necessarily by shrinking the tumour. Non-progression may then be an in-

dicator of treatment efficacy. Other outcomes might also be indications of treatment

efficacy, such as prolongation of overall survival, time to progression, or some multivari-

ate outcome which combines response, toxicity, survival or other outcomes. Results of

a trial may therefore be different than one had expected and as a result, designs are

often ammended or disregarded during the trial, even by experienced investigators, due

to unexpected results.

An additional problem with traditional trial designs is that only a single primary

outcome measurement is used for each patient, although there are often multiple mea-

surements of outcomes such as tumour response. Tumour size is often measured at the

end of each treatment cycle, or every second treatment cycle, although in the end, ac-

cording to most response definitions including RECIST, only the best observed response

is used. Despite the desire to use data efficiently, much data is discarded.

Instead of forcing investigators to choose a single design based on a single outcome,

which may in the end not effectively demonstrate treatment efficacy, an alternative idea

is to explore a range of possible designs and a range of possible outcomes. In this

dissertation, finite imbedding Markov chain models are used to explore data from a

single trial under a range of possible design questions and outcomes of interest. Using

Markov chain methodology allows incorporation of multiple outcome measures for each

patient, such as tumour response status at each evaluation.


The rest of the thesis is organised as follows. In Chapter 2, a review of statistical

methodologies is presented, including finite imbedding Markov chains, common in phase

II and III clinical trials or pertinent to this dissertation, followed by a review of most

frequently used phase II designs in Chapter 3. At the end of Chapter 3, the use of finite

Markov chain imbedding for use in analysing phase II oncology clinical trials is proposed

along with rationale for using finite Markov chain imbedding. An actual clinical trial

example is presented in Chapter 4 and in addition, implemention of finite Markov chain

imbedding methods and a description of the simulation analysis performed. Results of the

simulation are presented in Chapter 5. Finally, Chapter 6 summarises the dissertation,

adds some conclusions and discusses areas of future work.

Chapter 2

Statistical Issues

One way researchers have attempted to improve clinical trial efficiency is by performing

group-sequential designs in which interim analyses are performed and a trial is stopped

as soon as possible when the conclusion is definitive. In cancer, this has an increased

ethical importance as it limits the number of patients exposed to an inactive and possibly

toxic treatment, while ensuring personnel and financial resources are not wasted [19].

Additionally, if efficacy is found earlier, the drug development process can be sped up with

quicker approval by regulatory agencies, resulting in more patients being treated with an

active agent. Group-sequential designs are now implemented routinely in most clinical

trials, however, the nature of the design can vary substantially, particularly depending

on the type and phase of the trial. Phase II trials tend to have a single interim analysis,

while phase III trials tend to have many.

Statistically, the effect of performing an interim analysis can be quite pronounced

[20] and give spurious results [21]. Each additional look at the data will increase the

probability of falsely rejecting the null hypothesis (H0). An example is shown in Table 2.1,

which shows the false-positive rate when performing multiple independent significance

tests, each at the α=0.05 level of significance, using a two-sample t-test. The false-

positive probability increases to 1 as the number of looks at the data increases to ∞ [20].

5

Chapter 2. Statistical Issues 6

Thus, when performing a clinical trial which ethically requires interim analyses, one must

adjust the statistical error rates to account for these extra looks at the data.

No. of tests K Overall null probability

of rejecting H0

1 0.05

2 0.08

3 0.11

4 0.13

5 0.14

10 0.19

20 0.25

50 0.32

∞ 1.00

Table 2.1: Repeated significance testing on accumulating data, taken from [20]

It is important to review the statistical literature for existing group-sequential tech-

niques prior to investigating novel methodologies or modifications of existing methods.

Many methods have been proposed, both for phase II and phase III clinical trials and a

review of these methods is presented in the next section.

2.1 Phase II and III Clinical Trials

Before looking at the different group-sequential methods, it is important to note some

differences between phase II and phase III designs. In phase III clinical trials, there is a

comparison between two treatment arms, whereas in phase II trials, there is only a single

treatment arm. The second main difference is that in phase III trials, the experimental

treatment is generally hoped to be superior to, or in the case of non-inferior testing,


about as good as, standard, or the control arm, so interim analyses are often based

on the premise of stopping the trial early due to superiority. Phase II trials are more

frequently stopped early due to inferiority due to the expectation that most treatments

will not be better than standard of care. Even in the rare situation that a treatment is

superior, one will generally want to accrue additional patients to increase familiarity with

the treatment prior to the expensive, phase III trial. This difference will be important

when deciding on which statistical method is appropriate for use in clinical trial design.

2.2 Methods for Adjusting the Type I Error (α) In

the Situation of Multiple-Testing

2.2.1 Bonferroni

The simplest, and probably best known, statistical adjustment for multiple testing is

the Bonferroni adjustment [22]. Here, the type I error (α) is divided by the number of

looks at the data which will be performed. At each analysis, only data from patients

accrued since the previous test are included, and the test is significant only if the test α

is less than the Bonferroni level of significance. Bonferroni adjustments are well-known

to be overly-conservative [23]. Further, the Bonferroni adjustment is correct if each

individual group of data is independent but is inappropriate when the data is correlated,

as is generally the case in clinical trials where patients are accrued group-sequentially.

Thus, one possibility when many analyses are performed is that one might observe a

trend towards significance in the same direction at each analysis, but no test is by itself

sufficiently strong to reject the null hypothesis. However, if one combines each test

together into a single test, one might observe an overwhelmingly significant result, which

is missed when using Bonferroni. Fortunately, newer methods have been proposed which

are more efficient than using a Bonferroni adjustment and allow for one to reject or not


reject the null hypothesis at an interim analysis.

2.2.2 Early Group-Sequential Methods

The first sequential methods, the sequential probability ratio test [24] and the triangular

test [25] were developed during the second world war, but it was not until the late 1970s

that practical methods were developed. These more practical methods were developed

because of the ability to calculate conditional error rates by using numerical integration

[26]. Since the amount of α error spent at any one test is conditional upon previous

results (i.e. was a previous test significant, thus leading to a termination of the trial, or

not), one must account for previous results when determining total error spent at any

given test. This is possible through numerical integration and one adds up the total α

error spent at each test to obtain the final level of significance.

Two well-known approaches were first developed using numerical integration in the

late 1970’s, whereby patients were accrued in k equally-spaced groups and an interim

analysis performed after each group was accrued. The Pocock [27] and O’Brien-Fleming

[28] methods describe a ’family’ of boundaries which incorporate all previously observed

data at each interim analysis and used numerical integration to ensure the total α used

at the end of the trial is within pre-defined limits. For the first time, when the 2nd (or

later) interim analysis was performed, data from patients accrued between the 1st and

2nd analyses in addition to data from patients accrued prior to the 1st interim analysis

were combined and analysed together. The difference between the two methods was how

the α was used at each analysis. The O’Brien-Fleming design is based on boundaries

which are constant in the test statistic scale as a function of time, i.e. one would reject

Z∗ if Z∗ ≥ Z/√

ni/N at the ith analysis, where ni is the number of patients at analysis i

of the total N patients in the trial, and Z is the statistical test of interest (based on the

normal distribution). In contrast, the Pocock design gives boundaries which are constant

in terms of the α as a function of time, i.e. one would reject a test Zi if Zi > c at any


analysis i = 1, 2, 3, ..., n where c is a constant such that the overall test has size α. The

specific boundary used for a particular analysis from within the ’family’ of boundaries,

for both the Pocock and O’Brien-Fleming methods, depends on the number of interim

looks at the data. The O’Brien-Fleming approach required much stricter requirements

for stopping early on in a trial compared to the Pocock method, however, the boundary

point at the end of the study is much closer to the unadjusted boundary point. In

contrast, the Pocock approach would stop a trial earlier [29].

2.2.3 Alpha-Spending Function

The major drawback to these methods is that they require interim analyses to occur

at exact time points (i.e. after every ni patients), which may be difficult to achieve in

practise. A major breakthrough occurred with the development of the alpha-spending

function [30]. This development allowed users to specify a function, or a curve, which

represents the amount of α, or type I error, spent at any information time point during

the trial. By only specifying the α-spending function at the start of the trial, the timing

of interim analyses becomes irrelevant, provided when an interim analysis is performed is

not scheduled based on the data already observed. Here, information time is defined as

the information available at the interim analysis as a proportion of the total information

which will be available at the trial conclusion. With continuous or categorical endpoints,

this is simply the number of patients enrolled out of the total expected number. With

survival endpoints, this would be the number of deaths which have been observed out of

the total expected number.

The concept behind the α-spending function is based on assuming a Brownian motion

process, or a continuous-time stochastic process which is random. If B(t): 0 ≤ t ≤ 1 is

a standard Brownian motion process, and some horizontal boundary point b(t) = zα2

is

set (for a one-sided test), then by defining τ to be the first time that the process B(t)

crosses beyond the point b(t), a known function describes α∗(t) = pr(τ ≤ t), that being:


α∗(t) = pr(τ ≤ t) =

0 , if t = 0

2 − 2Φ(zα2)/√

t , if 0 < t ≤ 1

(2.1)

where Φ is the standard normal cumulative distribution function. Lan and DeMets

conjecture that if a process (not necessarily a Brownian motion process) is discretised,

such that one evaluates some test statistic at time t by defining a point b1 so that

Pr(b1 ≤ t) ≈ α∗(t), then the same optimality properties should hold for the discretised

process. In other words, if one calculates the probability of exceeding a particular value

at a given time t, where time is measured as information time, and a rejection region is

constructed so that this probability approximates the value of α∗(t) then one can spend

α at a rate which enjoys the same optimality properties of the Brownian motion process.

At the first evaluation, the probability of being in the rejection region is simply

Pr(Z > z1) ≈ α∗(t) where Z is the test statistic of interest, however, at any additional

evaluation, the numerical integration methods of Armitage et al [26] must be used. Any

increasing function with α(1) = α is acceptable as long as one specifies the curve prior

to the first interim look at the data. It is notable that the function α∗(t), where t is the

proportion of information time, gives a rejection region that closely matches the O’Brien-

Fleming boundary region. For this function, one would reject at any analysis, when the

proportion of information time observed is t, if the test is significant at the α∗(t) level

of significance. A couple of curves are described in the Lan-DeMets paper which closely

resemble the Pocock and O’Brien-Fleming to demonstrate the utility of this function.

2.2.4 Repeated Confidence Intervals

Other approaches to evaluating whether one should continue a study beyond an interim

analysis have been proposed. One approach is the use of repeated confidence intervals,

advocated by Jennison and Turnbull [31] [19]. Typically at the end of a study, a con-

fidence interval of size 1-α can be computed for the mean using (x − tn−1,α/2σ/√

n, x +


tn−1,α/2σ/√

n. In the repeated confidence interval approach, the confidence interval can

be computed at each analysis using the same calculations, but with α(t) replacing α,

and α(t) calculated using the same α-spending function as defined by Lan-DeMets. The

repeated confidence interval approach has some useful advantages, such as demonstrating

a range of values which are compatible with the data. Knowing a range of compatible

values is often more informative than simple yes/no decision. Another distinct advantage

occurs when the primary outcome falls outside a confidence interval, but instead of stop-

ping a trial the data safety monitoring committee or trial investigators decide to continue

to the next stage, overruling the statistical design. Subsequent confidence intervals are

not affected by the previous decision to continue, unlike a hypothesis test type I/II error.

This is important since there are often many reasons why one may not strictly follow

the statistical guidelines. Opponents of this method point out that a 95% confidence

interval at the trial conclusion depends on the number of previous looks, and there are

questions as to what confidence interval should be reported, the naive, or the adjusted,

slightly conservative CI. Additionally, the sample sizes calculated based on the repeated

confidence intervals approach can be prohibitively large.

2.2.5 Bayesian Approach

Another method used with some regularity is the use of Bayesian [32] [33], or likelihood

methods [34]. Bayesians look at the problem from a whole different context, believing

that when one is unsure of the true value of a parameter, one can consider the true value

to be random. A mathematical formula attributed to Rev. Thomas Bayes

P (A|B) =P (A ∩ B)

P (B)=

P (B|A)P (A)

P (B)(2.2)

shows how one can estimate a (posterior) distribution of this random value (the true

parameter value) based on prior belief and the accumulated data using the equation for

conditional probability shown in equation (2.2).


Many Bayesians consider this to be more in agreement with the approach to research

and the way individuals think than traditional methods [35]; that is, one takes the given

knowledge they have, add to it the additional information gained from a trial, and adjust

their beliefs. Modern computer power has allowed users to apply Bayesian methods

(e.g. using programs such as WinBUGS [http://www.mrc-bsu.cam.ac.uk/bugs/]) and

get usable results, and an additional attribute of these methods is that the posterior

distribution is not affected by the number of previous looks at the data, since, according

to the likelihood principle "if two sample points result in the same likelihood function

then they contain the same information about θ "[36]. As a result, inference about the

parameter of interest is based solely on the data at hand, and is unaffected by the number

of previous looks or what one might have done in future "identical "trials or how the

trial itself is designed [37]. A good overview and arguments promoting this principle can

be found in Royall [34] and for Bayesian methods in medicine in Berry and Stangl [38].

Unfortunately, Bayesian decision theory is not straightforward. The most pressing

question is what prior distribution to use [39]. Some advocate using a range of prior

distributions [40], however, many argue that a single trial should not produce a range of

outcomes. Use of a non-informative [41] or a reference prior [42] [43] is another alternative,

unfortunately, every prior contains some information [44] [45]. One possibility is to make

a conclusion only when the posterior distribution shows evidence which are convincing

beyond a reasonable doubt. However, what one considers beyond a reasonable doubt is

arbitrary, may require unreasonably high sample sizes, and finally, as with frequentist

approaches, the probability of rejecting a hypothesis based on a fixed probability value

will increase with increased looks at the data [39]. Thus, as with Table 2.1, initially, one

will reject a null hypothesis with probability one if one looks at the data infinitely many

times.


2.2.6 Using a Risk/Loss Function

A final method which has been proposed is based on a risk, or loss function [46] [47].

Here, one must quantify the potential risk of continuing a trial and weigh it against the

potential loss of life by not treating patients with a potentially useful new drug. The

quantification is invariably difficult and arbitrary, and it is also extremely difficult to

attempt to quantify future costs, outcomes, and results. Thus, this method has been

severely criticized [48] and not widely implemented.

2.2.7 Stochastic Curtailment and Conditional Probability

A method called stochastic curtailment was proposed by Lan, Simon and Halperin in

1982 [49], however, by 2003, Sebille and Bellissant suggested this had yet to be widely

used in the medical literature. In stochastic curtailment a study is designed with a fixed

sample size. Interim analyses are then conducted and the probability of rejecting the null

hypothesis (assuming continuation of the study) at the end of the study is calculated using

P (Z > Zα|Z1), (2.3)

where, Z1 is the standard normal test statistic at the first interim analysis. This prob-

abibility is based on the test statistic at the interim analysis and the amount of infor-

mation available out of the total amount expected. If the probability is sufficiently low,

then one would have sufficient evidence to terminate a study early for futility, as there

would be insufficient probability to obtain a statistically significant result. One of the

unique aspects to stochastic curtailment based on conditional power calculations is that

it looks at the futility of continuing, whereas most methods look only at whether clear

superiority has been observed. It is becoming common to design a clinical trial based

on group-sequential methods, perhaps based on an α-spending function resembling the

O’Brien-Fleming method, but using stochastic curtailment to stop a trial when there is


evidence that continuing the trial would be futile.

Unfortunately, the probability in equation (2.3) may be based on an incorrect assump-

tion, the assumption of normality under H0. One generally would not assume the null

to be correct for future data, since one already has observed some data (which typically

would not be equal to H0) and assuming that future data would be distributed as under

H0 would likely be underestimating the true value. Alternatively, assuming future data

is not from H0 implies that equation (2.3) may not be normal and causes problems about

the end of analysis test statistic. Further, assuming future data occurs as is specified

under the alternative hypothesis (HA), or that the data occurs as was observed in the

first part of the study, may be overly-optimistic. Methods have been proposed to account

for this, such as using a range of future data [50] [51], but there is no consensus at this

time.

2.3 Sample Size Re-adjustment [52] [53] [54] [55]

Alternative to group-sequential designs, another theory of statistical analysis which in-

vestigates what to do at interim analyses is sample size re-adjustment methods [52] [53]

[54] [55]. Whereas group-sequential designs have a fixed maximum sample size, sample

size re-adjustment methods take into account the data observed up to the interim anal-

ysis and adjusts the sample size for the next stage of accrual based on the accumulated

data. This might be useful, for example, if one designed a trial based on overall survival

as the primary outcome, but the null hypotheses of survival in the control arm was un-

derestimated (i.e. less deaths than expected), then the power of the trial may not be

sufficient to detect the effect size of which one is interested. This is because there will be

less deaths than expected at trial completion. The trial would be underpowered and if

the p-value is low, but does not attain statistical significance, then there would be many

questions about whether there is a true treatment effect or not.


For example, one may assume that 20% of patients will respond to standard treat-

ment. Using a two-sided Fisher’s exact test (α = 0.05), one might deem an experimental

treatment of interest if the response rate increased by 50%, i.e. to 30%. To attain 90%

power, one would need 412 patients in each treatment arm (calculations performed using

NCSS-PASS [56]). However, if instead of a 20% response rate, the true response rate

to standard treatment was only 15%, then a 22.5% response rate would be of interest

(increased by 50%) and one would require 594 patients per arm to attain sufficient sta-

tistical power. If the trial was stopped at 412 patients, then one would have only around

76% power and questions might abound if the end of trial p-value was reported as 0.07.

This overestimation of the standard treatment response rate might be observed early on

in the trial, and it might be desirable to the sponsoring agency or company to increase

the trial sample size to maintain 90% power rather than end with an inconclusive result.

However, using standard methods, this is not possible.

A real-life example where sample size re-estimation might have proved helpful is

discussed by Cui, Hung and Wang [57]. They discuss a phase III trial investigating

the effect of a new drug which aimed to prevent myocardial infarction [MI] amongst

patients undergoing coronary artery bypass graft surgery. The original sample size was

600 patients per treatment arm, which had sufficient power to detect a 50 % reduction

in the incidence of MI, from 22 % in placebo to 11 % in the treatment group. A planned

interim analysis was performed after half the population was accrued and the conditional

probability of finding a statistically significant result was very low if the trial continued as

planned. This was because the MI incidence rate in the treatment group was around 16.5

%, well above the planned incidence rate, but still below the observed placebo rate of 22 %

and thus, still of clinical interest. Unfortunately, at the time of this trial, no valid testing

procedure was available in the statistical literature and the sponsor ultimately decided

not to increase the sample size. The trial eventually failed when the new treatment

did not achieve a large enough decrease in the MI incidence rate to attain statistical


significance. Although the ultimate fate of the new drug was not discussed, one might

presume that if the decrease in MI incidence rate was real, a subsequent clinical trial

would be necessary at substantial cost to the sponsor and forcing a considerable delay to

the approval of a potentially beneficial drug.

Proponents of these methods indicate that these methods are flexible enough to ac-

count for situations where the initial hypotheses are misspecified [58], citing that if one

knew H0/HA well, then there would be no need for performing the trial. This may be

a common occurrence where initial estimates for designing a study are based on small

earlier studies, such as designing of a phase III trial based on efficacy estimates from a

small phase II study. Additionally, there may be times when investigators report a trend

towards significance if the p-value is just above the critical value for significance (say 0.05

< p-value<0.10). It would be desirable if the trial could be extended slightly to attain

statistical significance. By allowing one to re-adjust sample sizes, one can improve the

likelihood that a valid study conclusion is reached [59]. However, if an interim analysis

is performed after n1 patients are accrued, and the total sample size is n = n1 + n2,

then n2 ⇒ ∞, P (Z > Zα|Z1) ⇒ P (Z > Zα). It is therefore important to document how

the sample size will be adjusted prior to the start of the trial using one of the methods

discussed below.

Critics point out that the sample sizes using re-adjustment methods may be exceed-

ingly large, they may be drastically different than what was in the originally proposed

and for which the trial was budgeted [60] and the efficiency of a well-designed group-

sequential design far surpasses a well designed trial using sample size re-adjustment [61]

[62].

2.3.1 Variance Spending Approach

Fisher [52] and Shen and Fisher [53] advocate a variance spending approach to sample

size readjustment. In this method, the difference between treatment outcomes at the end


of a trial is compared using a test statistic based on the standard normal distribution. At

an interim analysis, some portion of the total variance is spent and an adjusted normal

test statistic is calculated as a measure of the difference in outcomes between treatment

arms. The stage 2 data is similarly constructed. The two normal statistics are then

summed together, with the total variance equal to 1. Under the null hypothesis of no

difference, regardless of any difference in sample size, with appropriate adjustments, the

final test statistic is a standard normal, N(0,1).

Specifically, if one calculates the difference in outcomes as S1 = Σrni=1(XAi − XBi) for

treatment groups A=experimental and B=standard and X is the statistic of interest, then

at an interim analysis S1 ≈ N(rnθ, rn) where 0 < r < 1. One can then transform this

value by dividing by the square root of the sample size, to attain W1 = S1√n≈ N(r

√nθ, r).

One has spent r, 0 < r < 1 of the total sample size. Adjusting the total sample size to

n∗ = rn+γ(1−r)n and calculating S2 = Σn∗i=rn+1(XAi−XBi) ≈ N(γ(1−r)nθ, γ(1−r)n)

and W2 = γ−1/2S2√n

≈ N(√

γ(1 − r)√

nθ, 1 − r) then one can calculate at the end of the

trial Z = W1 + W2 = S1+γ−1/2S2√n

≈ N(0, 1) and one would reject if Z > Zα.

One of the major problems with the variance-spending approach is that it does not

allow stopping of the trial at the interim analysis, and one is required to continue on

to stage 2 (or later stages, as this can be easily generalised to >2 stages of accrual).

Additionally, if one performs an interim analysis too late in the trial, the second stage of

accrual might be prohibatively large to maintain the necessary power. However, if one

performs the analysis too early, than there is insufficient data to make an appropriate

estimate of sample size for stage 2.

2.3.2 Fisher Combination Test

The Fisher combination test is named after R.A.Fisher, who first proposed this method-

ology for combining p-values, and not L.D.Fisher of the variance spending approach.

The methodology may be better attributed to Bauer and Kohne [54] who borrowed this


procedure from meta-analysis where it is commonly used.

Simply, under the null hypothesis of no treatment effect, a p-value follows a uniform,

U(0,1), distribution. If one defines pi to be the p-value obtained at the ith interim analysis,

then p1p2 ≈ exp−1/2χ24,α . So, regardless of the distribution of the data, one can combine

the results from 2, or more, interim analyses using this test statistic.

If the p-value is less than α at stage 1, one can stop the trial at the interim analysis.

Otherwise, one simply constructs the stage 2 data to be sufficiently large to have enough

power to attain a small enough p-value at stage 2 to declare significance in the trial

overall.

Similar to the variance spending approach, the stage 2 sample size may be prohiba-

tively large if the p-value for the first interim analysis is not close to significance. An

additional criticism is that the distribution of the data must be correctly specified, else

the p-value itself is inappropriate.

2.3.3 Conditional Power

The last method of sample size re-adjustment is a modification of the group-sequential

stochastic curtailing method which also uses conditional power. Whereas stochastic cur-

tailing assumes a fixed sample size and investigates the conditional power (conditioning

on the data already observed) of getting a significant result if one was to continue, in this

method, the level of conditional power is fixed, and the sample size required to attain

this power is calculated. In this manner, one is able to ensure sufficient statistical power

to obtain a valid study conclusion.

The idea of adjusting the sample size based on conditional power was proposed by

Proschan and Hunsberger [55], who investigate conditional power when comparing the

difference in a continuous outcome between two groups, as would be common in phase III

clinical trials. In this paper, the maximum αmax for two-stages of accrual is calculated

if one indiscriminately selects the stage 2 sample size to maximise the probability of


getting a significant result. For example, if Z1 > Zα than one might choose a stage

2 sample of size 0, ensuring a significant result. Conversely, if Z1 is very small, then

one might choose a stage 2 sample size which approaches ∞, thus, making the stage

1 data of almost no importance (i.e. as n2 ⇒ ∞, z = z1 + z2 ⇒ z2). The total error

by indiscriminately choosing the stage 2 sample size in this manner is increased to be

αmax = α + exp(−z2α/2)/4 and can more than double the planned error rate.

A simplistic, but inefficient, way of performing a two-stage design is to then select

αmax to be the error rate of interest, and to select a sample size for stage 2 in any way one

chooses, knowing that the total α < αmax is acceptable. A simple change increases the

efficiency dramatically. For some k and p*, at any interim analysis, one stops the trial if

Z1 > k and rejects H0 or if Z1 < Zp∗ and does not reject H0 where Z1 is the Normally

distributed test statistic at the time of the interim analysis. Thus, one can stop a trial

after the interim analysis if extreme results are seen, and one will continue to stage 2

only if Zp∗ ≤ Z1 ≤ k. This lowers αmax considerably.

A statistically more desirable extension procedure is also proposed based on the calcu-

lation of conditional power, that being Pδ(Z2 ≥ zα|z1, zα, n2) = 1−Φ[zα

√2(n1+n2)−z1

√2n1−n2δ√

2n2]

where Φ(z1) is the cumulative standard normal density, n2 is the second stage sam-

ple size, z1 is the observed test statistic after the first stage, Z2 is the test statistic

at the end of the second stage and δ = µ1−µ2

σ. It is shown that by using the empir-

ical estimate δ = y1−x1

σthen the formula for conditional power can be expressed as

CPδ(n2, zα|z1) = 1−Φ(zA −√

n2/2δ) where zA = ZCPδ(n2,zα|z1). One can then plot differ-

ent conditional power estimates for different stage 2 sample sizes and choose the sample

size based on these plots. Conversely, one can set CPδ(n2, zα|z1) = 1−β2 and solve for n2.

In other words, if one wants a defined amount of power at the end of stage 2, conditional

on the data at the interim analysis, one can set n2 =(zA+Zβ2

)2

δ2 or n2 = n1(zA+zβ2

)2

z21

if

using the empirical estimate δ. Since the observed estimate δ may be overly optimistic,

an alternative estimate may be derived from HA, or a mixture between δ and HA. This


can be generalised to more than 2 stages.

2.4 Combining Data from Different Analyses

2.4.1 Continuous and Categorical Outcomes

For any of the methods in the previous section to be valid, one must assume that future

data is independent of already observed data. If future data is correlated with already

observed data, then one needs to model this correlation to get valid test statistics or

decision rules, which is not possible without knowing the future data. Fortunately, the

statistical theory has already been provided which demonstrate this independence.

When one has continuous data, the significance test is based on assuming a normal

distribution under the null hypothesis, and one can easily perform multiple tests by as-

suming all data accumulated prior to the first evaluation follows a normal distribution Z1,

all data after the first evaluation but prior to the second evaluation follows an indepen-

dent normal distribution Z2, and so on until the end of the trial. Since the data in Z1 has

no influence on the data in Z2, the data are independent, and it is well known that the

sum of two (or more) independent normal distributions produce a normal distribution.

Thus, the accumulated data test statistic can be based on a normal distribution under

the null hypothesis.

Similarly, for categorical data, significance tests are often based on the χ2 test. Inde-

pendence between different test statistics can be assumed as in the continuous case, and

the sum of two (or more)χ2 is again a χ2. Thus, the accumulated data test statistic can

again be based on the χ2 distribution under the null hypothesis.

2.4.2 Survival-Type Outcomes

Unfortunately, with survival outcomes, significance tests are not as easily constructed.

The probability of death for a patient at the second and later evaluation is correlated


with the probability of death at the first evaluation - since if a patient is dead at the first

evaluation, they necessarily must be dead at all future analyses, but a patient censored at

the first evaluation may be dead or alive at later analysis, but necessarily having a survival

time at least as long as observed in the first evaluation. It might be assumed that the log-

rank test at evaluation 1 is correlated with a log-rank test at evaluation 2. Fortunately,

Tsaitis [63] showed that the joint distribution of the test statistics from repeated log-

rank tests evaluated at information time points t1, t2, . . . , tk, converge asymptotically to

a multivariate normal with mean 0 and an estimatable covariance matrix. From this it

is shown that as long as any weight function used in computing a survival test statistic

has asymptotically independent increments (as is the case in the log-rank test), then the

recursively calculated process used at interim analyses will have independent increments.

Further, this process will depend only on data collected at that time and on the previous

test results, not on future tests or data.

The Wilcoxon test (using the approach specified by Gehan [64]) weight function is

dependent upon the number of patients accrued and alive at each time point, thus, does

not have asymptoticallly independent intervals. Slud and Wei [65] show how one can

estimate the correlation between repeated significance tests. They note however, that

this requires high dimensional integrals which may deter the practicality of repeated

significance tests based on the Wilcoxon statistic.

The generalization of the Wilcoxon test by Peto and Peto [66] and Prentice [67] does

not depend on the censoring rate and has been recommended as a preferred generalization

of the Wilcoxon test when there is censored data [68] [69]. The Peto-Peto-Prentice

generalization use a modified Kaplan-Meier estimator in the derivation of the density

function of the survival times. Since the weight function does not depend on the censoring

rate, the weight function has asymptotically independent increments. Thus, the methods

of Tsaitis [63] are applicable, as they were for the log-rank test. The recursive process

at the interim analysis, therefore, has independent increments and the process depends


only on data collected at that time and previously, not on future results.

Practically, the Wilcoxon statistic is rarely used in group-sequential clinical trials,

partly because classical group sequential methods can not be used for group sequential

monitoring when the Gehan generalization is used, but also partly due to the emphasis on

early deaths when using the Wilcoxon test. If one saw many early deaths (by chance), one

would have more reason to stop a trial very early, and this would lead to questions later

regarding whether the trial should have continued. Most trials require sufficient time to

elapse before declaring superiority of one treatment over another to satisfy non-statistical

concerns of investigators, even when a purely statistical conclusion is evident.

2.5 Markov Chains

2.5.1 Definition

A Markov chain is a discrete-time stochastic process which has the Markov property.

Although frequently used in many research fields, including engineering, reliability and

quality control, they are less common in the biomedical literature. Simplistically, a system

of interest will change from one state to another (including potentially to the state it was

previously in) at discrete time points. Each change of state is called a transition. The

Markov property states that the conditional probability the system will be in a given

state after the next transition depends only on which state the system is in presently.

In other words, knowledge of what state the system was in previously does not give any

information about future states. Mathematically, this is defined as follows:

Let ω = {1, 2, . . . ,m}, (m < ∞) be a state space and let Yt = Y0, Y1, . . . , Yt, . . . be a

sequence of random variables defined on ω. Then, the sequence will be called a Markov

chain if, for any sequence Y0 = i0, Y1 = i1, . . . , Yt−1 = it−1, Yt = it, t = 1, 2, . . . , we have

P (Yt = it|Yt−1 = it−1, . . . , Y0 = i0) = P (Yt = it|Yt−1 = it−1) (2.4)


[70]

We will only be interested in Markov chains with a finite state space ω in this thesis.

The transitions can be described succintly as P (Yt = it|Yt−1 = it−1) = pij(t) and over a

finite state space, a m x m transition matrix, M, can be written as:

M = (pij(t)) =

p11(t) p12(t) · · · p1m(t)

p21(t) p22(t) · · · p2m(t)

· · · · · · · · · · · ·

pm1(t) pm2(t) · · · pmm(t)

(2.5)

A state is defined as an absorbing state if once the system enters that state, it will

never leave that state.

2.5.2 Properties of Markov Chains

A Markov chain is considered homogenous if the transition probabilities are identical

at all transitions, and is considered non-homogenous if the transtion probabilities are

different at some time t. The transition from state i to state j at any single time

point, P (Yt = it|Yt−1 = it−1) = pij(t) is called a one-step transition and is a first order

Markov chain. If it occurs over a series of n time periods, then it is a n-step transition

P (Yt = it|Yt−1 = it−1)∗P (Yt−1 = it−1|Yt−2 = it−2)∗· · ·∗P (Yt−n+1 = it−n+1|Yt−n = it−n) =

p(n)ij (t) and is an nth order Markov chain. The n-step probabilities can be calculated by

multiplying together the one-step probabilities as indicated in the Chapman-Kolmogorov

equations, which states: for any 0 < k < n,

p(n)ij =

∑

r∈S

p(k)ir p

(n−k)rj . (2.6)

Using the Chapman-Kolmogorov equations, it is possible to calculate many statistics,

although this calculation may still be quite complex.


In certain situations, an approach called finite Markov chain imbedding, which is

summarised by Fu and Koutras [71] can be used to simplify the calculations. They

stated that

a nonnegative integer random variable Xn,k can be imbedded into a finite

Markov chain if:

a) there exists a finite Markov chain Yt : t ∈ 1, 2, . . . , n defined on the finite

state space ω

b) there exists a finite partition Cx, x = 0, 1, . . . ,m on the state space ω, and

c) for every x = 0, 1, · · · ,m, we have P (Xn,k = x) = P (Yn ∈ Cx).

The subscript k was used to represent certain random variables of interest, such as the

number of consecutive positive states in a run, and would be dependent on the random

variable of interest. It is then show that if the random variable Xn,k can be imbedded

into a finite Markov chain, then

P (Xn,k = x) = π0(n

∑

t=1

Λt)U′(Cx) (2.7)

where π0 is the initial probability vector of the Markov chain, Λt is the t-step transition

matrix and U ′(Cx) =∑

r:ar∈Cx Ur where Ur is a 1 x m unit vector having 1 at the rth

coordinate and 0 elsewhere and U(Cx) is the end-state transition state(s) of interest.

Simplistically, this means that if one has an imbeddable random variable, then one can

calculate the distribution, moments and probability-generating function of interest. One

only needs to have a proper state space, a proper partition of the state space and the

transition matrix associated with the imbedded Markov chain [71].

Conceptually, if one has a system which can be modelled using a finite state Markov

chain and the state space can be partitioned finitely such that a random variable Xn,k

can be distributed in a way equal to the distribution of a finite partition, then the

random variable can be said to be imbedded in the finite Markov chain. When one


has a finite imbedded Markov chain, then the distribution of the random variable can

be modelled concisely using only the initial probability vector (the starting point), the

t-step transition matrices (the path through the system) and the end state partitions

(the ending states of interest). Probabilities of interest are thus calculable and can be

estimated using computer simulation.

2.5.3 Markov Chains as a Model for Cancer Phase II Clinical

Trials

It is possible to model a patient receiving treatment in a phase II cancer clinical trial using

Markov chain methodology. At any given time, a patient will be objectively observed

as being in one of a finite number of states and patients will transition from one state

to another at selected time points (i.e. when the tumour measurements occur which for

phase II clinical trials is generally after every 2nd cycle of treatment). The transition

is independent of what occurred to that patient previously, but only depends on in

which state the patient is presently. Thus, tumour size is a random variable which can

be modeled as a Markov chain; one needs only to show that this random variable is

imbeddable and one can calculate the distribution and moments.

One scenario of interest might be to model the RECIST [17] criteria. One can easily

construct a proper state space, ω, by defining states, ∅, complete response [CR], partial

response [PR], unconfirmed response [UR], stable disease [SD], off-treatment but had a

previous best response of SD [SDoff], progressive disease [PD] and off-study / censored

/ failed treatment [C] which can be partitioned finitely depending the random variable

of interest (e.g. let Xn,k be the random variable defined by a 1 if a patient is in state

k=2,3 which corresponds to states CR or PR after n=4 transitions, and 0 otherwise).

An appropriate transition matrix is in 2.8. Frequently, complete responders and partial

responders are grouped for simplicity purposes as responders [R] and this is shown in

matrix 2.9. States CR, PR, SDoff, PD and C are absorbing states, with SDoff included


since one takes the best confirmed response of patients while on-treatment.

Thus, the random variable Xn,k defined can be imbedded into a finite Markov chain

and the distribution, moments and probability-generating function can be estimated.

M =

∅ CR PR UR SD SDoff PD C

∅ 0 0 0 p∅−ur p∅−sd 0 p∅−pd p∅−c

CR 0 1 0 0 0 0 0 0

PR 0 0 1 0 0 0 0 0

UR 0 pur−cr pur−pr 0 0 pur−sdo 0 0

SD 0 0 0 psd−ur psd−sd psd−sdo 0 0

SDo 0 0 0 0 0 1 0 0

PD 0 0 0 0 0 0 1 0

C 0 0 0 0 0 0 0 1

(2.8)

The transition matrix 2.9 can be interpreted as follows. Patients enter the system

in state ∅ which is defined since the status of the patients tumour prior to treatment

is often unknown - although frequently assumed to be progressing, radiological tumour

measurements are usually not taken prior to trial entry. At the first objective tumour

evaluation, a patient can transition to be either an unconfirmed response (UR), in stable

disease, having progressive disease (PD), or censored (off-treatment due to a reason

other than PD). At the next transition, patients who have had an UR will either have

a confirmed response pur−r, or will come off-study without having an observed response

pur−sdo. The UR state is necessary since according to the RECIST criteria, one uses

the best observed response and a response occurs only after having a confirmation of

the response at least 4 weeks after the first observation. If a patient does not have

a confirmation of their response, they are deemed as having a best response of SD.

Patients with SD will either transition to have an UR psd−ur, will remain in SD psd−sd or

will come off-study psd−sdo. All other transitions are either performed with probability 1


(such as patients with PD will remain in PD as that is their best observed response) or

with probability 0 (i.e. it is not possible to transition from state CR to SD according to

the RECIST criteria), see matrix 2.9.

M =

∅ R UR SD SDoff PD C

∅ 0 0 p∅−ur p∅−sd 0 p∅−pd p∅−c

R 0 1 0 0 0 0 0

UR 0 pur−r 0 0 pur−sdo 0 0

SD 0 0 psd−ur psd−sd psd−sdo 0 0

SDo 0 0 0 0 1 0 0

PD 0 0 0 0 0 1 0

C 0 0 0 0 0 0 1

(2.9)

To calculate the probabilities of interest, or the first moments, one needs to then

only define the random variable of interest in such a way to properly partition the state

space. Traditionally, one would be interested only in the response rate, so the appropriate

partition is C ′(t) = {0100000}. Alternatively, one could partition the state space as

C ′(t) = {0100100} which defines the random variable based on response and stable

disease when off-study. One might be interested in the disease + stable disease rate while

some patients are still being treated and on-study, thus the partition is C ′(t) = {0111100}

at the time of analysis at time k. Many alternative transition matrices and outcomes are

possible, and some are described in detail in the next chapter.

Although the exact distribution of the random variable can be calculated, one can

also use simulation to attain probabilities of interest, and use of simulation allows one

to attain many different probabilities quickly and easily with only minor changes to

the input. Further, since the transition matrix at different transitions are correlated, the

covariance structure may not be straightforward, thus, calculating p-values of interest are

often difficult when using theoretical methods. As a result, when calculating probabilities


of interest, the results were simulated as opposed to strictly calculated.

Chapter 3

Potential Trial Designs for Phase II

Oncology Clinical Trials

When designing an oncology phase 2 clinical trial, there are numerous potential designs

one can use, even when fixing the maximum allowable error rates and the null and

alternative hypotheses. To illustrate the plethora of design alternatives, this chapter will

describe the many options available to a trialist for a given hypothesis test. Some pros

and cons for each design option will be discussed.

As a basis for discussion, an example trial is discussed and designs are illustrated in

similar context to this trial. Details of this trial are given more explicitly in the next

chapter.

3.1 Design Summary

Recently, a single-arm, open-label phase II study of CCI-779 (temsirolimus) was per-

formed in patients with neuroendocrine carcinoma [72]. Further details of this study will

be discussed in Chapter 4. The original design of the study was based on using response

rate [RR] as the primary outcome, with response defined as per the RECIST criteria [17].

Hypotheses were set at H0:RR=0.05 versus HA:RR=0.25, α = 0.05 and β = 0.10 and a

29

Chapter 3. Potential Trial Designs for Phase II Oncology Clinical Trials30

modification of the Simon Minimax design [9] used, such that a minimum of 30 patients

were to be accrued. This modification was added by the investigators to ensure sufficient

numbers of patients were accrued to fully evaluate the treatment clinically. Accordingly,

the design specified that 15 patients were to be accrued in the first stage. If 2 or more

patients had an objective response, 15 additional patients would be accrued in stage II.

At the end of stage II, one would reject H0 and deem the treatment worthy of further

study if 4 or more of 30 patients had an objective response, otherwise, one would not

reject H0 and deem the treatment inactive if one observed 3 or less of 30 patients with

an objective response. The true α = 0.045 and the true β = 0.096 for this design. The

probability of stopping after the first stage assuming H0 is 0.829 and the expected sample

size under H0 is 17.56.

While describing potential trial designs, estimates will be designed similar to the

design described. It is worth noting as well that many early phase II designs were based

on asymptotic results, which resulted in similar designs as are used in phase III trial

designs, however, modern computer power has allowed more recent designs to be based

on exact calculations, such as the frequently cited Simon design [9].

3.2 Phase II Designs

Even after defining α, β,H0 and HA, there are still many different designs one can choose

from when using a simple two-stage design. As an example, Table 3.1 gives a list of

possible, valid and commonly-used designs which can be chosen when one is investigating

H0 : RR = 0.05 versus HA : RR = 0.20, with α = 0.05 and β = 0.20. In the same clinical

scenario, a different investigator might be interested in the response + stable disease

rate [RR+SD] and not just the response rate [RR]. Setting H0 : RR + SD = 0.40

versus HA : RR + SD = 0.60, with α = 0.05 and β = 0.20. Table 3.2 gives the list

of possible decision rules for this scenario. Any one of the designs listed could be used


Design Primary Outcome Accept H0:stage 1 Accept H0:stage 2

Gehan [7] Response 0/14 not stated

Fleming [8] Response 0/15 ≤ 3/35

Simon optimal [9] Response 0/12 ≤ 3/37

Simon minimax [9] Response 0/18 ≤ 3/32

Jung design 1 [74] Response 0/15 ≤ 3/33

Jung design 2 [74] Response 0/13 ≤ 3/35

Jennison & Turnbull ∗ [31] Response 0/16 ≤ 2/35

Bayesian ∗ [16] Response ≤ 1/16

≤ 4/40 ≤ 6/57

Table 3.1: Potential Phase II Designs Using Response

∗ Other designs are possible

and the decision of which design comes down often to personal preference or investigator

familiarity with one of the designs. Uncertainty can develop when the trial results are

borderline. This uncertainty can be compounded if the individual stage targets are not

met exactly, if the trial design is not clearly stated, there is some question about the

natural history of the disease or investigators have different beliefs about the standard

of care response rate. Unfortunately, some, if not all, of these uncertainties are present

in most phase II cancer clinical trials.

A review of all these commonly used clinical trial designs is performed in this section.

In the following section, multinomial designs are reviewed - a further complication which

occurs when investigators count a patient who has a response as different than a patient

who has stable disese.


Design Primary Outcome Accept H0:stage 1 Accept H0:stage 2

Fleming [8] Response+SD ≤ 7/20 ≤ 22/45

Simon optimal [9] Response+SD ≤ 7/18 ≤ 22/46

Simon minimax [9] Response+SD ≤ 11/28 ≤ 20/41

Jung design 1 [74] Response+SD ≤ 11/27 ≤ 21/43

Jung design 2 [74] Response+SD ≤ 9/23 ≤ 22/45

Jennison & Turnbull ∗ [31] Response+SD ≤ 8/22 ≤ 19/45

Bayesian ∗ [16] Response+SD ≤ 10/23 ≤ 20/43

Table 3.2: Potential Phase II Designs Using Response & Stable Disease

∗ Other designs are possible

3.2.1 Univariate Designs With Response as the Outcome

Single Stage Design [73]

The simplest phase II clinical trial design is one in which all patients are accrued in a

single stage. Calculations based on the binomial distribution can be used, however, exact

calculations are preferred such as those provided by A’hern [73]. Based on H0:RR=0.05

versus HA:RR=0.25, α = 0.05 and β = 0.10, 25 patients are required. Rejection of H0

occurs if 4 or more patients have an objective response and non-rejection of H0 occurs if

3 or less patients have an objective response. The true α = 0.034 and the true β = 0.096

for this design, and the expected sample size under H0 is 25.

Gehan Design [7]

The Gehan design was formulated to allow for early termination of trials conducted on

inactive agents, and is often thought of as the classical phase II design. It is the first

widely used 2-stage phase II design and it is occasionally still used today, primarily due

to familiarity with this design amongst many exprienced trialists. The formulation of this


design was to reject as soon as possible when no responses are observed and the results

are no longer consistent with the assumption that the ’beneficial’ response rate is true. If

one continues beyond the first stage, the desire is to improve estimation of the response

rates, thus, sample size is determined by looking at the precision of the estimates, and

ensuring the standard error is within certain limits.

The Gehan design was formulated at a time when there was very little useful treat-

ments for patients and standard of care for most diseases had minimal if any efficacy

(response rates usually < 5−10%) and for this design one would define RRH0 = 0.05 and

RRHA= 0.20. Fourteen patients are to be accrued in the first stage, since the probability

of having no responses of the first 14 patients if HA is true would be 0.814 = 0.044 < 0.05,

the defined level of significance. As a result, the study would be terminated and the treat-

ment deemed uninteresting if none of the first 14 patients have a response since at this

time the results would be inconsistent with the assumption that the beneficial response

rate is true. If 1 or more patients had a response, the trial would continue to stage

2, where the number of patients in stage 2 would depend on the number of responses

observed in stage 1 and the desired level of precision. Assuming 1 patient had a response

and given the specified precision as having a standard error of < 0.10, then one would

need 11 additional patients in stage 2. This is calculated by noting that if 1 of 14 patients

had a response in stage 1, at most 12 of 25 patients could have a response at the end of

stage 2, which gives a standard error of 0.0999. Although different levels of significance

and standard error could be used, it is this basic design that is almost always peformed.

One of the major criticisms with this design is that the trial is based on estimation and

not hypothesis testing. While some argue that this is a better objective of phase II trials,

most trials are conducted under a hypothesis testing framework to exclude ambiguity in

reporting study results and early termination of ineffective trials. Additional criticisms of

this design include the lack of flexibility if the stage 1 accrual target is not hit exactly, the

total sample size depends on what is observed at stage 1 (a serious financial and ethical


consideration), and the actual standard error is likely quite different from the bound.

Fleming Design [8]

The Fleming design is derived from methodology developed for phase III clinical trials

[28] in which one can reject or not reject the null hypothesis at any one of K interim anal-

yses. Each stage of accrual is identically sized and the total trial error rate is nominally

preserved. A trialist thus defines the response rate under the null hypothesis (RRH0),

the response rate under the alternative hypothesis (RRHA), the type I error rate, α, and

the type II error rate, β, and from this the design can be constructed. The total trial

sample size is calculated based on asymptotic estimation, using the null response rate of

interest and assuming a single stage of accrual, by the formula:

N =(Z1−β

√

RRHA(1 − RRHA)) + Z1−α

√

RRH0(1 − RRH0)

(RRHA− RRH0)

2, (3.1)

where Z1−α is the 1 − α quantile of the standard normal distribution. Generally N is

rounded up to the nearest 5th patient (e.g. 5,10,15,20,...) for simplicity sake. One would

reject H0 in a single stage trial if the number of responses is at least

S ≥ [NRRH0 + Z1−α

√

NRRH0 ∗ (1 − RRH0)] + 1. (3.2)

Alternatively, one could design a trial using only N,α,RRH0 , or in the situation where

one has a fixed sample size due to practical concerns. In this case, the single stage RRHA

varies and is equal to

RRHA= pA =

[√

N ∗ RRH0 +√

1 − RRH0Z1−α]2

N + Z21−α

. (3.3)

We will assume that there is no ceiling on N in designing the hypothetical trial.

For a multiple stage trial, an interim analysis is conducted after half the patients have

been accrued, rounded to the nearest 5th patient. Rejection and acceptance points are

defined using the methods of [28], however, in a phase II trial after stage 1, only the


acceptance point is of interest at the interim analysis. The alternative would be rejected

after stage 1 of accrual if the number of patients with a response is less than or equal to

the smallest integer greater than

a1 = n1RRHA− Z1−α

√

NRRHA(1 − RRHA

), (3.4)

where n1 is the sample size after the first stage. At the end of the trial, after n = n1 +n2

patients are accrued, the null hypothesis would be rejected if the number of observed

responses is at least

r2 = (n1 + n2)RRH0 + Z1−α

√

NRRH0(1 − RRH0) + 1. (3.5)

In our example, the parameters RRH0 = 0.05, RRHA= 0.20, α = 0.10 and β = 0.10

are defined by the investigators, thus, the total N is calculated as

N = [(Z1−β

√

RRHA(1 − RRHA

)) + Z1−α

√

RRH0(1 − RRH0)/(RRHA− RRH0)]

2

= [0.96 ∗√

0.2 ∗ 0.8 + 1.645 ∗√

0.05 ∗ 0.95]/[0.20 − 0.05]2

≈ 33.

One might then round up the total sample size to n = 35, with an interim analysis

occurring after, n1 = 15 patients are accrued. At stage 1, one would accept H0 and stop

the trial assuming treatment inactivity if

a1 = n1RRHA− Z1−α

√

NRRHA(1 − RRHA

)

= 15 ∗ 0.20 − 1.645 ∗√

35 ∗ 0.20 ∗ 0.80

< 0,

thus, if 0 responses are observed at the end of stage 1. After the second stage, which is

at the trial conclusion, one would reject H0 if

r2 = (n1 + n2)RRH0 + Z1−α

√

NRRH0(1 − RRH0) + 1

= (15 + 20) ∗ 0.05 + 1.645√

(35 ∗ .05 ∗ .95) + 1 (3.6)

= 4.87, (3.7)


thus, if 5 or more responses are observed.

Extensions of this design can be made to account for k ≥2 interim analyses, but

given the short duration of phase II trials, practical concerns limit the number of interim

analyses to a single time. These designs are criticised because they are not optimal

and one might use excess numbers of patients compared to what is needed. Further,

the designs are constructed using asymptotic estimation procedures, but given modern

computing power, analysis is conducted using exact calculations and accrual targets need

to be met exactly.

Pocock Design [27]

When designing a phase III clinical trial, there are two designs which are frequently used

for setting up the α-spending function, those being the O’Brien-Fleming and Pocock

designs. While the Fleming design in the previous section is a simple modification of the

O’Brien-Fleming design to allow for use in a phase II setting, a similar modification can

be performed to allow the Pocock design to be used similarly.

The main difference between the two designs is the amount of α spent at each anal-

ysis with rejection of H0 being more difficult early on using the O’Brien-Fleming de-

sign. The O’Brien-Fleming design specifies that one would reject H0 at test g whenever√

n1+n2+...+ng

NYg(p0) > Z1−α, where Yg is the normal approximation test statistic at test

g. Conversely, the Pocock design specifies that one rejects H0 at any analysis whenever

Yg ≥ c where c = Zαp) is some constant such that the overall test has size α. For an

analysis with two stages, with α = 0.10 the corresponding level of c is approximately

1.53, corresponding to an α = 0.062. As a result, one could propose the following design

for a similarly constructed phase II trial, with 15 patients accrued in stage 1 followed by

another 20 in stage 2:

Accept H0 at the end of stage 1 if P (X ≤ x|RRHA) ≤ 0.062, which corresponds

to accepting if the number of responses is x = 0. Reject H0 at the end of stage 2 if


P (X ≥ x|RRH0) ≤ 0.062, which corresponds to rejecting if the number of responses is

x ≥ 5.

Thus, for this particular design parameters, the Pocock and O’Brien-Fleming designs

are identical, but this is not always the case. (for example, if a trial testing RRH0 = 0.05

vs RRHA= 0.20 was conducted in two-stages of 21 patients in each stage, the Pocock

trial would accept H0 if 0 or 1 of 21 patients had a response in stage 1 and reject H0 if

4 or more of 42 had a response, whereas the O’Brien-Fleming design would accept H0 at

stage 1 with 0/21 with a response and reject at stage 2 only if 5 or more responses were

seen).

Simon Design [9]

Simon used a computer program to calculate trial designs that satisfy RRH0 , RRHA, α,

and β and to identify optimal designs using exact calculations. Two designs were defined

as optimal, the two-stage design having the lowest expected sample size under the null

hypothesis, ESS(H0), was called the optimal design, and the two-stage design which had

the smallest total sample size, SSTOT , was called the minimax design. The computer

program started by selecting a starting N using

N = RR(1 − RR)[

Z1−α + Z1−β

RRHA− RRH0

]2

(3.8)

where RR = (RRH0 + RRHA)/2. By starting at a total sample size just smaller than

N, a search was conducted over all possible stage 1 sample sizes n1 ∈ (0, N − 1) and

rejection regions r1 ∈ (0, n1) and r ∈ (r1, N). The design having the smallest SSTOT

was defined as the minimax design. The process was then repeated for each successively

larger SSTOT until the ESS(H0) was consistently increasing. The design which had the

smallest ESS(H0) was then declared the optimal design.

For RRH0 = 0.05, RRHA= 0.20, α = 0.10, β = 0.10 the optimal design is as follows:

accrue 12 patients in stage 1 and accept H0 (reject treatment as inactive) if 0 responses


are observed. If one or more patients have a response, accrue 25 additional patients.

Accept H0 and deem treatment as inactive if 3 or less patients have a response of the

total 37 patients, but reject H0 and deem the treatment as potentially of interest if 4 or

more of the 37 total patients have a response. The exact β for this design is calculated

by

r1∑

i=0

(n1

i

)

(1 − h)ihn1−i

+

r∑

j=i

r−1∑

i=r1

(n1

i

)

(1 − h)ihn1−i ∗((n − n1)

j

)

(1 − h)jhn−n1+j =(12

0

)

∗ .812

+(12

1

)

∗ .811 ∗ .21 ∗

[(25

0

)

∗ .825 +(25

1

)

∗ .824 ∗ .21 +(25

2

)

∗ .823 ∗ .22]

+(12

2

)

∗ .810 ∗ .22 ∗ [(25

0

)

∗ .825 +(25

1

)

∗ .824 ∗ .21]

+(12

3

)

∗ .89 ∗ .23 ∗(25

0

)

∗ .825

= .069 + .020 + .008 + .001

≈ 0.098

where h = RRHA. The exact α for this design is calculated as

1 −r1

∑

i=0

(n1

i

)

(1 − RRH0)iRRn1−i

H0

+

r∑

j=i

r−1∑

i=r1

(n1

i

)

(1 − RRH0)iRRn1−i

H0

∗((n − n1)

j

)

(1 − RRH0)jRRn−n1+j

H0= 1 −

(12

0

)

∗ .9512

−(12

1

)

∗ .9511 ∗ .051 ∗ [(25

0

)

∗ .9525 +(25

1

)

∗ .9524 ∗ .051 +(25

2

)

∗ .9523 ∗ .052]

−(12

2

)

∗ .9510 ∗ .052 ∗ [(25

0

)

∗ .9525 +(25

1

)

∗ .9524 ∗ .051]

−(12

3

)

∗ .959 ∗ .053 ∗(25

0

)

∗ .9525

= 1 − .540 − .298 − .063 − .005

≈ 0.094.

The minimax design states that one should: accrue 18 patients in stage 1 and accept

H0 (reject treatment as inactive) if 0 responses are observed. If one or more patients have


a response, accrue 14 additional patients. Accept H0 and deem treatment as inactive if

3 or less patients have a response of the total 32 patients, but reject H0 and deem the

treatment as potentially of interest if 4 or more of the 32 total patients have a response.

The true α for the minimax design is 0.072 and the true β is 0.099.

Simon also gives design optimality characteristics for each design. The probability

of stopping after stage 1, assuming H0 is true, is 0.54 for the optimal design and 0.40

for the minimax design and ESS(H0) is 0.54*12+(1-0.54)*37=23.5 for the optimal design

and 0.40*18+(1-0.40)*32=26.4 for the minimax design. Although these designs are opti-

mal, there still remains some criticisms, notably the requirement to meet accrual targets

exactly and the possibility that neither the optimal nor minimax designs is practically

useful. This latter criticism might occur if both designs have stage 1 sample sizes which

are too small (near 0) or too large (near N ).

Compromise Designs [10] [74]

Given that neither the optimal nor minimax design of Simon might be practically useful,

Jung provided a way of selecting from a set of designs which are a compromise between

these two designs. In the original 2001 paper, Jung et al used graphical methods to select

a design and provided a downloadable JAVA program to assist in this selection. For each

N ∈ [SSTOTmin, SSTOTopt], where SSTOTmin is the total sample size for the minimax

design and SSTOTopt is the total sample size for the optimal design, the design with the

minimum ESS(H0) amongst all designs satisfying the error constraints is selected. The

values are then plotted for each design where the horizontal axis is the SSTOT and the

vertical axis is the ESS(H0). By exploring this plot, one can choose designs which may

have more practically useful design characteristics.

The method of selecting designs was formalised using Bayesian methods in the 2004

paper and the JAVA program updated by formalising admissible designs. Once possible

designs are plotted, one can think of admissible designs as those which in some way min-


imise the two optimality criteria. Graphically, this process was described by connecting

candidate designs between the optimal and minimax design using a convex hull. Any

design on this convex hull would be deemed admissible [75].

The more formal Bayesian framework is as follows: One can draw a straight line

qSSTOT + (1 − q) ∗ ESS(H0) = ρ on the (SSTOT , ESS(H0)) plane for any q, q ∈ [0, 1],

where SSTOT is the stage 2 sample size and ESS(H0) is the expected number under the

null hypothesis. This line thus has slope −q/(1 − q) and intercept ρ/(1 − q) where ρ is

the Bayes risk. By starting from small ρ and moving the line upwards, the first design

touched would be an admissible, Bayes design, with Bayes risk ρ∗, where ρ∗/(1−q) is the

intercept of the line. One can weight the optimality criteria depending on the relative

merit of each criterion, q ∈ [0, 1], and any design which is a Bayes design would be

considered admissible.

For the parameters outlined in this section, this procedure provides two additional

admissible designs. The first design accepts H0 if 0/15 patients have a response in stage

1 or ≤ 3/33 patients at the end of the trial have a response. The second admissible

design would accept H0 if 0/13 patients have a response after stage 1 or ≤ 3/35 patients

at the end of the trial have a response. However, which design to use is then based on

subjective opinion, so it is imperative to define the design prior to starting the trial.

Repeated Confidence Intervals [31]

Jennison and Turnbull suggest the use of confidence intervals in evaluating interim and

final trial results. One could stop a phase II trial early and reject a treatment as unin-

teresting if the upper bound of the associated confidence interval is less than the RRHA.

It is argued that since confidence intervals are an estimation technique, there is no ad-

justment necessary in terms of precision for confidence intervals performed after interim

analyses. While valid, it is also true that when used for decision making purposes, in-

cluding whether to continue accruing to a study, later constructed confidence intervals


are affected by prior decisions. An α-spending function is thus proposed [30] as done for

phase III trials to adjust for the width of the confidence interval and prior decisions. In

this manner, there is considerable flexibility if the accrual targets are not met.

For the trial described, one might aim to accrue 35 patients with an interim analysis

after 15 patients are accrued. However, it is possible that the accrual target after stage I

was missed, and the interim analysis was performed after the 16th patient was accrued.

One might specify an α-spending function which mimics the Pocock design [27], the

Pocock design being more conducive to stopping early. Thus, at the interim analysis,

45.7% of the data is accrued, and confidence intervals would be constructed at the nominal

α = 0.028985 level of interest. The exact 94.2% confidence interval if 0 responses are

observed would have upper bound at 0.199, thus, one would not continue. If one continued

to stage 2, the upper bound of a confidence interval constructed when 35 patients are

accrued would be 0.145 (0.187, 0.226) with 1 (2, 3) responses observed and one would

reject the treatment as ineffective if 0, 1 or 2 responses were observed.

Criticisms of this technique include questions as to whether a 1- or 2-sided confidence

interval should be used. For phase II studies, in accordance with the 1-sided hypothesis

testing designs, 1-sided confidence intervals are generally used for decision theory, but

final results are often reported as 2-sided confidence intervals. While adjustments to a

width of a confidence interval due to prior decisions (i.e. amount of α spent) makes sta-

tistical sense, these adjustments are not intuitive to non-statisticians. There are further

uncertainties regarding which confidence intervals to report at the end of a trial when an

interim decision is over-ruled due to practical concerns or for secondary outcomes. Al-

ternatively, some argue that one should not make decisions based on confidence intervals

and these intervals should be used for estimation purposes only and the same criticisms

of the Gehan method would then apply. These unresolved issues are often reasons why

the confidence interval approach is used less frequently than other methods.


Bayesian Designs [16] [76]

Although there are many Bayesian designs which could be used for statistical analysis

of a phase II clinical trial [77] [78] [79], the design as described by Thall and Simon

[16] remains one of the most user-friendly designs. Using this methodology, the outcome

of interest remains response, and the response distribution is defined as a Beta-binomial

with parameters α and β. To elicit priors, the investigators are asked to provide the mean

response rate of standard therapy (µs = αs/(αs + βs), the width of a 90% probability

interval, W90, for the standard treatment response rate and a targeted improvement δ.

In essence the investigators must provide their belief of the standard treatment response

rate and the strength of this belief. The statistician must formulate this belief into a

proper prior distribution, in discussion with the investigator.

The distribution of the experimental treatment is defined as πe ≈ β(αe, βe) and

guidelines are suggested for eliciting the prior distribution. Specifically, let ce = αe + βe

and 2 ≤ ce ≤ 10, where ce describes the prior knowledge of the experimental treatment,

and the mean πe is equal to µs + δ/2. This formulation leads to prior parameters for the

experimental distribution of αe = ce(µs + δ/2) and βe = ce[1 − (µs + δ/2)].

The posterior probability is

λ(x, n; πs, πe, δ) = Pr(Θs + δ0 < Θe|Xn = xofn) (3.9)

=∫ 1−δ0

0[1 − F (p + δ0; αe + x, βe + n − x)]f(p; αs, βs)dp

Since Θe|Xn ≍ β(αe + Xn, βe + n − Xn), a decision rule is defined by the following:

1) If Xn ≥ Un, stop and declare the experimental treatment of interest for further

study, else

2) if Xn ≤ Ln, stop and delcare the experimental treatment inactive, else

3) if Ln < Xn < Un treat another patient, where Un is the smallest integer such that

λ(x, n; πs, πe, 0) ≥ pu and Ln is the largest integer such that λ(x, n; πs, πe, δ) ≤ pl and

pu, pl are pre-defined probabilities. Again, if Xn ≥ Un one might practically desire to


continue treating additional patients as is custom.

Thus, for this trial, the standard treatment may be believed to have a mean response

rate of 0.05 and W90 may extend from 0.02 to 0.10. This would give a distribution

similar to that shown in Figure 3.1 which can be easily generated using most statistical

packages. The figure was generated from a β(5, 95) distribution which gives a W90 of ≈

0.08, ranging from 0.020 to 0.099 and mean µs = α/(α + β) = 5/100 = 0.05.

0.0 0.05 0.10 0.15

Standard Treatment Response Rate

Figure 3.1: Potential Distribution for Standard Treatment Response Rate

The investigator might also deem that the targeted improvement in response rate is

15% (consistent with the frequentist designs) and, for the first trial with this treatment

combination, that there is little to no prior knowlege of the experimental treatment effect.

Thus, one might reasonably set ce = 2. From this, the experimental treatment would

have prior distribution β(ce(µs + δ/2), ce[1− (µs + δ/2)]) = β(20.05+0.152

, 2[1− 0.05+0.152

]) =


β(0.25, 1.75). According to [16], one might set SSTOT arbitrarily, however in [76], two

suggestions are made — to base SSTOT such that the width of the posterior credible

interval is less than some value, or to use frequentist power-type calculations to make

sure the false-positive or false-negative rates are within certain limits.

Using the first method, one might set the posterior mean to equal the targeted value

since the posterior distribution is not known a priori. In our study, we are targeting

an improvement of 15% for a total targeted response rate of 20%. Then, if we want a

95% credible interval to have width of less than 0.20, we would get the percentiles of a

β(r + 0.25, n− r + 1.75) distribution, where r/n = 0.20. Here, the maximum sample size

would be 57, although one might round to 60 for simplicity.

Given the relatively large targeted improvement, one might postulate that one would

need strong evidence before stopping a trial early, and set λ ≤ 0.04. In other words,

to stop the study early, we require that the experimental treatment response rate to be

superior to the standard treatment response rate by at least 0.15 with probability 0.96.

Using simulation to calculate probability estimates, one would then stop the study and

conclude the experimental treatment is uninteresting if one observes 0 of 5 patients with

a response, 1 of 16, 2 of 25, 3 of 33, 4 of 40, 5 of 48 or 6 of 55.

Practically, one issue with this design is the potential for stopping after only 5 patients

are accrued, but this highlights one of the criticisms of phase II clinical trials, notably that

the perceived minimum beneficial response rate is usually over optimistic and unrealistic.

One might specify a minimum number of patients needed for treatment prior to stopping

based on these practical concerns.

Looking at the design parameters specified, W90 for the standard treatment has an

upper bound on the response rate<0.10. With 57 patients, a 95% credible interval for

the posterior mean response rate for the experimental treatment has width <0.20. If

the targeted minimum beneficial response rate is 0.20, then the lower bound on the

95% credible interval will be >0.10. As a result, if these distributions are real, then


there would be almost complete separation between the standard treatment response

rate distribution and the posterior experimental treatment response rate distribution

with as little as 57 patients. Since advancements in oncology treatments are generally

quite small, this is unrealistic, and one is generally more interested in smaller, more

difficult to find improvements. However, practical concerns limit the sample size of these

trials and require this improbable targeted improvement.

While Bayesian designs have many proponents, there are some criticisms of these

methods as well. Notably, there is no specific test at the end of the trial, which some

argue is necessary as investigator judgements are likely clouded by their personal obser-

vations for there own patients (usually a subset of the entire trial sample). There are

ways to define a test to decide whether a treatment is worthy of further study or not,

however, multiple testing issues are then the same as in frequentist designs. Addition-

ally, the subjectivity in defining priors, and the rejection probability, has many critics,

although it is noted that the hypotheses and error rates for a frequentist design are also

subjective, and more restrictive, than Bayesian designs. Practically, Bayesian designs

are often disliked by non-statisticians who are unfamiliar with the terminology, and the

fact that these designs usually require a larger sample size, so a statistician who favours

Bayesian designs must have a good working relationship with the primary investigator to

approve a design. The supposed main advantage to using Bayesian designs is the explicit

incorporation of prior information, however, this is also the main disadvantage as many

investigators would argue that one should only consider results directly from the trial.

3.3 Multinomial Designs

The previous designs are all univariate, in that they all use only best response as the

primary outcome. This can have major implications, especially with the recent empha-

sis in clinical oncology on molecularly-targeted agents (MTAs) as opposed to cytotoxic


agents. MTAs have a different mechanism of action and these agents may be effective in

preventing tumour growth as opposed to simply shrinking the tumour. The most notable

instance of this occurring is in the use of Sorafenib in renal cell cancer [80]. In this break-

through trial, the response rate of patients treated with Sorafenib was only 4%, which

would ordinarily be clinically uninteresting. However, 70% of patients had stable disease

which lasted 12-weeks or more. While the statistical trial design was based on response

alone, and would recommend accepting H0 and deeming the treatment as uninteresting,

the extremely high stable disease rate was observed and deemed noteworthy enough to

use in phase III confirmatory testing along with other agents showing promise [81] and

later approved by health regulatory bodies.

While it is possible to create a single univariate outcome, such as defining a good

outcome as either objective response OR stable disease, this is not always satisfactory.

One problem with using a single outcome measure, like response, is exemplified in Fig-

ure 3.2. This figure shows three hypothetical tumour responses to treatment as per the

RECIST criteria [17]. According to RECIST, the maximum diameter of all measur-

able tumours (≥2mm in diameter) are summed at each response evaluation time. After

treatment, each evaluation is compared with baseline. If the summed value is ≥ 30 %

smaller compared to baseline, the patient is defined as having a partial response. A

complete response occurs only when all the lesions have completely disappeared, but due

to the small number of complete responses which occur, partial and complete response

outcomes are almost always combined when evaluating treatment efficacy. Patients who

have a growth of ≥ 20 % as compared to the nadir, or the smallest sum of diameters, is

considered to be having disease progression, and almost always will be taken off-study

at this point. Patients who are neither in response, nor progressing, are classified as

having stable disease. The best response at any time during the study is generally used

for determining treatment efficacy. Additionally, to have a best response in the study, a

response evaluation must be confirmed with a second measurement of response at least


4 weeks after the first evaluation.

Evaluation

Tum

our

Shr

inka

ge/G

row

th C

ompa

red

with

Bas

elin

e

0 1 2 3 4 5 6

0.6

0.8

1.0

1.2

1.4

Patient 1Patient 2Patient 3

Patient 1 - PRPatient 2 - PRPatient 3 - SD

Figure 3.2: Tumour Shrinkage and Growth for Three Hypothetical Patients Over Time

In Figure 3.2, all three patients in this hypothetical example have tumour shrinkage,

but at different speeds. Patient 1 has an immediate shrinkage, followed by substantial

growth. Patient 2 has substantial shrinkage which continues for a lengthy period of time

before stabilising, and patient 3 has slow but steady shrinkage until the 3rd evaluation at

which time they stabilise. Also note that censoring may occur due to patient withdrawl,

excessive toxicity, because the patient remains on treatment at the time of analysis or

completion of treatment as per protocol. According to RECIST [17] both patient 1 and

2 have a partial response since they experienced ≥ 30% shrinkage, and both would be

considered superior to patient 3, who has stable disease as the best response. Clearly

this does not agree with clinical practise. The response of patients 2 and 3 to treatment


would generally be considered superior to the response of patient 1. Further, if censoring

occurred at say evaluation 3, only patient one would be counted as having a PR and

would be thought to have a better outcome than either of the other 2 patients. Thus,

the typical method of analysis, to use a single outcome measure based on best response,

is insufficient.

Alternative statistical designs which use both response and stable disease endpoints

simultaneously have been proposed and are described below. While these designs are an

improvement in situations where one might be interested in both outcomes, there still

remains work to make these designs correspond to the clinical thought process.

3.3.1 Zee Design [11] [82]

Previous to the design proposed by Zee et al., a number of designs were proposed [83]

[84] [85] [15] which have multiple outcomes, generally toxicity and response, however they

did not use a multivariate design, but a dual-binomial outcome i.e. a design having two

separate outcomes. Zee et al. proposed a multinomial design based on the belief that

an ineffective treatment would not only have little response, but would have lots of early

progressions. Thus, both the response rate and the early progression rate needs to be

defined for decision criteria to be constructed. To mimic the design thought process of

the univariate designs as closely as possible, one might compare H0: response=0.05 AND

early progression=0.60 versus HA:response=0.20 AND early progressions=0.40. The use

of AND in both H0: and HA: is purposeful even though Zee et al. used OR in the

definition of HA. This is because the construction of boundaries for statistical testing,

and as a result the error rates, are calculated under the assumption of both response and

early progression hypotheses being true, not one or the other [86]. Error rates were set

at α = 0.10 and β = 0.10. For this design, one must use the program provided by the

authors.

Using this design, one might perform a trial with 30 patients in total, with an interim


analysis after 15 patients are accrued. One would conclude to not reject H0 and stop the

trial if i) 0 patients respond and 8 or more early progressions are observed, or ii) 13 or

more early progressions are observed regardless of the number of responses at the interim

analysis. After stage 2, with 30 patients, one would reject H0 if i) 1 or 2 responses and

≤ 20 early progressions are observed, if ii) 3 responses and ≤ 21 early progressions are

observed or if iii) 4 responses and any number of early progressions are observed. The

trial-wide α = 0.1116, β = 0.0848 and the expected sample size is 20.118 with probability

of stopping at stage 1 is 0.6588 assuming H0.

This design is criticised in that if the treatment prevents tumour growth for all pa-

tients, but does not shrink the tumour, the trial design will accept H0 at stage 2 and the

treatment will be deemed inactive. As an example, the Sorafenib in renal cell carcinoma

trial previously discussed [80] had a response rate of 0.04 and a 12-week stable disease

rate of 0.70. With a response rate of 0.04, the probability of observing 0 responses in

30 patients is > 0.29. So, if 0 responses and 21 (70%) stable diseases were observed in

the trial, the Zee design would incorrectly conclude that the treatment was ineffective.

While proponents argue that this is an extreme case, it is precisely this type of extreme

case that one wishes to identify using a multinomial design.

Using a Zee multinomial design, one would need to observe at least a few responses in

addition to the large number of stable diseases to declare efficacy (see Figure 3.3). After

stage 1 (Figure A), one would reject H0, accept H0 or continue to stage 2. After stage 2

(figure B), one would only reject H0, thus accepting the hypothesis of some drug activity,

only if one observed at least some measure of response. Using this design would result

in an incorrect conclusion of drug inactivity. If a treatment is cytostatic, the calculation

of the α and β errors in the multinomial design of Zee is based on two exact scenarios,

i.e. the test is based on comparing a response rate of x1 AND early progressive disease

rate of x2 versus a response rate of y1 AND early progressive disease rate of y2. What

investigators often are interested in is the situation where the alternative is y1 OR y2.


Figure 3.3: Decision Process for Zee [82] multinomial design. Figure A is the decision

rule after stage 1 and Figure B is the decision rule after stage 2

3.3.2 Trinomial Design [12]

A similar design to the Zee design was proposed by Panageas. The difference is that

instead of using response and early progressions, the trial was based on complete response

and partial response rates, which can be easily extended to response rate and stable

disease rate. To replicate the null and alternative hypotheses tested for the Zee paper, one

would then test H0:response rate=0.05 AND stable disease rate=0.35 versus HA:response

rate=0.20 AND stable disease rate=0.40. Note the difference between the good stable

disease rate and poor stable disease rate is only 0.05, and as a result this would require

thousands of patients (my computer crashed when attempting to calculate the exact

sample size due to the large memory needed).

Re-setting H0: response rate=0.05 AND stable disease rate=0.20 versus HA: response

rate=0.15 AND stable disease rate=0.35 resulting in a trial design where 10 patients are


acrued in stage 1 and a further 17 patients in stage 2, for a total of 27. At stage 1, one

would accept H0 (and declare the treatment as uninteresting) if one observes i) 0 responses

and ≤3 stable diseases, ii) 1 responses and ≤1 stable diseases, or iii) 2 responses and 0

stable diseases. At the end of stage 2, one would reject H0 and declare the treatment

of interest if one observed i) 0 responses and ≥10 stable diseases, ii) 1 response and ≥8

stable diseases, iii) 2 responses and ≥7 stable diseases, iv) 3 responses and ≥5 stable

diseases, or v) ≥ 4 responses and any number of stable diseases. The expected sample

size and probability of stopping after stage 1 under H0 is 17.19 and 0.59 respectively.

The boundaries for this design are constructed in a manner appropriate for the ques-

tions of interest, however, only optimal designs are provided in the manuscript and the

computer program provided. The calculations are complex and not easily computable,

so if accrual targets are not met, there is no easy way to calculate an appropriate bound-

ary. Further, the α and β errors are valid only for the joint multinomial hypothesis, and

might not be accurate for the marginal hypothesis if one of complete or partial response

is true but the other is not. Finally, this design weights response and stable disease rates

equivalently, however, clinicians may put more emphasis on a patient who has a response.

3.3.3 Dual-Response Design [13]

Lu et al note that clinicians place different emphasis on patients who have complete

response compared with patients who have partial response. Additionally, they note

that phase II oncology clinical trials have generally been performed using total response

as the outcome of interest, where the number of total responses is the number of par-

tial+complete responses. As a result, they have proposed a design, using exact calcu-

lations, which compares the rate of total responses and the rate of complete responses

simultaneously. They note, similar to Panageas et al, that while the design is proposed

based on total response and complete response outcomes, it can be easily revised to a

design based on total response and total response+stable disease outcomes; in fact, the


example provided in the paper is based on this revised dual outcomes.

It is important to note that the outcome of the number of total responses is ≤ the

number of total responses+stable diseases outcome. As such, the hypotheses to be tested

change from

H0 : RRH0 and SDH0 versus HA : RRHAor SDHA

(3.10)

to

H0 : RRH0 and RR + SDH0 versus HA : RRHAor RR + SDHA

. (3.11)

A rejection region is constructed, R1 = (XRR ≥ rRR or XRR+SD ≥ rRR+SD), which corre-

sponds to this dual response hypothesis. Given this rejection region, the type I error can

be calculated as Pr(R1|RRH0 and RR + SDH0) and the type II error can be calculated

as the value of (RRHA, RR+SDHA

) which maximises 1−Pr(R1|RRHA

⋃

RR + SDHA).

Since

RRHA

⋃

RR + SDHA= (RRHA

⋂

RR + SDHA) (3.12)

+(RRHA

⋂

RR + SDHA) + (RRHA

⋂

RR + SDHA),

where H is the compliment of H. As a result, to calculate the maximum β, one can

simply calculate the maximum β for each of the three areas defined on the right side of

equation 3.12. Therefore,

min

(RRHA

⋂

RR + SDHA)

Pr(R1|RR, RR + SD) ≥ Pr(XRR ≥ rRR|RR = RRHA) = 1 − βRR. (3.13)

Similarly,

min

(RRHA

⋂

RR + SDHA)

Pr(R1|RR, RR + SD) ≥ Pr(XRR+SD ≥ rRR+SD|RR + SD = RR + SDHA) (3.14)

= 1 − βRR+SD .

and

min

(RRHA

⋂

RR + SDHA)Pr(R1|RR, RR + SD) ≥ max(1 − βRR, 1 − βRR+SD). (3.15)


As a result of equations 3.13-3.15, the minimum power is 1−Pr(R1|RRHA

⋃

RR + SDHA) =

1 − max(βRR, βRR+SD) and the maximum β error is max(βRR, βRR+SD).

In designing a two-stage study, an early stopping region is formed after n1 patients

are accrued with bounds XRR ≤ sRR and XRR+SD ≤ sRR+SD for two points sRR, sRR+SD.

Thus, for fixed α, βRR and βRR+SD, a computer program was used to construct acceptable

designs by examining all possible choices of n1, n, sRR, sRR+SD, rRR, rRR+SD and minimax

and optimal designs chosen, where a minimax design was the design with smallest n and

optimal design had smallest ESS(H0).

Returning to our example, we wish to test H0: RRH0 = 0.05 AND RR + SDH0=0.25

versus HA: RRHA=0.15 OR RR + SDHA

=0.50, and we will set maximum error rates

of α ≤ 0.10 and β = max(βRR, βRR+SD) ≤ 0.20. The β errors are inflated as there are

two possible errors (i.e. if response alone is sufficient, or if response+stable disease is

sufficient) and the sample size would become unreasonably large if one was too restrictive.

It is noted that one could put different β errors on each alternative rejection scenario.

Using the software provided by the authors, one can compute the minimax and optimal

designs under these constraints.

Specifically, the minimax design would be to accrue 29 patients in stage 1 and continue

to stage 2 if one observed either 2 or more responses or 12 or more response+stable

diseases. In stage 2, an additional 15 patients would be accrued for a total of 44 patients,

and one would reject the null hypothesis (i.e. deem the treatment of interest for further

study) if one observed 5 or more responses or 16 or more response + stable diseases. The

expected sample size using this design is 35.6 and the probability of stopping after stage

1 assuming the null is true is 0.56. By contrast, the optimal design would stop after stage

1 if one observed 0 or 1 responses and 8 or less response + stable diseases after accrual

of 22 patients. In stage 2, 27 additional patients would be accrued for a total of 49, and

one would reject the null hypothesis if one observed 5 or more responses or 18 or more

response + stable diseases. This design has an expected sample size under H0 of 31.0


and a probability of stopping after stage 1 of 0.67.

As can be seen, the total sample size (44 or 49) is quite a bit larger using this

design than in the previous designs. This, however, is largely due to the response rate

comparison (0.05 versus 0.15). Even a Simon minimax two-stage design based on this

comparsion with α = 0.10 and β = 0.20 requires 44 patients at the end of stage 2. If we

increased the response rate of interest (under HA) to be 0.20, then the minimax design

requires a much more feasible 38 total patients and the optimal design 46 patients at the

end of stage 2, even with α = β = 0.10.

3.3.4 Weighted Response Design [14]

An alternative approach to the multinomial design was proposed by Lin and Chen, who

use weighted likelihood methods to design trial parameters and then exact methods to

construct optimal designs. In this situation, patients can have one of 3 possible outcomes,

say response, stable disease or progressive disease. Setting p0i to be the probabilities of

having each outcome i = 1, 2, 3 under H0 and p1i be the probabilities of having each

outcome i = 1, 2, 3 under the alternative, then the trinomial likelihood is defined as in

equation (3.16) at the end of a trial when x1 patients had the first outcome (response),

x2 patients had the second outcome (stable disease) and n − x1 − x2 patients had the

third response (progressive disease).

Λ =(

p01

p11

)x1(

p02

p12

)x2(

1 − p01 − p02

1 − p11 − p12

)n−x1−x2

. (3.16)

With appropriate re-arrangement it is shown that the trinomial log-likelihood is a mono-

tone function of the number of partial responses plus the number of complete responses

multiplied by some weight, as shown in equation (3.18).

log(Λ) = x1 log(

p01

p11

)

+ x2 log(

p02

p12

)

+ (n − x1 − x2) log(

1 − p01 − p02

1 − p11 − p12

)


= x1 log(

p01

p11

)

+ x2 log(

p02

p12

)

+ (n − x1 − x2) log(

1 − p0

1 − p1

)

= x1[log(

p01

p11

)

− log(

1 − p0

1 − p1

)

] + x2[log(

p02

p12

)

− log(

1 − p0

1 − p1

)

] + n log(

1 − p0

1 − p1

)

= x1[log(

p01

p11

)

− log(

1 − p0

1 − p1

)

− log(

p0

p1

)

+ log(

p0

p1

)

] +

x2[log(

p02

p12

)

− log(

1 − p0

1 − p1

)

− log(

p0

p1

)

+ log(

p0

p1

)

] + n log(

1 − p0

1 − p1

)

= x1[log(

p01p1

p11p0

)

− log(

(1 − p0)p1

(1 − p1)p0

)

] + x2[log(

p02p1

p12p0

)

− log(

(1 − p0)p1

(1 − p1)p0

)

] + C.

(3.17)

By setting ω = θ−µθ−ν

, θ = log(

p1(1−p0)p0(1−p1)

)

, µ = log(

p1p01

p0p11

)

, ν = log(

p1p02

p0p12

)

and C = n log(

1−p0

1−p1

)

,

then equation 3.17 becomes

log(Λ) = x1(µ − θ) + x2(ν − θ) + C

= −(θ − µ)(θ − ν

θ − ν)x1 − (θ − ν)x2 + C

= −(θ − ν)(ωx1 + x2) + C (3.18)

and hence, the log-likelihood is just a monotone function of the weighted score ωx1 + x2.

The primary question becomes how to define ω. Note that the use of partial responses and

complete responses can easily be replaced by the number of (prolonged) stable diseases

and responses as needed depending on the tumour and trial requirements.

Interpreting equation (3.18) and defining ω can be further simplified by noting that the

above equations are simply ratios and proportions of the values of interest. Specifically,

defining r as the odds ratio of having a response under HA, r =(

p1(1−p0)p0(1−p1)

)

, r0 and r1 as

the proportion of responses under H0 and HA respectively, r0 = p00/p0 and r1 = p11/p1

then ω depends on (p0, p1, p00, p01, p10, p11) only through (r, r0, r1). Thus

ω =θ − µ

θ − ν=

log(

p1(1−p0)p0(1−p1)

)

− log(

p1p01

p0p11

)

log(

p1(1−p0)p0(1−p1)

)

− log(

p1p02

p0p12

) =log(r) − log( r1

r0)

log(r) − log(1−r0

1−r1)

=log

(

rr1

r0

)

log(

r(1−r0)1−r1

) (3.19)

Limiting ω > 1, which restricts to the situation where a complete response is deemed

more important than a partial response (or equivalently, where a response is deemed more


important than a stable disease), then by defining p0, p1, r0 and r1 there is a unique ω,

which Lin and Chen call the likelihood ratio weight, and is the increased weight associated

with a patient having a complete response compared with a partial response.

To design a clinical trial, one would accrue n1 in the first stage and n2 in the sec-

ond stage for a total sample size of n = n1 + n2. A weighted score s is calculated

as ω times the number of complete responses plus the number of partial responses ob-

served (i.e. s = ω ∗ x1 + x2). One simply needs to find critical values at the end of

the first stage, s1, and at trial termination, s, such that error rates α and β are satisti-

fied given (n1, n, s1, s, ω, p0, p1, r0, r1). Lin and Chen searched over all possible values of

(n1, n, s1, s, ω) for selected (p0, p1, r0, r1) to find optimal and minimax designs. However,

a more realistic approach may be to fix n1 and n prior to the start of a trial based on

practical concerns.

To use this design in our example, set p0 = 0.25, r0 = 1/5, p1 = 0.50, r1 = 3/10, where

r0 and r1 are chosen due to the recommendation from Lin and Chen. Then from equation

(3.19) we have:

ω =log

(

rr1

r0

)

log(

r(1−r1)1−r0

)

=log

(

3(3/10)1/5

)

log(

3(7/10)4/5

)

=log(4.5)

log(2.625)

= 1.56.

If n1 = 15, n2 = 15 was fixed, there are still infinite s1, s which can be calculated using

a program provided by the authors. One such design which satisfies the α and β errors

is to accept H0 at the end of stage 1, terminate the trial and declare the treatment as

uninteresting if s1 ≤ [5.12, 5.56). This corresponds to accepting H0 if there are 0 responses


and 5 or less stable diseases, 1 response and 3 or less stable disease or 2 responses and

0 stable diseases. At the end of stage 2, one would reject H0 and declare the treatment

of interest if one observes s ≥ (11.12, 11.24], which corresponds to observing 0 responses

and 12 or more stable diseases, 1 response and 10 or more stable diseases, 2 responses

and 9 or more stable diseases, 3 responses and 7 or more stable diseases, 4 responses and

5 or more stable diseases, 5 responses and 4 or more stable diseases, 6 responses and 2 or

more stable diseases, 7 responses and 1 or more stable disease, or 8 or more responses.

This is summarised in Table 3.3.

Stage 1 Stage 2

s1 n1 Response \ SD s n Response \ SD

[5.12,5.56) 15 0 \ 5 [11.12, 11.24) 30 0 \ 11

1 \ 3 1 \ 9

2 \ 0 2 \ 8

3 \ 6

4 \ 4

5 \ 3

6 \ 1

7 \ 0

Table 3.3: Acceptance Region for Hypothetical Trial using Lin and Chen Design [14],

comparing H0 : RR = 0.05 and SD = 0.25 versus HA : RR = 0.15 and SD = 0.50

While the value of ω is defined based on the values (p0, p1, r0, r1), one could arbitrarilly

set ω to be any value identified by the investigators. For example, an investigator may

deem that a prolonged stable disease is equally important as a response, and thus set ω =

1. In this case, the optimal designs are just those as shown by Simon [9]. Alternatively, if

the tumour is a slow-growing tumour and the possibility of a response is extremely rare,

investigators might expect a lot of patients with stable disease. The investigators might


wish to design a study based on p0 = 0.4, r0 = 1/8, p1 = 0.60, r1 = 1/6, which would

result in ω = 1.44. However, the investigators might arbitrarily assign ω = 2, 3, 4 since

they subjectively value a response much more than 1.44 times that of a stable disease, or

similarly, they may have the belief that 2 responses (score s=2.88) are sufficiently more

interesting than 3 stable diseases (score s=3).

There are problems with this method. First, this design fails to appropriately capture

the extreme cases, where the experimental treatment results in slowing progression of

tumour without increasing the number of responses, or the experimental treatment results

in responses amongst a certain subset of the population and has no effect on others, thus,

the number of responses changes but the number of stable diseases does not. In the former

case, the proportion of responses might be assumed less what would occur under the HA

distribution - when the treatment is active, resulting in r0 > r1 and ω < 1, which is not

defined in the design. In the second situation, referring again to our example and Table

3.3, one sees that 7 responses and 0 stable diseases would result in accepting H0, thus

deeming the treatment uninteresting. This is a response rate of 23%, which would be

of considerable interest since the hypotheses were based on comparing response rates of

5% with 15%. Thus, it fails in the extreme cases, which is the reason why investigators

would be interested in a multinomial design (where one or the other outcome happens,

but not both). Third, only optimal designs are described in the paper, and, although

possible, it is not a straightforward procedure to calculate design parameters if accrual

targets are not met. This is even more pronounced at the end of stage 1, where more

than 1 design might be possible, and continuation or stopping the trial might depend on

which design one chooses.

Alternatively, one advantage of this design as compared to the other multinomial

designs is the ease in calculating and interpreting p-values and, to a lesser degree, con-

fidence intervals. P (S > s|H0) = 1 − P (S ≤ s|H0) could be calculated with relative

ease producing a p-value, and one would just have to find bounds (LL, UL) such that


P (S > s|x) ≥ α ∀ x ∈ (LL,UL) to get a (1 − α)100% confidence interval. Due to the

discreteness of the trinomial distribution, this would be an approximate (1 − α)100%

confidence interval.

3.4 Using Finite Markov Chain Imbedding

While classical group sequential methods can be used for many of these designs, partic-

ularly when the primary outcome is simple and straightforwad, Markov chains might be

beneficial for some designs when the primary outcome is more complex. For example,

when the primary outcome is binomial, such as response, exact calculations are fairly

straightforward. Thus, the Simon designs, or even the Fleming design which is based on

asymptotic outcomes, are easily calculable. When the outcomes become more compli-

cated, it is more difficult to calculate probabilities. The Zee designs and Panageas designs

demonstrate this. Determining what is more extreme is not simple and the computer

programs only cover the optimal cases. When trials occur and accrual targets are not

met or the extreme cases are observed, the statistics required can not be calculated by

the computer programs. Thus, substantial additional work would be needed to get the

values of interest.

Further, in situations where one puts different emphasis on different outcomes, Markov

chain methods would be superior to classical methods. One could weight a subject’s out-

comes such that if they have a response, their response status is weighted at a certain

level, say x, if they have a stable disease, they are assigned a weight of y, but if they have

stable disease for 3 or more consecutive observations, they are assigned a weight of z. At

an interim analysis, a subject may have only stable disease, but later develop response or

stable disease for 3 or more consecutive observations. Thus, a subject’s status at a later

analysis might change from their status at an earlier analysis. This could be difficult for

classical methods, but not as difficult for Markov chain methodology. The probability a


subject will transition from one state to another at a future can be estimated based on

the transitions of the subjects who are further ahead in their treatment. In situations

where one is unsure of the weights to assign different states, and if one wanted to explore

a variety of weighting schemes, exact calculations could be quite tedious, whereas Markov

chain methods require only a simple modification.

Chapter 4

Examples and Simulation Set-up

An illustration of the methods proposed in this thesis is crucial for a thorough under-

standing of the pros and cons of the use of Markov chains. For illustration purposes,

a previously performed trial is analysed using finite Markov chain imbedding methods.

The trial was chosen because of the controversial nature of the trial and the ambiguity

that resulted from the final trial results. Different investigators have different beliefs as

to the future drug development of the agent investigated in this trial, as is discussed

below, and it is felt that it is in this context that finite Markov chain imbedding would

be of most benefit.

To illustrate these methods, a simulation study was performed. Although the exact

probabilities can be calculated using distributional theory, the use of a simulation study

allowed for investigation of many different designs, outcomes and assumptions simultane-

ously. The primary strength of finite Markov chain imbedding is the flexibility it allows,

in that one is able to investigate multiple designs, outcomes and assumptions at the same

time. This flexibility allows greater understanding of the data and promote agreement

between investigators by showing results which could be obtained in different scenarios.

This simulation is facilitated by use of a flexible statistical code which can be run on

most computers in a relatively small time period. This chapter will describe the clinical

61

Chapter 4. Examples and Simulation Set-up 62

scenario of interest and the simulation that was performed.

4.1 Phase II Clinical Trial of CCI-779 (temsirolimus)

in Neuroendocrine Carcinoma

Thirty-seven patients were accrued to a multi-centre, single-arm phase II clinical trial of

patients with neuroendocrine carcinoma, including both pancreatic islet cell and carcinoid

histologies. Results have recently been published [72]. This study was chosen as an

example particularly due to the complexity of the disease, the controversy surrounding

the determination of efficacy in this trial, which is highlighted by two letters to the

editor following this trial publication [87] [88], and the failure of the statistical design to

adequately assess potential drug activity.

4.1.1 Trial Description

In this trial, patients were treated using a novel MTA (temsirolimus), which had been

previously studied in a number of phase I trials and phase II trials in other disease sites.

The safety profile of the MTA was believed to be satisfactory and there was promising

evidence of anti-tumour activity in neuroendocrine carcinoma. Patients were treated

in an outpatient setting, receiving once weekly doses of the treatment via a 30-minute

infusion. A cycle of treatment was defined as 28-days, thus, a patient received four

treatments per cycle. Patients were to continue with treatment until disease progression,

patient withdrawl of consent, severe adverse event or removal from study due to physician

discretion. At the time of study publication, 5 patients remained on study and being

treated. As is common with this disease, progression is relatively slow, with 48% being

progression-free at 6-months, and over 70% being alive at 1-year after treatment start.

Thus, time to progression and overall survival are generally considered poor primary

endpoints for phase II trials due to the necessity of keeping trials relatively short (total


trial duration from start of accrual to publication for this study was around 2 years).

The primary efficacy analysis was based on best objective tumour (partial or com-

plete) response as defined by the RECIST criteria [17]. The study was designed with

the primary endpoint, response rate [RR], using H0:RR=0.05 versus HA:RR=0.25, error

limits of α = 0.05 and β = 0.10, and a modified version of the Simon Minimax design

[9]. Accrual was conducted in two stages, with an interim analysis planned after 15

patients was accrued. The design specified that if 2 or more objective responses were

observed amongst the first 15 patients, accrual to stage 2 was to be conducted. If 1 or

less responses were observed, accrual was to be terminated and the treatment declared

inactive. After stage 2, a total of 30 patients were to be accrued, with non-rejection of H0

(declaring the treatment as inactive) if 3 or less responses were observed, and rejection

of H0 (declaring the treatment of interest for further study) if 4 or more responses were

observed of the total 30 patients.

4.1.2 Trial Results

After the first 15 patients were accrued, an interim analysis was conducted and 0 patients

had a PR, however 8 patients had SD, a number of whom had prolonged SD (defined

as a patient not having progressed after the completion of cycle 6). It is interesting to

note that the last patient accrued had SD at the time of interim analysis, however, later

developed PR in cycle 6. The number of patients having SD greatly exceeded expectations

and some of the patients with SD had severe worsening of disease prior to trial entry,

so the investigators overruled the statistical design and accrual continued to stage 2. At

study completion, 37 patients were accrued to study, but one patient was ineligible due

to having rapidly progressing disease and died prior to receiving any treatment. Thus,

a total of 36 patients were evaluable, 6 more than were initially planned, of which 3

patients had PR, 20 patients had SD, 8 patients had PD and 5 patients were inevaluable

(due to severe adverse events occurring prior to first objective post-treatment tumour


evaluation).

In conclusion, the authors concluded that "temsirolimus appears to have only modest

activity..." and "the results of this study do not warrant further investigation of this

drug as a single agent in this patient population." However, they state "evaluation of

temsirolimus, in combination with other targeted agents ... should be considered" [72] In

the letter to the editor in response O’Donnell and Ratain "...suggest drug activity beyond

the natural course of the disease" [87] and state that it is not the single-agent drug which

should be abandoned, but the trial design. In response, the authors defend the use of

single-arm trials [88] and there remains considerable discussion about the usefulness of

single-arm trials.

Statistically, the issue is as follows. The statistical design is based on a single endpoint,

response rate and hypotheses believed to be of interest. Using frequentist methodology,

one is not able to deviate from this trial design which is based on a single, primary

outcome, however, when evaluating a trial for potential efficacy, clinicians and researchers

evaluate all outcomes, including secondary outcomes. Thus, although statistically one

must reject the alternative hypothesis and deem the treatment not of interest for further

study, the apparent efficacy of the treatment based on secondary outcomes may still

be intriguing to researchers. That secondary outcomes are important is demonstrated

clearly by the experience with Sorafenib in renal cell cancer [89], where Sorafenib was

approved by the FDA, Health Canada and other agencies for treatment of renal cell

cancer based on prolongation of disease stabilisation even though the primary outcome,

response rate, was very low (< 5%).

4.1.3 Note regarding response rates

It is noted that the numbers of confirmed responders does not equal the number of

responses reported in the published manuscript [72]. This is because the data used for

this analysis was obtained subsequent to the data used in the manuscript, and between


this time, one patient with SD developed a PR. Thus, this PR is included in the thesis

results but not in the manuscript results. Further, in the manuscript, only 3 patients are

listed as being censored, and 10 patients with PD. Conversely in this analysis, there are 5

patients censored and 8 with PD. The reason for this is that 2 patients did not have an on-

study objective response evaluation and are thus inevaluable as per the RECIST criteria

- however, they did fail treatment as one had symptomatic progression and the other died

of disease before their first objective evaluation. Thus, for simplicity, this analysis will be

performed using their objective tumour measurements, but in the published manuscript,

they are considered as having PD.

4.2 Implementation of Markov Chain Methods

One of the key components of phase II clinical trials in oncology is that patients have

their tumour burden measured at regular intervals — in this trial, this was to be con-

ducted after every 2 cycles of treatment, or after approximately every 56 days. Response

was defined as per the RECIST criteria, which was briefly outlined in subsection 2.5.3.

Briefly, patients are classified based on the growth of the tumour as either having com-

plete response [CR], partial response [PR], stable disease [SD] or disease progression [PD].

Some patients may be removed from the study due to other practical concerns, such as

withdrawl of consent, adverse events, or discretion of the treating physician. These pa-

tients are thus censored [C] in terms of their response status. According to the RECIST

criteria, one must measure a response, followed by a subsequent confirmation measure-

ment, to have a declared objective response. In terms of a Markov chain, this means

a patient must transition into an unconfirmed response [UR] state and then transition

into a confirmed response state. Finally, due to the few number of complete responses

observed in clinical trials in cancer, CR and PR states are generally combined into a

single response [R] category.


4.2.1 RECIST criteria

One can design a transition matrix to describe the potential transitions as allowable

according to the RECIST criteria, and this was described earlier and shown in matrix

2.8. Since for this study, complete and partial responders are combined, the transition

matrix is reduced as in matrix 2.9. Here, all patients commence treatment in the ∅ state.

This is done for practical considerations, since for the neuroendocrine cancer trial, pre-

treatment response status was not measured — it is generally assumed most patients will

be progressing hence the need to enter a clinical trial. However, if one has the data, a

generalisation is possible for some trials in which patients may enter in different states,

not necessarily the ∅ state which would then reduce the transition matrix even further.

At the first tumour measurement (cycle 2), patients will transition from the ∅ state

to one of the other states. Since this is the first measurement, patients can not have a

confirmed response, but can only transition into the unconfirmed response state. Simi-

larly, patients can not have SD and be off-treatment simultaneously, thus, the transition

from ∅ to this state has probability 0. At each subsequent tumour measurement, further

transitions are possible. A patient presently in the UR state, can only transition to a

confirmed response state, or go off-study with a best response of SD (a patient with

a best objective response of an unconfirmed response is deemed as having a SD only).

Patients who are in the SD state can transition into the UR state, remain in the SD state

or be removed from treatment with a best response of SD (i.e. state SDoff). The other

states, R, SDoff, PD and C, are all absorbing states — once entered, a patient can never

leave this state since one evaluates the best objective response observed.

Patient data for the neuroendocrine trial are in Appendix A and listed by the appro-

priate state space when using the RECIST criteria in Appendix B. Patient 021-001 had

baseline tumour lesions summed to be 457 mm, with subsequent measurements of 456

mm, 429 mm, 426 mm, 423 mm and 444 mm. At the last evaluation, new lesions were

discovered, thus, the patient was classified as having PD. Thus, this patient started in the


∅ state, and transitioned to state SD, where they remained until the final measurement

(5th evaluation), when they transitioned to state SDoff. Conversely, patient 021-015 had

baseline tumour lesions summed to be 194 mm, followed by measurements of 161 mm,

146 mm, 136 mm, 135 mm, 125 mm and so on, until the 20th cycle when they finally

had disease progression. In terms of Markov chains, this patient transitioned from ∅

state to state SD at the first evaluation, since they only had 1-161/194=17% shrinkage.

They remained in SD at the next transitions, with a 25% shrinkages, finally achieving an

unconfirmed response at evaluation 3 with a 30% shrinkage, thus transitioning from SD

to UR. This response was confirmed at the next evaluation, so this patient transitioned

from UR to R. Since this patient now has a confirmed response, this patient remains in

this absorbing state.

A summary of transitions for all patients for the first 5 evaluations is found in Table

D.1 in the Appendix. Of all 36 patients, 1 transitioned at the first evaluation from ∅ to

UR, 22 went from ∅ to SD, 8 had immediate PD (∅ to PD) and 5 were censored with no

objective measurement. At the next evaluation the patient with UR transitioned to have

a confirmed response, and 6 patients with SD came off-study, thus transitioning to SDoff

(2nd row). The endstate proportions are described in Table D.2. Three patients had a

response, 7 patients remained on-study after the 4th evaluation with stable disease, 12

patients were off-study with stable disease as their best response, and 8 patients had PD

and 5 were censored. This is described in the first row of the Table D.1.

Similar results can be found for the first 15 patients, which is the time of the interim

analysis, in the same two Tables D.1 D.2 with rows titled "interim".Of the 15 patients,

8 had an initial evaluation of SD, 6 had PD and 1 was censored. The only patient with a

response amongst the first 15 first had UR at evaluation 3 — which is shown in the row

titled "Interim eval 3",and finally transitioned to confirmed response at evaluation 4 —

as demonstrated in the row titled "Interim eval 4".The endstate probabilities for states

R, UR, SD, SDoff, PD and C are 1/15, 0, 3/15, 4/15, 6/16 and 1/15.


The tabular format of displaying results will be continued throughout the remainder

of this chapter and in the reporting of results.

4.3 Simulation

All statistical analyses in the simulation were performed using R version 2.1.1 (http://www.r-

project.org) on a personal computer with an Intel Pentium 4 CPU 3.20 GHz processor

with a speed of 3192 MHz, in Microsoft Windows XP Professional operating system ver-

sion 5.1. (Microsoft Corporation, Seattle, WA). Each individual calculation (p-value or

conditional power) took less than 20 seconds.

P-values were calculated as if at the time of study completion, that is, assuming all 36

patients had been accrued. One-thousand simulations were performed using transition

matrices defined under H0 and for each outcome, the number of simulations for which

the simulated number of patients in the endstates of interest was greater than or equal to

actual trial number of patients in the endstates of interest was calculated. The p-value

was just this number divided by the number of simulations. For example, using the

RECIST criteria, the end-states of interest would be the confirmed objective responses,

and the p-value is the proportion of simulations under H0, for which there were 3 or

more objective responses. The number 3 that at the end of the actual trial there were 3

patients who had a confirmed response.

Conditional power estimates were calculated as if they were analysed at the interim

analysis, thus, after 15 patients had been accrued and partially evaluated, and with

an expected total of 36 patients. Thus, the conditional power is estimated based on

available data for 15 patients and assuming that a further 21 patients would be accrued.

1000 simulations of 36 simulated patients under H0 were performed and the observed data

would be comprised of the actual observed data for the first 15 patients and simulated

data for the additional 21 patients. The assumed future data was estimated in two ways,


first, by assuming future data followed some hypothesised distrubution HA, and second,

by assuming that future data followed a distribution similar to the results of the 15

patients already observed. Since the simulation allowed for different outcome definitions

in different scenarios, both H0 and HA had to be defined individually for each simulation

scenario. Defining the distributions of H0 and HA is described in the next subsection. For

each simulation, a p-value was calculated, resulting in 1000 p-values, and the conditional

power was the proportion of p-values ≤0.05. Additional simulations were performed

using 500 simulations for comparison and sample sizes changed from 36 to 54.

4.3.1 Estimating H0 and HA

As in any research study, the definition of H0 and HA is crucial, however, in practise

these definitions are often based on subjective beliefs of the investigators. It is acknowl-

edged that the investigators defining these hypotheses are experts in their fields and have

substantial experience treating the type of patient who will be enrolled into these studies,

however, it remains that the definition of these hypotheses are subjective and different

investigators may believe different hypotheses. For example, although a new treatment

might be of clinical interest if the response rate was 10%, where present treatments have

response rates of < 5%, in practise, one might use H0:RR=0.05 vs HA:RR=0.20 to keep

the sample size to reasonable numbers. This is based on practical limitations, a desire

to perform trials within 1-2 years and some consider it ethical to restrict the numbers

of patients. Thus, it is hypothesised that a range of hypotheses are needed to fully un-

derstand the results from this simulation study, particularly since the methods used are

novel and the endpoints are sometimes defined differently than previous methods (e.g.

response status is set at a specific time).

Initial hypotheses for the simulation study were defined based on the hypotheses in

the protocol, aiming to have endstate probabilities approximately equal to the initially

defined hypotheses, and with consultation with a clinical oncologist familiar with the


study. Afterwards, variations of the hypotheses were subjectively altered based on clin-

ical experience and consultation with oncologists experienced in phase II clinical trial

methodology, with an aim to investigate changes due to minor changes in the transition

state matrices; hence minor changes to the competing hypotheses. Models were assumed

most often to have time-independant transition matrices for simplicity, however, the ease

in using transition time-dependant transition matrices is demonstrated in some cases.

4.4 Models Investigated in Simulation

4.4.1 RECIST model

The RECIST model as described using the transition matrix 2.9 was investigated. This

analysis corresponds to the best objective response observed while on-treatment. Transi-

tioning between states becomes very irregular, as most patients do not have an improve-

ment after the first transition. In fact, in this study only 3 patients (021-015, 021-022

and 022-029) had a transition after the first transition (excepting transitioning from an

unconfirmed response to another state) if one uses this model. Patients who thus have

stable disease can not end up in a worse state than the stable disease state, even when

they progress. Thus, there is no difference between a patient who has stable disease and

then immediately progresses, and one who has prolonged stable disease. Since molecu-

larly targeted agents are often believed to be cytostatic, there is a very major clinical

significance between a patient who progresses early after having stable disease and one

who progresses many months later. This is one area where present designs fail. To

demonstrate the flexibility of finite Markov chain imbedding, it is important to model

the standard RECIST criteria, thus it is done, but the power of finite Markov chain

imbedding can only be demonstrated in other contexts, described below.


4.4.2 RECIST model evaluating outcomes at different transi-

tion times

Since the best objective response using RECIST is determined primarily by the first

transition, a transition time-important RECIST transition matrix was assumed in matrix

4.1 to demonstrate the power of using finite Markov chain imbedding. By time-important,

it is meant that the timing (number of transitions) of the evaluation is important. With

the RECIST criteria, the only change in results occurs if patients transition from SD to

R, thus, the time in which the analysis is conducted has little effect. In other words, if

no patient has a late response, say after evaluation 3, then the results will be the same

regardless of the timing of any analysis conducted after evaluation 3. However, there

might be vastly different interpretations of the results comparing a trial in which all

patients with SD had PD at the 4th evaluation, or if a number remained on-study with

SD for 10 or more evaluations.

M =

∅ R UR SD PD C

∅ 0 0 p∅−r p∅−sd p∅−pd p∅−c

R 0 1 0 0 0 0

UR 0 pur−r 0 0 pur−pd pur−c

SD 0 0 psd−ur psd−sd psd−pd psd−c

PD 0 0 0 0 1 0

C 0 0 0 0 0 1

(4.1)

This transition matrix was considered the primary transition matrix of interest, and

subsequent evaluations performed where the null hypothesis was modified slightly, and

the number of evaluations prior to analysis was increased and decreased. These adjust-

ments allow investigation to the robustness of these methods.

A further modification excludes response as an absorbing state since it could be

argued that duration of response is also important, so a transition matrix (4.2) was also


constructed and analysed.

M =

∅ R UR SD PD C

∅ 0 0 p∅−r p∅−sd p∅−pd p∅−c

R 0 pr−r 0 0 pr−pd pr−c

UR 0 pur−r 0 0 pur−pd pur−c

SD 0 0 psd−ur psd−sd psd−pd psd−c

PD 0 0 0 0 1 0

C 0 0 0 0 0 1

(4.2)

4.4.3 Transition Matrices Based on Immediate Changes

Given that Markov chain methodologies incorporate time in their evaluations, a transition

matrix was constructed which is based on immediate changes. This is potentially the

most beneficial method when using Markov chains. Patients transitioned based on the

change in tumour size compared to the previous evaluation. A patient who had shrinkage

of 5% or more compared to the previous evaluation was considered as having a response

[R] at that transition; a patient who had growth of 5% or more was deemed as having

tumour progression [PD]; a patient with less than 5% growth and less than 5% shrinkage

was considered as having stable disease [SD]; patients who were removed from the study

for any reason entered the off-study state [Off]. This transition matrix is shown in matrix

(4.3). The timing of analysis (i.e. after which evaluation) then becomes a concern, and

this is investigated, as is the effect if one modified the definition of response/progression

from 5% to, say, 10%. One could also eliminate the stable disease state and have response

defined as any shrinkage or no change, and growth as any growth whatsoever, as in matrix

(4.4). This type of transition matrix represents immediate changes and demonstrate

whether a treatment remains active at a given time.


M =

∅ R SD PD Off

∅ 0 p∅−r p∅−sd p∅−pd p∅−o

R 0 pr−r pr−sd pr−pd pr−o

SD 0 psd−r psd−sd psd−pd psd−o

PD 0 ppd−r ppd−sd ppd−pd ppd−o

Off 0 0 0 0 1

(4.3)

M =

∅ R PD Off

∅ 0 p∅−r p∅−pd p∅−off

R 0 pr−r pr−pd pr−off

PD 0 ppd−r ppd−pd ppd−off

Off 0 0 0 1

(4.4)

4.4.4 Transition Matrices with Different Positive Outcomes

Instead of just having a stable disease, one might define a positive result as a patient

who has prolonged SD, where prolonged is some set duration of time, i.e. when a patient

remains in SD for 3 (see matrix (4.5)), or 4 (see matrix (4.6)) consecutive transitions.

Patients can transition from PD to first SD state, and then to the second SD state, and

then the 3rd SD state. Patients can also transition into state R any time after having a

confirmed response.


M =

∅ R SD3 SD2 SD1 PD

∅ 0 0 0 0 p∅−sd1 p∅−pd

R 0 1 0 0 0 0

SD3 0 psd3−r psd3−sd3 0 0 psd3−pd

SD2 0 psd2−r psd2−sd3 0 0 psd2−pd

SD1 0 psd1−r 0 psd1−sd2 0 psd1−pd

PD 0 0 0 0 ppd−sd1 ppd−pd

(4.5)

M =

∅ R SD4 SD3 SD2 SD1 PD

∅ 0 0 0 0 0 p∅−sd1 p∅−pd

R 0 1 0 0 0 0 0

SD4 0 psd4−r psd4−sd4 0 0 0 psd4−pd

SD3 0 psd3−r psd3−sd4 0 0 0 psd3−pd

SD2 0 psd2−r 0 psd2−sd3 0 0 psd2−pd

SD1 0 psd1−r 0 0 psd1−sd2 0 psd1−pd

PD 0 0 0 0 0 ppd−sd1 ppd−pd

(4.6)

Alternatively, one might require consecutive minor shrinkages of, say, 5% or more to

be a positive result (see matrix (4.7)). Patients with prolonged stable disease would then

not be of interest. This model would be of interest if one assumes that drug activity

corresponds to some shrinkage, and eliminates the vagarities which occur for a very slow

growing tumour that might not appear to be growing on consecutive evaluations. One

might be interested if a patient has a response, or 2 consecutive minor shrinkages of, say,

5% or more as an indicator of activity


M =

∅ R MR2 MR1 SD PD

∅ 0 0 0 p∅−mr1 p∅−sd p∅−pd

R 0 1 0 0 0 0

MR2 0 pmr2−r pmr2−mr2 0 0 0

MR1 0 pmr1−r pmr1−mr2 0 pmr1−sd pmr1−pd

SD 0 0 0 psd−mr1 psd−sd psd−pd

PD 0 0 0 0 0 1

(4.7)

4.4.5 Multi-binomial transition matrices

While multinomial outcomes may be of interest, such as response or stable disease, and

these results are analysed using the previous matrices, one might also be interested in

multiple outcomes which are not measures of the same thing (e.g. multiple binomial out-

comes, such as toxicity and tumour size). For example, a treatment may be of interest

only if the number of responses is high and the number of adverse events is low. Addi-

tional outcomes might include overall survival, time to progression, use of a molecular

marker (PSA or CA125), or even a quality of life indicator. A model which might repre-

sent this is in matrix (4.8). For simplicity, the unconfirmed response state is dropped, and

patients are classified as being in one of the following states, response with no toxicity

[R], stable disease with no toxicity [SD], off-study without having a response or toxicity

[Off], off-study due to toxicity [Tox], or off-study due to toxicity and having prior re-

sponse [R & Tox]. The number of patients with response is then just R + R & Tox, and

the number of patients with toxicity is Tox + R & Tox.


M =

∅ R SD Off Tox R & Tox

∅ 0 0 p∅−sd p∅−off p∅−tox 0

R 0 pr−r 0 0 0 pr−Rtox

SD 0 psd−r psd−sd psd−off psdtox 0

Off 0 0 0 1 0 0

Tox 0 0 0 0 1 0

R & Tox 0 0 0 0 0 1

(4.8)

4.5 Calculation of p-values

The exact distribution of a random variable which can be imbedded into a finite Markov

chain is

P (Xn,k = x) = π0(n

∑

t=1

Λt)U′(Cx) (4.9)

where π0 is the initial probability vector of the Markov chain, Λt is the t-step transi-

tion matrix and U ′(Cx) defines a proper partition of the state space for calculating the

probability interest 2.7. A p-value is the probability of observing as extreme or more

extreme results from H0 than what is observed by the data. Thus, in the simulation, a

p-value is obtained by calculating∑∞

x P (Xn,k = xd|H0) after n transitions, where xd is

the observed number of patients in the states defined by the partition U ′(Cx).

When using a transition matrix as defined by the RECIST criteria (with combining

complete + partial responses into a single response category), the state space is ω =

∅, R, SD, SDoff , PD,C and for the classical definition of response, the proper partition

of interest is C(x) = 010000. The initial probability vector is π0 = 100000, thus, the

only thing that differs between the observed data and the data distributed under H0 is

the transition matrix Λt. In other words, at transition time t, if there were 3 observed

patients who were in state R, then the p-value is∑∞

3 P (Xt,k = x|H0). In the simulation


study, the H0 data is generated by the simulation and the p-value is thus calculated

as (∑∞

3 (Xt,k = x|sim))/m, where m is the number of generated data points in the

simulation (500 or 1000).

4.6 Methods Used for Investigating Different Out-

comes

For each transition matrix of interest, one of 9 possible analytical methods was selected

determining the primary outcome of interest, and a decision rule constructed based on

the primary outcome. For evaluating the flexibility of finite Markov chain imbedding,

multiple decision rules were constructed for a single primary outcome.

The first decision rule, method 1, is based on defining the primary outcome using a

single response state. For example, one might define the primary outcome as objective

response as per some criteria, and would construct a decision rule based on whether the

number of observed objective responses is more extreme than expected. Thus, at a given

transition time t, one can calculate the p-value as (∑∞

x=q(Xt,k = x|sim))/m where m is

the number of simulated data points and q is the observed number of patients who were

in state k at transition t. In matrix 2.9, state k=2. Method 1 coincides with the classical

phase II design and is the most frequently used primary outcome definition (as is found

in the Simon [9], Fleming [8] or Jung [10] [74] designs).

Method 2 is based on defining the primary outcome as the sum of 2 or more response

states. The primary outcome might be the sum of objective responses + unconfirmed

response, the sum of objective responses + unconfirmed responses + stable diseases, the

sum of patients remaining on-study at some point (for example in response, or on-study

but in state response, unconfirmed response or stable disease, but not in state stable

disease but off-study) and so on. The difference between this method and method 1 is

in the definition of the end-state probability vector. For method 2 and matrix 2.9, this


vector would be C(x) = 011100, but for matrix 2.8 the vector would be C(x) = 0111100 if

one was interested in response+stable disease. The p-value is calculated as (∑∞

x=q(Xt,k =

x|sim))/m, however in this instance, k = 2∩ 3∩ 4 for matrix 2.9 and k = 2∩ 3∩ 4∩ 5 for

matrix 2.8. Although not explicity stated as such, these designs are frequently presently

used and would use a slight modification to the designs as defined in Simon [9], Fleming

[8] or Jung [10] [74].

Method 3 is based on defining the primary outcome as any of 2 or more response

states. For example, an investigator might be interested in a novel treatment if the

number of observed objective responses was greater than expected or the number of

stable diseases observed was greater than expected. For matrix 2.9, this means that one

can calculate the p-value by calculating (∑∞

x=q

∑∞y=r(Xt,k = x ∪ Yt,l = y|sim))/m where

k = 2, l = 3 ∩ 4, q is the number of observed responses and r is the number of observed

stable diseases. This design has the same characteristics as the Panageas design [12] [82].

Method 4 uses a definition of primary outcome based on superiority of one out-

come, or superiority of a second multinomial outcome where the second outcome in-

cludes the first. This is similar to the Lu design [13], where one would be interested

in a treatment if the number of responses is greater than expected, or the number of

responses and stable diseases is greater than expected. The p-value is thus calculated by

(∑∞

x=q+1

∑

x+y=r+1(Xt,k = x∪X +Yt,k = x+ y|sim))/m, where r is the observed number

of patients with response or stable disease and y is the number of patients with stable

disease (or the second outcome of interest).

A slight modification to method 4 is found in method 5, where the primary outcome

is based on superiority of the first outcome, and superiority of a second outcome while

the first outcome is equivalent. In other words, while one might be interested in a

treatment which has greater response rate than expected, the treatment might still be

deemed of clinical interest if the response rate is equal to what is expected and there

is superiority of the stable disease rate. The p-value calculation for this method is


(∑∞

x=q+1

∑∞y=r+1−q(Xt,k = x ∪ (Xt,k = q ∩ Yt,k = y|sim))/m.

Method 6 uses a definition based on the equivalence or superiority of one outcome and

equivalence or inferiority of a second outcome. This is similar to the actual Zee design

[11] [82]. The p-value calculation is thus, (∑∞

x=q

∑ry=0(Xt,k = x ∩ Yt,k = y|sim))/m. A

slight modification to this is in method 7, which defines the primary outcome to be strict

superiority of one outcome or equality of the first outcome and strict inferiority of the

second. The p-value for method 7 will always be less than the p-value in method 6 and

is calculated by (∑∞

x=q

∑r−1y=0(Xt,k = x ∩ Yt,k = y|sim))/m.

Method 8 uses a weighted response state definition similar to the Lin design [14].

As an example, clinicians might deem a response to be 4 times as important as patient

with stable disease lasting greater than 6 cycles, which in turn is twice as important as a

patient having stable disease of less than 6 cycles, when evaluating whether a treatment

is active. Thus, one might assign weights of 8, 2 and 1 to end states associated with

a patient response, stable disease greater than 6 cycles and stable disease less than 6

cycles, with other states having weight 0. A variety of weighting schemes is shown for

each transition matrix. To calculate the p-value, first one must define v = w1x1 +w2x2 +

· · ·+ wjxj where wi is the weight associated with response state xi. Then, the p-value is

∑∞v=s(V (w1, w2, · · · , wj, x1, x2, · · · , xj) = v), where s = w1x1o + w2x2o + · · · + wjxjo and

xio is the number of patients observed to be in state i.

Method 9 is based on superiority of each of multiple different end states. Clinicians

might be interested in a treatment only if the number of responses is greater than expected

and the number of stable diseases is greater than expected. This method is primarily

useful when looking at multiple binomial designs, where one might want the number of

responses to be greater than expected, and the number of patients with no toxicity to be

greater than interested, similar to the Bryant-Day method [15]. The calculation of this

p-value is performed using (∑∞

x=q+1

∑∞y=r+1(Xt,k = x ∩ Yt,k = y|sim))/m.

Chapter 5

Results

Numerical results detailing the p-values and conditional power which were evaluated by

the simulation study are in Appendix D. A summary of these results are discussed in

this section. This summary includes discussion of the statistical issues and the results as

they relate to the original study.

5.1 RECIST Criteria

5.1.1 Interpretation

Table D.1 shows the input data associated with matrix (2.9), the transition matrix asso-

ciated with the RECIST criteria. Interpretation of Table D.1 is as follows. Of 36 patients,

1 transitioned from state ∅ to state unconfirmed response, 22 transitioned to state stable

disease, 8 to state progressive disease and 5 were censored at the first transition period.

This is shown in the first row of this table titled Data ∅. The subsequent rows, titled

Data eval 2-data eval 4 show the observed transitions for the first 4 transition periods.

For example, the patient in state unconfirmed response had a confirmed response at tran-

sition Data eval 2, thus the transition probability is 1 (since no other patient transitioned

from unconfirmed response to any other state during that time period). Sixteen of the 22

80

Chapter 5. Results 81

patients in state stable disease remained in state stable disease, while 6 came off-study,

during the 2nd onstudy transition. At the next transition, 2 patients had unconfirmed

response, where 1 later became a confirmed response and 1 came off-study without having

an observed response. At Data eval 4, an additional patient had a confirmed response,

transitioning to this state in the 5th on-study transition period (data eval 4).

The transition matrix under H0 is shown in matrix (5.1) and under a specified alter-

native HA in matrix (5.2). This is simplified in Table D.1 in rows titled H0 and HA. Note

that the main differences between these matrices is the initial transition, where under H0

only 2% of patients have transition into the response state, compared to 20% of patients

under HA, at the expense of the number of patients with progressive disease. Also, the

number of patients in state stable disease transitioning to the off-study state decreases.

This agrees with RECIST comparisons which are driven primarily by the first transition,

since one takes the best observed response as the primary outcome. Since a patient in

state PD can not improve, and it is not frequent that a patient transitions from SD to

R, it is usually the first transition which dictates the best observed response.

The data observed at the interim analysis are shown in the final 5 rows, titled interim

eval.

M =


∅ 0 0 .02 .4 0 .4 .18

R 0 1 0 0 0 0 0

UR 0 .85 0 0 .15 0 0

SD 0 0 .05 .6 .35 0 0

SDo 0 0 0 0 1 0 0

PD 0 0 0 0 0 1 0

C 0 0 0 0 0 0 1

(5.1)


M =


∅ 0 0 .2 .42 0 .2 .18

R 0 1 0 0 0 0 0

UR 0 .85 0 0 .15 0 0

SD 0 0 .1 .7 .2 0 0

SDo 0 0 0 0 1 0 0

PD 0 0 0 0 0 1 0

C 0 0 0 0 0 0 1

(5.2)

Endstates probabilities based on the data in Table D.1 are shown in Table D.2. The

endstates associated with the H0 and HA are similar to the initial statistical design,

with a comparison of response rates being ≈ 0.05 versus 0.25, and the non-progression

rates being ≈ 0.4 versus 0.6 (sum of pR, pur, psd, psdo). P-values and conditional power

estimates based on sample sizes of 36 and 54 are in Tables D.3 and D.4 respectively. In

these tables, the first column indicates the method used and the second is the number

of iterations performed in the simulation. The last 3 columns are the calculated p-value,

conditional power (assuming future data appears similar to data as under HA), and

the conditional power (assuming future data occurs similarly to the data observed at

the interim analysis). The two columns titled "outv" and "outv2" define the states of

interest corresponding to the outcomes defined by each method. With method 1, for

instance, the primary outcome is based on a single response state. Thus, the state of

interest will be listed under "outv". For method 4, where the primary outcome is based

on superiority of a first outcome and superiority of a second outcome, then the first set

of states will be listed under "outv" and the second set of states is listed under "outv2".


5.1.2 Results

The first observation is expected, that the similarities in results given the number of

iterations used for the simulated data, 1000 or 500. This is a fairly consistent obser-

vation throughout, demonstrating the overall speed of convergence. Additionally, as is

expected, in general, the p-value generally decreases and the conditional power generally

increases as one increases the sample size from 36 to 54, with exceptions occurring for

method 5. Looking closely at method 5, one only gets significance if the number of re-

sponses is greater than expected, or the number of responses is equal and the number of

responses+stable diseases is equal. There were 3 responses observed of 36 patients, a rate

of 0.0833. With 54 patients, it is impossible to get an equivalent rate - i.e. 4/54=.074

and 5/54=.093. Thus, the only way one can get a significant result is if the number of

simulated patients with response is greater than the observed rate and this p-value is ≈

equal to method 4. The p-value increases since the probability of getting more extreme

results is limited to having strictly greater numbers when using 54 patients. The con-

ditional power estimate is not equal, however, since when using conditional power, the

observed data is based on the interim data observed plus generated future data.

The difference in conditional power between that found using hypothesised data under

HA, and that using generated data based on observed results are strikingly different.

Generally, the conditional power of getting a significant result by continuing beyond the

interim analysis is quite high if one assumes future data follows HA, but quite low when

using observed data. This exemplifies the over-optimistic results that are frequently used

when specificying HA in most phase II designs. As discussed earlier, the use of such an

optimistic result (i.e. HA: RR=0.25) is partially done to keep the sample size feasible.

P-values tended to be moderate to low (between 0.1 and 0.3) indicating a slight trend

of improvement over what was expected, but only a very minor improvement. Significant

results occur only in situations where the sum of response + stable disease states either

unweighted (method 2, or method 8 with all weights equal to 1) or weighted (method 8)


are tested, which leads to the belief that the improvement is in the number of patients

with prolonged stable disease, however, this is observed in better detail with some of

the results to follow. At this time, one could also hypothesise that the few statistically

significant results could be partially explained by lack of statistical power.

5.1.3 Transition Time-Important RECIST model

Matrices (4.1) and (4.2) describe modifications to the RECIST model which account

for time, with the primary change being the loss of the stable disease but off-treatment

state. This was performed to investigate the state a patient was in at a given transition,

not necessarily the best transition (which, as mentioned, is primarily the first transition

state). Data input for matrix (4.1) is in Table D.5, with accompanying endstate proba-

bilities in Table D.6 and results in Tables D.7 and D.35. Endstate probabilities for this

first iteration was designed to be identical to the previous results for matrix (4.1) for

the response, unconfirmed response and stable disease states. The disappearance of the

stable disease and off-treatment state causes an increase in patients in the progressive

disease and censored states. Since analysis is no longer best observed response, the issue

of when to analyse (i.e. after which transition) is raised.

The results using the transition time-important RECIST transtition matrix tended

to be similar to the time-independent RECIST model, especially for methods which only

include response, unconfirmed response or stable disease but on-study as outcomes which

are considered indicators of efficacy. However, when looking at those methods which deem

stable disease but off-study (see for example the difference between row 2 of matrix D.3

compared with row 2 of matrix D.7), the p-value tends to become more significant, and

the conditional power increases. Thus, the number of patients with stable disease but

still on-study at the 5th evaluation is less likely under H0 than the number of patients

with stable disease as their best observed response. One hypothesis arising from this

finding is that patients with stable disease do not progress as fast as expected under H0.


This needs further exploration.

Varying H0

One question of interest which is raised regarding the use of any different methodology

is the robustness of the models when one misspecifies H0. To demonstrate this, the

simulation was re-run with slight variations to the null hypothesis, such as is shown in

Table D.9, where a variation was simulated in which a patient at the first transition ∅

had a slightly increased chance of having an unconfirmed response (.02 to .04) at the

expense of being censored (.18 to .16). No other alterations were made. The endstate

probabilities in Table D.10 change with pr going from .0503 to .0673, ppd increasing from

.6206 to .6216 and pc decreasing from .2730 to .2550. The other endstate probabilities

remain the same.

One would expect the p-values and conditional probabilities in Tables D.11 and D.12

to be very similar to the p-values and conditional probabilities in Tables D.7 and D.8.

This does occur, with, in general, p-values and conditional probabilities changing by less

than .05, which one might expect by random variation alone. Thus, small changes in

transition matrix probabilities do not greatly affect outcomes, which is reassuring.

A more extreme variation is simulated in Table D.13, where the endstate probability

(see Table D.14) a patient is in state SD are much higher (.0518 to .2458) and that a

patient is in state PD are much lower (.6206 to .4675). No method would result in a

statistically significant result (see Tables D.15 and D.16) and a conditional probability

computation performed at an interim analysis would result in low-moderate belief that

statistical significance can be observed. Large changes in hypotheses do result in large

changes in outcomes, as one would hope. Calculations for H0 must be carefully thought

out prior to any statistical analysis being performed, although this is often one of the least

thought out part of the trial by some investigators, and often one of the most disputed

post-trial. Since an assumed H0 can often be disputed by different investigators, one


could use a range of values for H0 and explore at what point the data tend to become

significant. This is not possible under classical statistical methods, but is a reasonable

suggestion given the uncertainty surrounding phase II oncology clinical trials. Even using

typical methods such as Simon’s optimal designs [9], exploring a range of H0 to explore

the results at trial termination might serve to create a better understanding of trial

results, and reduce the frequency of questionable scenarios.

A less severe but more reasonable alteration is shown in Table D.17, with endstate

probabilities in Table D.18. This scenario might be used to describe the belief where a

patient is likely to have a response only if they transition to the unconfirmed response

state immediately at the first transition with the further belief that it is very unlikely

for a late transition to response to occur (similar to what might be believed for cytotoxic

agents). The endstate probability of being in state R remains the same, however, the

probability of being in being in SD had decreased, with more patients being in state

PD. As expected, the level of significance is more extreme (see Tables D.19 and D.20),

particularly for methods which include stable disease as a positive outcome.

A final modification is shown in Table D.21 with endstate probabilities in Table

D.22. This H0 is concordant with a very optimistic view of the drug. The endstate

probability under H0 of being in state R is .1721, much higher than .0503 which was

used in the first model. Since response is considered a good outcome in all methods, the

level of significance is decreased throughout (i.e. higher p-values) and there is a decreased

conditional probability at an interim analysis which would lead investigators to be less

likely to continue to stage II of a study (see Tables D.23 and D.24).

Timing of Evaluations

One issue that arisies is when to perform an end-study evaluation. Thus, using the same

H0 and HA as in Table D.5, an extra evaluation was included in Table D.25 and removed

in the following simulation, with endstate probabilities in Tables D.26 and D.29. Results


are shown in Tables D.27 and D.28 for the extra transition simulation, and in Tables

D.30 and D.31 for the simulation with one less transition.

Under H0, there is a slight increased probability of being in an absorbing endstate,

R, PD, C, with the extra transition simulation and a slight decreased probability to be

in the same endstates for the simulation with one transition removed. During the clinical

trial, there was 1 patient who transitioned from state UR to R at the 5th transition, and

1 patient went from SD to C at the time of the 5th evaluation, and 1 who went from SD

to PD at the 6th evaluation. The outcomes appeared least significant when the endstates

were defined after the 4th evaluation, an increase in the level of significance (primarily

due to the patient who transitioned to the response state) after the 5th evaluation, and

then a further decrease in significance when evaluations occurred after the 6th transition.

The change in the level of significance did remain small, but noticeable.

One might tend to believe that the analyses performed as they were, after the 5th

evaluation, is an ’optimal’ selection point. It is also possible that the increase in the

number of patients with prolonged stable disease ceases around this time, which clinically

means a termination of treatment effect at this time. Further to this, of the 6 patients

remaining in SD after the 6th transition, 1 patient progressed at the next transition, 2

patients progressed and 1 had an adverse event requiring study discontinuation at the

transition following, and 1 patient progressed at the transition following that. Only

one patient remained with stable disease substantially beyond this time point. Thus,

by performing analyses at different transition times, one observes that the duration of

treatment efficacy is around 5 evaluations (10 months). This could assist in understanding

the biological characteristics of temsirolimus or neuroendocrine carcinoma.

5.1.4 Varying away from the RECIST criteria

Although the RECIST criteria is commonly used, there is no consensus that this is

optimal [90] [91] [92]. Particularly with the increased flexiblity of finite imbedding Markov


chain methods, there is no reason to limit our simulation to transition matrices based on

the RECIST criteria. A simple and straightforward modification is described in matrix

(4.2). The primary difference in this design is that one is not using the best objective

response status at any time, but one is fully incorporating the present status of each

patient. Specifically, patients who transition into state R can subsequently transition

out when they come off study due to disease progression or censoring.

Input data for this particular design is in Table D.32 and endstate probabilities are

in Table D.33. Of particular note is that one patient (021-025) had a partial response,

but came off treatment after the 3rd evaluation due to an unrelated adverse event, thus

was censored. The other two patients with partial responses remained on-study for a

lengthy period of time (9 evaluations and still receiving treatment, and 10 evaluations

prior to disease progression). Thus, instead of 3/36 patients in state R, only 2 patients

are still in this state. However, the corresponding endstate probabilities under H0 are

also decreased.

Results are in Tables D.34 and D.35 and generally indicate a greater level of signifi-

cance than using the best observed response at any time model. This increased signifi-

cance would again lend credence to the possibility of temsirolimus extending the length

of time to progression beyond what is expected, and it does not have just an immedi-

ate effect. The 2 patients who responded without censoring did not progress until much

later than even those with stable disease (note also the very long prolonged stable disease

duration of one patient), indicating that the treatment may remain active for longer in

a subset of patients, and that there is a particular subgroup of patients (possibly based

on some as yet unknown molecular characteristic) for which the treatment is extremely

effective. Unfortunately, correlative studies for this trial did not yield impressive results,

as frozen tissue samples were only available for 1 of the 3 patients who had a partial

response, and only 1 of the 3 had paired tumour biopsies with usable data.


5.1.5 Immediate response

A natural extension for evaluation when incorporating multiple evaluations would be

to investigate how each patient is doing at each individual transition, compared to the

previous transition time. This is represented by the transition matrix (4.3). For this

analysis, tumour shrinkage (state R) was defined as a reduction in tumour size of 5% or

more, tumour growth (state PD) was defined as an increase in tumour size of 5% or more

and disease stabilisation (state SD) was defined as growth less than 5% or shrinkage of

less than 5% compared to the measurement at the previous transition time. One could

also be off-study. This is represented in Tables D.36-D.39. A second definition was used

where 10% growth and shrinkage was used instead of 5% and is shown in Tables D.40-

D.43. Finally, one might consider not having a SD state, and considering only shrinkage

or non-shrinkage, and this is in transition matrix (4.4) and Tables D.44-D.47.

The level of statistical significance decreases as the definition of response goes from

any shrinkage, to > 5%, to > 10%. In other words, the number of patients having

impressive activity compared to what is expected is small. In addition, there are still a

number of patients having slight activity due to treatment under H0 by the end of the

4th and 5th evaluation. This is consistent with the theory that treatment-related activity

is slowing down or stopping around the 4th to 5th evaluation and also consistent with

the slow growing nature of neuroendocrine carcinoma.

5.1.6 Consecutive states

Since the primary purpose of phase II clinical trials in oncology is to determine whether

a treatment has any potential activity, and with MTA activity might be observed by

the prevention of future growth for a number of consecutive evaluations instead of sim-

ply having tumour shrinkage, it might be of interest to explore models in which a good

outcome is defined as a patient having tumour shrinkage (treatment response) or con-


secutive (2 or more) evaluations with no tumour growth. This can be modelled using a

transition matrix as in matrix (4.5), in which a patient having 3 consecutive evaluations

with no tumour growh would be considered as having a good outcome, or by transition

matrix (4.6), in which 4 consecutive evaluations are required. Alternatively, one might

define a good outcome as a major response, or consecutive evaluations with minor re-

sponse, such as in transition matrix (4.7). One additional twist as shown in Table D.56 is

that the transition matrix under H0 or HA does not necessarily have to remain constant

throughout the trial. If one has a reason to believe that treatment effects might change

during the course of the trial (possibly due to changes in dosage from adverse events, or

noting that certain treatments, like some chemotherapy and hormonal treatments, can

only be used for certain lengths of time and a patient may have completed that portion

of the treatment), then one can model this by modifying the input transition matrices

accordingly.

Results are similar as has been seen in other analyses, however, requiring 3 or more

consecutive observations of stable disease tends to have slightly more significant results.

This would be in accordance with the belief that the number of stable disease patients

observed at cycle 2 might not be greater than expected, however, the long-term effect of

the treatment might persist, such that patients who do have stable disease do tend to

not progress as quickly as expected.

5.1.7 Dual-Binomial Outcomes

Trial designs have been suggested which incorporate multiple endpoints, such as response

and toxicity, where a patient could have either endpoint, neither endpoint, or both. For

example, a particular agent might be deemed of interest only if the response rate is

sufficiently high, and the level of toxicity is adquately low. One might also be willing to

accept higher risk if there was corresponding higher efficacy, or conversely if lower toxicity

was observed, one might deem acceptable a treatment with less efficacy. This latter case


might also indicate a treatment which has the potential to be part of a multi-agent

combination therapy. A transition matrix to represent this type of analysis is defined in

matrix (4.8). The definition of a toxic event would need to be specified (e.g. any grade

3 adverse event of any attribution). A patient might have a toxic event and remain on

study, might come off study without having ever experienced the toxicity, might respond

and have toxicity or might have neither outcome.

Results for this analysis are in Tables D.60-D.63. The p-value for those analyses

which are using both response+stable disease and toxicity outcomes show a great deal of

significance - see method 2, second analysis, method 8, 2nd, 3rd and 4th analyses. This

is a result of the observed toxicity level being less than expected, so that when combined

with the increased rate of stable disease which was observed, a highly significant result

is observed. Thus, although there was only a slight improvement in efficacy alone, this

treatment might be of considerably greater interest given the relative lack of toxicity

observed. Further, one might find temsirolimus suitable as a combined therapy, since the

toxicity is low and a synergistic relationship might be possible without putting patients

at excessive risk.

5.1.8 Theoretical Versus Simulated Calculations

The results observed by using simulation was similar to results which would be observed

had one used theoretical calculations. As an example, look at matrix (2.9) based on

the RECIST criteria. The end-state probability response rate under H0 is 0.0503 as

defined by the transition matrices. If we calculated the p-value in this instance us-

ing theoretical calculations, we would calculate the probability that one observed 3 or

more responses out of 36 patients, given the probability of having a response is 0.0503.

This is 0.2709. Using the simulated data, the probability was 0.245. If the primary

outcome was determined to be response + stable disease rate, then the probability is

0.0503+0.0043+0.0518+0.3135=0.4199, and the probability of observing 23 or more pa-


tients in response or stable disease is 0.0066. This is not very different then the simulated

calculation of 0.029. Finally, if the primary outcome is response + stable disease but on-

study, the probability is 0.1034 and the theoretical p-value is 0.003, compared to the

simulated p-value of 0.005.

The theoretical conditional probability of obtaining a statistically significant result

if one was to continue the trial after 15 patients were accrued, is 0.048 if the primary

outcome was response. This compares to the simulated value of 0.049. If future data was

assumed to follow the distribution as under HA, then the conditional probability is 0.803

using theoretical calculations and 0.820 using simulated data. When the primary outcome

was response or stable disease, the conditional probability would be 0.599, compared to

the simulated value of 0.591, if future data was assumed to follow the distribution as

defined under HA, that is the probability of a response or stable disease of a future

patient being 0.62. If future data was assumed to occur similar to the data up to the

interim analysis, the conditional probability is 0.286 using theory and 0.278 using the

simulation. The conditional probability when the primary outcome is response + stable

disease but on-study is 0.851 (theoretical) and 0.867 (simulated) when future data is

assumed to occur the same as present data, and 0.975 (theoretical) and 0.971 (simulated)

when future data is assumed to follow HA. The simulated results are thus, similar to the

theoretical results which could be calculated.

Chapter 6

Discussion

There are a multitude of phase II clinical trial designs which can be used to investigate

whether an experimental treatment in cancer has potential efficacy. The choice of design

to use for any individual trial is frequently subjective, often based on a statistician’s

personal preferences. Even after a particular design is selected, specifying hypotheses

may often be a result of practical issues and not solely on clinical efficacy rates. This

leads to the possibility that different investigators would choose different hypotheses,

and conflicting conclusions can easily arise upon trial conclusion. Without consensus,

the ability to evaluate different designs and different hypotheses simultaneously would

be advantageous.

In this dissertation, it is shown that finite Markov chain imbedding can be used to

evaluate multiple designs and hypotheses all at one time. It is possible to test many hy-

potheses of which different investigators might hold, including optimistic and pessimistic

hypotheses, as well as hypotheses using different outcomes of interest. In this manner,

one can reduce post-trial conflicts by getting more information from the same data and

better understanding whether the experimental treatment is truly efficacious or not. To

do this, one needs only to set up the transition matrix appropriately and to manipulate

the data to correspond with the transition matrix. Given that oncology clinical trials are

93

Chapter 6. Discussion 94

naturally divided into sections of time, by the cycles of treatment, and there are already

defined states for the efficacy outcome most commonly used, that being response, these

trials fit in easily with Markov chain methodology. These states can be modified easily.

Two computer programs, written in R, are provided, one which calculates p-values

at trial termination, and one which calculates conditional power at an interim analysis.

Both programs require only a few seconds to complete the necessary simulations and

can calculate outcomes from any of 9 different methods. Actual trial data is used to

demonstrate the utility of this method and the ease of use of these programs. It is also

shown how the same data can provide additional information about the treatment using

finite Markov chain imbedding, as evidenced by the ability to detect that the treatment

appears to be effective for around 5-6 evaluations ( 10 months) amongst most patients

with disease stabilization, but a little longer for those patients having tumour response.

Additionally, a subset of patients appears to demonstrate activity, which leads to the

presumption that some unknown biological factor (e.g. molecular marker) present in

only a proportion of patients or tumours, may be affected by the treatment.

For binary primary outcomes, either simulated or theoretical calculations are possible,

however, for more complex primary outcomes, it is easier to work with the simulated data.

Particularly when investigating small modifications to the assumed distributions H0 or

HA, for example slight differences in the response rate, the work required to compute

the simulated outcome is minimal. Theoretically, the calculations have to be performed

again and they can be quite complicated. The program itself takes only a few seconds to

compute the statistics of interest, thus, a wide variety of results can be computed with

only minimal work and time required.

Finite Markov chain imbedding is an additional, valuable tool which statisticians

might avail themselves when analyzing studies that incorporate an outcome measured

repeatedly over time. Presently, in most situations, investigators choose a single mea-

surement evaluated at a single time point. The use of finite Markov chain imbedding


allows investigators to study results as a pattern of outcomes over time. It is this tran-

sition from a single measurement to a pattern, which could prove extremely important

to future researchers. By examining the pattern of results from observed data instead of

focusing on a single outcome measurement, one can more clearly understand the effect

of a treatment which could otherwise be obscured. This is particularly true when the

effect being measured is not well understood and difficult to pinpoint prior to initiating

a study.

The computer code provided is the first code, to my knowledge, which allows the

user to easily apply finite Markov chain imbedding for analysis in a common statistical

software package. The code is relatively simple to use, flexible and efficient. It is in-

tended that this code will be published and made freely available to other statisticians

and investigators for their use, such that they can implement finite Markov chain imbed-

ding methods with relative ease. The ease and simplicity of this code allows users to

investigate a multitude of possibilities with only minor modifications to the input pa-

rameters. Conflicting conclusions which could arise from subjective implementation of

different possible designs can be clarified by understanding the discrepancies between

designs. An improved understanding of the true treatment effect results which should

allow more effective use of limited money and resources and enhanced decision-making.

Nevertheless, despite these advantages, finite Markov chain imbedding is not a panacea.

Regardless of the results seen, a confirmatory randomized phase III trial would still need

to be conducted to compare a presumed effective treatment with the standard of care.

Phase II trials tend to be small, single arm trials which aim to study whether a treatment

has potential efficacy, and should be studied further. In addition, even though the ability

to investigate a range of hypotheses is provided by finite Markov chain imbedding, it

does not guarantee that all hypotheses of interest will be investigated. An investigator,

or investigators, who are overly optimistic/pessimistic, might remain so over the entire

range of hypotheses investigated. Finally, one must evaluate the trial results across all


simulations, not individually, which takes time to understand all the analyses undertaken,

and time to set up all the simulations.

Even though all frequentist hypothesis tests incorporate a priori information in the

framing of the hypotheses and type I and II errors, the individual analyses are not

Bayesian and do not formally incorporate prior information with the trial data. P-values

are probabilities of observing data as extreme, or more extreme, than one did observe,

under the null hypothesis. Each probability has to be interpreted in the context of the null

hypothesis, and does not measure directly whether the treatment is effective. Given that

these results are based on simulations, one must fully understand how the calculations are

performed to ensure proper interpretation of the results. Sufficient statistical knowledge

must be displayed by the user to prevent incorrect conclusions. This is partially forced

upon the user as they must have statistical coding abilities to construct arrays and utilize

R code, however, there is no guarantee of valid inference.

Further work is still needed on finite Markov chain imbedding. First, time-to-event

outcomes, such as survival or progression-free survival, were not discussed in this disser-

tation, although this is a very important efficacy outcome. While adding an additional

state to the state space would account for this partially, it does not wholly account for

time to event outcomes. Second, there is no accounting for known predictors of effi-

cacy. In breast cancer alone, Her2 status and whether there is nodal involvement are two

significant predictors of efficacy which are not always known prior to trial recruitment.

Thus, it is important to include known predictors, and will become more important as

additional disease markers become known. Third, finite Markov chain imbedding meth-

ods could be valuable for understanding treatment effects in other manners rather than

strict efficacy. There are often questions about the treatment regimen to use; such as

how long should a cycle be (e.g. 21, 28, 35 days), how should the dosing schedule be set

per cycle (e.g. daily, 3 weeks on-treatment followed by 1 week rest, twice daily), is there

a maximum number of cycles which is acceptable, should there be a lead-in period for


one agent in a multi-agent treatment, in which order should consecutive treatments be

structured, etc;. Fourth, there is a need to evaluate finite Markov chain imbedding as a

tool for study design and sample size calculation. Fifth, although 9 methods for calculat-

ing statistics are available, there are potentially other methods which could be used and

these should be investigated. Additional methods for calculating statistics could become

apparent if this design was used in other therapeutic areas for which finite Markov chain

imbedding methods might prove useful, such as central nervous system or pain studies.

These therapeutic areas have outcomes which are measured on each subject repeatedly

over time, and the outcome (or state) which a subject resides would change constantly

throughout the study period. Thus, they might naturally prove as therapeutic areas in

which finite Markov chain imbedding methods should be studied further.

In summary, this dissertation has used a new method, finite Markov chain imbedding,

to evaluate phase II oncology clinical trial data. The ability to investigate a range of

designs and hypotheses simultaneously with relative ease is demonstrated. This powerful

tool has the ability to increase trial efficiency and improve our statistical knowledge.

Appendix A

Data

98

Appendix A. Data 99

ID Baseline 1 2 3 4 5 6 7 8 9 Off-Study Best Response

021-001 457 456 426 429 423 444 Progression SD

021-002 124 127 126 111 Physician Discretion SD

021-003 110 102 119 Progression SD

021-004 265 282 268 321 Progression SD

021-005 223 257 Progression PD

021-006 114 101 102 100 91 90 90 79 84 88 Progression SD

021-007 140 138 133 134 Adverse Event SD

021-008 104 Adverse Event IE

021-009 66 95 Adverse Event PD





021-014 208 190 180 185 172 174 183 182 196 169 Still On-Treatment SD

021-015 194 161 146 136 135 125 128 130 135 133 Still On-Treatment PR

021-016 207 213 226 229 223 242 244 248 250 Progression SD

021-017 107 Death IE

021-018 70 72 76 69 79 75 75 81 Adverse Event SD

021-019 33 Never Treated Ineligible


021-021 324 318 Symptomatic Progression PD

021-022 294 239 217 167 Completed 8 Cycles SD (uPR)

021-023 429 368 385 Symptomatic Progression SD

021-024 25 24 Withdrew Consent SD

021-025 268 188 154 Unrelated Disease Complications PR

021-026 225 263 Symptomatic Progression SD

021-027 298 319 Symptomatic Progression SD


021-029 34 27 33 27 25 22 28 27 12 14 Still On-Treatment PR


021-031 58 48 53 48 46 43 42 Still On-Treatment SD

021-032 179 161 160 157 166 164 157 136 138 Still On-Treatment SD

021-033 165 163 174 182 Progression SD

021-034 79 84 84 72 71 64 37 59 Still On-Treatment SD


021-036 314 328 339 371 367 Symptomatic Progression SD

021-037 189 167 Withdrew Consent SD

*PR=Partial Response, SD=Stable Disease, PD=Progressive Disease, IE=Inevaluable, uPR=Unconfirmed Partial Response

Table A.1: Data, in mm

Appendix B

State Spaces

100

Appendix B. State Spaces 101

ID Baseline 1 2 3 4 5 6 7 8 9 Off-Study Best Response

021-001 ∅ SD SD SD SD PD Progression SD

021-002 ∅ SD SD SD C Physician Discretion SD

021-003 ∅ SD PD Progression SD

021-004 ∅ SD SD PD Progression SD

021-005 ∅ PD Progression PD

021-006 ∅ SD SD SD SD SD SD SD SD PD Progression SD

021-007 ∅ SD SD SD Adverse Event SD

021-008 ∅ C Adverse Event IE

021-009 ∅ PD Adverse Event PD





021-014 ∅ SD SD SD SD SD SD SD SD SD Still On-Treatment SD

021-015 ∅ SD SD UR R R R R R R Still On-Treatment R

021-016 ∅ SD SD SD SD SD SD SD SD Progression SD

021-017 ∅ C Death IE

021-018 ∅ SD SD SD SD SD SD SD Adverse Event SD

021-019 ∅ Never Treated Ineligible


021-021 ∅ PD Symptomatic Progression PD

021-022 ∅ SD SD UR C Completed 8 Cycles SD (uPR)

021-023 ∅ SD PD Symptomatic Progression SD

021-024 ∅ SD Withdrew Consent SD

021-025 ∅ UR R C Unrelated Disease Complications R




021-029 ∅ SD SD SD SD UR R R R R Still On-Treatment R


021-031 ∅ SD SD SD SD SD SD Still On-Treatment SD

021-032 ∅ SD SD SD SD SD SD SD SD Still On-Treatment SD

021-033 ∅ SD SD SD Progression SD

021-034 ∅ SD SD SD SD SD SD SD Still On-Treatment SD


021-036 ∅ SD SD SD PD Symptomatic Progression SD

021-037 ∅ SD Withdrew Consent SD

*R=Partial Response, SD=Stable Disease, PD=Progressive Disease, IE=Inevaluable, UR=Unconfirmed Partial Response, C=Censored

Table B.1: Data, in State Spaces According to RECIST Criteria

Appendix C

Computer Code

pval.fn<-function(startvector,H0array,dataarray,sampsize,iterations,method,outv,outv2){

##### startvector - starting positions #####

##### H0array - array of transition matrices under H0 #####

##### dataarray - array of transition matrices as given by data ######

l<-dim(H0array)[2]

ntransitions<-dim(H0array)[3]

update<-t(startvector)

data.update<-t(startvector)

#### Create data and H0 matrices #####

for (i in 1:ntransitions){

endstate.mat<-update%*%H0array[,,i]

update<-endstate.mat

data.endstate.mat<-data.update%*%dataarray[,,i]

data.update<-data.endstate.mat

}

#### Generate random data under H0 #####

nulldata<- matrix(sample(1:l,sampsize*iterations,prob=endstate.mat,replace=T),nrow=iterations,byrow=T)

##### Fake data added to get identical number of outcomes using summary.factor ####

fakedat<-matrix(rep(1:l,iterations),nrow=iterations,byrow=T)

endstatesuse<-cbind(nulldata,fakedat)

endstate<-t(as.matrix(apply(endstatesuse,1,summary.factor))-rep(1,l))/sampsize

#### Calculate p-value depending on method #####

#### Method 1 - Superiority of one outcome (i.e. CR) #####

if (method==1) {pval<-sum(endstate[,outv]>=data.update[outv])}

#### Method 2 - Superiority of the sum of multiple outcomes (i.e. CR+PR) #####

102

Appendix C. Computer Code 103

if (method==2) {pval<-sum(apply(endstate[,outv],1,sum)>=sum(data.update[outv]))}

#### Method 3 - Superiority of any one of many multiple outcomes (CR or PR) #####

if (method==3) {

v<-matrix(NA,nrow=iterations,ncol=length(outv))

for (i in 1:length(outv)){

v[,i]<-(endstate[,outv[i]]>=data.update[outv[i]])

}

pval<-sum(apply(v,1,max))

}

#### Method 4 - Either A or sum of B (CR or CR+PR) #####

if (method==4){

temp<-rep(0,iterations)

for (i in 1:iterations){temp[i]<-(sum(endstate[i,outv]>=data.update[outv] ||

sum(endstate[i,outv2])>=sum(data.update[outv2])))}

pval<-sum(temp)

}

##### Method 5 - Either superiority of A or equivalence of A and superiority of B ####

if (method==5){


for (i in 1:iterations){temp[i]<-(sum(endstate[i,outv]>data.update[outv] ||

(endstate[i,outv]==data.update[outv] && sum(endstate[i,outv2])>=sum(data.update[outv2]))))

}

pval<-sum(temp)

}

##### Method 6 - Superiority of A and inferiority of B ####

if (method==6){


for (i in 1:iterations){temp[i]<-(sum(endstate[i,outv]>=data.update[outv] ||

sum(endstate[i,outv2])<=sum(data.update[outv2])))}

pval<-sum(temp)

}

##### Method 7 - Strict superiority of A and strict inferiority of B ####

if (method==7){


for (i in 1:iterations){temp[i]<-(sum(endstate[i,outv]>data.update[outv] ||

sum(endstate[i,outv2])<sum(data.update[outv2])))}

pval<-sum(temp)

}

###### Method 8 - Weighted model ######

if (method==8){

temp<-endstate%*%outv


temp1<-data.update%*%outv

temp2<-rep(0,iterations)

for (i in 1:iterations){

temp2[i]<-temp[i]>=temp1

}

pval<-sum(temp2)

}

#### Method 9 - Superiority of each one of many multiple outcomes (CR and PR) #####

if (method==9) {



v[,i]<-(endstate[,outv[i]]>=data.update[outv[i]])

}

pval<-sum(apply(v,1,prod))

}

print(pval/iterations)

}

cond.prob.fn<-function(startvector,H0array,dataarray,h1array,sampsize,interimss,iterations,method,outv,outv2){

##### startvector - starting positions #####

##### H0array - array of transition matrices under H0 #####

##### dataarray - array of transition matrices as given by data ######

l<-dim(H0array)[2]

ntransitions<-dim(H0array)[3]

update<-t(startvector)

data.update<-t(startvector)

alt.update<-t(startvector)

#### Create data, HA and H0 matrices #####

for (i in 1:ntransitions){

endstate.mat<-update%*%H0array[,,i]

update<-endstate.mat

data.endstate.mat<-data.update%*%dataarray[,,i]

data.update<-data.endstate.mat

alt.endstate.mat<-alt.update%*%h1array[,,i]

alt.update<-alt.endstate.mat

}

#### Generate random data under H0 #####

nulldata<- matrix(sample(1:l,sampsize*iterations,prob=endstate.mat,replace=T),nrow=iterations,byrow=T)



fakedat<-matrix(rep(1:l,iterations),nrow=iterations,byrow=T)

endstatesuse<-cbind(nulldata,fakedat)

endstate<-t(as.matrix(apply(endstatesuse,1,summary.factor))-rep(1,l))

##### Data at interim analysis ######

int.data<-matrix(rep(data.update*interimss,iterations),nrow=iterations,byrow=T)

##### Generate future data under HA ######

futdata<- matrix(sample(1:l,(sampsize-interimss)*iterations,prob=alt.endstate.mat,replace=T),nrow=iterations,byrow=T)


fakedath1<-matrix(rep(1:l,iterations),nrow=iterations,byrow=T)

alt.endstatesuse<-cbind(futdata,fakedath1)

alt.endstate<-t(as.matrix(apply(alt.endstatesuse,1,summary.factor))-rep(1,l))+int.data

cond.prob<-rep(NA,iterations)

#### Calculate conditional probability depending on method #####

#### Method 1 - Superiority of one outcome (i.e. CR) #####

if (method==1) {for (i in 1:iterations){

cond.prob[i]<-sum(endstate[,outv]>=alt.endstate[i,outv])/iterations

}

}

#### Method 2 - Superiority of the sum of multiple outcomes (i.e. CR+PR) #####

if (method==2) {for (i in 1:iterations){

cond.prob[i]<-sum(apply(endstate[,outv],1,sum)>=sum(alt.endstate[i,outv]))/iterations

}

}

#### Method 3 - Superiority of any one of many multiple outcomes (CR or PR) #####

if (method==3) {

for (j in 1:iterations){



v[,i]<-(endstate[,outv[i]]>=alt.endstate[j,outv[i]])

}

cond.prob[j]<-sum(apply(v,1,max))/iterations

}

}

#### Method 4 - Either A or sum of B (CR or CR+PR) #####

if (method==4){



for (i in 1:iterations){temp[i]<-(sum(endstate[i,outv]>=alt.endstate[j,outv] ||

sum(endstate[i,outv2])>=sum(alt.endstate[j,outv2])))}


cond.prob[j]<-sum(temp)/iterations

}

}

##### Method 5 - Either superiority of A or equivalence of A and superiority of B ####

if (method==5){



for (i in 1:iterations){temp[i]<-(sum(endstate[i,outv]>alt.endstate[j,outv] ||

(endstate[i,outv]==alt.endstate[j,outv] && sum(endstate[i,outv2])>=sum(alt.endstate[j,outv2]))))

}


}

}

##### Method 6 - Superiority of A and inferiority of B ####

if (method==6){



for (i in 1:iterations){temp[i]<-(sum(endstate[i,outv]>=alt.endstate[j,outv] ||

sum(endstate[i,outv2])<=sum(alt.endstate[j,outv2])))}


}}

##### Method 7 - Strict superiority of A and strict inferiority of B ####

if (method==7){



for (i in 1:iterations){temp[i]<-(sum(endstate[i,outv]>alt.endstate[j,outv] ||

sum(endstate[i,outv2])<sum(alt.endstate[j,outv2])))}


}}

###### Method 8 - Weighted model ######

if (method==8){


temp<-endstate%*%outv

temp1<-alt.endstate[j,]%*%outv

temp2<-rep(0,iterations)

for (i in 1:iterations){

temp2[i]<-sum(temp[i])>=sum(temp1)

}

cond.prob[j]<-sum(temp2)/iterations

}}

#### Method 9 - Superiority of each one of many multiple outcomes (CR and PR) #####


if (method==9) {




v[,i]<-(endstate[,outv[i]]>=alt.endstate[j,outv[i]])

}

cond.prob[j]<-sum(apply(v,1,prod))/iterations

}

}

condprob<-sum(cond.prob<=.05)/iterations

print(condprob)

}

Appendix D

Results

Matrix p∅−ur p∅−sd p∅−pd p∅−c pur−cr pur−sdo psd−ur psd−sd psd−sdo

Data ∅ 1/36 22/36 8/36 5/36 0 0 0 0 0

Data eval 2 0 0 0 0 1 0 0 16/22 6/22

Data eval 3 0 0 0 0 0 0 2/16 12/16 2/16

Data eval 4 0 0 0 0 1/2 1/2 1/12 8/12 3/12

Data eval 5 0 0 0 0 1 0 0 7/8 1/8

H0 .02 .4 .4 .18 .85 .15 .05 .6 .35

Interim ∅ 0 8/15 6/15 1/15 0 0 0 0 0

Interim eval 2 0 0 0 0 0 0 0 7/8 1/8

Interim eval 3 0 0 0 0 0 0 1/7 5/7 1/7

Interim eval 4 0 0 0 0 1 0 0 3/5 2/5

Interim eval 5 0 0 0 0 0 0 0 3/3 0

HA .2 .42 .2 .18 .85 .15 .1 .7 .2

Table D.1: Data input for matrix (2.9) modelling the RECIST criteria

Endstates pR pur psd psdo ppd pc

Data 3/36 0 7/36 13/36 8/36 5/36

H0 .0503 .0043 .0518 .3135 .4000 .1800

Interim 1/15 0 3/15 4/15 6/15 1/15

HA .2482 .0144 .1008 .2566 .2000 .1800

Table D.2: Endstate probabilities for (2.9) modelling the RECIST criteria

108

Appendix D. Results 109

Method iterations outv outv2 p-value cond. prob. (HA) cond. prob. (data)

1 1000 2 0.245 0.820 0.049

2 1000 2,3,4,5 0.029 0.591 0.278

2 1000 2,3,4 0.005 0.971 0.867

3 1000 2,4 0.278 0.449 0.032

4 1000 2 3,4,5 0.372 0.002 0.002

4 1000 2 2,3,4,5 0.285 0.431 0.017

4 1000 2 2,3,4 0.277 0.826 0.056

5 1000 2 3,4,5 0.109 0.870 0.121

5 1000 2 2,3,4,5 0.107 0.902 0.137

6 1000 2 6 0.343 0.278 0.001

6 1000 2 6,7 0.299 0.486 0.023

7 1000 2 6 0.138 0.309 0.004

7 1000 2 6,7 0.112 0.714 0.074

8 1000 0,1,1,1,1,0,0 0.016 0.585 0.144

8 1000 0,4,1,1,1,0,0 0.050 0.868 0.131

8 1000 0,4,2,2,1,0,0 0.018 0.961 0.533

9 1000 2,5 0.143 0.813 0.073

1 500 2 0.256 0.798 0.056

2 500 2,3,4,5 0.028 0.586 0.306

2 500 2,3,4 0.002 0.974 0.870

3 500 2,4 0.272 0.468 0.038

4 500 2 3,4,5 0.328 0.008 0.000

4 500 2 2,3,4,5 0.280 0.476 0.008

4 500 2 2,3,4 0.306 0.798 0.012

5 500 2 3,4,5 0.094 0.862 0.150

5 500 2 2,3,4,5 0.128 0.868 0.130

6 500 2 6 0.294 0.304 0.000

6 500 2 6,7 0.296 0.466 0.022

7 500 2 6 0.152 0.544 0.004

7 500 2 6,7 0.116 0.730 0.050

8 500 0,1,1,1,1,0,0 0.028 0.596 0.290

8 500 0,4,1,1,1,0,0 0.046 0.934 0.194

8 500 0,4,2,2,1,0,0 0.012 0.962 0.502

9 500 2,5 0.152 0.848 0.090

Table D.3: Outcomes for (2.9) modelling the RECIST criteria and n=36 patients


n=54 iterations outv outv2 p-value cond. prob. (HA) cond. prob. (data)

1 1000 2 0.142 0.934 0.035

2 1000 2,3,4,5 0.011 0.819 0.415

2 1000 2,3,4 0.000 0.999 0.930

3 1000 2,4 0.139 0.543 0.093

4 1000 2 3,4,5 0.161 0.038 0.010

4 1000 2 2,3,4,5 0.149 0.791 0.026

4 1000 2 2,3,4 0.116 0.941 0.043

5 1000 2 3,4,5 0.148 0.974 0.186

5 1000 2 2,3,4,5 0.141 0.981 0.118

6 1000 2 6 0.167 0.720 0.001

6 1000 2 6,7 0.148 0.773 0.029

7 1000 2 6 0.160 0.853 0.007

7 1000 2 6,7 0.152 0.874 0.111

8 1000 0,1,1,1,1,0,0 0.006 0.874 0.405

8 1000 0,4,1,1,1,0,0 0.027 0.986 0.298

8 1000 0,4,2,2,1,0,0 0.005 0.992 0.700

9 1000 2,5 0.044 0.984 0.123

1 500 2 0.122 0.942 0.024

2 500 2,3,4,5 0.008 0.802 0.376

2 500 2,3,4 0.000 0.998 0.918

3 500 2,4 0.148 0.540 0.056

4 500 2 3,4,5 0.174 0.024 0.000

4 500 2 2,3,4,5 0.148 0.754 0.026

4 500 2 2,3,4 0.122 0.944 0.034

5 500 2 3,4,5 0.136 0.986 0.106

5 500 2 2,3,4,5 0.156 0.980 0.108

6 500 2 6 0.140 0.734 0.006

6 500 2 6,7 0.176 0.844 0.034

7 500 2 6 0.136 0.836 0.006

7 500 2 6,7 0.114 0.920 0.138

8 500 0,1,1,1,1,0,0 0.004 0.852 0.420

8 500 0,4,1,1,1,0,0 0.030 0.992 0.294

8 500 0,4,2,2,1,0,0 0.000 0.994 0.730

9 500 2,5 0.060 0.978 0.108

Table D.4: Outcomes for (2.9) modelling the RECIST criteria and n=54 patients

Matrix p∅−r p∅−sd p∅−pd p∅−c pur−cr pur−pd pur−c psd−ur psd−sd psd−pd psd−c

Data ∅ 1/36 20/36 10/36 5/36 0 0 0 0 0 0 0

Data eval 2 0 0 0 0 1 0 0 0 16/20 3/20 1/20

Data eval 3 0 0 0 0 0 0 0 2/16 12/16 2/16 0

Data eval 4 0 0 0 0 1/2 0 1/2 1/12 8/12 1/12 2/12

Data eval 5 0 0 0 0 1 0 0 0 7/8 0 1/8

H0 .02 .4 .4 .18 .85 .05 .1 .05 .6 .25 .1

Interim ∅ 0 8/15 6/15 1/15 0 0 0 0 0 0 0

Interim eval 2 0 0 0 0 0 0 0 0 7/8 1/8 0

Interim eval 3 0 0 0 0 0 0 0 1/7 5/7 1/7 0

Interim eval 4 0 0 0 0 1 0 0 0 3/5 0 2/5

Interim eval 5 0 0 0 0 0 0 0 0 3/3 0 0

HA .2 .42 .2 .18 .85 .05 .1 .1 .7 .15 .05

Table D.5: Data input for matrix (4.1) modelling the transition-time dependent RECIST

criteria


Endstates pR pur psd ppd pc

Data 3/36 0 7/36 16/36 10/36

H0 .0503 .0043 .0518 .6206 .2730

Interim 1/15 0 3/15 8/15 3/15

HA .2482 .0144 .1008 .3742 .2624

Table D.6: Endstate probabilities for (4.1) modelling the transition-time dependent RE-

CIST criteria


1 1000 2 0.288 0.810 0.051

2 1000 2,3,4 0.008 0.983 0.847

2 1000 2,3 0.311 0.830 0.011

3 1000 2,4 0.279 0.439 0.044

4 1000 2 3,4 0.282 0.309 0.029

4 1000 2 2,3,4 0.283 0.814 0.052

4 1000 2,3 2,3,4 0.273 0.798 0.058

5 1000 2 3,4 0.273 0.915 0.159

5 1000 2 2,3,4 0.275 0.934 0.144

6 1000 2 5 0.278 0.560 0.010

6 1000 2 5,6 0.285 0.802 0.043

7 1000 2 5 0.270 0.824 0.042

7 1000 2 5,6 0.281 0.914 0.155

8 1000 0,1,1,1,0,0 0.002 0.967 0.855

8 1000 0,4,1,1,0,0 0.058 0.930 0.243

8 1000 0,8,1,1,0,0 0.096 0.943 0.165

9 1000 2,4 0.001 0.991 0.966

1 500 2 0.256 0.820 0.060

2 500 2,3,4 0.006 0.962 0.852

2 500 2,3 0.306 0.852 0.036

3 500 2,4 0.270 0.464 0.044

4 500 2 3,4 0.240 0.340 0.028

4 500 2 2,3,4 0.284 0.772 0.050

4 500 2,3 2,3,4 0.274 0.798 0.050

5 500 2 3,4 0.104 0.914 0.168

5 500 2 2,3,4 0.126 0.906 0.152

6 500 2 5 0.272 0.608 0.014

6 500 2 5,6 0.282 0.778 0.058

7 500 2 5 0.134 0.716 0.026

7 500 2 5,6 0.100 0.910 0.176

8 500 0,1,1,1,0,0 0.000 0.970 0.860

8 500 0,4,1,1,0,0 0.028 0.922 0.232

8 500 0,8,1,1,0,0 0.128 0.910 0.162

9 500 2,4 0.000 0.992 0.970

Table D.7: Outcomes for (4.1) modelling the transition-time dependent RECIST criteria

with n=36 patients



1 1000 2 0.141 0.982 0.112

2 1000 2,3,4 0.000 0.997 0.919

2 1000 2,3 0.191 0.956 0.038

3 1000 2,4 0.129 0.511 0.031

4 1000 2 3,4 0.126 0.631 0.037

4 1000 2 2,3,4 0.120 0.956 0.120

4 1000 2,3 2,3,4 0.147 0.977 0.028

5 1000 2 3,4 0.129 0.994 0.115

5 1000 2 2,3,4 0.130 0.992 0.257

6 1000 2 5 0.145 0.890 0.027

6 1000 2 5,6 0.119 0.977 0.045

7 1000 2 5 0.142 0.918 0.109

7 1000 2 5,6 0.124 0.982 0.237

8 1000 0,1,1,1,0,0 0.001 0.996 0.961

8 1000 0,4,1,1,0,0 0.029 0.989 0.453

8 1000 0,8,1,1,0,0 0.057 0.996 0.274

9 1000 2,4 0.000 0.999 0.998

1 500 2 0.120 0.934 0.098

2 500 2,3,4 0.000 0.998 0.976

2 500 2,3 0.192 0.966 0.044

3 500 2,4 0.132 0.542 0.038

4 500 2 3,4 0.146 0.622 0.096

4 500 2 2,3,4 0.122 0.960 0.094

4 500 2,3 2,3,4 0.144 0.954 0.034

5 500 2 3,4 0.144 0.986 0.108

5 500 2 2,3,4 0.166 0.990 0.260

6 500 2 5 0.150 0.900 0.032

6 500 2 5,6 0.128 0.960 0.042

7 500 2 5 0.148 0.938 0.076

7 500 2 5,6 0.118 0.994 0.276

8 500 0,1,1,1,0,0 0.000 0.990 0.926

8 500 0,4,1,1,0,0 0.026 0.984 0.514

8 500 0,8,1,1,0,0 0.048 0.990 0.234

9 500 2,4 0.000 0.998 0.998

Table D.8: Outcomes for (4.1) modelling the transition-time dependent RECIST criteria

with n=54 patients


Matrix p∅−ur p∅−sd p∅−pd p∅−c pur−cr pur−pd pur−c psd−ur psd−sd psd−pd psd−c

H0 .04 .4 .4 .16 .85 .05 .1 .05 .6 .25 .1

Table D.9: Modified data input (2), slightly better expectations under H0, for matrix

(4.1) modelling the transition-time dependent RECIST criteria


H0 .0673 .0043 .0518 .6216 .2550

Table D.10: Endstate probabilities for modified data input (2), slightly better expecta-

tions under H0, for (4.1) modelling the transition-time dependent RECIST criteria



1 1000 2 0.243 0.801 0.053

2 1000 2,3,4 0.005 0.980 0.841

2 1000 2,3 0.321 0.695 0.047

3 1000 2,4 0.288 0.392 0.036

4 1000 2 3,4 0.273 0.428 0.034

4 1000 2 2,3,4 0.275 0.797 0.039

4 1000 2,3 2,3,4 0.261 0.816 0.036

5 1000 2 3,4 0.111 0.925 0.164

5 1000 2 2,3,4 0.119 0.917 0.148

6 1000 2 5 0.272 0.528 0.009

6 1000 2 5,6 0.279 0.791 0.039

7 1000 2 5 0.124 0.743 0.042

7 1000 2 5,6 0.117 0.909 0.162

8 1000 0,1,1,1,0,0 0.001 0.978 0.874

8 1000 0,4,1,1,0,0 0.064 0.919 0.260

8 1000 0,8,1,1,0,0 0.109 0.906 0.202

9 1000 2,4 0.000 0.991 0.970

1 500 2 0.280 0.820 0.044

2 500 2,3,4 0.000 0.966 0.872

2 500 2,3 0.318 0.832 0.012

3 500 2,4 0.270 0.416 0.030

4 500 2 3,4 0.284 0.316 0.030

4 500 2 2,3,4 0.248 0.810 0.044

4 500 2,3 2,3,4 0.266 0.806 0.050

5 500 2 3,4 0.104 0.946 0.134

5 500 2 2,3,4 0.094 0.922 0.156

6 500 2 5 0.294 0.556 0.010

6 500 2 5,6 0.258 0.814 0.064

7 500 2 5 0.112 0.696 0.074

7 500 2 5,6 0.122 0.916 0.118

8 500 0,1,1,1,0,0 0.006 0.956 0.856

8 500 0,4,1,1,0,0 0.048 0.920 0.324

8 500 0,8,1,1,0,0 0.102 0.926 0.162

9 500 2,4 0.000 0.992 0.966

Table D.11: Outcomes for modified data input (2), slightly better expectations under H0,

for (4.1) modelling the transition-time dependent RECIST criteria with n=36 patients



1 1000 2 0.139 0.952 0.053

2 1000 2,3,4 0.001 1.000 0.940

2 1000 2,3 0.165 0.969 0.037

3 1000 2,4 0.143 0.722 0.047

4 1000 2 3,4 0.117 0.613 0.044

4 1000 2 2,3,4 0.151 0.940 0.037

4 1000 2,3 2,3,4 0.147 0.983 0.110

5 1000 2 3,4 0.141 0.979 0.268

5 1000 2 2,3,4 0.139 0.988 0.241

6 1000 2 5 0.127 0.883 0.017

6 1000 2 5,6 0.132 0.947 0.096

7 1000 2 5 0.139 0.955 0.058

7 1000 2 5,6 0.140 0.989 0.134

8 1000 0,1,1,1,0,0 0.000 0.997 0.933

8 1000 0,4,1,1,0,0 0.032 0.994 0.327

8 1000 0,8,1,1,0,0 0.051 0.976 0.245

9 1000 2,4 0.000 0.999 0.997

1 500 2 0.146 0.954 0.050

2 500 2,3,4 0.002 1.000 0.966

2 500 2,3 0.176 0.968 0.130

3 500 2,4 0.132 0.728 0.092

4 500 2 3,4 0.124 0.644 0.106

4 500 2 2,3,4 0.124 0.978 0.138

4 500 2,3 2,3,4 0.154 0.964 0.106

5 500 2 3,4 0.142 0.994 0.132

5 500 2 2,3,4 0.142 0.982 0.276

6 500 2 5 0.122 0.898 0.036

6 500 2 5,6 0.128 0.972 0.108

7 500 2 5 0.148 0.940 0.108

7 500 2 5,6 0.154 0.994 0.110

8 500 0,1,1,1,0,0 0.002 0.994 0.952

8 500 0,4,1,1,0,0 0.038 0.998 0.476

8 500 0,8,1,1,0,0 0.050 0.984 0.174

9 500 2,4 0.000 1.000 0.996

Table D.12: Outcomes for modified data input (2), slightly better expectations under

H0, for matrix (4.1) modelling the transition-time dependent RECIST criteria with n=54

patients

Matrix p∅−r p∅−sd p∅−pd p∅−c pur−cr pur−pd pur−c psd−ur psd−sd psd−pd psd−c

H0 .06 .6 .2 .16 .95 .025 .025 .01 .8 .15 .04

Table D.13: Modified data input (3), extremely better expectations under H0, for matrix

(4.1) modelling the transition-time dependent RECIST criteria


H0 .0709 .0031 .2458 .4675 .2327

Table D.14: Endstate Probabilities for Modified Data Input (3), extremely better expec-

tations under H0, for (4.1) modelling the transition-time dependent RECIST criteria



1 1000 2 0.465 0.630 0.007

2 1000 2,3,4 0.708 0.019 0.002

2 1000 2,3 0.511 0.497 0.002

3 1000 2,4 0.914 0.000 0.000

4 1000 2 3,4 0.919 0.000 0.000

4 1000 2 2,3,4 0.775 0.016 0.000

4 1000 2,3 2,3,4 0.814 0.017 0.000

5 1000 2 3,4 0.444 0.631 0.006

5 1000 2 2,3,4 0.417 0.635 0.014

6 1000 2 5 0.705 0.016 0.000

6 1000 2 5,6 0.791 0.017 0.000

7 1000 2 5 0.601 0.027 0.000

7 1000 2 5,6 0.716 0.039 0.002

8 1000 0,1,1,1,0,0 0.743 0.014 0.000

8 1000 0,4,1,1,0,0 0.493 0.480 0.006

8 1000 0,8,1,1,0,0 0.429 0.624 0.009

9 1000 2,4 0.386 0.638 0.033

1 500 2 0.438 0.646 0.010

2 500 2,3,4 0.734 0.010 0.000

2 500 2,3 0.482 0.672 0.008

3 500 2,4 0.878 0.000 0.000

4 500 2 3,4 0.908 0.000 0.000

4 500 2 2,3,4 0.798 0.016 0.000

4 500 2,3 2,3,4 0.796 0.014 0.000

5 500 2 3,4 0.414 0.610 0.014

5 500 2 2,3,4 0.424 0.624 0.012

6 500 2 5 0.672 0.026 0.000

6 500 2 5,6 0.794 0.004 0.000

7 500 2 5 0.546 0.022 0.000

7 500 2 5,6 0.752 0.038 0.002

8 500 0,1,1,1,0,0 0.752 0.020 0.002

8 500 0,4,1,1,0,0 0.500 0.522 0.008

8 500 0,8,1,1,0,0 0.396 0.608 0.014

9 500 2,4 0.320 0.666 0.012

Table D.15: Outcomes for modified data input (3), extremely better expectations un-

der H0, for (4.1) modelling the transition-time dependent RECIST criteria with n=36

patients



1 1000 2 0.341 0.899 0.012

2 1000 2,3,4 0.774 0.033 0.000

2 1000 2,3 0.338 0.922 0.016

3 1000 2,4 0.871 0.000 0.000

4 1000 2 3,4 0.876 0.000 0.000

4 1000 2 2,3,4 0.791 0.054 0.000

4 1000 2,3 2,3,4 0.797 0.048 0.000

5 1000 2 3,4 0.339 0.873 0.015

5 1000 2 2,3,4 0.352 0.886 0.024

6 1000 2 5 0.607 0.074 0.000

6 1000 2 5,6 0.781 0.036 0.000

7 1000 2 5 0.591 0.139 0.000

7 1000 2 5,6 0.790 0.144 0.002

8 1000 0,1,1,1,0,0 0.769 0.080 0.002

8 1000 0,4,1,1,0,0 0.453 0.778 0.013

8 1000 0,8,1,1,0,0 0.393 0.863 0.022

9 1000 2,4 0.238 0.888 0.026

1 500 2 0.344 0.778 0.018

2 500 2,3,4 0.746 0.044 0.000

2 500 2,3 0.366 0.944 0.016

3 500 2,4 0.860 0.000 0.000

4 500 2 3,4 0.886 0.000 0.000

4 500 2 2,3,4 0.818 0.072 0.000

4 500 2,3 2,3,4 0.768 0.086 0.000

5 500 2 3,4 0.340 0.880 0.012

5 500 2 2,3,4 0.336 0.872 0.020

6 500 2 5 0.608 0.102 0.000

6 500 2 5,6 0.796 0.044 0.000

7 500 2 5 0.612 0.138 0.000

7 500 2 5,6 0.758 0.064 0.000

8 500 0,1,1,1,0,0 0.718 0.038 0.002

8 500 0,4,1,1,0,0 0.470 0.746 0.010

8 500 0,8,1,1,0,0 0.382 0.896 0.012

9 500 2,4 0.212 0.880 0.034

Table D.16: Outcomes for modified data input (3), extremely better expectations under

H0, for matrix (4.1) modelling the transition-time dependent RECIST criteria with n=54

patients



H0 .05 .4 .4 .15 .7 .2 .1 .03 .5 .4 .07

Table D.17: Modified data input (4), hypothesing a cyto-toxic treatment with improved

immediate response but no durability, for matrix 4.1 modelling the transition-time de-

pendent RECIST criteria


H0 .0497 .0015 .0250 .7142 .2096

Table D.18: Endstate probabilities for modified data input (4), hypothesing a cyto-toxic

treatment with improved immediate response but no durability, for (4.1) modelling the

transition-time dependent RECIST criteria



1 1000 2 0.248 0.787 0.036

2 1000 2,3,4 0.000 1.000 0.987

2 1000 2,3 0.308 0.824 0.043

3 1000 2,4 0.235 0.716 0.049

4 1000 2 3,4 0.253 0.731 0.038

4 1000 2 2,3,4 0.279 0.794 0.040

4 1000 2,3 2,3,4 0.237 0.804 0.039

5 1000 2 3,4 0.117 0.921 0.166

5 1000 2 2,3,4 0.103 0.921 0.143

6 1000 2 5 0.242 0.782 0.034

6 1000 2 5,6 0.261 0.818 0.044

7 1000 2 5 0.098 0.923 0.145

7 1000 2 5,6 0.105 0.923 0.145

8 1000 0,1,1,1,0,0 0.000 1.000 0.993

8 1000 0,4,1,1,0,0 0.037 0.944 0.346

8 1000 0,8,1,1,0,0 0.085 0.932 0.185

9 1000 2,4 0.000 1.000 1.000

1 500 2 0.298 0.786 0.036

2 500 2,3,4 0.000 0.994 0.954

2 500 2,3 0.244 0.800 0.046

3 500 2,4 0.290 0.720 0.060

4 500 2 3,4 0.260 0.688 0.062

4 500 2 2,3,4 0.262 0.808 0.052

4 500 2,3 2,3,4 0.232 0.566 0.056

5 500 2 3,4 0.088 0.916 0.162

5 500 2 2,3,4 0.118 0.906 0.152

6 500 2 5 0.252 0.760 0.036

6 500 2 5,6 0.248 0.762 0.056

7 500 2 5 0.106 0.918 0.144

7 500 2 5,6 0.112 0.912 0.152

8 500 0,1,1,1,0,0 0.000 0.998 0.988

8 500 0,4,1,1,0,0 0.024 0.964 0.460

8 500 0,8,1,1,0,0 0.118 0.916 0.190

9 500 2,4 0.000 1.000 0.998

Table D.19: Outcomes for modified data input (4), hypothesing a cyto-toxic treatment

with improved immediate response but no durability, for (4.1) modelling the transition-

time dependent RECIST criteria with n=36 patients



1 1000 2 0.119 0.944 0.057

2 1000 2,3,4 0.000 1.000 0.995

2 1000 2,3 0.131 0.958 0.036

3 1000 2,4 0.134 0.891 0.098

4 1000 2 3,4 0.133 0.902 0.037

4 1000 2 2,3,4 0.116 0.943 0.109

4 1000 2,3 2,3,4 0.150 0.983 0.040

5 1000 2 3,4 0.129 0.982 0.131

5 1000 2 2,3,4 0.139 0.994 0.117

6 1000 2 5 0.131 0.976 0.046

6 1000 2 5,6 0.129 0.935 0.139

7 1000 2 5 0.124 0.995 0.101

7 1000 2 5,6 0.135 0.978 0.130

8 1000 0,1,1,1,0,0 0.000 1.000 0.999

8 1000 0,4,1,1,0,0 0.019 0.994 0.552

8 1000 0,8,1,1,0,0 0.057 0.990 0.241

9 1000 2,4 0.000 1.000 1.000

1 500 2 0.122 0.938 0.056

2 500 2,3,4 0.000 1.000 0.988

2 500 2,3 0.138 0.962 0.036

3 500 2,4 0.130 0.944 0.034

4 500 2 3,4 0.130 0.876 0.054

4 500 2 2,3,4 0.138 0.978 0.028

4 500 2,3 2,3,4 0.142 0.982 0.114

5 500 2 3,4 0.104 0.994 0.114

5 500 2 2,3,4 0.138 0.990 0.264

6 500 2 5 0.130 0.918 0.102

6 500 2 5,6 0.126 0.958 0.040

7 500 2 5 0.092 0.982 0.276

7 500 2 5,6 0.118 0.998 0.252

8 500 0,1,1,1,0,0 0.000 1.000 0.990

8 500 0,4,1,1,0,0 0.010 0.998 0.468

8 500 0,8,1,1,0,0 0.044 0.996 0.280

9 500 2,4 0.000 1.000 1.000

Table D.20: Outcomes for modified data input (4), hypothesing a cyto-toxic treatment

with improved immediate response but no durability, for matrix (4.1) modelling the

transition-time dependent RECIST criteria with n=54 patients



H0 .15 .6 .15 .1 .85 .05 .1 .05 .5 .3 .15

Table D.21: Modified data input (5), an extreme optimist, for matrix (4.1) modelling the



H0 .1721 .0038 .0375 .4976 .2890

Table D.22: Endstate probabilities for modified data input (5), an extreme optimist, for

matrix (4.1) modelling the transition-time dependent RECIST criteria



1 1000 2 0.952 0.021 0.000

2 1000 2,3,4 0.224 0.335 0.084

2 1000 2,3 0.966 0.042 0.000

3 1000 2,4 0.961 0.009 0.000

4 1000 2 3,4 0.960 0.013 0.000

4 1000 2 2,3,4 0.967 0.024 0.000

4 1000 2,3 2,3,4 0.954 0.022 0.000

5 1000 2 3,4 0.878 0.062 0.000

5 1000 2 2,3,4 0.884 0.046 0.000

6 1000 2 5 0.954 0.008 0.000

6 1000 2 5,6 0.960 0.025 0.000

7 1000 2 5 0.891 0.007 0.000

7 1000 2 5,6 0.895 0.058 0.000

8 1000 0,1,1,1,0,0 0.221 0.347 0.074

8 1000 0,4,1,1,0,0 0.799 0.085 0.000

8 1000 0,8,1,1,0,0 0.891 0.054 0.000

9 1000 2,4 0.000 0.939 0.988

1 500 2 0.954 0.024 0.000

2 500 2,3,4 0.204 0.520 0.092

2 500 2,3 0.980 0.042 0.000

3 500 2,4 0.956 0.012 0.000

4 500 2 3,4 0.944 0.000 0.000

4 500 2 2,3,4 0.966 0.010 0.000

4 500 2,3 2,3,4 0.950 0.022 0.000

5 500 2 3,4 0.922 0.048 0.000

5 500 2 2,3,4 0.876 0.058 0.000

6 500 2 5 0.972 0.010 0.000

6 500 2 5,6 0.966 0.016 0.000

7 500 2 5 0.920 0.000 0.000

7 500 2 5,6 0.886 0.034 0.000

8 500 0,1,1,1,0,0 0.224 0.346 0.088

8 500 0,4,1,1,0,0 0.796 0.080 0.000

8 500 0,8,1,1,0,0 0.880 0.054 0.000

9 500 2,4 0.000 0.960 0.990

Table D.23: Outcomes for modified data input (5), an extreme optimist, for matrix (4.1)

modelling the transition-time dependent RECIST criteria with n=36 patients



1 1000 2 0.962 0.081 0.000

2 1000 2,3,4 0.163 0.569 0.134

2 1000 2,3 0.969 0.113 0.000

3 1000 2,4 0.967 0.048 0.000

4 1000 2 3,4 0.964 0.037 0.000

4 1000 2 2,3,4 0.964 0.085 0.000

4 1000 2,3 2,3,4 0.975 0.071 0.000

5 1000 2 3,4 0.966 0.153 0.000

5 1000 2 2,3,4 0.957 0.146 0.000

6 1000 2 5 0.968 0.045 0.000

6 1000 2 5,6 0.959 0.086 0.000

7 1000 2 5 0.967 0.080 0.000

7 1000 2 5,6 0.966 0.155 0.000

8 1000 0,1,1,1,0,0 0.175 0.593 0.136

8 1000 0,4,1,1,0,0 0.823 0.264 0.000

8 1000 0,8,1,1,0,0 0.919 0.157 0.000

9 1000 2,4 0.000 0.962 0.991

1 500 2 0.974 0.090 0.000

2 500 2,3,4 0.186 0.570 0.198

2 500 2,3 0.966 0.106 0.000

3 500 2,4 0.964 0.054 0.000

4 500 2 3,4 0.974 0.048 0.000

4 500 2 2,3,4 0.978 0.028 0.000

4 500 2,3 2,3,4 0.958 0.066 0.000

5 500 2 3,4 0.972 0.138 0.000

5 500 2 2,3,4 0.974 0.142 0.000

6 500 2 5 0.978 0.024 0.000

6 500 2 5,6 0.978 0.076 0.000

7 500 2 5 0.978 0.076 0.000

7 500 2 5,6 0.980 0.166 0.000

8 500 0,1,1,1,0,0 0.142 0.570 0.122

8 500 0,4,1,1,0,0 0.866 0.316 0.000

8 500 0,8,1,1,0,0 0.924 0.152 0.000

9 500 2,4 0.000 0.958 0.998

Table D.24: Outcomes for modified data input (5), an extreme optimist, for matrix (4.1)

modelling the transition-time dependent RECIST criteria with n=54 patients


Matrix p∅−r p∅−sd p∅−pd p∅−c pur−cr pur−pd pur−c psd−pr psd−sd psd−pd psd−c

Data ∅ 1/36 20/36 10/36 5/36 0 0 0 0 0 0 0

Data eval 5 0 0 0 0 1 0 0 0 6/7 1/7 0

H0 .02 .4 .4 .18 .85 .05 .1 .05 .6 .25 .1

Interim eval 5 0 0 0 0 0 0 0 0 2/3 1/3 0

Table D.25: Data input (6), an additional transition, for matrix (4.1) modelling the



H0 .0540 .0026 .0311 .6337 .2786

data .0833 .0000 .1667 .4722 .2778

Interim data .0667 .0000 .1333 .6000 .2000

HA .2604 .0101 .0706 .3900 .2689

Table D.26: Endstate probabilities (6), an additional transition, for matrix (4.1) mod-

elling the transition-time dependent RECIST criteria



1 1000 2 0.321 0.854 0.053

2 1000 2,3,4 0.002 0.964 0.623

2 1000 2,3 0.333 0.842 0.050

3 1000 2,4 0.321 0.275 0.006

4 1000 2 3,4 0.299 0.348 0.021

4 1000 2 2,3,4 0.309 0.683 0.043

4 1000 2,3 2,3,4 0.289 0.809 0.043

5 1000 2 3,4 0.126 0.813 0.047

5 1000 2 2,3,4 0.125 0.852 0.116

6 1000 2 5 0.318 0.457 0.001

6 1000 2 5,6 0.315 0.844 0.047

7 1000 2 5 0.143 0.662 0.015

7 1000 2 5,6 0.129 0.931 0.143

8 1000 0,1,1,1,0,0 0.002 0.960 0.651

8 1000 0,4,1,1,0,0 0.083 0.934 0.128

8 1000 0,8,1,1,0,0 0.116 0.927 0.112

9 1000 2,4 0.000 0.987 0.876

1 500 2 0.306 0.662 0.040

2 500 2,3,4 0.010 0.962 0.648

2 500 2,3 0.360 0.744 0.040

3 500 2,4 0.294 0.282 0.034

4 500 2 3,4 0.294 0.314 0.018

4 500 2 2,3,4 0.316 0.778 0.038

4 500 2,3 2,3,4 0.280 0.756 0.064

5 500 2 3,4 0.148 0.896 0.134

5 500 2 2,3,4 0.144 0.890 0.042

6 500 2 5 0.350 0.496 0.006

6 500 2 5,6 0.302 0.834 0.008

7 500 2 5 0.150 0.582 0.014

7 500 2 5,6 0.108 0.914 0.146

8 500 0,1,1,1,0,0 0.000 0.922 0.644

8 500 0,4,1,1,0,0 0.066 0.874 0.198

8 500 0,8,1,1,0,0 0.110 0.916 0.206

9 500 2,4 0.000 0.994 0.876

Table D.27: Outcomes (6), an additional transition, for matrix(4.1) modelling the




1 1000 2 0.163 0.965 0.049

2 1000 2,3,4 0.000 0.999 0.680

2 1000 2,3 0.188 0.982 0.036

3 1000 2,4 0.150 0.500 0.031

4 1000 2 3,4 0.183 0.585 0.032

4 1000 2 2,3,4 0.183 0.963 0.039

4 1000 2,3 2,3,4 0.169 0.966 0.048

5 1000 2 3,4 0.169 0.990 0.123

5 1000 2 2,3,4 0.166 0.986 0.135

6 1000 2 5 0.162 0.815 0.005

6 1000 2 5,6 0.165 0.961 0.032

7 1000 2 5 0.170 0.926 0.025

7 1000 2 5,6 0.149 0.990 0.105

8 1000 0,1,1,1,0,0 0.001 0.995 0.819

8 1000 0,4,1,1,0,0 0.037 0.981 0.269

8 1000 0,8,1,1,0,0 0.065 0.984 0.137

9 1000 2,4 0.000 0.999 0.955

1 500 2 0.620 0.964 0.030

2 500 2,3,4 0.000 0.988 0.826

2 500 2,3 0.172 0.968 0.048

3 500 2,4 0.170 0.490 0.014

4 500 2 3,4 0.172 0.374 0.038

4 500 2 2,3,4 0.160 0.976 0.030

4 500 2,3 2,3,4 0.142 0.964 0.042

5 500 2 3,4 0.178 0.984 0.110

5 500 2 2,3,4 0.122 0.976 0.130

6 500 2 5 0.176 0.892 0.010

6 500 2 5,6 0.172 0.972 0.052

7 500 2 5 0.158 0.924 0.018

7 500 2 5,6 0.172 0.980 0.124

8 500 0,1,1,1,0,0 0.000 1.000 0.820

8 500 0,4,1,1,0,0 0.054 0.984 0.252

8 500 0,8,1,1,0,0 0.084 0.988 0.168

9 500 2,4 0.000 1.000 0.970

Table D.28: Outcomes (6), an additional transition, for matrix (4.1) modelling the



H0 .0442 .0072 .0864 .5986 .2636

data .0556 .0278 .2222 .4444 .2500

Interim data .0667 .0000 .2000 .5333 .2000

HA .2307 .0206 .1441 .3515 .2531

Table D.29: Endstate Probabilities for matrix (4.1) modelling the transition-time depen-

dent RECIST criteria with only 3 transitions



1 1000 2 0.493 0.748 0.049

2 1000 2,3,4 0.006 0.903 0.502

2 1000 2,3 0.280 0.831 0.036

3 1000 2,4 0.478 0.238 0.021

4 1000 2 3,4 0.471 0.246 0.005

4 1000 2 2,3,4 0.503 0.733 0.042

4 1000 2,3 2,3,4 0.474 0.714 0.039

5 1000 2 3,4 0.245 0.887 0.158

5 1000 2 2,3,4 0.230 0.888 0.159

6 1000 2 5 0.495 0.521 0.010

6 1000 2 5,6 0.465 0.702 0.040

7 1000 2 5 0.265 0.663 0.049

7 1000 2 5,6 0.234 0.881 0.154

8 1000 0,1,1,1,0,0 0.006 0.961 0.509

8 1000 0,4,1,1,0,0 0.097 0.914 0.216

8 1000 0,8,1,1,0,0 0.210 0.884 0.165

9 1000 2,4 0.000 0.984 0.872

1 500 2 0.470 0.750 0.064

2 500 2,3,4 0.002 0.908 0.480

2 500 2,3 0.308 0.836 0.056

3 500 2,4 0.478 0.206 0.030

4 500 2 3,4 0.484 0.292 0.010

4 500 2 2,3,4 0.464 0.738 0.026

4 500 2,3 2,3,4 0.494 0.732 0.058

5 500 2 3,4 0.208 0.890 0.156

5 500 2 2,3,4 0.190 0.882 0.162

6 500 2 5 0.514 0.536 0.012

6 500 2 5,6 0.450 0.732 0.050

7 500 2 5 0.232 0.694 0.048

7 500 2 5,6 0.176 0.860 0.144

8 500 0,1,1,1,0,0 0.004 0.912 0.682

8 500 0,4,1,1,0,0 0.110 0.940 0.332

8 500 0,8,1,1,0,0 0.202 0.908 0.168

9 500 2,4 0.000 0.986 0.856

Table D.30: Outcomes for matrix (4.1) modelling the transition-time dependent RECIST

criteria with 3 transitions and n=36 patients



1 1000 2 0.423 0.965 0.117

2 1000 2,3,4 0.000 0.995 0.736

2 1000 2,3 0.139 0.946 0.038

3 1000 2,4 0.421 0.476 0.059

4 1000 2 3,4 0.437 0.397 0.058

4 1000 2 2,3,4 0.449 0.968 0.096

4 1000 2,3 2,3,4 0.422 0.959 0.109

5 1000 2 3,4 0.223 0.986 0.257

5 1000 2 2,3,4 0.216 0.991 0.269

6 1000 2 5 0.432 0.864 0.019

6 1000 2 5,6 0.411 0.960 0.093

7 1000 2 5 0.218 0.959 0.075

7 1000 2 5,6 0.233 0.981 0.232

8 1000 0,1,1,1,0,0 0.002 0.991 0.734

8 1000 0,4,1,1,0,0 0.059 0.992 0.450

8 1000 0,8,1,1,0,0 0.146 0.992 0.254

9 1000 2,4 0.001 0.999 0.934

1 500 2 0.456 0.968 0.094

2 500 2,3,4 0.002 0.990 0.736

2 500 2,3 0.124 0.948 0.122

3 500 2,4 0.392 0.458 0.068

4 500 2 3,4 0.446 0.448 0.064

4 500 2 2,3,4 0.412 0.960 0.108

4 500 2,3 2,3,4 0.420 0.958 0.094

5 500 2 3,4 0.240 0.988 0.278

5 500 2 2,3,4 0.218 0.994 0.250

6 500 2 5 0.446 0.892 0.014

6 500 2 5,6 0.400 0.956 0.102

7 500 2 5 0.240 0.930 0.068

7 500 2 5,6 0.192 0.976 0.230

8 500 0,1,1,1,0,0 0.000 0.990 0.734

8 500 0,4,1,1,0,0 0.044 0.998 0.358

8 500 0,8,1,1,0,0 0.112 0.992 0.296

9 500 2,4 0.000 0.998 0.928


criteria with 3 transitions and n=54 patients

Appendix

D.

Results

129

Matrix pr−r pr−pd pr−c p∅−r p∅−sd p∅−pd p∅−c pur−r pur−pd pur−c psd−r psd−sd psd−pd psd−c

Data ∅ 0 0 0 1/36 20/36 10/36 5/36 0 0 0 0 0 0 0

Data eval 2 0 0 0 0 0 0 0 1/1 0 0 0 16/20 3/20 1/20

Data eval 3 0 0 1/1 0 0 0 0 0 0 0 2/16 12/16 2/16 0

Data eval 4 0 0 0 0 0 0 0 1/2 0 1/2 1/12 8/12 1/12 2/12

Data eval 5 1/1 0 0 0 0 0 0 1/1 0 0 0 7/8 0 1/8

H0 .8 .15 .05 .02 .4 .4 .18 .85 .05 .1 .05 .6 .25 .1

Interim ∅ 0 0 0 0 8/15 6/15 1/15 0 0 0 0 0 0 0

Interim eval 2 0 0 0 0 0 0 0 0 0 0 0 7/8 1/8 0

Interim eval 3 0 0 0 0 0 0 0 0 0 0 1/7 5/7 1/7 0

Interim eval 4 0 0 0 0 0 0 0 1 0 0 0 3/5 0 2/5

Interim eval 5 0 0 0 0 0 0 0 0 0 0 0 3/3 0 0

HA .85 .1 .05 .2 .42 .2 .18 .85 .05 .1 .1 .7 .15 .05

Table D.32: Data input for matrix (4.2) modelling the transition-time dependent RECIST criteria with response not an absorbing

state



Data 2/36 0 7/36 16/36 11/36

H0 .0338 .0043 .0518 .6329 .2771

Interim 1/15 0 3/15 8/15 3/15

HA .1689 .0144 .1008 .4270 .2888

Table D.33: Endstate probabilities for matrix (4.2) modelling the transition-time depen-

dent RECIST criteria with response not an absorbing state



1 1000 2 0.133 0.705 0.175

2 1000 2,3,4 0.002 0.952 0.945

2 1000 2,3 0.159 0.772 0.159

3 1000 2,4 0.139 0.316 0.146

4 1000 2 3,4 0.134 0.291 0.124

4 1000 2 2,3,4 0.150 0.697 0.172

4 1000 2,3 2,3,4 0.109 0.695 0.166

5 1000 2 3,4 0.121 0.858 0.414

5 1000 2 2,3,4 0.107 0.810 0.415

6 1000 2 5 0.132 0.435 0.048

6 1000 2 5,6 0.146 0.697 0.159

7 1000 2 5 0.140 0.596 0.178

7 1000 2 5,6 0.135 0.857 0.420

8 1000 0,1,1,1,0,0 0.001 0.968 0.953

8 1000 0,4,1,1,0,0 0.040 0.888 0.552

8 1000 0,8,1,1,0,0 0.114 0.912 0.438

9 1000 2,4 0.000 0.986 0.991

1 500 2 0.110 0.730 0.146

2 500 2,3,4 0.004 0.890 0.940

2 500 2,3 0.130 0.536 0.190

3 500 2,4 0.134 0.414 0.128

4 500 2 3,4 0.126 0.402 0.098

4 500 2 2,3,4 0.104 0.732 0.168

4 500 2,3 2,3,4 0.088 0.702 0.176

5 500 2 3,4 0.126 0.906 0.398

5 500 2 2,3,4 0.128 0.874 0.374

6 500 2 5 0.096 0.452 0.020

6 500 2 5,6 0.128 0.696 0.186

7 500 2 5 0.118 0.666 0.152

7 500 2 5,6 0.146 0.914 0.428

8 500 0,1,1,1,0,0 0.004 0.954 0.962

8 500 0,4,1,1,0,0 0.046 0.896 0.468

8 500 0,8,1,1,0,0 0.116 0.782 0.438

9 500 2,4 0.000 0.990 0.984


criteria with response not an absorbing state and n=36 patients



1 1000 2 0.109 0.907 0.266

2 1000 2,3,4 0.000 0.980 0.969

2 1000 2,3 0.143 0.959 0.119

3 1000 2,4 0.101 0.487 0.229

4 1000 2 3,4 0.100 0.596 0.236

4 1000 2 2,3,4 0.108 0.910 0.254

4 1000 2,3 2,3,4 0.093 0.914 0.249

5 1000 2 3,4 0.110 0.974 0.474

5 1000 2 2,3,4 0.116 0.977 0.506

6 1000 2 5 0.121 0.716 0.081

6 1000 2 5,6 0.128 0.911 0.264

7 1000 2 5 0.120 0.863 0.235

7 1000 2 5,6 0.111 0.969 0.504

8 1000 0,1,1,1,0,0 0.000 0.982 0.984

8 1000 0,4,1,1,0,0 0.029 0.972 0.766

8 1000 0,8,1,1,0,0 0.073 0.975 0.532

9 1000 2,4 0.000 0.994 0.999

1 500 2 0.124 0.930 0.256

2 500 2,3,4 0.000 0.988 0.996

2 500 2,3 0.124 0.934 0.104

3 500 2,4 0.118 0.482 0.246

4 500 2 3,4 0.096 0.610 0.238

4 500 2 2,3,4 0.104 0.902 0.272

4 500 2,3 2,3,4 0.102 0.896 0.280

5 500 2 3,4 0.098 0.956 0.544

5 500 2 2,3,4 0.124 0.972 0.478

6 500 2 5 0.106 0.778 0.122

6 500 2 5,6 0.114 0.914 0.264

7 500 2 5 0.124 0.876 0.230

7 500 2 5,6 0.106 0.952 0.500

8 500 0,1,1,1,0,0 0.002 0.980 0.974

8 500 0,4,1,1,0,0 0.044 0.978 0.748

8 500 0,8,1,1,0,0 0.084 0.966 0.508

9 500 2,4 0.000 0.996 0.994


criteria with response not an absorbing state and n=54 patients

Appendix

D.

Results

133

Matrix p∅−r p∅−sd p∅−pd p∅−o pr−r pr−sd pr−pd pr−o psd−r psd−sd psd−pd psd−o ppd−r ppd−sd ppd−pd ppd−o

Data ∅ 11/36 9/36 12/36 4/36 0 0 0 0 0 0 0 0 0 0 0 0

Data eval 2 0 0 0 0 4/11 3/11 3/11 1/11 0/9 3/9 4/9 2/9 0/12 2/12 0/12 10/12

Data eval 3 0 0 0 0 2/4 1/4 0/4 1/4 2/8 3/8 2/8 1/8 3/7 3/7 0/7 1/7

Data eval 4 0 0 0 0 1/7 3/7 1/7 2/7 2/7 2/7 1/7 2/7 0/2 1/2 0/2 1/2

Data eval 5 0 0 0 0 1/3 2/3 0/3 0/3 3/6 1/6 1/6 1/6 0/2 2/2 0/2 0/2

H0 .2 .35 .35 .1 .5 .3 .1 .1 .1 .4 .3 .2 .05 .2 .1 .65

Interim ∅ 4/15 3/15 7/15 1/15 0 0 0 0 0 0 0 0 0 0 0 0

Interim eval 2 0 0 0 0 2/4 1/4 1/4 0/4 0/3 2/3 1/3 0/3 0/7 1/7 0/7 6/7

Interim eval 3 0 0 0 0 1/2 1/2 0/2 0/2 1/4 2/4 1/4 0/4 0/2 1/2 0/2 1/2

Interim eval 4 0 0 0 0 0/2 1/2 0/2 1/2 2/4 1/4 0/4 1/4 0/1 0/1 0/1 1/1

Interim eval 5 0 0 0 0 0/2 2/2 0/2 0/2 1/2 1/2 0/2 0/2 0 0 0 0

HA .4 .4 .1 .1 .7 .2 .05 .05 .1 .6 .2 .1 .1 .2 .3 .4

Table D.36: Data input for matrix (4.3) modelling the change in response (10%) at each transition


Endstates pR psd ppd poff

Data 4/36 5/36 1/36 26/36

H0 .0568 .0916 .0548 .7967

Interim 1/15 3/15 0/15 11/15

HA .2024 .2241 .0965 .4771

Table D.37: Endstate probabilities for matrix (4.3) modelling the change in response

(10%) at each transition



1 1000 2 0.121 0.632 0.016

2 1000 2,3,4 0.196 0.860 0.066

2 1000 2,3 0.072 0.955 0.500

3 1000 2,4 0.900 0.014 0.000

4 1000 2 3,4 0.549 0.161 0.000

4 1000 2 2,3,4 0.253 0.591 0.010

4 1000 2,3 2,3,4 0.265 0.411 0.005

5 1000 2 3,4 0.149 0.628 0.070

5 1000 2 2,3,4 0.149 0.671 0.044

6 1000 2 5 0.258 0.590 0.004

7 1000 2 5 0.190 0.781 0.031

8 1000 0,1,1,1,0 0.185 0.865 0.183

8 1000 0,4,1,1,0 0.084 0.822 0.064

8 1000 0,8,1,1,0 0.092 0.679 0.049

9 1000 2,4 0.130 0.743 0.017

1 500 2 0.106 0.442 0.054

2 500 2,3,4 0.156 0.864 0.150

2 500 2,3 0.092 0.946 0.504

3 500 2,4 0.896 0.004 0.000

4 500 2 3,4 0.518 0.182 0.000

4 500 2 2,3,4 0.256 0.504 0.010

4 500 2,3 2,3,4 0.232 0.434 0.004

5 500 2 3,4 0.146 0.790 0.048

5 500 2 2,3,4 0.158 0.648 0.060

6 500 2 5 0.262 0.386 0.004

7 500 2 5 0.206 0.780 0.042

8 500 0,1,1,1,0 0.168 0.866 0.174

8 500 0,4,1,1,0 0.084 0.762 0.100

8 500 0,8,1,1,0 0.062 0.800 0.048

9 500 2,4 0.100 0.724 0.046

Table D.38: Outcomes for matrix (4.3) modelling the change in response (10%) at each

transition and n=36 patients



1 1000 2 0.071 0.839 0.034

2 1000 2,3,4 0.107 0.994 0.216

2 1000 2,3 0.027 0.995 0.620

3 1000 2,4 0.815 0.037 0.000

4 1000 2 3,4 0.460 0.670 0.001

4 1000 2 2,3,4 0.178 0.826 0.025

4 1000 2,3 2,3,4 0.154 0.800 0.025

5 1000 2 3,4 0.088 0.910 0.076

5 1000 2 2,3,4 0.093 0.922 0.113

6 1000 2 5 0.165 0.831 0.022

7 1000 2 5 0.107 0.922 0.048

8 1000 0,1,1,1,0 0.122 0.991 0.223

8 1000 0,4,1,1,0 0.043 0.983 0.135

8 1000 0,8,1,1,0 0.046 0.927 0.065

9 1000 2,4 0.061 0.922 0.044

1 500 2 0.086 0.834 0.048

2 500 2,3,4 0.158 0.998 0.222

2 500 2,3 0.022 0.998 0.740

3 500 2,4 0.818 0.052 0.000

4 500 2 3,4 0.460 0.634 0.004

4 500 2 2,3,4 0.176 0.830 0.020

4 500 2,3 2,3,4 0.168 0.790 0.022

5 500 2 3,4 0.078 0.910 0.038

5 500 2 2,3,4 0.082 0.934 0.092

6 500 2 5 0.146 0.842 0.024

7 500 2 5 0.120 0.940 0.048

8 500 0,1,1,1,0 0.126 0.994 0.214

8 500 0,4,1,1,0 0.048 0.980 0.114

8 500 0,8,1,1,0 0.048 0.950 0.090

9 500 2,4 0.070 0.934 0.062



Appendix

D.

Results

137

Matrix p∅−r p∅−sd p∅−pd p∅−o pr−r pr−sd pr−pd pr−o psd−r psd−sd psd−pd psd−o ppd−r ppd−sd ppd−pd ppd−o

Data ∅ 9/36 14/36 9/36 4/36 0 0 0 0 0 0 0 0 0 0 0 0

Data eval 2 0 0 0 0 1/9 5/9 2/9 1/9 0/14 10/14 1/14 3/14 0/9 0/9 0/9 9/9

Data eval 3 0 0 0 0 0/1 0/1 0/1 1/1 3/15 10/15 1/15 1/15 1/3 1/3 0/3 1/3

Data eval 4 0 0 0 0 0/4 2/4 0/4 2/4 0/11 8/11 1/11 2/11 0/1 0/1 0/1 1/1

Data eval 5 0 0 0 0 0 0 0 0 1/10 8/10 0/10 1/10 0/1 1/1 0/1 0/1

H0 .25 .6 .05 .1 .25 .6 .05 .1 .05 .6 .15 .2 .05 .15 .05 .75

Interim ∅ 2/15 6/15 6/15 1/15 0 0 0 0 0 0 0 0 0 0 0 0

Interim eval 2 0 0 0 0 0/2 2/2 0/2 0/2 0/6 5/6 1/6 0/6 0/6 0/6 0/6 6/6

Interim eval 3 0 0 0 0 0 0 0 0 1/7 5/7 1/7 0/7 0/1 0/1 0/1 1/1

Interim eval 4 0 0 0 0 0/1 0/1 0/1 1/1 0/5 4/5 0/5 1/5 0/1 0/1 0/1 1/1

Interim eval 5 0 0 0 0 0 0 0 0 0/4 4/4 0/4 0/4 0 0 0 0

HA .25 .6 .05 .1 .5 .4 .05 .05 .05 .7 .15 .1 .05 .4 .15 .4

Table D.40: Data input for matrix (4.3) modelling the change in response (5%) at each transition


Endstates pR psd ppd poff

Data 1/36 9/36 0/36 26/36

H0 .02168 .1627 .0383 .7773

Interim 0/15 4/15 0/15 11/15

HA .06859 .3700 .0823 .4791

Table D.41: Endstate probabilities for matrix (4.3) modelling the change in response

(5%) at each transition



1 1000 2 0.170 0.183 0.000

2 1000 2,3,4 0.255 0.861 0.086

2 1000 2,3 0.121 0.773 0.166

3 1000 2,4 1.000 0.002 0.000

4 1000 2 3,4 0.421 0.028 0.000

4 1000 2 2,3,4 0.369 0.123 0.000

4 1000 2,3 2,3,4 0.353 0.109 0.000

5 1000 2 3,4 0.186 0.366 0.000

5 1000 2 2,3,4 0.175 0.419 0.000

6 1000 2 5 0.391 0.115 0.000

7 1000 2 5 0.310 0.283 0.000

8 1000 0,1,1,1,0 0.269 0.863 0.081

8 1000 0,4,1,1,0 0.288 0.526 0.000

8 1000 0,8,1,1,0 0.274 0.343 0.000

9 1000 2,4 0.171 0.278 0.000

9 1000 2,3 0.025 0.917 0.312

1 500 2 0.214 0.182 0.000

2 500 2,3,4 0.264 0.838 0.036

2 500 2,3 0.132 0.762 0.312

3 500 2,4 1.000 0.002 0.000

4 500 2 3,4 0.450 0.032 0.000

4 500 2 2,3,4 0.376 0.180 0.000

4 500 2,3 2,3,4 0.382 0.164 0.000

5 500 2 3,4 0.194 0.410 0.000

5 500 2 2,3,4 0.186 0.418 0.000

6 500 2 5 0.404 0.046 0.000

7 500 2 5 0.300 0.418 0.000

8 500 0,1,1,1,0 0.256 0.854 0.090

8 500 0,4,1,1,0 0.282 0.462 0.000

8 500 0,8,1,1,0 0.290 0.378 0.000

9 500 2,4 0.194 0.262 0.000

9 500 2,3 0.014 0.906 0.338





1 1000 2 0.322 0.290 0.000

2 1000 2,3,4 0.200 0.992 0.130

2 1000 2,3 0.049 0.961 0.326

3 1000 2,4 1.000 0.044 0.000

4 1000 2 3,4 0.462 0.226 0.000

4 1000 2 2,3,4 0.430 0.275 0.000

4 1000 2,3 2,3,4 0.447 0.255 0.000

5 1000 2 3,4 0.320 0.512 0.000

5 1000 2 2,3,4 0.299 0.524 0.000

6 1000 2 5 0.452 0.254 0.000

7 1000 2 5 0.391 0.493 0.000

8 1000 0,1,1,1,0 0.186 0.982 0.063

8 1000 0,4,1,1,0 0.203 0.914 0.000

8 1000 0,8,1,1,0 0.259 0.646 0.000

9 1000 2,4 0.323 0.566 0.000

9 1000 2,3 0.009 0.989 0.468

1 500 2 0.280 0.288 0.000

2 500 2,3,4 0.188 0.986 0.122

2 500 2,3 0.060 0.982 0.470

3 500 2,4 1.000 0.012 0.000

4 500 2 3,4 0.444 0.208 0.000

4 500 2 2,3,4 0.406 0.270 0.000

4 500 2,3 2,3,4 0.428 0.246 0.000

5 500 2 3,4 0.338 0.514 0.000

5 500 2 2,3,4 0.326 0.506 0.000

6 500 2 5 0.462 0.296 0.000

7 500 2 5 0.410 0.478 0.000

8 500 0,1,1,1,0 0.220 0.984 0.224

8 500 0,4,1,1,0 0.202 0.898 0.000

8 500 0,8,1,1,0 0.236 0.690 0.000

9 500 2,4 0.374 0.568 0.000

9 500 2,3 0.004 0.988 0.598



Matrix p∅−r p∅−pd p∅−off pr−r pr−pd pr−off ppd−r ppd−pd ppd−off

Data ∅ 16/36 16/36 4/36 0 0 0 0 0 0

Data eval 2 0 0 0 11/16 2/16 3/16 0/16 6/16 10/16

Data eval 3 0 0 0 9/11 0/11 2/11 3/8 4/8 1/8

Data eval 4 0 0 0 8/12 1/12 3/12 0/4 2/4 2/4

Data eval 5 0 0 0 8/8 0/8 0/8 0/3 2/3 1/3

H0 .4 .4 .2 .6 .2 .2 .2 .4 .4

Interim Data ∅ 6/15 8/15 1/15 0 0 0 0 0 0

Interim Data eval 2 0 0 0 5/6 1/6 0/6 0/8 2/8 6/8



Interim Data eval 5 0 0 0 4/4 0/4 0/4 0 0 0

HA .6 .2 .2 .65 .2 .15 .65 .2 .15

Table D.44: Data input for matrix (4.4) modelling the change in response, with no stable

disease, at each transition


Endstates pR ppd poff

Data 8/36 2/36 26/36

H0 .1280 .0800 .7920

Interim 4/15 0/15 11/15

HA .2090 .1739 .6172

Table D.45: Endstate probabilities for matrix (4.4) modelling the change in response,

with no stable disease, at each transition


1 1000 2 0.073 0.470 0.714

2 1000 2,3 0.173 0.415 0.078

3 1000 2,3 0.813 0.011 0.000

4 1000 2 3 0.835 0.005 0.000

5 1000 2 3 0.071 0.567 0.693

6 1000 2 4 0.221 0.293 0.088

7 1000 2 4 0.116 0.492 0.190

8 1000 0,1,1,0 0.196 0.422 0.169

8 1000 0,4,1,0 0.074 0.583 0.537

8 1000 0,8,1,0 0.070 0.548 0.702

9 1000 2,3 0.054 0.773 0.710

1 500 2 0.096 0.482 0.716

2 500 2,3 0.242 0.618 0.162

3 500 2,3 0.806 0.006 0.000

4 500 2 3 0.792 0.008 0.000

5 500 2 3 0.058 0.482 0.702

6 500 2 4 0.210 0.344 0.182

7 500 2 4 0.114 0.512 0.174

8 500 0,1,1,0 0.204 0.624 0.070

8 500 0,4,1,0 0.070 0.580 0.464

8 500 0,8,1,0 0.056 0.442 0.702

9 500 2,3 0.052 0.802 0.696

Table D.46: Outcomes for matrix (4.4) modelling the change in response, with no stable

disease, at each transition and n=36 patients



1 1000 2 0.031 0.614 0.864

2 1000 2,3 0.138 0.811 0.238

3 1000 2,3 0.802 0.074 0.000

4 1000 2 3 0.824 0.088 0.000

5 1000 2 3 0.026 0.697 0.868

6 1000 2 4 0.145 0.466 0.219

7 1000 2 4 0.084 0.671 0.229

8 1000 0,1,1,0 0.136 0.673 0.196

8 1000 0,4,1,0 0.039 0.757 0.754

8 1000 0,8,1,0 0.030 0.714 0.849

9 1000 2,3 0.019 0.918 0.851

1 500 2 0.026 0.574 0.840

2 500 2,3 0.134 0.774 0.136

3 500 2,3 0.840 0.088 0.000

4 500 2 3 0.828 0.090 0.000

5 500 2 3 0.026 0.734 0.844

6 500 2 4 0.178 0.396 0.220

7 500 2 4 0.072 0.666 0.244

8 500 0,1,1,0 0.118 0.808 0.108

8 500 0,4,1,0 0.036 0.772 0.848

8 500 0,8,1,0 0.034 0.694 0.852

9 500 2,3 0.032 0.940 0.838

Table D.47: Outcomes for matrix (4.4) modelling the change in response, with no stable

disease, at each transition and n=54 patients

Appendix

D.

Results

143

Matrix p∅−sd1p∅−pd psd1−r psd1−sd2

psd1−pd psd2−r psd2−sd3psd2−pd psd3−r psd3−sd3

psd3−pd

Data ∅ 21/36 15/36 0 0 0 0 0 0 0 0 0

Data eval 2 0 0 1/21 16/21 4/21 0 0 0 0 0 0

Data eval 3 0 0 0 0 0 0/16 14/16 2/16 0 0 0

Data eval 4 0 0 0 0 0 0 0 0 1/14 9/14 4/14

H0 .4 .6 .05 .7 .25 .05 .7 .25 .05 .7 .25

Interim Data ∅ 8/15 7/15 0 0 0 0 0 0 0 0 0

Interim Data eval 2 0 0 0/8 7/8 1/8 0 0 0 0 0 0

Interim Data eval 3 0 0 0 0 0 0/7 6/7 1/7 0 0 0

Interim Data eval 4 0 0 0 0 0 0 0 0 1/6 3/6 2/6

HA .6 .4 .15 .65 .2 .15 .65 .2 .15 .65 .2

Table D.48: Data input for matrix (4.5) modelling response+3 consecutive stable disease observations as a good outcome


Endstates pR psd3ppd

Data 2/36 9/36 25/36

H0 .0438 .1372 .8190

Interim 1/15 3/15 11/15

HA .1865 .1648 .6487

Table D.49: Endstate probabilities for matrix (4.5) modelling response+3 consecutive

stable disease observations as a good outcome


1 1000 2 0.217 0.585 0.045

2 1000 2,3 0.040 0.454 0.312

3 1000 2,3 0.262 0.036 0.000

4 1000 2 3 0.235 0.039 0.000

5 1000 2 3 0.198 0.754 0.154

6 1000 2 6 0.220 0.426 0.024

7 1000 2 6 0.222 0.633 0.121

8 1000 0,1,1,0,0,0 0.042 0.502 0.325

8 1000 0,4,1,0,0,0 0.152 0.779 0.211

8 1000 0,8,1,0,0,0 0.217 0.741 0.155

9 1000 2,3 0.004 0.883 0.511

1 500 2 0.224 0.570 0.050

2 500 2,3 0.024 0.476 0.166

3 500 2,3 0.224 0.022 0.010

4 500 2 3 0.242 0.042 0.002

5 500 2 3 0.184 0.706 0.178

6 500 2 6 0.242 0.410 0.022

7 500 2 6 0.212 0.722 0.108

8 500 0,1,1,0,0,0 0.048 0.464 0.322

8 500 0,4,1,0,0,0 0.144 0.806 0.148

8 500 0,8,1,0,0,0 0.226 0.716 0.140

9 500 2,3 0.002 0.902 0.510

Table D.50: Outcomes for matrix (4.5) modelling response+3 consecutive stable disease

observations as a good outcome and n=36 patients



1 1000 2 0.199 0.883 0.109

2 1000 2,3 0.011 0.781 0.335

3 1000 2,3 0.213 0.079 0.014

4 1000 2 3 0.230 0.111 0.011

5 1000 2 3 0.208 0.926 0.217

6 1000 2 6 0.222 0.723 0.068

7 1000 2 6 0.211 0.841 0.161

8 1000 0,1,1,0,0,0 0.009 0.859 0.353

8 1000 0,4,1,0,0,0 0.091 0.962 0.275

8 1000 0,8,1,0,0,0 0.159 0.944 0.226

9 1000 2,3 0.001 0.971 0.597

1 500 2 0.180 0.878 0.128

2 500 2,3 0.012 0.784 0.352

3 500 2,3 0.218 0.094 0.012

4 500 2 3 0.198 0.142 0.008

5 500 2 3 0.216 0.930 0.178

6 500 2 6 0.194 0.710 0.064

7 500 2 6 0.244 0.866 0.206

8 500 0,1,1,0,0,0 0.010 0.852 0.460

8 500 0,4,1,0,0,0 0.088 0.938 0.294

8 500 0,8,1,0,0,0 0.180 0.920 0.238

9 500 2,3 0.002 0.970 0.636



Appendix

D.

Results

146

Matrix p∅−sd1p∅−pd psd1−r psd1−sd2

psd1−pd psd2−r psd2−sd3psd2−pd psd3−r psd3−sd4

psd3−pd psd4−r psd4−sd4psd4−pd

Data ∅ 21/36 15/36 0 0 0 0 0 0 0 0 0 0 0 0

Data eval 2 0 0 1/21 16/21 4/21 0 0 0 0 0 0 0 0 0

Data eval 3 0 0 0 0 0 0/16 14/16 2/16 0 0 0 0 0 0

Data eval 4 0 0 0 0 0 0 0 0 1/14 9/14 4/14 0 0 0

Data eval 5 0 0 0 0 0 0 0 0 0 0 0 1/9 8/9 0/9

H0 .4 .6 .05 .7 .25 .05 .7 .25 .05 .7 .25 .05 .7 .25

Interim Data ∅ 8/15 7/15 0 0 0 0 0 0 0 0 0 0 0 0

Interim Data eval 2 0 0 0/8 7/8 1/8 0 0 0 0 0 0 0 0 0

Interim Data eval 3 0 0 0 0 0 0/7 6/7 1/7 0 0 0 0 0 0

Interim Data eval 4 0 0 0 0 0 0 0 0 1/6 3/6 2/6 0 0 0

Interim Data eval 5 0 0 0 0 0 0 0 0 0 0 0 0/3 3/3 0/3

HA .6 .4 .15 .65 .2 .15 .65 .2 .15 .65 .2 .15 .65 .2

Table D.52: Data input for matrix (4.6) modelling response+4 consecutive stable disease observations as a good outcome


Endstates pR psd4ppd

Data 3/36 8/36 25/36

H0 .0507 .0960 .8533

Interim 1/15 3/15 11/15

HA .2112 .1071 .6817

Table D.53: Endstate probabilities for matrix (4.6) modelling response+4 consecutive

stable disease observations as a good outcome


1 1000 2 0.113 0.655 0.051

2 1000 2,3 0.012 0.697 0.496

3 1000 2,3 0.095 0.022 0.016

4 1000 2 3 0.137 0.035 0.008

5 1000 2 3 0.105 0.764 0.133

6 1000 2 7 0.116 0.576 0.051

7 1000 2 7 0.120 0.745 0.137

8 1000 0,1,1,0,0,0,0 0.013 0.686 0.521

8 1000 0,4,1,0,0,0,0 0.065 0.760 0.209

8 1000 0,8,1,0,0,0,0 0.112 0.784 0.144

9 1000 2,3 0.000 0.894 0.688

1 500 2 0.094 0.686 0.042

2 500 2,3 0.010 0.858 0.508

3 500 2,3 0.144 0.028 0.014

4 500 2 3 0.138 0.018 0.012

5 500 2 3 0.104 0.844 0.174

6 500 2 7 0.142 0.584 0.036

7 500 2 7 0.076 0.736 0.156

8 500 0,1,1,0,0,0,0 0.010 0.674 0.500

8 500 0,4,1,0,0,0,0 0.074 0.808 0.166

8 500 0,8,1,0,0,0,0 0.106 0.822 0.144

9 500 2,3 0.002 0.896 0.732





1 1000 2 0.126 0.865 0.111

2 1000 2,3 0.003 0.919 0.778

3 1000 2,3 0.130 0.086 0.026

4 1000 2 3 0.141 0.094 0.026

5 1000 2 3 0.155 0.938 0.117

6 1000 2 7 0.142 0.816 0.042

7 1000 2 7 0.151 0.905 0.109

8 1000 0,1,1,0,0,0,0 0.002 0.908 0.595

8 1000 0,4,1,0,0,0,0 0.036 0.940 0.298

8 1000 0,8,1,0,0,0,0 0.065 0.934 0.126

9 1000 2,3 0.000 0.974 0.900

1 500 2 0.152 0.868 0.118

2 500 2,3 0.002 0.840 0.766

3 500 2,3 0.166 0.076 0.034

4 500 2 3 0.124 0.096 0.038

5 500 2 3 0.122 0.968 0.112

6 500 2 7 0.132 0.816 0.090

7 500 2 7 0.152 0.914 0.258

8 500 0,1,1,0,0,0,0 0.004 0.930 0.714

8 500 0,4,1,0,0,0,0 0.034 0.948 0.272

8 500 0,8,1,0,0,0,0 0.072 0.938 0.152

9 500 2,3 0.000 0.986 0.866



Appendix

D.

Results

149

Matrix p∅−mr1p∅−sd p∅−pd pmr2−r pmr2−mr2

pmr1−r pmr1−mr2pmr1−sd pmr1−pd psd−mr1

psd−sd psd−pd

Data ∅ 11/36 12/36 13/36 0 0 0 0 0 0 0 0 0

Data eval 2 0 0 0 0 0 1/11 3/11 5/11 2/11 1/12 8/12 3/12

Data eval 3 0 0 0 0/3 3/3 0/1 0/1 1/1 0/1 5/13 6/13 2/13

Data eval 4 0 0 0 1/3 2/3 0/5 1/5 3/5 1/5 1/7 4/7 2/7

Data eval 5 0 0 0 1/3 2/3 0/1 0/1 1/1 0/1 2/7 4/7 1/7

Data eval 6 0 0 0 0/2 2/2 0/2 1/2 1/2 0/2 1/5 3/5 1/5

H0 .15 .2 .55 .05 .95 .15 .2 .55 .1 .2 .7 .1

H0 transition 5 0 0 0 0 1 0 .1 .55 .35 .1 .7 .2

Interim Data ∅ 4/15 4/15 7/15 0 0 0 0 0 0 0 0 0

Interim Data eval 2 0 0 0 0 0 0/4 2/4 1/4 1/4 1/4 3/4 0/4

Interim Data eval 3 0 0 0 0/2 2/2 0/1 0/1 1/1 0/1 1/5 2/5 2/5




HA transition ∅ 1 .3 .3 .4 0 0 .4 .25 .25 .1 .3 .5 .2

HA transition 2 0 0 0 .1 .9 .35 .3 .25 .1 .2 .7 .1

HA transition 3 0 0 0 .1 .9 .2 .2 .45 .15 .2 .7 .1

HA transition 4 0 0 0 .05 .95 .1 .2 .5 .2 .2 .7 .1

HA transition 5 0 0 0 0 1 0 .2 .5 .3 .1 .7 .2

Table D.56: Data input for matrix (4.7) modelling response+consecutive minor responses as a good outcome


Endstates pR pMR2pMR1

psd ppd

Data 3/36 3/36 1/36 4/36 25/36

H0 .0497 .0563 .0144 .1202 .7593

Interim 1/15 1/15 0/15 1/15 12/15

HA .1905 .0977 .0120 .0989 .6009

Table D.57: Endstate probabilities for matrix (4.7) modelling response+consecutive mi-

nor responses as a good outcome



1 1000 2 0.271 0.587 0.053

2 1000 2,3,4,5 0.215 0.154 0.001

2 1000 2,3,4 0.064 0.432 0.015

2 1000 2,3 0.186 0.594 0.059

3 1000 2,3 0.508 0.045 0.000

4 1000 2 3,4,5 0.574 0.002 0.000

4 1000 2 2,3 0.321 0.501 0.008

4 1000 2,3 2,3,4,5 0.396 0.143 0.000

5 1000 2 3,4,5 0.146 0.638 0.045

5 1000 2 2,3 0.147 0.727 0.062

6 1000 2 6 0.382 0.141 0.000

7 1000 2 6 0.270 0.290 0.002

8 1000 0,1,1,1,1,0 0.228 0.160 0.001

8 1000 0,4,2,1,1,0 0.128 0.585 0.046

8 1000 0,8,6,1,1,0 0.111 0.671 0.060

9 1000 2,3 0.082 0.812 0.152

1 500 2 0.298 0.602 0.046

2 500 2,3,4,5 0.236 0.136 0.000

2 500 2,3,4 0.072 0.432 0.014

2 500 2,3 0.178 0.594 0.054

3 500 2,3 0.532 0.016 0.004

4 500 2 3,4,5 0.574 0.002 0.000

4 500 2 2,3 0.330 0.452 0.010

4 500 2,3 2,3,4,5 0.354 0.154 0.000

5 500 2 3,4,5 0.148 0.666 0.046

5 500 2 2,3 0.184 0.742 0.058

6 500 2 6 0.412 0.136 0.002

7 500 2 6 0.284 0.278 0.002

8 500 0,1,1,1,1,0 0.206 0.180 0.000

8 500 0,4,2,1,1,0 0.174 0.648 0.034

8 500 0,8,6,1,1,0 0.124 0.630 0.054

9 500 2,3 0.092 0.802 0.154

Table D.58: Outcomes for matrix (4.7) modelling response+consecutive minor responses

as a good outcome and n=36 patients



1 1000 2 0.149 0.785 0.124

2 1000 2,3,4,5 0.132 0.377 0.003

2 1000 2,3,4 0.047 0.767 0.063

2 1000 2,3 0.132 0.910 0.079

3 1000 2,3 0.297 0.118 0.000

4 1000 2 3,4,5 0.431 0.009 0.000

4 1000 2 2,3 0.174 0.794 0.024

4 1000 2,3 2,3,4,5 0.239 0.465 0.002

5 1000 2 3,4,5 0.138 0.897 0.116

5 1000 2 2,3 0.147 0.875 0.113

6 1000 2 6 0.233 0.500 0.001

7 1000 2 6 0.206 0.583 0.004

8 1000 0,1,1,1,1,0 0.142 0.514 0.000

8 1000 0,4,2,1,1,0 0.097 0.870 0.068

8 1000 0,8,6,1,1,0 0.073 0.878 0.125

9 1000 2,3 0.025 0.940 0.255

1 500 2 0.152 0.794 0.112

2 500 2,3,4,5 0.150 0.372 0.000

2 500 2,3,4 0.048 0.768 0.026

2 500 2,3 0.122 0.816 0.156

3 500 2,3 0.314 0.164 0.000

4 500 2 3,4,5 0.446 0.008 0.000

4 500 2 2,3 0.164 0.742 0.040

4 500 2,3 2,3,4,5 0.190 0.458 0.002

5 500 2 3,4,5 0.136 0.914 0.122

5 500 2 2,3 0.122 0.900 0.096

6 500 2 6 0.218 0.440 0.000

7 500 2 6 0.214 0.558 0.010

8 500 0,1,1,1,1,0 0.138 0.366 0.004

8 500 0,4,2,1,1,0 0.074 0.850 0.036

8 500 0,8,6,1,1,0 0.084 0.920 0.142

9 500 2,3 0.020 0.964 0.222

Table D.59: Outcomes for matrix (4.7) modelling response+consecutive minor responses

as a good outcome and n=54 patients

Matrix p∅−sd p∅−off p∅−tox pr−r pr−Rtox psd−r psd−sd psd−off psd−tox

Data ∅ 21/36 11/36 4/36 0 0 0 0 0 0

Data eval 2 0 0 0 0 0 1/21 16/21 4/21 0/21

Data eval 3 0 0 0 1/1 0/1 0/16 14/16 2/16 0/16

Data eval 4 0 0 0 1/1 0/1 1/14 9/14 3/14 1/14

Data eval 5 0 0 0 1/2 1/2 1/9 8/9 0/9 0/9

H0 .4 .4 .2 .9 .1 .05 .7 .2 .05

Interim Data ∅ 8/15 5/15 2/15 0 0 0 0 0 0

Interim Data eval 2 0 0 0 0 0 0/8 7/8 1/8 0/8




HA .6 .35 .05 .95 .05 .15 .7 .1 .05

Table D.60: Data mnput for matrix (4.8) modelling response & toxicity outcomes


Endstates pR psd poff pT ox pR&T ox

Data 2/36 8/36 20/36 5/36 1/36

H0 .0416 .0960 .6026 .2507 .0091

Interim 1/15 4/15 7/15 3/15 0/15

HA .2068 .1441 .5020 .1260 .0212

Table D.61: Endstate probabilities for matrix (4.8) modelling response & toxicity out-

comes


1 1000 2 0.428 0.650 0.048

2 1000 2,6 0.280 0.752 0.047

2 1000 2,3,6 0.015 0.941 0.869

2 1000 2,3 0.012 0.914 0.947

3 1000 2,3 0.471 0.310 0.039

4 1000 2 2,3 0.451 0.665 0.053

5 1000 2 2,3 0.178 0.842 0.162

6 1000 2,6 5,6 0.499 0.129 0.003

7 1000 2,6 5,6 0.275 0.383 0.031

8 1000 0,1,0,0,0,1 0.280 0.738 0.038

8 1000 0,2,1,0,0,2 0.016 0.948 0.625

8 1000 0,8,3,1,0,4 0.041 0.934 0.489

8 1000 0,8,1,0,0,2 0.180 0.866 0.341

1 500 2 0.432 0.828 0.154

2 500 2,6 0.266 0.762 0.062

2 500 2,3,6 0.008 0.948 0.882

2 500 2,3 0.034 0.872 0.898

3 500 2,3 0.454 0.410 0.034

4 500 2 2,3 0.428 0.658 0.042

5 500 2 2,3 0.204 0.850 0.180

6 500 2,6 5,6 0.520 0.126 0.004

7 500 2,6 5,6 0.276 0.326 0.022

8 500 0,1,0,0,0,1 0.320 0.744 0.052

8 500 0,2,1,0,0,2 0.020 0.952 0.736

8 500 0,8,3,1,0,4 0.034 0.936 0.642

8 500 0,8,1,0,0,2 0.164 0.848 0.380

Table D.62: Outcomes for matrix (4.8) modelling response & toxicity outcomes and n=36

patients



1 1000 2 0.390 0.916 0.117

2 1000 2,6 0.114 0.929 0.042

2 1000 2,3,6 0.001 0.991 0.975

2 1000 2,3 0.004 0.992 0.990

3 1000 2,3 0.410 0.448 0.113

4 1000 2 2,3 0.387 0.925 0.113

5 1000 2 2,3 0.177 0.982 0.275

6 1000 2,6 5,6 0.435 0.415 0.024

7 1000 2,6 5,6 0.254 0.635 0.077

8 1000 0,1,0,0,0,1 0.149 0.892 0.047

8 1000 0,2,1,0,0,2 0.005 0.992 0.849

8 1000 0,8,3,1,0,4 0.015 0.994 0.690

8 1000 0,8,1,0,0,2 0.115 0.983 0.429

1 500 2 0.418 0.914 0.110

2 500 2,6 0.124 0.972 0.044

2 500 2,3,6 0.000 0.996 0.982

2 500 2,3 0.008 0.990 0.960

3 500 2,3 0.394 0.414 0.110

4 500 2 2,3 0.384 0.928 0.108

5 500 2 2,3 0.210 0.976 0.254

6 500 2,6 5,6 0.398 0.428 0.016

7 500 2,6 5,6 0.248 0.622 0.064

8 500 0,1,0,0,0,1 0.142 0.910 0.050

8 500 0,2,1,0,0,2 0.006 0.988 0.864

8 500 0,8,3,1,0,4 0.014 0.996 0.726

8 500 0,8,1,0,0,2 0.110 0.984 0.396

Table D.63: Outcomes for matrix (4.8) modelling response & toxicity outcomes and n=54

patients

Bibliography

[1] National Cancer Institute of Canada annual report National Cancer Institute of

Canada, Annual Report 2004-2005, March, 2005.

[2] NCI Budget request for fiscal year 2008 http://plan.cancer.gov/budget.shtml ex-

tracted 28 June 2007.

[3] Canadian Cancer Society/National Cancer Institute of Canada: Canadian Cancer

Statistics 2005, Toronto, Canada, 2005.

[4] B.Fisher. On clinical trial participation Journal of Clinical Oncology 1991; 9: 1927-

1930.

[5] C.G.Wood, S.J.Wei, M.K.Hampshire, P.A.Devine and J.M.Metz. The influence of

race on the attitudes of radiation oncology patients towards clinical trial enrollment

American Journal of Clinical Oncology December 2006; 29(6):593-599.

[6] National Cancer Institute: Clincal Trials Frequently Asked Questions

http://www.cancer.gov/cancertopics/factsheet/Information/clinical-trials extracted

28 June 2007.

[7] E.A.Gehan. The determination of number of patients in a follow-up trial of a new

chemotherapeutic agent Journal of Chronic Disease, 1961; 13:346-353.

[8] T.R.Fleming. One-sample multiple testing procedure for phase II clinical trials. Bio-

metrics 1982 38:143-151.

155

Bibliography 156

[9] R.Simon. Optimal two-stage designs for phase II clinical trials. Controlled Clinical

Trials 1989 10:1-10.

[10] S-H.Jung, M.Carey and K.M.Kim. Graphical search for two-stage designs for phase

II clinical trials. Controlled Clinical Trials 2001 22(4):367-72.

[11] B.Zee, D.Melnychuk, J.Dancey and E.Eisenhauer. Multinomial phase II clinical trials

incorporating response and early progression. Journal of Biopharmaceutical Statis-

tics 1999; 9(2):351-363.

[12] K.S.Panageas, A.Smith, M.Gonen and P.B.Chapman. An optimal two-stage phase

II design utilizing complete and partial response information separately. Controlled

Clinical Trials 2002; 23(4):367-79.

[13] Y.Lu, H.Jin and K.R.Lamborn. A design of phase II cancer trials using total and

complete response endpoints. Statistics in Medicine 2005; 24:3155-3170.

[14] S.Lin and T.Chen. Optimal two-stage designs for phase II clinical trials with differ-

entiation of complete and partial responses. Communications in Statistics, Part A

- Theory and Methods 1998; 29: 923-940.

[15] J.Bryant and R.Day. Incorporating toxicity considerations into the design of two-

stage phase II clinical trials. Biometrics 1995; 51(4):1372-1383.

[16] P.F.Thall and R.Simon. Practical Bayesian guidelines for phase IIB clinical trials.

Biometrics 1994 50:337-349.

[17] P.Therasse, S.G.Arbuck, E.A.Eisenhauer, J.Wanders, R.S.Kaplan, L.Rubinstein,

J.Verwij, M.Van Glabbeke, A.T.Van Oosterom, M.C.Christian and S.G.Gwyther.

New guidelines to evaluate the response to treatment in solid tumors Journal of the

National Cancer Institute 2000 92:205-216.

Bibliography 157

[18] A.B.Miller, B.Hoogstraten, M.Staquet and A.Winkler. Reporting results of cancer

treatment. Cancer 1981; 47:207-214.

[19] C.Jennison and B.W.Turnbull. Statistical approaches to interim monitoring of med-

ical trials: A review and commentary. Statistical Science 1990; 5(3):299-317.

[20] S.J.Pocock. Clinical Trials: A Practical Approach Wiley: New York, 1983.

[21] P.C.Austin, M.M.Mamdani, D.N.Juulink and J.E.Hux. Testing multiple statistical

hypotheses resulted in spurious associations: a study of astrological signs and health

Journal of Clinical Epidemiology 2006; 59:964-969.

[22] J.W.Tukey. Some thoughts on clinical trials, especially problems of multiplicity.

Science 1977 198:679-684.

[23] T.V.Perneger. What’s wrong with Bonferroni adjustments British Medical Journal

1998 316(7139):1236-1241.

[24] A.Wald. Sequential Analysis Wiley: New York, 1947.

[25] T.W.Anderson. A modification of the sequential probability ratio test to reduce the

sample size. Annals of Mathematical Statistics 1960; 31:165-197.

[26] P.Armitage, C.K.McPherson and B.C.Rowe. Repeated significance tests on accumu-

lating data. Journal of the Royal Statistical Society, Series A 1969; 132:235-244.

[27] S.J.Pocock. Group-sequential methods in the design and analysis of clinical trials.

Biometrika 1977; 64:191-199.

[28] P.C.O’Brien and T.R.Fleming. A multiple testing procedure for clinical trials. Bio-

metrics 1979; 35:549-556.

[29] S.J.Pococok. Interim analyses for randomised clinical trials: The group sequential

approach. Biometrics 1982; 38:153-162.

Bibliography 158

[30] K.K.G.Lan and D.L.DeMets. Discrete sequential boundaries for clinical trials.

Biometrika 1983; 70(3):659-663.

[31] C.Jennison and B.W.Turnbull. Interim analyses: The repeated confidence interval

approach (with discussion) Journal of the Royal Statistical Society, Series B. 1989;

51:305-361.

[32] D.A.Berry. Interim analyses in clinical trials: The role of the likelihood principle.

American Statistician 1987; 41:117-122.

[33] D.J.Spiegelhalter and L.S.Freedman. Bayesian approaches to clinical trials (with

discussion), in Bayesian Statistics Bernardo, J.M., DeGroot, M.H., Lindley, D.V.

and Smith, A.F.M. eds; 453-477. Oxford University Press: Oxford, 1988.

[34] R.Royall. Statistical evidence: A likelihood paradigm. London, UK: Chapman & Hall,

1997.

[35] S.N.Goodman Towards evidence-based medical studies I: The p-value fallacy Annals

of Internal Medicine, 1999; 130:995-1004.

[36] G.Casella and R.L.Berger. Statistical inference. Belmont, California, USA: Duxbury

Press, 1990.

[37] S.N.Goodman and R.Royall Evidence and scientific research American Journal of

Public Health, 1988; 78:1568-1574.

[38] D.A.Berry and D.K.Stangl (eds.) (1996) Bayesian Biostatistics. New York Marcel-

Dekker.

[39] J.M.Bernardo, J.O.Berger, A.P.Dawid and A.F.M.Smith (eds.) (1999) Bayesian

Statistics 6 London, Oxford University Press.

[40] J.Dickey. Scientific reporting and personal probabilities: Student’s hypothesis. Jour-

nal of the Royal Statistical Society, Series B. 1973; 35:285-305.

Bibliography 159

[41] R.Kass and L.Wesserman The selection of prior distributions by formal rules. Journal

of the American Statistical Association 1996; 91:1343-1370.

[42] J.M.Bernardo Reference posterior distributions for Bayesian inference (with discus-

sion) Journal of the Royal Statistical Society, Series B 1979; 41:113-147.

[43] R.Yang and J.Berger A catalogue of non-informative priors. ISDS Discussion Paper

1997; 97-42; Duke University.

[44] J.Berger and D.Berry Analyzing data: Is objectivity possible? The American Sci-

entist 1988; 76:159-165.

[45] J.O.Berger Bayesian analysis: A look at today and thoughts of tomorrow. Journal

of the American Statistical Association 2000; 95:1269-1276.

[46] F.J.Anscombe. Sequential medical trials. Journal of the American Statistical Asso-

ciation 1963; 58:365-383.

[47] J.Colton A model for selecting one of two medical treatments. Journal of the Amer-

ican Statistical Association 1963; 58:388-400.

[48] R.Peto. Discussion of ”On the allocation of treatments in sequential medical trials”

by J.A. Bather. International Statistical Review 1985; 53:31-34.

[49] K.K.G.Lan, R.Simon and M.Halperin. Stochastically curtailed tests in long-term

clinical trials. Sequential Analysis 1982; 1:207-219.

[50] S.C.Choi, P.J.Smith and D.P.Becker. Early decision in clinical trials when treatment

differences are small. Controlled Clinical Trials 1985; 6:280-288.

[51] D.J.Spiegelhalter, L.S.Freedman and P.R.Blackburn. Monitoring clinical trials: Con-

ditional or predictive power? Controlled Clinical Trials 1986; 7:8-17.

[52] L.D.Fisher. Self-designing clinical trials Statistics in Medicine 1998; 17: 1551-1562.

Bibliography 160

[53] Y.Shen and L.Fisher. Statistical inference for self-designing clinical trials with a

one-sided hypothesis. Biometrics 1999; 55: 190-197.

[54] P.Bauer and K.Kohne. Evaluation of experiments with adaptive interim analyses.

Biometrics 1994; 50:1029-1041.

[55] M.A.Proschan and S.A.Hunsberger. Designed extension of studies based on condi-

tional power Biometrics 1995; 51:1315-1324.

[56] J.Hintze. (2004) NCSS and PASS. Number cruncher statistical systems Kaysville,

Utah. www.ncss.com

[57] L.Cui, H.M.J.Hung and S-J.Wang. Modifications of sample size in group-sequential

clinical trials. Biometrics 1999; 55: 853-857.

[58] S.J.Wang, H.M.J.Hung and R.T.O’Neill Adapting the sample size planning of a

phase III trial based on phase II data. Journal of Pharmaceutical Statistics 2006;

5(2):85-97.

[59] H.M.J.Hung, L.Cui, S.J.Wang and J.Lawrence Adaptive statistical analysis following

sample size modifications based on interim review of effect size. Journal of Biophar-

maceutical Statistics 2005; 15(4):693-706.

[60] P.Gallo Operational challenges in adaptive design implementation. Journal of Phar-

maceutical Statistics 2006; 5(2):119-124.

[61] A.A.Tsiatis and C.Mehta On the inefficiency of the adaptive design for monitoring

clinical trials. Biometrika 2003; 90:367-378.

[62] C.Jennison and B.W.Turnbull. Mid-course sample size modification in clinical trials.

Statistics in Medicine 2003; 22:971-993.

Bibliography 161

[63] A.A.Tsiatis. Repeated significance testing for a general class of statistics used in

censored survival analysis. Journal of the American Statistical Association 1982;

77(380):855-861.

[64] E.A.Gehan. A generalized Wilcoxon test for comparing arbitrary single-censored

samples. Biometrika 1965; 53:203-223.

[65] E.Slud and L.J.Wei. Two-sample repeated significance tests based on the mod-

ified Wilcoxon statistic. Journal of the American Statistical Association 1982;

77(380):862-868.

[66] R.Peto and J.Peto. Asymptotically efficient rank invariant test procedures. Journal

of the Royal Statistical Society, Series A 1972; 135:185-206.

[67] R.L.Prentice. Linear rank tests with right censored data. Biometrika 1978; 65:167-

179.

[68] R.L.Prentice and P.Marek. A qualitative discrepancy between censored data rank

test. Biometrics 1979; 35:861-867.

[69] W.Y.W.Lou and K.K.G.Lan. A note on the Gehan-Wilcoxon statistic. Communica-

tions in Statistics, Part A - Theory and Methods 1998; 27(6): 1453-1459.

[70] J.C.Fu and W.Y.W.Lou. Distribution Theory of Runs and Patterns and its Appli-

cations: A Finite Markov Chain Imbedding Approach World Scientific Publishing

Co. Pte. Ltd. 2003; London

[71] J.C.Fu and M.V.Koutras. Distribution Theory of Runs: A Markov Chain Approach

Journal of the American Statistical Association, 1994; 89(427):1050-1058.

[72] I.Duran, J.Kortmansky, D.Singh, H.Hirte, W.Kocha, G.Goss, L.Le, A.Oza,

T.Nicklee, J.Ho, D.Birle, G.R.Pond, D.Arboine, J.Dancey, S.Aviel-Ronen,

M.S.Tsao, D.Hedley and L.L.Siu A phase II clinical and pharmacodynamic study

Bibliography 162

of temsirolimus in advanced neuroendocrine carcinomas. British Journal of Cancer

November 2006 95(9):1148-1154.

[73] R.P.A.A’Hern. Sample size tables for exact single-stage phase II designs Statistics

in Medicine 2001 20:859-866.

[74] S-H.Jung, T.Lee, K.Kim and S.L.George. Admissible two-stage designs for phase II

cancer clinical trials. Statistics in Medicine 2004; 23(4):561-9

[75] M.H.Degroot. Optimal Statistical Decisions McGraw-Hill: New York, 1970.

[76] P.F.Thall and R.Simon. A Bayesian approach to establishing sample size and moni-

toring criteria for phase II clinical trials. Controlled Clinical Trials 1994 15(6):463-

481.

[77] P.F.Thall, J.K.Wathen, B.N.Bekele, R.E.Champlin, L.H.Baker and R.S.Benjamin.

Hierarchical Bayesian approaches to phase II trials in diseases with multiple subtypes

Statistics in Medicine 2003 22:763-780.

[78] P.F.Thall and J.D.Cook. Dose-finding based on efficacy-toxicity trade-offs. Biomet-

rics 2004; 60(3):684-93.

[79] G.Yin, Y.Li and Y.Ji. Bayesian dose-finding in phase I/II clinical trials using toxicity

and efficacy odds ratios Biometrics 2006; 62(3):777-797.

[80] M.J.Ratain, T.Eisen, W.M.Stadler, K.T.Flaherty, M.Gore, A.Desai, A.Patnaik,

H.Q.Xiong, B.Schwartz and P.O’Dwyer. Final findings from a phase II, placebo-

controlled, randomized discontinuation trial (RDT) of sorafenib (BAY 43-9006) in

patients with advanced renal cell carcinoma (RCC). Journal of Clinical Oncology

2005; abstract 4544.

[81] P.H.Patel, R.S.K.Chaqanti and R.J.Motzer. Targeted therapy for metastatic renal

cell carcinoma British Journal of Cancer 2006; 94:614-619.

Bibliography 163

[82] S.Dent, B.Zee, J.Dancey, A.Hanauske, J.Wanders and E.Eisenhauer. Application

of a new multinomial phase II stopping rule using response and early progression.

Journal of Clinical Oncology 2001; 19(3):785-791.

[83] P.F.Thall and H.G.Sung. Some extensions and applications of a Bayesian strat-

egy for monitoring multiple outcomes in clinical trials Statistics in Medicine 1998;

17(14):1563-80.

[84] P.F.Thall, R.M.Simon and E.H.Estey. New statistical strategy for monitoring

safety and efficacy in single-arm clinical trials Journal of Clinical Oncology 1996;

14(1):296-303.

[85] M.R. Conaway and G.R. Petroni Designs for phase II trials allowing for a trade-off

between response and toxicity. Biometrics 1996; 52: 1375-1386.

[86] B.Zee, B.Freidlin, J.Dancey, E.L.Korn and E.Eisenhauer. Multinomial Phase II Trial

Designs Journal of Clinical Oncology Jan 2002; 20(2):599.

[87] P.H.O’Donnell and M.J.Ratain. Evaluating the activity of temsirolimus in neuroen-

docrine cancer. British Journal of Cancer 2007 96:177.

[88] I.Duran, M.Moore and L.L.Siu. Reply: Evaluating the activity of temsirolimus in

neuroendocrine cancer. British Journal of Cancer 2007 96:178-179.

[89] M.J.Ratain, T.Eisen, W.M.Stadler, K.T.Flaherty, S.B.Kaye, G.L.Rosner, M.Gore,

A.A.Desai, A.Patnaik, H.Q.Xiong, E.Rowinsky, J.L.Abbruzzese, C.Xia, R.Simantov,

B.Schwartz and P.J.O’Dwyer. Phase II placebo-controlled randomized discontinua-

tion trial of sorafenib in patients with metastatic renal cell carcinoma. Journal of

Clinical Oncology 2006 24(16):2505-2512.

[90] S.D.Curran, A.U.Muellner and L.H.Schwartz. Imaging response assessment in on-

cology. Cancer Imaging Oct 2006; 31(6):S126-130.

Bibliography 164

[91] R.S.Tuma Sometimes size doesn’t matter: reevaluating RECIST and tumor response

rate endpoints Journal of the National Cancer Institute Sep 2006; 98(18):1272-1274.

[92] L.C.Michaelis and M.J.Ratain. Measuring response in a post-RECIST world: from

black and white to shades of grey Nature Review Cancer May 2006; 6(5):409-414.

Design and analysis of sequential clinical trials using a ...€¦ · Design and analysis of...

Documents

Transcript of Design and analysis of sequential clinical trials using a ...€¦ · Design and analysis of...