An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

16
An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments Franz KȰnig * , Peter Bauer, and Werner Brannath Section of Medical Statistics, Medical University of Vienna, Spitalgasse 23, A-1090 Vienna, Austria Received 5 December 2005, revised 20 February 2006, accepted 15 March 2006 Summary We consider the situation where during a multiple treatment (dose) control comparison high doses are truncated because of lack of safety and low doses are truncated because of lack of efficacy, e.g., by decisions of a data safety monitoring committee in multiple interim looks. We investigate the properties of a hierarchical test procedure for the efficacy outcome in the set of doses carried on until the end of the trial, starting with the highest selected dose group to be compared with the placebo at the full level a. Left truncation, i.e., dropping doses in a sequence starting with the lowest dose, does not inflate the type I error rate. It is shown that right truncation does not inflate the type I error if efficacy and toxicity are positively related and dose selection is based on monotone functions of the safety data. A positive relation is given e.g. in the case where the efficacy and toxicity data are normally distributed with a positive pairwise correlation. A positive relation also applies if the probability for an adverse event is increasing with a normally distributed efficacy outcome. The properties of such truncation procedures are investigated by simulations. There is a conflict between achieving a small number of unsafely treated patients and a high power to detect safe and efficient doses. We also investigated a procedure to increase power where a reallocation of the sample size to the truncated treatments and the control remaining at the following stages is performed. Key words: Adaptive hierarchical test; Dropping treatments; Interim look; Many-one compar- ison; Reallocation of sample size; Safety monitoring rules. 1 Introduction One of the most widely applied designs in clinical trials for drug development refers to the compar- ison of several treatments (doses) with a control (Placebo) (Bauer et al., 1998). When a single efficacy endpoint is considered in the classical fixed size sample setting the issue of multiple comparisons of several treatments with a control has been addressed by Dunnett (1955) for the comparison among homoscedastic normal distributions. In case that an order relation holds among the parameters to be tested, several multiple comparisons procedures have been developed, e.g. Williams (1971, 1972); Ruberg (1995a, b); Bauer (1997). Simultaneous testing for efficacy (superiority) and safety (non-infer- iority) has been considered in Tamhane et al. (2001); Bauer et al. (2001); Tamhane and Logan (2002). In our paper we do not consider simultaneous testing of efficacy and safety, but testing efficacy after selecting doses based on safety data. In clinical trials in phase III in general a data and safety monitoring board is accompanying the course of the trial (occasionally this is also the case in phase II trials). Recently methods for combining the different challenges of the dose finding phase and the pivotal confirmatory phase in drug development into a single trial have been discussed (Bauer and Kieser, 1999; Lehmacher et al., 2000; Stallard and Todd, 2003; Liu and Pledger, 2005; Bretz et al., 2006). In such an environment * Corresponding author: e-mail: [email protected], Phone: +43 1 40400 7484, Fax: +43 1 40400 7477 Biometrical Journal 48 (2006) 4, 663 678 DOI: 10.1002/bimj.200510235 663 # 2006 WILEY-VCH Verlag GmbH &Co. KGaA, Weinheim

Transcript of An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

Page 1: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

An Adaptive Hierarchical Test Procedure for Selecting Safeand Efficient Treatments

Franz K�nig*, Peter Bauer, and Werner Brannath

Section of Medical Statistics, Medical University of Vienna, Spitalgasse 23, A-1090 Vienna, Austria

Received 5 December 2005, revised 20 February 2006, accepted 15 March 2006

Summary

We consider the situation where during a multiple treatment (dose) control comparison high doses aretruncated because of lack of safety and low doses are truncated because of lack of efficacy, e.g., bydecisions of a data safety monitoring committee in multiple interim looks. We investigate the propertiesof a hierarchical test procedure for the efficacy outcome in the set of doses carried on until the end ofthe trial, starting with the highest selected dose group to be compared with the placebo at the full levela. Left truncation, i.e., dropping doses in a sequence starting with the lowest dose, does not inflate thetype I error rate. It is shown that right truncation does not inflate the type I error if efficacy andtoxicity are positively related and dose selection is based on monotone functions of the safety data. Apositive relation is given e.g. in the case where the efficacy and toxicity data are normally distributedwith a positive pairwise correlation. A positive relation also applies if the probability for an adverseevent is increasing with a normally distributed efficacy outcome. The properties of such truncationprocedures are investigated by simulations. There is a conflict between achieving a small number ofunsafely treated patients and a high power to detect safe and efficient doses. We also investigated aprocedure to increase power where a reallocation of the sample size to the truncated treatments and thecontrol remaining at the following stages is performed.

Key words: Adaptive hierarchical test; Dropping treatments; Interim look; Many-one compar-ison; Reallocation of sample size; Safety monitoring rules.

1 Introduction

One of the most widely applied designs in clinical trials for drug development refers to the compar-ison of several treatments (doses) with a control (Placebo) (Bauer et al., 1998). When a single efficacyendpoint is considered in the classical fixed size sample setting the issue of multiple comparisons ofseveral treatments with a control has been addressed by Dunnett (1955) for the comparison amonghomoscedastic normal distributions. In case that an order relation holds among the parameters to betested, several multiple comparisons procedures have been developed, e.g. Williams (1971, 1972);Ruberg (1995a, b); Bauer (1997). Simultaneous testing for efficacy (superiority) and safety (non-infer-iority) has been considered in Tamhane et al. (2001); Bauer et al. (2001); Tamhane and Logan (2002).In our paper we do not consider simultaneous testing of efficacy and safety, but testing efficacy afterselecting doses based on safety data.

In clinical trials in phase III in general a data and safety monitoring board is accompanying thecourse of the trial (occasionally this is also the case in phase II trials). Recently methods forcombining the different challenges of the dose finding phase and the pivotal confirmatory phase indrug development into a single trial have been discussed (Bauer and Kieser, 1999; Lehmacher et al.,2000; Stallard and Todd, 2003; Liu and Pledger, 2005; Bretz et al., 2006). In such an environment

* Corresponding author: e-mail: [email protected], Phone: +43 1 40400 7484, Fax: +43 1 40400 7477

Biometrical Journal 48 (2006) 4, 663–678 DOI: 10.1002/bimj.200510235 663

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Page 2: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

randomization is the rule and the data and safety monitoring board may take decisions to eliminatedoses during the trial because of lack of safety (to protect the patients) or because of lack ofefficacy (to avoid ineffective treatment of patients). For example Zeymer et al. (2001) conducted aninternational, prospective, randomized, double-blind, placebo-controlled phase II dose finding studyapplying a two-stage adaptive design. The trial started with four groups with increasing doses(50 mg, 100 mg, 150 mg and 200 mg eniporiode) and a placebo group. The interim look led to thedropping of the lowest dose group due to a lack of efficacy and of the highest dose group based onsafety arguments.

Data safety monitoring boards often tend to give themselves fixed rules as guidelines for safetymonitoring to avoid frequent erroneous decisions against a sufficiently safe treatment. We will investi-gate a method of multiple treatment-control comparisons where dose selection can be done in multipleinterim looks. Treatment selection is done by truncating high doses (right truncation based on a pre-fixed rule due to a lack of safety) and truncating low doses (left truncation due to a lack of efficacy,which need not to be pre-specified) or by applying truncation on both sides (pruning doses). Themultiple testing procedure applied for the efficacy endpoints among the doses carried over to the endwill be a strictly hierarchical test starting with the highest selected dose. As shown in the appendix,the multiple type I error rate, i.e., the probability of falsely rejecting at least one true null hypothesis,is not inflated with these procedures if efficacy and safety endpoints are positively related and dosesare dropped when pre-specified monotone statistics of the corresponding safety data exceed prefixedthresholds. Right truncation may, however, increase the type I error rate in case of a negative relationbetween efficacy and safety. Here we focus on the frequent case of a positive relation. In a simulationstudy we consider, in particular, normally distributed safety and efficacy endpoints that are pairwisepositively correlated and compare the resulting procedures to the classical fixed sample hierarchicaltest with k doses and a placebo without truncating doses.

Note that we do not consider methods of sequentially assigning doses to cohorts of patientsbased on the observed efficacy and toxicity outcomes for previous patient (O’Quigley et al., 2001;Braun, 2002; Ivanova, 2003; Thall and Cook, 2004), which have been suggested mainly for earlierphases of drug developments to estimate an optimal dose. Our main goal is to restrict experimenta-tion to the dose range with sufficient efficacy and safety. Our method, of eliminating doses atfrequent interim looks, allows us to react quickly to unsafe doses. A key feature of the proposedapproach is that the selection procedure combined with the adaptive hierarchical test controls themultiple level.

In Section 2 the notation and test statistics are introduced. Section 3 investigates right truncationbased on the safety outcome and a sample size reassessment where the remaining sample size of theskipped doses is shifted to the groups to be carried on. Section 4 investigates left truncation based onthe efficacy outcome and combines both truncation methods. Finally, in Section 5 simulations illus-trate the statistical properties of such designs. Section 6 gives some closing remarks.

2 Notation and Assumptions

Consider a trial where k doses 1; 2; . . . ; k are compared with a control dose in parallel groups. Denotethe set of increasing dose levels by 0; 1; 2; . . . ; k, where 0 denotes the zero dose control (placebogroup) or an active control treatment. Let Xij be the efficacy variable of the j-th patient(j ¼ 1; 2; . . . ; n) treated with dose i (i ¼ 0; 1; . . . ; k), assuming that the variables are independentlynormally distributed, Xij � Nðmi; s

2i Þ. Large values of mi represent high efficacy. For simplicity we

assume a known common variance s2i ¼ 1. The assumption of known variability may at least apply

asymptotically if a common variance is estimated from a large number k of dose groups. The normal-ity assumption can be considered as an asymptotic approximation which is commonly used for intro-ducing sequential decision methods.

664 F. K�nig et al.: Adaptive Hierarchical Testing

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 3: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

2.1 Hierarchical test of the hypotheses on efficacy

We consider one sided many to one comparisons for the efficacy parameters with the null hypothesisH0i : mi � m0 against the alternative hypothesis H1i : mi > m0 for dose i at the end of the study. Thenull hypothesis for dose i will only be rejected if the test statistic Zi ¼ ð �XXi;n � �XX0;nÞ

ffiffiffiffiffiffiffiffin=2

p� z1�a,

where �XXi;n ¼ 1n

Pnj¼1

Xij and z1�a denotes the ð1� aÞ-quantile of standard normal distribution. In the

hierarchical procedure without dose selection the k null hypotheses are ordered in the sequencek; k � 1; . . . ; 1. The procedure starts with testing the null hypothesis H0k. Any null hypothesis H0i isrejected if it and all previous null hypotheses in the sequence k to i are rejected at level a. Thisprocedure controls the multiple type I error rate at level a (Sonnemann et al., 1986; Bauer, 1991).

2.2 Assumption on safety and efficacy data – positive relation between efficacyand toxicity

For each dose group i we summarize the efficacy data by the random vector X i ¼ ðXi1; . . . ;XinÞ,where by assumption the components are independently normally distributed. Accordingly, the toxicitydata for dose i are denoted by the random vectors Yi ¼ ðYi1; . . . ; YinÞ, where Yij could also be multi-variate, i.e., consist of several different toxicity endpoints. In order to control the multiple type I error,we need not specify the distribution of the safety data. The following positive relation between theefficacy and safety data of dose groups i � 1 is assumed. No assumptions on the relation betweenefficacy and safety in the placebo group is required.

Definition 2.1 The efficacy and safety data X i and Yi of dose group i are said to be positivelyrelated if Ð

UðX iÞVðYiÞ dFðX i;YiÞ �Ð

UðX iÞ dFðX iÞ �Ð

VðYiÞ dFðYiÞ ð1Þ

for all non-negative bounded functions UðX iÞ and VðYiÞ such that UðX iÞ is componentwise non-decreasing in X i and VðYiÞ is componentwise non-increasing in Yi.

Two examples for this type of positive relation are given below. One example is the positive corre-lated bivariate normal data case, the other example is for binary safety data.

Remark: When preparing the manuscript we realized that there is little information in the literatureon the common distribution of efficacy and toxicity variables. (There should be plenty of data avail-able on this issue.) We have at least one simple explanation that efficacy and safety analyses inpharmaceutical companies are generally performed by different groups of people. In most case onewould expect that average efficacy and toxicity in the population both increase with increasing dose.Applying the same dose to a heterogeneous population of patients, e.g. with different body weights, islikely to induce some positive relation between the efficacy and toxicity variable given a certain dose.Moreover, it is sketched below how to proceed if the a-priori assumption of a positive dependencydoes not seem to be justified in an application.

2.2.1 Normal safety data

Here ‘‘toxicity” is measured by a normally distributed variable Yij � Nðqi; t2i Þ. High values of qi

indicate high toxicity. Adverse effects on kidney, liver or heart are often monitored by continuousparameters such as laboratory values or variables from the ECG respectively. The joint distribution of(Xij, Yij) is assumed bivariate normal with correlation qi � 0 for i � 1. Positive correlation betweenefficacy and toxicity seems to be a realistic assumption in various scenarios.

Due to the parallel group design, Xij and Ylm are independent for i 6¼ l and/or j 6¼ m and hence X i

and Yi are multivariate normal with non-negative pairwise covariances. As was shown in (Pitt, 1982)this implies that the random vector V i ¼ ðX i;YiÞ is positively associated (Esary et al., 1967) which

Biometrical Journal 48 (2006) 4 665

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 4: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

implies thatÐ

f ðV iÞ � gðV iÞ dFðX i;YiÞ �Ð

f ðV iÞ dFðX i;YiÞ �Ð

gðV iÞ dFðX i;YiÞ for all bounded func-tions f ðV iÞ and gðV iÞ which are non-decreasing in all components of V i. Applying this to functionsUðX iÞ and VðYiÞ satisfying the properties in our definition of a positive relation gives (1).

2.2.2 Binary safety data

Assume that the safety endpoint Yij is binary with Yij ¼ 1 indicating an adverse event for patient j indose group i. It is shown in the appendix that, if for all groups i � 1 and all patients j, the conditionalprobability PðYij ¼ 1 j XijÞ given the observed efficacy outcome Xij is non-decreasing in Xij, then Xij

and Yij are positively related.

3 Dropping Unsafe Doses – Right Truncation

3.1 Selection

Now we consider the case where a total of m interim looks are performed. At each look t we select asequence 1; 2; . . . ; ht of apparently safe doses to be carried on to the next stage. We assume that

ht ¼ max fi ¼ 1; . . . ; ht�1: Sl;t � ll;t for all 1 � l � ig ð2Þwhere h0 ¼ k, li;t are prefixed selection margins and Si;t is a statistic computed from the safety data ofdose i, Yij, j ¼ 1; . . . ; nt. Here nt is the prefixed size of the cumulated sample until the t-th interimanalysis. Selection of safe doses could also be applied in the final analysis leading to the sequence ofdoses 1; 2; . . . ; hmþ1 which are finally tested for efficacy, where hmþ1 is defined as in (2). The inter-pretation of (2) is that a dose i is considered ‘‘safe” at the t-th interim analysis iff Si;t � li;t, and isdropped even in this case, if some dose l < i appears to be unsafe. This seems to be a reasonablestrategy since committees usually refuse to go on with a higher dose if a lower dose shows safetyproblems. If S1;t > l1;t then (2) is the minimum over an empty set which is put to �1 in this manu-script. Hence, ht ¼ �1 means that the trial is terminated in the t-th interim look since dose 1 appearsunsafe. Unless all active doses are dropped in the interim look, the placebo dose is always carried on.

3.2 Assumptions on the safety selection rules

The statistics Si;t and selection margins li;t in (2) must be fixed in advance. We assume that Si;t isnon-decreasing in the safety data Yij (j � nt) of dose group i, since large values of Yij indicate toxicity.The function Si;t must not depend on the efficacy data or the safety data of the dose groups l 6¼ i(l � 1). These assumptions are satisfied e.g. if Si;t is the mean, the sum, the maximum or any otherquantile of the safety data Yij, j � nt. Note that Si;t can also be a function of the safety data of the Y0j

(j � nt) of the placebo group. This is the case, e.g., if mean differences between single doses and theplacebo group are used for selection.

Note that the number m of inspections for ethical reasons can be large, even in form of ‘‘contin-uous” monitoring with data inspections whenever a single unit for all retained dose groups and aplacebo group have been investigated.

Remark Although not done here we could formulate the selection rule based on the toxicity as amultiple sequential non-inferiority test problem. In our formulation no formal multiple error controlfor the selection procedure with regard to the toxicity outcome is intended (which is the usual proce-dure applied by DSMBs monitoring safety in efficacy studies). Instead in Section 5 we will look atthe probability to identify at least one efficient and safe dose.

3.3 Adaptive hierarchical test procedure

In the final analysis (after n observations per dose) the hierarchical test procedure is applied to the“right truncated” set 1; . . . ; h of hypotheses where h ¼ hmþ1 if doses are also selected in the final

666 F. K�nig et al.: Adaptive Hierarchical Testing

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 5: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

analysis and h ¼ hm if doses are selected only at interim looks. The hypotheses of the doses 1; . . . ; hare tested in the strict order H0h;H0h�1; . . . ;H02;H01, each at level a. Applying the hierarchical proce-dure starting with the random highest dose h, the family wise error rate may be affected.

3.4 Error control

Note that the multiple level of the adaptive hierarchical test procedure is a if the safety and efficacydata are independent. The following theorem shows the conservatism of this procedure in the case ofpositively related efficacy and safety data in a balanced design.

Theorem 3.1 Assume that X i and Yi are positively related under H0i for all i � 1, and equalstage wise samples sizes for all doses i � 1 at each look and the final analysis. If doses are righttruncated according to ð2Þ, then the final hierarchical test starting with the highest selected dose hcontrols the multiple type I error at level a, i.e., rejects at least one true H0i with a probability of atmost a.

The simple explanation for the theorem is that dropping unsafe doses in this situation would meandropping doses with a rather promising (effective) outcome. A concise proof of Theorem 3.1 is givenin the appendix.

3.4.1 Type I error inflation with non-positive related efficay and toxicity data

We shall illustrate by a simulation study that without the assumption of a positive relation betweenefficacy and toxicity the multiple level of the right truncated hierarchical test procedure (which startswith dose h) can exceed the nominal level a. In this example we assume bivariate normally distribut-ed efficacy and safety data with a negative correlation qi. The simulations presented here will beperformed for the comparison of k ¼ 2 doses with a placebo. The sample size n per group is fixed sothat a single treatment control comparison for the highest dose provides a power of 1� b ¼ 0:9 for aone-sided z-test with a ¼ 0:025 at a particular alternative mk ¼ mD, so that n ¼ 2ðz1�a þ z1�bÞ2=m2

D.Later we take mD ¼ 1.

We conjecture without formal proof that the worst case scenario with regard to the inflation of thetype I error will be to select the safe doses once in the final analysis immediately before performingthe hierarchical test for efficacy (for k ¼ 2 we will see below that this is indeed the worst case).Figure 1 shows the maximum achievable type I error rate as function of the correlation (assumingequal variances in all groups with �1 � qi � 0) when applying the selection procedure to k ¼ 2 dosesbased on the difference to the placebo group Si;mþ1 ¼ �YYi;n � �YY0;n. We investigate the maximum achiev-able multiple level under the global null hypothesis (m0 ¼ m1 ¼ m2 ¼ 0, which is the worst case) and atoxicity pattern (with q0 ¼ q1 ¼ q2 ¼ 0Þ, with variances t2

i ¼ 1. The maximum is independent of thepatterns of the means for toxicity, because we can shift all distribution to have zero mean and thenoptimize over li;mþ1. For searching the maximum per qi we search in a grid for �0:62 < l2;mþ1 < 0:1in steps of 0:01. Runs of 1 000 000 replications were generated per point in the grid.

When q ¼ �1 the worst case scenario is to chose l2;mþ1 ¼ �z1�a

ffiffiffiffiffiffiffiffi2=n

pþ q2 and l1;mþ1 ¼ 1. This

means that dose 2 is only selected for a test of efficacy if it is certain that the test for efficacy willreject. Dose 1 will always be selected for the test of efficacy irrespectively of being safe or not. Sofor k ¼ 2 and q ¼ �1 the maximum achievable type I error rate is the unadjusted (na�ve) type I errorof many-one comparison of 2 doses with a single placebo group. When increasing k for q ¼ �1 theworst case scenario will lead to lower type I error rates than in the unadjusted many-one cases. This isdue to the fact that the selection rule requires that all doses lower than the tested are simultaneouslysafe. For moderate correlation the inflation of the type I error rate is moderate. One has to bear inmind that in practice nobody would apply this strategy to inflate the type I error rate: (i) performingthe safety monitoring at the very end of the trial is against the spirit of monitoring clinical trials to

Biometrical Journal 48 (2006) 4 667

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 6: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

protect the patients, and (ii) disregarding safety for the lowest dose is against the intention of thewhole procedure.

If negative correlations cannot be excluded then the multiple level could be controlled by apply-ing methods based on adaptive combination tests and the closed test principle as in Bauer andKieser (1999) and Hommel (2001). Since we focus on the case of positive relation between toxicityand efficacy we refer to details of these methods in the cited papers and give only a short outlinehere. Basically, we can use the adaptive test statistic (3) for the k-treatment-control comparisondefined in the next paragraph for the case of sample size reallocation to combine stage-wise teststatistics. Here we have the complication that for doses truncated at earlier interim analysis we haveno data for the standardized stage-wise statistics later on. For these doses we can simply replacethe stage-wise test statistic in (3) by the test statistic of the highest dose still used at the respectivestage. A “weak” monotonicity assumption on efficacy is required to get a valid procedure for thecase of negative correlation: if for a dose i the null hypothesis H0i is true, it also true for all lowerdoses (m0 ¼ m1 ¼ . . . ¼ mi). The procedure rejects H0h if the above defined adaptive test statisticsfor dose k; k � 1; . . . ; hþ 1 (replacing missing data by data from the highest available dose) and thetest statistic for h itself reject at the level a. If H0h is rejected the procedure proceeds with asimple hierarchical test procedure starting with dose h� 1. If no monotonicity assumption is madethe procedure gets more complicated: Also for H0h�1 we have to define adaptive test statisticsincluding the doses k; k � 1; . . . ; hþ 1, and so on for H0h�2;H0h�3; . . . ;H01 to test all relevant inter-section hypotheses.

The question remains of what to do in a real life application if during the trial the results seem tobe not in line with the a-priori assumption of a positive relation between efficacy and toxicity. Onepragmatic solution would be to define a reasonable data based rule for when to switch to the adaptivecombination test and the closed testing procedure. This procedure should be conservative (although a

668 F. K�nig et al.: Adaptive Hierarchical Testing

Figure 1 Maximum achievable type I error inflation in case of negativecorrelation for the comparison of k ¼ 2 doses to placebo. The solid lineshows the simulated maximum achievable type I error rate (left verticalaxis). The dotted line shows the the corresponding selection boundaryl2;mþ1 for dose 2 (right vertical axis). For l2;mþ1 an interpolation wasperformed. For a detailed description see Section 3.4.1.

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 7: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

strict proof for the control of the multiple level for such a data driven switching between differenttests can not be provided).

3.5 Sample size reallocation after right truncation

It should be noted that there is a systematic loss of power caused by dropping doses. To compensatefor the loss of power the sample sizes saved for the dropped doses may be reallocated to dose groupsthat are carried on. In the following we describe a procedure with sample size reallocation after righttruncation which controls the multiple type I error rate if toxicity and efficacy are positively related.Here the adaptive combination test method of Bauer and K�hne (1994) is applied for the efficacy testsusing the inverse normal combination function (Lehmacher and Wassmer, 1999)

~ZZi ¼Pmþ1

t¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffint � nt�1

n

r~ZZi;t ; ð3Þ

whereffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðnt � nt�1Þ=n

pis the a-priori weight derived from the preplanned sample sizes, and ~ZZi;t is the

standardized test statistic calculated from ~nnt � ~nnt�1 observations per dose group recruited betweenstage t and t � 1 with ~nnt the reallocated cumulated sample size of the t-th stage. For sample sizereallocation we adopt the simple strategy of distributing the unused number of observations per stageevenly over the selected doses (including the placebo group). In the t-th interim look the sample sizeof a selected dose is reallocated to ðntþ1�ntÞ ðkþ1Þ

htþ1 for the next stage t þ 1, where nt is the preplannedcumulated sample size up to the t-th stage. Hence the total sample size after each stage is unchangedand the timing of the interim looks remains unchanged.

Using the inverse normal method for efficacy now allows the type I error to be controlled afterreallocating sample sizes (see the appendix). It should be mentioned that we did not actually observea type I error inflation when using likelihood tests. However, our proof in the appendix requires thatthe conditional error rate given the interim data is independent of the sample sizes, which is achievedby using the inverse normal method. The question of whether the multiple level is inflated without theinverse normal method remains an open research problem.

4 Pruning Doses

4.1 Selection of a sequence of promising doses – left truncation

We now consider the case where, during the trial, doses may be eliminated from the trial because oflack of efficacy. The only restriction required for a control of the multiple level is that eliminating adose i implies eliminating all lower doses.

In the final analysis we are testing only the selected ‘‘left truncated” set of hypothesesH0l;H0lþ1; . . . ;H0k, where l is the lowest dose finally selected. Unless all doses are dropped in theinterim look, the placebo dose is always carried on until to the end of the trial. If we perform theconventional hierarchical test procedure as described in Section 2.1 to the left truncated set of dosesthen the multiple type I error is controlled, since the procedure rejects less hypotheses than the non-truncated conventional hierarchical procedures for the sequence k; k � 1; . . . ; 1 without sequential doseselection.

Remarks (i) Under the restriction that dropping dose i implies dropping all lower doses, any type ofselection rule based on all information from in- and outside the trial collected up to the interim lookcan be used for the selection. (ii) In contrast to right truncation here the number and timing of theinterim looks need not to be prefixed. Even the times of interim looks can be chosen in a data depen-dent way by the DSMB. (iii) By dropping doses the test can become strictly conservative, becauseusually less hypotheses are rejected than in the non-truncated case.

Biometrical Journal 48 (2006) 4 669

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 8: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

4.2 Left and right truncation

We can combine the left and right truncation procedures to select a final sequence of safe and effec-tive doses l; lþ 1; . . . ; h� 1; h finally to be tested hierarchically in reverse order.

The multiple level and the power are essentially dominated by the right truncation procedure. Trun-cation from the left will have a conservative influence on the type I error and the power. Hence thearguments of the last section apply: pruning doses does not violate the type I error if hierarchicaltesting from H0h;H0h�1; . . . ;H0l is used as the final test procedure among the selected doses, giventhat efficacy and toxicity are positively related.

5 Simulation Study

5.1 Right truncation

The simulations are performed for the comparison of k ¼ 4 doses with a placebo. We investigate theprobability of rejecting at least one null hypothesis of a safe and effective dose for different pruningstrategies with and without reallocation of the saved sample size. A dose i is defined as safe if qi � 1and as effective if mi � m0 > 0.

A total sample size n per dose is preplanned with n ¼ 2ðz1�a þ z1�bÞ2=m2D, where a ¼ 0:025,

1� b ¼ 0:9 and mD ¼ 1. We again assume m0 ¼ q0 ¼ 0 without loss of generality. For correlationsqi � 0 the hierarchical test can be done in the sequence H0h;H0h�1; . . . ;H0lþ1;H0l. The simulationsare done for a positive correlation qi ¼ 0:3 between efficacy and toxicity.

We first investigated the situation with m ¼ 10 interim looks. Without sample size reallocationequally spaced interim looks are planned, so that the t-th interim look is performed after nt ¼ t n

mþ1observations per dose still under investigation. In a design without truncation a total sample ofðk þ 1Þ � n observations is investigated. Using the sample size reallocation of Section 3.5 we distributethe unused number of observations per stage evenly over the selected doses (including the placebogroup).

Here we present the simulation results performing a right truncation based on the single mean

where Si;t ¼ �YYi;nt ¼ 1nt

Pnt

j¼1Yij (in the following called “RSM“). The general tendency will be similar to

right truncation based on the mean difference to placebo (where Si;t ¼ �YYi;nt � �YY0;nt ), which will lead toa higher variability of the selection criterion. This would require that the selection boundaries have tobe modified in order to get the same statistical properties. Equal toxicity margins li;t ¼ vþ 2ffiffiffi

ntp for all

doses are used for the selection per stage.For toxicity we assume a set of linearly increasing toxicity means with q1 ¼ 0, q2 ¼ 1

3 q4, q3 ¼ 23 q4

and 0 � q4 � 2. Note, that dose 1 and 2 are safe for all 0 � q4 � 2, dose 4 becomes unsafe forq4 > 1, and dose 3 for q4 > 1:5. The ratio of the average sample size (ASN) per dose group dividedby the preplanned sample size n, without (left panels) and with (right panels) reallocation of thesample size is shown in Figure 2. The x-axis gives the mean q4 for the toxicity variable in the highestdose group. Figures 2a and b show the results if the parameter for the toxicity boundaries is set tov ¼ 0:5. If q4 ¼ 1 in average only 50 percent of the preplanned patients are treated with the highestdose. In case of a sample size reallocation (Figure 2b) for increasing q4 more patient than preplannedare treated with the doses 1 and 2 respectively being still safe. Figures 2c and d give the correspond-ing results when the toxicity boundaries becomes narrower (v ¼ 0). Now only about 20 percent of thepatients are treated with the highest toxic dose for q4 ¼ 1 as compared to the fixed size sample de-sign.

Making the truncation boundary narrower we expect that also the smaller safe doses will be trun-cated too frequently. To investigate the impact of the choice of the boundary on power we assume twodifferent patterns for the mean of the efficacy variable: a linear increasing efficacy pattern (m1 ¼ 0:25,m2 ¼ 0:5, m3 ¼ 0:75 and m4 ¼ 1, Figure 3 left panels) and ramp type efficacy profile (m1 ¼ 0:333,

670 F. K�nig et al.: Adaptive Hierarchical Testing

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 9: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

m2 ¼ 0:667 and m3 ¼ m4 ¼ 1, Figure 3 right panels). The power values to detect at least one efficientand sufficiently safe dose are shown again depending on the mean of the toxicity variable q4 of thehighest dose. The power function (solid line) of the single stage design (no truncation) is a step func-tion, because for q4 2 ð0; 1� the doses 1, 2, 3 and 4 are safe and efficient, for q4 2 ð1; 1:5� doses 1, 2and 3 and for q4 2 ð1:5; 2� only doses 1 and 2 are safe and efficient. For low toxicity q4 � 1 the RSMstrategy results in a loss of power for rejecting at least one efficient and safe dose for both efficacy

Biometrical Journal 48 (2006) 4 671

Figure 2 Ratio ASN/n in dependency of the toxicity q4 of the highest dose without (left column:a, c, e) and with reallocation (left column: b, d, f) of sample size. For toxicity we assume a linearincreasing toxicity with q1 ¼ 0, q2 ¼ 1

3 q4, q3 ¼ 23 q4 and 0 � q4 � 2. The toxicity margins

li; j ¼ vþ 2ffiffiffinjp were set to v ¼ 0.5 (ab), v ¼ 0 (cd) and v ¼ 0.35 (ef). The number ofinterim looks is

m ¼ 10 for abcd and m ¼ 3 for ef. For a detailed description see Section 5.1.

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 10: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

672 F. K�nig et al.: Adaptive Hierarchical Testing

Figure 3 The power (defined as the probability to reject the null hypothesis H0i for at least oneefficient (mi > 0) and safe dose (qi � 1) in the final analysis) depending on the mean q4 (assuming alinear increasing toxicity with q1 ¼ 0, q2 ¼ 1

3 q4, q3 ¼ 23 q4 and 0 � q4 � 2). The toxicity margins

li; j ¼ vþ 2ffiffiffinjp were set to be v ¼ 0.5 (AB), v ¼ 0 (CD) and v ¼ 0.35 (EF). The number of interim

looks is m ¼ 10 for ABCD and m ¼ 3 for EF. The left panels (ACE) show the results of a linearincreasing efficacy pattern with m0 ¼ 0, m1 ¼ 0:25, m2 ¼ 0:5, m3 ¼ 0:75 and m4 ¼ 1, the right panels(BDF) show the results of a ramp type efficacy profile (m0 ¼ 0, m1 ¼ 0.333, m2 ¼ 0.667 andm3 ¼ m4 ¼ 1). We assume a positive correlation (q ¼ 0:3Þ between efficacy and toxicity. Dashed linesdenote the result for a hierarchical test after right truncation. Dotted lines denote the result if samplesizes saved for the truncated dose groups are reallocated to the remaining doses and placebo. Thesolid lines denote the results of the hierarchical test for a single stage design without pruning. For adetailed description see Section 5.1.

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 11: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

patterns (A and B). Clearly, in this region all doses are safe enough, so that there must be a loss ofpower due to frequently dropping the highest (and most efficient) doses. The reallocation strategycompensates this loss in power with increasing q4. If q4 > 1:5 we achieve for both efficacy patterns ahigher power than with the design with no truncation. This can be explained, in that by dropping theeffective but unsafe (higher) doses, the chance for the effective and safe (lower) doses to be testedfirst in the sequence of selected hypotheses is increased. Moreover if higher doses are truncated weapply higher sample sizes due to sample size reallocation. If all doses have equal efficacym1 ¼ m2 ¼ m3 ¼ m4 ¼ 1 (not shown in the Figure), the power of the truncation procedure with samplesize reallocation comes close to 1 with increasing q4. In this case the high probability of truncation iscompensated be reallocating sample sizes to similar effective doses.

Using the smaller truncation limits with v ¼ 0 we observe a large decrease in power due to righttruncation (Figures 3C and D). The price to be paid for substantially reducing the expected number ofpatients to be treated with a toxic dose as compared to Figures 3A and 3B is a large loss of power.Here even the reallocation strategy can not compensate for this loss.

In the following we investigate the impact of decreasing the number of interim looks, instead ofm ¼ 10 we considered m ¼ 3. In order to have a realistic comparison between the two designs wefixed the toxicity margin for right truncation to get approximately the same ASN-ratio for the highestdose if q4 ¼ 1. This results in the parameter v ¼ 0:35. Clearly, if the number of interim looks issmaller the critical boundaries to achieve sufficiently frequent truncation of a toxic dose have to benarrower. Figures 2e and 2f shows flatter ASN-ratios, leading to a larger proportion of patients treatedwith a toxic dose for higher values of toxicity. Also the power for both efficacy patterns decreaseswith decreasing number of interim looks (Figures 3E and 3F), which is not unexpected from generalproperties of sequential procedures.

5.2 Pruning doses

We also investigated the situation where right truncation due to toxicity and a left truncation due to alack of efficacy were applied. Without reallocation of the sample size additional left truncation ofineffective doses has a little impact on power unless the efficacy margin is chosen to be too large. Ifsample size reallocation is applied additional left truncation may lead to a further increase of power, ifthe inefficient doses are skipped and sample size is reallocated to the efficient doses. No inflation ofthe multiple type I error was observed in the simulations in the case of reallocation the sample size,although no exact proof can be given. An exact proof of the multiple type I error control after areallocation of the sample size in the situation of pruning doses remains an open research topic.

6 Discussion

We have shown how multiple many-one comparisons of k doses with a placebo can be applied in adesign with adaptive dose selection in multiple interim looks. Only applying right truncation of unsafedoses is related to a decision scenario in a DSMB in a clinical trial when performing periodic moni-toring of safety data. A hierarchical testing procedure has been investigated where there is a pre-assigned order of testing, starting with the highest dose and proceeding to the next lower dose only ifthe higher dose has been proven to be effective (as compared to the placebo). No adjustment formultiplicity is required for the tests of the individual null hypotheses. Such a procedure avoids situa-tions where lower doses are proven to be effective whereas the proof fails for higher doses (and thereis no good argument why this should not be the case). Moreover, such a hierarchical procedure aimsat a more specific controlled inference on the form of the dose response relationship than a generalmany to one comparison. Showing efficacy of a certain dose, conditional on the established efficacyof all higher doses, will in general make interpretation of the results of such a trial more plausible andpersuasive.

Biometrical Journal 48 (2006) 4 673

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 12: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

Our main emphasis has been on the situation that high doses are truncated because of safety con-cerns in multiple interim looks. It has been shown that left truncation of low doses due to a lack ofefficacy has no anti-conservative impact on the multiple level and may be applied in various ways.Pruning procedures have the appealing feature of restricting experimentation to a dose range thatprovides sufficient safety and efficacy. The key message is that in the final analysis the simple hier-archical many-one comparisons procedure can be applied to the sequence of selected doses withoutinflating the multiple level under the following condition: if the efficacy outcome variable used in themultiple comparison procedure has a non-negative association with the toxicity variable used for righttruncation. It is striking that for non-negative correlations the multiple level a is controlled if righttruncation is based on more than one safety outcome measure and different selection boundaries areprefixed for the different doses which may vary over the looks. The method works even if for ethicalreasons a “continuous” safety monitoring is applied, which means to look at the data after eachpatient has been observed under each dose still under investigation and a placebo. It is worth mentioningthat the procedure controls the multiple level even if the underlying order relation among the efficacyparameters among the increasing doses is in fact not true, e.g., if an inversion of the dose-responserelationship applies. A nice feature of the procedure is that the type I error control does not rely on thenormality assumption for the safety endpoint. We have shown that it can be applied to a binary safetyendpoint if the probability of an adverse event increases with increasing efficacy. It has to be stressedhowever, that right truncation can only be performed on the basis of toxicity and not on the basis of theefficacy variable. It is also worth mentioning that type I error control is still guaranteed if the safetyboard adopts monotonous rules of right truncation not pre-specified in the protocol (see the remarksfollowing the proof in the appendix). E.g., it is possible to include additional interim looks with righttruncation after observing increased toxicity not yet severe enough to justify dropping the dose.

We have suggested switching to an adaptive combination test using the closed testing principle(Bauer and Kieser, 1999) if during the trial the a-priori assumption of positive relation between toxi-city and efficacy in the active dose groups has to be questioned seriously. (The validity of the proce-dure of our hierarchical truncation procedure does not depend on the relation between efficacy andtoxicity in the placebo group.) Without having an explicit proof we are convinced that an integratedprocedure with properly switching to the adaptive closed test procedure will preserve the multiplelevel. The evaluation of the joint procedure will be a topic for further research.

Truncation may lead to a decrease of power when effective doses are dropped. Hence it is tempting toreallocate the saved sample sizes from the dropped dose groups to those carried over to the next stage.The principle of adaptive combination tests can be applied to deal with such data dependent adaptationof sample sizes. It has been shown that by distributing the saved sample size evenly over the maintaineddoses the multiple level is controlled and the power may be increased even compared to the design with-out dose selection. However, generally the latter is not the design one should compare with, becausedropping of unsafe doses is usually driven by ethical and not by efficiency arguments.

An important message can be seen from our results. If narrow safety margins are chosen in multipleinterim looks we can keep the number of patients treated with unsafe doses small. However, in thiscase we will tend also to truncate sufficiently safe doses which will reduce the power to detect thosedoses which are sufficiently safe and effective. Obviously using the same truncation margins but per-forming fewer interim looks will help to increase the power. But then, if unsafe doses are involved,the number of patients treated with an unsafe dose will increase. Performing frequent interim looksmay lead to a reduction of the expected number of unsafely treated patients if doses with unsuspect-edly high toxicity are involved. Ethical considerations may ask for more frequent looks in order to beable to react in a timely way to unexpected severe safety problems. Our simulation results indicatethat designs with more frequent interim looks and moderate truncation margins could be more effi-cient in terms of power than designs with few interim looks and more strict truncation margins. Set-ting up optimal monitoring rules remains a difficult task and will depend on various aspects such as atradeoff between the severity of the disease, the seriousness of the safety issues and the therapeuticeffect. Therefore it is not surprising if DSMBs refuse to be bounded to a preassigned rule. When

674 F. K�nig et al.: Adaptive Hierarchical Testing

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 13: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

setting up safety monitoring rules it seems to be advisable to perform extensive simulation work inadvance to get an impression of their impact on the statistical properties of the applied test proce-dures. It seems to be difficult to give general recommendations accounting for all the various demandsof a specific clinical trial.

Acknowledgements We thank Martin Posch and the referees for their useful comments. This research wassupported by the Austrian FWF-Fund no. P15853.

Appendix – Type I error rate for positively related efficacyand toxicity interim data

The proof of the conservatism of the na�ve hierarchical test after right truncation of doses as definedin Section 3 relies on the normality assumption for the efficacy data and on the assumption of apositive relation between efficacy and toxicity as defined in (1).

We assume, without loss of generality, that the maximum number of patients in each dosegroup is bounded by some finite number N (also in the case of sample size reallocations) and for-mally represent the efficacy and safety data by the random matrices X ¼ ðX0;X1;X 2; . . . ;X kÞ andY ¼ ðY0;Y1;Y2; . . . ;YkÞ, respectively, where the vectors X i ¼ ðXi1; . . . ;XiNÞ and Yi ¼ ðYi1; . . . ; YiNÞrepresent the efficacy and the toxicity data of the i-th dose group (i ¼ 0; 1; . . . ; k). Note that only sub-vectors of X i and Yi are observed if the dose group i is dropped before the last look.

In the proof of Theorem 3.1 we will apply definition (1) to functions UðXÞ and VðYÞ whichdepend on all data (also of the groups l 6¼ i) and have the required monotonicity properties in the dataX i and Yi of group i. Since the dose groups are independent we can fix the data of all groupsl 6¼ i and consider UðXÞ and VðYÞ as function in X i and Yi. When integrating over the bivariatedistribution of ðX i;YiÞ we get, under the positive relation of X i and Yi, thatÐ

UðXÞVðYÞ dFðX i;YiÞ �Ð

UðXÞ dFðX iÞ �Ð

VðYÞ dFðYiÞ. Such inequalities will also play a role inthe discussion of binary safety data below.

We first discuss the positive relation property with binary safety data, then verify Theorem 3.1assuming prefixed sample sizes and extend the proof for the sample size reallocation rule of Section 3.5.A final remark on further extensions of Theorem 3.1 will close the Appendix.

Positive relation with binary safety data

Assume that the safety endpoint Yij is binary with Yij ¼ 1 indicating an adverse event for patient j indose group i. We will show that if, for all groups i � 1 and all patients j, the conditional probabilityPðYij ¼ 1jXijÞ given the observed efficacy outcome Xij is non-decreasing in Xij then Xij and Yij arepositively related.

To see this let UðXijÞ be bounded and non-decreasing in Xij and VðyÞ such that Vð1Þ � Vð0Þ. ThenEfVðYijÞ j Xijg ¼ fVð1Þ � Vð0ÞgPðYij ¼ 1 j XijÞ þ Vð0Þ is a non-increasing function in Xij. Since Xij ispositively associated in the sense of Esary et al. (1967)Ð

UðXijÞ VðYijÞ dFðXij; YijÞ ¼Ð

UðXijÞ � EfVðYijÞ j Xijg dFðXijÞ � EUðXijÞ � EVðYijÞ :

To see (1) let UðXi1; . . . ;XiNÞ and VðYi1; . . . ; YiNÞ be functions of the efficacy and safety data of allpatients in group i which are componentwise non-decreasing and non-increasing in X i and Yi, respec-tively. Applying the inequality in the last equation inductively over all N independent patients ingroup i we end up with (1).

Type I error control after right truncation – Proof of Theorem 3.1

Proof. Denote r1 � . . . � rs the doses for which the null hypothesis is true. The multiple level is the

probability to reject Htrue ¼Tsj¼1

Hrj . If ri is the highest dose finally selected at the last interim analysis,

Biometrical Journal 48 (2006) 4 675

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 14: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

then the na�ve hierarchical test rejects Htrue if Zri � z1�a where Zri is the z-score for comparing doseri with the placebo at the final analysis. Let Ri ¼ fri � h < riþ1g be the event where dose ri is thehighest selected ineffective dose, and Ai the event where at least one H0l for ri < l < riþ1 is selectedand accepted. The rejection region of the na�ve hierarchical test can be partitioned into the disjointsets Ri \ fZri � z1�ag \ Ac

i where Aci is the complement of Ai. Hence, the multiple level is bounded by

Pðreject HtrueÞ �Psi¼1

P½fZri � z1�ag \ Ri� ¼Psi¼1

Eðwri1RiÞ ð4Þ

where wri¼ 1fZri � z1�ag. We will show that the left side of (4) is at most a by deriving suitable upper

bounds for Eðwri1RiÞ. These upper bounds will be obtained from monotonicity properties of the test

decision function wri¼ 1fZri � z1�ag and the selection function 1Ri which we discuss next.

We first show that 1Ri is component wise non-increasing in Yri . Note that 1Ri ¼ hi � ð1� giÞ�Qmt¼1

1fSri ;t � lri ;tg where hi ¼Q

l< ri

Qmt¼1

1fSl;t � ll;tg and gi ¼Qriþ1

l¼riþ1

Qmt¼1

1fSl;t � ll;tg. If we consider a fixed dose ri

and fix the data X l and Yl for l ¼ 0 and l 6¼ ri, then the indicator functions hi and gi are fixedconstants (which are either 0 or 1) and 1Ri is component wise non-increasing in Yri because Sri;t isnon-decreasing in Yri by assumption. By the definition of the test statistic Zri , the decision functionwri¼ wri

ðX 0;X riÞ is completely determined by X 0 and X ri and is non-decreasing in each componentof X ri the efficacy data of group ri.

In the following step we integrate 1Riwriover X ri and Yri conditioning on all other data (which is

omitted for notational simplicity), and utilize the positive relation between X ri and Yri and the mono-tonicity of wri

ðX 0;X riÞ in X ri and 1Ri in Yri (recall the extension of property (1) noted at the begin-ning of the appendix):Ð Ð

wriðX0;X riÞ � 1Ri dFðX ri ;YriÞ �

ÐwriðX 0;X riÞ dFðX riÞ �

Ð1Ri dFðYriÞ

¼ Eðwrij X 0 Þ �

Ð1Ri dFðYriÞ ð5Þ

where EðwrijX 0Þ ¼

ÐwriðX0;X riÞ dFðX riÞ. As a next step we integrate both sides of (6) over Yl and

X l, l ¼ 1; . . . ; k, l 6¼ ri , l 6¼ ri, to compute the expectation conditional on the placebo data X0 and Y0:

Eðwri1Ri j X 0;Y0Þ � Eðwri

j X 0 Þ � PðRi j Y0Þ� E0ðwri

j X 0 Þ � PðRi j Y0Þ ¼ E0ðwr1j X 0 Þ � PðRi j Y0Þ ð6Þ

where PðRi j Y0Þ ¼Ð

. . .Ð

1Ri dFðY1Þ . . . dFðYkÞ, since the truncation function 1Ri only depends onthe safety data. Here E0 is the expectation under the point null hypothesis

Ti2ðr1;...;rsÞ

fmri¼ m0g, and the

second inequality follows from the property that Eðwrij X 0 Þ is increasing in mri

� m0. The last equal-ity follows from the fact that under balanced dose groups E0ðwri

j X0 Þ ¼ E0ðwr1j X 0 Þ for all

i 2 fr1; r2; . . . ; rsg. Finally, summing (6) over all ri and integrating the sum over X0 and Y0 we get

Pðreject HtrueÞ �Ð Ð

E0ðwr1j X 0 Þ �

Pi2fr1;...;rsg

PðRi j Y0Þ� �

dFðX0;Y0Þ � E0ðwr1Þ ¼ a

sinceP

i2fr1;...;rsgPðRi j Y0Þ � 1.

Type I error control with sample size reallocation

We show that when using the inverse normal method for the efficacy tests and the sample size reallo-cation rule of Section 3.5 then the type I error of the right truncated hierarchical test is at most a.With sample size reallocation it is more difficult to show that the test decision function wri

is non-

decreasing in X ri and thatQmt¼1

1fSri ;t � lri ;tg is non-increasing in Yri . The rest of the proof is just as for

the case of fixed sample sizes considered before.

676 F. K�nig et al.: Adaptive Hierarchical Testing

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 15: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

The monotonicity wi is achieved via the inverse normal method and by writing wi ¼ 1f ~ZZi � z1�ag asnon-decreasing function of the stage wise standardized means of the placebo and the respective dosegroup. Note that the distribution of the stage wise standardized means is independent from the (real-located) sample size. Recall that ~ZZi was defined in (3) as increasing function of the standardized meandifferences and not as function of individual stage wise means. However, fixing the sample size ratiobetween the groups, the standardized mean differences are fixed increasing functions of the individualstage wise means which do not involve the sample sizes. With the sample size reallocation rule ofSection 3.5 sample size ratios are fixed at 1. In the proof it is now required to replace each X i,i ¼ 0; . . . ; k, by the individual stage wise standardized means in group i.

With fixed sample sizes (no sample size reallocation) the monotonicity ofQmt¼1

1fSri ;t � lri ;tg was ob-

vious since in this case each Sri;t is non-decreasing in the safety data (by assumption). For the samplesize reallocation rule considered in Section 3.5, we can show by induction in s ¼ 1; . . . ;m thatQst¼1

1fSri ;t � lri ;tg is non-increasing in Yri for fixed X l and Yl, l 6¼ ri. Since the samples sizes at the first

interim analysis are fixed we get monotonicity of 1fSri ;1 � lri ;1g. Assume that we have verified the mono-

tonicity ofQs�1

t¼11fSri ;t � lri ;tg. Since

Qst¼1

1fSri ;t � lri ;tg ¼Qs�1

t¼11fSri ;t � lri ;tg whenever

Qs�1

t¼11fSri ;t � lri ;tg ¼ 0 it remains

to investigate 1fSri ;s � lri ;sg at sample points whereQs�1

t¼11fSri ;t � lri ;tg ¼ 1. Now observe thatQs�1

t¼11fSl;t � ll;tg ¼ 1 implies 1fSl;t � ll;tg ¼ 1 for all t ¼ 1; . . . ; s� 1, and given the safety data Yl of all

groups l 6¼ ri, this fixes the sample sizes for the s-th interim analysis. So, 1fSri ;s � lri ;sg is non-increasing

in Yri exactly when the term 1fSri ;s � lri ;sg determinesQst¼1

1fSri ;t � lri ;tg.

Remarks

It can be seen from the proof of Theorem 3.4 that the hierarchical test starting with the highest selecteddose h controls the multiple level if for all i ¼ 1; . . . ; k and t ¼ 1; . . . ;m the indicator function 1fSi;t � li;tgis non-decreasing in the safety data Yi (and the safety and efficacy endpoints are positively related). Bythe monotonicity of Si;t, this is the case as long as the safety margins li;t are constants or are decreasingfunctions of the safety data. As a consequence, the safety board will not inflate the multiple level whensharpening (i.e., decreasing) the safety margins after observing increased toxicity values (not yet severeenough to justify dropping the dose) at this or previous interim looks. It would also be possible to includeadditional interim looks with right truncation after observing increased toxicity, since formally thismeans decreasing the safety margin of plus infinity (no right truncation) to some finite margin at thisadditional interim look. It should, however, be noted that without pre-specifying rules for such ‘‘adapta-tions” of the safety margins, monotonicity is hardly guaranteed.

Note further that when pruning doses (left truncation due to a lack of efficacy and right truncationdue to a lack of safety) no problem with the type I error control arises if there is no sample sizereallocation. With sample size reallocation as in Section 3.5 the crucial point that 1Ri is a non-decreas-ing function of Yi given the efficacy data cannot be verfied. To show this monotonicity is still anopen research topic.

Another open question is the possible impact of of sample size unbalance between doses on thetype I error rate.

References

Bauer, P. (1991). Multiple testing in clinical trials. Statistics in Medicine 10, 871–889.Bauer, P. (1997). A note on multiple testing procedures in dose finding. Biometrics 53, 1125–1128.Bauer, P., Brannath, W. and Posch, M. (2001). Multiple testing for identifying effective and safe treatments.

Biometrical Journal 43, 606–616.

Biometrical Journal 48 (2006) 4 677

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

Page 16: An Adaptive Hierarchical Test Procedure for Selecting Safe and Efficient Treatments

Bauer, P. and Kieser, M. (1999). Combining different phases in the development of medical treatments within asingle trial. Statistics in Medicine 18, 1833–1848.

Bauer, P. and K�hne, K. (1994). Evaluation of experiments with adaptive interim analyses. Biometrics 50, 1029–1041.

Bauer, P., R�hmel, J., Maurer, W. and Hothorn, L. (1998). Testing strategies in multi-dose experiments includingactive control. Statistics in Medicine 17, 2133–2146.

Braun, T. M. (2002). The bivariate continual reassessment method: Extending the CRM to phase I trials of twocompeting outcomes. Controlled Clinical Trials 23, 240–256.

Bretz, F., Schmidli, H., K�nig, F., Racine, A. and Maurer, W. (2006). Confirmatory seamless phase II/III clinicaltrials with hypotheses selection at interim: General concepts. Biometrical Journal 48, 623–634.

Dunnett, C. W. (1955). A multiple comparisonprocedure for comparing several treatments with a control. Journalof the American Statistical Association 50, 1096–1121.

Esary, J. D., Proschan, F. and Walkup, D. W. (1967). Association of random variables, with applications. TheAnnals of Mathematical Statistics 38, 1466–1474.

Hommel, G. (2001). Adaptive modifications of hypotheses after an interim analysis. Biometrical Journal 43,581–589.

Ivanova, A. (2003). A new dose-finding design for bivariate outcomes. Biometrics 59, 1001–1007.Lehmacher, W., Kieser, M. and Hothorn, L. (2000). Sequential and multiple testing for dose-response analysis.

Drug Information Journal 34, 591–597.Lehmacher, W. and Wassmer, G. (1999). Adaptive sample size calculations in group sequential trials. Biometrics

55, 1286–1290.Liu, Q. and Pledger, G. (2005). Phase 2 and 3 combination designs to accelerate drug development. Journal of

the American Statistical Association 100, 493–502.O’Quigley, J., Hughes, M. D. and Fenton, T. (2001). Doses-finding designs for HIV studies. Biometrics 57, 1018–

1029.Pitt, L. D. (1982). Positively correlated normal variables are associated. The Annals of Probability 10, 496–499.Ruberg, S. J. (1995a). Dose-response studies. II. Analysis and interpretation. Journal of Biopharmaceutical Statis-

tics 5, 15–42.Ruberg, S. J. (1995b). Dose-response studies. I. Some design considerations. Journal of Biopharmaceutical Statis-

tics 5, 1–14.Sonnemann, E., Finner, H. and Kunert, J. (1986). Analyse von Verlaufskurven. Biometry course, Trier University.Stallard, N. and Todd, S. (2003). Sequential designs for phase III clinical trials incorporating treatment selection.

Statistics in Medicine 22, 689–703.Tamhane, A. C., Dunnett, C. W., Green, J. W. and Wetherington, J. D. (2001). Multiple test procedures for

identifying the maximum safe dose. Journal of American Statistical Association 96, 835–843.Tamhane, A. C. and Logan, B. R. (2002). Multiple test procedures for identifying the minimum effective and

maximum safe doses of a drug. Journal of the American Statistical Association 97, 293–301.Thall, P. F. and Cook, J. D. (2004). Dose-finding based on efficacy-toxicity trade-offs. Biometrics 60, 684–693.Williams, D. A. (1971). A test for difference between treatment means when several dose levels are compared

with a zero dose control. Biometrics 27, 103–117.Williams, D. A. (1972). The comparison of several dose levels with a zero dose control. Biometrics 28, 519–531.Zeymer, U., Suryapranata, H., Monassier, J. P., Opolski, G., Davies, J., Rasmanis, G., Linssen, G., Tebbe, U.,

Schroder, R., Tiemann, R., Maching, T. and Neuhaus K. L. (2001). The Na+/H+ exchange inhibitor eniporideas an adjunct to early reperfusion therapy for acute myocardial infarction – results of the evaluation of thesafety and cardioprotective effects ofeniporide in acute myocardial infarction (ESCAMI) trial. Journal of theAmerican College of Cardiology 38, 1644–1650.

678 F. K�nig et al.: Adaptive Hierarchical Testing

# 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com