35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and...

39
Br. J. Cancer (1977) 35, 1 DESIGN AND ANALYSIS OF RANDOMIZED CLINICAL TRIALS REQUIRING PROLONGED OBSERVATION OF EACH PATIENT II. ANALYSIS AND EXAMPLES R. PETO,1 M. C. PIKE,2 P. ARMITAGE,' N. E. BRESLOW,3 D. R. COX,4 S. V. HOWARD,5 N. MANTEL,6 K. McPHERSON,l J. PETO' AND P. G. SMITH' From 'Oxford University, 2University of Southern California, 3University of Seattle, 4Imperial College, London, 5U.C.H. Medical School, London and 6George Washington University Report to the Medical Research Council's Leukaemia Steering Committee; Chairman, Professor Sir Richard Doll Received 22 December 1975 Accepted 25 August 1976 Summary.-Part I of this report appeared in the previous issue (Br. J. Cancer (1976) 34,585), and discussed the design of randomized clinical trials. Part II now describes efficient methods of analysis of randomized clinical trials in which we wish to com- pare the duration of survival (or the time until some other untoward event first occurs) among different groups of patients. It is intended to enable physicians without statistical training either to analyse such data themselves using life tables, the logrank test and retrospective stratification, or, when such analyses are presented, to appreciate them more critically, but the discussion may also be of interest to statisticians who have not yet specialized in clinical trial analyses. CONTENTS ANALYSIS PAGE 16.-General principles . . . . . . . . . . 2 17.-Definition of the "trial time" for each patient . . . . . 3 18.-The life table . . . . . . . . . . . 3 19.-The logrank test . . . . . . . . . . 7 20.-Logrank significance levels (including chi-square tabulation) . . 9 21.-Explanatory information (prognostic factors) . . . . . 11 22.-Use of prognostic factors to refine the treatment comparison . . 12 23.-Bad methods of analysis . . . . . . . . . 14 24.-How much data should be collected from each patient? . . . 16 25.-Subdividing the follow-up period . . . . . . . 18 26.-Arranging the manner in which the data will be collected . . . 19 27.-Assessment by separate causes of death . . . . . . 21 28.-Other end points . . . . . . . . . . 21 29.-Remission duration . . . . . . . . . 23 30.-Combining information from different trials . . . . . 23 EXAMPLES 31.-Immunotherapy of acute leukaemia . . . . . . 24 32.-The MRC myelomatosis trials . . . . . . . . 27 Requests for reprints to R. Peto, Radcliffe Infirmary, Oxford, England; or to M. C. Pike, University of Southern California School of Medicine, Los Angeles, California 90033, U.S.A. Reprints of both parts will be sent to those who request reprints of either part. Bulk orders for teaching purposes cost £5 or $10 per 10; please inform us if any details are unclear, misleading or wrong.

Transcript of 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and...

Page 1: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

Br. J. Cancer (1977) 35, 1

DESIGN AND ANALYSIS OF RANDOMIZED CLINICAL TRIALSREQUIRING PROLONGED OBSERVATION OF EACH PATIENT

II. ANALYSIS AND EXAMPLESR. PETO,1 M. C. PIKE,2 P. ARMITAGE,' N. E. BRESLOW,3 D. R. COX,4 S. V. HOWARD,5

N. MANTEL,6 K. McPHERSON,l J. PETO' AND P. G. SMITH'From 'Oxford University, 2University of Southern California, 3University of Seattle, 4Imperial

College, London, 5U.C.H. Medical School, London and 6George Washington UniversityReport to the Medical Research Council's Leukaemia Steering Committee;

Chairman, Professor Sir Richard Doll

Received 22 December 1975 Accepted 25 August 1976

Summary.-Part I of this report appeared in the previous issue (Br. J. Cancer (1976)34,585), and discussed the design of randomized clinical trials. Part II now describesefficient methods of analysis of randomized clinical trials in which we wish to com-pare the duration of survival (or the time until some other untoward event firstoccurs) among different groups of patients. It is intended to enable physicianswithout statistical training either to analyse such data themselves using life tables,the logrank test and retrospective stratification, or, when such analyses are presented,to appreciate them more critically, but the discussion may also be of interest tostatisticians who have not yet specialized in clinical trial analyses.

CONTENTSANALYSIS PAGE

16.-General principles . . . . . . . . . . 217.-Definition of the "trial time" for each patient . . . . . 318.-The life table . . . . . . . . . . . 319.-The logrank test . . . . . . . . . . 720.-Logrank significance levels (including chi-square tabulation) . . 921.-Explanatory information (prognostic factors) . . . . . 1122.-Use of prognostic factors to refine the treatment comparison . . 1223.-Bad methods of analysis . . . . . . . . . 1424.-How much data should be collected from each patient? . . . 1625.-Subdividing the follow-up period . . . . . . . 1826.-Arranging the manner in which the data will be collected . . . 1927.-Assessment by separate causes of death . . . . . . 2128.-Other end points . . . . . . . . . . 2129.-Remission duration . . . . . . . . . 2330.-Combining information from different trials . . . . . 23

EXAMPLES

31.-Immunotherapy of acute leukaemia . . . . . . 2432.-The MRC myelomatosis trials . . . . . . . . 27

Requests for reprints to R. Peto, Radcliffe Infirmary, Oxford, England; or to M. C. Pike, University ofSouthern California School of Medicine, Los Angeles, California 90033, U.S.A. Reprints of both partswill be sent to those who request reprints of either part. Bulk orders for teaching purposes cost £5 or$10 per 10; please inform us if any details are unclear, misleading or wrong.

Page 2: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

REFERENCES FOR PART II . . .

APPENDICES FOR PART II3.-Worked example of a clinical trial analysis (hypothetical data)4.-How to record data in such a way that it is easy to analyse by computer .5.-Testing for a trend in prognosis with respect to an explanatory variable

STATISTICAL NOTES FOR PART II

ANALYSIS

16.-General principlesSome of this introduction recapitulatestext from Part I.

Many clinical trials compare survivalduration among cancer patients randomlyallocated to different treatments. There hasbeen much investigation in the statisticalliterature of possible ways of interpretingthe data from such trials, the surprisingoutcome of which has been the discoverythat 2 techniques (life table graphs andlogrank P-values), which are so simple thatthey are easily mastered by non-statisticians,are commonly more accurate and moresensitive than any of the elaborate alter-natives that have been considered. Part IIof this report now describes these 2 tech-niques in sufficient detail for them to beperformed entirely without statistical guid-ance. Inessential notes on statistical detailsare relegated to the end of the text, andshould be ignored by most readers.

If the course of the disease is very rapid(e.g. acute liver failure) and it is unimportantwhether a dying patient lives a few dayslonger or not, a count of the numbers ofdeaths and survivors on each treatment isall that is required. However, if (as withmost forms of neoplastic disease) an appreci-able proportion of the patients do eventuallydie of the disease, but death may take someconsiderable time, it is possible to achievea more sensitive assessment of the value ofeach treatment by looking not only at howmany patients died but also at how longafter entry they died.

You can best learn statistical methodsby applying them to data which interestyou. If you have some data to interpret,then we hope that, when you have readthis paper and applied it to your data,

37

you will feel that the methods it describesare straightforward and that the use ofthem has simplified your data and helpedyou understand them. However, if youdo not have any such data to interpret,perhaps you should not study the techni-cal parts of this paper carefully, asattempting to learn the details of statisti-cal methods in the hope of applying themat some vague date in the future usuallyproduces confusion.

In Part II of this article, the firstfew sections describe how time to deathmay be analysed, ignoring entirely allassessment of the quality of life. Themethods described, however, are equallyuseful for analysing time to some otherfirst event; for example, in clinical trialsof solid tumour therapy, a separateanalysis of time from entry to first localrecurrence may be of interest, or perhapsa separate analysis of time from entryto first metastatic spread. In later sec-

tions, we describe the ways in which thestatistical methods used to study deathrates may also be used to study rates ofsome other particular type of event in aclinical trial. Analysis of survival dura-tion in a randomized trial would usuallyinvolve obtaining:

(i) Descriptive graphs of the observedoutcome in each treatment group

(" life tables ") which can be com-

pared with each other visually.(ii) A P-value to see if the observed

differences between treatmentgroups could plausibly be justchance (using the " logrank test ").

(iii) Information about how survival

28

293336

2

Page 3: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

differs between groups of patientswho differ with respect to an " ex-planatory variable ", such as ageor disease stage, recorded at thetime of randomization (using lifetables and logrank tests to comparethese groups with each other).

(iv) Retrospective stratification, basedon the findings in (iii), followed byrecalculation of the P-value com-paring treatment groups makingproper allowance for which ofthese strata each patient is in.

17.-Definition of the " trial time " for eachpatient" Time " is measured for each patientfrom that particular patient's date ofrandomization.

After deciding to analyse the resultsa stopping date, perhaps the end of aparticular month, is chosen and for eachpatient in the trial it is determinedwhether he was alive or dead on that dateand, if dead, the date on which he died.(Deaths occurring shortly after the chosenstopping date are ignored in this analysis,even if they are known of when analysisoccurs, since otherwise the risk of deathwould be slightly exaggerated.) If thecollaborating centres have enough ad-vance warning of the stopping date, theycan arrange appointments for all theirsurviving patients for a few days afterthis date, and the data collection can thenbe completed within a matter of weeksof the stopping date. It is certainly notsufficient to rely on busy collaboratingphysicians to notify a trial centre when-ever deaths occur; delays of severalmonths would then be commonplace,and some deaths might be completelymissed. Curiously, recall (in response totelephone enquiries) of how long agodeaths occurred or patients were last seenis very unreliable, many events beingremembered as being considerably morerecent than they actually were. Exactdates when patients died or were last seenmust, unfortunately, be determined.

For each patient one now knowswhether or not he died and the time forwhich he was at risk, which we call histrial time. This runs from the date of hisrandomization to the date of his death or,if he did not die, to the stopping date.The trial time of a lost or emigratedpatient runs to the date of loss; if loss mayhave occurred because therapy was notbeing successful (or because it has beencompletely successful), there is no satis-factory way of allowing for this fact, sodon't let it happen! Note that survivalin a therapeutic trial is measured fromthe date of randomization, not the date offirst symptoms, presentation or startingtreatment. Early deaths, occurring be-fore treatment has even started, are thusincluded in the actual treatment com-parison.

18.-The Life TableThis is a graph or table giving anestimate of the proportion of a groupof patients that will still be alive atdifferent times after randomization,calculated with due allowance forincomplete follow-up.

If all the patients in a trial have died,it is easy to calculate the proportion ofpatients surviving to the end of a particu-lar day from randomization, and a graphof this proportion against time fromrandomization would be a simple lifetable, for the special case where all havedied.

Unfortunately, this simple and sensiblegraph of " proportion alive " against" time since randomization " can be fullyplotted only if all the patients are alreadydead before analysis of the data is under-taken. For example, if some of thepatients are still alive with trial timesless than one year (because they wererandomized only a few months ago), wecannot yet know if they will eventuallysurvive a full year from randomizationor not, and so there is no simple andobvious estimate of the proportion of allpatients who will be alive at one year.

3

Page 4: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

However, to survive a whole year, apatient has to survive each of the 365days comprising it, and this apparentlytrivial observation is the key to efficientestimation of how many will live the fullyear out. We need first to look at the deathrates observed on each individual day,and then to argue that, for example, theway to live 31 days is to live 30 days andthen to live one more day.

Translated into the language ofprobabilities, this means that theprobability of living 31 days from ran-domization is the probability of living 30days multiplied by the chance of surviv-ing Day 31 after living 30 days: they aremultiplied together since this is how onecombines such probabilities. It is essen-tial that the previous sentence be clearlyunderstood, for without it the remainderof this section will be obscure. (It is justanalogous to the calculation that if 2coins are tossed in succession, the probabi-lity of both being heads is one-quarter,this being the product of the probabilitythat the first coin is heads, which is one-half, and the probability, after the firstcoin has come down heads, that thesecond coin will now do so, which is alsoone-half.)

This simple rule is all that underliesthe calculation of " life table " graphs.It follows fairly straightforwardly from itthat the chance of living a year fromrandomization is

C1 X C2 X C3 X C4 X ... X C364 X C365

where:C0 denotes the chance of surviving at

least one day from randomizationC2 denotes the chance of surviving a

second day after you have survivedone day from randomization

C3 denotes the chance of surviving athird day after you have survived2 days from randomization

C4 denotes the chance of surviving afourth day after you have survived3 days from randomization

etc., and:

C365 denotes the chance of survivingDay 365 after you have survived364 days from randomization.

Unfortunately, we do not know anyof these individual C's. However, wecould estimate any particular one of them(C365, for example) by looking to seewhat proportion of patients who are at riskon Day 365 actually survived it. Let uswrite this observed survival rate for Day 365as P365 We could use P365, the observedsurvival rate on Day 365 among thosealive after 364 days, as a very crudeestimate of C365, the actual chance ofsurviving Day 365 after you have survived364 days. To calculate P365' we studyall patients with trial times greater thanor equal to 365 days; in other words, allpatients who entered the trial more than364 days ago (so that we have a chance tosee their fate on Day 365) and who werestill alive 364 days after randomization.(Patients who entered only a few monthsago tell us nothing about P365.) P365 1sthen simply the proportion of thesepatients who survive Day 365. If, asmay well be the case, nobody happened todie exactly on Day 365, then P365 1.For every post-randomization day, p canbe defined analogously.*

The " life table " estimate of the trueprobability (C0 X C2 X C3 X C4 X C5 XC6 X C7) of surviving 7 days fromrandomization is simply p1 x P2 X p3 Xp4 x p5 x P6 X p7; likewise, the " lifetable " estimate of the chance of survivinga whole year from randomization wouldbe:

Pi X P2 X P3 X P4 X ... X P364 X P365'a product of 365 observed survival rates.

* Formally, on Day 76 after randomization, P76, the obsqrvod survival rate, equals (no. with trial timeof at least 76 days who did not die on Day 76)/(no. with trial time of at least 76 days). The observedsurvival rate on Day 76 after randomization thus makes no use. whatsoever of data from patients who diedbefore Day 76, or who were randomizedl less than 76 (lays ago.

4

Page 5: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

Although each individual observed sur-vival rate (p) is a very inaccurate estimateof the corresponding chance of survival,(C) it is a surprising fact that the productof a lot of p's (the life table estimateof the chance of still being alive after acertain time) is quite an accurate estimateof the product of the corresponding C's(the actual chance of still being alive then).

It may be noticed that the life tableestimate of the chance of surviving anyparticular number of days from random-ization is thus the product of the life tableestimate up to the previous day, and theobserved survival rate for the particularday. The life table estimate is exactlythe same as the simple proportion ofsurvivors if all the patients die before thetrial is analysed (readers can check thisthemselves for the first 6 days, forexample). The " life table " (or " sur-vival curve ") for a set of data is a graphor table of this estimate against time fromrandomization. (The usage has emergedthat either a table or a graph may bereferred to as a " life table ".)

A life table is the most accuratedescription of a set of data on the times todeath of a group ofpatients, and physiciansengaged in such clinical trials should befamiliar with its construction and in-terpretation. In practice, one doesn'thave to worry about the multitude ofdays on which nobody happened to die,for on these days p = 1 and a life tablemay stay constant for a whole run of suchdays.

Although the life table is much morereliable than the individual observedsurvival rates of which it is composed,spurious big jumps or long flat regionsmay sometimes occur in a plotted life-table; this is discussed more fully below.

In an MRC trial in chronic granulo-cytic leukaemia (MRC, 1968), previouslyuntreated patients were admitted atseveral centres from September 1959 toDecember 1964. Analysis was under-taken with the stopping date of 1 January1967. Fig. 3 is the life table for theentire group of 102 patients admitted to

the trial. It is slightly irregular, as areall life tables based on small numbers ofpatients, but its general shape gives thebest information that can be derived fromthe collected data about the pattern ofmortality of these patients, although itsfine detail is not really informative.

Fig. 4 gives the separate life tablesfor the 2 treatment groups of patients-those treated by radiotherapy and thosetreated with busulphan. Evidently, thebusulphan-treated patients have faredsomewhat better. This shows how treat-ment differences can be illustrated bysurvival curves.

If the vertical 00 survivors " axisis given on a logarithmic scale, the slopeof the survival curve at any given time

CHRONIC GRANULOCYTIC LEUKAEMIA

02 patients

0 100 200 300 400

TIME FROM FIRST TREATMENT, wks.FiG. 3. Life table for all patients in the

Medical Research Council's first CGL trial.The numbers of patients still alive andunder observation at entry and annuallythereafter were: 102, 84, 65, 50, 18, 11, 3.

CHRONIC GRANULOCYTIC LEUKAEMIA

_ 100

_ 80L .

u o 60__,X a: 40<:

w* 20

_ L

- o

. Busulphan 48 patientsl . I Radiotherapy 54 patients_ *'.

.\ \.

.

0 100 200 300 400

TIME FROM FIRST TREATMENT, wks.FIG. 4. Life tables for the 2 separate treat-ment groups in the AMedical ResearchCouncil's first CGL trial. The numbers ofpatients still alive and under observationat entry and annually thereafter were:busulphan 48, 40, 33, 30, 13, 9, 3; andradiotherapy 54, 44, 32, 2(, 5, 2, 0.

- -

5

innWLU

:2

"I

L,J

ui-im<

U.jLL.

7= k'*' - " -

1. ..

Page 6: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

100 b * Times of death

x Trial times of survivors

o - Correct life-table--- Incorrect life-table (see text)

S 60_

_ tM ~40 ______________________________________

I-C 4 still at-> 20 risk at

1400 days

0 250 500 750 1000 1250

TIME FROM FIRST TREATMENT (days)

FiG. 5.-Life table derived from the hypothetical data used as a worked example in Appendix 3. Thenumbers of patients still alive and under observation at entry and every 6 months thereafter were:25, 14, 11, 10, 8, 7, 7, 7, 4.

estimates the death rate among thesurvivors at that time. This is sometimesuseful as it shows how the death rateamong the survivors depends on time fromrandomization, but since such data usuallyhave to be presented to people who are

not really familiar with logarithms, log-arithmic axes should in general beavoided when presenting life tables, especi-ally since they magnify the parts of thelife table which are least accurate at theexpense of the more accurate parts.

Fig. 5 gives the life table for thehypothetical data which are presentedand analysed as a worked example inAppendix 3. Here, a common featureof many real life tables is apparent:the data are so sparse that there are longperiods (e.g. between Day 250 and Day600) when nobody happens to have died.The life table consequently consists offlat regions separated by " steps ", andagain it must be emphasised that anyconclusion based on the fine detail ofsuch a graph is likely to be wrong.

Particularly, long flat regions at theright-hand end of a life table do not implythat the real risk of death among patientswho are still alive then is negligible, unlessa large number of patients have trialtimes well into or beyond the flat region.

In Figs. 3, 4 and 5, the trial timesof all individual patients, whether theydied or not, are marked, so it is a fairlysimple matter when looking at the graphs

to see how many patients were still atrisk at any one time. This is done bycounting the points to the right of thistime, and is especially valuable in lifetables such as Fig. 5 where the data aresparse. If your trial is so big that itwould be impossible to show a distinctpoint for each patient, write at severaldifferent times along the foot of the lifetable, the number of patients with trialtimes of at least those magnitudes. Thishas also been done in the legends toFigs. 3, 4 and 5, although it would reallyhave been better if these numbers ofpatients remaining " at risk " had actuallybeen written along the bottom of eachgraph.

The reason for wanting to know howmany patients are alive and still beingfollowed up at a particular time fromrandomization is that this informationcan be used for a quick (but quite accurate;see statistical note 6) estimate of howmuch your life table at that time mightdiffer from the value it would have hadin a vastly larger, and hence moreaccurate, study. However, these esti-mates should not usually be used for thecalculation of P-values; for this, thelogrank test, which is described below,is preferable since it takes into accountthe overall structure of the 2 curvesbeing compared, not just their values atone time.

(Over periods as long as 5 or 10 years

6

Page 7: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: AN-ALYSIS

from entry, an appreciable proportion ofa group of old patients would be expectedto die from other causes, and soadjusted" 5- or 10-vear percentage

survivals are sometimes cited. Theseare simply the life-table estimates of theproportion of the diseased patients stillalive at 5 or 10 vears a-s a percentage ofthe proportion that would have been alivehad only the national age- and sex-specificdeath rates prevailed.)

19.-The Logrank testThis involves counting the numberof deaths observed in each group, 0,and comparing it with E, the extent ofexposure to risk of death in thatgroup; the method of calculating E isgiven below.

The basis of the logrank test is sostraightforward that it seems surprisingthat it was first suggested only as recentlyas 1966 (Mantel, 1966). Breslow (1975)has recentlv written a unified statisticalreview paper on the logrank test andallied approaches to clinical trial data,which can be consulted for formal justifi-cation of its widespread use.

The principle is that if, for example, ofthe patients under observation on a parti-cular day after randomization, two-thirdsare in Treatment Group A and one-thirdare in Treatment Group B, then onaverage two-thirds of the deaths on thatday should occur among A patients andonly one-third among B patients, unlessA is really a more, or a less, effectivetreatment than B. WVe mav define theextent of exposure to risk of death of Apatients on that dav to be two-thirdsthe number of deaths on that day, andthat of B patients to be one-third of thenumber of deaths on that dav.*

WVhen considering a particular dayafter randomization, a little care isneeded to work out the proportions of

A patients and B patients, since patientswho have died before this day must beignored, as must patients randomized sorecently that their fate on this particulardav is not vet known. (In this respect,the calculation resembles the calculationof the life table.) For example, if in aclinical trial. 100 patients are randomizedto A and 100 to B, then on Day 1, 200patients are at risk, half in Group A andhalf in Group B. If, however, due todeath, loss or recent entrv, only 60 of theA patients plus 40 of the B patients havetrial times of 365 days or more, then onDay 365, 100 patients are at risk, withproportions 0-600 in Group A and 0-400in Group B. If 2 deaths occurred amongthese 100 patients on the 365th day afterrandomization, the extent of exposure torisk of death suffered bv the A patientson Dav 365 would be 1-200 and thatsuffered bv the B patients would be 0-800.In other words, risks are related to theproportions remaining, not to the pro-portions originally randomized. Note,of course, that no calculations need beperformed for days on which no deathsoccur in either group, since the quantitywhich we have chosen to call the extent ofexposure to risk of death will necessarilybe zero on all such days. (This illustratesthat our definition of the extent of expo-sure to risk on a particular day dependson what actually happened on that day,not on what might have happened.)

The actual number of deaths observedon a certain day among the Treatment Apatients will usuallv not exactly equal theextent of exposure to risk on that day,especially as this is likely not to be anexact whole number. The number ofdeaths actually observed among the Apatients may be less than the extent ofexposure of the A patients to risk on somedavs and more on other days. If A andB are equivalent treatments, however,then over a long period (comprising many

* The general definition is that the extent of exposure to risk of death among a subgroup of patients ona particular day is the total number of deaths on that dav in the whole study population, multiplied bv theproportion of the patients at risk on the particular day who are in the subgroup of interest: see Appendix 3for a worked example.

7

Page 8: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO. M. C. PIKE ET AL.

individual days) the total number ofdeaths observed in A patients should onaverage equal the sum of all the separateextents of exposure of the A patients torisk of death on each separate day duringthis period.

The logrank* test comparing Treat-ment A with Treatment B during acertain period involves:

(i) counting the total number of GroupA deaths observed during thatperiod, calling this OA;

(ii) counting the total number of GroupB deaths observed during thatperiod, calling this OB;

(iii) calculating the extents of exposureof the A patients to risk duringeach day of the period, addingthem all up to get the totalextent of exposure to risk of deathsuffered by the A patients duringthis period, calling this EA;

(iv) deriving similarly the total extentof exposure to risk of deathsuffered by the B patients duringthis period, calling this EB;

(v) comparing OA with EA and OBwith EB, to see if there are anymarked discrepancies. Table IVgives such a comparison for thedata from the whole of the MRCchronic granulocvtic leukaemiatrial (MRC, 1968).

As an arithmetic check, OA + OBshould equal EA + EB, except for slightrounding errors. (When calculating theextents of exposure to risk on individualdays, it suffices to work to 3 decimalplaces.) It follows that if OA exceeds

EA, indicating that Group A fared worsethan the average of Groups A and Btogether, then OB will be less than EB,indicating that Group B fared better thanaverage.

This method generalises instantly tothe comparison of several groups ofpatients with each other: for each group,the extent of exposure to risk of death ona particular day is still the proportionon that day who are in that group timesthe number of deaths on that day; andagain, the total exposure in one groupover an extended period is the sum of theseparate exposures in that group on theseparate days comprising the period. Inany one period, too, the sum of all the O'swill equal the sum of all the E's. Forexample, if we were comparing 4 groups,A, B, C and D, we would finally checkthat: OA + OB + OC + OD equals EA +EB+ Ec + ED.

Two questions are unanswered: " WA'hatperiods should be examined? " and " Whatconstitutes a marked discrepancy betweenG and E? " The second question isdealt with in Section 20. The answer tothe first question depends on the diseasebeing studied; for acute myeloid leukae-mia, where there are many early deathsduring the first few months after random-ization, followed by partial stabilization, itmight be sensible to look separately atwhat happens in 2 periods, the first includ-ing all days in the first 6 months afterrandomization, and the second comprisingall subsequent davs. Alternatively, foranother disease, one might look separatelyat the apparent treatment differences in3 periods, the first year, the second vear,

TABLE IY.-MRC Chronic Granulocytic Leukaemia Trial

Treatmentgroup

BusulphanRadiotherapyAll patients

No. ofpatients in

group

4854102

0, observedno. ofdeaths405090

E, extent ofexposure torisk of death

51 9538 -0590-00

Relativedeath rate,

O/E0- 771-311-00

* The name " logrank " derives from obscure mathematical considerations (Peto and Pike, 1973) whichare not worth understanding; it's just a name. The test is also sometimes called, usually by Americanworkers who cite Mantel (1966) as the reference for it, the " Mantel-Haenszel test for survivorship data ".

8

Page 9: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLIN'ICAL TRIALS. II: ANALYSIS

and all subsequent years. Whateverperiods the time from randomization issplit into, the most important com-parison is that for all periods together.This is the "' logrank test" comparingthe overall difference between the wholesurvival curve for A and that for B.

Tables (such as Table IV) of observednumbers of deaths, 0, and extents ofexposure to risk of death, E, for the wholeperiod of observation, give a concisesummarv of the trial results.* The ratioO/E for a subgroup is called the relativedeath rate for that subgroup, because itapproximates to the ratio of the dailydeath rate in that subgroup to the dailydeath rate among all groups combined.Therefore, the ratio of 2 O/E's fromdifferent subgrups can be used to des-cribe the apparent ratio of the correspond-ing death rates. For example, in TableIV, the relative death rates on the 2treatments are 0-77 and 1-31, suggestingthat the true death rate ratio is about0-77/1-31 = 0-6: i.e., very crudely, thatbusulphan prevents or delays about 40%of the deaths that would occur withradiotherapy.

Actually, in one group the death ratewill probably not be constant; it might,for example, be more rapid amongpatients who have only just been ran-domized than among patients who havebeen in the trial for over a vear. If thedeath rate in one group is thus notconstant, how can we talk meaningfullyabout the ratio of the death rates in 2groups? If time from randomization issubdivided into periods which are shortenough for the death rates not to varvmuch within one time period, then withineach separate period it is meaningfulto talk about the death rates in 2 sub-groups of patients, and the ratio of these2 death rates. Some sort of average of

these death rate ratios in different timeperiods could be formed, and this

z average death rate ratio'" is what isreally estimated by the ratio of 2 O/E'sfor 2 subgroups of patients.

A statistical test based on the dif-ferences between O's and E's is optimal,in the sense that if there really is aslight difference in the efficacy of thetreatments (whereby the death rate inone group consistently exceeds that inthe other group by a certain proportion:see statistical note 5 in Part I) no othervalid statistical method is as likely tovield a significant difference. We shallnow describe how P-values are calculatedfrom the differences between O's andE's to help decide whether such differencescould plausibly have occurred by chancealone.

20.-Loqrank sinificance levelsP-values may be estimated by com-paring the sum of (O-E)2/E with anappropriate chi-square distribution.

The approximate statistical signifi-cance of differences between observednumbers of deaths, 0, and extents ofexposure to risk of death, E, in differentgroups, can be calculated quite rapidly.

In each group we can calculate(O-E)2/E. The more discrepant the valueof 0 in a particular group is from thevalue of E in that group, the bigger(0-E)2/E in that group will tend to be.Suppose that we calculate (0-E)2/E ineach group and that we then add theseup, one term from each group. This issomething which we shall want to discussin many places in this paper. It istherefore convenient to have a brief namefor the sum of all the (0-E)2/E values,and we choose to call it X2. Likewise,we shall let the svmbol k denote the

* There is obviously some sort of analog between E, the total extent of exposure in a subgroup, and anexpected nurmber of deaths in that subgroup, and because of this, E is often referred to as the " expectednumber of deaths ". Unfortunately, in a group of patients who stay alive longer than average, E may, asin Table IV, exceed the number of patients originally randomized into a group. Since it would seemparadoxical to " expect " more deaths than there are patients, the name " extent of exposure " for E isperhaps preferable. However, both names are now sanctioned by usage in published work and, whichevername is used, the statistical arguments are equally valid.

9

Page 10: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO. M. C. PIKE ET AL.

number of groups being compared witheach other: in many clinical trial analvses,k 2.

If there are k treatment groups andthe prognosis of each group is, in fact,the same, then X2 will usually be roughlyequal to (k-l).* If, on the other hand,the prognosis in different groups is reallydifferent, the observed numbers, 0, ineach group will be svstematicallv differentfrom the corresponding extents of ex-posure, E, and X2 will tend to be greaterthan k-1. Large values for X2, therefore,although they could arise by chance,constitute evidence for real differencesbetween the prognoses in the k groups.It is possible to calculate the approximateprobability that X2 would, if the prog-nosis were the same in all k groups,exceed any particular given value-and,of course, the larger the given value theless probable this is.

This probability, which we call thesignificance level " or "P-value ", is

estimated by an analogy between thebehaviour of X2 if the k treatments wereidentical and the behaviour of one of thestandard distributions of statistics, thechi-square distribution. Actuallv, thereare lots of different chi-square distribu-tions, each with a different mean value;we can have a chi-square distributionwith mean 1, a chi-square distributionwith mean 2, and so on: the mean valueof a particular chi-square distribution iscalled the degrees of freedom" of that

chi-square distribution, for reasons whichare not essential here. Since, if theprognoses are the same in all k groups,the expected value of X2 is approximatelv(k-i), we shall use the analogy with thechi-square distribution with mean (k-i).This comparison enables us to sav that P,the significance level, is approximatelvthe probability that an ordinary chi-squaredistribution with k-i degrees of freedomshall equal or exceed the observed valueof X2. Some of these probabilities arelisted in the footnote.t

As an example of the use of thesemethods, consider the data of Table IV.There are 2 groups, so k = 2 and X2, thesum of (O-E)2/E, is

(-11-95)2/51-95 + (11-95)2/38 05 = 6-50.

(N.B. This sum has onlv one termfor each group, based on the numbers ofdeaths: no contributions come from thenumbers of survivors.) Comparison ofthis value with the tabulated behaviourof chi-square with one degree of freedomshows that the observed difference be-tween the 2 treatments is more extremethan would commonly arise bv chancealone: since 5-02 < 6-50 < 6-63, 0-025 >P > 0-01 and since 6-50 nearly equals6-63, P - 0-01. We might, in a publica-tion, savy The difference is statisticallysignificant (X2 = 6-a0, d.f. 1, P_0-01)."

More precise significance levels can be

* Although the reasons for this rough equality wiH not be apparent to most non-statisticians, it mustunfortunatAy be taken on trust, as its proof is beyond the scope of the present paper; the same is true of thechi-square analogy which follows.

t For any mean value (1, 2, 3, 4 .), the chi-square distribution with that mean has a probability justunder 0-05 of exceeding (mean -3,'mean). For comparisons of 2, 3, 4, 5 or 6 groups with each other,the minimal value of X2 nec-ssary to generate certain particular P-values is tabulated.

Mean valueof X2

(i.e. degreesof freedom

of chi-squareanalogue)

1

345

Minimal X2

P< 0-1 P< 0-05 P<0-0252-71 3-84 5-024-61 5-99 7-38'6-25 7-81 9-357-78 9-49 11-149-24 11-07 12-83

P < 0-016-639-2111-3413-2815-09

P < 0-0057-8810-6012-8414-8616- 75

P < 0-001

10-8313-8116-2718-4720-52

No. ofgroups ofpatientsbeing

compared23456

10

Page 11: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. H: ANALYSIS

calculated for publication purposes (Petoand Pike, 1973-see statistical note 7 onp. 38) if statistical assistance is available,and these will usuallv be slightly moreextreme than the significance levelsderived by this simple chi-squaredanalogy. A worked example of the useof all the methods so far described onsome hvpothetical clinical trial data isgiven in Appendix 3, where life tables,O's and E's and P-values are calculatedfrom first principles.

21.-Explanatary information (prognosticfactors)If patients are retrospectively dividedinto strata, life tables and logrankmethods can compare the prognosis indifferent strata, testing for hetero-geneity or, if possible, for trend.

Explanatory information is any datathat can help to explain some of thedifferences between the survival times ofdifferent individuals. Broadly speaking,anv facts collected from all the patientsbefore their entry to a clinical trial can beused in this way without difficultv. In-complete information is of much less use,and it may not be possible to deal with itin an unbiased wav. Information col-lected once treatment is under way may bevaluable, if it is collected from all thesurvivors at a particular time after theirfirst treatment. This is, however, usuallymore difficult to arrange than is antici-pated when the trial is designed, so datacollected at the time of original randomiza-tion is usually of the greatest value.

With several items of explanatoryinformation available from each patient,it is possible to determine (using lifetables and the logrank test, but testingbetween groups defined by the explanatoryvariables instead- of between differenttreatment groups) which (if any) of theseitems are correlated with prognosis. Itis also possible to test whether an apparentinfluence on prognosis is merely due to anassociation with another, possibly moreimportant, factor (Cox, 1972; Breslow,

1975). There is a worked example ofthis in Appendix 3.

For example, in analysing the MERCmvelomatosis trial (MRC, 1971a), therewas no apparent difference between the 2treatment schedules tested, and interestturned to these explanatorv variables.Table V gives the observed numbers ofdeaths, and the extents of exposure torisk of death in this trial, according tothe blood urea of the patients at presenta-tion. It can be seen from the columnof values of O/E that the lower the initialblood urea, the better the prognosis.

We could, of course, check howeasily heterogeneity as extreme as, ormore extreme than, that actuallv seenbetween the O's and E's in Table Vcould arise simply bv chance, by calcu-lating:

X2 (79 - 122-06)2 ± (81 - 74.60)21 _1__.A9 t 7XA 2Lzz UU 14-OU

(53 - 16.34)216 34

In this case X2 - 97.99 and, since thereare 3 groups, it is appropriate to compareX2 with tables of chi-square on 2 degreesof freedom. The previous footnote showsthat the probability of chi-square with 2degrees of freedom exceeding 13 81 is0-001. The probability is therefore much,much less than 0-001 that X2 shouldattain a value as large as or larger than97-99 bv chance alone. In publishingsuch data, we might therefore writeX2 = 97.99, d.f. = 2, P < 0 001" (or

even, for emphasis, P <6 0-001).This value of X2 is so extreme that it

answers the question of chance beyonddoubt. However, in less extreme cases,it is preferable to make use of the factthat the groups are ordered, and to testfor the existence of a trend in prognosisas we go from the first group to the lastgroup. The trend test seeks not justheterogeneity, but plausible heterogeneity,in which the middle group (or groups)tends to have a more average prognosisthan the outer groups, and the outer

11I

Page 12: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

TABLE V.-First MRC Myelomatosis TrialInitial urea No. of 0, observed E, extent of Relative(mg/100 ml patients in no. of exposure to death rate,

blood) group (leaths risk of death O/E0-39 113 79 122-06 0-65

40-79 92 81 74-60 1.0980+ 53 53 16-34 3-24

All patients 258 213 213-00 1-00

groups tend to differ in opposite directionsfrom the average prognosis. Whenevermore than 2 groups are being compared,and they do have a natural ordering, it islikely to be more sensitive to test fortrend than to test for heterogeneity.Technical details of how to test for trendare relegated to Appendix 5.

However, when a plausible contrastis as marked as that in Table V (therelative death rate in the high-ureagroup being about 5 times as big as thatin the low-urea group), no sane readerwill suppose that the differences betweenthe O's and E's arose simply by chance.This particular P-value, therefore, answersan irrelevant question, and need hardlybe cited; the most important thing withsuch data is to characterize the difference,not to test whether it could be due tochance or not. To describe the de-pendence of prognosis on initial bloodurea, we might calculate separately:

(i) the life table for the low-ureapatients,

(ii) that for the medium-urea patients,(iii) that for the high-urea patients,

and plot all three of them on a singlegraph of " estimated % alive " versus" time since entry to study ".

22.-Use of prognostic factors to refinethe treatment comparisonIf a treatment difference amongpatients in one stratum is calculated,the sum of all such differences, oneper stratum, yields an overall testof whether treatment matters amongotherwise similar patients.

In clinical trial analysis, we are

interested in whether apparent differencesbetween treatments might be due merelyto random allocation of more of thegood-prognosis patients to one treatmentthan to the other treatment. Obviously,anything we know about the majordeterminants of prognosis can help us toanswer this question correctly, and helpus to see whether, given the differentnumbers on each treatment in variousprognostic categories, there is any residualrelationship of treatment with survival.

In earlier reports of clinical trials,the first step in the analysis was often toexamine the percentage of each favourableand unfavourable prognostic feature ineach treatment group and, hopefully, to de-monstrate that they were not too different.To ensure this, a policy of initial stratifica-tion was sometimes adopted, givingalternate patients in each particularprognostic stratum alternate treatments.With modern methods of analysis ofsurvival data, it does not matter if thereis some imbalance of prognostic featuresbetween treatments, and stratification onentry is usually unnecessary. Theprinciple underlying these methods isvery simple: when the trial is beinganalysed, find out which of the factorsrecorded at entry are relevant to prog-nosis (by the method of the previoussection). In the light of this analysis,define a few "prognostic strata ", sothat within each stratum the patients all,as far as could have been told at entry tothe trial, had a fairly similar prognosis.

This is straightforward, if only one ofyour explanatory variables is stronglyrelated to prognosis. If there is a naturalway of subdividing that one importantvariable (e.g. male/female, or Stage I/Stage II/Stage III/Stage IV), then use

12

Page 13: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

these natural subdivisions to define yourstrata. If it is a continuous variable,for example haemoglobin, you might firsttry subdividing it fairly finely (e.g. into6 to 10 subgroups) and calculate theobserved numbers of deaths, 0, and theextents of exposure to risk of death, E,in each such subgroup. Finally, calculateO/E in each subgroup and pool adjacentsubgroups with roughly similar values ofO/E, to give yourself a few larger strata.

If you have only two importantexplanatory variables to allow for, thenfirst use this approach to each oneseparately, splitting each into as fewcategories as possible. If you can manageto split the patients into only 2 or 3categories with respect to each of the 2important variables, then your stratamight well be the 4, 6 or 9 differentcombinations of categories of these 2variables. Stratification with respect toas many as 3 variables is often notnecessary, and stratification with respectto more than 3 variables is usually bothunnecessary and unwise, unless you havethousands of patients in your study.

Let us suppose that you have nowdefined, on the basis of explanatoryinformation recorded at entry into thetrial, a few retrospective strata. Withinthe first of these prognostic strata calcu-late, as above, the observed numbers ofdeaths and the extents of exposure torisk of death on each treatment, entirelyignoring all the patients in all the otherstrata. Within the first stratum, thesum of the observed numbers on thevarious treatments will necessarily equalthe sum of the various extents of ex-posure. If all the treatments beingcompared are equivalent, then for any onetreatment group in this first stratum,the observed number and extent ofexposure will differ from each other onlvby random fluctuation. If one treatmentis better than the other(s), however, thenfor that treatment in this stratum, theobserved number of deaths is likely to beless than the extent of exposure, althoughsince we have only looked at a fraction

of all the patients in the trial so far, thisdifference is unlikely to be significant.However, we next repeat this analysisfor the patients in the second prognosticstratum, and then for the third prognosticstratum, and so on. For a particulartreatment, we now have an observednumber and an extent of exposure inevery stratum, which differ from eachother only randomly, unless treatmentmatters. These may be added, to obtaina grand observed number, 0, and a grandextent of exposure, E, for that treatment.Even if there is no very significant treat-ment effect within any single stratum,differences in the same direction in severalstrata can reinforce each other so that thegrand O's and E's in certain treatmentgroups eventually differ from each othersignificantly, if some treatments really arebetter than others.

Comparison of these grand observednumbers and extents of exposure (bycalculating X2, as previously, and com-paring it with the standard chi-squaredistribution with mean one less than thenumber of treatments) is not biased inany way by chance correlations betweenparticular prognostic strata and treat-ment, and statistical tests for significantdifferences between the grand O's and E'sare therefore the best way to assess realtreatment benefits. In the MRC myelomatrial, inspection of Table V led us todefine 3 prognostic strata (low, medium,and high urea) and after this stratificationwe eventually found that there was nosignificant effect of treatment amongpatients with any given level of urea.Similar techniques can also, of course,be used to examine the relevance of onefactor to prognosis, with other factorsbeing constant. A computer programmecapable of doing all such analyses and ofplotting or printing life-tables is availableon request (see p. 20), and a workedexample of the use of explanatory infor-mation is given in Appendix 3. Aninstructive and interesting example ofthe use of these methods on real data isprovided by the report of the Medical

13

Page 14: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

Research Council's fourth and fifth thera-peutic trials in acute myeloid leukaemia(MRC, 1974).

In multi-centre trials, the differencesbetween the prognoses of patients enteredat different centres can be substantial.To allow for this is simple: stratify withrespect to centre, and within each centrecalculate observed numbers of deathsand extents of exposure to risk of deathas described above with respect to treat-ment (or some explanatory variable).Finally, add up all the observed numbers,and all the extents of exposure for onetreatment (or explanatory variable cate-gory), obtaining a grand 0 and E for thattreatment (or category). X2, calculatedfrom the grand O's and E's for all treat-ment groups, provides a valid test ofwhether any real differences betweentreatments exist. This is unbiasedwhether or not there is marked hetero-geneity in the types of patients admittedor in the general standards of medicalmanagement at different centres.

One advantage of dividing the patientsinto retrospective strata is that if onetreatment is better than the other, it issometimes much more so among certaintypes of patients than among others. Atrial where this might be the case isdescribed in the Example of Section 31,where methotrexate appears to be of sub-stantial benefit in acute lymphoblasticleukaemia remissions only if the whiteblood count is low. However, it isextremely important not to be misledinto seeing effects like this (which arecalled " interactions ") in every set of dataanalysed. If patients are divided into 3or more strata, then since each stratumis smaller than the whole study, purelyrandom differences between treatmentswill be more marked in each stratum.These differences may well point inopposite directions in different strata,

giving the impression of an interaction,whether one is really there or not. Thefundamental P-value to be reported is theoverall comparison of treatments, adjustedby retrospective stratification. If this isnot significant, it is unwise to concludewithout expert statistical assistance thatany treatment differences in individualstrata are real.

A fuller discussion of statisticalmethods for the identification and use ofprognostic factors may be found inArmitage and Gehan (1974).

23. Bad methods of analysisA list is given of some commonmethods of analysis of survival datawhich are either inefficient, mislead-ing or actually wrong.

(1) The comparison of life tables atone point in time, ignoring their structureelsewhere, is in general inefficient (exceptfor diseases which are very rapidly fatalor cured). Moreover, if the point atwhich the comparison is being made ischosen (e.g. Mathe et al., 1969) becausethe difference there is substantial, thecomparison is invalid unless specialstatistical methods are used, and these areinefficient.

(2) If few patients are at risk formore than a certain time, and after thattime none of these few happens to die,there will be an apparent " plateau " inthe life table. Such plateaux at the endsof life tables are very common, andshould never be taken as evidence that" after a certain time most patients arecured" unless there are large numbers ofpatients still at risk at the time of theplateau. (Likewise, a sudden andmeaningless big drop can sometimesoccur near the right-hand end of a lifetable.)

(3) " Median* survival times" are* Definition: If half the patients will (lie within a certain time from their randomization and half will

live longer than that, time, that time is the " Mediani survival time " for these patients. It can be estimatedby calculating the life table for these patients and seeing on which (lay the life table (which estimates proba-bility of survivinig) crosses 500o If t,he life table has long flat regions near 50% then this estimate(d medianwill be very imprecise. For example, in Fig. 5 the median suggested by the graph is at 210 (ays but if2 of the early deaths had been avoided the mediain would have been at 630 days!

14

Page 15: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

very unreliable unless the death ratearound the time of the median survival isstill high. Even in quite extensivedata, median survival times can be veryinaccurate. Although median survivaltimes are widely cited, they should there-fore be treated with great caution, exceptfor diseases in which nearly everyonedies, the data are extensive, and the lifetable falls rapidly through the wholeregion between 7000 and 3000 alive(the region in which the life table isused to estimate the median). Averagesurvival times can be far worse, andshould almost never be cited.

(4) A simple count of the numbersdead in each group is inefficient (exceptfor diseases which are rapidly fatal orcured), as it wastes the information as toexactly when each death occurred.

(5) The best estimate of the proba-bility of living 4 years, say, fromrandomization, is given by the value ofthe life table at 4 years. This is becausethe life table makes proper use of partialdata, from patients who have been studiedfor only part of the first 4 years of theirdisease. Estimates other than the lifetable should never be constructed withoutexpert statistical guidance. A lessaccurate but valid estimate is given bythe proportion of the people who wererandomized 4 years or more before thestopping date, who were still alive 4years after their randomization. Thenumber of deaths by a certain timedivided by the total number originallyrandomized (e.g. Mathe et al., 1969)systematically underestimates the risk ofdeath if some patients only enteredrecently, since the recent patients havenot yet had their full chance of dying.Conversely, the number of dead beforeyear 4 divided by the number deadbefore year 4 plus the number survivingat year 4 (a common error) systematicallyoverestimates the risk of death by year 4,since recent patients could not countamong the living but inflate the numberdead.

(6) Study of survival from first treat-2

ment rather than from randomization isundesirable, especially if the treatmentsbeing compared are such that the delaysin initiating them in ill patients mightdiffer. Study of survival among thosewho have lived long enough for a certainnumber of courses of treatment to havebeen given may misleadingly exaggeratethe chances of survival. These are notabsolute statistical prohibitions, of course,just warnings!

(7) The connection of the bottoms ofthe steps of life tables, such as that inFig. 5 by sloping lines is improper,as it results in a graph which is a biasedestimator of the proportion surviving at agiven time. (Connection of the topsof the steps with each other would beoppositely biased.)

(8) Other significance tests could beused instead of the logrank test-forexample, Gehan's (1965) modification ofthe Wilcoxon rank sum test is used inmany American studies, and it is certainlya valid method to use. The advantageof the logrank test is that if there reallyis a slight difference between the groupsbeing compared, whereby the death ratein one group consistently exceeds that inthe other group by a given proportion,then this difference is more likely to bedetected by the logrank than by any othervalid assumption-free test (see statisticalnote 5 in Part I of this report).

(9) Believing that a treatment effectexists in one stratum of patients, eventhough no overall significant treatmenteffect exists, is a common error. Beliefthat a treatment difference exists shouldchiefly be based on the overall sum of allthe within-stratum treatment comparisons.If this is clearly significant, seriousconsideration may then be directed todiscovering whether the difference betweenthe two treatments is more marked insome strata than in others. (This wouldbe described by a statistician as an" interaction " between treatment andcertain strata; the statistical use of thisword resembles the medical use of theword " synergism".) However, marked

15

Page 16: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

heterogeneity of the treatment comparisonin different strata can arise by chancemore easily than would intuitively beexpected, and statistical assistance shouldusually be sought before accepting anyapparent interactions between treatmentdifferences and patient characteristi4s asreal.

(10) Failure to check really carefullythat, on your selected stopping date, allthe patients you think are alive really arealive is unwise; many trial organizersunderestimate the time it takes for newsof death to reach them.

(11) A " one-sided " or " one-tailed"P-value may be cited in a clinical trialreport; if so, you should usually double itto get the sort of ordinary P-value whichyou are used to, and which would emergefrom the methods given in this paper.If, in a trial, Group A fares better thanGroup B, then the probability of A doingat least this much better than B just bychance is the one-sided P-value, while theprobability of the difference between Aand B being at least this big in onedirection or the other (A better or B better)just by chance is the ordinary P-value.(For emphasis, the ordinary P-value isoccasionally referred to as the two-sidedor two-tailed P-value.)

(12) Published P-values are sometimescalculated after excluding protocoldeviants, or any other category of with-drawn patients; if so, they should bemistrusted. Section 13 in Part Idiscussed how and why a reliable analysisshould be done in such trials. It is,however, sometimes useful not only to doa rigorous analysis, treating withdrawalsetc. properly, but also to do variousinformal analyses omitting certain suchpatients, assuming certain of them tohave died soon after loss (or to have livedfor ever), and so on. If all these informalanalyses agree with the rigorous analysis

in some conclusion, it will make that con-clusion more acceptable to many readers.(No disagreement can arise if, due to goodtrial design, there are few, or no, exclusionsor withdrawals.) If the analyses do notall agree, the investigator should makesure he understands why they do not, butshould usually trust the rigorous analysismore than the others.

(13) In Part I, we argued against theuse of historical controls when random-ized controls could be used instead.Although we asserted that most claimsbased on historically controlled studiesremain open to reasonable doubt, it shouldnot be inferred that all historically con-trolled studies are the same. Certaininvestigators appear to feel that anycomparison can be made moderatelyrespectable simply by labelling one groupof patients " historical controls ", and so,when reading reports of historically con-trolled comparisons, one should alwaysbe aware of the possibility of gross errorsof method. On the other hand, excellentinvestigators sometimes have to makecautious use of what are, effectively,historical controls, especially when thealternative would be no answer for months,for years, or for ever.

24.-How much data should be collectedfrom each patient?The general principle is: collect asmuch data as possible at first presen-tation, only data which are strictlynecessary thereafter, and analyse thedata you do collect very thoroughly.

Although not essential, the collectionof extensive data on each patient at thetime of randomization (including perhapsa serum sample, some biopsy material orsome other biological matter*) which canbe stored indefinitely in case analysis of itis required, can help check to what extent

* Stored samples from a large series of diseased patients can often be of great value when new hypothesesare devised in the future, especially since analytical results can then be immediately correlated with survivalduration. Any measurement which is strongly correlated with survival for a reason which is not obviousis likely to have such a deep connection with the fundamental disease processes that elucidation of thisis likely to prove really fruitful. This is much less likely to be true of a measurement which, althoughabnormal in diseased patients, is not strongly correlated with prognosis.

16

Page 17: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

any treatment effects that do appear aredue to the chance inclusion of an excessof good-prognosis patients on one protocol:in other words, such data can help ourstatistical analysis of the treatment effectsto compare like with like. Such datamay also enable us to identify particularsubgroups in which one treatment ispreferable, and anyway, the relationshipof presenting features to each other or toprognosis in a uniformly treated series cansometimes be of more interest than thetreatment comparison itself.

By contrast, apart from recording anysignificant side-effects of treatment, andrecording the date and cause of death,plus perhaps the dates of a few otherrelevant events, data collected afterrandomization are of less value, andsimplicity in what the participatingphysicians are asked to record during thefollow-up of each patient should besacrificed only for a very good reason.

Most routine data recorded duringfollow-up in many clinical trials are neverused in any publication, and were col-lected partly because the trial designershad not thought out clearly what theywould really need and what they wouldnot. It is not easy to know in advancewhat will be needed, but the blunderbussapproach of demanding masses of detailsin case one or two items are eventuallyneeded is wasteful, or worse. Excessiveform-filling can be positively harmful,if it wastes so much time at the hospitalsthat gaps get left in some essential data,or if doctors become reluctant to enterpatients into this, or some future trial,because of the administrative burdenanticipated. It should be emphasizedthat, although extensive data will needto be recorded in the hospital notes ofeach patient during the follow-up, littleof this need be sent to the trial organizer.Although the trial organizer need not tryto understand the full clinical course ofeach patient, he must however know whenwhatever critical events (toxic manifes-tations, perhaps, relapse and death) he isinterested in occur, and preferably he

should know what circumstances immedi-ately preceded or accompanied thoseevents.

If the treatment should, if it is beingeffective, cause some observable effect,like leukaemia remission or solid tumourshrinkage, on the disease itself then anassessment of these effects should be madein each patient. Even if survival is notsignificantly different, there may be ahighly significant immediate effect of onetreatment which is of great interest.

Special studies of the progress of thedisease can be added to the clinical trialat certain centres that wish to do so, andthese may illuminate either the naturalhistory of the disease or the mechanismsunderlying certain treatment effects.However, in a multi-centre trial it maybe wise to leave these extra data-recordingtasks as optional activities which anyparticipating centre can be free from, if itso wishes, without censure. In trialswhere the organizer feels he has to askfor some follow-up data, he should, soonafter the trial has got properly under way,reconsider what data to request (or howemphatically to demand it) in the light ofwhich requested items actually get veryincompletely documented, how much workeach item actually involves, and any otherpractical considerations that have arisen.If some changes are necessary or somework is unnecessary, the sooner this isrecognized the better.

The causes (or circumstances) of deathshould definitely be recorded, if possible,and grouped into a few distinct categories.For instance, in myelomatosis we mighttry to separate deaths according towhether the patient died of:

(1) " myeloma kidney " during Year 1(2) other causes during Year 1(3) sudden onset of drug resistance

after Year 1(4) other, after Year 1.

It is obvious that the determinants ofmyeloma kidney and drug resistancemay be completely independent of eachother and that separate analyses of these

17

Page 18: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

4 separate categories of death could beenlightening. Other subdivisions ofmortality (e.g. whether neutropenia pre-ceded death) could be useful for otherpurposes.

Two separate, but extremely im-portant, ways in which retrospectiveenquiry about the events preceding eachdeath may be of value are (i) that suchenquiry may suggest new therapeuticstrategies designed to prevent particularcauses of death; and (ii) that suchenquiry may strongly suggest that somepreventable side-effect of one treatmentis causing a few particular deaths, eventhough the number thus caused is far toosmall to be noticed in any comparison ofoverall mortality.

For example, in 1964, melphalanwas only available in 2 mg tablets, thedaily dose was difficult to adjust, and itwas discovered that a few deaths ofmelphalan-treated patients were precededby drug-induced neutropenia. Warnings,and the manufacture of 0 25 mg tablets,coincided with the cessation of suchdeaths. Likewise, in acute lymphoblasticleukaemia 10 years later, some deathswere found to be preceded by neutropenia.Review of the case histories of those whodied showed recent radiotherapy in almostevery case, and further studies thendiscovered that drugs and radiotherapygiven in the particular time relationshipwhich had unfortunately been used,caused far more neutropenia than theirseparate actions would suggest. Finally,the small and non-significant excess ofearly deaths in acute myeloid leukaemiapatients treated with asparaginase wasconfidently attributed to the asparaginase,because only among asparaginase-treatedpatients had marked granulocyte de-pression apparently contributed to death.

25.-Subdividing the follow-up periodMortality may have different corre-lates during different phases of thenatural history of the disease, andthese should be sought separately.

Splitting the time after follow-upinto about 3 different periods, and com-paring the logrank 0 with E in eachperiod separately, as well as in all 3 addedtogether, is a useful safeguard againstmisinterpreting situations such as thoseillustrated in Fig. 6, where one treatmentis only better than another in certainperiods. Fig. 6(i) is an ordinary situation,where A would be better than B in each

100(i)

01 2 3

years

100(ii)

Q)

01 2 3

years

100(iii)

a)

01 2 3

yearsFIG. 6. Three hypothetical ways in which

treatments might cliffer in their effects onsurvival: for each, % alive is plotted againstnumber of years since randomizationseparately for treatments A and B.

18

Page 19: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

year. Fig. 6(ii) is a situation where Awould be better than B in the first andsecond years, with no treatment com-parison possible in the third year, sincethere are almost no B survivors to com-pare with the A survivors. Fig. 6(iii)depicts a situation which is often worriedabout, and occasionally encountered,where the treatment which is initiallybetter is actually worse in the long run.The logrank test for data such as those inFig. 6(iii) would show that the death-rate among A survivors was better inthe first year, worse in the second year,and worse in the third year. In Figs. 6(i)and (ii) the overall logrank test wouldnecessarily show A to be better than Boverall, but in Fig. 6(iii) any result couldemerge from the overall logrank test(A better overall, B better overall, or nodifference overall).

Suppose that an intensive treatment,which might well kill a substantial num-ber of patients early in the trial, is to becompared with a gentler treatment, whichis probably not as fatal in the short runbut which may be less curative in thelong run, producing the pattern illustratedin Fig. 6(iii). Here, it would be sensibleto subdivide time from randomizationinto an initial period (during which mostof the deaths from the intensive treat-ment would be expected), and a laterperiod, and to accumulate (O-E) for eachtreatment in the initial period only and(O-E) for each treatment in the laterperiod only, examining these separately.The split between " early " and " late "

should be chosen either soon afterthe intensive treatment becomes lessintense, or from consideration of theoverall survival curve for all patientstogether to find when the overall deathrate changes. The split should not, ofcourse, be chosen after examination of the2 separate survival curves for the 2 treat-ments, for it is too easy to find a timewhen one treatment has a temporaryadvantage over the other. Subdividingtime in this way is also good policy whenthe nature of the disease changes as time

goes by, e.g. from initial crisis to remissionmaintenance, as the determinants orcorrelates of mortality in one period maydiffer from those in the other. Again,the overall survival curve could be usedas an unbiased guide on where to splittime to separate an early period of rapiddeath from a later period.

26.-Arranging the manner in which datawill be collectedA controlled clinical trial is a sub-stantial research undertaking, andsufficient time and money must beset aside to ensure that the records arealways complete and up-to-date.

(1) Make sure that the (extensive)initial data from each patient are collectedcentrally and checked for completenessimmediately the patient is entered, sothat missing, obscure or unreadable itemscan be corrected quickly. Retrospectiveefforts to supply missing data or checkimplausible items can be very difficult,and can seriously delay the analysis of thetrial. If some data are missing, however,it is (unfortunately!) worth making con-siderable efforts to collect them, if there isany possibility of doing so, as missingdata make the statistical analysis muchmore difficult, and may make the trialreport less easy for other workers to inter-pret without serious doubts about certainof its conclusions.

(2) If something is assessed by anumber, record that number exactly,even though the margin of error may bevery wide, and not some rounded orgrouped version of it. (E.g. it is betterto record haemoglobin exactly than touse it to categorize patients as anaemic/not anaemic, and it is better to recordcigarette consumption exactly than todivide patients into ranges such as 1-4,5-14, 15-25, etc.) It is easier to deviseappropriate groupings of numerical datawhen analysing than when designing thetrial. Record dates of birth exactly (notjust age), as this is sometimes very useful

19

Page 20: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

for tracing lost patients, especially if use

may be made of official governmentrecords. *

(3) By contrast, if something can beassessed only subjectively (e.g. generalcondition of patient, nature of tumour,previous history, reasons for presentation,etc.), it is usually worth forcing, albeitslightly artificially, the physician recordingthe data to do so in certain pre-specifiedcategories (e.g. good, fair, bad; well-differentiated, intermediate, un-

differentiated). Although categorization isnot helpful when describing an individualcase history to a colleague, it is essentialfor describing a group of case histories,and categorization will either be donewith some loss of precision by thephysician examining the patient, or itwill be done later with a greater loss ofprecision by somebody examining themedical records of that patient.

For example, the Eastern Co-operativeOncology Group simply record " perfor-mance status ' as 0-4, where 0 normalactivity, 1 symptoms but ambulatory,2 in bed less than half the time,3 in bed more than half the time, and4 completely bedridden. Zelen findsthat if advanced cancer patients are

divided with respect to performancestatus, the median survival times are very

different. Although this may not be a

very useful finding biologically, it doesmean that if this simple question is askedof all patients on entry to a therapeutictrial in advanced cancer, the treatmentcomparison in that trial will be more

accurate. (By retrospective stratifi-cation based on this information, we can

get closer to the ideal of comparing likewith like.) An alternative performancescale is described in Zelen (1973), whereamong about 1000 patients with inoper-able lung cancer " performance status "

was more important than histologicaltype, disease extension or any of the usualinformation!

Usually, categories containing veryfew patients are not useful, and have tobe merged with other categories, so donot create such categories unless there are

strong medical reasons for doing so.

(4) If, in a clinical trial of over 100patients, you choose to record a lot ofpresenting features, it is probably worthobtaining programming assistance andgetting a convenient computer programset up before starting the statisticalanalysis, so that any particular tabulationof presenting features against each otheror against prognosis can be obtainedeasily. Otherwise, the data will notreceive the attention which, in view of thecost of such trials (Pike, 1973), theydeserve. The " SPSS " system, which isavailable on most large computers, is ofsome value, but unfortunately it does notimplement the logrank test. A FOR-TRAN program is available which iseasy to use and capable of calculatingand displaying life tables, performinglogrank tests, devising strata, and analys-ing the effects of one factor (e.g. treatment)making full allowance for some others(e.g. the important explanatory variables).Copies of this will be supplied on request,jat a slight charge to cover duplication andpostage.

(5) If the data may eventually be

* In the UK, a clinical trial organizer may writ3 to the Registrar-General, Department of MedicalStatistics, Office of Population Censuses and Surveys, St Catherine's House, 10 Kingsway, London W.C.2,giving his credentials as a bona fide research worker plus a very brief outline of the trial, asking the R-G forhelp in monitoring the dates of death of all trial patients who are normally resident in the UK. If the R-Gagrees, and is later supplied with the full name and exact date of birth of each trial patient (either in one

batch, after intake has ended, or in a few batches as it progresses), together with a fee of nearly £1 per

patient, then the R-G will notify the trial organizer whenever any deaths occur, giving the date and certifiedcause of death, and the name of the physician who certified the death. (The trial organizer will be notifiedimmediately if any of his batch of patients are already dead.) Adequate follow-up of survival is thus mucheasier in Britain than elsewhere, for the R-G will usually agree to help any reasonable project by bone fidemedical research workers, especially if recent addresses or NHS ntumbers can also be supplied by the trialorganizer for most patients.

t P. G. Smith, DHSS Cancer Unit, 9 Keble Road, Oxford, England.

20

Page 21: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

transferred onto a computer, they shouldbe recorded in such a way that they canbe transferred with a minimum of trouble.Notes on how to do this constituteAppendix 4.

27.-Assessment by separate causes of deathLife tables and logrank tests mayeasily be used to study separatelydifferent causes of death.

If the effect of a presenting featureon the difference between 2 treatments islikely only to affect one (or a few) causesof death, we may choose to look separatelyat those causes of death which areconsidered relevant. The previousmethods of analysis carry over exactly tothis situation, the numbers at risk oneach day being just as before, althoughthe extent of exposure to risk of deathfrom a relevant cause on a particular dayis related, not to the total number ofdeaths on that day, but to the totalnumber of relevant deaths on that day.Likewise, in calculating the life table, wecalculate for each day the proportion whodo not die of one of the relevant causes,and then combine these proportions, as inSection 18, to obtain the life tabledescription of mortality from relevantcauses only. (This is equivalent to treat-ing deaths from other causes as if theywere losses to follow-up, assuming thatthe different causes of death being studiedare independent. Unfortunately, thisassumption cannot be tested statistically,but must instead be judged reasonable ornot on biological grounds.) For example,in the current MRC polycythaemia trial,which may last for 10 years, we mayexclude from certain analyses deathsfrom causes that are unlikely to berelated to polycythaemia or its treatment.

Actually, we shall calculate separatelythe observed numbers of relevant deathsand the extents of exposure to risk ofrelevant death by treatment group, andthe observed number of non-relevantdeaths and the extents of exposure torisk of non-relevant death in each treat-

ment group, so that readers can lookeither at relevant mortality or at totalmortality: observed numbers of deathsfrom relevant and non-relevant causes canbe simply added together to give totalnumbers of deaths, and so can extents ofexposure to risk of relevant and non-relevant death. Separating a few majorcauses of death, and calculating in eachtreatment group the observed numbers ofdeaths from each cause, and extent ofexposure to risk of death from eachcause, can help the interpretation of datafrom clinical trials of slowly progressivediseases. Again taking an example fromthe polycythaemia trial, we may alsoexamine separately (a) the deaths due toonset of acute leukaemia, (b) the deathsdue to marrow fibrosis, and (c) otherdeaths, to see whether active cytotoxictreatment is more leukaemogenic thansimple venesection.

28.-Other end pointsLife tables and logrank methods canfind whether, among survivors on agiven day, any particular endpoint ismore likely in one group than another;this is very useful, but can be misused.

It may be of use to note that thesemethods of analysis can as easily beapplied to the influence of treatment onevents other than death: time to firstdetection of leukaemia in the centralnervous system; time to first cardio-vascular event (fatal or not) and so on.In many clinical trials studying solidtumours, 2 separate analyses, one of timeto local recurrence and one of time tometastatic spread, may be required inaddition to a full analysis of survivalduration.

Exactly analogous statistical tech-niques may be used for any such analyses,except that now we are interested in thetime from randomization to the first such" event" instead of to death. Con-sequently, patients contribute no furtherinformation to such an analysis once theirfirst " event " has occurred. The only

21

Page 22: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

difficulty is to decide how to deal withpatients who die without suffering theevent of interest.

If the event is such (e.g. neoplasticrecurrence) that few patients die withoutpreviously suffering it, the most satis-factory course, for reasons which will soonbecome clear, is undoubtedly to defineour interest as being in " recurrence ordeath not preceded by recurrence".This means that our life tables willestimate the proportions of recurrence-free survivors at different times, and ourO's and E's will test which treatmentbest prolongs disea8e-free survival.

If, however, the event is such thatmore patients die without suffering itthan ever suffer it, study of it might beswamped if we mixed in all the patientswho died without it. An example of thiswill arise in the current MRC trial com-paring venesection with venesection pluscytotoxic treatment for polycythaemia:whether or not any significant differencein total mortality emerges, we shall wantto know if, compared with simple vene-section, the cytotoxic treatment isleukaemogenic. Counting the fewpatients who develop leukaemia (includingleukaemias which are fatal before theanalysis is undertaken and leukaemiaswhich are not) in each group is easy.Calculating the extent of exposure to riskof leukaemia is also easy: we simplyargue that on a given day after randomiza-tion, all patients who are still alive andfree of leukaemia would, if cytotoxictreatment were not leukaemogenic, beequally likely to develop leukaemia. Theonly difference from our original analysisarises when counting the numbers ofpatients still alive and at risk on a

particular day after randomization (inorder to calculate the proportion of themthat are in a particular grouip). Now, wemust omit not only, as before, thosepatients who entered too recently for usyet to know their fate on this day afterrandomization, but also those who havealready developed a leukaemia before thisday. This is easy: for purposes of thestatistical analysis of leukaemia incidence,re-define the trial time, for patients whoeventually get leukaemia, as the timefrom randomization to leukaemiadiagnosis. (For patients who did not getleukaemia, the trial time runs as before todeath or, if not dead, to the stoppingdate or date of loss.) Now the extent ofexposure to risk of leukaemia in Group Aon a particular day after randomizationis the total number of patients in thewhole study who get leukaemia on thatday, multiplied by the proportion ofpatients with trial times at least that longwho are in Group A.

Once the underlying principle of asking"Among the event-free survivors on aparticular day after randomization, arepeople in each group equally likely tosuffer an event on this day? " is mastered,its applications can be very varied andvery useful: the method automaticallymakes perfect adjustment for the effectsof differences in mean duration of risk ineach group (and for the effects of differ-ences in risk at different times afterrandomization) on the expected numberof events in each group.*

Another example of the use of thesemethods to search for a particular end-point, was the proposed MRC trial of oralhypoglycaemic agents in diabetes, wherethe main question of interest would have

* There is, however, one possible source of error which may arise if some patients (lie without beingobserved to have suffered the unwanted " event ". Those who (lie are never a random sample of those whoremain, usually because their disease state was more severe, but sometimes (causing severe bias) becausecauses which would have culminated in the untoward event whose incidence we wish to study actuallyled to death, just before the relevant event was observed. In either case, (lelay of (leath by ani effectivetreatment may jtust allow the relevant event to be observed and counted against that treatment! Use ofthe logrank test on such events would correctly indicate that, among the survivors on a given (lay afterrandomization, patients given a more effective treatment would be more likely to be observed to suffer suchevents, although this would be a very misleading fact.

For example, if leukaemic proliferation in the blood causes conitamination of the central nervouis system

22

Page 23: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: A-NALYSIS

been their effect on vascular disease.WVe would have counted the observednumbers of people in each group whosuffered a first vascular event afterrandomization, and compared these withthe calculated extents of exposure to riskof suffering one. (Analysis of the totalnumber of vascular events in each groupwould be less satisfactory, as the chanceinclusion of one patient who suffersseveral events might affect the totalsundulv.)

If there are 2 or more definite eventsthat might be adopted as " endpoints "(e.g. time to first rejection episode andtime to complete graft failure, in theM1RC trial of therapy following renaltransplantation, or time to relapse andtime to death in leukaemia), then it isusuallv worth making a completelvseparate analvsis of the whole trial foreach endpoint of interest.

29.-Remi&ion durationIf time is measured from the date ofremission, analogous analyses of re-mission duration are possible.

Exactly analogous techniques can beused to studv the duration of remission,if the trial time is taken to start atremission and to end at c;relapse ordeath without relapse" rather than atdeath. If the randomization is donebefore the state of remission is achieved,the time at which the phvsician declaresthat a patient is in remission may dependon the particular treatment regime thatthe patient will receive in remission, andthis could bias the results. It is thereforepreferable, if possible, not to randomizeuntil remission has been achieved, andthen to study time from randomization,as before. If this is not possible, a verystrict definition of remission and relapse

is needed in order to communicate theresults to other workers and to avoid anypossibilitv of biased assessment. It is ageneral rule that randomization shouldalways be left to the last possible momentbefore the start of treatment, so thatevents between randomization and treat-ment do not bias or obscure the effectsof the treatments.

Moreover, it should alwavs be bornein mind in analvses of remission duration,that a treatment mav prolong the firstremission without prolonging survival (asin the immunotherapy trial describedin the next section) if the delaved relapseis more likelv to be refractorv to treatment.(Against this, the observation that " firstremission duration and survival durationare strongly correlated" is sometimesmade, which, although true, is not relevantin this context.) It should also beborne in mind that a good treatment,which improves the chances of achievingremission but does little or nothing toprolong remission, mav result on averagein shorter remissions among those achiev-ing remission, because more of the veryill patients are being got into remissionand they might relapse more rapidly.

30.o-Cmbining information from differenttrialsDifferent trials which each comparethe same 2 treatments may best bepooled, to give an overall treatmentcomparison, by analysing the data asthough the separate trials were eachretrospective strata within one largetrial.

Some therapeutic comparisons whichare important, but which are neverthelessknown not to involve major differences insurvival, require larger resources thananv of the investigators who have

(CNS) which is followed by proliferation in the CNS.S partial control of the blood but not of the CNS wouldprevent death from leukaemia in the blood but would allow CNS relapse to be seen. In this case a logrankanalvsis of the event " CNS relapse or death without it"' might be less misleading than a logrank analysisof " CNS relapse , but unfortunately this safer analvsis would be less sensitive, if one treatment reallv didjust prevent or exacerbate CNS relapse, and would moreover confuse control of CNS relapse with control ofdeath from other causes. The same difficulties arise, of course, with all other statistical techniques, includingthe apparently more straightforward visual examination of life tables.

23

Page 24: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

approached them have mustered; andconsequently many different randomizedtrials of these treatment comparisons mayhave been undertaken, none of which issufficiently accurate on its own. Anexample might be the possible utility ofanti-coagulants following myocardial in-farction, or the utility of a particularform of immunotherapy for a particularneoplasm. We might hope to find anoverall tendency for patients given Ato fare slightly better than patients givenB, obscured or enhanced in each particulartrial by random variation.

The most efficient unbiased way ofdetermining whether this is so is toignore in each particular trial all patientsallocated to treatments other than A orB, and then to calculate, for each separatetrial, 0 and E for Treatment A and 0 andE for Treatment B. Finally, we combinethe trials by adding all the O's for Treat-ment A, to get an overall OA, all the E'sfor Treatment A, to get an overall EA,and similarly we obtain a combined OBand EB: OA + OB wil equal EA + EB,of course. Comparison of OA with EAand OB with EB in the usual way, bycomputing X2 = (OA-EA)2/EA + (OB-EB)2/EB and using the analogy betweenX2 and chi-square with one degree offreedom, will lead to a P-value testingwhether A and B are statistically signifi-cantly different from each other. Calcula-tion of R = (OA/EA)/(OB/EB) will, more-over, yield a useful pooled estimate of theratio of the death rate on A to that on B.

This method of pooling different trialstreats each trial as though it were aretrospective stratum in a single largetrial, and of course would be equallyapplicable if some of the separate trialswere themselves actually subdivided, e.g.into young patients and old patients,with calculation of the O's and E'scomparing treatments occurring entirelywithin separate strata in each separatetrial. (A standard alternative is to pooldifferent trials by mathematically com-bining the P-values from the separatetrials, but this is less sensitive.) It is

an advantage of reporting logrank O'sand E's when publishing clinical trials,that the efficient combination of differentstudies is then straightforward.

EXAMPLES

31.-Example I. Immunotherapy of acuteleukaemia

In 1968, Mathe and his co-workersreported that of 20 patients with acutelymphoblastic leukaemia in remission whowere subsequently given immunotherapy,8 achieved long-term remissions, whereasnone of an untreated control group of 10such patients did so (Mathe et al., 1969).Eight of the 20 patients treated byimmunotherapy had received BCG alone,and it appeared that this was as effectiveas the other immunotherapy regimensof blast cells alone or BCG combined withblast cells. Animal work had shownthat in mice BCG can, under certaincircumstances, cure grafted isogeneicleukaemias, and it was hoped that thismight be the first news of the break-through to a real leukaemia cure. How-ever, approximate comparison with con-temporary British experience suggestedthat, although the treated patients hadfared fairly well, the controls had faredmuch worse than was normal, and thatthis accounted for much of the disparitybetween the 2 treatment groups. It wastherefore decided that the MedicalResearch Council should organize a clinicaltrial to assess these claims made for BCG.If BCG has any effect, first remissions inwhich BCG is given regularly should onaverage last longer than first remissionsin which no treatment is given; and thetrial as originally conceived was tocompare unmaintained and BCG-maintained first remissions.

The intake to the trial was to consistof children with acute lymphoblasticleukaemia who had undergone a standard5-month course of intensive cytotoxictherapy from their time of diagnosis.After this course, maintenance therapywith methotrexate was expected to pro-

24

Page 25: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

long the first remission, but possibly tomake relapses, when they occurred, morerefractory than if no maintenance treat-ment had been given. (Some leukaemicrelapses occurring during unmaintainedremission can be controlled by drugs,while others are refractory. If the effectof maintenance treatment during remissionis chiefly to suppress those relapses which,had they occurred, could have beencontrolled anyway, then only the re-fractory relapses will break through.This effect must be allowed for in thecomparison of drug maintenance withimmunotherapy-only maintenance, beforeclaims that " immunotherapy in remissionfacilitates the control of subsequent re-lapses " can be made.) It appeared,therefore, an open question whether apatient in remission should be left aloneor given maintenance chemotherapy.

However, some physicians felt un-happy about not including a maintenancemethotrexate group in the trial, and thetrial as finally agreed, therefore, had 2" control " groups, a no-treatment groupand a methotrexate group. At eachcentre, 2-way randomization between im-munotherapy and control was used, butwhich regimen to use for all their controlpatients was chosen by each centre beforethe trial began. Intake began in January1969 and continued to August 1970.An individual trial cannot exist in

isolation. It will of necessity last someyears, even if intake only lasts one year,and during this time reports from otherworkers may make the trial irrelevant,and may even make its continuanceunethical. This is a major problem,and must be considered very carefullybefore starting a trial (Pike, 1973).About one year after the start of thistrial, reports from the United Statesshowed that the extent to which firstremission was prolonged by maintenancemethotrexate was far greater than hadbeen supposed. The physicians fromcentres which had chosen the no-treatmentcontrol now felt that they should no longerput patients into this group. It was

therefore decided that all control patientsentered from that time onwards shouldreceive maintenance methotrexate as theircontrol treatment. This detracted fromthe uniformity of the series of patients,but was in the circumstances unavoidable.

One advantage of the fact that 5months of standard cytotoxic therapypreceded the allocation to different treat-ment groups in this trial was that all the" difficult " patients (those who had" bad " veins, refused in-patient treat-ment, were unable to tolerate the intensivetherapy, or were not in remission 5 monthsafter first treatment) were eliminatedfrom the trial before randomization. Thegroup available for randomization wastherefore more than usually homogeneous:this generally has the effect of increasingthe chances of detecting any differencesbetween the treatments. This uniformitywas, however, marred to some extent bythe dose of L-asparaginase given in theintensive treatment phase being reducedafter the first 50 patients, as the originaldose level was proving too toxic and tooexpensive. This could perhaps havebeen avoided, had the protocol been triedout more extensively on a pilot basis(MRC, 1971b) just to study its toxicity.

Because of the great interest generatedby Mathe's group's results, the trial wasanalysed at several different times. Aswe have explained, this is in principlea bad practice, but in this case it wasinevitable. These analyses were initiallylimited to the overall comparison of the3 treatment groups. To our disappoint-ment, it was soon apparent that the effectof BCG was, if positive at all, onlyslight. However, since our BCG therapyprotocol differed from the French inpossibly important ways, this finding isof less value than it might otherwise havebeen.

Ideally, one should use an identicalprotocol if one wants to test a publishedclaim, but this was not possible, partlydue to difficulty in obtaining the PasteurInstitute vaccine used by the French;partly due to unwillingness to scarify

25

Page 26: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

large areas of skin, as they had done,rather than use percutaneous inoculationby Heaf gun; and partly because 7different protocols were used by theFrench, each on 2, 3 or 4 of the immuno-therapy patients. In retrospect, dif-ferences between the MRC protocol andthe French protocols meant that thenegative result eventually found in ourtrial is not sufficient to demonstrate thattheir methods were not effective. How-ever, another, still larger, trial (Heynet al., 1975) based on Mathe's report hasalso produced a null result.

In the MRO trial, 191 cases presented,122 went into remission and were ran-domized as intended at Month 5, and afurther 10 were randomized later, duringthe next few months. When, in 1971,half the patients had relapsed (thoughfew had died), the results among the 122were reported (MRC, 1971b). Peoplewho have cited this publication have givenless prominence to the fact that therewas no statistically significant differencebetween the first remission durations ofthe group given BOG and those of theuntreated control group, than to the factthat the observed median remission dura-

tion, an unreliable statistic, was 27 weekson BCG and 17 weeks on untreated con-trol. However, partly because this dif-ference diminished as more data accumu-lated, the remission durations have neverbeen properly reported since. The trialwas not a critical test of immunotherapy,because prophylactic treatment of thecentral nervous system was so inadequatethat meningeal relapses prematurely termi-nated many of the remissions, but sinceon 1 January 1976 only 3/55 childrengiven immunotherapy were still in firstremission (compared with 2/20 given nomaintenance and 8/57 given methotrexate)BCG given in this way appears to be oflittle value (Table VI).

In Table VI, the fourth line confirmsthat there is little difference betweenBCG and untreated control, since therelative relapse rates are so similar(1.25 and 1.32).

If the data for relapses in the BCGand untreated control groups are pooled,the combined relative relapse rate is1-27: (52 + 18)/(41.58 + 13.61). Therelative relapse rate on methotrexate is,by comparison, only 0-77, confirming thatmaintenance methotrexate really did pro-

TABLE VI.-MRC Immunotherapy (" Concord ") Trial: Follow-up on 1 January 1976

BCG maintenance(55 patients)

0 E O/EFirst relapse or death

in first remission(a) WBC* 0-5 18 11-56 1-56(b) WBC* 6-20 18 14-02 1-28(c) WBC* 21+ 16 16-01 100Total (a) + (b) + (c) 52 41-58 1-25(i.e. numbers of relapsesand total extents ofexposure to risk ofrelapse stratifiedretrospectivelyfor WBC)

MethotrexateUntreated control maintenance Total

(20 patients) (57 patients) (132 patients)

0 E O/E 0 E O/E 0 E

675

18

3-62 1-665 - 72 1 -224-28 1-1713-61 1-32

17151749

25-82 0- 6620 - 27 0 - 7417-71 0-9663-80 0 77

414038119

41-0040-0038 -00119-00

DeathNumbers of deaths 39 39-46 0-99 15 15-05 1-00 40 39 49 1 -01 94 94-00and total extents ofexposure to risk of deathwithin WBC strata

* Whits blood count at presentation, in units of 10 9/1. Among people in full remission, the original WBCis still a good indicator of how long the remission might last, and among people diagnosed years ago, theoriginal WBC is a useful predictor of future risks of death.

26

Page 27: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

long first remissions (x2 7-41, d.f.-1, P < 0.01), a result which was reportedin the 1971 paper. However, examinationof the life tables (not given) and the lastline of Table VI, where the relative deathrates are all unity, shows that, in the longrun, the group given methotrexatemaintenance did not survive any longer.This result has so far not been reported,partly because the treatments given afterrelapse were so various as to defy summary,but also, unsatisfactorily, because it isnull.

An interesting feature of the effect ofmethotrexate on relapses is that the lowerthe initial WBC, the greater the im-portance of methotrexate. The 3 ratiosof the relative relapse rate of the metho-trexate patients to the combined relativerelapse rate among BCG plus unmaintainedcontrol patients in Table VI, were 0-42(WBC 0-5), 0'58 (WBC 6-20), and0 93 (WBC 21+). In other words,among patients presenting with lowerWBCs, methotrexate can apparentlyroughly halve the relapse rate, whileamong patients presenting with highWBCs, methotrexate seems to have almostno delaying effect on relapses. When,as here, the relative merits of 2 treatmentsare very different in different strata,there is said to be an interaction betweenstrata and treatments. This particularinteraction has been noticed in otherstudies, and deserves investigation.

32. Example II. The MRC myelomatosistrials

By 1964, it was generally agreed thatboth cyclophosphamide and melphalanwere useful drugs for treating patientssuffering from myelomatosis. No random-ized comparison of them had, however,been made and a trial to compare dailyoral administration of the 2 agents was

begun in October 1964 by the MedicalResearch Council. Intake continued untilAugust 1968, by which time 276 patientshad been admitted. The 2 treatmentgroups fared almost exactly the same,

and apart from this, the main scientificinterest of the trial has, therefore, beenthe relationship between certain bio-chemical measurements made on thepatients at admission and the survivaltimes (MRC, 1973).

It was known that a high blood ureaindicated renal failure, and was the majordeterminant of prognosis, but the statisti-cal techniques discussed above enabledus to investigate a whole range of possibleexplanatory factors. We found that theapparent adverse effects of hyper-calcaemia, osteolytic lesions, and thepresence of Bence Jones protein in theurine, were wholly explained by theirassociations with uraemia. The degree ofinitial anaemia was strongly correlatedwith prognosis, and as expected, thiscorrelation was by no means completelyaccounted for by the strong associationbetween uraemia and anaemia. How-ever, one entirely unsuspected factor-the serum albumin-was found to be ofsubstantial prognostic significance, inde-pendently of the urea level (Table VII).

This table is based on 258 of the totalof 276 patients entered (18 patients withunusual paraprotein types were excludedfrom this analysis). The difference be-tween the relative death rates of 1.3and 0 4 for the low and high albuminpatients with no evidence of renal failure,represents a difference between about18 months and 5 years in median survivaltime, and is thus of considerable medicalsignificance. The reason for this dif-ference remains obscure. It has been sug-gested that it might be that the moredangerous myeloma cell populationsactually catabolise albumin much morerapidly, but this suggestion has not beenproperly investigated. Likewise, thereason for the relevance of anaemia toprognosis requires investigation, to seewhether the relevant anaemia is chieflyof renal origin or due to bone marrowfailure.

In the second MRC myelomatosistrial, daily low-dose cyclophosphamidewas compared with intermittent high

27

Page 28: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

28 R. PETO, M. C. PIKE ET AL.

TABLE VIJ.-First MRC Myelomatosis TrialRelative death rates by initial serum levels of urea and albumin

MediumLow albumin High All albumin

albumin (30-39 g/l) albumin levelsLow urea 1-3 0-8 0 4 0'6

(no evidence of renal failure)Medium urea 1*8 1*0 0*9 1*1

(40-79 mg/100 ml)High urea 3 * 8 4*4 2*1 3*2

(evidence of severe renal failure)All urea levels 1*8 1.1 0*7 1.0

(necessarily)

doses of melphalan, and with intermittenthigh-dose melphalan plus prednisone.Because 3 treatment alternatives werebeing compared, a large number ofpatients was required, and intake lastedfrom 1968 to 1975, which is an undesirablylong time. Three hundred and seventy-three patients were followed up to 1July 1976. No significant differencebetween the 3 treatments emerged, butthe intermittent melphalan plus predni-sone group did fare a little better. Sinceintermittent melphalan plus prednisoneis as acceptable as intermittent mel-phalan alone, and more acceptable thancontinuous cyclophosphamide, it can bedefinitely recommended, even though itssuperiority was not statistically significant.

In the second trial, we again found astatistically significant relationshipbetween albumin and prognosis, but itwas a weaker relationship than had beenobserved in the first trial. This slightdisappointment should have been antici-pated; moderately strong relationshipsare often more extreme in the studieswhere they were first discovered to beinteresting than in subsequent studies.This is because a relationship which is inexpectation only moderate may, in variousdifferent studies, appear weaker than itshould or stronger than it should, and it isthe latter studies which generate interest.(For the same reason, promising new treat-ments should not be expected to live upto their early promise.)

We now have extensive initial data,and various numbers of years of follow-up,

on a total of over 600 myeloma patients,all treated fairly equivalently. Some ofthe associations with prognosis (particu-larly that of anaemia) are proving easierto investigate in this larger series than inthe first trial series alone. As far asprognostic correlates and their in-terpretation are concerned, though, thechief need now is not for still largernumbers, nor for other workers to checkour findings on other, probably smallerseries, but for new and preferably strikinglydifferent therapeutic strategies to beproposed and tested, and for new factorsto be measured in future patients. Since1975, these have been the aims of the thirdMRC myelomatosis trial.

M. C. Pike is supported by ContractNO1-CP-53500 and Grant PO ICA 17045-02 from the National Cancer Institute,and N. Mantel by PHS Grant CA 15686.

We are grateful to Gale Mead fortyping the annual rewrites which thismanuscript has suffered since 1971, and tothe many colleagues who used andcriticized previous versions.

The Statistics Department, UniversityCollege, London, authorized our reproduc-tion of Table III from Tracts for Com-puters, No. 24. R. Peto is supported byan Imperial Cancer Research Fund reader-ship in cancer studies.

REFERENCESARMITAGE, P. & GEiHAN, E. A. (1974) StatisticalMethods for the Identification and use of Prog-nostic factors. Int. J. Cancer, 13, 16.

Page 29: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. H: ANALYSIS 29

BRESLOw, N. E. (1975) Analysis of Survival Dataunder the Proportional Hazards Model. Int.statist. Rer., 43, 45.

Cox, D. R. (1972) Regression Models and LifeTables (with discussion). J. R. statist. Soc., B,34, 187.

GE,HA, E. A. (1965) A Generalized Wilcoxon Testfor Comparing Arbitrarily Singly-censoredSamples. Biometrika, 52, 203.

HEYN, R. M., Joo, P., KaoN, M., NESBIT, M.,SHORE, N., BRESLOw, N., WEIER, J., REED, A.& H1A o5D, D. (1975) BCG in the Treatment ofAcute Lymphocytic Leukaemia. Blood, 46, 431.

KAPLAN-, E. L. & MERI, P. (1958) NonparametricEstimation from Incomplete Observations. J.Am. statist. Ass., 53, 457.

MANEL, N. (1966) Evaluation of Survival Dataand Two New Rank Order Statistics Arising in itsConsideration. Cancer Chemother. Rep., 50, 163.

M.TH, G., A1iEL, J., ScHwARzNBERG, L.,ScHN-saER, M., CATTrN, A., SCHLUMBERGER,J. R., HAYAT, M. & DE VAssAL, F. (1969) ActiveImmunotherapy for Acute Lymphoblastic Leu-kaemia. Lancet, i, 697.

MEDICAL RESEARCH CoLuNcm (1968) ChronicGranulocytic Leukaemia: Comparison of Radio-therapy and Busulphan Therapy. Br. med. J., i,201.

MEDICAL RESEARCH COUNCIL (1 971a) Myelomatosis:Comparison of Melphalan and CyclophosphamideTherapy. Br. med. J., i, 640.

MEDICAL RESEARCH COUNSCIL (1971b) Treatment ofAcute Lymphoblastic Leukaemia. Br. med. J.,4, 189.

M>EDICAL RESE:ARCH COUNCIL (1973) Report on theFirst Myelomatosis Trial. Br. J. Haemat., 24,123.

MEDICAL RESEARCH COUNCIL (1974) Treatment ofAcute Myeloid Leukaemia with Daunorubicin,Cytosine Arabinoside, Mercaptopurine, L-aspara-ginase, Prednisone and Thioguanine: Results ofTreatment with Five Multi-drug Schedules.Br. J. Haemat., 27, 373.

PETo, R. (1972) Rank Tests of Maximal Powera t Lehmann-type Alternatives. Biometrika,59, 472.

PI1o, R. & PETo, J. (1972) AsymptoticallyEfficient Rank Invariant Test Procedures (withdiscussion). J. R. statist. Soc., A, 135, 185.

PETo, R. & PmE, M. C. (1973) Conservatism of theApproximation X(O-E)2/E in the Logrank Testfor Survival Data or Tulmor Incidence Data.Biometrics, 29, 579.

PIKE, M. C. (1973) The Analysis of Clinical Trials inLeukaemia. In Recent Results in Cancer Research,43, Ed. G. Mathe6, P. Pouillant, and L. Schwarzen-berg. New York: Springer-Verlag.

ZEL-N-, M. (1973) Keynote Address on Biostatisticsand Data Retrieval. Cancer Chemother. Rep., 4,31.

APPENDICES

APPENDIX 3.-Worked Example of a ClinicalTrial Analysi8 (Hypothetical Data)

Suppose that for each patient, we knowthe date of randomization, the date of death

if he died before the follow-up date of 31 May1974, the randomized treatment (A or B),and the renal function (I = impaired, orN = normal) at the time of randomization.Some hypothetical data of this type are setout in Table VIII, along with the derivationfrom them of the trial time. We shallstudy the relevance of treatment to deathfrom the disease or from causes that mightwell be correlated with the disease or itstreatment, and so we shall not include thedeath due to a road accident in our analysis.If we wanted to study total mortality, ofcourse, we could easily count this one deathand modify the calculated extents ofexposureto risk on Day 2240 after randomizationaccordingly.

Notes on the culatik n of trial times inTable VIII

(a) If a patient dies after randomization,but before treatment could be started, thatis irrelevant: include him as though he hadbeen treated. (To avoid such occurrences,do not randomize until the last possible'minute.)

(b) If a patient changes treatment forserious medical reasons, for social reasons, oreven at a doctor's whim, that is also irrelevant:include him as though no deviation hadoccurred, and try to ensure that as fewsuch changes as possible occur in the nextclinical trial you design. The anaylsis willthen try to answer: " Is it better to adopt apolicy of Treatment A if possible, withdeviations if necessary, or a policy of Treat-ment B if possible, with deviations ifnecessary, for patients who seem to have thisdisease?" This is a relevant question,sometimes even more relevant than " Is Aor B better for this diease? "

(c) If, during the course of the treatment,it is realized that the original diagnosis waserroneous, it may be decided that suchpatients should stay in the analysis as if nomistake had occurred (see the discussionsurrounding Fig. 2 on p. 602 in Part I of thisreport), even if treatment has to be com-pletely altered to attack the real disease.

(d) If causes of death are available for aUpatients who die, it may be decided thatdeaths from causes which could not possiblybe related to the disease or its treatment willbe ignored, and that such patients will be

Page 30: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

alli~ - o

-

N

rp TQO Q TO Q Q OQQz1,())C zC)C~C 1 zC2 z

1)C

=O O- = GO. =D = C'I

1 co: C1G

4OO OX O O--COZ o oc)cc

N~~~~~~~~:

~ ~

- ~ o o* e

Q 10 e ICC3

Ca ~ ~ --4 --

C C C "I m OC IO "i 5"I T ML- > 0

Oga0 + O ;> mP 5 m§4c) Nt °

C) ~ ~ ~ ~ ~ ~ ~ ~ ~ avdO)lf*-_- c IC= N NC .

.- C0 0

O rnQ 0O0 0 Qs O Q c ;aD D ern i21*C_)._ . _ *-(* _*C *_*C)-C)-C *-

o o-= i c o eI o X m00q- =: - 1- = CD r- - C= - r- t- C.

. . . . .

. .~ .~ . .~

0 0o_ 0c 0= (T 00 -_ 00__ .0-0]4 C)

00 lt = 00(= NNN-u-(

='r01 -00O0 - o

(0.1 010(m00 . -0

N C ON CzCON c o -_ CO

00CA-~- 00'oo_ oN N

Z?l ~ l Z~ ~ 1~~z ~ ; ~ , ~-, ~, z?, Z~-q

~

(', D -I 4" (c: t- 0 0 0 -_ Cl Vt" In = t- 0 0 _-4 CEM. 'Io -- _- _0 _4 _- _~ _- * cq cq cq cq e q

~So

C.j

N E<

f_ e

*s

ib

O >,

_ Ca

N 'c

-C

S W

z

H

c 0

4-

Page 31: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. H: ANALYSIS

analysed as though they had emigrated aliveon the day of their death.

(e) Patients who emigrate or are lost tofollow-up have trial times which run up totheir date of disappearance.

(f) Ignore deaths after the chosen follow-up date, and be certain that no deaths beforethis date have escaped your notice.

(g) It is slightly preferable to calculatetrial times accurately to the nearest day,especially if mortality shortly after entry isvery rapid, although weeks or even monthsare often sufficiently accurate units of timeif computing facilities are not available.(Leap years can be dealt with most con-veniently, if using a computer or calculatorto derive days from dates, by altering themonth and vear so that January andFebruary become months 13 and 14 in theprevious year. The number of leap-days tobe counted for a date is then the number oftimes 4 can be divided into the correctedyear, and allowance for the irregular lengthsof months need make no special allowancefor February.)

Now arrange the patients in order ofincreasing trial time. (If there is a tie, putpatients who suffered a relevant event onthat day just before patients who did not.)This is done in Table IX, where the lifetable and extents of exposure to risk ofdeath are calculat-ed. The life tablecalculated in Table IX appears as Fig. 5.

From Table IX, we obtain:

OA = observed no. ofdeaths in Group A = 6

EA = extent of exposure to risk ofdeath in Group A 8-34

OB = observed no. ofdeaths in Group B = 11

EB = extent of exposure to risk ofdeath in Group B = 8-66

(At this point of the calculation, check forarithmetic accuracy by seeing thatOA + OB- EA + EBR)

CalculateX2 (OA-EA)2 (OB-EB)2

EA EB1-29

Because we are comparing 2 groups witheach other, it is appropriate to compare X2with the chi-square distribution with meanone. When we do so, we find that theprobability of getting a value of chi-square

3

with 1 degree of freedom as big as or biggerthan 1-29 is quite substantial (about 1 in 4),which suggests that the apparent superiorityof Treatment A could well have arisen bychance alone.

However, before accepting this conclusionlet us first study the relevance of renal func-tion to prognosis. This requires an extra4 columns of numbers to be added to theright of Table IX, and these are given asTable X.

From Table X:OI observed no. of deaths among those

with renal impairment = 7EI extent of exposure to risk of death

among those with renal impairment=1-60

ON = observed no. of deaths among thosewith normal renal function = 10

ENs extent of exposure to risk of deathamong those with normal renalfunction = 15-40

and so(01-E1)2 (O-EN -2

El + EXEN) 20-12.EI E~NSince the probability of getting a value ofchi-square with 1 degree of freedom as bigas or bigger than 20-12 is < 0-001, thetendency of those with renal impairment todie sooner cannot plausibly be merely due tochance. This indicates that we shouldexamine the treatment differences separatelyamong those with and without renal im-pairment, and this is done in Tables XI andXI.

From Table XI, among patients withimpaired renal function, GA 4, EA _ 5-42,OB =3, and EB= 1-58.

From Table XII, among patients withnormal renal function, OA= 2, EA = 5-01,OB =8, and EB= 4-99-

Combining the observed numbers and theextents of exposure to risk of death in bothprognostic categories:

OA= 4 + 2 =6EA = 5-42 + 5-01 = 10-43OB = 3 + 8 - 11EB 1-58+4-99=6-57

Because we have now allowed for theoverwhelmingly strong effect that renalcondition has on prognosis, the effect oftreatment on prognosis stands out moreclearly. This is because, by retrospective

31

Page 32: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

0S' *- <0

. Pi

0.a :1Xo

Co

C.)

CO

C6)

EI

o Inin' )O n XCQ C C]Otc,, co t C1 O oo OCOc 01C

_ 00000000000

C: Vn U:O "t C) Ct Oo C) IC 1LO

. . . . .C CCC C> C> c CO0 C c C

C C C, 0"in i

0000ot CD = NNo NNC"it't m

C CO CC

cN10 010to I e' C 00 to

O0000 ~0000000

M C I = I- 0=

o oooo ooooooo

01010101.cn 10 t t C D

0 *=cm m C c c

EH; o _ ca~~~~~~ N r- ce = t- 0c cl,.,5 rr --11

0 a)~

11 11pq :: .

~~~~~qH~~H~ H ~~H ~ .~-~ .H-

Z4F ZgF zF

;;¢¢S;x¢ZSmaZ;Z

0E-

-4-_0]CDc eU1 00 4OC 01: 1- C O

I.M ~ cc X;0e 44-4

o,zoo,o

*# - 0P c3

C 0 C) C

Oct *_

ee

]0 C] CD -A C

rrSCed Ce

e

.E

Z C)e

4-)':V-O~;.

*~~~~ *03ooo o o cO O;o

0nb

r00c

m E O°.=0

o0 _ r

* * * * * rn ;)

.0

<~ ~ ~¢ e=t

o-=.; O

t ~ ~ ~ ~0as<<cC ¢

-0; 00e

'O c

_ . oHE

32

Page 33: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

TABLE X.-Extension of Table IX to Study the Relevance of Impaired Renal Function toPronosis

Numbers alve and at riskExtent of Expore to Risk of

Death

in impaired= e.i/r

0-5600-2610-2270-1900-1500-21100000000000000000

Sum = 1-599

in normal= e.nfr

1-4400- 7390-7730-8100-8501-78911111101101000000

Sum = 15-401

stratification, our analysis now only com-

pares like with like, instead of mixing all

sorts of different patients blindly togetheras before. Using the above observed num-bers and the extents of exposure calculatedwithin prognostic strata, we now have:

X2 (6-10.43)2 (11-6-57)210-43 + 6-57

4-87,and referrinrg to the chi- square distributionappropriate for a 2-group comparison (i.e.chi-square with 1 " degree of freedom "),we find that 0-025 < P < 0-05. Thedifference in prognosis between the 2 groupsis therefore statistically significant (X24-87, d.f. = 1, P < 0-05), although onlymarginally so.

After adjustment for the effects of renalcondition on the extent of exposure to riskof death, the relative death rate in Group Ais 0-58 (6/1043), that in Group B is 1-67(11/6-57), and so the ratio of the death rateon Treatment A to that on Treatment B is0-34 (0-58/1-67). In other words, the deathrate observed on A is about one-third ofthat on B, although the vast uncertaintyattached to this as an estimate of medical

fact is emphasized by the fact that thisextreme ratio is only just significantlydifferent from unity!

If you have a statistician analysing your

data for you, he may use your O's and E'sto calculate something slightly different fromyour X2 to compare with the chi-squaredistribution (Peto and Pike, 1973; Breslow.1975-see statistical notes 7 and 8 on p. 38),However, the answers you will obtain by thesimiple analogy between X2 and the chi-spare distribution will give an adequatemedical understandingo of the data. As was

noted on p. 20, a convenient computerprogram is available on request which willperform all the analyses described in Appen-dix 3.

APPENDIX 4.-Hotw to Record Data in Such a

Way that it i8 Easy to Analyse by Computer(a) Computers like to read data one line

at a time, each line containing up to, but notmore than, 80 characters (letters, digits,blanks, dots, etc.). Computers like to tellwhat a number denotes, simply by whichposition it occupies in the line (e.g. the 43rdcharacter), so omission of a blank can shift

e, fromTable IX

1

1121

111

111010

0000

00

r, fromTable IX

2523222120191716151413121110987654321

i = no.impaired

76543200000000000000000

n = no.normal

1817171717171716151413121110987654321

33

Page 34: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

pq b, = 8 8a C)00 10i.,0 P;.~ v 104 aq,-- 0.,q o (ooooo:< ( o

.t! 11I

t-0000-4 I,:esOH

0 .D .: .> .: .:*_4- 0 00

"a II

00>0C1

m

11 Ptnem qP4 )

, o

a11b0

0

II-c 0 t c

ll

10t..

ho

II0

p4c11, 40

'44

11II

0

lz

--Ca

0

Vl?

VA

4H'

0 0

II 0aHH11

0D0

.I" 4 .Io1COCOOOCCCO0b1:4

a,.*

C-i-

a)

4b

t0Iz

ic 00o me o 0

:'^ 41 4° ° ° 0°OO 010OO

¢4CCO1OCOOCO 010o 0 000L0s 0o oo e0

",_O° OOMOOO O OOO

_!11~~~

0q -

0 00000

11

t-l00

0

00000 10

11

11 R 00 X Q4tc) m oC- ot o o- mc qcqa -

1I P.,00 t L - --t - co1 1044o""C* q OC

n00 --:4

¢~~~~~'~ceo40 10'44COF.40On E

m

O4

0

11b4

0 Q

I0 0

00 t- to- ,*M0 L,- 100 00 0

4P-P-r-P- P- P- P- P- ^o H o

CO0 100 00010c0 ci1 0 1000*- 0 0 tCo 0 t-bt _I _ I C C t-at ci ci CO '4 00s ci

pff.sp p> .>* >>.

0D0

:4

e4 101 X1 CO -_-4 r-I cq ec .4 a OC =i q t'- 'q4 to

4' -I -4-4 ci P- a--- -4-4 -4

34

Q0

5140

0#EQ

0

40

-z

0

Q

CO

I.

Cq)

'4~ :

0Op. I*;aCB

Page 35: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

all subsequent numbers one space leftwardsand cause chaos. It is not humanly possibleto feed in data as carefully as this, unlessyou first write out the data on a coding form.Fig. 7 illustrates a coding form used in arecent MRC trial. Although the codingform illustrated was designed to be com-pleted by the physician referring the patient,it is usually preferable to ask the physicianto complete a form designed with hisconvenience and understanding chiefly inmind, and to copy this on to a computercoding form at the trial centre. This keepshim as co-operative as possible, and forcesyou to check that all requisite informationhas been supplied. (Immediate queries ofmissing or doubtful information are betterthan queries during the statistical analysis ayear or two later.)

(b) On the coding form, there are a givennumber of boxes for each item of information,and the information is written in, onecharacter per box. Although the boxes areall on different lines on the coding form,before they are fed into the computer,they will all be concatenated into a singleline, giving all the data for that one patient(unless that makes more than 80 boxes,in which case the line will be cut up at certainpoints into 2 or more lines per patient, each80 or less characters in length).

(c) Apart from the boxes reserved forthe patient's name, in which you will writea mixture of letters and blanks, try to usenothing but digits and blanks-no letters,

Surname(first 10 letters only, if long:

Sex (1 M, 2 F)

Date randomized (d, m, y)

and no dots. This will make the pro-grammer's task a little easier. For example,if you want to record a patient's sex, havea box for doing so, but adopt the codingconvention 1 for male and 2 for femaleinstead of entering M or F. Also, if youwant to record a number with a decimalpart, leave out the dot and just give thefigures (115 for 11-5 and 110 for 11-0, forexample). Everything that has a decimalpart must be given to the same number ofdecimal places, even if the decimal part iszero.

(d) Computers do not like to distinguishbetween blank and zero. This has twoconsequences. Firstly, if nothing is writtenfor a piece of information, the computerwill read it as though zero were writtenthere. (If, therefore, zero is a possiblevalue, it is usually better to have an im-possible number that can be entered todenote " no data ", and when defining howreplies about something are to be recordedas a number, it is better not to have zeroas one of the codes.) Secondly, if someonehas a pair of boxes in which to enter thenumber 9, he must write it in the right-hand box: if he wrote it in the left-hand boxof the pair, the computer would take thenumber written in that pair of boxes asbeing 90!

(e) Never underestimate the number ofcharacters required for recording something.For example, if a number might possibly, justfor one patient in the whole trial, run into 3

1I' IR.I'/VI 1 1 A1 I I I

wHI |b| I I 1 1W1

Hb at presentation(g/100 to one d.p., omitting the point)

Leucocytes(000/pl: enter 999 if not known) I I ESi

etc.

(So far, 23 characters have been used; this form actually ran to68 characters in all.)

FiG. 7. Part of a coding form.

35

Page 36: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

digits (e.g. age), then allow 3 boxes for it.It does not matter if, in the event, theleft-hand box is never used.

APPENDIX 5.-Testing for a Trend in Prognosiswith Respect to an Explanatory Variable

Frequently, if a group of patients isdivided into more than 2 subgroups withrespect to an explanatory variable, thesesubgroups have a natural order (e.g. low%,medium, high) and can be numbered 1, 2,3 . . . in a non-arbitrary way. It is thenusually more sensitive, if we need to knowwhether there is any relationship of thatexplanatory variable to prognosis, to askwhether there is a statistically significanttendency for a trend in prognosis to existas we go from Group 1 to Group 2 to Group 3(and on, if there are more than 3 groups),instead of asking if there is statisticallysignificant heterogeneity (by calculation of X2,as on page 31). There are a few situations,however, where there could be real hetero-geneity but no real trend; for example,average patients might fare better thanpatients who are extreme in either direction,up or down. The best general policy, whenexamining the relevance to prognosis of anexplanatory variable which is split into 3 ormore naturally ordered subgroups, shouldtherefore be to calculate both the test fortrend (described below) and the X2 test forheterogeneity. However, the P-value ulti-mately used to help infer whether this explan-atory variable is at all related to prognosisshould nearly always be based on the test fortrend, rather than on that for heterogeneity,even if the statistical significance of theheterogeneity test is slightly more extremethan that of the test for trend.

Computational details. The test for trendinvolves these steps:

(a) Divide the patients into subgroupswith respect to the explanatory variable.Thechoice of the number of subgroups is notusually critical, except that if it is small it isslightly preferable for it to be odd (e.g. 3 or 5)rather than even. It is usually best to aimto have roughly similar numbers (to withina doubling) in each subgroup, although thisis not essential, and if medical considerationssuggest particular natural groupings, especi-ally of a non-continuous explanatory variablesuch as disease stage, these should be adopted.

(b) Give each subgroup a number,starting at 1 and working upwards in naturalorder (e.g. low urea - 1, medium urea = 2,high urea = 3).

(c) For each subgroup count 0, theobserved number of events, and calculate E,the extent of exposure to risk of such events,according to the methods described inSection 19. Add up the O's to obtainOsum and the E's to obtain Esum, andcheck that, apart from rounding errors,Osum = Esum. If not, there is an arith-metical error in the derivation of the O's orthe E's. (Note that Esum is simply the totalnumber of patients in the whole trial whohave suffered an event.)

(d) Calculate X2, the test statistic forheterogeneity, as the sum of (O-E)2/E, asin Section 20.

(e) In each subgroup we now have n,the subgroup number, and the logrank 0 andE for that subgroup. Calculate, withineach subgroup:

ABC

n(O-E)nEn2E

(f) Add up all the A's in the differentsubgroups to obtain " Asum ". Analogouslyobtain " Bsum " and " Csum ".

(g) Calculate V where V= Csum-(Bsum x Bsum/Esum) (see statistical note 9on p. 38). Finally, calculate T, the teststatistic for trend, where

T = Asum x Asum/V.

(If T is negative or exceeds X2 you havemade an arithmetical error somewhere: ifOsum and Esum are equal after step c, theremust be an error in step d, e, f or g.)

(h) Obtain a P-value by using an analogywhich exists between the behaviour of Tif the explanatory variable is in fact irrelevantand the behaviour of one of the standarddistributions in statistics, called " chi-squareon 1 degree of freedom" (see page 10).This implies that if the explanatory variableis irrelevant (so there is no real trend), then

T is zero or positive, and would be expectedto be around unity

T has an approximately 3 chance of exceedingunity

T has an approximately 10% chance ofexceeding 2-71

36

Page 37: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

T has an approximately 5% chance ofexceeding 3-84

T has an approximately 21% chance ofexceeding 5-02

T has an approximately 1%1 chance ofexceeding 6-63

T has an approximately 4 O/o chance ofexceeding 7-88

T has an approximately 04001 chance ofexceeding 10-83.

For example, if we computed T = 4-32 in aparticular case, we might insert, in ourpublished account of the data, " Chi-squaretest for trend vielded 4-32; d.f. = 1: P <0-05."Example: data from Table V are displayedin Table XIII, with derivation from themof the requisite quantitie3.

Now V 567-52 - 320-28 x 320-28/213-00= 85-93

and T = 79-72 x 79-72/85-93- 73-96

(Check: Osum - Esum and zero < T < X2.)If initial urea were irrelevant to prognosis,

the probability that, by chance alone, Twould exceed 10-83, is approximately 04001,so the probability in this case that T shouldequal or exceed 73196 is much less than0-001. We would therefore write ' Chi-square test for trend yielded 73-96: d.f. = 1P < 0-001."

Incidentally, in a test for trend betweenjust 2 groups of patients, so that n = 2,T will necessarily just equal X2.

STATISTICAL NOTES

As in Part I, these are collected togetherso that they can be completely ignored bythe non-statistical reader. The statisticalmethods recommended in this paper aredeveloped in Kaplan and Meier (1958),Mantel (1966), Peto (1972), Cox (1972), and

Pet4o and Pike (1973), and are reviewed byBreslow (1975).

STATISTICAL NOTE 6.-(From p. 6). If thelife table at a particular time after randomi-zation equals L, and there are then Npatients still at risk, the standard error of Lat that time is approximately LV((1 - L)/N).(Example: suppose that at one year afterrandomization there are 20 survivors stillbeing observed, and the life table estimateof the chance of surviving one full year fromrandomization is 0-2. The estimatedstandard error of the life table at one yearwould then be 0-2 x /(0-8/20), which is0-04.)

The justification for this formula is thatthere must initially have been at least N/Lpatients for N to remain when the life tableequals L. If originally there were exactlyN/L, binomial theory yields the standarderror estimate L/((1 -L)/N). If there weremore than N/L originally, the life table will besomewhat more accurate than this, dependingon the trial times of the surviving patients.This estimate is therefore usually conserva-tive, although the actual degree of con-servatism is often surprisingly slight, and iscounterbalanced by the fact that the formuladeals appropriately with the increasinguncertainty that should properly be expectedas one goes along the long flat region withwhich many life tables finish. The Green-wood standard error estimate (Kaplanand Meier, 1958) is less immediately com-puted, and has the disadvantage that insuch life table tails, which are often theregions of greatest medical interest (and thesource of most mistakes), the standard errormay be grossly underestimated. Whicheverestimate is preferred, the trial times of thesurviving patients should somehow beindicated with the plotted life table as inFigs. 3, 4 and 5.

TABLE XIII.-Test for Trend in Prognosis with Respect to Initial Blood Urea AmongPatients Entered into the First MRC Myelomatosis Trial (From Table V)

Initial urea(mg/100 ml

blood)0-39

40-7980-

Groupnumber,

n1

3Totals

Names of totals:

0, observed E, extent ofno. of exposure todeaths risk of death

79 122-0681 74-6053 16-34

213 213-00Osum Esum

(O-E)2E

15-190-5582-2597-99X2

A=n (O-E)-43-06

12-80109-9879- 72

B=n E

122-06149-2049-02

320-28Asum Bsu

C=n'E

122-06298-40147-06567-52Csuim

37

Page 38: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

R. PETO, M. C. PIKE ET AL.

STATISTICAL NOTE 7.-(From p. 11). For a2-group comparison, the variance, V, of(OA -EA) can be computed. It is the sameas the variance of (OB -EB), of course, andis obtained by arguing that, for those aliveand observed in the trial on each particularday, a 2 x 2 table giving group membership(A or B) and fate that day (died or not) canbe constructed. The variance of the differ-ence between observed and expected for thenumber of Group A deaths in each suchtable can be calculated, and since OA-EAis the sum of all such differences V is thesum of all such variances. On a day whenthere are a A-patients and b B-patients atrisk and the observed survival rate amongboth groups of patients combined is p, thecontribution to the overall variance V will bep(l -p)ab/(a + b -1).

Finally, (OA-EA)2/V (or, if a continuitycorrection is preferred,* (GOA-EAII )2/V)is referred to tables of chi-square with onedegree of freedom. If the continuity cor-rection of 1 is not used, this is necessarilygreater than or equal to X2. If more than 2groups are to be compared a variance/covariance matrix from each day must beaccumulated to give the overall variance/covariance matrix for the vector of the(O- E)'s in each group, but the principleis similar. Details are given by Peto and

Pike (1973) and an example is given inAppendix 3.

There is a strong connection betweenthese methods and those of Cox (1972) forcomparing 2 groups. Cox uses /3 to denotethe log of the ratio of the hazard functionsin the 2 treatment groups. He then derivesa log-likelihood L(:) and uses the statisticL'(0) to test = 0 noting from likelihoodtheory that its variance must be approxi-mately -L"(0). In the absence of tiedranks L'(0) and -L"(0) are the logrank(O- E) and its variance V, giving a deeperjustification to logrank methods.

STATISTICAL NOTE 8. (From p. 33). In thisparticular example, where OA= 6 andEA= 10-43 one might use the methodsdescribed in the statistical note 7 to computeV -variance of (OA -EA)= 3.39, leadingto a chi-square of 4-432/3-39 = 5-79 withouta continuity correction, or, if a continuitycorrection is preferred, (4 43 -)2/3.394*56.

STATISTICAL NOTE 9.-(From p. 36). Theapproximation is being made that under thenull hypothesis Asum has mean zero (whichis exactly true) and variance V (which is not).The exact variance, Vexact, of Asum may beobtained by noting that, if xi denotes the

* Continuity correction. With more than 2 groups, a continuity correction is definitely unwanted, butthere is no uniform practice of using or not using a continuity correction when comparing 2 groups by thelogrank test, and the reasons why are quite interesting. Two fundamentally different methodls may beavailable for deriving the P-value from the calculated values of 0 and E in each group, the permutationalmethod (Peto, 1972) or the conditional method (Mantel, 1966). If there are initially a Group A subjectsand b Group B subjects, then both methods calculate a P-value given the times from randomization at whichdeaths occur and given the durations of follow-up of those who do not die. The permutational P-value,Pperm, is the probability that if a of the a + b patients were selected at random and taken as Group A, theremainder being taken as Group B, a value of (O-E) as extreme as or more extreme than that actuallyobserved would then be generated. The conditional argument considers the information in the data to beequivalent to that in a hypothetical set of independent 2 x 2 tables, one for each post-randomization daywith margins equal to the margins of the 2 x 2 tables actually observed relating death on that day togroup membership among those still at risk on that day. The conditional probability, P,ond, is then theprobability that the value of (O-E) obtained by combining a set of such 2 x 2 tables in the usual waywould be as extreme as or more extreme than the observed (O -E).

The permutational argument is usually applicable when two groups being compared have been separatedby randomization, and is then preferable because Pperm will usually, although not always, be smaller thanPcond. The variance of (O-E) under the permutational argument is approximately V, and since thereare usually many permutations that would lead to values of (O-E) very close to, buit not, equal to, thevalue of (O-E) actually obtained, Pperm is usually better estimated if a continuity correction is not used.

The conditional argument lacks a little of the efficiency of the permutational argument, but leads morenaturally into Cox's (1972) methods, and into the study of treatment effects in particular time-periods.However, since it leads us, albeit slightly artificially, to regard E as fixed and 0 as an integer-valued randomvariable with variance exactly V, P0ond is usually better estimated if a continuity correction is used.

The conditional argument is always valid, but the permutational argument is sometimes not (e.g. if wecompare survival among 2 groups of patients which have on average been followed up for different lengthsof time). Nevertheless, when comparing the efficiency of the logrank test with that of alternative statisticatests for detecting differences between randomized groups, the efficiency of the permutational argumentshould be used.

38

Page 39: 35, DESIGN TRIALS PETO,1 ARMITAGE,' E. BRESLOW,3 …researchonline.lshtm.ac.uk/4651905/1/Design and analysis... · 2019-03-05 · logrank P-values), which are so simple that theyare

PROLONGED CLINICAL TRIALS. II: ANALYSIS

value of (O -E) in the ith subgroup, Asum =Eixi and the Vexact is EXijcij, where summa-tion is over all groups from first to last inclu-sive and cij, the covariance of xi with xj, isgiven by Peto and Pike (1973). A preferabletest for trend, albeit one which is not acces-sible without statistical expertise, is then totake Asum2/Vexact (or, if a continuity cor-

rection is preferred, (IAsum -0-5)2/Vexact) aschi-square with 1 d.f. There is a strong con-nection between this recommended test for

trend and the methods developed by Cox(1972). These methods test for trend byattempting to relate the log hazard functionlinearly to the subgroup number by a para-meter f and then testing g,B 0 by examiningthe log-likelihood function, L(fl). Since thequantity Asum equals Cox's L'(0), and in theabsence of tied ranks the quantity Vexactequals Cox's L"(O), it is therefore of fullasymptotic efficiency in a reasonable class ofmodels to base a test for trend on Asum.

39