A Method for Comparing the Areas under ROC curves Derived from the Same Cases

5
|ames A. Hanley, Ph.D. BarbaraI. McNeil, M.D., Ph.D. Receiveroperating characteristic (ROC) curves are used to describeand compare the performance of diagnostic technology and diagnostic algorithms. This paper re- fines the statistical comparison of the areas under two ROC curves derived from the same set of patients by taking into account the correlation between the areas that is induced by the paired nature of the data. The correspondence between the area under an ROC curve and the Wil- coxon statistic is used and underlying Gaussian distributions (binormal) are as- sumed to provide a table that converts the observed correlations in paired ratings of images into a correlation between the two ROC areas. This between-area correlation can be used to reduce the standard error (uncertainty) about the observed differ- ence in areas. This correction for pairing, analogousto that used in the paired t- test, can produce a considerable increase in the statistical sensitivity (power) of the comparison.For studies involving multi- ple readers, this method provides a rnea- sure of a component of the sarnpling vari- ation that is otherwise difficult to obtain. Index term: Receiver operating characteristic curve (ROC) Radiology 148: 839-843, September 1983 1 From tho Department of Epidcmiology antl Health, McGill Universitv, Montreal, Canada (J.A.H.) and the Department of Iladiologv, Harvard lvfedical School and l3riglram and Women's HosPit.rl, Bostor, MA, USA (B.l.M.). Rt'ceived June 3, 1c)81; revision re- quested Jull- 21, 1981; final revision receivt'cl and ac- ceptecl Feb. 15, 1983. Supported in part by the Hartford Found.rtion antl the National Center for Health Care Technologv. ht [Reprinted from RADIOLOGY, Vol. 148, No. 3, Pages 839 84;], September, 1981].1 Copyright1983 by the Radiological Society of North America, Incorporated A Methodof Comparing the Areas under Receiver Operating Characteristic Curves Derived from the Same Casesl 1l rvnnar questions dealing with comparative benefitsfor alterna- D tirr" diagnostic algorithirs, diagnoitic tests, or therapeutic regi- mens have recently emergedin medicine.For example, how do we know whether one diagnosticalgorithm is better than another in sorting patients into diseased and nondiseased groups? Whether the addition of a new test or procedureto an established algorithm im- proves its performance? Whether it matters who of severalavailable readers interprets a mammogram? Whether one type of hard-copy unit in radiology is better than another? Whether reading a CT scan in coniunction with the patient's historv allows a more accurate di- ug.,ori, than reading it without the hiitory? The analyses of such problems have started with construction of receiver operating char- acteristic (ROC)cutves (1-3). Generallythese analyses have used as cutoff points either different posterior probabilities on a continuous scale or different thresholds on a discrete rating scale. The latter ap- proach has been particularly popular in radiology. Major gaps in the understanding of statistical propertiesof ROC curves have limited their usefulness, especiallyfor questions in- volving comparisons of curvesbased on the same sample of subjects or objects. Thesecomparative situations contrast with thoseinvolving a single datasetand a single ROC curve. In such cases, the investigator generally only needs to know that a single modality or diagnostic approach has "poor", "moderate", or "good" accuracy, and the loca- tion of the ROC curve gives a rough assessment. However, when a comparison of two algorithmsor modalities is relevant, more formal statisticalcriteria are needed in order to judge whether observed differences in accuracy are more likely to be random than real.Thus far thesecriteria have not been fully developedfor ROC curves. In a recentpaper(4)we dealtwith one popular accuracy index that can be derived from and used as a summary of the ROC curve. We showed that the relationshipof the area under the ROC curve to the Wilcoxon statistic could be used to derive its statistical properties, such as its standarderror (SE) and the samplesizes required to measure the areawith a prespecified degree of precision(reliability) and to provide a desiredlevel of statistical power (low type II error) in comparative experiments. This paper extendsour statistical analysisto another large class of situations, where the two or more ROC curves are gen- erated using the sameset of patients. In thesesituations, it is inap- propriate to calculate the standarderror of the differencebetween two areas (Ar0a1and ArAn2) as SL(Arint-Ariazl=fffi (l) srnce Ar0at and Ar0sz are likely to be correlated. This correlationis likely to be positive; if the vagariesof random sampling of cases produce a higher/lower than expected accuracy index for one mo- dality (e .g.,if the sampleconsisted of a larger than usual number of easy/difficult cases), then the accuracy of the second modality will probably alsobe correspondingly higher/ lower than one would ex- pect. In other words, while the two indices may fluctuate indepen-

description

Contraste estadístico de dos curvas ROC obtenidas en los mismos sujetos

Transcript of A Method for Comparing the Areas under ROC curves Derived from the Same Cases

Page 1: A Method for Comparing the Areas  under ROC curves Derived from  the Same Cases

|ames A. Hanley, Ph.D.Barbara I. McNeil, M.D., Ph.D.

Receiver operating characteristic (ROC)curves are used to describe and comparethe performance of diagnostic technologyand diagnostic algorithms. This paper re-fines the statistical comparison of theareas under two ROC curves derivedfrom the same set of patients by takinginto account the correlation between theareas that is induced by the paired natureof the data. The correspondence betweenthe area under an ROC curve and the Wil-coxon statistic is used and underlyingGaussian distributions (binormal) are as-sumed to provide a table that converts theobserved correlations in paired ratings ofimages into a correlation between the twoROC areas. This between-area correlationcan be used to reduce the standard error(uncertainty) about the observed differ-ence in areas. This correction for pairing,analogous to that used in the paired t-test, can produce a considerable increasein the statistical sensitivity (power) of thecomparison. For studies involving multi-ple readers, this method provides a rnea-sure of a component of the sarnpling vari-ation that is otherwise diff icult to obtain.

Index term: Receiver operat ing character is t ic curve(ROC)

Radiology 148: 839-843, September 1983

1 From tho Depar tment o f Ep idcmio logy an t l

Hea l th , McGi l l Un ivers i tv , Mont rea l , Canada (J .A .H. )

and the Depar tment o f I lad io logv , Harvard lv fed ica l

Schoo l and l3 r ig l ram and Women 's HosPi t . r l , Bos tor ,

MA, USA (B. l .M. ) . R t ' ce ived June 3 , 1c)81 ; rev is ion re -

ques ted Ju l l - 21 , 1981; f ina l rev is ion rece iv t ' c l and ac-

ceptec l Feb. 15 , 1983.

Suppor ted in par t by the Har t fo rd Found. r t ion an t l

the Nat iona l Center fo r Hea l th Care Techno logv . h t

[Reprinted from RADIOLOGY, Vol. 148, No. 3, Pages 839 84;], September, 1981].1Copyright 1983 by the Radiological Society of North America, Incorporated

A Method of Comparing the Areas

under Receiver OperatingCharacteristic Curves Derived fromthe Same Casesl

1l rvnnar questions dealing with comparative benefits for alterna-D tirr" diagnostic algorithirs, diagnoitic tests, or therapeutic regi-mens have recently emerged in medicine. For example, how do weknow whether one diagnostic algorithm is better than another insorting patients into diseased and nondiseased groups? Whether theaddition of a new test or procedure to an established algorithm im-proves its performance? Whether it matters who of several availablereaders interprets a mammogram? Whether one type of hard-copyunit in radiology is better than another? Whether reading a CT scanin coniunction with the patient's historv allows a more accurate di-ug.,ori, than reading it without the hiitory? The analyses of suchproblems have started with construction of receiver operating char-acteristic (ROC) cutves (1-3). Generally these analyses have used ascutoff points either different posterior probabil it ies on a continuousscale or different thresholds on a discrete rating scale. The latter ap-proach has been particularly popular in radiology.

Major gaps in the understanding of statistical properties of ROCcurves have limited their usefulness, especially for questions in-volving comparisons of curves based on the same sample of subjectsor objects. These comparative situations contrast with those involvinga single data set and a single ROC curve. In such cases, the investigatorgenerally only needs to know that a single modality or diagnosticapproach has "poor", "moderate", or "good" accuracy, and the loca-tion of the ROC curve gives a rough assessment. However, when acomparison of two algorithms or modalit ies is relevant, more formalstatistical criteria are needed in order to judge whether observeddifferences in accuracy are more l ikely to be random than real. Thusfar these criteria have not been fully developed for ROC curves.

In a recent paper (4) we dealt with one popular accuracy index thatcan be derived from and used as a summary of the ROC curve. Weshowed that the relationship of the area under the ROC curve to theWilcoxon statistic could be used to derive its statistical properties, suchas its standard error (SE) and the sample sizes required to measure thearea with a prespecified degree of precision (reliability) and to providea desired level of statistical power (low type II error) in comparativeexperiments. This paper extends our statistical analysis to anotherlarge class of situations, where the two or more ROC curves are gen-erated using the same set of patients. In these situations, it is inap-propriate to calculate the standard error of the difference betweentwo areas (Ar0a1and ArAn2) as

S L ( A r i n t - A r i a z l = f f f i ( l )

srnce Ar0at and Ar0sz are l ikely to be correlated. This correlation islikely to be positive; if the vagaries of random sampling of casesproduce a higher/lower than expected accuracy index for one mo-dality (e .g., if the sample consisted of a larger than usual number ofeasy/diff icult cases), then the accuracy of the second modality wil lprobably also be correspondingly higher/ lower than one would ex-pect. In other words, while the two indices may fluctuate indepen-

Page 2: A Method for Comparing the Areas  under ROC curves Derived from  the Same Cases

dently by amounts SE1 and SE2 in sep-arate samples, they wil l tend to fluc-tuate in tandem when derived from asingle sample.

In this paper we have developed anapproach to take account of this corre-Iation. In brief, we indicate that therelevant standard error for such com-parisons is not that shown in Equation1 but rather

SE(ArAa1- ArAaz)= @

-2rSE(ArAa) SE(ArAa) Q)

where r is a quantity representing thecorrelation introduced between thetwo areas by studying the same sampleof patients. This paper reviews thecalculations for comparing the ROCcuryes of two modalities and illustratesthis new approach using data from aseries of experiments involving phan-toms.

METHODS

The general approach to assessingwhether the difference in the areasunder two ROC curves derived fromthe same set of patients is random orreal is to calculate a crit ical ratio z, de-fined as

o , _ A z- ,rlp1Ts77 rrtr,* \J/

where A 1 and SE1 refer to the observedarea and estimated standard error ofthe ROC area associated with modality1; where A2 and SE2 refer to corre-sponding quantities for modality 2; andwhere r represents the estimated cor-relation between ,41 and A2.2 Thisquantity z is then referred to tables ofthe normal distribution and values ofz above some cutoff, e.9., z > 7.96, aretaken as evidence that the "true" ROCareas are different. The importance ofintroducing the 2rSE$E2 term in theabove equation is obvious: failure tosubtract out from the sampling vari-abil ity those fluctuations that thepaired design has already eliminatedwill leave the denominator of Equation3 too large and z too small, thereby re-ducing the chance of detecting a dif-ference between two modalities.

Calculating Areas

Areas under ROC curves can be ob-

2 As we wil l see later, the SE of an estimatedarea depends on the magnitude of the underlyingot "true" atea. When calculating : to test the nullhypothesis that this underlying area is the samefor both modalit ies, one should equate SE1 andSE2, calculating them both from a common est:i-mate o f the area . In th is case the denominatorbecomes t/2SET - a or sE /2(1 aJ.

840 . Radiology

tained in three ways: ( l) by the trape-zoidal rule; ( l l ) as output from theDorfman and Alf maximum Iikel ihoodestimation program (5); or (lii) from theslope and intercept of the original datawhen plotted on binormal graph paper( 3 ) . A s i n d i c a t e d i n o u r c o m p a n i o npaper (4) the trapezoidal approachsystematical ly underestimates areas.Because the Dorfman and Alf approachis becoming readily accessibie to thoseinterested in this area, we will calculateareas using this approach. (For thosel imited to graphical methods, the areacan be derived from the slope and in-te rcept accord ing to the ru le Area =

Percentage of Gaussian distribution tole f t o f z t , where Z1 = In te r -

cept/r/JT stopez-).

Calculating Standard Errors

The standard errors associated withareas can be obtained in three ways: (i )as output d i rect ly f rom the Dor imanand Alf maximum likelihood estima-tion program; (l i ) front the variance ofthe Wilcoxon statistic as i l lustrated indetail in Reference 4; or (i i i) from anapproximation to the Wilcoxon statisticby making an assumption, shown to beconservative (compared with assuminga Gaussian-based ROC curve), that theunderlying signal (diseased) and noise(nondiseased) distributions are expo-nential in type (a). We wil l use thestandard errors estimated from theDorfman and AIf program.

Calculat ing the Corre lat ionCoef f ic ient , r , Between Areas

Two intermediate correlation coef-ficients are required, which are thenconverted into a correlation between,4 1 and A2 aia a table that we supplybelow. The first is ro,, ' , the correlationcoefficient for the ratings given to im-ages from nondiseased patients by thetwo moctaiit ies. The second is r, i, thecorrelation coefficient for the ratingsof diseased patients imaged by the twomodalit ies. Each of these can be calcu-lated in traditional ways using eitherthe Pearson product-moment correla-tion method or the Kendall tau. Theformer approach is usually used forresults derived from an interval scalewhereas the latter is more appropriatefor results obtained from an ordinalscale. ROC curves in radiology are de-rived from ordinal scale data andtherefore we have used the Kendall taufor calculating /"r ' and 11. Standardstat is t ica l packages (c.9. , SPSS, SAS)provide tau; when the number of rat-ing categories is small, however, sayfour or less, the caiculation can also beperformed manually.

Once the correlations between the

ratings (rn among the normals, r..1among the abnormals) are obtained, itis necessaty to calculate the correlationthat they induce between the two areasA r and Az; for ease of notation we havecalled this r (without any subscript).This is the coefficient present inEqua t i ons 2 and 3 . Tabu ia t i on o f r(Taet-E I) is the fundamental contribu-tion of this paper3; therefore, in oursubsequent example we wil l i l lustrateits use.

Experimental Data forIl lustrative Examples

We studied 112 phantoms that werespecially constructed to evaluate theaccuracy of two different computer al-gorithms used in image reconstructionfor CT. Fifty-eight of these phantomswere of uniform density and weredesignated "normal"; the remaining 54contained an area of reduced density tosimulate a lesion and were designated"abnormal". Two images of eachphantom were reconstructed using thetwo different algorithms, which wewill refer to as modality 1 and modality2. A single reader read each image andrated it on a 6-point scale: I = Defi-nitely Normal;2 = Probably Normal;3 = Possibly Normal;4 = Possibly Ab-normal; 5 = Probably Abnormal; 6 =Definitely Abnormal. From the re-sulting data, we constructed two ROCcurves. The data were submitted to theDorfman and Alf maximum likelihoodprogram to produce areas under theROC curves and standard errors.

RESULTS

Our results wil l be divided into twoparts. First, the analysis of the exampleinvolving CT phantoms wil l be i l lus-trated. Then, in order to verify that thez statistic performs correctly, results ofseveral simulations wil l be summa-rized.

CT Phantom Example

The basic data are presented in theAppendix, along with the calculationsproduced from them. The areas underthe ROC curves were 89.45% (SE 3.0V")and93.82% (SE 2.6%). The (Kendall tau)correlations between the paired ratingswere rN = 0.39 (nondiseased patients)and r4 = 0.60 (diseased patients), giv-ing an /iaverage// correlation betweenthe ratings of 0.50. With this averagecorrelation of 0.50 and with an averagearea of (89.45 + 93.82) l2 = 91.64, TaarE

3 Mathemat ica l der iva t ion ava i lab le upon re -ques t .

September 1983

Page 3: A Method for Comparing the Areas  under ROC curves Derived from  the Same Cases

'zl(v + tv) t'zl(vt + Nt) +

'(surunloJ) earp aBe,rale pue (s^{oJ) sButler uaa,nlaq

uorl€laJroJ a8era,te yo uorlJunJ e sp z y pup I y spaJe fod o.{^l uaaalaq I IUeIJIJJJoJ UoIIPIJJJoJ i

It8 . fSoIorpPU

Sursn sarpnls Jo raqrunu-e dq,tr ureldxaderu uor;enrasqo srqJ 'aq

llr.tt lsal zparred ar{} e^llrsuas aloru aql /seaJ€ ar{Juaemlaq uorJ€laJro) aql ra8rel aq1

'uJnl ur slurod asar{J ssnJsrpar14 'srsdleue pue u8rsap leluaurrad-xa Jo pur{ srq} uroJJ saSlarua +eql dru-o-uo)a Ierrlsrlpls aLIl alprlpul ol palelo-deJlxa aq UPJ elEp Jno 'puoJas 'aJrMl

luerled qrea 8ur.{pn1s .,{q parnpurserJe ur arualaJJrp JL{l Jo,'{1r1rqer"re.rSurldwes.rallerus aLIl lunoJJ€ olur salel,role8rlsarrur ar{l Jr alrlrsuas aJoru operueq ueJ uosrJedruor aql ]eql umoqs a^eLIe1,4 'lua.reddp arp sllnsal alerparururo1!r,J'sluarled;o aldrues arues aLIl ruorJpa^rJap seAJnJ lou ol!\l Japun seaJeaql SurredruoJ Jo poqlau e paqlrrs-ap aleq a.u uorle8rlsalur srq] uI

NOISSNJSIC

'obgrTq ol (srs,,{leue parredunue uroJJ palradxa) dlr,Lrlrsues %09 e

asrer plno.t{ 1sa1 parred aql leql 1ra[ordplnoJ auo 'sarJern)Je aurlaseq puesuorl€IarJo) Jo suor]PurqruoJ Jnot ra^o'aJuaJeJJrp

luerr;ruBrs e Surlerrpur slsalparredun pue parred yo a8eluarrad oql3u11e1nqe1 .,{q paqenlerra se,1t aJueruJoJ-rad aq1 'sluarJrJJaoJ uorlelarror 8ur-dre,L ruory palelnurrs araru sluarur,rad-xa 002 Jo slas'asodrnd srr{l roC 'seare

JOU tuaraJJIP '{tl^ salllleporu o^{l

;o uosrredruor parrnbar )rlsrlels srql roJ(,razvrod) flr,rrlrsuas ar{} Jo uorlenlelg _ '%1'96

Jo lrJrJrraose ''a'!'obg', uaaq a^pr{ plnoM srlll 'uoll

-nqlrlsrp uprssneJ 1ra1r.rd e ul'h6 i6;o ,,{1rrr;rcads e ''a't /oby'g set (arua.ra;-tlp luPrlJlu8rs d11err1sr1e1s e Surletrp-ur sp ua{el uJlJo sanlel) g 7- rr,rolaqro 0'Z a^oqe sanlel z 1o uorlrodorda8e,ra,re aql (sarur] 007 unJ r{rea suorl-eurquor 71) s1err1 008't er'{t Suotue'dlprryoadg'lradxa plnoqs auo teqm olJsolf, pup',lrol JIJM seler anr;rsod-as1e;aql 'I ol asol) dlqeldale suor+el

€ raqurnN 8tI aurnlo^

-rrap prepuels peq pue sauo uerssneDruor; alqer{srnBurlsrpur sasodrnd 1errl-rerd 11e JoJ aJaM suorlplnrurs snorJelesaql ruorJ paurelqo z Jrlsrlels lsalaql Jo suortnqlrlslp palelnqel aql 'suoll

-elalJor pue seare 3gy Surdpapun

JO suorleuiquroJ IeJaAas Jo qJea roJpauroyad ara,u sas.{1eue pal€lnruls 00t'flrrrlnads aql alpln)ler ol Japro uI'u) uPluuor) pue zlal tr pue (9) qarsHpue ))ellod ,{q pasn asoql ol snoSoleuespoqlaru Sursn'suorlenlrs palelnurrsyo a8ue,r e JaAo aJu€urroyad rrlsou8erpslr paurruexa a1\1. 'lsal

IeJrlsrl€ls /vtausu.{l JoJ sJrlsrJalJeJerlJ asaql auIruJa]apo1 '(,,(1rl;nads pue ,,{1r,r.r1rsuas q8rq)

lsrxa saop auou 'lJeJ ur 'uar{.4{ lsrxaol pr€s sr aJuaraJJrp e qJTLIM ur saJuels-ul to Jaqurnu aq1 .{e.u alqelrrparde ur azrurrurru plnoqs 1r lnq 'luasard

,,{11ear sr auo uar{.na aJualaJJrp e aleJ-rpur plnor{s lsal leJrlsrlels poo8 y

lsal parIEdaql Jo aruEruroJrad lerauac

'alerrdordde uaaqa^eq plno.4 lsal palrel-oMl e uaql'uoll-Jalrp JelnJrlred auo ur lsaralur uoud, ou peq arvr ;1

'(sarlrlepoul qloq qllrrsluerled Jo tas aur€s aql 8u1,(pnls {qpaJnpur .,{lr'rrlrsuas paseaJJur ar{l }uno)-re olur d)el ot PrIeJ rM pPt{ 'sproM

Jar{lo ul) oraz oJ lenba aq ol pauns-Se uaaq s€aJe uaa.ft{Jaq uollelalJoJar{l peq Paleln)lef, uaaq a^eq plnoM

leql (l uI I ro 9€I'0 Jo anle^ d e ro) 0I'IJo orler IeJrlrJJ e woJJ uMeJp aq plno,4t

]eql aJuaJaJur Ja{ea1vr aql.Lllllvr slseJl-uoJ srr{I 'ruopueJ aq lou Aeru af,uaJaJ-Jrp pa^rasqo aql ler{l s1sa33ns aruap-rla srllt :(OZO'O = d) saldrues g1 ,{ra.,raur aJuo ,,{1q8nor rnJro plnoqs raq8rqro It'I Jo anle^ e leql saleJrpul uorl-nqrrlsrp uerssneD aq1 'alerrdordde sr

lsal palrel-auo e uar{l'sluaruanordrur urpalsaralur,,{1uo ere pue 1 dlrleporu uelllraltaq aq o1 .{14111 sr 7 dlrlepou 1eq1tnud a a^arlaq ol uoseoJ a^erl a,tr JJ

'zr'l lo anleA z Ie)rluapl

lsourle ue splar.{ g uorlenbg Jo Joleu-rruouap aLIt ut g0€0'0 = ft'}-I)zLI8Z0'0 Sursn l1g7g'g sp srorra prepuelsarll Jo r.{)ea lJrpalo ol t a)uaJaJauur Plnurroj .)r{l rsn puP 't9

16'0 Jo earPuor.ur.uoJ e urPlqo ol 5PJJe oa,r1 aq1 aSera-,r.e lq8rru auo 'rar]rea pauorluau sV

I7'I = 60€0'0ll€10'0 =

(qzo'o)(eo'o)(t'o)z - zgzo o +zso'oAlGvogo -z8€6'o) = zplep r^oqe eql Sulsn

'3ur1dwes ruopuer to llnsar e .,{lararuse]l{ seale pa^Jasqo uaa^{laq a)ualaJ-Jrp pa^rasqo ar{l leql srsaqlodfq 11nuaql lsa] ol Japro uI z olleJ IeJrlrJJ ar{lelelnJlpr ol pasn urql spM g uorlenb3

'67'g ,,{lalerurxordde sr seale uea.Ml-aq I uorlelarroJ ar{l leql sale)lpul I

280 t80 980 9806t0 I80 280 €809L0 BLU 080 I80€L0 9''�0 tL0 8t00L0 tt'0 9t0 9t0t90 0t0 zt0 €t0r9'0 89 0 010 rt 0r9'0 990 t90 690690 890 S90 990950 090 €90 b90t90 890 090 a9'0r90 990 890 0906r'0 €90 990 ts0l,0 Is0 F90 9s0910 6r'0 r90 €s0zv0 lt0 6n0 r900r0 9r'0 t 0 6108€0 €?0 9r'0 h09€0 It0 €i0 910t€'0 6€0 rn0 €f0z€,0 l€ 0 6€ 0 If'o0€0 9€0 l€,0 6€0620 €€0 9€0 l€0lz0 r€'0 r€0 9€09Z 0 6Z'0 Z€ 0 S€'0nz0 820 0€0 z€'0zz0 9z'0 820 0€0rz0 ,z'0 9z0 8z0610 €z'0 920 9208r 0 tz'j €z 0 ,7,0910 610 rZ0 tz09I0 8r'0 020 rz0€r'0 91 0 8I 0 61 0zI0 9I0 lr'0 8I0II0 tI0 9I0 9I00I 0 zr'0 tI 0 9I 0600 II0 zI0 gr'0100 600 II0 Ir'0900 800 600 0I0900 100 800 800r00 900 900 l00000 r00 900 900200 €00 r00 f00200 z00 200 €00r00 r00 r00 r00

980 tB0 180 180rB0 180 980 980r80 280 zB0 €806t0 080 080 I80Lt0 tt0 8t'0 81 0ftt0 910 9t0 9t0zt0 €.t'0 tt0 bL0010 rt'0 rL0 zt0t90 990 690 690990 99'�0 t90 190€90 t90 990 990r9 0 z9'0 z9'0 €9 0690 090 090 190t90 890 890 690rE0 990 990 l90290 €90 190 990090 r90 290 €90810 6t0 0s0 r909i0 fi'U 8?0 6b0tt 0 91 0 91'0 lt'jz,0 €i0 rn0 9t00r0 rt0 zn'j €r08€0 6e0 0''0 It0t€,0 8t0 8€0 6E09€0 9e0 9€0 t€,'0€€0 n00 9€0 9E0I€0 zt0 t-€0 tt06Z0 000 r€'0 rt0tz0 820 6z'0 0€0920 920 lZ0 820,20 920 920 920zz0 020 ,20 ,20020 rz0 zz0 zz0610 6r'0 020 Iz0tr'\ 8I 0 8r'0 61 09r'0 9r'0 lr'0 tr'0fI 0 tI 0 9I 0 9r'0zI'0 tI 0 €I 0 €r'0II'O II'O II'O ZI O600 600 0I0 0I0t00 800 800 800900 900 900 100r00 900 900 900€00 t00 €00 s00I00 200 200 200

880 880 880 880s80 980 980 980e80 e80 180 180r80 r80 r8'0 280610 6t0 6t0 6t0910 tt0 lt0 tl0rt0 910 9[0 9t0zt} ztj t,to €.t00t'0 010 rl0 rl0890 890 690 690990 990 990 t90t90 i90 ,90 990290 z9'0 290 €90690 090 090 I90t90 890 890 690990 990 990 t90€90 190 r's0 990r90 290 290 t906t0 090 090 r90fi'j Bn0 810 6109t0 9t0 9r'0 fi0€r0 fr'O nr0 9t0zn0 zt0 zt0 ti00n0 0r0 0n0 It08€0 8€0 6€0 6€09€0 9€0 t€'0 lE0r€0 r€'0 9e0 9t0z€0 z€0 e€0 8000€0 rt0 IE0 I€0820 620 620 62'�0920 tz} t(,0 tzj920 920 920 920€20 sz0 ez0 1z'0rz0 rz0 zz0 zz06r'0 6r'0 0z 0 0z 0tr'j 8r'0 8I0 81 09r'0 9r'0 91 0 910fI0 tI0 tI0 tI0zr0 zr0 zl0 tI0OI() IIO II'O IIO600 600 600 600100 t00 100 100s00 900 900 900€00 €00 f00 i00200 200 200 200

06 t)8909801802800808t09t0tt0zt00t0890990t90z9'00908E09901902900908n09f0tt0

0f08€09€0r00Z€O0€0BZ0920ftz0zz00208I 091 0rI 0zr00I 0800900r00200

9t6 096 926 006 stg 098 s28 008qtt 09L qzt 00t1s.9u rley

LrJi),\{laq

uorlPI.rJJoJ

aBera,ty

+eaJVPJJ^V

*sluarJrJJaoJuorlelaJJoJ :I a-lsvI

Page 4: A Method for Comparing the Areas  under ROC curves Derived from  the Same Cases

an unpaired z test that assumed the twoareas were stat ist ical lv independentf a i l e d t o f i n d a s i g n i f i c a n t d i i f e r e n c ebetween the modali t ies. The degree ofcorrelat ion expected between R(JCareas obtained with dif ferent modali-t ies varies considerably dependingupon the types of modaii t ies involved.For example, i f the two images are ob-ta ined f rom the same mach ine w i thtwo different settings or if a radiologistreads a CT scan with and without ex-tensive cl inical history, high correla-t ion can be expected. In this study in-volving dif ferent reconstruction algo-r i thms with CT, the correlat ion be-tween the paired rat ings of abnormalphantoms was 0.60 ancl between pairedratings of normal phantoms was 0.39.We have observed similar results in as tudy o f ours (8 ) invo lv ing the in te r -pretat ion of CT studies of the headwith and without extensive cl inicalh is to ry . On the o ther hand, when theon lv common denominator in thecomparison is the patient, the correla-t ions are l ikelv to be weaker. For ex-ample , a s tudy by A lderson c t a l . (9 )comparing CT, ultrasound, and nuclearmed ic ine imag ing in the d iagnos is o fl iver metastases found considerablylower rat ing-pair correlat ions (0.36 inabnormal pa t ien ts and 0 .28 in normalpatients). Obviously, in the latter si t-uation the gains from using a pairedrather than an unpaired analysis aresmai le r .

Two other points must be madeabout correlat ion coeff icients. First, ingeneral we have noted that whateverthe modali t ies under study, the rat ingstend to be less correlated in the non-diseased patients than in the ci iseasedpatients. This suggests that in diag-nostic imaging agreement tends to begreater i f there is in fact underlyingdisease, and less i f there is not. Second,i f an investigator knew a Ttr iori that thecorrelat ions between the modali t iesunder study were smaii , then an ex-perimental design that did not involvepairing could be used, provided that itwas no more dif f icult to separate(diagnose) the patients studied by onemodality than it was to diagnose thosestudied by the other modali tv.

The stat ist ical economy result ingfrom this new stat ist ical test is large.Statist ical economy relates to thequestion of how many more patientsare required in an unpaired designthen in a paired design to achieve thesame sensitivity or statistical power. Acomparison of Equations I and 2 pro-vides an answer to this question. Eachof the s tandard er ro rs i s inverse lyproport ional to the square root of thesample size n. Also, the equations canbe simpli f ied by assuming that thestandard errors of the two areas are

842. Rad io logy

equal; in this case, Equation 2 dif fersfrom Equation 1 only in the presence ofthe fac to r (1 - t ) . When the samplesizes associated with the two tech-niques are arranged so that the pairc-dand unpaired tests prclduce the same zvalue, then a simple algebraic iclenti tyemerges :

t r , , = r t r l (1 - r )

t r , , = ( l - r ) r r , ,

where l i , , and 1,, are the numbers ofpatients per modali tv in the respectiveunpaired and paired designsa. For ex-ample, i . f r is anticipated to be roughlv0 .3 and an unpa i red des ign ca l led fo r100 patients per modali tv, then a paireddes ign shou ld requ i re on ly 70 permoclal i tv. Thus the total number ofimages iead would be 140 rather than2 0 0 . T h i s e f f i c i e n c v i s e v e n m o r e i m -portant i f the l imit ing factor is thenumber of avai iable patients with aproved ou tcome ( ra ther than thenumber of images a reader can be ex-pected to reacl), since the total of 140pa i red images is ob ta ined f rom jus t 70patients, rather than from 200 patientsin the unpaired. design. The investi-ga tor must we igh very care fu l l y thepractical and stat ist ical issues, keepingin mind tha t i f one uses an unpa i rec ldesign, one must establ ish (thrr:rughcase match ing and/or random a l loca-t ion) that the method of constructingtwo independent samples o f sub jec tsdoes no t g ive one moda l i t v an inbu i l tadvantage.

The discussion thus far has centeredon a rather restricted design where justone reader read the images 5;eneratedby the two moclalities being compared.The stat ist ical test simplv asked theq u e s t i o n : i f t h i s o n e r e a d e r r e a d a n i n -f ini te rather than a f ini te number ofimages, wou ld h is /her accuracy bec o m p a r a b l e i n b o t h m o d a l i t i e s ? qClearlv, a more general question isrelevant: how do the modali t ies com-pare over many reaclers?

For the sake of completeness, werefer briefly to this probiem of multiplereaders and readings in each modali ty.This situation has been discussed ex-tensivelv by Swets and Pickett (10); ourmain reasons for mentioning i t here areto draw readers' attention to a veryextensive treatment of the design andanalysis of imaging experiments, andto point out that our method of ob-

4 Th is s imp le re la t ion a l lows the user to mu l -t ip ly the sample s izes in TABLI - : I I I o f our f i rs tpublication (4) bv the appropriate (1 - r) ancl usethem fo r pa i red des igns .

5 Onc cou ld a lso use the z tes t to compare twospec i f i c readers on one moda l i t y .

taining r now al lows the methodstherein to be used with greater sensi-t ivi ty. This is best appreciated by re-producing the formula that the authorsgive (Equation 2, Chapter 3) for thestandard error of a dif ference betweenthe value of an accuracv index (such asthe area under an ROC curve) for onemodality (averaged over / readers, eachreading each image rr t imes) and thevalue of the same accuracv inclex (againaveraged over readers and reacl ings)for a seconcl moclal i ty. The expressioninvo lves th ree sources o f var ia t ion : S f ,the variat ion in the inclex clue to dif-ferences in mean dif f icultv of casesf rom casc sample t ( ) c . tse s . imp le ; 5 f , ,between-reade-r variance due to dif-ferences in diagnostic^capabil i ty fromreader to reader; ancl 5;, , , within-readervar iance due- to d i f fe rences in an ind i -vidual reacler 's diagnc'rses of the samecase in repeated occas ions . I t a lso in -voives two correlat ion coeff icients: r, .to denotc the- cor re la t ions in t roc lucedbv us ing s imi la r (o r even the same' )cases w i th bo th moda l i t ies and r r , , t c rdenote cor re la t ions be tn ,een the accu-racy index obtained bv usinp; rnatchecl(o r poss ib lv the same) readers . Wi thth is no ta t ion . the fo rmula becomes

S E ( d i i f e r e r r c e ) =

r 2

The authors descr ibe fu l l v i ' l n seve- ra iworkcd eramples h t rw to eva lua te eachof these te rms.

' fhev po in t ou t , how-

ever , tha t the es t imat ion o f the twoc o m p o n e n t s r a n t l 5 r . c r e a t e s p r o b -lems. F i rs t , i f n r = 7 , i . c . , i f each i ^mageis read just once, then Sf, ancl Su', , arenot separable., anci one is forced tooveres t imate the SE. The second, andmore serious, problem is that i f r l = 1and i f one does no t have a la rge num-ber of cases, enough i for example) tospl i t them into a number of subsamplesand f i t an ROC curvL- to each, one isunable to estimate r, . ln such cases, theauthors exp la in tha t one has no a l te r -na t ive bu t to assume r , : 0 , therebvg i v i n g u p n n v b e n e f i t s a t t a i n a b l e l r o mcase matching.

The method we have- presented heremeans that i f on€. uses the area underthe ROC curve as an index of accuracy,one is not forced to assume r,. = 0. Thequanti ty we have cal lecl r , which isobtainable r; la Taslr I from the areaand from the correlat ions betweenr a t i n g s , i s t h e s d m e q u . r n t i l v r . - , . ,ment ioned in Equat ion 5 , Chapter 4 o fSwets and Pickett (8)6. The interested

6I f m ) 1 , one can cor rec t the quant i t v / . , , . ,(ob ta ined f rom T lnLr l ) fo r the "a t tenuat ion"

produced bv S1, . , and es t imate the " t rue" cor re -lation r,. introduced bv using similar (or the same-)cases.

or

September 1983

Page 5: A Method for Comparing the Areas  under ROC curves Derived from  the Same Cases

gpg . f8o1orpel

It I = (920 0)(0€0'0X010'0)z - z9z0 0 +.0€0 0'\/let0 0 = ::J'lsrlels lsrJ (p)

0t 0 = sP;iP LlJJ,tlaLl u(lllL)iJ.llt).)

9tl6 0 = tart.rHerr,rl'

llt0 tt = SPJIP LIr Jf,ui)ldlil( I

€ rrqunN 8'l rrunlo^

'a8b I

'ssJJJ JrLU,)pP-rV:jrOI .rAN SU,llS.\S f rlsOU-3prpr;o uorlr'n1p^:I l lll ltal-lrd '\/J sl.).uq

'.rtd rrr 1gs1 .ii,' ,rr1'1'y i1'n1rJArlf,Jdsord V ieuoLrrllpl lseJrq l() Lr()l()Jqlr,u stu;rtecl ur JrArl rql lo,iqdt.rilrlur:spLrrl'pun()srlJlln',iqclr.rBoruol pJlnclru())

IP lr '1fl IlrNrl,"{

'{ci sLuPpv '()(l uosr,)plYssar cl

ur 1!B6l .iNirlrrrpry ,ipnis rsr': Lr sp pt,Jtl

Jql J() l-) :u()lJL'lal({lalLII -rtqcltrStllpur utt

.{ro1sr 11 _1o 1:11q.r .rL{l prr r sJ,\r n.) -l l l<uJlJpJprJ-littrlr':,r.1'' t,)\ti\rl l\'rltr'(j I rl|ttlll{'\\

llll LIr')l\trrltrn.l \ | \'r'.rIH lrl lr'rN)n

lf;_r I i:ai r086 I Llr.\scltlllll^\i | 'sJA.In-r

l())l Jrul.l(rLlt!l l()l slsil ilLIt'JrgruHrs 1rrr1sr1e1q gH LrpLuL{()J\':i.) /lJI/\,1

fJI lql I / r6s6l llntl Llr\s([''p

J() pLr. J,\rnl-.)())J,)tl] rJfrun 0.)ip aLll

1o.i1rlrqcr:e,r lurlelLurq )l LI,)rsl i 'l '.j.)Pll1)(i

961-i8f :q rb96 [ Li-r \s(l

Lllpl,\ I tlpp poL{l!)Lu-NLr}Pl-ilP.\J,rltrl,r)UAlrtlUrr\ ltr lt,rilr'utrul,rl,)l\ l\Ut'� \l,r.r{l

uorlJalip lt,rrNrs 1o sl.rlauuJpd J() LI()rlrttirlsipoorillJ.lrI unrui\Pl\ .{ ]lv'CC] LrPLulr{)(j

9l 6;:lt l:191r 1 .ifo1orpr;1 a.unJ (.)())J) -rrlsrJr]rplt'tllIrirlr:ado rJ,\r,r.ri.r I Jal]un u,)rp JLll J(),)sn

1'.1r' s11r11r'.r1s .r11. lll Ir.'\ rn \ | \')lull lIZI 60i:tt:6/6[ l()rpP]l t\r,\ul

\Jn []r u L{r,ll iti r itru r I p-)rp.)tu J() u()r I ptl I P,\.1

rrll ()l Prrlcldr crs,ilPuP X))l \''l sli.11('g6a-!Sa:B 18/61 pJI{ l.)nf.J trrurq

srs,\1rur' .11)y 1o s.rltlrrrir:cl rrsug t1.1 z1atrr,1

9961'suos PUP,ia|,11 uqof :jl() \ uaN s:rs.iqcloq:.isd pur

.iro;rqt ttotl-rr1ap lruFrq y1s1.r,uq'Cl uiaJ()

0l

saJuaJatau

fila vfH PpPUP ).),)q.rnO'iPJrirrol\

],1,)Jls .\lrsl!)Alll a I 1 /!

.\lrsJJ,\run lIrr)Jl\q I I paH pu r,tNo1r,r ur.rpr rlrl 1o 1u;tu;rucJ.161

'1c1 u:sntreu .rr1l p.rc{,\1 uour-uIp-)-lJr'\i .ru.;.r1 trorIe.9r1s.),\Llr sIqI Jo JsJno-)

itll Lr I\u()r\\n.r'r1t rn.;J1.rt1 .{111'11 1,r1 21rtr1

salJPrlJ ()l p.)lq.pur ()slu JJP .l^|sllnsal

llurpuuar ptre sr.uo;ueqcl ,1.; aql 3urpr,to.r.l

ro1 .ipnI clrpq.l prr-rt u()ssLtr,1r.S pJeLlJi]l

ot pJlq.lpur aJB a,1l :s1uaru3pa1,r'roulry

'sJ.lpPaJ

.r1dr11nr-u Jo r-Jrpnls LrI 'p.rJeurrlsaJap

-un Jo'lp prassanB iluo ilsnornard se.ll.

IPLII Lriall uu sapr,lord slql 'JaAoaJotr tr

'seaJP aql 1L]) ttosuedr-uoJ JAtltsuJsaJoLu e urrolred ol uollelaJJof, stqlJSn ol ,ltor,l u.MoLIs JAuLI pue sluaried 1oaldues aulps JLII LUoJJ pJAIrJp sa JnJ

JOU o^ i rJpun sParP JLll uraMlJquorJeiaJJoJ Jql SurlprurlsJ roj PoLllsru Ppapr,r.ord JAeLI a,M'uaql ',fueutuns u1

'pasn sr ll MoLI uo sllelrp IInJJoJ JJUJJ.IJJJ IPLII llnsuoJ ueJ JJpPaJ

920 0

0e0 0

OT91619TEZ8€60 f9t 0lI i9t0 ! E tl lt

6 lz tt q I t91680 SZr Zlr Er6'0 | q s 8z al

I Pr.UJ() Lr q Y{P.LUl () N

z.\trlLlpor\JPUrI()Uq\i

IPrUr()NI.\l{lP})(rl\

(lrJJV) JoJlq

prep uPls

1cirr.ra1u1 aclol_q PJr VLlIlPfr()nl

u rlPu

r5^rs.ilPur' -)c)u (.))

09 0 = ur)l lPl):t,r: r8t'.t:,tt'610: \rZ .\lIllrpotu .s,l [ ,UIIL?L]otrl

090= i'r; ilrlepotu 's.2 I ,tlllepor-u:(nPl

Ilppu.))) sHurlt,r Lraa,!\laq uorlel.)JJo-) (,1)

JpuJouq\/.i1a1ru1116:9 ol Jt'uuoN,\l;lrrrlJc = I Lu(uJ:sNrttlry ,

6I

L6

L

IPIoJ 9 E t € Z L TuloJ 9 9^ f t a I-LU.,ltl,!J $uJ,\,qV _

.LU,"Lrt.l,l BuJ"Nz ,\lrlPpo] l L[]L\4. *illlltrtl

I ,\trlPpol,\

ql1.1\

* ilr-rrlPu

a-[ilzaL

rte;6lI

f6

: !l rf

III9IIISa-t8zt-aI

r-9 0r 9t691ZZ 9 Oift I

f-I-

Iel()f()9'tEZI

:Jrrllo aql ufql eaJe JOll rJluarx B prpuJl-qns {1r1epour auo J.r{JaL{,ra lsal ol ls.)l z Jo

:Lrlep.Isug (t/)

suorJplnJIeJ qlr,n raqlaSo] 'stuo1ueqcl 711

Jo L{)eJ yo s.rBeur o,u1 o; ua,ttB sBr.trley

XICIN!IddV