ROC Curves Tutorial

1

An Overview of Contemporary ROC Methodology

in Medical Imaging and Computer-Assist Modalities

Robert F. Wagner, Ph.D., OST, CDRH, FDA

2

ROC

Receiver Operating Characteristic (historic name from radar studies)

Relative Operating Characteristic(psychology, psychophysics)

Operating Characteristic (preferred by some)

3

OUTLINE:

- Efforts toward consensus development on present issues

- The ROC Paradigm

- The complication of reader variability

- The multiple-reader multiple-case (MRMC) ROC paradigm

- The measurement scales: categories; patient-management/action; probability scale

- Complications from location uncertainty

truth uncertainty effective sample # uncertainty

reader vigilance

- Summary

4

EFFORTS TOWARD CONSENSUS DEVELOPMENTON THE PRESENT ISSUES

- How to use classic concepts of Sensitivity, Specificity,and ROC analysis to assess performance

of diagnostic imaging and computer-assist systems?

- Many new issues and levels of complexity coming to the fore as more complex technologies emerge

5

EFFORTS TOWARD CONSENSUS DEVELOPMENT ON THE PRESENT ISSUES (II)

RSNA/SPIE/MIPS Various Workshops & Literature - an evolving Work-in-Progress

FDA/CDRH use of multiple-reader multiple-case (MRMC) ROC - Digital Mammography PMAs - Computer Aid for lung nodule detection on CXR (film)

NCI Lung Image Database Consortium (LIDC) & Workshops- Consensus seeking on many issues- Two CDRH active members

Communication of these resources with incoming sponsors

6

Fundamentals of the ROC paradigm

7

Non-diseasedcases

Diseasedcases

Test result valueor

subjective judgement of likelihood that case is diseased

Threshold

8

Non-diseasedcases

Diseasedcases

Test result valueor

subjective judgement of likelihood that case is diseased

more typically:

9

Threshold

TP

F,

sens

itivi

tyFPF, 1-specificity

less aggressivemindset

Non-diseasedcases

Diseasedcases

10

Threshold

TP

F,

sens

itivi


moderatemindset

Non-diseasedcases

Diseasedcases

11

Threshold

TP

F,

sens

itivi


more aggressive

mindset

Non-diseasedcases

Diseasedcases

12

Threshold

Non-diseasedcases

Diseasedcases

TP

F,

sens

itivi


Entire ROC curve

13

TP

F,

sens

itivi


Entire ROC curve

Reader Skilland/or

Level of Technology

chan

ce lin

e

14

. . . at least that’s the idea . . .

. . . now to what happens in the real world . . .

The Complication of Reader Variability

15

In the following example from mammography,

readers were asked to set their “threshold for action” . . .

. . . between their sense of the boundary between

category 3 and category 4 of the BIRADS scale

16

0 .0

0 .0

0 .0

0 .0

0 .1

0 .1

0 .1

0 .1

0 .2

0 .2

0 .2

0 .2

0 .3

0 .3

0 .3

0 .3

0 .4

0 .4

0 .4

0 .4

0 .5

0 .5

0 .5 0 .5

0 .6

0 .6

0 .6

0 .6

0 .7

0 .7

0 .7

0 .7

0 .8

0 .8

0 .8

0 .8

0 .9

0 .9

0 .9

0 .9

1 .0

1 .01 .0

1 .0

F alse P ositive F raction

Tru e N egativ e F raction

Tru

e P

osit

ive

Fra

ctio

n

Fal

se N

egat

ive

Fra

ctio

n

TPF vs FPF for 108 US radiologists in study by Beam et al.

17

- There is no unique ROC operating point

i.e., no unique (TPF, FPF) point

- There is no unique ROC curve

i.e., there is a band or region of ROCs

18

. . . dozens of examples of this phenomenon exist . . .

The following is an example from

plain film chest radiography (CXR)

19 ( Chest film study by E. James Potchen, M.D., 1999 )

20

The Multiple-Reader Multiple-Case (MRMC) paradigm

“Fully-Crossed Design”

* Cases matched across modalities (i.e., same cases read unaided vs aided)

* Readers matched across modalities (i.e., same readers read unaided vs aided)

* This design has the most statistical power for a given number of readers and

a given number of cases with verified truth;thus, it’s least demanding of these resources

(“least burdensome”)

21


Enabled by “resampling strategies”

- Jackknife plus ANOVA (parametric) (Dorfman, Berbaum, Metz DBM 1992)

- Bootstrap the experiment of interest (nonpar) Draw random readers, random cases Carry out the experiment of interest

22

Some possible bootstrap samples of size 15 from a dataset with 15 elements

[14, 6, 3, 5, 12, 9, 11, 14, 4, 10, 7, 12, 3, 14, 2]

.

.

.

[9, 15, 11, 2, 13, 1, 6, 7, 12, 4, 8, 1, 12, 6, 14]

23


Enabled by “resampling strategies”

- Jackknife plus ANOVA (parametric) (Dorfman, Berbaum, Metz DBM 1992)

- Bootstrap the experiment of interest (nonpar) Draw random readers, random cases Carry out the experiment of interest

- Obtain mean performance over readers, cases - Obtain error bars that account for variability of readers and cases

24

Scales used for reporting and measurements:

- Historic ordered categories (usu. 5 or 6)(almost definitely no . . . maybe . . . almost definitely yes)

- “Action item” or “patient management” scale (e.g., no action vs F/U . . . or F/U vs biopsy)

. . . BIRADS scale is classic example . . .

- “Continuous” probability rating scale (e.g., probability of disease or probability of cancer) . . . actually recommended in BIRADS doc . . .

25

Scales used for reporting and measurements

Example of “Best of both worlds”:

Classification of benign vs malignant μcalc clusters (Jiang, Nishikawa, Schmidt, Metz, Giger, Doi)

Authors studied ROC curves, ROC areas . . . and (Sensitivity, Specificity) operating point

(means and uncertainties)

27

Possible reasons why we do not see more of “Best of both worlds”

ROC total area is TPF (Se) averaged over FPF (Sp) - Var(ROC area) ~ (Binomial Var)/2 - Var(Se) when Sp is known = Binomial Var - Var (Se) when Sp is estimated > Binomial Var

Var(ROC area) is least burdensome

- “Both worlds” requires consistent conventions. . . plus training (little documentation so far)

- May require consensus bodies to promote the practice

28

The most famous slides in the ROC archives . . .

29

Dilemma: Which modality is better?

False Positive Fraction= 1.0 Specificity

Tru

e P

os

itiv

e F

rac

tio

n=

Sen

sit

ivit

y

1.0

1.00.0

0.0

Modality A

Modality B

30


Tru

e P

os

itiv

e F

rac

tio

n=

Sen

sit

ivit

y

1.0

1.00.0

0.0

Modality A

Modality B

The dilemma is resolved after ROCsare determined (one scenario):

Conclusion:

Modality B is better:

higher TPF at same FPF, or

lower FPF at same TPF

31

A different scenario: Same ROC

False Positive Fraction

= 1.0 Specificity

Tru

e P

os

itiv

e

Fra

ctio

n=

Sen

sit

ivit

y

1.0

1.00.0

0.0

Modality A

Modality B

32


Tru

e P

os

itiv

e F

rac

tio

n=

Sen

sit

ivit

y

1.0

1.00.0

0.0

Modality A

Modality B

. . . yet another scenario:

Modality Ais better:

Conclusion:

higher TPF at same FPF, or

lower FPF at same TPF

33

When ROC curves cross . . .

total area under the ROC curve is not a sufficient

summary measure of performance

. . . other summary measures may be necessary.

When this is anticipated, the study protocol

is expected to address this.

34

Location scoring:

- The basic ROC paradigm is an assessment of the

decision making at the level of the patient.

- In complex imaging, assessment of

decision making at a finer level is desired,

i.e., assessment of localization is desired.

- Localization adds more information, more statistical power

35

The problem of location-specific ROC or “LROC” analysis

- Measurement of a “hit” depends on localization criterion (thus, results are not unique)

- Monotonic relationship between ROC and LROC for special case of zero or one lesion - More elaborate models require assumptions of independence among multiple lesions, regions

- Lack of validated software for analysis of experiments

36

Region-of-interest (ROI) approach to location-specific ROC analysis . . .

. . . only require localization to within a quadrant . . .

. . . or some other unit . . .

37

Region-of-interest (ROI) approach to location-specific ROC analysis . . .

- Disadvantages: “Does not correspond to the clinical task”. . . etc. . .

- Advantages: Straightforward to account for correlations w/o additional assumptions

- The most straightforward method is simply to

resample using the patient as the statistical unit

38

THE PROBLEM OF UNCERTAINTY OF TRUTH STATE

Classic paper: Revesz, Kundel, Bonitatibus (1983) included various ways of obtaining panel consensus “truth”

Authors compared three imaging methods

Any one of the three could outperform the others – depending on rule used for reducing panel to truth

HOWEVER, TODAY TARGET IS “ACTIONABLE NODULE” ACCORDING TO EXPERT PANEL

Classic ref. above indicates additional uncertainty present

=> Resample panel to assess additional uncertainty

39

UNCERTAINTY OF TRUTH STATE UNCERTAINTY IN EFFECTIVE SAMPLE SIZE

Uncertainty in TPF # actually diseased casesUncertainty in FPF # actually nondiseased cases

Uncertainty in total area under ROC curve “effective number of cases”

Harmonic mean of numbers in the two classes . . . & is a function of the panel sample

40

0

10

20

30

40

50

60

0 20 40 60 80 100

Number in "normal" class

Eff

ecti

ve n

um

ber

fo

r R

OC

are

a es

tim

atio

n

Given: 100 patients –What is the best “split” between “normals” and “abnormals”

for purposes of estimating area under ROC?

41

. . . relaxing panel criterion from unanimous to majority

- allows resampling to assess variability

- may increase effective number of samples

. . . these effects may tend to cancel

42

THE PROBLEM OF CONTROLLING FOR READER VIGILANCE

Any measurement setting has artificial conditions vis-à-vis actual practice:

“Are readers more vigilant in unaided reading when they’re subjects in a study?”

“Are readers less vigilant in unaided reading when they’re not subjects in a study?”

One early suggestion: Control the time available to readers to mimic the clinic

(Chan et al., Invest. Radiol. 1990)

43

IN SUMMARY

These points reflect the current status of on-going interactions

between and among

FDA

Academia

Industry sponsors

NCI and the LIDC

on the topic and issues for submissions like the present one

44

Selected References

Metz CE. Basic principles of ROC analysis. Seminars in Nuclear Medicine 1978; 8: 283-298.Metz CE. ROC methodology in radiologic imaging. Invest Radiol 1986; 21: 720-33.Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 1989; 24: 234-245. Metz CE. Fundamentals of ROC Analysis. [In] Handbook of Medical Imaging. Vol. 1. Physics and Psychophysics. Beutel J, Kundel HL, and Van Metter RL,

Eds. SPIE Press (Bellingham WA 2000), Chapter 15: 751-769.Swets JA and Pickett RM. Evaluation of Diagnostic Systems. Academic Press,

New York, 1982.Wagner RF, Beiden SV, Campbell G, Metz CE, and Sacks WM. Assessment of

medical imaging and computer-assist systems: Lessons from recent experience. Acad Radiol 2002; 9: 1264-1277

Wagner RF, Beiden SV, Campbell G, Metz CE, and Sacks WM. Contemporary issues for experimental design in assessment of medical imaging and computer-assist systems. Proc. of the SPIE-Medical Imaging 2003; 5034: 213-224.

Dodd LE, Wagner RF, Armato SG, McNitt-Gray MF, et al. Assessment methodologies and statistical issues for computer-aided diagnosis of lung

nodules in computed tomography: Contemporary research topics relevant to

the Lung Image Database Consortium. Acad Radiol (in print, Apr. 2004).

45

Toledano AY, Gatsonis C. Ordinal regression methodology for ROC curves derived from correlated data. Statistics in Medicine 1996, 15: 1807-1826.

Nishikawa RM and Yarusso LM. Variations in measured performance of CAD schemes due to database composition and scoring protocol. Proc. of the SPIE 1998; 3338: 840-844.

Giger ML. Current issues in CAD for mammography. In: Doi K, Giger ML, Nishikawa RM, and Schmidt RA, Eds. Digital Mammography ’96. Elsevier Science B.V. 1996, 53-59.

Clarke LP, Croft BY, Staab E, Baker H, Sullivan DC, National Cancer Institute initiative: Lung image database resource for imaging research. Acad Radiol 2001 May;8(5):447-50.

Wagner RF, Beiden SV, Metz CE. Continuous versus categorical data for ROC analysis: Some quantitative considerations. Acad Radiol 2001; 8: 328-334.

Revesz G, Kundel HL, and Bonitatibus M. The effect of verification on the assessment of imaging techniques. Invest. Radiol. 1983; 18: 194-198.

Beiden SV, Wagner RF, Campbell G. Components-of-variance models and multiple-bootstrap experiments: An alternative method for random-effects receiver operating characteristic analysis. Acad Radiol 2000; 7: 341-349.

Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: Hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Acad Radiol 1995; 2 (Supplement 1): S22-S29.

Chan HP, Doi K, Vyborny CJ et al. Improvement in radiologists’ detection of clustered microcalcifications on mammograms. Invest Radiol 1990; 25: 1102.

46

Chakraborty DP and Winter L. Free-response methodology: Alternate analysis and a new observer-performance experiment. Radiology 1990; 174: 873-881.Metz CE, Starr SJ, Lusted LB. Observer performance in detecting multiple radiographic signals: prediction and analysis using a generalized ROC approach. Radiology 1976; 121: 337-347.Starr SJ, Metz CE, Lusted LB, Goodenough DJ. Visual detection and localization of radiographic images. Radiology 1975; 116: 533-538 Swensson RG. Unified measurement of observer performance in detecting and localizing target objects on images. Medical Physics 1996; 23: 1709-1725.Chakraborty DP. The FROC, AFROC and DROC variants of the ROC analysis. [In] Handbook of Medical Imaging. Vol. 1. Physics and Psychophysics. Beutel J, Kundel HL, and Van Metter RL, Eds. SPIE Press (Bellingham WA 2000), Chapter

16: 771-796.Obuchowski NA. Multireader receiver operating characteristic studies: A comparison of study designs. Acad Radiol 1995; 2: 709-716. Gatsonis CA, Begg CB, Wieand S. Advances in Statistical Methods for Diagnostic Radiology: A Symposium. Acad Radiol 1995; 2 (Supplement 1): S1-S84 (the entire supplement is the Proceedings of the Symposium).Beiden SV, Wagner RF, Doi K, Nishikawa RM, Freedman M, Lo S-C B, and Xu X-W. Independent versus sequential reading in ROC studies of computer-assist modalities: Analysis of components of variance. Acad Radiol 22002; 9: 1036- 1043.

47

Metz CE. Evaluation of CAD Methods. In: Doi K, MacMahon H, Giger ML, and Hoffmann KR, eds. Computer-Aided Diagnosis in Medical Imaging. Amsterdam: Elsevier Science B.V. (Excerpta Medica International Congress Series, Vol. 1182), 1999, 543-554.Chakraborty, DP. Statistical power in observer performance studies: Comparison of the receiver operating characteristic and free-response methods in tasks involving localization. Acad Radiol 2002; 9: 147-156.Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol 1992; 27: 723-731.Chakraborty DP and Berbaum KS: Comparing Inter-Modality Diagnostic Accuracies in Tasks Involving Lesion Localization: A Jackknife AFROC Approach. Supplement to Radiology, Volume 225 (P), 259, 2002.Obuchowski NA, Lieber ML, Powell KA. Data analysis for detection and localization of multiple abnormalities with application to mammography. Acad Radiol 2000; 7: 516-525.Rutter CM. Bootstrap estimation of diagnostic accuracy with patient-clustered data. Acad Radiol 2000; 7 : 413-419.Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman and Hall, New York, 1993.Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Arch Intern Med 1996; 156: 209-213.Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K. Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 1999; 6: 22-33.

ROC Curves Tutorial

Documents

Transcript of ROC Curves Tutorial