ROC Curves Tutorial

47
1 An Overview of Contemporary ROC Methodology in Medical Imaging and Computer-Assist Modalities Robert F. Wagner, Ph.D., OST, CDRH, FDA

description

ROC Curves Tutorial

Transcript of ROC Curves Tutorial

Page 1: ROC Curves Tutorial

1

An Overview of Contemporary ROC Methodology

in Medical Imaging and Computer-Assist Modalities

Robert F. Wagner, Ph.D., OST, CDRH, FDA

Page 2: ROC Curves Tutorial

2

ROC

Receiver Operating Characteristic (historic name from radar studies)

Relative Operating Characteristic(psychology, psychophysics)

Operating Characteristic (preferred by some)

Page 3: ROC Curves Tutorial

3

OUTLINE:

- Efforts toward consensus development on present issues

- The ROC Paradigm

- The complication of reader variability

- The multiple-reader multiple-case (MRMC) ROC paradigm

- The measurement scales: categories; patient-management/action; probability scale

- Complications from location uncertainty

truth uncertainty effective sample # uncertainty

reader vigilance

- Summary

Page 4: ROC Curves Tutorial

4

EFFORTS TOWARD CONSENSUS DEVELOPMENTON THE PRESENT ISSUES

- How to use classic concepts of Sensitivity, Specificity,and ROC analysis to assess performance

of diagnostic imaging and computer-assist systems?

- Many new issues and levels of complexity coming to the fore as more complex technologies emerge

Page 5: ROC Curves Tutorial

5

EFFORTS TOWARD CONSENSUS DEVELOPMENT ON THE PRESENT ISSUES (II)

RSNA/SPIE/MIPS Various Workshops & Literature - an evolving Work-in-Progress

FDA/CDRH use of multiple-reader multiple-case (MRMC) ROC - Digital Mammography PMAs - Computer Aid for lung nodule detection on CXR (film)

NCI Lung Image Database Consortium (LIDC) & Workshops- Consensus seeking on many issues- Two CDRH active members

Communication of these resources with incoming sponsors

Page 6: ROC Curves Tutorial

6

Fundamentals of the ROC paradigm

Page 7: ROC Curves Tutorial

7

Non-diseasedcases

Diseasedcases

Test result valueor

subjective judgement of likelihood that case is diseased

Threshold

Page 8: ROC Curves Tutorial

8

Non-diseasedcases

Diseasedcases

Test result valueor

subjective judgement of likelihood that case is diseased

more typically:

Page 9: ROC Curves Tutorial

9

Threshold

TP

F,

sens

itivi

tyFPF, 1-specificity

less aggressivemindset

Non-diseasedcases

Diseasedcases

Page 10: ROC Curves Tutorial

10

Threshold

TP

F,

sens

itivi

tyFPF, 1-specificity

moderatemindset

Non-diseasedcases

Diseasedcases

Page 11: ROC Curves Tutorial

11

Threshold

TP

F,

sens

itivi

tyFPF, 1-specificity

more aggressive

mindset

Non-diseasedcases

Diseasedcases

Page 12: ROC Curves Tutorial

12

Threshold

Non-diseasedcases

Diseasedcases

TP

F,

sens

itivi

tyFPF, 1-specificity

Entire ROC curve

Page 13: ROC Curves Tutorial

13

TP

F,

sens

itivi

tyFPF, 1-specificity

Entire ROC curve

Reader Skilland/or

Level of Technology

chan

ce lin

e

Page 14: ROC Curves Tutorial

14

. . . at least that’s the idea . . .

. . . now to what happens in the real world . . .

The Complication of Reader Variability

Page 15: ROC Curves Tutorial

15

In the following example from mammography,

readers were asked to set their “threshold for action” . . .

. . . between their sense of the boundary between

category 3 and category 4 of the BIRADS scale

Page 16: ROC Curves Tutorial

16

0 .0

0 .0

0 .0

0 .0

0 .1

0 .1

0 .1

0 .1

0 .2

0 .2

0 .2

0 .2

0 .3

0 .3

0 .3

0 .3

0 .4

0 .4

0 .4

0 .4

0 .5

0 .5

0 .5 0 .5

0 .6

0 .6

0 .6

0 .6

0 .7

0 .7

0 .7

0 .7

0 .8

0 .8

0 .8

0 .8

0 .9

0 .9

0 .9

0 .9

1 .0

1 .01 .0

1 .0

F alse P ositive F raction

Tru e N egativ e F raction

Tru

e P

osit

ive

Fra

ctio

n

Fal

se N

egat

ive

Fra

ctio

n

TPF vs FPF for 108 US radiologists in study by Beam et al.

Page 17: ROC Curves Tutorial

17

- There is no unique ROC operating point

i.e., no unique (TPF, FPF) point

- There is no unique ROC curve

i.e., there is a band or region of ROCs

Page 18: ROC Curves Tutorial

18

. . . dozens of examples of this phenomenon exist . . .

The following is an example from

plain film chest radiography (CXR)

Page 19: ROC Curves Tutorial

19 ( Chest film study by E. James Potchen, M.D., 1999 )

Page 20: ROC Curves Tutorial

20

The Multiple-Reader Multiple-Case (MRMC) paradigm

“Fully-Crossed Design”

* Cases matched across modalities (i.e., same cases read unaided vs aided)

* Readers matched across modalities (i.e., same readers read unaided vs aided)

* This design has the most statistical power for a given number of readers and

a given number of cases with verified truth;thus, it’s least demanding of these resources

(“least burdensome”)

Page 21: ROC Curves Tutorial

21

The Multiple-Reader Multiple-Case (MRMC) paradigm

Enabled by “resampling strategies”

- Jackknife plus ANOVA (parametric) (Dorfman, Berbaum, Metz DBM 1992)

- Bootstrap the experiment of interest (nonpar) Draw random readers, random cases Carry out the experiment of interest

Page 22: ROC Curves Tutorial

22

Some possible bootstrap samples of size 15 from a dataset with 15 elements

[14, 6, 3, 5, 12, 9, 11, 14, 4, 10, 7, 12, 3, 14, 2]

.

.

.

[9, 15, 11, 2, 13, 1, 6, 7, 12, 4, 8, 1, 12, 6, 14]

Page 23: ROC Curves Tutorial

23

The Multiple-Reader Multiple-Case (MRMC) paradigm

Enabled by “resampling strategies”

- Jackknife plus ANOVA (parametric) (Dorfman, Berbaum, Metz DBM 1992)

- Bootstrap the experiment of interest (nonpar) Draw random readers, random cases Carry out the experiment of interest

- Obtain mean performance over readers, cases - Obtain error bars that account for variability of readers and cases

Page 24: ROC Curves Tutorial

24

Scales used for reporting and measurements:

- Historic ordered categories (usu. 5 or 6)(almost definitely no . . . maybe . . . almost definitely yes)

- “Action item” or “patient management” scale (e.g., no action vs F/U . . . or F/U vs biopsy)

. . . BIRADS scale is classic example . . .

- “Continuous” probability rating scale (e.g., probability of disease or probability of cancer) . . . actually recommended in BIRADS doc . . .

Page 25: ROC Curves Tutorial

25

Scales used for reporting and measurements

Example of “Best of both worlds”:

Classification of benign vs malignant μcalc clusters (Jiang, Nishikawa, Schmidt, Metz, Giger, Doi)

Authors studied ROC curves, ROC areas . . . and (Sensitivity, Specificity) operating point

(means and uncertainties)

Page 26: ROC Curves Tutorial

26

Page 27: ROC Curves Tutorial

27

Possible reasons why we do not see more of “Best of both worlds”

ROC total area is TPF (Se) averaged over FPF (Sp) - Var(ROC area) ~ (Binomial Var)/2 - Var(Se) when Sp is known = Binomial Var - Var (Se) when Sp is estimated > Binomial Var

Var(ROC area) is least burdensome

- “Both worlds” requires consistent conventions. . . plus training (little documentation so far)

- May require consensus bodies to promote the practice

Page 28: ROC Curves Tutorial

28

The most famous slides in the ROC archives . . .

Page 29: ROC Curves Tutorial

29

Dilemma: Which modality is better?

False Positive Fraction= 1.0 Specificity

Tru

e P

os

itiv

e F

rac

tio

n=

Sen

sit

ivit

y

1.0

1.00.0

0.0

Modality A

Modality B

Page 30: ROC Curves Tutorial

30

False Positive Fraction= 1.0 Specificity

Tru

e P

os

itiv

e F

rac

tio

n=

Sen

sit

ivit

y

1.0

1.00.0

0.0

Modality A

Modality B

The dilemma is resolved after ROCsare determined (one scenario):

Conclusion:

Modality B is better:

higher TPF at same FPF, or

lower FPF at same TPF

Page 31: ROC Curves Tutorial

31

A different scenario: Same ROC

False Positive Fraction

= 1.0 Specificity

Tru

e P

os

itiv

e

Fra

ctio

n=

Sen

sit

ivit

y

1.0

1.00.0

0.0

Modality A

Modality B

Page 32: ROC Curves Tutorial

32

False Positive Fraction= 1.0 Specificity

Tru

e P

os

itiv

e F

rac

tio

n=

Sen

sit

ivit

y

1.0

1.00.0

0.0

Modality A

Modality B

. . . yet another scenario:

Modality Ais better:

Conclusion:

higher TPF at same FPF, or

lower FPF at same TPF

Page 33: ROC Curves Tutorial

33

When ROC curves cross . . .

total area under the ROC curve is not a sufficient

summary measure of performance

. . . other summary measures may be necessary.

When this is anticipated, the study protocol

is expected to address this.

Page 34: ROC Curves Tutorial

34

Location scoring:

- The basic ROC paradigm is an assessment of the

decision making at the level of the patient.

- In complex imaging, assessment of

decision making at a finer level is desired,

i.e., assessment of localization is desired.

- Localization adds more information, more statistical power

Page 35: ROC Curves Tutorial

35

The problem of location-specific ROC or “LROC” analysis

- Measurement of a “hit” depends on localization criterion (thus, results are not unique)

- Monotonic relationship between ROC and LROC for special case of zero or one lesion - More elaborate models require assumptions of independence among multiple lesions, regions

- Lack of validated software for analysis of experiments

Page 36: ROC Curves Tutorial

36

Region-of-interest (ROI) approach to location-specific ROC analysis . . .

. . . only require localization to within a quadrant . . .

. . . or some other unit . . .

Page 37: ROC Curves Tutorial

37

Region-of-interest (ROI) approach to location-specific ROC analysis . . .

- Disadvantages: “Does not correspond to the clinical task”. . . etc. . .

- Advantages: Straightforward to account for correlations w/o additional assumptions

- The most straightforward method is simply to

resample using the patient as the statistical unit

Page 38: ROC Curves Tutorial

38

THE PROBLEM OF UNCERTAINTY OF TRUTH STATE

Classic paper: Revesz, Kundel, Bonitatibus (1983) included various ways of obtaining panel consensus “truth”

Authors compared three imaging methods

Any one of the three could outperform the others – depending on rule used for reducing panel to truth

HOWEVER, TODAY TARGET IS “ACTIONABLE NODULE” ACCORDING TO EXPERT PANEL

Classic ref. above indicates additional uncertainty present

=> Resample panel to assess additional uncertainty

Page 39: ROC Curves Tutorial

39

UNCERTAINTY OF TRUTH STATE UNCERTAINTY IN EFFECTIVE SAMPLE SIZE

Uncertainty in TPF # actually diseased casesUncertainty in FPF # actually nondiseased cases

Uncertainty in total area under ROC curve “effective number of cases”

Harmonic mean of numbers in the two classes . . . & is a function of the panel sample

Page 40: ROC Curves Tutorial

40

0

10

20

30

40

50

60

0 20 40 60 80 100

Number in "normal" class

Eff

ecti

ve n

um

ber

fo

r R

OC

are

a es

tim

atio

n

Given: 100 patients –What is the best “split” between “normals” and “abnormals”

for purposes of estimating area under ROC?

Page 41: ROC Curves Tutorial

41

. . . relaxing panel criterion from unanimous to majority

- allows resampling to assess variability

- may increase effective number of samples

. . . these effects may tend to cancel

Page 42: ROC Curves Tutorial

42

THE PROBLEM OF CONTROLLING FOR READER VIGILANCE

Any measurement setting has artificial conditions vis-à-vis actual practice:

“Are readers more vigilant in unaided reading when they’re subjects in a study?”

“Are readers less vigilant in unaided reading when they’re not subjects in a study?”

One early suggestion: Control the time available to readers to mimic the clinic

(Chan et al., Invest. Radiol. 1990)

Page 43: ROC Curves Tutorial

43

IN SUMMARY

These points reflect the current status of on-going interactions

between and among

FDA

Academia

Industry sponsors

NCI and the LIDC

on the topic and issues for submissions like the present one

Page 44: ROC Curves Tutorial

44

Selected References

Metz CE. Basic principles of ROC analysis. Seminars in Nuclear Medicine 1978; 8: 283-298.Metz CE. ROC methodology in radiologic imaging. Invest Radiol 1986; 21: 720-33.Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 1989; 24: 234-245. Metz CE. Fundamentals of ROC Analysis. [In] Handbook of Medical Imaging. Vol. 1. Physics and Psychophysics. Beutel J, Kundel HL, and Van Metter RL,

Eds. SPIE Press (Bellingham WA 2000), Chapter 15: 751-769.Swets JA and Pickett RM. Evaluation of Diagnostic Systems. Academic Press,

New York, 1982.Wagner RF, Beiden SV, Campbell G, Metz CE, and Sacks WM. Assessment of

medical imaging and computer-assist systems: Lessons from recent experience. Acad Radiol 2002; 9: 1264-1277

Wagner RF, Beiden SV, Campbell G, Metz CE, and Sacks WM. Contemporary issues for experimental design in assessment of medical imaging and computer-assist systems. Proc. of the SPIE-Medical Imaging 2003; 5034: 213-224.

Dodd LE, Wagner RF, Armato SG, McNitt-Gray MF, et al. Assessment methodologies and statistical issues for computer-aided diagnosis of lung

nodules in computed tomography: Contemporary research topics relevant to

the Lung Image Database Consortium. Acad Radiol (in print, Apr. 2004).

Page 45: ROC Curves Tutorial

45

Toledano AY, Gatsonis C. Ordinal regression methodology for ROC curves derived from correlated data. Statistics in Medicine 1996, 15: 1807-1826.

Nishikawa RM and Yarusso LM. Variations in measured performance of CAD schemes due to database composition and scoring protocol. Proc. of the SPIE 1998; 3338: 840-844.

Giger ML. Current issues in CAD for mammography. In: Doi K, Giger ML, Nishikawa RM, and Schmidt RA, Eds. Digital Mammography ’96. Elsevier Science B.V. 1996, 53-59.

Clarke LP, Croft BY, Staab E, Baker H, Sullivan DC, National Cancer Institute initiative: Lung image database resource for imaging research. Acad Radiol 2001 May;8(5):447-50.

Wagner RF, Beiden SV, Metz CE. Continuous versus categorical data for ROC analysis: Some quantitative considerations. Acad Radiol 2001; 8: 328-334.

Revesz G, Kundel HL, and Bonitatibus M. The effect of verification on the assessment of imaging techniques. Invest. Radiol. 1983; 18: 194-198.

Beiden SV, Wagner RF, Campbell G. Components-of-variance models and multiple-bootstrap experiments: An alternative method for random-effects receiver operating characteristic analysis. Acad Radiol 2000; 7: 341-349.

Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: Hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Acad Radiol 1995; 2 (Supplement 1): S22-S29.

Chan HP, Doi K, Vyborny CJ et al. Improvement in radiologists’ detection of clustered microcalcifications on mammograms. Invest Radiol 1990; 25: 1102.

Page 46: ROC Curves Tutorial

46

Chakraborty DP and Winter L. Free-response methodology: Alternate analysis and a new observer-performance experiment. Radiology 1990; 174: 873-881.Metz CE, Starr SJ, Lusted LB.  Observer performance in detecting multiple radiographic signals: prediction and analysis using a generalized ROC approach.  Radiology 1976; 121: 337-347.Starr SJ, Metz CE, Lusted LB, Goodenough DJ.  Visual detection and localization of radiographic images.  Radiology 1975; 116: 533-538 Swensson RG. Unified measurement of observer performance in detecting and localizing target objects on images. Medical Physics 1996; 23: 1709-1725.Chakraborty DP. The FROC, AFROC and DROC variants of the ROC analysis. [In] Handbook of Medical Imaging. Vol. 1. Physics and Psychophysics. Beutel J, Kundel HL, and Van Metter RL, Eds. SPIE Press (Bellingham WA 2000), Chapter

16: 771-796.Obuchowski NA. Multireader receiver operating characteristic studies: A comparison of study designs. Acad Radiol 1995; 2: 709-716. Gatsonis CA, Begg CB, Wieand S. Advances in Statistical Methods for Diagnostic Radiology: A Symposium. Acad Radiol 1995; 2 (Supplement 1): S1-S84 (the entire supplement is the Proceedings of the Symposium).Beiden SV, Wagner RF, Doi K, Nishikawa RM, Freedman M, Lo S-C B, and Xu X-W. Independent versus sequential reading in ROC studies of computer-assist modalities: Analysis of components of variance. Acad Radiol 22002; 9: 1036- 1043.

Page 47: ROC Curves Tutorial

47

Metz CE. Evaluation of CAD Methods. In: Doi K, MacMahon H, Giger ML, and Hoffmann KR, eds. Computer-Aided Diagnosis in Medical Imaging. Amsterdam: Elsevier Science B.V. (Excerpta Medica International Congress Series, Vol. 1182), 1999, 543-554.Chakraborty, DP. Statistical power in observer performance studies: Comparison of the receiver operating characteristic and free-response methods in tasks involving localization. Acad Radiol 2002; 9: 147-156.Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol 1992; 27: 723-731.Chakraborty DP and Berbaum KS: Comparing Inter-Modality Diagnostic Accuracies in Tasks Involving Lesion Localization: A Jackknife AFROC Approach.  Supplement to Radiology, Volume 225 (P), 259, 2002.Obuchowski NA, Lieber ML, Powell KA. Data analysis for detection and localization of multiple abnormalities with application to mammography. Acad Radiol 2000; 7: 516-525.Rutter CM. Bootstrap estimation of diagnostic accuracy with patient-clustered data. Acad Radiol 2000; 7 : 413-419.Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman and Hall, New York, 1993.Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Arch Intern Med 1996; 156: 209-213.Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K. Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 1999; 6: 22-33.