Lec14: Evaluation Framework for Medical Image Segmentation

61
MEDICAL IMAGE COMPUTING (CAP 5937) LECTURE 14: Evaluation Framework for Medical Image Segmentation Dr. Ulas Bagci HEC 221, Center for Research in Computer Vision (CRCV), University of Central Florida (UCF), Orlando, FL 32814. [email protected] or [email protected] 1 SPRING 2017

Transcript of Lec14: Evaluation Framework for Medical Image Segmentation

Page 1: Lec14: Evaluation Framework for Medical Image Segmentation

MEDICAL IMAGE COMPUTING (CAP 5937)

LECTURE 14: Evaluation Framework for Medical Image Segmentation

Dr. Ulas BagciHEC 221, Center for Research in Computer Vision (CRCV), University of Central Florida (UCF), Orlando, FL [email protected] or [email protected]

1SPRING 2017

Page 2: Lec14: Evaluation Framework for Medical Image Segmentation

Outline• How to evaluate accuracy of image segmentation?

– Gold standard ~ surrogate of truths– Qualitative

• Visual• Inter- and intra-observer agreement rates

– Quantitative• Volumetric measurements (regression)• Region overlaps• Shape based measurements• Theoretical comparisons• STAPLE, Uncertainty guidance, and evaluation w/o truths

2

Page 3: Lec14: Evaluation Framework for Medical Image Segmentation

Visual Assessment

3

Manual image segmentation from the full spectrum of IDEAL MRI data to delineate red: SAT, green: VAT, blue: liver, yellow: pancreas, purple: kidneys. Left to right: water- only, fat-only, in-phase, out-of-phase, fat fraction, and segmented labels from SliceOmatic.

Reference: Assessment of Abdominal Adiposity and Organ Fat with Magnetic Resonance Imaging (chp11).

Page 4: Lec14: Evaluation Framework for Medical Image Segmentation

Inherent Uncertainty

4

Comparison of glioblastoma multiforme (GBM) segmentation results on an axial slice: semi-automatic segmentation under Slicer (green, left image) and pure manual segmentation (blue, middle image). Egger et al., Nat Sci Rep., 2012.

Page 5: Lec14: Evaluation Framework for Medical Image Segmentation

Inherent Uncertainty 5

red: endocardium; green: epicardium; yellow: ground truthQueiros et al., European Heart Journal, 2016.

Page 6: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation EvaluationCan be considered to consist of two components:

(1) Theoretical

Study mathematical equivalence among algorithms.

(2) Empirical

Study practical performance of algorithms in specific application domains.

6

Page 7: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:

(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?

7

Page 8: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:

(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?

(Ch2) How to develop truly distinct methods constituting real advance?

8

Page 9: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:

(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?

(Ch2) How to develop truly distinct methods constituting real advance?

(Ch3) How to choose a method for a given application domain?

9

Page 10: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:

(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?

(Ch2) How to develop truly distinct methods constituting real advance?

(Ch3) How to choose a method for a given application domain?

(Ch4) How to set an algorithm optimally for an applicationdomain?

10

Page 11: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:

(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?

(Ch2) How to develop truly distinct methods constituting real advance?

(Ch3) How to choose a method for a given application domain?

(Ch4) How to set an algorithm optimally for an applicationdomain?

Currently any method A can be shown empirically to be better than anymethod B, even when they are equivalent.

11

Page 12: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Theoretical

Attributes commonly used by segmentation methods:

(1) Connectedness (2) Texture(3) Smoothness of boundary(4) Gradient / homogeneity(5) Shape information about object(6) Noise handling(7) Optimization employed(8) Orientedness of boundary

Page 13: Lec14: Evaluation Framework for Medical Image Segmentation

Attributes utilized by well-known delineation models

Connected Gradient Texture Smooth Shape Noise Optimize

Fuzzy con Yes Gr = hom affinity

Obj feat affinity

No No Scale FC

In RFC

Chan-Vese No No Yes Yes No No YesMum-Shah No No Yes Yes No Yes Yes

KWT snake Boundary Yes No Yes No No YesMSV LS Fg when

expandngYes No No No No No

Live wire Boundary Yes Yes Yes User No YesAct. shape Yes No No No Yes No YesAct. app Yes No Yes No Yes No Yes

Graph cut Usly not Yes Possible No No No Yes

Clustering No No Yes No No No Yes

SEGMENTATIONEVALUATION:Theoretical

Page 14: Lec14: Evaluation Framework for Medical Image Segmentation

Attributes utilized by well-known delineation models

Connected Gradient Texture Smooth Shape Noise Optimize

Fuzzy con Yes Gr = hom affinity

Obj feat affinity

No No Scale FC

In RFC

Chan-Vese No No Yes Yes No No YesMum-Shah No No Yes Yes No Yes Yes

KWT snake Boundary Yes No Yes No No YesMSV LS Fg when

expandngYes No No No No No

Live wire Boundary Yes Yes Yes User No YesAct. shape Yes No No No Yes No YesAct. app Yes No Yes No Yes No Yes

Graph cut Usly not Yes Possible No No No Yes

Clustering No No Yes No No No Yes

SEGMENTATIONEVALUATION:Theoretical

Deep Learning Yes Yes Yes Yes Yes Yes Yes

Page 15: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

T :

B :

P :

Example: Estimating the volume of brain.

A body region -

Imaging protocol -

Application domain: A particular triple .

A task -

Example: Head.

Example: T2 weighted MRimaging with a particular set of parameters.

Q: A set of scenes acquired for a particular application domain

, ,á ñT B P

, , .T B Pá ñ

Page 16: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

16

The segmentation efficacy of a method M in an applicationdomain may be characterized by three groupsof factors:

Precision :(Reliability)

Repeatability taking into account all subjective actions influencing the result.

Accuracy :(Validity)

Degree to which the result agrees withtruth.

Efficiency : (Viability)

Practical viability of the method.

, ,T B Pá ñ

Page 17: Lec14: Evaluation Framework for Medical Image Segmentation

Validation of Image Segmentation• Spectrum of accuracy versus realism in reference standard.• Digital phantoms.

– Ground truth known accurately.– Not so realistic.

• Acquisitions and careful segmentation.– Some uncertainty in ground truth.– More realistic.

• Autopsy/histopathology.– Addresses pathology directly; resolution.

• Clinical data ?– Hard to know ground truth.– Most realistic model.

Slide Credit: N. Archip

Page 18: Lec14: Evaluation Framework for Medical Image Segmentation

Comparison To Higher Resolution

MRI Photograph MRI

Provided by Peter Ratiu and Florin Talos.Credit: N. Archip

Page 19: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

19

Intra operator variationsInter operator variations

Intra scanner variationsInter scanner variations

Inter scanner variations include variations due to the same brand and different brands.

Repeatability taking into account all subjective actions that influence the segmentation result.

Precision

Page 20: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

20

Precision

( ) -

1 - , = 3, 4. + 2

1 2

i

1 2

O OM MT

M O OM M

PR i=C C

C C

A measure of precision for method M in a trial that producesand for situation Ti is given by

Intra/inter operator

Intra/inter scanner

may be binary or fuzzy segmentations.

1OMC 2O

MC

CMO1,CM

O2

Page 21: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

21

Accuracy

The degree to which segmentations agree with true segmentation.

Surrogates of truth are needed.

For any image C acquired for application domain

CMO - segmentation of O in C by method M ,

Ctd - surrogate of true delineation of O in C.

Page 22: Lec14: Evaluation Framework for Medical Image Segmentation

22

TPFP

TN

FN

True segmentation

OMC

tdC

Segmentation by algorithm M.

FP

FN

Ud

Page 23: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

23

FNVFMd =

Ctd − CMO

Ctd, TPVFM

d = Ctd ∩ CM

O

Ctd

FPVFMd =

CMO − CtdUd -Ctd

, TNVFMd =

Ud − CMO -Ctd

Ud -Ctd,

Ud : A binary scene representing a reference super set(for example, this may be the body region that is imaged).

: Amount of tissue truly in that is missed by .

: Amount of tissue falsely delineated by .

dMdM

FNVF O MFPVF M

Page 24: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

24

Requirements for accuracy metrics:

(1) Capture M’s behavior of trade-off between FP and FN.(2) Satisfy laws of tissue conservation:

(3) Capable of characterizing the range of behavior of M.(4) Any monotonic function g(FNVF, FPVF) is fine as a

metric.(5) Appropriate for

1

1

d dM Md dM M

FNVF TPVFFPVF TNVF

= -

= -

, , .T B Pá ñ

Page 25: Lec14: Evaluation Framework for Medical Image Segmentation

25

Segmentation Evaluation: Empirical

Page 26: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

26

1-FNVF

FPVF

Brain WM segmentation in PD MRimages.

Each value of parameter vector p of M gives a point on the DOC curve.The DOC curve characterizes the behavior of M over a range of parametric values of M.

Delineation Operating Characteristic

:MA Area underthe DOC curve

Page 27: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

27

, ,á ñT B P

.

FPVF

1-FN

VF

0

1p - parameter vector for method M

gp(FPVF, FNVF) - monotonic fn

p* = arg min p [gp(FPVF, FNVF)]

Set M to operate at p*.

Optimally setting an algorithm for

1

Page 28: Lec14: Evaluation Framework for Medical Image Segmentation

Existent Segmentation Data28

Expert 1 Expert 2 Expert 3 Expert 4

Original Image

• Manual segmentation performed by 4 independent experts

• low grade glioma

Page 29: Lec14: Evaluation Framework for Medical Image Segmentation

Expert and Student Segmentations

29

Test image ? ?

? ?

Page 30: Lec14: Evaluation Framework for Medical Image Segmentation

Expert and Student Segmentations

30

Test image Expert consensus Student 1

Student 2 Student 3

Page 31: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

31

Describes practical viability of a method.

Four factors should be considered:

(1) Computational time – for one time training of M

(2) Computational time – for segmenting each scene

(3) Human time – for one-time training of M

(4) Human time – for segmenting each scene

(2) and (4) are crucial. (4) determines the degree of automation of M.

Efficiency

( )1cMt( )2cMt( )1hMt

( )2hMt

Page 32: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: Empirical

32

Precision : Accuracy :

:::

: Area under the DOC curveintra scannerFN fraction for delineation:inter operatorFP fraction for delineation:intra operator1T

MPR

2TMPR

3TMPR

dMFPVF

MA

dMFNVF

Efficiency :

operator time for scene segmentation.:operator time for algorithm training.:computational time for scene segmentation.:computational time for algorithm training.:1c

Mt2cMt

1hMt

2hMt

4TMPR : inter scanner

Page 33: Lec14: Evaluation Framework for Medical Image Segmentation

Remarks

33

(1) Precision, accuracy, efficiency are interdependent.

accuracy à efficiency.precision and accuracy à difficult.

(2) “Automatic segmentation method” has no meaning unless theresults are proven on a large number of data sets withacceptable precision, accuracy, efficiency, and with .

(3) A descriptive answer to “is method M1 better than M2 under ?” in terms of the 11 parameters is more meaningful

than a “yes” or “no” answer.

(4) DOC is essential to describe the range of behavior of M.

2hMt = 0

, ,T B Pá ñ

Page 34: Lec14: Evaluation Framework for Medical Image Segmentation

Velazquez et al, Scientific Reports 2013.34

Page 35: Lec14: Evaluation Framework for Medical Image Segmentation

Shape Based Metrics for Segmentation Evaluation

35

Sensitivity=94.69%Specificity=94.19%

Sensitivity=72.99%Specificity=78.16%

If you use only DSC (dice similarity, or overlap measure), DSC values are similar to each otherIn both examples (but not sensitivity-specificity values).

Sufficient Enough?

Page 36: Lec14: Evaluation Framework for Medical Image Segmentation

Hausdorff Distance• Can be used for a complementary evaluation metric to the

overlap measure for measuring boundary mismatches!

36

Page 37: Lec14: Evaluation Framework for Medical Image Segmentation

Hausdorff Distance• Can be used for a complementary evaluation metric to the

overlap measure for measuring boundary mismatches!• Lower Haussdorff Distance (HD), Better segmentation

accuracy!

37

( ))(max),(maxmax),( bdadBAHD ABbBAa ÎÎ=

( )),(min)( badadBbB Î

= is a distance of one point a on A from B

Page 38: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: STAPLE38

• STAPLE (Simultaneous Truth and Performance Level Estimation):– An algorithm for estimating performance and ground truth from a

collection of independent segmentations.– Warfield, Zou, Wells MICCAI 2002.– Warfield, Zou, Wells, IEEE TMI 2004.– Publicly Available

– The STAPLE algorithm ( Warfield et al., 2004) is a region formulation for producing consensus segmentations.

– When foreground is small à weight w is small

Page 39: Lec14: Evaluation Framework for Medical Image Segmentation

Segmentation Evaluation: STAPLE• Segmentations are generated by sampling independently at

each voxel.

• However, the produced segmentations may not be realistic for two reasons. – First, the variability of the segmentation does not account for the

intensity in the image such that borders with strong gradients are equally variable as borders with weak gradient. This is counter intuitive as the basic hypothesis of image segmentation is that changes of intensity are correlated with changes of labels.

– Second, borders of the segmented structures are unrealistic mainly due to their lack of geometric regularity.

39

Page 40: Lec14: Evaluation Framework for Medical Image Segmentation

Regression Analysis in Clinical Problems

• Linear regression between volume(s) – automated segmentation’s volume vs. manual segmentation’s volume– Bland-Altman plot

• Linear regression between visual inspection (raters)– Kappa statistics– t-test / p-value

• Significantly different volumes ? Score ?

40

Page 41: Lec14: Evaluation Framework for Medical Image Segmentation

Regression Analysis in Clinical Problems

41

Manual segmentationVedentham, et al. JCIS, 2014

Page 42: Lec14: Evaluation Framework for Medical Image Segmentation

What is Bland-Altman plot?

42

Page 43: Lec14: Evaluation Framework for Medical Image Segmentation

What is Bland-Altman plot?• is a method of data plotting used in analyzing the agreement

between two different assays.• Claim: any two methods that are designed to measure the

same parameter should have good correlation.– X-axis: mean of the two measurement– Y-axis: difference between the two values

• Good first step analyzing the data!

43

Page 44: Lec14: Evaluation Framework for Medical Image Segmentation

Bland-Altman Plots (e.g., airway segmentation evaluation)

44

Xu, Bagci, et al. MedIA, 2015.

Page 45: Lec14: Evaluation Framework for Medical Image Segmentation

New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)

• Automatically produce plausible image segmentation samples from a single expert segmentation!

45

Page 46: Lec14: Evaluation Framework for Medical Image Segmentation

New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)

• Automatically produce plausible image segmentation samples from a single expert segmentation!

• A probability distribution of image segmentation boundaries is defined as Gaussian Process, which leads to segmentations which are spatially coherent and consistent with the presence of salient borders in the image.

46

Page 47: Lec14: Evaluation Framework for Medical Image Segmentation

The Gaussian Density

47

Page 48: Lec14: Evaluation Framework for Medical Image Segmentation

Remark: Gaussian Process (GP) ?

48

Credit: Ghahramani

Page 49: Lec14: Evaluation Framework for Medical Image Segmentation

Remark: Gaussian Process (GP) ?

49

Credit: Ghahramani

Page 50: Lec14: Evaluation Framework for Medical Image Segmentation

Remark: Gaussian Process (GP) ?

50

Credit: Ghahramani

Page 51: Lec14: Evaluation Framework for Medical Image Segmentation

Remark: (GP) ? 51

Page 52: Lec14: Evaluation Framework for Medical Image Segmentation

Remark: (GP) ? 52

Page 53: Lec14: Evaluation Framework for Medical Image Segmentation

New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)

• Automatically produce plausible image segmentation samples from a single expert segmentation!

• A probability distribution of image segmentation boundaries is defined as Gaussian Process, which leads to segmentations which are spatially coherent and consistent with the presence of salient borders in the image.

53

Page 54: Lec14: Evaluation Framework for Medical Image Segmentation

Sample segmentation contours according to mean inter-sample dice coefficient!

54

(Top Left) Mean of the GP µ; (Top Middle) Sample of the level set function φ(a) drawn from𝒢𝒫(µ,Σ) (Others) GPSSI samples. The ground truth is outlined in red, the GPSSI samples are outlined in orange.

Page 55: Lec14: Evaluation Framework for Medical Image Segmentation

55

New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)

(Left) Signed geodesic distance µ(a) of the ROI with isocontours –45, 0, 45, 100, 200. (Right) One can check that the samples most probably lie in the region delineated by the isocontours µ(a)=±45. The sampled contours are in orange.

Page 56: Lec14: Evaluation Framework for Medical Image Segmentation

New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)

56

Page 57: Lec14: Evaluation Framework for Medical Image Segmentation

New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)

57

Page 58: Lec14: Evaluation Framework for Medical Image Segmentation

Provocative Question?• Can we evaluate segmentation error without the ground

truth?

58

Page 59: Lec14: Evaluation Framework for Medical Image Segmentation

Provocative Question?• Can we evaluate segmentation error without the ground

truth?– With the machine learning support, can we design a classifier which

LEARNS segmentation error and adapt itself for better delineation?

59

Page 60: Lec14: Evaluation Framework for Medical Image Segmentation

Summary• Segmentation Evaluation

– Theoretical vs. Empirical– Visual Assessment– Volumetric Agreement– Efficacy (efficiency, accuracy, …)– STAPLE– New Trends!– Segmentation Challenges (choose your project!)

60

Page 61: Lec14: Evaluation Framework for Medical Image Segmentation

Slide Credits and References• Credits to: Jayaram K. Udupa of Univ. of Penn., MIPG• Bagci’s CV Course 2015 Fall.• K.D. Toennies, Guide to Medical Image Analysis,• Handbook of Medical Imaging, Vol. 2. SPIE Press.• Handbook of Biomedical Imaging, Paragios, Duncan, Ayache.• Seutens,P., Medical Imaging, Cambridge Press.• Neculai Archip, Ph.D• Simon K. Warfield, Ph.D. (See STAPLE Algorithm)

61