Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison...

46
Computational Radiology Laboratory Harvard Medical School www.crl.med.harvard.edu Children’s Hospital Department of Radiology Boston Massachusetts Evaluation of Image Segmentation Simon K. Warfield, Ph.D. Associate Professor of Radiology Harvard Medical School

Transcript of Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison...

Page 1: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

Computational Radiology Laboratory Harvard Medical School www.crl.med.harvard.edu

Children’s Hospital Department of Radiology Boston Massachusetts

Evaluation of Image Segmentation

Simon K. Warfield, Ph.D. Associate Professor of Radiology Harvard Medical School

Page 2: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 2

Segmentation •  Segmentation

–  Identification of structure in images. – Many different algorithms and a wide range

of principles upon which they are based. •  Segmentation is used for:

– Quantitative image analysis –  Image guided therapy – Visualization

•  Evaluation : How to know when we have a good segmentation ?

Page 3: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 3

Validation of Image Segmentation •  Spectrum of accuracy versus realism in

reference standard. •  Digital phantoms.

–  Ground truth known accurately. –  Not so realistic.

•  Acquisitions and careful segmentation. –  Some uncertainty in ground truth. –  More realistic.

•  Autopsy/histopathology. –  Addresses pathology directly; resolution.

•  Clinical data ? –  Hard to know ground truth. –  Most realistic model.

Page 4: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 4

Validation of Image Segmentation •  Comparison to digital and physical

phantoms: – Excellent for testing the anatomy, noise and

artifact which is modeled. – Typically lacks range of normal or

pathological variability encountered in practice.

MRI of brain phantom from Styner et al. IEEE TMI 2000

Page 5: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 5

Comparison To Higher Resolution

MRI Photograph MRI

Provided by Peter Ratiu and Florin Talos.

Page 6: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 6

Comparison To Higher Resolution

Photograph MRI Photograph Microscopy

Provided by Peter Ratiu and Florin Talos.

Page 7: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 7

Comparison to Autopsy Data •  Neonate gyrification index

– Ratio of length of cortical boundary to length of smooth contour enclosing brain surface

Page 8: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 8

Staging

Stage 3 Stage 5

Stage 4 Stage 6

Stage 3: at 28 w GA shallow indentations of inf. frontal and sup. Temp. gyrus (1 infant at 30.6 w GA, normal range: 28.6 ± 0.5 w GA)

Stage 4: at 30 w GA 2 indentations divide front. lobe into 3 areas, sup. temp.gyrus clearly detectable (3 infants, 30.6 w GA ± 0.4 w, normal range: 29.9 ± 0.3 w GA)

Stage 5: at 32 w GA frontal lobe clearly divided into three parts: sup., middle and inf. Frontal gyrus (4 infants, 32.1 w GA ± 0.7 w, normal range: 31.6 ± 0.6 w GA)

Stage 6: at 34 w GA temporal lobe clearly divided into 3 parts: sup., middle and inf. temporal gyrus (8 infants, 33.5 w GA ± 0.5 w normal range: 33.8 ± 0.7 w GA)

“Assessment of cortical gyrus and sulcus formation using MR images in normal fetuses”, Abe S. et al., Prenatal Diagn 2003

Page 9: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 9

Neonate GI: MRI Vs Autopsy

Page 10: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 10

GI Increase Is Proportional to Change in Age.

Page 11: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 11

GI Versus Qualitative Staging

Page 12: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 12

Neonate Gyrification

Page 13: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 13

Validation of Image Segmentation

•  STAPLE (Simultaneous Truth and Performance Level Estimation): – An algorithm for estimating performance

and ground truth from a collection of independent segmentations.

– Warfield, Zou, Wells, IEEE TMI 2004. – Warfield, Zou, Wells, PTRSA 2008. – Commowick and Warfield, IEEE TMI 2010.

Page 14: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 14

Validation of Image Segmentation •  Comparison to expert performance; to other

algorithms. •  Why compare to experts ?

–  Experts are currently doing the segmentation tasks that we seek algorithms for:

•  Surgical planning. •  Neuroscience research. •  Response to therapy assessment.

•  What is the appropriate measure for such comparisons ?

Page 15: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 15

Measures of Expert Performance •  Repeated measures of volume

–  Intra-class correlation coefficient •  Spatial overlap

–  Jaccard: Area of intersection over union. –  Dice: increased weight of intersection. –  Vote counting: majority rule, etc.

•  Boundary measures –  Hausdorff, 95% Hausdorff.

•  Bland-Altman methodology: –  Requires a reference standard.

•  Measures of correct classification rate: –  Sensitivity, specificity ( Pr(D=1|T=1), Pr(D=0|T=0) ) –  Positive predictive value and negative predictive value

(posterior probabilities Pr(T=1|D=1), Pr(T=0|D=0) )

Page 16: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 16

Measures of Expert Performance •  Our new approach:

• Simultaneous estimation of hidden ``ground truth’’ and expert performance.

• Enables comparison between and to experts.

• Can be easily applied to clinical data exhibiting range of normal and pathological variability.

Page 17: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 17

How to judge segmentations of the peripheral zone?

1.5T MR of prostate Peripheral zone and segmentations

Page 18: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 18

Estimation Problem

•  Complete data density: •  Binary ground truth Ti for each voxel i. •  Expert j makes segmentation decisions Dij. •  Expert performance characterized by sensitivity

p and specificity q. – We observe expert decisions D. If we knew

ground truth T, we could construct maximum likelihood estimates for each expert’s sensitivity (true positive fraction) and specificity (true negative fraction):

Page 19: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 19

Expectation-Maximization •  General procedure for estimation

problems that would be simplified if some missing data was available.

•  Key requirements are specification of: – The complete data. – Conditional probability density of the hidden

data given the observed data. •  Observable data D •  Hidden data T, prob. density •  Complete data (D,T)

f (T | D,θ̂)

Page 20: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 20

Expectation-Maximization •  Solve the incomplete-data log likelihood

maximization problem

•  E-step: estimate the conditional expectation of the complete-data log likelihood function.

•  M-step: estimate parameter values Q(θ | θ̂) = E ln f (D,T |θ) |D,θ̂

argmaxθ Q θ | ˆ θ ( )

Page 21: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 21

Expectation-Maximization •  Since we don’t know ground truth T, treat T as

a random variable, and solve for the expert performance parameters that maximize:

•  Parameter values θj=[pj qj]T that maximize the conditional expectation of the log-likelihood function are found by iterating two steps: –  E-step: Estimate probability of hidden ground truth T given a

previous estimate of the expert quality parameters, and take the expectation.

–  M-step: Estimate expert performance parameters by comparing D to the current estimate of T.

Q(θ | θ̂) = E ln f (D,T |θ) |D,θ̂

Page 22: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 22

STAPLE •  Consider binary labels:

–  foreground. – background.

•  Spatial correlation of the unknown true segmentation can be modelled with a Markov Random Field.

Page 23: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 23

To Solve for Expert Parameters:

Page 24: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 24

True Segmentation Estimate

Page 25: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 25

Expert Performance Estimate Now we seek an expression for the conditional expectation of the complete-data log likelihood function that we can maximize.

Page 26: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 26

Expert Performance Estimate Now, consider each expert separately:

Differentiate this with respect to pj,qj and solve for zero.

Page 27: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 27

Expert Performance Estimate

p (sensitivity, true positive fraction) : ratio of expert identified class 1 to total class 1 in the image.

q (specificity, true negative fraction) : ratio of expert identified class 0 to total class 0 in the image.

Page 28: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 28

Extension to Several Tissue Labels

•  Complete data density: •  True segmentation Ti for each voxel i

– May be binary

– May be categorical

•  Expert j makes segmentation decisions Dij

•  Expert performance θs’s characterizes probability of deciding label s’ when true label is s.

Page 29: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 29

Probability Estimate of True Labels

Page 30: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 30

Expert Performance Estimate Now, consider each expert separately:

Note constraint on sum of parameters. Solve for maximum.

Page 31: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 31

Parameter Estimation Noting that

We can formulate the constrained optimization problem:

Page 32: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 32

Parameter Estimation Therefore

And noting that

We find that

Page 33: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 33

Results: Synthetic Experts •  Several experiments with known ground truth

and known performance parameters. •  Goal:

–  Determine if STAPLE accurately identifies known ground truth.

–  Determine if STAPLE accurately determines known expert performance parameters.

–  Understand sensitivity of STAPLE with respect to changes in prior hyper-parameters; requirements for number of observations to enable good estimation; convergence characteristics.

Page 34: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 34

Synthetic Experts 10 observations of segmentation by expert with p=q=0.99

Four segmentations of ten shown. STAPLE ground truth.

STAPLE p,q estimates: mean p 0.990237 std. dev p 0.000616 mean q 0.990121 std. dev q 0.00071

Page 35: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 35

Synthetic Experts 10 segmentations by experts with p=0.95, q=0.90

Four segmentations of ten shown. STAPLE ground truth.

STAPLE p,q estimates: mean p 0.950104 std. dev p 0.001201 mean q 0.900035 std. dev q 0.001685

Page 36: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 36

Expert and Student Segmentations

Test image Expert consensus Student 1

Student 2 Student 3

Page 37: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 37

Phantom Segmentation

Image Expert Students Voting STAPLE

Image Expert segmentation

Student segmentations

Page 38: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 38

Prostate Peripheral Zone

Frequency of selection by experts. STAPLE truth estimate

1 2 3 4 5

pj .879 .991 .937 .918 .895

qj .998 .994 .999 .999 .999

Dice .913 .951 .967 .955 .944

Page 39: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 39

A Binary MRF Model for Spatial Homogeneity. Include a prior probability for the neighborhood configuration:

Page 40: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 40

MAP Estimation With MRF Prior

Page 41: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 41

Synthetic Experts Only three segmentations by different quality experts.

STAPLE ground truth.

STAPLE p,q estimates: p1, q1 0.9505,0.9494 p2, q2 0.9511,0.8987 p3, q3 0.9000,0.8987

p=0.95,q=0.95 p=0.95,q=0.90

p=0.90,q=0.90 With MRF prior

Page 42: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 42

Cryoablation of Kidney Tumor Segmentations before training session with radiologist:

Rater frequency. STAPLE with MRF. After training session:

Based on the STAPLE performance assessment, we found the training session created a statistically significant increase in performance of the raters.

Page 43: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 43

Newborn MRI Segmentation

Page 44: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 44

Newborn MRI Segmentation

Summary of segmentation quality (posterior probability Pr(T=t|D=t) ) for each tissue type for repeated manual segmentations.

Page 45: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 45

STAPLE Summary •  Key advantages of STAPLE:

– Estimates ``true’’ segmentation. – Assesses expert performance.

•  Principled mechanism which enables: – Comparison of different experts. – Comparison of algorithm and experts.

•  Extensions for the future: – Can we learn image features that lead to

different levels of expert performance?

Page 46: Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

ComputationalRadiologyLaboratory. Slide 46

Acknowledgements

•  Neil Weisenfeld. •  Andrea Mewes. •  Petra Huppi. •  Olivier Clatz. •  William Wells. •  Olivier Commowick.

This study was supported by: Center for the Integration of Medicine and Innovative Technology R01 RR021885, R01 GM074068 and R01 HD046855.

Colleagues contributing to this work: •  Arne Hans. •  Heidelise Als. •  Lianne Woodward. •  Frank Duffy. •  Arne Hans. •  Kelly Zou.