Temporally Consistent Probabilistic Detection of New ...colm/TMI2013_preprint.pdf · 1 Temporally...

1

Temporally Consistent Probabilistic Detection ofNew Multiple Sclerosis Lesions in Brain MRI

Colm Elliott, Douglas L. Arnold, D. Louis Collins, Member, IEEE, and Tal Arbel, Member, IEEE

Abstract—Detection of new Multiple Sclerosis (MS) lesionson MRI is important as a marker of disease activity and as apotential surrogate for relapses. We propose an approach wheresequential scans are jointly segmented, to provide a temporallyconsistent tissue segmentation while remaining sensitive to newlyappearing lesions. The method uses a two-stage classificationprocess: 1) a Bayesian classifier provides a probabilistic braintissue classification at each voxel of reference and follow-upscans, and 2) a random-forest based lesion-level classificationprovides a final identification of new lesions. Generative modelsare learned based on 364 scans from 95 subjects from a multi-center clinical trial. The method is evaluated on sequential brainMRI of 160 subjects from a separate multi-center clinical trial,and is compared to 1) semi-automatically generated ground truthsegmentations and 2) fully manual identification of new lesionsgenerated independently by 9 expert raters on a subset of 60subjects. For new lesions greater than 0.15cc in size, the classifierhas near perfect performance (99% sensitivity, 2% false detectionrate), as compared to ground truth. The proposed method wasalso shown to exceed the performance of any one of the 9 expertmanual identifications.

Index Terms—Multiple Sclerosis, New Lesion Segmentation,Change Detection, Bayesian Inference, Machine Learning, Sub-traction Imaging

I. INTRODUCTION

Multiple Sclerosis (MS) is an inflammatory disease of thecentral nervous system (CNS) that mostly begins in youngadulthood and is characterized by a wide range of symptoms.There is presently no known cure for MS. Magnetic ResonanceImaging (MRI) has been used both as a diagnostic andmonitoring tool for MS, and longitudinal MRI sequencesare important for assessing disease activity and evaluatingtreatment efficacy [1]–[7]. One of the hallmarks of MS onMRI is the appearance of hyperintense lesions on T2-weightedMRI. The appearance of new MS lesions on MRI is used as

C. Elliott and T. Arbel are with the Centre for Intelligent Machines, McGillUniversity, Montreal, QC, H3A 0E9, Canada (e-mail: [email protected];[email protected]).

D. L. Arnold is with NeuroRx Research, 3575 Avenue du Parc, Suite #5322,Montreal, QC H2X 4B3, Canada (e-mail: [email protected]).

D. L. Collins is with the Montreal Neurological Institute, 3801 Uni-versity St., McGill University, Montreal, QC H3A 2B4, Canada (e-mail:[email protected]).

This work was supported by a Canadian National Science and EngineeringResearch Council Strategic Grant (STPGP 350547-07) and a Canadian Na-tional Science and Engineering Research Council collaborative Research andDevelopment Grant (CRDPJ 411455-10).

The authors gratefully acknowledge Josefina Maranzano (NeuroRx Re-search) for expert segmentation and review of new MS lesions.

Copyright (c) 2010 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

(a) Ref. FLAIR (b) Fol.Up FLAIR (c) GT New Lesion

Fig. 1. Examples of new MS lesions. (a) and (b) show one supraventricularaxial slice of a FLAIR image at reference and follow-up. (c) shows new MSlesions as per our ground truth (GT) segmentation (shown in orange). Whileone new lesion is very large and obvious, most others are relatively small.The blue arrows show lesions that are not obvious by looking solely at thefollow-up image. The green arrow shows a new lesion that overlaps with alesion already existing at reference.

an outcome measure in clinical trials and as a marker forrelapses [8], [9]. Examples of new MS lesions are shownin Fig. 1. While some new lesions are large and obvious,many are quite small (as small as 3 voxels) and subtle. Someare very difficult to identify in the follow-up scan withoutcomparing to the reference scan, while small enlargements orre-inflammation of existing lesion can mimic misregistration.

Manual segmentation of new MS lesions, as is often donein clinical practice, suffers from high intra-rater and inter-ratervariability, due both to the ambiguity of lesion boundaries andof the exact time of transition from healthy tissue to lesion [4],[6], [7], [10], [11]. Manual segmentation is also impractical forlarge clinical data sets that may contain thousands of images,and can be less sensitive than automated methods, especiallyfor smaller lesions [12].

Multiple approaches exist for automatic detection of MSlesions for a single MRI timepoint [10], [13]–[18]. Whileautomatic segmentation of MS lesions can improve repro-ducibility as compared to manual segmentations, automatedsegmentation of lesions is generally imperfect and measure-ment variability introduced by different MRI acquisitions isstill significant [5], [7]. This variability leads to temporallyimprecise lesion segmentation when independently consider-ing sequential scans of the same subject: the inference ofnew lesion may be the result of inconsistent measurementsacross longitudinal scans rather than due to anatomical andphysiological change. Figure 2 illustrates the difficulty in iden-tifying new lesions based on two independent segmentationsof successive scans of the same patient. Single timepointapproaches also do not incorporate any temporal information,

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TMI.2013.2258403

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

2

(a) Ref. FLAIR (b) Fol.Up FLAIR (c) GT New Lesion

(d) Ref. Lesion (e) Fol.Up Lesion (f) New Lesion

Fig. 2. Imprecise new lesion identification based on segmentations doneindependently at each timepoint. (a)-(b) show reference and follow-up FLAIRimages while (c) shows ground truth (GT) new lesion segmentation in orange(also shown with arrow). (d)-(e) show automatic lesion segmentations (in red)using a Bayesian approach, but where reference and follow-up are consideredindependently. (f) shows the resultant identification of new lesion (in blue)based on the difference of the two segmentations in (d) and (e). Although (d)and (e) can both be considered reasonable lesion segmentations, segmentationvariability across timepoints leads to imprecise measurement of new lesions.

which can provide additional evidence of new lesion activity.Several approaches have been designed specifically for

automatic detection of MS lesions in longitudinal MRI [12],[19]–[24]. However, many of these use a time-series approachthat require several timepoints (in general 6 or more) atrelatively high temporal frequency (monthly or less). Accessto monthly scans may not always be practical and whilethe requirement for many timepoints may be suitable for aretrospective analysis, many contexts, such as clinical trials,require on-going analysis of treatment efficacy.

Other automated methods consider reference and follow-upscan pairs rather than a full time-series, and pose the problemof new lesion detection as one of change detection. Oneclass of algorithms performs non-linear registration betweensuccessive scans and identifies lesion activity as contractionsor dilations of the computed deformation field [19], [22]. Suchmethods provide nice models of lesion growth and contraction,but are not well suited to detection of small new lesions.In [12], a statistical change detection approach that looks at3x3x3 voxel patches of non-linearly coregistered sequentialscan pairs is used. This approach showed higher sensitivityto lesion evolution than a manual rater, but in addition todetecting new lesion also detected change attributable toventricular expansion and contraction, and falsely detectedchange in sulcal cerebro-spinal fluid.

Subtraction imaging, where intensity differences betweenreference and follow-up scans are considered rather thannative intensities, has been used in clinical settings to improve

(a) Ref. FLAIR (b) Fol.Up FLAIR (c) Diff. Image

(d) Ref. FLAIR (e) Fol.Up FLAIR (f) Diff. Image

Fig. 3. Two difference image examples. (a) and (d) show reference FLAIR,(b) and (e) show corresponding follow-up scans and (c) and (f) show differenceimages obtained by subtracting reference from follow-up. In (c) we can seeintensity differences attributable to flow artifact and imperfect registration. In(f) we can see both artifactual intensity differences and intensity differencesattributable to new lesion.

the precision of manual identification of new MS lesionsas compared to identifying lesions on individual scans [1],[2], [8], [25]–[27]. The use of difference images in an auto-mated context is complicated by the appearance of intensitydifferences in coregistered scans that do not correspond toanatomical change but rather to local misregistration or artifact(see Fig. 3).

In this paper, we present an automated probabilistic frame-work for identification of new MS lesions, given multi-modalreference and follow-up scans of the same subject. Our clas-sification model borrows from both the change detection andsubtraction imaging paradigms. We use a reference scan anda difference image to detect not only general change betweentimepoints, but also to model different sources of change inorder to differentiate changes attributable to new MS lesionfrom change due to misregistration, artifact, or other sources.New MS lesion classification is done in two stages, as shownin Fig. 4 : 1) A generative Bayesian model is used to jointlyinfer tissue classes at each voxel of reference and follow-upscans to provide a probabilistic voxel-wise classification and2) Voxels identified as new lesion are grouped into contiguousnew lesion candidates and assigned a confidence value using arandom forest classifier. New lesion candidates meeting a user-defined confidence threshold are retained as the set of newlesions. The Bayesian model provides a tissue classificationbased on local observations and interactions between voxelsin the same local spatial and temporal neighbourhood, whilethe random forest classification takes into account the largercontext of new lesion candidates to refine our initial voxel-wise classification.



3

REFERENCE FOLLOW-UP DIFFERENCE

Voxel-wise Bayesian

Classification

Lesion LevelRandom Forest

Classification

REFERENCE

NEW LESION CANDIDATES NEW LESIONSJOINT MAP

ESTIMATE

FOLLOW-UP

Fig. 4. New MS lesion Classifier. Reference and difference images are input to the Joint Bayesian Classifier, while reference, follow-up and differenceimages are used for lesion-level classification using a random forest classifier. The image showing the final classification of new lesions shows candidatesrejected by the lesion-level classification in blue (additionally shown with blue arrows) while those that are retained are shown in red.

In previous work, a Bayesian model was used for voxelwisedetection of new and resolving lesion, but required a tissueclassification of a reference scan as input [28]. The Bayesianmodel used here removes the requirement for a referencetissue classification and does a true joint (over both timepoints)inference of tissue classes at each voxel. Additionally, neigh-bourhood models used here consider both timepoints ratherthan only the follow-up scan, and tissue transition models varyspatially across the brain. We have added a second, lesion-level, classification stage using random forests to provide amore specific classification of new MS lesions. A large scalevalidation of our method is performed to characterize thesensitivity and specificity, and to assess the generalizability ofour classifier models when trained and tested on independentdata sets. Generative models for our Bayesian framework werelearned from labeled multi-center clinical trial data consistingof 364 scans from 95 subjects with relapsing-remitting MS(RRMS). The method was then evaluated on a separate multi-center clinical trial data set consisting of 320 scans from160 subjects, and was shown to detect new lesions as smallas 3 voxels with a false detection rate of 8% (equivalentto 0.16 false detections per subject) at a sensitivity of 80%and a false detection rate of 23% (or 0.57 false detectionsper subject) at a sensitivity of 90%. The probabilistic natureof the output allows for a user-defined tradeoff betweensensitivity and specificity and allows us to present resultsin the form of a curve. Further performance comparisons tofully manual identification of new lesions performed by expertraters showed the proposed method performed better than anyindividual manual identification of new lesions while havingcomparable performance to a consensus measure combining 9manual segmentations.

II. METHOD

A. Joint Timepoint Bayesian Formulation

We define a random variable (RV) at each voxel in bothreference and follow-up scans, where each RV can take onone of m tissue class labels. We present the classificationproblem as one of jointly inferring tissue class labels, c

(r)i

and c(t)i , at each voxel i, of reference and follow-up scans of

a given patient, given multi-modal MR images at referenceand follow-up. Instead of using MRI intensities of the follow-up scan directly, we consider intensity differences betweencoregistered follow-up and reference scans. Table I presentsthe nomenclature used in our derivations.

TABLE INOTATION

c(r)i Tissue class label for voxel i at referencec(t)i Tissue class label for voxel i at follow-up~Ii Intensity for voxel i at reference~I Intensity vectors at reference for entire volume~Di Intensity difference (between timepoints) vector for voxel i~D Intensity difference vector for entire volumeNi Spatial neighbourhood of voxel i at reference and follow-up{cNi

} Configuration of tissue class labels in Ni

Intensity and intensity differences are represented as vec-tors, where each vector element corresponds to an intensity inone MRI modality. It should be noted that the term intensitydifference refers to a difference in intensity between co-registered reference and follow-up scans at a given voxel.



4

B. Single-voxel Joint Timepoint Classification

We first present the problem as one of inferring tissueclasses c

(r)i and c

(t)i based only on observations at the voxel,

i, in question, at reference and follow-up scans, and our voxelindex, i, where each voxel index corresponds to a location instandard space. We can express our inference problem as aproduct of two likelihoods and a joint prior on reference andfollow-up tissue classes, where our joint prior is implicitlyconditioned on our location in standard space:

p(c(r)i , c

(t)i |~Ii, ~Di) =

p(~Ii|c(r)i , c(t)i , ~Di)p(c

(r)i , c

(t)i | ~Di)

p(~Ii| ~Di)(1)

=p(~Ii|c(r)i )p( ~Di|c(r)i , c

(t)i )p(c

(r)i , c

(t)i )

p(~Ii, ~Di). (2)

We have made a simplifying assumption:• c

(r)i is a sufficient statistic for ~Ii: the intensity at refer-

ence, ~Ii, is not dependent on the follow-up tissue class,c(t)i , or the intensity difference, ~Di, given the reference

tissue class, c(r)i .We can then further expand the last term in eq. (2) to give:

p(c(t)i , c

(r)i |~Ii, ~Di) =

1

Kp(~Ii|c(r)i )p( ~Di|c(r)i , c

(t)i )p(c

(t)i |c(r)i )p(c

(r)i )

= p0(ci), (3)

where we have treated the denominator as a normalizationconstant, K. We have defined the formulation in eqn. (3) as thesingle voxel formulation (denoted by p0(ci)), as it considersonly observations at the voxel in question (at both referenceand follow-up timepoints) without considering spatially adja-cent voxels.

The single voxel formulation expresses our inference prob-lem as a product of four terms. The intensity likelihood,p(~Ii|c(r)i ), models the likelihood of tissue class c

(r)i for in-

tensity ~Ii. The difference likelihood, p( ~Di|c(r)i , c(t)i ), models

the joint likelihood of tissue classes c(r)i and c

(t)i , at reference

and follow-up, for intensity difference, ~Di, between coreg-istered voxels of reference and follow-up scans. The tissuetransition probability, p(c(t)i |c(r)i ), determines the probabilityof a transition from tissue class c

(r)i at reference to c

(t)i at

follow-up, while our prior on c(r)i , p(c

(r)i ), determines the

prior probability of reference tissue class c(r)i . It should be

noted that both the tissue transition probability and the prioron our reference tissue class are implicitly conditioned on thevoxel index, i, and will vary spatially throughout the brain.

C. Neighbourhood Model

We incorporate modeled dependencies between neighbour-ing voxels to provide a more accurate and consistent clas-sification. A common way to consider interactions betweenspatially adjacent voxels is by modelling the volume as aMarkov Random Field (MRF) [10], [14]. An MRF formulationallows us to compactly express our joint posterior over allvoxels as a Gibb’s distribution, via the Hammersly-CliffordTheorem, providing an equivalence between maximizing the

joint posterior distribution over all sites of our MRF andminimizing an energy function. While this provides an elegantformulation for our problem, in practice minimizing the energyfunction presents several challenges and disadvantages:

1) The parametrization of the energy function is notstraightforward.

2) A true global minimization of the energy function isoften intractable to compute.

3) Due to the intractability of computing the partitionfunction, our solution typically consists of a set of labelsas opposed to a set of posterior distributions.

Simplified models, such as pairwise Pott’s models, areoften used in practice to overcome the complexities of MRFparametrization [10], but can only provide simplistic modelsof interaction between sites and their neighbourhood. Ap-proximate optimization schemes, such as Iterated ConditionalModes (ICM) are also often used in order to make optimizationmore tractable. Such approaches generally do not find a trueglobal optimum and depend heavily on initialization of oursolution.

Instead of formulating the joint posterior as a Gibb’s Dis-tribution, we have chosen an alternate formulation that allowsfor a richer description of interactions between neighbouringvoxels whose parameters can easily and accurately be learnedfrom training data. We present the problem as one of maximingmarginal posterior distributions at each site, rather than one ofmaximizing the joint posterior over all sites. In essence, weare embedding assumptions made in approximate optimizationschemes directly in our formulation, which provides greaterfreedom in parametrizing our problem. Such an approach alsoallows us to present a final solution in the form of posteriorprobabilities, rather than just as class labels.

We define {cNi} as a configuration of tissue class labels inreference and follow-up scan of voxels in the neighbourhoodof voxel i. Given an estimate of the posterior distribution ofour neighbourhood configuration, p({cNi

}| ~D, ~I), we can re-express our joint inference problem as:

p(c(r)i , c

(t)i | ~D, ~I) =∑

{cNi}

p(c(r)i , c

(t)i , {cNi

}| ~D, ~I)

=∑{cNi

}

p(c(r)i , c

(t)i |{cNi

}, ~D, ~I)p({cNi}| ~D, ~I). (4)

As is typical in MRFs, we make the simplifying assumptionthat class labels at site i are conditionally independent ofobservations other than at i, given observations at i and classlabels in the neighbourhood of i. Given this assumption, thefirst term after the summation in eqn. (4) can be expanded ina similar fashion as was done for our formulation in eqn. (3)to give:



5

p(c(r)i , c

(t)i |{cNi

}, ~D, ~I) =

p(~Ii|c(r)i )p( ~Di|c(r)i , c(t)i )p(c

(r)i , c

(t)i |{cNi

})p(~Ii, ~Di|{cNi}, ~INi ,

~DNi)

=p(~Ii|c(r)i )p( ~Di|c(r)i , c

(t)i )p({cNi}|c

(r)i , c

(t)i )p(c

(r)i , c

(t)i )

p(~Ii, ~Di|{cNi}, ~INi ,~DNi)p({cNi})

= Kp0(ci)p({cNi

}|c(r)i , c(t)i )

K{cNi}

, (5)

where we have substituted in our single-voxel formulationfrom eqn. (3). We treat the denominator as a normalizationterm dependent on {cNi}, denoted by K{cNi

}. We use ~INi

and ~DNi to denote intensities and intensity differences inthe neighbourhood of voxel i. The formulation in eqn. (5) isequivalent to our single voxel formulation, with the additionalterm that describes the normalized likelihood of neighbour-hood configuration {cNi

} given c(r)i and c

(t)i , analagous to an

MRF smoothing term. By substituting eqn. (5) into eqn. (4)we get:

p(c(r)i , c

(t)i | ~D, ~I) =

Kp0(ci)∑{cNi

}

p({cNi}| ~D, ~I)

p({cNi}|c(r)i , c

(t)i )

K{cNi}

. (6)

We generate an initial estimate, p0({cNi}), of the

posterior distribution of our neighbourhood configuration,p({cNi}| ~D, ~I), as a product of the individual posteriors at eachvoxel j in Ni:

p0({cNi}) =

∏j∈Ni

p0(cj). (7)

An iterative procedure to update estimates of posteriors bothfor tissue classes and neighbourhood configurations can thenbe described as follows:

1) Compute p0(ci) = p(c(r)i , c

(t)i |~Ii, ~Di) as in eqn. (3) for

every voxel i.2) Compute an initial estimate of neighbourhood configu-

ration posteriors, p0({cNi}), as in eqn. (7).3) Compute updated posteriors, defined here as p1(ci), at

all voxels i, as in eqn. (6), using p0({cNi}) as an

estimate for p({cNi}| ~D, ~I).4) Compute updated estimates of neighbourhood configu-

ration posteriors, as in eqn. (7), but using p1(cj) insteadof p0(cj).

5) Continue refinining posterior estimates by repeatingsteps 3) and 4) for a fixed number of iterations or untilsome convergence criteria are met. Our solution is thenthe MAP estimate of our last iteration.

This approach is mathematically equivalent to a Gibb’sdistribution formulation that is optimized using a soft ICM,whose solution is initialized as in step 1) and where a single

multi-voxel clique potential is equal to p({cNi}|c(r)

i,c

(t)i

)

K{cNi}

.

D. New Lesion Classification

The voxel-wise classification provides a probabilistic tissueclassification of each voxel in reference and follow-up scans.New lesion voxels are identified as those voxels for which thejoint MAP corresponds to lesion at follow-up and non-lesionat reference. Considering only a voxel-wise classification, asubset of voxels are misidentified as new lesion principallyfor the following reasons:• The Bayesian classifier does not explicitly consider the

intensity values in the follow-up scan. Artifactual inten-sity differences that are not well explained by other tissuetransitions are thus sometimes classified as lesion. Theseare caused mainly by intensity differences in non-brainor edge of brain due to imperfect brain masks, or by flowartifact, particularly in the inferior temporal lobe.

• Large registration errors at the cortical gray matter /white matter boundary are sometimes misclassified asnew lesion due to the intensity overlap of gray matterand lesion.

• Misregistration at the boundaries of lesion already ex-isting at reference is sometimes misidentified as “new”lesion.

New lesion candidates are determined by considering con-tiguous (defined by 18-connectedness in 3 dimensions) setsof voxels labelled as new lesion. The subset of new lesioncandidates that meet a minimum size requirement of 3 voxels(as is the clinical norm) form the input to a second, lesion-levelclassification stage using a random forest classifier. A randomforest is a discriminative classifier that consists of a ensembleof decision tree classifiers, where the final classification isdetermined by summing the votes cast by each individualtree [29]. Using random subsets of training data and featuresfor each individual tree helps prevent overfitting, to whichtraditional decision tree classifiers are prone. Random forestshave shown to be powerful classifiers for a number of differentclassification tasks, including MS lesion segmentation [18].

By considering lesions as a whole, we can extract higherlevel features for a lesion-based (as opposed to voxel-based)classification. The Random Forest classifier provides as outputa confidence measure between 0 and 1 for each new lesioncandidate, corresponding to the percentage of individual deci-sion trees that classified the candidate as new lesion. Our finalsolution, consisting of a set of new lesions, is then the setof new lesion candidates that meet a user-defined confidencethreshold. Specific features used for lesion-level classificationwill be discussed in Section III-B2.

III. EXPERIMENTS

A. Data Sets

Two proprietary multi-center clinical data sets were used formodel generation and testing, respectively. The first clinicaldata set, Data Set A, is comprised of 364 scans from 95subjects with relapsing-remitting MS (RRMS), each with 2-7 timepoints taken at intervals ranging from 3 to 12 months.Each timepoint has T1-weighted (T1w), T2-weighted (T2w),Proton Density Weighted (PDw) and fluid attenuated inversionrecovery (FLAIR) scans and some timepoints additionally had



6

T1-weighted scans with gadolinium contrast agent (T1c). Asis typical of data used clinically in most centers, all scanswere acquired axially with in-slice resolution of 1mm and slicethickness of 3mm. The second clinical data set, Data Set B, iscomprised of 320 scans from 160 patients, each with 2 scanstaken 8-28 weeks apart. The same type of MRI sequenceswere available as for Data Set A. These 320 scans were chosenfor generation of a ground truth standard for new MS lesionsto be used for validation. Different data sets were used fortraining and testing to demonstrate the generalizability of thegenerative models used in our voxelwise classification.

1) Training Data: Data Set A was used for training thegenerative models used in our Bayesian classifier. Tissueclass labels used for training were generated prior to, andindependently of this study, using an existing semi-automaticprocessing pipeline with stringent quality control that has beenused for several real clinical trials of MS treatments. Thepipeline is described briefly as follows:

1) All volumes were automatically segmented into cerebro-spinal fluid (csf), gray matter (gm), white matter (wm),MS lesion (les), and a partial volume class (pv) thatmodels mixtures of csf and gray matter, using a classifierbased on [16].

2) Automatic segmentation of MS lesion was manuallyverified by a trained expert. Any falsely identified MSlesions were removed. No new lesions were added inthe manual correction phase and lesion boundaries werenot modified. Healthy tissue segmentations were notmodified.

Additional processing of the data was done for the purposeof this study, in which existing segmentations of MS lesionwere semi-manually modified to enforce greater temporalprecision and thus provide cleaner longitudinal models fortraining our classifier.

2) Test Data: Data Set B was used for generation of aground truth reference for new MS lesion and for evaluatingthe performance of our classifier. Tissue labels were firstgenerated using the same pre-existing semi-automatic pipelineas was used for our training data. An initial set of new MSlesions was then generated, again using a pre-existing pipeline,by considering manually corrected lesion segmentations ateach timepoint. This initial set of new MS lesions was thenthoroughly manually verified by an expert neuroradiologist toremove any falsely identified new lesions and to add any newlesions not identified in the semi-automatic process. All MRImodalities (T1w, T2w, PDw, FLAIR, T1c) were consulted dur-ing the manual correction stage, as were timepoints from thesame patient other than the two being considered (includingintervening, prior and subsequent scans). The resultant semi-manually generated set of labels corresponding to new lesionrepresents a thorough identification of new lesions by a highlytrained expert and was served as a ground truth reference forvalidation of our method. Expert manual review of new lesionsoccurred at the lesion-level: lesions were accepted or rejectedbut lesion boundaries as determined by the automated portionof the pipeline were not modified. Only when the manual rateradded new lesion not previously identified were the lesionboundaries determined manually.

3) Fully Manual Identification of New Lesion: A subset ofthe 320 scans in Data Set B, comprised of 120 scans from 60patients was additionally analyzed independently by 9 expertraters (5 neurologists, 4 neuroradiologists), each of whomprovided identification of new lesions based on analysis ofcoregistered sequential FLAIR images (other sequences werenot consulted by the experts). The subset of 60 subjects forwhich these manual lesion counts are available will be referredto as Data Set Bsub. Because full manual segmentation is verytime-consuming, raters identified new lesions by painting asingle voxel at the center of each identified new lesion. Assuch, each manual identification of new lesion was consideredas a seed, and a new lesion growing algorithm was appliedto each of the manual identifications to provide a syntheticsegmentation of the whole new lesion. The growing algorithmconsidered follow-up intensities and intensitiy differences be-tween scans, favoured a generous segmentation and enforceda minimum growth of 3mm in all directions in the axial plane,so as to be sure not to penalize the manual segmentations withour chosen evaluation metrics, which are described below.

B. Implementation

1) Generative Models for Bayesian Classifier: Classifiermodels are based on the same 5 tissue classes available in thetraining set: csf, gm, wm, les, and pv. All classifier modelswere generated based on Data Set A. The following modelswere learned from training data:

1) Intensity Likelihood models (corresponding to p(~I|c))2) Intensity Difference Likelihood models (corresponding

to p( ~D|c(r), c(t)))3) Tissue Transition models (corresponding to p(c(t)|c(r)))4) Tissue Neighbourhood models (corresponding to

p({cN }|c(r), c(t)))Tissue priors (corresponding to p(c)) for healthy tissue weretaken from the icbm152 (MNI) average brain atlas [30]. Tissuepriors for MS lesion were generated for this study by collectingfrom a series of clinical trials (independent of Data Set A andData Set B) that we had access to, totalling 3714 subjectswith relapsing-remitting MS. Expert manually corrected lesionsegmentations for these scans were all non-linearly registeredto icbm152 space to provide a probabilistic lesion atlas.

Intensity likelihood models for each of the 5 tissue typeswere generated as 3-dimensional Gaussian Mixture Mod-els (GMMs), corresponding to three modalities (T1w, T2w,FLAIR). The number of Gaussian components used in ourmodels ranged from 2-5 depending on the tissue class and waschosen heuristically based on non-parametric representationsof the distributions. Figure 5a) shows example intensity likeli-hood models over all classes for FLAIR intensities, shown asa marginal distribution for a single modality for visualizationpurposes (in practice these distributions are multi-variate).

Intensity difference models, tissue transition models andneighbourhood models are generated from timepoint pairs, andare all based on tissue combinations at coregistered voxels ofreference and follow-up scans in our training data. We refer tothese tissue combinations as tissue transitions as they model atransition from one tissue type at reference to another (or the



7

−100 −50 0 50 100 150 200 250

csf

gm

wm

les

pv

(a) p(~Ii|c(r)i ) for FLAIR

−50 0 50 100

wm−wm

wm−les

les−les

les−wm

(b) p( ~Di|c(r)i , c

(t)i ) for FLAIR

Fig. 5. Likelihood models for FLAIR. (a) shows intensity likelihood modelsfor all tissue types (csf = cerebro-spinal fluid, gm = gray matter, wm = whitematter, les = MS lesion, pv = partial volume). (b) shows intensity differencelikelihood models for a subset of tissue transitions, involving wm and les (i.e.wm-les corresponds to transitions from white matter at reference to MS lesionat follow-up). These transitions make-up 4 of the 25 total intensity differencemodels. Models shown in (a) and (b) are marginalized over FLAIR only forvisualization purposes - in practice these are multivariate distributions overall MRI modalities.

same) tissue type at follow-up. Tissue transitions can modelvarious sources of change, such as new MS lesion (e.g. whitematter to MS lesion), or misregistration (e.g. some instancesof white matter to gray matter).

Intensity difference likelihood models were generated as4-dimensional GMMs (corresponding to T1w, T2w, FLAIR,T1c) and were learned for every possible tissue transition(for a total of 25 models), each using from 2-7 Gaussiancomponents. Figure 5b) shows example intensity differencelikelihood models for selected tissue transitions (wm-wm, wm-les, les-les, les-wm) for FLAIR intensity differences, againshown as a marginal distribution for visualization purposes.

The tissue transition model can be thought of as a matrix oftransition probabilities and determines the probability of goingfrom tissue class c(r) at reference to tissue class c(t) at follow-up. This provides a bias towards consistent classification ofcoregistered voxels across timepoints, or in the event thatour observations suggest a change in tissue class, the modelwill bias towards more plausible transitions. We condition thetransition model on anatomical location using our atlas-derivedspatial tissue priors, as some transitions are more likely tooccur in some areas of the brain than in others. The mostcommonly occurring tissue transitions are those involvingstable healthy tissue, namely wm-wm, gm-gm, csf-csf and pv-pv. In our training data set, these transitions together make upover 97% of the transitions seen. Stable lesion (les-les) wasseen in 0.65% of the voxels in our training data, while wm-lestransitions (corresponding to most new lesion) occurred 0.02%of the time.

(a) Ref. FLAIR (b) Fol.Up FLAIR (c) FLAIR Diff Image

(d) p(c(r) = wm) (e) p(~I|c(r)=wm)

p(~I)(f) p(~D|c(r)=wm,c(t)=les)

p(~D)

(g) p(c(t)=les|c(r)=wm)* (h)p(c(r)=wm,c(t)=les|~I, ~D)

(i) MAP estimate forc(r)=wm,c(t)=les

Fig. 6. Illustration of single-voxel joint Bayesian Classification for c(r) =wm, c(t) = les. (d) shows the prior probability that c(r)=wm, (e) illustratesthe normalized intensity likelihood for c(r)=wm while (f) illustrates thenormalized intensity difference likelihood for the tissue transition c(r)=wm,c(t)=les. (g) shows the prior probability of transitioning from c(r)=wm toc(t)=les. The * denotes that the probability map is shown on a logarithmicscale as all transition probabilities for wm to les are relatively small. (h) showsthe joint posterior for transitions corresponding to c(r)=wm, c(t)=les, and (i)shows voxels for which the single-voxel joint MAP estimate suggests a wmto les transition (shown in red). In practice, we evaluate the posterior for eachof the 25 possible tissue transitions.

Figure 6 illustrates the components of our single voxelBayesian framework as shown in eqn. (3), for a case of interestwhere we wish to determine the probability of the tissue classbeing white matter at reference and lesion at follow-up (i.e.c(r) = wm, c(t) = les).

We define our neighbourhood, Ni, as the 4-voxel in-planeneighbourhood at both reference and follow-up scans (consid-ering a total of 8 voxels). Out of plane neighbours were foundto not be useful in our voxelwise neighbourhood model dueto the 3mm slice thickness, but 3D connectivity was exploitedin the lesion-level classification stage. We consider tissuetransitions, rather than individual tissue classes - as such wecan consider the size of our neighbourhood as being 4 voxelsbut with 25 possible class labels corresponding to all possibletissue transitions. A neighbourhood configuration, {cNi

}, is



8

defined by the counts of occurences of each of the tissue tran-sitions in our neighbourhood. The most commonly occurringneighbourhood configurations corresponded to homogenousstable healthy tissue: 4 wm-wm, 4 gm-gm, 4 csf-csf and 4pv-pv were the 4 most frequently encountered neighbourhoodconfigurations overall. The most likely occuring neighbour-hood configurations surrounding voxels that were new lesionwere 4 wm-les (34%) followed by 2 wm-les and 2 wm-wm(11%), followed by various count combinations of wm-les,wm-wm and gm-gm.

In practice, it is not tractable to consider all possibleneighbourhood configurations when computing the estimate ofthe posterior distribution of neighbourhood configurations, asin eqn. (7). Most neighbourhood configurations never or veryrarely occur and are thus ignored, allowing us to consideronly a subset of 1000 neighbourhood configurations thatoccur most frequently in our training data (the 1000th mostcommonly occurring neighbourhood configuration occurredonly 0.0000063% of the time). In addition, we generate apartial representation of p({cNi

}| ~D, ~I) by considering onlythe 3 most likely tissue transitions at each voxel j in Ni.

2) Lesion Level Random Forest Classifier: Because refer-ence segmentations available for new lesions in Data Set Bwere of superior quality than those in Data Set A, randomforest classifiers were trained from Data Set B, using four-fold cross-validation stratified over subjects, with 120 subjectsused for training and the remaining 40 subjects for testing ineach fold. The high quality of our ground truth segmentationsin Data Set B allowed us to automatically label new lesioncandidates as true or false for training purposes. Features usedfor random forest classification can be summarized as follows:• Intensity based features: average intensity and intensity

difference likelihoods for all tissue classes, average rawintensity and intensity differences over all MRI modali-ties.

• Contextual features: tissue type at source, surroundingtissue type at follow-up, percent connectedness to existinglesion.

• Size based features: size of new lesion, number of axialslices over which new lesion is present.

• Other: estimate of registration quality between timepointsfor each modality, shape descriptors based on eigenval-ues, estimate of presence of flow artifact.

Intensity based features allow us to eliminate the mostobvious false positives where the intensity difference maybe consistent with new lesion formation but the intensity atfollow-up is not consistent with MS lesion. Contextual featureshelp to eliminate new lesion that are unlikely to be realbased on anatomical context, and to identify stable lesionsthat are mistakingly identified as new due to misregistration.Presence of artifact or poorer quality registration will reduceour confidence in new lesion candidates.

A total of 63 features were used when using 4 modalitiesfor classification. Default parameters were used for trainingrandom forests (500 trees, 8 features per tree equivalent tothe square root of number of features). The most importantfeature was the mean probability of new lesion, as deter-mined by the voxelwise classification prior to incorporation

TABLE IIFEATURE IMPORTANCE FOR RANDOM FOREST CLASSIFIER

Feature Mean Dec. in G.I. Mean RankMean SV prob. new lesion 62.7 1.0Size 40.1 2.75Num. Slices 39.9 2.25Mean Int. Diff. FLAIR 21.4 4.25Mean Int. Follow Up FLAIR 18.4 4.75Mean p(les|It) 12.8 7.25Mean Int. Follow Up T1c 12.4 7.0Mean Int. Diff. T1c 11.9 8.5Std. Dev. Int. Follow Up FLAIR 11.9 8.25Mean p(D|c(r)i 6= les, c

(t)i = les) 10.9 9.0

Top ranked features for random forest classification using 4 modalities (T1w,T2w, FLAIR, T1c). Values are averages over 4 classification folds. Mean Dec.in G.I. = mean decrease in Gini Index. SV prob. new lesion = single voxelprobability of new lesion based on eq. (3).

of neighbourhood information. This feature is equivalent top(c

(t)i = les, c

(r)i 6= les|~Ii, ~Di), averaged over all voxels in the

new lesion candidate. Other important features include size,number of slices over which new lesion candidate is presentand intensity based features for both FLAIR and T1c images.Table II shows the 10 most important features, as measuredby the average mean decrease in Gini index over the fourcross-validation folds.

A FLAIR-only version of our classifier was also considered,to allow for fairer comparison to expert manual identificationof lesion done using only FLAIR images. In this case, intensitylikelihood and intensity difference likelihood were univariate(based on FLAIR intensities only) and any intensity basedfeatures used in our random forest classifier were derived onlyfrom the FLAIR scans, resulting in a total of 45 features forlesion-level classification.

3) Preprocessing: Before analysis, several preprocessingsteps are performed to remove non-brain portions of theimage, mitigate non-uniformity effects and bring MR imagesinto a common spatial and intensity space. Brain masking isdone using the Brain Extraction Tool (BET) [31] while inten-sity non-uniformity (NU) correction is done using N3 [32].All modalities and timepoints are first rigidly registered (6parameter transformation) to a common space. Additionalnon-linear registration is done between timepoints using thefollow-up scan as the target image and the reference scan asthe moving image. Care is required when performing non-linear registration between timepoints as we wish to ensurethat we do not remove any change of interest, namely bywarping new MS lesion into adjacent existing lesion. For thisreason, we first apply an approximate and generous lesionmask to each timepoint prior to registration, generated using aBayesian classifier with artificially inflated lesion prior. Non-linear registration was done hierarchically with a non-lineardeformation grid resolution ranging from 32 mm to 4 mm [33].Although most experiments used non-linear registration be-tween timepoints, rigid and affine inter-timepoint registrationwere also considered.

Intensity normalization is done in a two-stage process.An intra-patient intensity normalization is first performedto bring all timepoints of the same patient into a common



9

intensity space. An inter-patient intensity normalization isthen performed to map each patient into a global intensityspace. This allows us to use global intensity likelihood modelslearned across many subjects.

Intra-patient intensity normalization is based on a previouslyproposed Least Trimmed Squares (LTS) method [23], butwith additional constraints that guarantee the monotonicity ofthe mapping of intensities and require that samples used togenerate the mapping be equally distributed across intensitydeciles. Inter-patient intensity normalization was done usingthe method of Nyul et al. [34], using intensity deciles ascontrol points. Common control points are used across time-points of the same patient to ensure that any changes inimage contrast introduced by the inter-patient normalizationare consistent across all timepoints of a given patient.

C. Experimental Validation

Classifier performance was evaluated by comparing to ourground truth reference for Data Set B. We first comparethe proposed classifier to variants that 1) consider referenceand follow-up timepoints independently, and 2) that use thereference and follow-up image intensities directly, rather thanusing intensity differences. We next compare our iterativeneighbourhood formulation to 1) a variant that does not con-sider spatial neighbourhood information and 2) to the use of astandard MRF with pairwise clique potentials and optimizedusing standard ICM. We also consider the use of differentinter-timepoint registration schemes (rigid, affine, non-linear)and examine performance as a function of new lesion size.Further analysis was done to compare the performance of theproposed classifier with fully manual identification of newlesions done by 9 expert raters for the 60 subjects in DataSet Bsub.

Performance of lesion detection algorithms has generallybeen measured using Kappa or the related Dice coeffi-cient [10], [14], [35]–[37]. Such an approach can be prob-lematic as exact lesion boundaries are often ambiguous andthere is a strong bias toward larger lesions as small lesionscontribute little to an overall volumetric measure. As weare more interested in detection of individual lesions, weuse lesion-wise metrics to evaluate our classifier, as standardvoxel-wise metrics are not directly applicable to our problem.We consider a new lesion as detected (DET) if at least 3 voxelsin that lesion are considered lesion by a classifier. If a lesion inour ground truth is not detected by a classifier then it is definedas a false negative (FNEG). Similarily, a lesion is considered afalse detection (FD) if it is detected by a classifier and there areless than 3 voxels in that lesion that are considered lesion asper our ground truth. We use fairly liberal definitions of lesionoverlap because of the inherent variability and ambiguity oflesion boundaries, and because our ground truth was optimizedfor new lesion identification rather than exact segmentation.

Overall sensitivity is evaluated as the percentage of newlesions in our reference standard that are detected by ourclassifier ( DET

DET+FNEG ), and false detection rate is evaluatedas: FD

FD+DET . We have chosen to use the number of detectedlesions (DET) rather than the number of true positives (TP) as

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Detection Rate

Sen

sit

ivit

y

PC

STC

JCint

Fig. 7. Lesion Sensitivity vs. False Detection Rate for DifferentBayesian Models. PC=Proposed Classifier, STC=Single Timepoint Classifier,JCInt=Joint Classifier using Intensity Only. Individual points on the curvewere generated by varying the random forest output threshold for acceptanceas new lesion.

we do not want to artificially inflate classifier performancefor classifiers that detect multiple new lesions overlappingwith a single ground truth new lesion. Considering differentacceptance thesholds on our lesions-level classification allowsus to present our results in the form of a curve.

1) Bayesian Framework for voxelwise classification: Ex-periments were performed to compare our joint timepointBayesian formulation to 1) a single-timepoint variant (STC)where lesion is independently identified in reference andfollow-up and new lesion voxels are determined by consideringthe difference between the two lesion segmentations and 2) toa variant (JCInt) that jointly infers tissue classes over bothtimepoints, but where the follow-up intensities are used insteadof the intensity difference between reference and follow-upscan (i.e. p( ~Di|c(r), c(t)) becomes p(I

(t)i |c(t)) in our single-

voxel formulation in eqn. (3).For all variants of the classifier, separate lesion-level random

forest classifiers were generated based on the new lesioncandidates as generated by the particular voxel-wise classifiervariant, but the process for generating these lesion-level clas-sifiers and the features used were the same across all variants.All classifier models used other than those specifically targetedby a given classifier variant (i.e. intensity likelihood models,tissue priors, neighbourhood models) were common across allvariants being tested.

Figure 7 presents a plot of sensitivity vs. false detectionrate for lesion-level classification for the 3 classifier variantsdiscussed above. Different points on the curve were generatedby varying the target sensitivity for the lesion-level classifier.The plots show the mean classification sensitivity as a functionof false detection rate over 4 cross-validation folds. Ninety-fivepercent confidence intervals of the standard error of the meanclassification sensitivity are shown based on classificationvariance over the 4 folds. It should be noted that while thecurve shown in Figure 7 is much like a Receiver OperatingCharacteristic (ROC) curve, it does not represent a typicalROC curve, as it does not plot sensitivity vs. false positive



10

rate and considers only the set of lesions identified by theground truth or by the classifier, rather than all possiblesamples in our image. For comparison purposes, curves havebeen extrapolated (as a straight line at maximum achievablesensitivity for that classifier) to cover the whole range of falsepositive values even if in practice these false positive rates arenot achievable. Sensitivity at operating points of interest areshown in tabular form in III.

The results show that the incorporation of a joint segmen-tation framework significantly increases the ability to detectnew lesions as compared to independently classifying lesionsin two timepoints and subtracting the resultant segmentations.When performing a joint segmentation over both timepoints,using the intensity difference between coregistered scans asopposed to the native intensity at follow-up provides a furthersignificant performance improvement.

2) Neighbourhood Model: Experiments were performed tocompare the proposed neighbourhood model to (1) a variantwhere spatial context is not considered (NoN), and (2) to avariant using a more traditional MRF model (MRF). The firstvariant generates a voxel-wise classification based only on thesingle-voxel formulation, as in eqn. (3). The second variantemploys an MRF model using pairwise cliques and ICM as theoptimization scheme. All other components of the voxel-wiseclassification were kept constant. Pairwise clique potentialswere learned as conditional probabilities from training data inthe same manner as for the original neighbourhood models.It should be noted that although the model considers cliquesthat are spatially pairwise, models are built based on tissuetransitions such that both timepoints are considered togetherin the model. For each of the classifier variants describedabove, variant-specific random forest classifiers for lesion-levelclassification were relearned based on new lesion candidatesgenerated by each classifier variant.

Figure 8 presents a plot of sensitivity vs. false detectionrate for lesion-level classification for the 3 variants of neigh-bourhood models. Plots and error bars are showing meanclassification sensitivity over 4 folds, and 95% confidenceintervals of the standard error of the mean, respectively. Curveshave been extrapolated to cover the full range of false positiverates. Our experiments show that incorporation of spatialneighbourhood information is beneficial, as both the proposedclassifier and the variant using pairwise MRF outperform thevariant that does not incorporate spatial context. The proposedneighbourhood model has similar performance to a basic MRFat low false detection rates but can achieve higher sensitivityat false detection rates above 10%.

3) Inter-timepoint registration: Experiments were carriedout to compare the effect of inter-timepoint registration onoverall classification accuracy. All other components of themodel were kept constant, but random forest classifiers weregenerated for each registration type so as to be tuned to newlesion candidates as generated by that choice of registration(rigid registration will for example result in more new lesioncandidates due to local misregistration). Results are summa-rized in Figure 9. In general, inter-timepoint registration doesnot appear to make a sizeable difference, although slightlybetter performance can be observed using non-linear registra-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Sen

sit

ivit

y

PC

NoN

MRF

Fig. 8. Lesion Sensitivity vs. False Detection Rates for Different Neigh-bourhood Models. PC=Proposed Classifier, NON=No Neighbourhood Model,MRF=Standard Pairwise MRF. Individual points on the curve were generatedby varying the random forest output threshold for acceptance as new lesion.

TABLE IIICLASSIFIER PERFORMANCE AT SELECTED OPERATING POINTS OF

INTEREST

Classifier Sensitivity At FD rateFD rate = 0.1 FD rate = 0.2

PC 0.835±0.082 0.898±0.055MRF 0.809±0.078 0.836±0.060NoN 0.700±0.085 0.742±0.077JCint 0.662±0.040 0.701±0.040STC 0.565±0.028 0.612±0.048

PC=Proposed, STC=Single Timepoint, JCint=Joint Timepoint using Inten-sities, NoN=No Neighbourhood Model, MRF=Pairwise MRF using ICM.Values after ± refer to 95% confidence intervals of the standard error ofthe mean.

tion in the operating range with false detection rates of 10-30%. Using linear (affine and rigid) registration results in ahigher maximum sensitivity, as there are a few new lesionsthat are registered away when using non-linear registration,but also results in more false new lesion candidates due tolocal misregistration.

4) Performance as function of lesion size: Because detec-tion of a new lesion that spans multiple slices and severalhundred voxels is much easier than detecting a 3 voxel lesionon a single slice, we have examined performance as a functionof lesion size. Figure 10 shows plots of sensitivity vs. falsepositive for different ranges of lesion size, from small (3-10 voxels) to very large (101+ voxels). Table IV showssensitivity and false positive rates for different lesion sizeranges when operating at overall target sensitivities of 0.8 and0.9. Operating at a target sensitivity of 80% (note that targetsensitivity is determined based on overall sensitivity acrosslesions of all sizes), we can see that sensitivities for differentlesion sizes range from 61% for small lesions, to 100% forvery large lesions. We can achieve close to perfect detection(99% sensitivity, 2% false detection rate) for lesions over 50voxels (corresponding to 0.15cc) at an overall target sensitivityof 80%. At a target sensitivity of 90%, we can exceed 80%sensitivity for small lesions, but at the expense of increased



11

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Sen

sit

ivit

y

NonLin

Lin

Affine

Fig. 9. Lesion Sensitivity vs. False Detection Rates for Different Inter-Timepoint Registration Schemes. NonLin = non-linear reg., Lin. = linear reg,Affine = Affine reg. Individual points on the curve were generated by varyingthe random forest output threshold for acceptance as new lesion.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Sen

sit

ivit

y

small n=106

small−med n=53

medium n=73

large n=47

very large n=57

all n=336

Fig. 10. New lesion Sensitivity as function of new lesion size. small = 3-10voxels, small-med = 11-20 voxels, medium = 21-50 voxels, large = 51-100voxels, very large = 101+ voxels. n refers to the number of lesion of thatsize as per the ground truth reference. Individual points on the curve weregenerated by varying the random forest output threshold for acceptance asnew lesion.

false detection rates.5) Comparison to Manual Labels: The proposed classifier

was compared to manual identification of new lesions asdone by 9 expert raters who identified new lesions based oncoregistered FLAIR scans (other modalities were not used)from Data Set Bsub.

In addition to comparing to individual raters, we constructvarious consensus segmentations by considering segmentationoverlap between raters, using the synthetic segmentations gen-erated as described in Section III-A. We construct a consensuscurve by in turn considering lesions labelled by any of the 9raters (1 of 9), those labelled by at least 2 of 9, and so on.

Because manual identification was done using only FLAIR,we additionally compare to a variant of our classifier that usesonly FLAIR images as input. Figure 11 displays performancecurves for the proposed classifier (PC), a variant that uses

TABLE IVPERFORMANCE BY SIZE AT TARGET SENSITIVITIES OF 0.8 AND 0.9

Target Size Overall 3-10 11-20 21-50 51-100 101+Sens N 336 106 53 73 47 570.8 DET 270 65 39 63 46 57

SENS 0.80 0.61 0.74 0.86 0.98 1.00FD 25 6 8 9 1 1FDR 0.08 0.08 0.17 0.13 0.02 0.02

0.9 DET 301 86 44 68 46 57SENS 0.90 0.81 0.83 0.93 0.98 1.00FD 91 42 26 20 5 3FDR 0.23 0.33 0.32 0.23 0.10 0.05

N=# new lesions as per ground truth, DET=# detected, SENS=sensitivity,FD=# false detections, FDR=false detection rate. All sizes in voxels.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Sen

sit

ivit

y

PC

PC−FLR

MANCONS

Fig. 11. Lesion Sensitivity as Compared to Manual Labels. PC = ProposedClassifier (4 modalities), PC-FLR = Proposed Classifier (FLAIR only), MAN-CONS = Consensus of expert raters. Each symbol represents an individualexpert rater, while the X represents the mean expert rater. All error barscorrespond to 95% confidence intervals of the standard error of the mean.Individual points on the curve for PC and PC-FLR were generated by varyingthe random forest output threshold for acceptance as new lesion. Individualpoints on the MANCONS curve were generated by considering differentdefinitions of consensus (ranging from 1 of 9 experts to 9 of 9 experts).

only FLAIR (PC-FLR), the consensus expert new lesionidentification (MANCONS), as well as each of the individualmanual new lesion identifications.

The version of the proposed classifier that uses only FLAIRimages as input outperforms any single expert rater and hasperformance comparable to a consensus of the 9 expert raters.Use of only FLAIR images as input results in a significantdecrease in performance as compared to the proposed classifierusing 4 modalities.

IV. DISCUSSION

As shown by our experimental results, a joint timepointsegmentation approach that models a temporal relationshipbetween successive timepoints provides better identification ofnew lesions than using an equivalent approach that segmentsMS lesions independently in sequential scans. The modellingof temporal dependencies provides a more temporally preciseclassification across timepoints by favouring a tissue clas-sification that is consistent over time. Enforcing temporalconsistency in our segmentation makes it easier to identify



12

veritable change between sequential scans by lowering thesegmentation variability over time. The use of differenceimages as observations for segmentation purposes providesadditional temporal consistency as we are implicitly modellingan additional relationship between scans. Difference imagesalso provide additional sensitivity to change, as compared tothe case where we consider native intensities independently.

The modelling of spatial relationships improves identi-fication of new MS lesions by leveraging contextual evi-dence. The neighbourhood model that we have introducedprovides performance comparable to a standard pairwise MRFwhen operating at high specificities, but successfully operatesthrough a larger range of sensitivities than is possible withan MRF using ICM as an optimization scheme. This is dueto the proposed soft optimization scheme which propogatesunderlying distributions rather than just labels, allowing fordetection of more subtle lesions that may have sub-MAP butnon-negligible probability of being new lesion based on initialestimates.

We have used a hierarchical approach to new lesion iden-tification, combining a first-stage generative voxel-wise clas-sification using a Bayesian framework, and a second lesion-level discriminative classification stage using Random Forests.Although effort was made to find lesion-level features ableto discriminate between veritable new lesion and false newlesions, there may exist a more optimal feature set and moreoptimal parameters for random forest classification. Many ofthe features chosen are correlated and as such some of thefeatures used may be redundant. Default parameters were usedfor training random forest generation and may be suboptimal.

Comparisons of our proposed classifier to expert manualidentification of new lesion using only FLAIR show betterperformance than any individual expert, even when using onlyFLAIR images for automatic classification. Use of additionalmodalities other than FLAIR significantly improved classifi-cation performance, especially for small lesions. As it is moredifficult (and time-consuming) for a human rater to synthesizeseveral MRI modalities across 2 timepoints when identifyingnew lesions, the advantages of an automated approach becomeeven more apparent for multi-modal classification tasks. Inaddition, the use of an automated technique eliminates thevariability associated with manual segmentation leading tohigher precision - an important issue when trying to detectchange.

Meaningful characterization of performance of lesion seg-mentations is difficult due to the absence of a real goldstandard segmentation. We have attempted to provide a reliableground truth identification of new lesions by incorporatinga thorough manual verification stage on top of an existingpipeline that has been used on multiple large multi-centerclinical trials, and by incorporating additional information (inthe form of additional timepoints other than reference andfollow-up) in the manual review phase. Ideally, we couldevaluate the performance of our method on a common publiclyavailable data set, but unfortunately no common data set fordetection or segmentation of new MS lesions currently exists.

Comparisons of fully manual new lesion identification withan automated approach done in this study and in [12] highlight

the shortcomings of using manual segmentations as groundtruth, as is done in many lesion segmentation validations [17].We believe that the reference standard used here, while ad-mittedly imperfect, is superior to a fully manual referencesegmentation. Training the lesion-level random forest classifierbased on overlap with the ground truth (using cross-validation)may introduce a positive bias when comparing to fully manualidentification of new lesions (Figure 11), as the classifier istuned to the process that generated the ground truth. This biasis unavoidable given the nature of the data sets and shouldbe small given the quality of the ground truth. While overallsensitivities to new lesion in the range of 80-90% may notseem exceedingly high, our reference includes many small,subtle lesions that may not be included in other referencesegmentations. More than 30% of new lesions in our groundtruth are comprised of 10 or fewer voxels, corresponding to avolume of less than 0.03cc.

Another complicating factor in validation of lesion segmen-tation algorithms is the inherent ambiguity of what constituteslesion and when exactly lesion should be considered as new,as the exact time of transition from healthy tissue to lesion isnot always obvious. This ambiguity leads to several instanceswhere our reference standard has considered something asnew lesion, while our classifier has considered this same caseas stable lesion (i.e. lesion at both reference and follow-upand therefore not new), and vice-versa. Operating at a targetsensitivity of 90%, 15 of the 35 (43%) undetected new lesions(false negatives) are cases where the classifier has considereda reference standard new lesion as a stable lesion.

Upon further review, some of the “false detection” newlesions were not truly falsely detected lesions, but rather verita-ble new lesions that were not included in our ground truth. Anexample of such a case is shown in Fig. 12. A disproportionatenumber of false detections occur in subjects that are active(i.e. having 1 or more new lesions). At a target sensitivity of90%, there are on average 0.31 false detections per subjectfor inactive subjects, compared to 0.98 false detections persubject for active subjects. The number of false detectionsalso increases with the number of new lesions occuring inthe subject. It is hypothesized that this trend is due to twofactors: 1) some of the the new lesion determined to be falsedetections are not actual false detections but rather new lesionsnot included by experts in our ground truth and 2) increasesin lesion activity result in greater shifts in surrounding braintissue, thus creating more intensity differences correspondingto misregistration that are misidentified as new lesions.

Because our ground truth is imperfect, the true number offalse negatives and false positives are probably lower than whathas been reported in this paper. As such, the curves presentedin III-C can be thought of as lower bounds on performance.Additional examples of new lesion classifications are shownin Fig. 13.

The method presented here is fully automated. It provides amechanism for trading off sensitivity and specificity, which isuseful for tailoring the algorithm to a specific context. In manypractical scenarios, such as clinical trials, there is generallysome form of manual review of automated segmentations. Insuch a case, a lower specificity may be tolerable if it provides



13

(a) Reference FLAIR (b) Follow-up FLAIR (c) Classification

Fig. 12. Example of false detection (shown in red) that upon further reviewwas determined to be a veritable lesion.

(a) Reference FLAIR (b) Follow-up FLAIR (c) Classification

(d) Reference FLAIR (e) Follow-up FLAIR (f) Classification

Fig. 13. Example new lesion classfications. True positive new lesions areshown in green.

additional sensitivity. In a context where no manual reviewis done, or in cases where determining whether a subject isactive or inactive (i.e. we are only interested in the presenceor non-presence of new lesion rather than identifying all newlesions), a higher specificity and lower sensitivity may be moreappropriate.

V. CONCLUSION

We have presented a framework for automated detection ofnew MS lesions using a two-stage classifier that first performsa joint Bayesian classification of tissue classes at each voxel ofreference and follow-up scans using intensities and intensitydifferences, and then performs a lesion-level classificationusing a random forest classifier. The proposed new lesionclassifier allows for trade-off of sensitivity and specificitythrough the use of a user-specified confidence threshold (ortarget sensitivity). Sample points of operation show that ourclassifier is able to detect new lesions as small as 3 voxels witha sensitivity of 80% and false detection rate of 7% and a sensi-tivity of 90% with false detection rate of 23%, as compared to

a reference standard. Comparisons to manual identification ofnew lesions using only sequential FLAIR scans showed betterperformance than any individual expert rater and comparableperformance to consensus segmentation combining manualidentification of new lesion from 9 independent raters.

REFERENCES

[1] Y. Duan, P. Hildenbrand, M. Sampat, D. Tate, I. Csapo, B. Moraal,R. Bakshi, F. Barkhof, D. Meier, and C. Guttmann, “Segmentation ofsubtraction images for the measurement of lesion change in multiplesclerosis,” American Journal of Neuroradiology, vol. 29, no. 2, pp. 340–346, 2008.

[2] B. Moraal, D. Meier, P. Poppe, J. Geurts, H. Vrenken, W. Jonker,D. Knol, R. Van Schijndel, P. Pouwels, C. Pohl et al., “Subtractionmr images in a multiple sclerosis multicenter clinical trial setting1,”Radiology, vol. 250, no. 2, pp. 506–514, 2009.

[3] A. Evans, J. Frank, J. Antel, and D. Miller, “The role of mri in clinicaltrials of multiple sclerosis: comparison of image processing techniques,”Annals of neurology, vol. 41, no. 1, pp. 125–132, 1997.

[4] P. Molyneux, P. Tofts, A. Fletcher, B. Gunn, P. Robinson, H. Gallagher,I. Moseley, G. Barker, and D. Miller, “Precision and reliability formeasurement of change in mri lesion volume in multiple sclerosis: acomparison of two computer assisted techniques,” Journal of Neurology,Neurosurgery & Psychiatry, vol. 65, no. 1, pp. 42–47, 1998.

[5] C. Guttmann, R. Kikinis, M. Anderson, M. Jakab, S. Warfield, R. Kil-liany, H. Weiner, F. Jolesz et al., “Quantitative follow-up of patientswith multiple sclerosis using mri: reproducibility,” Journal of MagneticResonance Imaging, vol. 9, no. 4, pp. 509–518, 1999.

[6] P. Molyneux, D. Miller, M. Filippi, T. Yousry, E. Radu, H. Ader, andF. Barkhof, “Visual analysis of serial t2-weighted mri in multiple scle-rosis: intra-and interobserver reproducibility,” Neuroradiology, vol. 41,no. 12, pp. 882–888, 1999.

[7] E. Ashton, C. Takahashi, M. Berg, A. Goodman, S. Totterman, andS. Ekholm, “Accuracy and reproducibility of manual and semiautomatedquantification of ms lesions by mri,” Journal of Magnetic ResonanceImaging, vol. 17, no. 3, pp. 300–308, 2003.

[8] B. Moraal, I. van den Elskamp, D. Knol, B. Uitdehaag, J. Geurts,H. Vrenken, P. Pouwels, R. van Schijndel, D. Meier, C. Guttmann et al.,“Long-interval t2-weighted subtraction magnetic resonance imaging: Apowerful new outcome measure in multiple sclerosis trials,” Annals ofneurology, vol. 67, no. 5, pp. 667–675, 2010.

[9] M. P. Sormani, B. Stubinski, P. Cornelisse, S. Rocak, D. Li, and N. D.Stefano, “Magnetic resonance active lesions as individual-level surrogatefor relapses in multiple sclerosis,” Multiple Sclerosis Journal, vol. 17,no. 5, pp. 541–549, 2011.

[10] K. Van Leemput, F. Maes, D. Vandermeulen, A. Colchester, andP. Suetens, “Automated segmentation of multiple sclerosis lesions bymodel outlier detection,” Medical Imaging, IEEE Transactions on,vol. 20, no. 8, pp. 677 –688, aug. 2001.

[11] X. Wei, S. K. Warfield, K. H. Zou, Y. Wu, X. Li, A. Guimond, J. P.Mugler, R. R. Benson, L. Wolfson, H. L. Weiner, and C. R. Guttmann,“Quantitative analysis of mri signal abnormalities of brain white matterwith high reproducibility and accuracy,” Journal of Magnetic ResonanceImaging, vol. 15, no. 2, pp. 203–209, 2002.

[12] M. Bosc, F. Heitz, J.-P. Armspach, I. Namer, D. Gounot, and L. Rum-bach, “Automatic change detection in multimodal serial mri: applicationto multiple sclerosis lesion evolution,” NeuroImage, vol. 20, no. 2, pp.643 – 656, 2003.

[13] A. Zijdenbos, R. Forghani, and A. Evans, “Automatic ”pipeline” analysisof 3-d mri data for clinical trials: application to multiple sclerosis,”Medical Imaging, IEEE Transactions on, vol. 21, no. 10, pp. 1280 –1291, oct. 2002.

[14] R. Harmouche, L. Collins, D. Arnold, S. Francis, and T. Arbel,“Bayesian ms lesion classification modeling regional and local spatialinformation,” in Pattern Recognition, 2006. ICPR 2006. 18th Interna-tional Conference on, vol. 3, 0-0 2006, pp. 984 –987.

[15] D. Garcia-Lorenzo, S. Prima, D. Arnold, D. Collins, and C. Barillot,“Trimmed-likelihood estimation for focal lesions and tissue segmenta-tion in multisequence mri for multiple sclerosis,” Medical Imaging, IEEETransactions on, vol. 30, no. 8, pp. 1455 –1467, aug. 2011.

[16] S. Francis, “Automatic lesion identification in mri of multiple sclerosispatients,” Ph.D. dissertation, McGill University, 2004.



14

[17] X. Llado, A. Oliver, M. Cabezas, J. Freixenet, J. Vilanova, A. Quiles,L. Valls, L. Ramio-Torrenta, and A. Rovira, “Segmentation of multiplesclerosis lesions in brain mri: A review of automated approaches,”Information Sciences, 2011.

[18] E. Geremia, O. Clatz, B. Menze, E. Konukoglu, A. Criminisi, andN. Ayache, “Spatial decision forests for ms lesion segmentation in multi-channel magnetic resonance images,” NeuroImage, vol. 57, no. 2, pp.378–390, 2011.

[19] J. Thirion and G. Calmon, “Deformation analysis to detect and quantifyactive lesions in three-dimensional medical image sequences,” MedicalImaging, IEEE Transactions on, vol. 18, no. 5, pp. 429–441, 1999.

[20] G. Gerig, D. Welti, C. Guttmann, A. Colchester, and G. Szekely, “Ex-ploring the discrimination power of the time domain for segmentationand characterization of active lesions in serial mr data,” Medical ImageAnalysis, vol. 4, no. 1, pp. 31–42, 2000.

[21] G. Szekely, D. Welti, G. Gerig, E. Radu, and L. Kappos, “Spatio-temporal segmentation of active multiple sclerosis lesions in serial mridata,” in Information Processing in Medical Imaging. Springer, 2001,pp. 438–445.

[22] D. Rey, G. Subsol, H. Delingette, and N. Ayache, “Automatic detectionand segmentation of evolving processes in 3d medical images: Appli-cation to multiple sclerosis,” Medical Image Analysis, vol. 6, no. 2, pp.163–179, 2002.

[23] S. Prima, N. Ayache, A. Janke, S. Francis, D. Arnold, and D. Collins,“Statistical analysis of longitudinal mri data: Applications for detectionof disease activity in ms,” Medical Image Computing and Computer-Assisted InterventionMICCAI 2002, pp. 363–371, 2002.

[24] L. Aıt-Ali, S. Prima, P. Hellier, B. Carsin, G. Edan, and C. Barillot,“Strem: a robust multidimensional parametric method to segment mslesions in mri,” Medical Image Computing and Computer-AssistedIntervention–MICCAI 2005, pp. 409–416, 2005.

[25] M. Lee, S. Smith, J. Palace, and P. Matthews, “Defining multiplesclerosis disease activity using mri t2-weighted difference imaging.”Brain, vol. 121, no. 11, pp. 2095–2102, 1998.

[26] I. Tan, R. Van Schijndel, F. Fazekas, M. Filippi, P. Freitag, D. Miller,T. Yousry, P. Pouwels, H. Ader, and F. Barkhof, “Image registration andsubtraction to detect active t 2 lesions in ms: an interobserver study,”Journal of neurology, vol. 249, no. 6, pp. 767–773, 2002.

[27] M. Liguori, D. Meier, P. Hildenbrand, B. Healy, T. Chitnis, N. Baruch,S. Khoury, H. Weiner, R. Bakshi, F. Barkhof et al., “One year activityon subtraction mri predicts subsequent 4 year activity and progression inmultiple sclerosis,” Journal of Neurology, Neurosurgery & Psychiatry,vol. 82, no. 10, pp. 1125–1131, 2011.

[28] C. Elliott, S. Francis, D. Arnold, D. Collins, and T. Arbel, “Bayesianclassification of multiple sclerosis lesions in longitudinal mri usingsubtraction images,” Medical Image Computing and Computer-AssistedIntervention–MICCAI 2010, pp. 290–297, 2010.

[29] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.5–32, 2001.

[30] V. Fonov, A. Evans, K. Botteron, C. Almli, R. McKinstry, D. Collinset al., “Unbiased average age-appropriate atlases for pediatric studies,”NeuroImage, vol. 54, no. 1, pp. 313–327, 2011.

[31] S. Smith, “Fast robust automated brain extraction,” Human brain map-ping, vol. 17, no. 3, pp. 143–155, 2002.

[32] J. Sled, A. Zijdenbos, and A. Evans, “A nonparametric method forautomatic correction of intensity nonuniformity in mri data,” MedicalImaging, IEEE Transactions on, vol. 17, no. 1, pp. 87–97, 1998.

[33] D. Collins, C. Holmes, T. Peters, and A. Evans, “Automatic 3-d model-based neuroanatomical segmentation,” Human Brain Mapping, vol. 3,no. 3, pp. 190–208, 1995.

[34] L. Nyul, J. Udupa, and X. Zhang, “New variants of a method of mriscale standardization,” Medical Imaging, IEEE Transactions on, vol. 19,no. 2, pp. 143–150, 2000.

[35] P. Anbeek, K. L. Vincken, M. J. van Osch, R. H. Bisschops, andJ. van der Grond, “Probabilistic segmentation of white matter lesionsin mr imaging,” NeuroImage, vol. 21, no. 3, pp. 1037 – 1044, 2004.

[36] A. Akselrod-Ballin, M. Galun, R. Basri, A. Brandt, M. Gomori, M. Fil-ippi, and P. Valsasina, “An integrated segmentation and classificationapproach applied to multiple sclerosis analysis,” in Computer Visionand Pattern Recognition, 2006 IEEE Computer Society Conference on,vol. 1, june 2006, pp. 1122 – 1129.

[37] N. Shiee, P.-L. Bazin, A. Ozturk, D. S. Reich, P. A. Calabresi, and D. L.Pham, “A topology-preserving approach to the segmentation of brainimages with multiple sclerosis lesions,” NeuroImage, vol. 49, no. 2, pp.1524 – 1535, 2010.



Temporally Consistent Probabilistic Detection of New ...colm/TMI2013_preprint.pdf · 1 Temporally...

Documents

Transcript of Temporally Consistent Probabilistic Detection of New ...colm/TMI2013_preprint.pdf · 1 Temporally...