6.2 Background

11
Chapter 6 Applications of the ESPNet architecture in medical imaging Sachin Mehta 1 Nicholas Nuechterlein 1 Ezgi Mercan 2 Beibin Li 1 Shima Nofallah 1 Wenjun Wu 1 Ximing Lu 1 Anat Caspi 1 Mohammad Rastegari (Change this to 1.)3 Joann Elmore 4 Hannaneh Hajishirzi 1 Linda Shapiro 1 1 University of Washington, Seattle, WA, United States 2 Seattle Childrens Hospital, Seattle, WA, United States 3 Allen Institute of Artificial Intelligence (AI2) and XNOR.AI 4 University of California, Los Angeles, CA, United States 6.1 Introduction Medical imaging creates visual representations of the human body, including organs and tissues, to aid in diagnosis and treatment. These visual representations are effective in a variety of medical settings and have become integral in clinical decision making. A common example is X-ray-based radiography examinations that are used to capture visual representations of the interior structure of the body. Other examples include PAP smear analysis for cervical cancer screening and whole slide biopsy analysis for multiple kinds of cancer, including breast cancer and melanoma. Todays human physician is no longer able to interpret the vast amount of information now available in medical imaging. For example, there are hundreds of thousands of cells on some of the whole slide images (WSIs) of a biopsy. Machine learning is an effective tool that enables machines (or computers) to learn meaningful patterns from medical imaging data, which can be used to build computer-aided diagnostic systems [1,2]. With the recent advancements in hardware technology and the availability of a large amount of medical imaging data, deep learning-based methods, especially convolutional neural networks (CNNs) [3], are gaining attention in medical image analysis [4,5]. Researchers have applied deep learning-based methods to a variety of medical data, such as WSIs [6], magnetic resonance (MR) tumor scans [7], electron microscopic recordings [8], and tasks, such as cancer diagnosis [2,9] and cellular-level entity detection [8,10]. In this chapter, we will describe the ESPNet architecture [11] that has been successfully applied across a variety of visual recognition tasks, including image classification, object detection, semantic segmentation, and medical image analysis [6,12,13]. To demonstrate the modeling power of the ESPNet architecture, we study the application of ESPNet to two different medical imaging tasks: (1) tissue-level segmentation of breast biopsy WSIs and (2) tumor segmentation in 3D brain MR images. Our results show that the ESPNet architecture learns meaningful representations efficiently that allows it to deliver good

Transcript of 6.2 Background

Page 1: 6.2 Background

Chapter6

ApplicationsoftheESPNetarchitectureinmedicalimaging

SachinMehta1

NicholasNuechterlein1

EzgiMercan2

BeibinLi1

ShimaNofallah1

WenjunWu1

XimingLu1

AnatCaspi1

MohammadRastegari (Changethisto1.)3

JoannElmore4

HannanehHajishirzi1

LindaShapiro1

1UniversityofWashington,Seattle,WA,UnitedStates

2SeattleChildren’sHospital,Seattle,WA,UnitedStates

3AllenInstituteofArtificialIntelligence(AI2)andXNOR.AI

4UniversityofCalifornia,LosAngeles,CA,UnitedStates

6.1IntroductionMedicalimagingcreatesvisualrepresentationsofthehumanbody,includingorgansandtissues,toaidindiagnosisandtreatment.Thesevisualrepresentationsareeffectiveinavarietyofmedicalsettingsand

havebecomeintegralinclinicaldecisionmaking.AcommonexampleisX-ray-basedradiographyexaminationsthatareusedtocapturevisualrepresentationsoftheinteriorstructureofthebody.Otherexamples

includePAPsmearanalysisforcervicalcancerscreeningandwholeslidebiopsyanalysisformultiplekindsofcancer,includingbreastcancerandmelanoma.Today’shumanphysicianisnolongerabletointerpretthe

vastamountofinformationnowavailableinmedicalimaging.Forexample,therearehundredsofthousandsofcellsonsomeofthewholeslideimages(WSIs)ofabiopsy.

Machinelearningisaneffectivetoolthatenablesmachines(orcomputers)tolearnmeaningfulpatternsfrommedicalimagingdata,whichcanbeusedtobuildcomputer-aideddiagnosticsystems[1,2].With

the recent advancements inhardware technology and the availability of a large amount ofmedical imagingdata, deep learning-basedmethods, especially convolutional neural networks (CNNs) [3], aregaining

attention inmedical imageanalysis [4,5]. Researchers have applieddeep learning-basedmethods to a variety ofmedical data, such asWSIs [6],magnetic resonance (MR) tumor scans [7], electronmicroscopic

recordings[8],andtasks,suchascancerdiagnosis[2,9]andcellular-levelentitydetection[8,10].

In this chapter, we will describe the ESPNet architecture [11] that has been successfully applied across a variety of visual recognition tasks, including image classification, object detection, semantic

segmentation, andmedical imageanalysis [6,12,13]. To demonstrate themodeling power of theESPNet architecture,we study the application of ESPNet to two differentmedical imaging tasks: (1) tissue-level

segmentationofbreastbiopsyWSIsand(2)tumorsegmentationin3DbrainMRimages.OurresultsshowthattheESPNetarchitecturelearnsmeaningfulrepresentationsefficientlythatallowsittodelivergood

Page 2: 6.2 Background

accuracyondifferenttasks.

Therestofthechapterisorganizedasfollows.Section6.2reviewsthedifferenttypesofconvolutionsthatareusedintheESPNetarchitecture,whichisdescribedinSection6.3.Experimentalresultsontwo

differentmedicalimagingdatasetsareprovidedinSection6.4.Section6.5concludesthechapter.

6.2BackgroundTheconvolutionoperationliesattheheartofmanycomputervisionalgorithms,includingCNNs.Inthissection,wewillbrieflyreviewtwotypesofconvolutions,standardanddilatedconvolutions,whichare

usedintheESPNetarchitecture.

6.2.1StandardconvolutionForagiven2Dinputimage ,thestandardconvolutionmovesakernel ofsizen×n1overeveryspatiallocationoftheinput toproducetheoutput .Mathematically,itcanbedefinedas:

where*denotestheconvolutionoperations.

6.2.2DilatedconvolutionDilated convolutions2 are a special form of standard convolutions in which holes are inserted between kernel elements to increase the effective receptive field. These convolutions have been widely used in semantic

segmentation networks [14,15]. For a given 2D input image , the dilated convolutionmoves a kernel of size with a dilation rate of over every spatial location of the input to produce the output .

Mathematically,itcanbedefinedas:

where denotes thedilatedconvolutionoperationwithadilation rateof .Theeffective receptive fieldofadilatedconvolutionalkernel ofsize with a dilation rate of is .Note that

dilatedconvolutionsarethesameasstandardconvolutionswhenthedilationrate is1.AnexamplecomparingdilatedandstandardconvolutionsisshowninFig.6.1.

6.3TheESPNetarchitectureInthissection,weelaborateonthedetailsoftheESPNetarchitecture.WefirstdescribetheESPmodule,thecorebuildingblockoftheESPNetarchitecture,andthendescribedifferentvariantsoftheESPNet

architectureforanalyzingdifferenttypesofmedicalimages,suchaswholeslidebiopsyimagesand3Dbraintumorimages.

6.3.1EfficientspatialpyramidunitThecorebuildingblockoftheESPNetarchitectureistheefficientspatialpyramid(ESP)unit.Tobeefficient,theESPunitdecomposesthestandardconvolutionintoapoint-wiseconvolutionandspatialpyramidofdilated

Figure6.1Acomparisonbetweenstandardanddilatedconvolutionoperations.InputXispaddedwithzeros(whitecells)sothattheresultingoutputaftertheconvolutionoperationisofthesamesizeasthatoftheinput.

Page 3: 6.2 Background

convolutionsusingtheRSTM(reduce,splitandtransform,andmerge)principle:

• Reduce:TheESPunitapplies point-wise3(or )convolutionstoaninputtensor toproduceanoutputtensor ,where and representthewidthandheightofthetensor,whereas and represent

thenumberofchannelsintheinputandoutputtensor.

• SplitandTransform:Tolearnthespatialrepresentationsfromalargereceptivefieldefficiently,theESPunitapplies , dilatedconvolutionssimultaneouslywithdifferentdilationrates to toproduceoutput

, .Tolearnspatialrepresentationsfromalargereceptivefield,theESPunitusesadifferentdilationrate, ,ineachbranch.ThisallowstheESPunittolearnspatialrepresentationsfromaneffective

receptivefieldof ,where denotesthekernelsize.

• Merge:TheESPunitconcatenatesthesed-dimensionalfeaturemapstoproduceaM-dimensionalfeaturemap ,where representstheconcatenationoperationand .

Fig.6.2comparestheESPunitwiththestandardconvolution.TheESPunitlearns parametersandperforms operations.Comparedto parametersand operationsforthe

standardconvolution,theRSTMprinciplereducesthetotalnumberofparametersandoperationsbyafactorof ,whilesimultaneouslyincreasingtheeffectivereceptivefieldbyapproximately .

6.3.1.1HierarchicalfeaturefusionfordegriddingintheefficientspatialpyramidunitDilatedconvolutionsinsertzeros,controlledbythedilationrate,betweenkernelelementstoincreasethereceptivefieldofthekernel.However,thisintroducesunwantedcheckerboardorgriddingartifactsontheoutput,as

showninFig.6.3.TheESPunitintroducesahierarchicalfeaturefusion(HFF)methodtoremovetheseartifactsinacomputationallyefficientmanner4.HFFhierarchicallyaddsoutputs andthenconcatenatethemtoproduceahigh-

dimensionalfeaturemap ,where⊕representstheelement-wiseadditionoperation.Fig.6.4visualizestheESPunitwithandwithoutHFF.

Figure6.2AcomparisonbetweenthestandardconvolutionandtheESPunit.IntheESPunit,thestandardconvolutionisdecompose (Weonlyrefertowhitecolorinthisfigurecaption,whichshouldbeokayforbothcolorandB&Wprints.SameforFigure6.1)dintoa

point-wiseconvolutionandaspatialpyramidofdilatedconvolutionsusingtheRSTMprinciple.Notethatindilatedconvolutions,zeros(representedinwhite)areinsertedbetweenthekernelelementstoincreasethereceptivefield.

Figure6.3Thisfigureillustratesagriddingartifactexample,wherea5×5inputwithsingle-activepixel(representedinblue)isconvolvedwitha3×3dilatedconvolutionkerneltoproduceanoutputwithagriddingartifact.Activekernelelementsarerepresentedinora (Thisfigure

illustratesagriddingartifactexample,wherea5x5inputwithasingle-activepixelisconvolvedwitha3x3dilatedconvolutionkerneltoproduceanouputwithagriddingartifact.Activeinput,output,andkernelelementsareshownincolor.)nge.

Page 4: 6.2 Background

6.3.2SegmentationarchitectureMostimagesegmentationarchitecturesfollowanencoder–decoderstructure[16].Theencodernetworkisastackofencodingunits,suchasthebottleneckunitinResNet[17],anddownsamplingunitsthathelpthenetwork

learnmultiscalerepresentations.Spatialinformationislostduringfilteringanddownsamplingoperationsintheencoder;thedecodertriestorecoverthelossofthisinformation.Thedecodercanbeviewedasaninverseoftheencoder

thatstacksupsamplinganddecodingunitstolearnrepresentationsthatcanhelpproduceeitherbinaryormulticlasssegmentationmasks.Itisimportanttonotethatavanillaencoder–decodernetworkdoesnotshareinformation

betweenencodinganddecodingunitsateachspatiallevel.Toshareinformationbetweenencodinganddecodingunitsateachspatiallevel,U-Net[8]introducesaskipconnectionbetweentheencodinganddecodingunitsateach

spatiallevel.Thisconnectionestablishesadirectlinkbetweentheencodingandthedecodingunitsandimprovesgradientflow,thusimprovingsegmentationperformance.Fig.6.5comparesthevanillaencoder–decoderandU-Net

styleencoder–decoderarchitectures.

The ESPNet architecture for semantic segmentation extends the U-Net. In contrast to using computationally expensive VGG-style [18] blocks for encoding and decoding the information, the ESPNet architecture uses

computationallyefficientESPunits.

6.4ExperimentalresultsInthissection,weprovideresultsoftheESPNetarchitectureontwodifferentmedicalimagingdatasets:(1)breastbiopsywholeslideimagingand(2)braintumorsegmentation.

6.4.1Breastbiopsywholeslideimagedataset6.4.1.1Dataset

Thebreastbiopsydatasetconsistsof240WSIswithhaematoxylinandeosin(H&E)staining[19,20].Threeexpertpathologistsindependentlyinterpretedeachofthe240casesandthenmettorevieweachcasetogetherand

provideaconsensusreferencediagnosislabel,whichweuseasthegoldstandardgroundtruthforeachcase.Additionally,theexpertpathologistsmarked428regionofinterests(ROIs)onthese240WSIsthatwererepresentativeof

theirdiagnosis.Outof these428ROIs,58weremanually segmentedbyanexpert intoeightdifferent tissue-level segmentation labels:background,benignepithelium,malignantepithelium,normal stroma,desmoplastic stroma,

secretion,blood,andnecrosis.Fig.6.6visualizessomeROIsalongwiththeirtissue-levelsegmentationlabels.Furthermore,wesplitthese58ROIsintotwoequallysizedsubsets,atrainingset(30ROIs)andatestset(28ROIs),while

keepingthedistributionofdiagnosticcategoriessimilar(seeTable6.1).Atotalof87pathologistswhowereactivelyinterpretingbreastbiopsiesintheirownclinicalpracticesparticipatedinthestudyandprovidedoneofthefour

Figure6.4BlocklevelvisualizationoftheESPunitwithoutandwithhierarchicalfeaturefusion(HFF).

Figure6.5Encoder–decoderarchitecturesforsegmentation.Colorsrepresentdifferenttypesofunitsintheencoder–decoderarachitecture (Encoder-decoderarchitecturesfortissue-levelsemanticsegmentation):encodingunitingreen,downsamplingunitinred,

upsamplingunitinorange,anddecodingunitinblue.

Page 5: 6.2 Background

diagnosticlabels(benign,atypia,ductalcarcinomainsitu,andinvasivecancer)perWSIofacase.Eachpathologistprovideddiagnosticlabelsforarandomlyassignedsubsetof60patients’WSIs,producinganaverageof22diagnostic

labelspercase.

Table6.1Diagnosticcategory-wisedistributionof58regionofinterests(ROIs)forthesegmentationsubset.

#ROI AverageROIsize(inpixels)

Expertconsensusreferencediagnosticcategory Training Test Total

Benign 4 5 9

Atypia 11 11 22

DCIS 12 10 22

Invasive 3 2 5

Total 30 28 58

6.4.1.2TrainingThebreastbiopsyROIs(orWSIs)havehugespatialdimensions,oftenoftheorderofgigapixels.Duetosuchhighspatialdimensionalityoftheseimagesandthelimitedcomputationalcapabilityofexistinghardwareresources,

itisdifficulttotrainconventionalCNNsdirectlyontheseimages.Thereforewefollowapatch-basedapproachthatsplitsaROI(orWSI)intosmallpatchesofsize384×384withanoverlapof56pixelsbetweenconsecutivepatches.We

usestandardaugmentationstrategiessuchasrandomflipping,cropping,andresizingtopreventoverfitting.Fortraining,wesplitthetrainingsetof30ROIsintotrainingandvalidationsubsetswitha90:10ratio.Wetrainournetwork

for100epochsusingstochasticgradientdescent(SGD)withaninitiallearningrateof0.0001,whichisdecreasedby0.5afterevery30epochs.Toevaluatethetissue-levelsegmentationperformance,weusemeanintersectionover

union(mIOU),awidelyusedmetrictoevaluatesegmentationperformance.

The shape and structure of objects of interest in the breast biopsy are variable in size, and splitting such structures into fixed size using a patch-based approach limits the contextual information; CNNs tend tomake

segmentationerrorsespeciallyattheborderofthepatch(seeFig.6.7).Toavoidsucherrors,wecentrallycropa384×384predictiontoasizeof256×256,asshowninFig.6.8.

Figure6.6ThesetofROIs(toprow)alongwiththeirtissue-levelsegmentationlabels(bottomrow)fromthedataset. (Wecan'tchangecolorsfortissuelabelsinFigure6.6.,6.7,6.8,and6.9)

Figure6.7Segmentationerrorsalongtheborderofthepatch.

Page 6: 6.2 Background

6.4.1.3SegmentationresultsTables6.2and6.3summarizetheperformanceofdifferentsegmentationarchitectures.

Table6.2Impactofskipconnectionsanddifferentdecodingunits.

Skipconnection Network

parametersEncoding–decodingunits Add Concat mIOU

ESP–ESP (CouldyoupleaseeditthelayoutofTable6.2,sothatitisthesameaswesubmited?Inparticular,mergethethreerowsinColumn1(correspondingtoESP-ESPsetting),sothatreadersmaynotgetconfusedwithwhichrowisforwhichsetting) 1.95 M 35.23

✓ 1.95 M 36.19

✓ 2.25 M 38.03

ESP–PSP ✓ 2.75 M 44.03

Table6.3Comparisonwithstate-of-the-artmethods.

Method Networkparameters mIOU

Superpixel+SVM NA 25.8

SegNet 12.80 M 37.6

SegNet+additiveskipconnections 12.80 M 38.1

Multiresolution 26.03 M 44.20

Y-Net(ESP–PSP) 2.75 M 44.03

Notes:TheresultsinthefirstfourrowsaretakenfromRefs[6,11,21].

6.4.1.4SkipconnectionsSkipconnectionsbetweentheencoderandthedecoderinFig.6.5canbeconstructedusingtwooperations:(1)element-wiseadditionand(2)concatenation.TheimpactoftheseconnectionsisstudiedinTable6.2.Wecansee

that encoder–decoder networkswith skip connection constructed using concatenation operations improve the accuracy of the vanilla encoder–decoder by about 4% and that the encoder–decoder networkswith skip connections

constructedusingelement-wiseadditionoperationsimprovetheaccuracyofthevanillaencoder–decoderbyabout2%.

6.4.1.5PyramidalspatialpoolingasadecodingunitTraditionally,thedecodingunitinencoder–decoderarchitectures,suchasSegNet[16]andU-Net[8],isthesameastheencodingunit.Inourrecentwork[6,11,21],weintroducedageneralencoder–decodernetwork,sketched

inFig.6.5,thatallowstheuseofdifferentencodinganddecodingunits.WhenwereplacedtheESPunitwiththepyramidalspatialpooling(PSP)unit[22]inthedecoder,theperformanceofournetworkimprovedbyabout6%.Thisis

likelybecausethepoolingoperationsinthePSPunitallowthecaptureofbetterglobalcontextualrepresentations(Fig.6.9).

Figure6.8Centercroppingtominimizesegmentationerrorsalongpatchborder.

Page 7: 6.2 Background

6.4.1.6Comparisonwithstate-of-the-artmethodsTable6.3comparestheperformancewithdifferentmethods.Wecanseethattheasymmetricencoder–decoderstructure,withtheESPastheencodingunitandthePSPasthedecodingunit,allowsustobuildalight-weight

networkthathas fewerparametersthanthenetworkofRefs[6,11,21]whiledeliveringsimilarsegmentationperformance.

6.4.1.7Tissue-levelsegmentationmasksforcomputer-aideddiagnosisTissue-levelsegmentationmasksprovideapowerfulabstractionfordiagnosticclassification.Todemonstratethedescriptivepoweroftissue-levelsegmentationmasks,weextractandstudytheimpactoftwofeatures,tissue-

distributionandstructuralfeatures [2], forafour-classbreastcancerdiagnostictask.Forthisanalysis,weusethe428ROIs.Weuseasupport-vectormachineclassifierwithapolynomialkernel ina leave-one-out-cross-validation

strategy.Wemeasure the performance in terms of sensitivity, specificity, and accuracy.Table 6.4 summarizes diagnostic classification results.We can see that simple features extracted from tissue-level segmentationmasks are

powerful,andweareabletoattainaccuraciessimilartopathologists.

Table6.4ImpactofdifferentfeaturesondiagnosticclassificationfromRef.[2].

Diagnosticfeatures Accuracy Sensitivity Specificity

Invasiveversusnoninvasive

Tissue-distributionfeature 0.94 0.70 0.95

Structuralfeature 0.91 0.49 0.96

Pathologists 0.98 0.84 0.99

AtypiaandDCISversusbenign

Tissue-distributionfeature 0.70 0.79 0.41

Structuralfeature 0.70 0.85 0.45

Pathologists 0.81 0.72 0.62

DCISversusatypia

Tissue-distributionfeature 0.83 0.88 0.78

Structuralfeature 0.85 0.89 0.80

Figure6.9ROI-wisepredictions.ThefirstrowshowsdifferentbreastbiopsyROIs,thesecondrowshowstissue-levelground-truthlabels;thethirdrowshowspredictions.ROI,regionofinterest.

Page 8: 6.2 Background

Pathologists 0.80 0.70 0.82

Notes:Thebestnumbersareindicatedinbold.

6.4.2Braintumorsegmentation6.4.2.1Dataset

TheMultimodalBrainTumorSegmentationChallenge(BraTS)2018trainingsetconsistsof285multi (multi-instituitional?)institutionalpreoperativemultimodalMRtumorscans,eachconsistingofT1,postcontrastT1-weighted

(T1ce),T2,andFLAIRvolumes[23].Eachvolumeisannotatedwithvoxel labelscorrespondingtothefollowingtumorcompartments:enhancingtumor,peritumoraledema,background,andnecroticcoreandnonenhancingtumor.

Necroticcoreandnonenhancingtumorshareasinglelabel.ThesedataarecoregisteredtothestandardMNIanatomicaltemplate,interpolatedtothesameresolution,andskull-stripped.Ground-truthsegmentationsaremanually

drawnbyradiologists.Fig.6.10visualizessomeMRmodalitiesalongwiththeirvoxel-levelsegmentationlabels.

6.4.2.2TrainingMRscansarehigh-dimensional:eachofthefourMRsequencevolumeshasadimensionof .Weusestandardaugmentationstrategiessuchasrandomflipping,cropping,andresizingtopreventoverfitting.For

training,wesplitthetrainingsetof285MRscansintotrainandvalidationsubsetswith80:20ratio(228:57).Wetrainournetworkfor300epochsusingSGDwithaninitiallearningrateof0.0001anddecreaseditto0.00001after200

epochs.Weevaluatethesegmentationperformanceusinganonlineserver,thatmeasuresthesegmentationperformanceintermsoftheDicescore.Furthermore,weadapttheESPmoduleforvolume-wisesegmentationbyreplacing

thespatialdilatedconvolutionsinFig.6.4withthevolume-wisedilatedconvolutions.Formoredetails,pleaseseeRef.[13].

6.4.2.3ResultsTable6.5summarizestheresultsoftheBRATS2018onlinetestandvalidationsets.Unlikethewinningentriesfromthecompetitionthatoftenusespecificnormalizationtechniques,modelensembling,andpostprocessing

methodssuchasconditionalrandomfield(CRF)toboosttheperformance[7,24],ournetworkwasabletoattaingoodsegmentationperformancewhilelearningmerely3.8millionparameters,whichareanorderofmagnitudefewer

thanexistingmethods.Furthermore,ourvisualinspectionrevealsconvincingperformance;wedisplaysegmentationresultsoverlaidondifferentmodalitiesinFig.6.11.Itisimportanttonotethatthepredictionsmadebyournetwork

aresmoothandlacksomeofthegranularitypresentintheground-truthsegmentation.Thisislikelybecauseourmodelisnotabletolearnsuchgranulardetailswithalimitedamountoftrainingdata.Webelievethatpostprocessing

techniques,suchasCRFs,wouldhelpournetworkincapturingsuchgranulardetailsandwewillstudysuchmethodsinthefuture.

Table6.5Resultsobtainedbyourmethod[13]oftheBraTS2018onlinetestandvalidationset.

Networks Wholetumor Enhancingtumor Tumorcore

Ours—validation 0.883 0.737 0.814

Ours—test 0.850 0.665 0.782

Figure6.10ThesetofMRmodalities(left)alongwiththeirvoxel-levelsegmentationlabels(right).MR,magneticresonance.

Page 9: 6.2 Background

Myronenko[24] 0.884 0.766 0.815

Notes:Wealsocomparewiththewinningentryonthetestset[24],whichusesanensembleof10modelsandspecialnormalizationtechniques.

6.4.3OtherapplicationsTheESPNetarchitectureisageneral-purposearchitectureandcanbeusedacrossawidevarietyofapplications,rangingfromimageclassificationtolanguagemodeling.Inourwork,wehaveusedtheESPNetsforimage

classification[12],objectdetection[12],semanticsegmentation[6,11,21],languagemodeling[12],andautismspectraldisorderprediction[25].

6.5ConclusionThis chapter describes the ESPNet architecture that is built using a novel convolutional unit, the ESP unit, introduced in Refs [6,11,21]. To effectively demonstrate the modeling power of the ESPNet

architectureonmedicalimaging,westudyitontwodifferentdatasets:(1)abreastbiopsyWSIdatasetand(2)abraintumorsegmentationMRimagingdataset.OuranalysisshowsthatESPNetcanlearnmeaningful

representationsfordifferentmedicalimagingdataefficiently.

AcknowledgmentResearch reported in this chapterwas supportedbyWashingtonStateDepartment ofTransportation researchgrantT1461-47,NSF III (1703166),NationalCancer Institute awards (R01CA172343,R01

CA140560,RO1CA200690,andU01CA231782),theNSFIGERTProgram(awardno.1258485),theAllenDistinguishedInvestigatorAward,SamsungGROaward,andgiftsfromGoogle,Amazon,andBloomberg.The

authorswouldalsoliketothankDr.JamenBartlettandDr.DonaldL.Weaverforhelpfuldiscussionsandfeedback.Theviewsandconclusionscontainedhereinarethoseoftheauthorsandshouldnotbeinterpretedas

necessarilyrepresentingendorsements,eitherexpressedorimplied,ofNationalCancerInstituteortheNationalInstitutesofHealth,ortheUSGovernment.

References[1]C.J.LynchandC.Liston,Newmachine-learningtechnologiesforcomputer-aideddiagnosis.s.l,Nat.Med.24,2018,1304–1305.

[2]E.Mercan,etal.,Assessmentofmachinelearningofbreastpathologystructuresforautomateddifferentiationofbreastcancerandhigh-riskproliferativelesions,JAMANetw.Open2,2019,e198777.

Figure6.11ThisfigurevisualizesthesegmentationresultsproducedbyournetworkoverlaidoneachofthefourMRmodalitiesintheBRATSdataset.MR,magneticresonance.

Page 10: 6.2 Background

[3]LeCun,Y.,Bengio,Y.,Convolutionalnetworksforimages,speech,andtimeseries.VolumeThehandbookofbraintheoryandneuralnetworks.1995.

[4]G.Litjens,etal.,Asurveyondeeplearninginmedicalimageanalysis,Med.ImageAnal.42,2017,60–88.

[5]Shen,D.,Wu,G.,Suk,H.-I.,Deeplearninginmedicalimageanalysis.Annualreviewofbiomedicalengineering,2017,pp.221–248.

[6]Mehta,S.etal.,Y-net:Jointsegmentationandclassificationfordiagnosisofbreastbiopsyimages.s.l.,InternationalConferenceonMedicalImageComputingandComputer-AssistedIntervention(MICCAI),2018b,pp.893–901.

[7]Kamnitsas,K.etal.,Ensemblesofmultiplemodelsandarchitecturesforrobustbraintumoursegmentation.s.l.,InternationalMICCAIBrainlesionWorkshop.2017.

[8]Ronneberger,O.,Fischer,P.&Brox,T.,U-Net:ConvolutionalNetworksforBiomedicalImageSegmentation.s.l.,InternationalConferenceonMedicalImageComputingandComputer-AssistedIntervention(MICCAI),2015,pp.234–241.

[9]A.Esteva,etal.,Dermatologist-levelclassificationofskincancerwithdeepneuralnetworks,Nature542,2017,115–118.

[10]Cireşan,D.C.,Giusti,A.,Gambardella,L.M.,Schmidhuber,J.,MitosisDetectioninBreastCancerHistologyImageswithDeepNeuralNetworks.s.l.,InternationalConferenceonMedicalImageComputingandComputer-AssistedIntervention.2013.

[11]Mehta,S.etal.,ESPNet:Efficientspatialpyramidofdilatedconvolutionsforsemanticsegmentation.s.l.,ProceedingsoftheEuropeanConferenceonComputerVision(ECCV).2018c.

[12]Mehta,S.,Rastegari,M.,Shapiro,L.,Hajishirzi,H.,ESPNetv2:Alight-weight,powerefficient,andgeneralpurposeconvolutionalneuralnetwork.s.l.,IEEEConferenceonComputerVisionandPatternRecognition(CVPR).2019.

[13]Nuechterlein,N.,Mehta,S.,3D-ESPNetwithPyramidalRefinementforVolumetricBrainTumorImageSegmentation.s.l.,InternationalMICCAIBrainlesionWorkshop.2018.

[14]L.-C.Chen,etal.,Deeplab:semanticimagesegmentationwithdeepconvolutionalnets,atrousconvolution,andfullyconnectedCRFs,IEEETrans.PatternAnal.Mach.Intell.2017,834–848.

[15]Yu,F.,Koltun,V.,Multi-scalecontextaggregationbydilatedconvolutions.s.l.,InternationalConferenceonRepresentationLearning.2016.

[16]V.Badrinarayanan,A.KendallandR.Cipolla,Segnet:adeepconvolutionalencoder-decoderarchitectureforimagesegmentation,IEEETrans.patternAnal.Mach.Intell.2017,2481–2495.

[17]He,K.,Zhang,X.,Ren,S.,Sun,J.,DeepResidualLearningforImageRecognition.s.l.,IEEEconferenceoncomputervisionandpatternrecognition(CVPR),2016,pp.770–778.

[18]Simonyan,K.,Zisserman,A.,VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition.s.l.,InternationalConferenceonRepresentationLearning(ICLR).2015.

[19]J.G.Elmore,etal.,Diagnosticconcordanceamongpathologistsinterpretingbreastbiopsyspecimens,JAMA313,2015,1122–1132.

[20]J.G.Elmore,etal.,Arandomizedstudycomparingdigitalimagingtotraditionalglassslidemicroscopyforbreastbiopsyandcancerdiagnosis,J.Pathol.Inform.8,2017,12.

[21]Mehta,S.etal.,Learningtosegmentbreastbiopsywholeslideimages.s.l.,2018IEEEWinterConferenceonApplicationsofComputerVision(WACV),2018a,pp.663–672.

[22]Zhao,H.etal.,Pyramidsceneparsingnetwork.s.l.,IEEEconferenceoncomputervisionandpatternrecognition(CVPR),2017,pp.2881–2890.

[23]B.H.Menze,etal.,Themultimodalbraintumorimagesegmentationbenchmark(BRATS),IEEETrans.Med.Imaging34,2015,1993–2024.

[24]Myronenko,A.,3DMRIbraintumorsegmentationusingautoencoderregularization.s.l.,InternationalMICCAIBrainlesionWorkshop.2018.

[25]Li,B.etal.,AFacialAffectAnalysisSystemforAutismSpectrumDisorder.s.l.,IEEEInternationalConferenceonImageProcessing(ICIP).2019.

Footnotes1Forsimplicity,weassumethatnisanoddnumber,suchas3,5,7,andsoon.

2Dilatedconvolutionsaresometimesalsoreferredtoasatrousconvolutions,wherein“trous”meansholesinFrench.

Page 11: 6.2 Background

QueriesandAnswersQuery:

Pleasecheckalltheauthornamesandaffiliations.

Answer:Yes,theauthor'snamesarecorrect.

Query:

Pleaseprovidecityandcountrynameforaffiliation3.

Answer:Changeofaffiliation3.PleaseusetheUniversityofWashingtoninsteadofAllenInstituteofArtificialintelligenceandXNOR.AI

Query:

Pleasenotethatwehavesetkeywords.Pleasecheckandamendifnecessary.

Answer:Correct

Query:

PleaseprovideexpansionforPAP,FLAIR,andMNI.

Answer:PAP:Papanicolaoutest,FLAIR:Fluidattenuatedinversionrecovery

Query:

PleasenotethatwehaveintroducedFig.6.9citationhere.Pleasecheckandamendifnecessary.

Answer:Lookscorrect

Query:

PleasealterthesignificancebothincaptionandimageofFigs.6.1–6.3,and6.5andimageofFigs.6.6and6.9toensurethattheyaremeaningfulwhenthechapterisreproducedbothincolorandB&W.

Answer:Checkourcomments

3Point-wiseconvolutionsareaspecialcaseofstandard convolutionswhen .Theseconvolutionsproduceanoutputplanebylearninglinearcombinationsbetweeninputchannels.Theseconvolutionsare

widelyusedforeitherchannelexpansionorchannelreductionoperations.

4GriddingartifactscanberemovedbyaddinganotherconvolutionlayerwithnodilationaftertheconcatenationoperationinFig.6.4(left);however,thiswillincreasethecomputationalcomplexityoftheblock.

Abstract

Medicalimagingisafundamentalpartofclinicalcarethatcreatesinformative,noninvasive,andvisualrepresentationsofthestructureandfunctionoftheinteriorofthebody.Withadvancementsintechnology

andtheavailabilityofmassiveamountsofimagingdata,data-drivenmethods,suchasmachinelearninganddatamining,havebecomepopularinmedicalimaginganalysis.Inparticular,deeplearning-basedmethods,

suchas convolutionalneuralnetworks,nowhave the requisite volumeofdataandcomputationalpower tobe consideredpractical clinical tools.Wedescribe thearchitectureof theESPNetnetworkandprovide

experimentalresultsforthetaskofsemanticsegmentationontwodifferenttypesofmedicalimages:(1)tissue-levelsegmentationofbreastbiopsywholeslideimagesand(2)3Dtumorsegmentationinbrainmagnetic

resonanceimages.OurresultsshowthattheESPNetarchitectureisefficientandlearnsmeaningfulrepresentationsfordifferenttypesofmedicalimages,whichallowsESPNettoperformwellontheseimages.

Keywords:ESPNet;wholeslideimages;magneticresonanceimages;medicalimaging;tumorsegmentation