PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf ·...

10
PREPRINT 1 Automatic lung segmentation in routine imaging is a data diversity problem, not a methodology problem Johannes Hofmanninger, Florian Prayer, Jeanny Pan, Sebastian R¨ ohrich, Helmut Prosch, Georg Langs Abstract—Automated segmentation of anatomical structures is a crucial step in many medical image analysis tasks. For lung segmentation, a variety of approaches exist, involving sophisticated pipelines trained and validated on a range of different data sets. However, during translation to clinical routine the applicability of these approaches across diseases remains limited. Here, we show that the accuracy and reliability of lung segmentation algorithms on demanding cases primarily does not depend on methodology, but on the diversity of training data. We compare 4 generic deep learning approaches and 2 published lung segmentation algorithms on routine imaging data with more than 6 different disease patterns and 3 published data sets. We show that a basic approach - U-net - performs either better, or competitively with other approaches on both routine data and published data sets, and outperforms published approaches once trained on a diverse data set covering multiple diseases. Training data composition consistently has a bigger impact than algorithm choice on accuracy across test data sets. We carefully analyse the impact of data diversity, and the specifications of annotations on both training and validation sets to provide a reference for algorithms, training data, and annotation. Results on a seemingly well understood task of lung segmentation suggest the critical importance of training data diversity compared to model choice. Index Terms—pathological lung segmentation, semantic seg- mentation, routine radiology data I. I NTRODUCTION The translation of machine-learning approaches developed on specific data sets to the variety of routine clinical data is of increasing importance. As methodology matures across different fields, means to render algorithms robust for the transition from the lab to the clinic become critical. Here, we investigate the impact of model choice and training data diversity in the context of segmentation. The detection and accurate segmentation of organs is a crucial step for a wide range of medical image process- ing applications in research and computer-aided diagnosis (CAD) [1]. Segmentation enables the quantitative description of anatomical structures with metrics such as volume, location, or statistics of intensity values, and is paramount for the detec- tion and quantification of pathologies. For machine-learning, segmentation is highly relevant for discarding confounders outside the relevant organ. At the same time, the segmentation algorithm itself can act as a source of bias. For example, a segmentation algorithm can hamper subsequent analysis by systematically excluding dense areas in the lung field [2]. All authors are with the Department of Biomedical Imaging and Image- guided Therapy, Computational Imaging Research Lab, Medical Uni- versity Vienna. (correspondence: [email protected], [email protected]) While lung segmentation is regarded as a mostly solved problem in the methodology community, experiments on rou- tine imaging data demonstrate that algorithms performing well on publicly available data sets do not yield accurate and reliable segmentation in routine chest CTs of patients who suffer from severe disease. Thus, the methodological advance does not translate to applicability in the clinical routine. Here, we show that this is primarily an issue of training data diversity, and not of further methodological advances. Furthermore, annotation of training data in public data sets is often biased by specific target applications, and thus leads to evaluation results that cannot be compared. Automated lung segmentation algorithms are typically de- veloped and tested on limited data sets, covering a limited spectrum of visual variability by predominantly containing cases without severe pathology [3] or cases with a single class of disease [4]. Such specific cohort datasets are highly relevant in their respective domain, but lead to specialized methods and machine-learning models that struggle to generalize to unseen cohorts when utilized for the task of segmentation. Due to these limitations, in practice, medical image process- ing studies, especially when dealing with routine data, still rely on semiautomatic segmentations or human inspection of automated organ masks [5], [6]. However, for large-scale data analysis based on thousands of cases, human inspection or any human interaction with single data items, at all, is not an option. A. Related Work A diverse range of lung segmentation techniques has been proposed. They can be categorized into rule-based, atlas-based, machine-learning-based and hybrid approaches. Rule-based methods rely on thresholding, edge detection, region-growing, morphological operations, and other ”classical” image process- ing techniques [7], [8], [9], [10]. Atlas-based methods rely on non-rigid image registration between the unlabeled image and on- or more labeled atlases [11], [12], [13]. Machine-learning- based approaches rely on large datasets to learn active shape models [14], [15], [16], locations of landmarks [17], hand- crafted features, or Convolutional Neural Networks (CNN) for end-to-end learning [18]. Hybrid approaches combine various techniques, such as thresholding and texture classification [19], [20], [21], landmarks and shape models [17], and other combinations [22], [23], [24]. The lung appears as a high contrast region in X-ray-based imaging modalities, such as CT, so that thresholding and atlas segmentation methods lead to good results in many arXiv:2001.11767v1 [eess.IV] 31 Jan 2020

Transcript of PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf ·...

Page 1: PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf · 2020-02-03 · Fig. 1. Segmentation results for selected cases from routine data:

PREPRINT 1

Automatic lung segmentation in routine imaging is adata diversity problem, not a methodology problem

Johannes Hofmanninger, Florian Prayer, Jeanny Pan, Sebastian Rohrich, Helmut Prosch, Georg Langs

Abstract—Automated segmentation of anatomical structuresis a crucial step in many medical image analysis tasks. Forlung segmentation, a variety of approaches exist, involvingsophisticated pipelines trained and validated on a range ofdifferent data sets. However, during translation to clinical routinethe applicability of these approaches across diseases remainslimited. Here, we show that the accuracy and reliability of lungsegmentation algorithms on demanding cases primarily does notdepend on methodology, but on the diversity of training data.We compare 4 generic deep learning approaches and 2 publishedlung segmentation algorithms on routine imaging data with morethan 6 different disease patterns and 3 published data sets. Weshow that a basic approach - U-net - performs either better, orcompetitively with other approaches on both routine data andpublished data sets, and outperforms published approaches oncetrained on a diverse data set covering multiple diseases. Trainingdata composition consistently has a bigger impact than algorithmchoice on accuracy across test data sets. We carefully analyse theimpact of data diversity, and the specifications of annotationson both training and validation sets to provide a reference foralgorithms, training data, and annotation. Results on a seeminglywell understood task of lung segmentation suggest the criticalimportance of training data diversity compared to model choice.

Index Terms—pathological lung segmentation, semantic seg-mentation, routine radiology data

I. INTRODUCTION

The translation of machine-learning approaches developedon specific data sets to the variety of routine clinical datais of increasing importance. As methodology matures acrossdifferent fields, means to render algorithms robust for thetransition from the lab to the clinic become critical. Here,we investigate the impact of model choice and training datadiversity in the context of segmentation.

The detection and accurate segmentation of organs is acrucial step for a wide range of medical image process-ing applications in research and computer-aided diagnosis(CAD) [1]. Segmentation enables the quantitative descriptionof anatomical structures with metrics such as volume, location,or statistics of intensity values, and is paramount for the detec-tion and quantification of pathologies. For machine-learning,segmentation is highly relevant for discarding confoundersoutside the relevant organ. At the same time, the segmentationalgorithm itself can act as a source of bias. For example, asegmentation algorithm can hamper subsequent analysis bysystematically excluding dense areas in the lung field [2].

All authors are with the Department of Biomedical Imaging and Image-guided Therapy, Computational Imaging Research Lab, Medical Uni-versity Vienna. (correspondence: [email protected],[email protected])

While lung segmentation is regarded as a mostly solvedproblem in the methodology community, experiments on rou-tine imaging data demonstrate that algorithms performing wellon publicly available data sets do not yield accurate andreliable segmentation in routine chest CTs of patients whosuffer from severe disease. Thus, the methodological advancedoes not translate to applicability in the clinical routine.Here, we show that this is primarily an issue of trainingdata diversity, and not of further methodological advances.Furthermore, annotation of training data in public data setsis often biased by specific target applications, and thus leadsto evaluation results that cannot be compared.

Automated lung segmentation algorithms are typically de-veloped and tested on limited data sets, covering a limitedspectrum of visual variability by predominantly containingcases without severe pathology [3] or cases with a single classof disease [4]. Such specific cohort datasets are highly relevantin their respective domain, but lead to specialized methodsand machine-learning models that struggle to generalize tounseen cohorts when utilized for the task of segmentation.Due to these limitations, in practice, medical image process-ing studies, especially when dealing with routine data, stillrely on semiautomatic segmentations or human inspection ofautomated organ masks [5], [6]. However, for large-scale dataanalysis based on thousands of cases, human inspection orany human interaction with single data items, at all, is not anoption.

A. Related Work

A diverse range of lung segmentation techniques has beenproposed. They can be categorized into rule-based, atlas-based,machine-learning-based and hybrid approaches. Rule-basedmethods rely on thresholding, edge detection, region-growing,morphological operations, and other ”classical” image process-ing techniques [7], [8], [9], [10]. Atlas-based methods rely onnon-rigid image registration between the unlabeled image andon- or more labeled atlases [11], [12], [13]. Machine-learning-based approaches rely on large datasets to learn active shapemodels [14], [15], [16], locations of landmarks [17], hand-crafted features, or Convolutional Neural Networks (CNN) forend-to-end learning [18]. Hybrid approaches combine varioustechniques, such as thresholding and texture classification[19], [20], [21], landmarks and shape models [17], and othercombinations [22], [23], [24].

The lung appears as a high contrast region in X-ray-basedimaging modalities, such as CT, so that thresholding andatlas segmentation methods lead to good results in many

arX

iv:2

001.

1176

7v1

[ee

ss.I

V]

31

Jan

2020

Page 2: PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf · 2020-02-03 · Fig. 1. Segmentation results for selected cases from routine data:

PREPRINT 2

cases [7], [8], [9]. However, disease-associated lung patterns,such as effusion, atelectasis, consolidation, fibrosis, or pneu-monia, lead to dense areas in the lung field that impedesuch approaches. Multi-atlas registration aims to deal withthese high-density abnormalities by incorporating additionalatlases, shape models, and other post-processing steps [22].However, such highly complex pipelines are not reproduciblewithout extensive effort, especially if the source code andthe underlying set of atlases is not shared. An additionaldrawback is that these algorithms are usually optimized forchest CT scans, neglecting scans with a larger or smallerfield of view. Furthermore, run-time does not scale well whenincorporating additional atlases, as registrations tend to becomputationally expensive. With respect to these drawbacks,trained machine-learning models have the advantage, in thatthey can be easily shared without giving access to the trainingdata, they are fast at inference time, and scale well whenadditional training data are available. Harrison et al. [18]showed that deep-learning-based segmentation outperforms aspecialized approach in cases with interstitial lung diseases[18] and provides trained models, inference code, and modelspecifications for non-commercial use upon request. However,with some exceptions, trained models for lung segmentationare rarely shared publicly, hampering advances in medicalimaging research. At the same time, machine-learning methodsare limited by the training data available, their number, andthe quality of the ground-truth annotations.

Benchmark datasets for training and evaluation areparamount to establish comparability between different meth-ods. However, publicly available datasets with manually an-notated organs for development and testing of lung segmenta-tion algorithms are scarce. The VISCERAL Anatomy3 dataset[3] provides a total of 40 segmented lungs (left and rightseparated) in scans that show the whole body or the wholetrunk. The Lung CT Segmentation Challenge 2017 (LCTSC)[4] provides 36 training and 24 test scans with segmentedlungs (left and right separated) from cancer patients of threedifferent institutions. The VESsel SEgmentation in the Lung2012 (VESSEL12) [25] challenge provides 20 lungs withsegmentations. In VESSEL12, the left and right lung are notseparated and have a single label. The Lung Tissue ResearchConsortium (LTRC)1 provides an extensive database of caseswith interstitial fibrotic lung disease and chronic obstructivepulmonary disease (COPD). The dataset provides lung-maskswith left-right split and the individual lung-lobes. The LObeand Lung Analysis 2011 (LOLA11) challenge2 published 55test cases from 55 subjects. These images include normalscans and scans with severe pathologies acquired with differentscanners. The ground-truth labels (left and right lung) areknown only to the challenge organizers.

In semantic segmentation of natural images, large publiclyavailable datasets have fostered the development and sharingof high-quality segmentation algorithms for downstream ap-plications [26], [27]. In contrast, the aforementioned publiclyavailable datasets that are frequently used for training of lung

1https://ltrcpublic.com2https://lola11.grand-challenge.org

segmentation models were not created for this purpose and arestrongly biased to either inconspicuous cases (Anatomy3) orvery specific diseases (LCTSC, VESSEL12, LTRC). The cre-ation of the lung masks in these datasets involved automatedalgorithms with a varying and uncertain degree of humaninspection and manual correction. While the Anatomy3 datasetunderwent a thorough quality assessment, the organizers ofthe VESSEL12 challenge provide the lung masks as-is withoutguarantees about quality. In fact, the segmentations are merelyprovided as a courtesy supplement for the task of vesselsegmentation. None of these datasets consistently includeshigh-density areas, such as tumor mass or pleural effusion,as part of the pathological lung. Within the LCTSC dataset”tumor is excluded in most data” and ”collapsed lung may beexcluded in some scans.” [4] At the same time, pleural fluidsand other consolidations have texture and intensity propertiessimilar to surrounding tissue, rendering their annotation quitetedious. Furthermore, anatomical sub-division is not consistentacross datasets [full lung only (VESSEL12), left/right split(Anatomy3,LCTSC), lung lobes (LTRC)]. To the best of ourknowledge, there is no publicly available dataset with severepathologies, such as consolidations, effusions, air-pockets,and tumors, as part of the annotated lung field nor is therea publicly available segmentation model trained on such adataset.

B. ContributionWe investigated to what extent the accuracy and reliability

of lung segmentation in CT scans, in the presence of severeabnormalities in the routine population, is driven by thediversity of the training data or by methodological advances.Specifically, we addressed the following questions: (1) whatis the influence of training data diversity on lung segmenta-tion performance, (2) how do inconsistencies in ground-truthannotations across data contribute to the bias in automatic seg-mentation or its evaluation in severely diseased cases, and (3)can a generic deep learning algorithm perform competitivelywith specialized approaches on a wide range of data, oncediverse training data is available? To this end, we collected andannotated a diverse data set of 266 CT scans from routine datawithout restriction on disease or disease pattern. We trainedfour different segmentation models on different training data(public and routine) and evaluated their accuracy on publicdata-sets, and on a diverse routine data-set with more thansix different disease patterns. Furthermore, we compared theperformance between two publicly available (trained) lungsegmentation models and our segmentation model trained onour diverse data set and submitted results to the LOLA11challenge for independent evaluation.

II. MATERIALS AND METHODS

Both-, the training dataset and architecture, used for amachine-learning task affect the performance of a trainedmodel. In order to study these effects in the context of lungsegmentation, we trained four generic semantic segmentationmodels (Sec. II-A) from scratch on three different public train-ing sets (Anatomy3, LTSC, LTRC) and one training set col-lected from the clinical routine (Sec. II-C). We evaluated these

Page 3: PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf · 2020-02-03 · Fig. 1. Segmentation results for selected cases from routine data:

PREPRINT 3

Ground truth

U-net(R231)

CIP

P-HNN

[Ours]

a) b) c) d) e) f)

Fig. 1. Segmentation results for selected cases from routine data: Each column shows a different case. Row 1 shows a slice without lung masks,row 2 shows the ground truth, and rows 3 to 5 show automatically generated lung masks. (a) Effusion, chest tube and consolidations (b) small effusions,ground-glass and consolidation, (c) over-inflated (right) and poorly ventilated (left), atelectasis, (d) irregular reticulation and traction bronchiectasis, fibrosis,(e) pneumothorax, and (f) effusions and compression atelectasis (trauma)

models on public test sets (LTSC, LTRC, and VESSEL12)and routine data, including cases showing severe pathologies.Furthermore, we performed a comparison of models trainedon a diverse routine training set to two published automaticlung segmentation systems, which we did not train, but usedas provided.

A. Models

We refrained from developing specialized methodology,but utilized generic ”vanilla” state-of-the-art deep-learning,semantic segmentation architectures, without modifications,which we trained from scratch. We considered the followingfour models:

U-net: Ronneberger et al. proposed the U-net for thesegmentation of anatomic structures in microscopy images[28]. Since then, the U-net architecture has been used fora wide range of applications and various modified versionshave been studied [29]. We utilized the U-net as proposed

by Ronneberger et al. with the only adaption being batch-normalization [30] after each layer.

ResU-net: Residual connections have been proposed tofacilitate the learning of deeper networks [31], [32] and arewidely used in deep learning architectures. Here, the ResU-netmodel includes residual connections at every down- and up-sampling block as a second adaptation to the U-net, in additionto batch-normalization.

Dilated Residual Network (DRN): Yu et al. proposeddilated convolutions for semantic image segmentation [33].They showed that dilated convolutions can extend the receptivefield in higher layers, thereby facilitating context aggregation.Later, they adapted deep residual networks [32] with dilatedconvolutions (DRN) to perform high-quality semantic segmen-tations on natural images. Here, we utilized the DRN-D-22model, as proposed by Yu et al. [34].

Deeplab v3+: Chen et al. proposed a series of semantic seg-mentation algorithms (Deeplab v1, Deeplab v2, and Deeplabv3(+)), with a particular focus on speed. Deeplab v3 combines

Page 4: PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf · 2020-02-03 · Fig. 1. Segmentation results for selected cases from routine data:

PREPRINT 4

dilated convolutions, multi-scale image representations, andfully-connected conditional random fields as a post-processingstep. Deeplab v3+ includes an additional decoder module torefine the segmentation. Here, we utilized the ’Deeplab v3 +’model as proposed by Chen et al. [34].

B. Implementation Details

We aimed to achieve a maximum of flexibility with respectto the field of view (from partially visible organ to whole-body) and to enable lung segmentation without prior local-ization of the organ. To this end, we performed segmentationon the slice level (2D). That is, for volumetric scans, eachslice was processed individually. We segmented the left andright lung (individually labelled), excluded the trachea andspecifically included high density anomalies such as tumorand plural effusions. During training and inference, the imageswere cropped to the body region using thresholding and mor-phological operations, and rescaled to a resolution of 256×256pixels. Prior to processing, hounsfield units were mapped tothe intensity window [−1024, 600] and normalized to the 0-1range. During training, the images were augmented by randomrotation, non-linear deformation and Gaussian noise. We usedstratified mini-batches of size 14 holding 7 slices showing thelung and 7 slices which don’t show the lung. For optimization,we used Stochastic Gradient Decent (SGD) with momentum[35].

C. Routine data collection and sampling

We collected representative training and evaluation datasetsfrom the picture archiving and communication system (PACS)of a university hospital radiology department. We includedin- and outpatients who underwent a chest CT examinationduring a period of 2.5 years, with no restriction on age, sex,or indication. However, we applied minimal inclusion criteriawith regard to imaging parameters, such as axial-slicing ofthe reconstruction, number of slices in a series ≥ 100 and theseries description included one of the terms lung, chest, orthorax. If multiple series of a study fulfilled these criteria, theone series with the highest number of slices was used. In total,we collected more than 5300 patients (examined during the2.5-year period), each represented by a single CT series. Fromthis database, we carefully selected a representative trainingdataset using three sampling strategies: (1) random samplingof cases (N=57); (2) sampling from image phenotypes [36](N=71) (the exact methodology for phenotype identificationwas not in the scope of this work); and (3) manual selection ofedge cases with severe pathologies, such as fibrosis (N=28),trauma (N=20), and other pathologies (N=55). In total, wecollected 231 cases from routine data for training (hereafterreferred to as R231). While the dataset collected from theclinical routine showed a high variability in lung appearance,cases that showed the head or the abdominal area are scarce.To mitigate this bias toward slices that showed the lung, weaugmented the number of non-lung slices in R231 by includingall slices which did not show the lung from the Anatomy3dataset.

For testing, we randomly sampled 20 cases from thedatabase that were not part of the training set (hereafterreferred to as RRT) and 15 cases with specific anomalies[atelectasis (2), emphysema (2), fibrosis(4), mass (2), pneu-mothorax (2) and trauma (3)] for testing. Ground-truth labelingwas bootstrapped by training of a lung segmentation algorithm(U-net) on the Anatomy3 dataset. The preliminary masks wereiteratively corrected in an active-learning fashion. Specifically,the model for the intermediate masks was iteratively retrainedafter 20-30 new manual corrections were performed with theITK-Snap software [37]. In total, we performed ground-truthannotation on 266 chest CT scans from 266 individual patients.

D. Evaluation metrics

Automatic segmentations were compared to the ground truthfor all test datasets using the following evaluation metrics,as implemented by the Deepmind surface-distance pythonmodule3. While segmentation was performed on 2D slices,evaluation was performed on the 3D volumes. If not reporteddifferently, the metrics were calculated for the right and leftlung separately and then averaged.

1) Dice coefficient (DSC): The Dice coefficient or Dicescore is a measure of overlap:

D(X,Y ) =2|X ∩ Y ||X|+ |Y |

(1)

where X and Y are two alternative labelings, such as predictedand ground-truth lung masks.

2) Robust Hausdorff distance (HD95): The directed Haus-dorff distance is the maximum distance over all distances frompoints in surface Xs to their closest point in surface Ys. Inmathematical terms, the directed robust Hausdorff distance isgiven as:

−→H (Xs, Ys) = P95

x∈Xs

(miny∈Ys

d(x, y)

)(2)

where P95 denotes the 95th percentile of the distances. Here,we used the symmetric adaptation:

H(Xs, Ys) = max(−→H (Xs, Ys),

−→H (Ys, Xs)

)(3)

3) Mean surface distance (MSD): The mean surface dis-tance is the average distance of all points in surface Xs totheir closest corresponding point in surface Ys:

−−−→MSD(Xs, Ys) =

1

|X|∑x∈Xs

miny∈Ys

d(x, y) (4)

Here, we used the symmetric adaptation:

MSD(Xs, Ys) = max(−−−→MSD(Xs, Ys),

−−−→MSD(Ys, Xs)

)(5)

3https://github.com/deepmind/surface-distance

Page 5: PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf · 2020-02-03 · Fig. 1. Segmentation results for selected cases from routine data:

PREPRINT 5

Abbr. Name #Vol. #SlicesL #Slices

RR36 Routine Random 36 3393 3393VISC36 VISCERAL 36 3393 3393LTRC36 LTRC 36 3393 3393LCTSC36 LCTSC 36 3393 3393R231 Routine 231 Cases 231 62224 108248

TABLE IDATASETS USED TO TRAIN SEMANTIC SEGMENTATION MODELS, THE

NUMBER OF VOLUMES, THE NUMBER OF SLICES THAT SHOWED THE LUNG(#SLICESL) AND THE TOTAL NUMBER OF SLICES (#SLICES). Visceral,

LTRC, AND LCTSC ARE PUBLIC DATASETS. Routine Random AND Routine231 ARE IMAGES FROM THE ROUTINE DATABASE OF A RADIOLOGY

DEPARTMENT.

E. Experiments

Study the impact of training data variability: We de-termined the influence of training data variability (especiallypublic datasets vs. routine) on the generalizability to otherpublic test-datasets, and, specifically, to cases with a variety ofpathologies. To this end, we trained the four different modelson an equal number of patients (N=36) and slices (N=3393)from each training dataset [Routine Random (RR36), VIS-CERAL Anatomy3 (VISC36), LTRC (LTRC36), and LCTSC(LTSC36)] individually. The RR36 dataset is a subset of the57 randomly sampled cases that we collected for training, andtherefore, is also a subset of the full routine dataset R231.Table I gives an overview of the training datasets created.The number of volumes and slices were limited to matchthe smallest dataset (LCTSC), with 36 volumes and 3393slices. During this experiment, we considered only slices thatshowed the lung (during training and testing) to prevent abias induced by the field of view. For example, images inVISCERAL Anatomy 3 showed either the whole body or thetrunk, including the abdomen, while other datasets, such asLTRC, LCTSC, or VESSEL12 contained only images limitedto the chest. Also, trauma patients are scanned with a largerfield of view compared to patients in whom only the lung isexamined.

Comparison between generic models trained on a di-verse dataset and publicly available lung segmentationsystems: We compared the models trained on our dataset(R231) to publicly available lung segmentation algorithms onthe full volumes. As a post-processing step, we removed con-nected components smaller than 25cm3 from segmentations.Note that the reference methods performed comparable post-processing. We compared our trained models to the referencesystems provided by the Chest Imaging Platform (CIP)4 andthe Progressive Holistically Nested Networks (P-HNN) modelprovided by Harrison et al. [18]. The CIP algorithm featuresa threshold-based approach, while P-HNN is a CNN archi-tecture specifically developed for lung segmentation that wastrained on cases from the LTRC dataset and other cases withinterstitial lung diseases. The CIP algorithm was shown to besensitive to image noise. Thus, if the CIP algorithm failed, wepre-processed the volumes with a Gaussian filter kernel. If thealgorithm still failed, the case was excluded for comparison.

4https://chestimagingplatform.org

Abbr. Description #Vol. #SlicesL #Slices

RRT Routine Random Test 20 5788 7969LTRC LTRC 105 44784 51211LCTSC LCTSC 24 2063 3675VESS12 VESSEL12 20 7251 8593Ath. Atelecatasis 2 395 534Emph. Emphysema 2 688 765Fibr. Severe fibrosis 4 1192 1470Mass* Mass 2 220 273PnTh Pneumo Thorax 2 814 937Trauma Trauma/Effusions 3 911 2225Normal** Normal (Large field of view) 7 1180 5301

Total 191 65286 82953TABLE II

TEST DATASETS USED TO EVALUATE THE PERFORMANCE OF LUNGSEGMENTATION ALGORITHMS. *TWO CASES FROM THE PUBLICLYAVAILABLE LUNG1 DATASET, **INCLUDES FOUR CASES FROM THE

PUBLICLY AVAILABLE VISCERAL ANATOMY 3 DATASET

The trained P-HNN model does not distinguish between leftand right lung. Thus, evaluation metrics were computed onthe full lung for masks created by P-HNN. In addition toevaluation on publicly available datasets and methods, weperformed an independent evaluation of our lung segmentationmodel by submitting solutions to the LOLA11 challenge forwhich 55 CT scans are published but ground-truth masks areavailable only to the challenge organizers.

Quantitative assessment of the models’ ability to covertumor areas: Studies on lung segmentation usually useoverlap- and surface-metrics to assess the automatically gener-ated lung mask against the ground truth. However, segmenta-tion metrics on the full lung can only marginally quantify thecapability of a method to cover pathological areas in the lungas pathologies may be relatively small compared to the lungvolume. Carcinomals are an example of high-density areas thatare at risk of being excluded by threshold- or registration-based methods, when they are close to the lung border. Weutilized the publicly available, previously published Lung1dataset [38] to quantify the model’s ability to cover tumorareas within the lung. The collection contains scans of 318non-small-cell lung cancer patients before treatment, with amanual delineation of the tumors. However, no lung maskswere available. In this experiment, we evaluated the overlapproportion of tumor volume covered by the lung mask:

TO(T,X) =|T ∩X||T |

(6)

III. RESULTS

Table II gives an overview of the test datasets used. LTRC,LCTSC, and VESS12 are publicly available. RRT is a setof 20 randomly sampled lung CT scans from the routinedatabase. The category ass held two cases from the Lung1dataset where ground truth masks were created by us and thecategory normal held four cases with a large field of viewfrom the Visceral Anatomy3 dataset and two cases from theroutine database. In total, we collected 191 test cases not usedfor training.

Page 6: PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf · 2020-02-03 · Fig. 1. Segmentation results for selected cases from routine data:

PREPRINT 6

Ground truth

U-net(R231)

CIP

P-HNN

VESS12 LCTSC LTRC

FAILED FAILED FAILED FAILED

[Ours]

* ***

**

Liver

Tumor

Fig. 2. Ground-truth annotations in public datasets lack coverage of pathologic areas: Segmentation results for cases in public datasets where the masksgenerated by our U-net(R231) yielded low Dice scores when compared to the ground truth. Note that public datasets often do not include high-density areasin the segmentations. Tumors in the lung area should be included into the segmentation while the liver should not.

A. Models trained on routine data outperformed modelstrained on publicly available study data

U-net, ResU-net, and Deeplab v3+ models, when trained onroutine data (RR36 models), yielded the best evaluation scoresover all test cases. The largest differences in performance wereobserved on routine test data (RRT), with a U-net trained onVisceral data (VISC36), which yielded an average DSC of0.84, and, when trained on RR36, 0.92. This advantage ofroutine data for training is also reflected in the overall resultsand other evaluation metrics with U-net yielding DSC, HD95,and MSD scores of 0.96±0.08, 9.19±18.15, 1.43±2.26 whentrained on RR36 and 0.92± 0.14, 13.04± 19.04, 2.05± 3.08when trained on VISC36. Table III lists the evaluation resultsin detail. We report the averaged DSC for the individualtest sets and for all test cases combined (All(L), N=191).In addition, we report all test cases combined without theLTRC and LCTSC data considered (All, N=62). The rationalebehind this is, that the LTRC test dataset contains 105 volumesand dominates the averaged scores, and the LCTSC dataset

contains multiple cases with tumors and effusions that arenot included in the ground-truth masks. Thus, an automatedsegmentation that includes these areas yields a lower score,distorting and misrepresenting the averaged results. Modelstrained on routine data yielded the highest DSC on all subcat-egories apart from LCTSC and VESS12. In fact, the modelstrained on RR36 yielded the lowest performance in terms ofDSC on the LCTSC test dataset. However, the lower scoresfor these models in LCTSC and VESS12 can be attributed tothe lack of very-dense pathologies in the ground truth masks,as mentioned above (See qualitative results in Fig. 2).

B. Various deep-learning-based semantic segmentation archi-tectures perform comparably for lung segmentation

We determined that the influence of model architecture ismarginal compared to the influence of training data. Consider-ing the models trained on the datasets comprised of 36 scans(RR36, LTRC36, LCTSC36 and VISC36) we observed thatthe different architectures perform comparably, with the DSC

Page 7: PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf · 2020-02-03 · Fig. 1. Segmentation results for selected cases from routine data:

PREPRINT 7

Test datasets (DSC) - Lung slices only DSC±STD HD95(mm)±STD MSD(mm)±STD

Public Routine Mixed

Architecture Trainingset LTRC LCTSC VESS12 RRT Ath. Emph. Fibr. Mass PnTh Trau Norm All(L)* All All All

U-net

RR36 0.99 0.93 0.98 0.92 0.95 0.99 0.96 0.98 0.99 0.93 0.97 0.97±0.05 0.96±0.08 9.19±18.15 1.43±2.26LTRC36 0.99 0.96 0.99 0.86 0.93 0.99 0.95 0.98 0.98 0.90 0.97 0.97±0.08 0.94±0.13 11.9±22.9 2.42±5.99LCTSC36 0.98 0.97 0.98 0.85 0.91 0.98 0.92 0.98 0.98 0.89 0.97 0.96±0.09 0.92±0.14 10.96±14.85 1.96±2.87VISC36 0.98 0.95 0.98 0.84 0.91 0.98 0.90 0.98 0.98 0.89 0.97 0.96±0.09 0.92±0.15 13.04±19.04 2.05±3.08

ResU-net

RR36 0.99 0.93 0.98 0.91 0.95 0.99 0.96 0.98 0.98 0.93 0.97 0.97±0.06 0.95±0.09 8.66±15.06 1.5±2.34LTRC36 0.99 0.96 0.99 0.86 0.94 0.99 0.95 0.98 0.98 0.89 0.97 0.97±0.08 0.94±0.13 11.58±21.16 2.48±6.24LCTSC36 0.98 0.97 0.98 0.85 0.92 0.98 0.95 0.97 0.98 0.89 0.97 0.96±0.09 0.93±0.14 12.15±19.42 2.36±4.68VISC36 0.97 0.96 0.98 0.84 0.91 0.98 0.89 0.98 0.98 0.89 0.97 0.95±0.09 0.92±0.15 9.41±15.0 1.83±2.92

DRN

RR36 0.98 0.93 0.97 0.88 0.94 0.98 0.95 0.97 0.98 0.92 0.96 0.96±0.07 0.94±0.12 8.96±17.67 1.96±3.97LTRC36 0.98 0.95 0.98 0.85 0.93 0.98 0.94 0.98 0.98 0.89 0.97 0.96±0.08 0.93±0.14 10.94±20.93 2.66±6.66LCTSC36 0.97 0.96 0.97 0.83 0.90 0.98 0.90 0.97 0.97 0.89 0.96 0.95±0.09 0.91±0.15 8.98±13.3 1.92±2.73VISC36 0.96 0.95 0.97 0.83 0.90 0.97 0.92 0.97 0.97 0.87 0.97 0.94±0.1 0.91±0.15 8.96±13.62 1.92±2.83

Deeplab v3+

RR36 0.98 0.92 0.98 0.90 0.93 0.99 0.95 0.98 0.98 0.92 0.97 0.96±0.06 0.95±0.09 8.99±14.32 1.71±2.68LTRC36 0.99 0.94 0.99 0.85 0.93 0.98 0.94 0.98 0.98 0.89 0.97 0.96±0.09 0.93±0.14 11.9±21.8 2.51±6.07LCTSC36 0.98 0.96 0.98 0.85 0.92 0.98 0.93 0.98 0.98 0.89 0.96 0.96±0.08 0.93±0.14 10.47±19.14 2.21±4.67VISC36 0.98 0.96 0.98 0.85 0.93 0.98 0.95 0.98 0.98 0.89 0.97 0.96±0.08 0.93±0.14 10.16±21.21 2.15±4.99

TABLE IIIEVALUATION RESULTS AFTER TRAINING SEMANTIC SEGMENTATION ARCHITECTURES ON DIFFERENT TRAINING SETS. THE SETS RR36, LTRC36,

LCTSC36, AND VISC36 CONTAINED THE SAME NUMBER OF VOLUMES AND SLICES. THE BEST EVALUATION SCORES FOR MODELS TRAINED ON THESETHREE DATASETS ARE MARKED IN BOLD (HIGHEST FOR DSC AND LOWEST FOR HD95 AND MSD). ALTHOUGH THE DIFFERENT ARCHITECTURES

PERFORMED COMPARABLY, TRAINING ON ROUTINE DATA OUTPERFORMED TRAINING ON PUBLIC COHORT DATASETS. *THE LCTSC GROUND TRUTHMASKS DO NOT INCLUDE HIGH-DENSITY AREAS AND THE HIGH NUMBER OF LTRC TEST CASES DOMINATE THE AVERAGED RESULTS. THUS, ”ALL(L)”

(N=167) IS THE AVERAGE OVER ALL CASES INCLUDING LCTSC AND LTRC WHILE ”ALL” (N=62) DOES NOT INCLUDE THE LCTSC OR THE LTRCCASES.

Test datasets (DSC) - Full volumes DSC±STD HD95(mm)±STD MSD(mm)±STD

Public Routine Mixed

Architecture LTRC LCTSC VESS12 RRT Ath. Emph. Fibr. Mass PnTh Trau Norm All(L)* All All(HD95) All(MSD)

unet(R231) 0.99 0.94 0.98 0.97 0.97 0.99 0.97 0.98 0.99 0.97 0.97 0.98±0.03 0.98±0.03 3.14±7.4 0.62±0.93resunet(R231) 0.99 0.94 0.98 0.97 0.97 0.99 0.97 0.98 0.99 0.97 0.97 0.98±0.03 0.98±0.03 3.19±7.35 0.64±0.88drn(R231) 0.98 0.94 0.98 0.95 0.96 0.99 0.97 0.98 0.98 0.96 0.97 0.97±0.04 0.97±0.06 6.22±18.95 1.1±2.54deeplab(R231) 0.99 0.94 0.98 0.97 0.97 0.99 0.97 0.98 0.99 0.97 0.97 0.98±0.03 0.98±0.03 3.28±7.52 0.65±0.91P-HNN 0.98 0.94 0.99 0.88 0.95 0.98 0.95 0.98 0.96 0.88 0.97 0.96±0.09 0.94±0.12 16.8±36.57 2.59±5.96

unet(R231)** 0.99 0.95 0.99 0.99 0.97 0.99 0.97 0.98 0.99 0.97 0.98 0.98±0.01 0.98±0.01 1.44±1.09 0.35±0.19CIP** 0.99 0.94 0.99 0.96 0.90 0.99 0.92 0.98 0.99 0.86 0.99 0.98±0.03 0.96±0.05 4.65±6.45 0.91±1.09CIP#cases** 96/105 19/24 17/20 13/20 2/2 2/2 4/4 2/2 2/2 1/3 1/7

TABLE IVEVALUATION RESULTS FOR MODELS TRAINED ON THE FULL R231 DATASET AND EVALUATED ON THE FULL VOLUMES. IN ADDITION, A COMPARISON TO

THE SEGMENTATION ALGORITHM OF THE CHEST IMAGING PLATFORM (CIP) AND THE TRAINED P-HNN MODEL IS GIVEN. THE RESULTS AREEXPRESSED IN MEAN AND MEAN ± STANDARD DEVIATION FOR THE DICE SCORE (DSC), ROBUST HAUSDORFF DISTANCE (HD95) AND MEAN SURFACE

DISTANCE (MSD). *THE LCTSC GROUND TRUTH MASKS DO NOT INCLUDE HIGH-DENSITY DISEASES AND THE HIGH NUMBER OF LTRC TEST CASESDOMINATE THE AVERAGED RESULTS. THUS, ”ALL(L)” (N=167) IS THE AVERAGE OVER-ALL CASES THAT INCLUDED LCTSC AND LTRC, WHILE ”ALL”

(N=62) DOES NOT INCLUDE THE LCTSC AND LTRC CASES. **FOR THESE ROWS, ONLY CASES ON WHICH THE CIP ALGORITHM DID NOT FAIL, ANDWHERE THE DSC WAS LARGER THAN 0 WERE CONSIDERED (#CASES).

not varying for more than 0.02 when the same combinationof training and test set was applied (Table III). The sameconclusion holds when models were trained on the full datasetR231 and evaluated on the full volumes with the U-net(R231),resunet(R231) and deeplab(R231) achieving the identical DSCof 0.98±0.03 over all test cases. The drn(R231) model yieldeda slightly lower DSC over all cases of 0.97±0.06. Detailedresults are listed in Table (IV).

C. All training sets generalized well to cases without severepathologies

Results show, that moderately sized training sets generalizewell to the test cases of the same dataset but also to differentdatasets without severe pathologies. For example, training theU-net on 36 cases of the LTRC dataset yielded a DSC of 0.99on the LTRC test set of 105 cases while still generalizingwell to the VESS12 dataset (DSC of 0.99) and our selectedroutine cases with emphysema (DSC of 0.99), mass (DSCof 0.98), or pneumo-thorax (DSC of 0.98). In general, we

observed that, independent of training set and architecture,cases without severe pathologies were accurately segmentedby the models. Specifically, test cases in the public LTRCand VESS12 datasets received an averaged DSC of at least0.96 (up to 0.99) depending on architecture and training data.The same conclusion can be drawn for the emphysema, mass,pneumo-thorax and normal test cases for which results varyonly little (∆DSC≤0.01) for the different models. Detailedresults are listed in Table III.

D. Generic models trained on routine data outperformedspecialized publicly available systems

Compared to P-HNN the U-net(R231) yielded an averageDSC of 0.98±0.03 vs. 0.96±0.09 over all 191 test cases. Theaveraged results without the LTRC and LCTSC datasets yieledDSC, HD95, and MSD scores of 0.98±0.03, 3.14±7.4, 0.62±0.93 vs. 0.94±0.12, 16.8±36.57, 2.59±5.96. Detailed resultsare given in Table IV. Note that the P-HNN results werecalculated on the full lung compared to the left and right

Page 8: PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf · 2020-02-03 · Fig. 1. Segmentation results for selected cases from routine data:

PREPRINT 8

a)

b)

c)

d)

Fig. 3. Qualitative results of automatically generated lung masks fortumor cases. Yellow: tumor area covered by the lung mask. Red: tumor areanot covered by the lung mask. (a) Lung masks generated by our U-net(R231).(b) Lung masks generated by P-HNN. (c) Corrupted tumor segmentations inthe Lung1 dataset. (d) Cases with poor tumor overlap of lung masks generatedby U-net(R231).

lung for the U-net, giving P-HNN an advantage in achievingbetter scores. For comparison with the CIP-algorithm, onlyvolumes for which the algorithm did not fail were considered.The CIP-algorithm tends to fail on challenging cases withdense pathologies and on volumes with a large field of view.In total, the CIP algorithm was able to process 160 of the191 test volumes. Both algorithms yielded comparable DSCwhen averaged on all test cases. Without LTRC and LCTSCconsidered, the algorithms yielded average DSC, HD95, andMSD scores of 0.98 ± 0.01, 1.44 ± 1.09, 0.35 ± 0.19 for theU-net(R213) compared to 0.96±0.05, 4.65±6.45, 0.91±1.09for CIP. Figure 1 shows qualitative results for cases from theroutine test sets. In addition to publicly available datasetsand our routine data, we created segmentations for the 55cases of the LOLA11 challenge with the U-net(R231) model.

Fig. 4. U-net trained on routine data covered more tumor area comparedto reference methods. Box- and swarm-plots showing the percentage of tumorvolume (318 cases) covered by the lung masks generated by different methods.

While masks for evaluation are only available to the challengeorganizers, prior research and earlier submissions suggestinconsistencies in creating the ground truth, especially withrespect to pleural effusions [24]. We specifically includedeffusions in our lung masks of the R231 training dataset.To account for this discrepancy and improve comparabilitywe submitted two solutions first, masks as yielded by the U-net(231), and alternatively, with subsequently removed denseareas from the lung masks. The automatic exclusion of denseareas was performed by simple thresholding of values between−50 < HU < 70 and morphological operations. The un-altered masks yielded an average overlap score of 0.968 andwith dense areas removed 0.977 which was the second highestscore among all competitors at the time of submission. Incomparison, the first ranked method [22] achieved a score of0.980 and a human reference segmentation achieved 0.984 5.

E. A generic model trained on routine data covers more tumorarea than specialized publicly available systems

Table V and Figure 4 show the results for average tumoroverlap on the 318 volumes of the Lung1 dataset. U-net(231)covered more tumor volume mean/median (60/69%) comparedto P-HNN (50/44%) and CIP (34/13%). Figures 3a and bshow qualitative results for tumor cases for U-net(R231) andP-HNN. We found, that 23 cases of the Lung1 dataset hadcorrupted ground-truth annotation of the tumors (Figure 3c).Figure 3d shows cases with little or no tumor overlap achievedby U-net(R231).

F. Runtime

Batchwise inference of the U-net model requires 12.5msper slice with a batch size of 20 ( 4.4 seconds for a volumeof 350 slices) on a GeForce GTX 1080Ti GPU with Python3.6, Pytorch 1.2 and Numpy 1.17.2 and an Intel(R) Xeon(R)W-2125 CPU. Inference on the CPU takes 0.8s per slice or281 seconds for a volume of 350 slices. In addition, creatingthe body-mask and cropping the slice takes ∼22ms per sliceor 7.7s for the whole volume of 350.

5https://lola11.grand-challenge.org/evaluation/results/

Page 9: PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf · 2020-02-03 · Fig. 1. Segmentation results for selected cases from routine data:

PREPRINT 9

Tumor overlapMethod Mean Median < 5% > 95%

CIP 34% 13% 113 56P-HNN 50% 44% 48 78U-net(RR36) 53% 54% 46 79U-net(R231) 60% 69% 37 90

TABLE VOVERLAP BETWEEN LUNG MASKS AND MANUALLY ANNOTATED TUMOR

VOLUME IN THE LUNG1 DATASET. GIVEN ARE MEAN, MEDIAN ANDNUMBER OF CASES WITH A SMALLER THAN 5% OVERLAP AND A LARGER

THAN 95% OVERLAP.

IV. DISCUSSION

Lung segmentation is a pivotal pre-processing step for manyimage analysis tasks, such as classification, detection andquantification of lung pathologies. Despite its fundamentalimportance, publicly available algorithms for pathological lungsegmentation are scarce, which impedes research on lungpathologies with medical imaging. This is not only attributedto authors reluctant to make their implementations publiclyavailable, but also to the fact that many algorithms are complexand hard to reproduce. For example, the method with thehighest score in the LOLA11 challenge consists of a so-phisticated pipeline with multiple processing steps, and relieson a private database of reference atlases [22], rendering itimpossible to reproduce the results without access to the data.In contrast to complex processing pipelines, trained machine-learning models are easy to share and use. However, weshowed that public datasets do not hold sufficient variabilityto generalize to a wide spectrum of routine pathologies. Thissituation is aggravated by the fact that many publicly availabledatasets don’t include dense pathologies such as tumor oreffusions into the lung mask.

The inclusion or exclusion of lung pathologies such aseffusions into lung segmentations is a matter of definition andapplication. While pleural effusions (and pneumothorax) aretechnically outside the lung, they are assessed as part of lungassessment, and have a substantial impact on lung parenchymaappearance through compression artefacts. Neglecting such ab-normalities would hamper automated lung assessment, as theyare closely linked to lung function. In addition, lung masks thatinclude pleural effusions greatly alleviate the task of effusiondetection and quantification, thus making it possible to removeeffusions from the lung segmentation as a post-processing step.

There are a large number of segmentation methods pro-posed every year, often based on architectural modifications[29] of established models. Isensee et al. showed that suchmodified design concepts do not improveand occasionally evenworsenthe performance of a well-designed baseline [29]. Theyachieved state-of-the-art performance on multiple, publiclyavailable segmentation challenges relying only on U-nets. Thiscorresponds to our finding that architectural choice had asubordinate effect on performance, and, given a diverse andlarge set of training data, a standard semantic segmentationarchitecture (U-net) can generate high quality lung segmenta-tions.

Volume-based segmentation metrics are valid tools with

which assess the segmentation quality on the organ level.However, they are a superficial means by which to assess thequality in pathological lung segmentation. Area-based metrics,such as the Dice score, but also surface-based metrics suchas HD95 and MSD, tend to only marginally worsen in thepresence of a missed abnormality if the abnormality is smallcompared to the whole organ. We showed that datasets withoutlung masks, but with annotated pathologies such as Lung1, canbe a means with which to quantify algorithms with respect tothe ability to cover abnormal lung areas. We showed that theU-net trained on routine data covered more tumor area thanthe reference methods.

V. CONCLUSION

We showed that accurate and fast lung segmentation doesnot require complex methodology and that a proven deep-learning-based segmentation architecture can outperform state-of-the-art methods once diverse (but not necessarily larger)training data are available. By comparing various datasets fortraining of the models, we illustrated the importance of diversetraining data and showed that data from the clinical routinegeneralize well to unseen cohorts, achieving the second-highest score among all submissions to the LOLA11 challenge.Given these findings, we draw the following conclusions:(1) translating machine-learning approaches from the lab toroutine data can require the collection of diverse trainingdata rather than methodological modifications; (2) current,publicly available study datasets do not meet these diversityrequirements; and (3) generic, semantic, segmentation algo-rithms are adequate for the task of lung segmentation. Areliable, universal tool for lung segmentation is fundamentallyimportant to foster research on severe lung diseases and tostudy routine clinical datasets. Thus, the trained model andinference code are made publicly available under the GPL3license to serve as an open science tool for research anddevelopment and as a publicly available baseline for lungsegmentation under https://github.com/JoHof/lungmask.

ACKNOWLEDGMENT

This research was supported by the Austrian Sci-ence Fund FWF (I2714-B31), Siemens Healthineers DigitalHealth (https://www.siemens-healthineers.com/digital-health-solutions), and an unrestricted grant of Boehringer Ingelheim.

REFERENCES

[1] A. Mansoor, U. Bagci, B. Foster, Z. Xu, G. Z. Papadakis, L. R. Folio,J. K. Udupa, and D. J. Mollura, “Segmentation and Image Analysis ofAbnormal Lungs at CT: Current Approaches, Challenges, and FutureTrends,” RadioGraphics, vol. 35, no. 4, pp. 1056–1076, 7 2015.

[2] S. G. Armato and W. F. Sensakovic, “Automated lung segmentation forthoracic CT: Impact on computer-aided diagnosis,” Academic Radiology,vol. 11, no. 9, pp. 1011–1021, 9 2004.

[3] O. Goksel, O. A. Jimenez-del Toro, A. Foncubierta-Rodrıguez, andH. Muller, “Overview of the VISCERAL Challenge at ISBI,” in Pro-ceedings of the VISCERAL Challenge at ISBI, New York, NY, 5 2015.

[4] J. Yang, H. Veeraraghavan, S. G. Armato, K. Farahani, J. S. Kirby,J. Kalpathy-Kramer, W. van Elmpt, A. Dekker, X. Han, X. Feng,P. Aljabar, B. Oliveira, B. van der Heyden, L. Zamdborg, D. Lam,M. Gooding, and G. C. Sharp, “Autosegmentation for thoracic radia-tion treatment planning: A grand challenge at AAPM 2017,” MedicalPhysics, vol. 45, no. 10, pp. 4568–4581, 10 2018.

Page 10: PREPRINT 1 Automatic lung segmentation in routine imaging is a … › pdf › 2001.11767v1.pdf · 2020-02-03 · Fig. 1. Segmentation results for selected cases from routine data:

PREPRINT 10

[5] L. Oakden-Rayner, G. Carneiro, T. Bessen, J. C. Nascimento, A. P.Bradley, and L. J. Palmer, “Precision Radiology: Predicting longevityusing feature engineering and deep learning methods in a radiomicsframework,” Scientific Reports, vol. 7, no. 1, p. 1648, 2017.

[6] J. M. Stein, L. L. Walkup, A. S. Brody, R. J. Fleck, and J. C. Woods,“Quantitative CT characterization of pediatric lung development usingroutine clinical imaging,” Pediatric Radiology, vol. 46, no. 13, pp. 1804–1812, 12 2016.

[7] P. Korfiatis, S. Skiadopoulos, P. Sakellaropoulos, C. Kalogeropoulou,and L. Costaridou, “Combining 2D wavelet edge highlighting and 3Dthresholding for lung segmentation in thin-slice CT,” The British Journalof Radiology, vol. 80, no. 960, pp. 996–1004, 12 2007.

[8] S. Hu, E. Hoffman, and J. Reinhardt, “Automatic lung segmentation foraccurate quantitation of volumetric X-ray CT images,” IEEE Transac-tions on Medical Imaging, vol. 20, no. 6, pp. 490–498, 6 2001.

[9] H. Chen and A. Butler, “Automatic Lung Segmentation in HRCTImages,” in International Conference on Image and Vision Computing,2011, pp. 293–298.

[10] A. R. Pulagam, G. B. Kande, V. K. R. Ede, and R. B. Inampudi, “Auto-mated Lung Segmentation from HRCT Scans with Diffuse ParenchymalLung Diseases,” Journal of Digital Imaging, vol. 29, no. 4, pp. 507–519,8 2016.

[11] I. Sluimer, M. Prokop, and B. van Ginneken, “Toward automatedsegmentation of the pathological lung in CT,” IEEE Transactions onMedical Imaging, vol. 24, no. 8, pp. 1025–1038, 8 2005.

[12] J. E. Iglesias and M. R. Sabuncu, “Multi-atlas segmentation of biomed-ical images: A survey.” Medical image analysis, vol. 24, no. 1, pp.205–19, 8 2015.

[13] Z. Li, E. A. Hoffman, and J. M. Reinhardt, “Atlas-driven lung lobesegmentation in volumetric X-ray CT images,” IEEE Transactions onMedical Imaging, vol. 25, no. 1, pp. 1–16, 2006.

[14] Shanhui Sun, C. Bauer, and R. Beichel, “Automated 3-D Segmentationof Lungs With Lung Cancer in CT Data Using a Novel Robust ActiveShape Model Approach,” IEEE Transactions on Medical Imaging,vol. 31, no. 2, pp. 449–460, 2 2012.

[15] S. Agarwala, D. Nandi, A. Kumar, A. K. Dhara, S. B. T. A. Sadhu, andA. K. Bhadra, “Automated segmentation of lung field in HRCT imagesusing active shape model,” in IEEE Region 10 Annual InternationalConference, Proceedings/TENCON, vol. 2017-Decem. IEEE, 11 2017,pp. 2516–2520.

[16] G. Chen, D. Xiang, B. Zhang, H. Tian, X. Yang, F. Shi, W. Zhu,B. Tian, and X. Chen, “Automatic Pathological Lung Segmentation inLow-Dose CT Image Using Eigenspace Sparse Shape Composition,”IEEE Transactions on Medical Imaging, vol. 38, no. 7, pp. 1736–1749,7 2019.

[17] M. Sofka, J. Wetzl, N. Birkbeck, J. Zhang, T. Kohlberger, J. Kaftan,J. Declerck, and S. K. Zhou, “Multi-stage learning for robust lungsegmentation in challenging CT volumes,” in International Conferenceon Medical Image Computing and Computer-Assisted Intervention, vol.6893 LNCS, no. PART 3. Springer, Berlin, Heidelberg, 2011, pp. 667–674.

[18] A. P. Harrison, Z. Xu, K. George, L. Lu, R. M. Summers, and D. J.Mollura, “Progressive and multi-path holistically nested neural networksfor pathological lung segmentation from CT images,” in InternationalConference on Medical Image Computing and Computer-Assisted Inter-vention, vol. 10435 LNCS. Springer, Cham, 9 2017, pp. 621–629.

[19] P. Korfiatis, C. Kalogeropoulou, A. Karahaliou, A. Kazantzi, S. Ski-adopoulos, and L. Costaridou, “Texture classification-based segmenta-tion of lung affected by interstitial pneumonia in high-resolution CT,”Medical Physics, vol. 35, no. 12, pp. 5290–5302, 11 2008.

[20] J. Wang, F. Li, and Q. Li, “Automated segmentation of lungs with severeinterstitial lung disease in CT,” Medical Physics, vol. 36, no. 10, pp.4592–4599, 9 2009.

[21] T. T. Kockelkorn, E. M. van Rikxoort, J. C. Grutters, and B. vanGinneken, “Interactive lung segmentation in CT scans with severe ab-normalities,” in IEEE International Symposium on Biomedical Imaging:From Nano to Macro. IEEE, 2010, pp. 564–567.

[22] A. Soliman, F. Khalifa, A. Elnakib, M. Abou El-Ghar, N. Dunlap,B. Wang, G. Gimel’farb, R. Keynton, and A. El-Baz, “Accurate LungsSegmentation on CT Chest Images by Adaptive Appearance-GuidedShape Modeling,” IEEE Transactions on Medical Imaging, vol. 36, no. 1,pp. 263–276, 1 2017.

[23] E. M. van Rikxoort, B. de Hoop, M. A. Viergever, M. Prokop, andB. van Ginneken, “Automatic lung segmentation from thoracic com-puted tomography scans using a hybrid approach with error detection,”Medical Physics, vol. 36, no. 7, pp. 2934–2947, 6 2009.

[24] A. Mansoor, U. Bagci, Z. Xu, B. Foster, K. N. Olivier, J. M. Elinoff,A. F. Suffredini, J. K. Udupa, and D. J. Mollura, “A Generic Approachto Pathological Lung Segmentation,” IEEE Transactions on MedicalImaging, vol. 33, no. 12, p. 2293, 12 2014.

[25] R. D. Rudyanto, S. Kerkstra, E. M. van Rikxoort, C. Fetita, P.-Y.Brillet, C. Lefevre, W. Xue, X. Zhu, J. Liang, . Oksuz, D. Unay,K. Kadipasaoglu, R. S. J. Estepar, J. C. Ross, G. R. Washko, J.-C.Prieto, M. H. Hoyos, M. Orkisz, H. Meine, M. Hullebrand, C. Stocker,F. L. Mir, V. Naranjo, E. Villanueva, M. Staring, C. Xiao, B. C. Stoel,A. Fabijanska, E. Smistad, A. C. Elster, F. Lindseth, A. H. Foruzan,R. Kiros, K. Popuri, D. Cobzas, D. Jimenez-Carretero, A. Santos,M. J. Ledesma-Carbayo, M. Helmberger, M. Urschler, M. Pienn, D. G.Bosboom, A. Campo, M. Prokop, P. A. de Jong, C. Ortiz-de Solorzano,A. Munoz-Barrutia, and B. van Ginneken, “Comparing algorithms forautomated vessel segmentation in computed tomography scans of thelung: the VESSEL12 study,” Medical Image Analysis, vol. 18, no. 7,pp. 1217–1232, 10 2014.

[26] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Datasetfor Semantic Urban Scene Understanding,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2016, pp. 3213–3223.

[27] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, andL. Fei-Fei, “Auto-DeepLab: Hierarchical Neural Architecture Search forSemantic Image Segmentation,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2019, pp. 82–92.

[28] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in International Conference onMedical image computing and computer-assisted intervention, vol. 9351,2015, pp. 234–241.

[29] F. Isensee, J. Petersen, S. A. A. Kohl, P. F. Jager, and K. H. Maier-Hein, “nnU-Net: Breaking the Spell on Successful Medical ImageSegmentation,” arXiv preprint arXiv:1809.10486, 4 2019.

[30] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating DeepNetwork Training by Reducing Internal Covariate Shift,” in InternationalConference on Machine Learning, 6 2015, pp. 448–456.

[31] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training Very DeepNetworks,” in Advances in neural information processing systems, 2015,pp. 2377–2385.

[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for ImageRecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 12 2016.

[33] F. Yu and V. Koltun, “Multi-Scale Context Aggregation by DilatedConvolutions,” 11 2015.

[34] F. Yu, K. Vladlen, and F. Thomas, “Dilated Residual Networks,”Proceedings of the IEEE conference on computer vision and patternrecognition, pp. 472–480, 2017.

[35] B. Polyak, “Some methods of speeding up the convergence of iter-ation methods,” USSR Computational Mathematics and MathematicalPhysics, vol. 4, no. 5, pp. 1–17, 1 1964.

[36] J. Hofmanninger, M. Krenn, M. Holzer, T. Schlegl, H. Prosch, andG. Langs, “Unsupervised identification of clinically relevant clusters inroutine imaging data,” in International Conference on Medical ImageComputing and Computer-Assisted Intervention, vol. 9900 LNCS, 2016,pp. 192–200.

[37] P. A. Yushkevich, J. Piven, H. C. Hazlett, R. G. Smith, S. Ho, J. C.Gee, and G. Gerig, “User-guided 3D active contour segmentation ofanatomical structures: Significantly improved efficiency and reliability,”NeuroImage, vol. 31, no. 3, pp. 1116–1128, 7 2006.

[38] H. J. Aerts, E. R. Velazquez, R. T. Leijenaar, C. Parmar, P. Grossmann,S. Carvalho, J. Bussink, R. Monshouwer, B. Haibe-Kains, D. Rietveld,F. Hoebers, M. M. Rietbergen, C. R. Leemans, A. Dekker, J. Quack-enbush, R. J. Gillies, and P. Lambin, “Decoding tumour phenotypeby noninvasive imaging using a quantitative radiomics approach,” NatCommun, vol. 5, p. 4006, 2014.