arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the...

9
Synergistic Image and Feature Adaptation: Towards Cross-Modality Domain Adaptation for Medical Image Segmentation Cheng Chen 1 , Qi Dou 1 , Hao Chen 1,2 , Jing Qin 3 , and Pheng-Ann Heng 1,4 1 Department of Computer Science and Engineering, The Chinese University of Hong Kong 2 Imsight Medical Technology Co., Ltd., China 3 Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, 4 Guangdong Provincial Key Laboratory of Computer Vision and Virtual Reality Technology,SIAT,China {cchen, qdou, hchen, pheng}@cse.cuhk.edu.hk, [email protected] Abstract This paper presents a novel unsupervised domain adaptation framework, called Synergistic Image and Feature Adaptation (SIFA), to effectively tackle the problem of domain shift. Do- main adaptation has become an important and hot topic in re- cent studies on deep learning, aiming to recover performance degradation when applying the neural networks to new test- ing domains. Our proposed SIFA is an elegant learning dia- gram which presents synergistic fusion of adaptations from both image and feature perspectives. In particular, we simul- taneously transform the appearance of images across domains and enhance domain-invariance of the extracted features to- wards the segmentation task. The feature encoder layers are shared by both perspectives to grasp their mutual benefits dur- ing the end-to-end learning procedure. Without using any an- notation from the target domain, the learning of our unified model is guided by adversarial losses, with multiple discrim- inators employed from various aspects. We have extensively validated our method with a challenging application of cross- modality medical image segmentation of cardiac structures. Experimental results demonstrate that our SIFA model recov- ers the degraded performance from 17.2% to 73.0%, and out- performs the state-of-the-art methods by a significant margin. Introduction Deep convolutional neural networks (DCNNs) have made great breakthroughs in various challenging while crucial vi- sion tasks (Long et al. 2015a; He et al. 2016). As inves- tigations of DCNNs moving on, recent studies have fre- quently pointed out the problem of performance degradation when encountering domain shift, i.e., attempting to apply the learned models on testing data (target domain) that have dif- ferent distributions from the training data (source domain). In medical image computing, which is an important area to apply AI for healthcare, the situation of heterogeneous do- main shift is even more natural and severe, given the various imaging modalities with different physical principles. For example, as shown in Fig. 1, the cardiac areas present significantly different visual appearance when viewed from different modalities of medical images, such as the magnetic resonance (MR) imaging and computed tomography (CT). Unsurprisingly, the DCNNs trained on MR data completely Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Model trained on MR data MR CT Test on CT Domain Shift W/o adapt Image adapt Feature adapt Ground truth Ours (a) (b) (c) (e) (d) Figure 1: Illustration of addressing the severe cross-modality domain shift of medical images from different perspectives. The segmentation results of the CT images with the DCNN trained on MR data are shown in the bottom: a) without any adaptation; b) with pure image adaptation; c) with pure fea- ture adaptation; d) our proposed synergistic image and fea- ture adaptations; e) the ground truth. fail when being tested on CT images. To recover model per- formance, an easy way is to re-train or fine-tune models with additional labeled data from the target domain (Van Opbroek et al. 2015; Ghafoorian et al. 2017). However, annotating data for every new domain is obviously and prohibitively expensive, especially in medical area that requires expertise. To tackle this problem, unsupervised domain adaptation has been intensively studied to enable DCNNs to achieve competitive performance on unlabeled target data, only with annotations from the source domain. Prior works have treated domain shift mainly from two directions. One stream is the image adaptation, by aligning the image appearance between domains with the pixel-to-pixel transformation. In this way, the domain shift is addressed at input level to DC- NNs. To preserve pixel-level contents in original images, the adaptation process is usually guided by a cycle-consistency constraint (Zhu et al. 2017; Hoffman et al. 2018). Typi- cally, the transformed source-like images can be directly tested by pre-trained source models (Russo et al. 2017; Zhang et al. 2018b); alternatively, the generated target-like images can be used to train models in target domain (Bous- malis et al. 2017; Zhao et al. 2018). Although the synthesis arXiv:1901.08211v4 [cs.CV] 18 Jun 2019

Transcript of arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the...

Page 1: arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the generated target-like images can be used to train models in target domain (Bous-malis

Synergistic Image and Feature Adaptation: Towards Cross-ModalityDomain Adaptation for Medical Image Segmentation

Cheng Chen1, Qi Dou1, Hao Chen1,2, Jing Qin3, and Pheng-Ann Heng1,4

1 Department of Computer Science and Engineering, The Chinese University of Hong Kong2 Imsight Medical Technology Co., Ltd., China

3 Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University,4 Guangdong Provincial Key Laboratory of Computer Vision and Virtual Reality Technology, SIAT, China

{cchen, qdou, hchen, pheng}@cse.cuhk.edu.hk, [email protected]

Abstract

This paper presents a novel unsupervised domain adaptationframework, called Synergistic Image and Feature Adaptation(SIFA), to effectively tackle the problem of domain shift. Do-main adaptation has become an important and hot topic in re-cent studies on deep learning, aiming to recover performancedegradation when applying the neural networks to new test-ing domains. Our proposed SIFA is an elegant learning dia-gram which presents synergistic fusion of adaptations fromboth image and feature perspectives. In particular, we simul-taneously transform the appearance of images across domainsand enhance domain-invariance of the extracted features to-wards the segmentation task. The feature encoder layers areshared by both perspectives to grasp their mutual benefits dur-ing the end-to-end learning procedure. Without using any an-notation from the target domain, the learning of our unifiedmodel is guided by adversarial losses, with multiple discrim-inators employed from various aspects. We have extensivelyvalidated our method with a challenging application of cross-modality medical image segmentation of cardiac structures.Experimental results demonstrate that our SIFA model recov-ers the degraded performance from 17.2% to 73.0%, and out-performs the state-of-the-art methods by a significant margin.

IntroductionDeep convolutional neural networks (DCNNs) have madegreat breakthroughs in various challenging while crucial vi-sion tasks (Long et al. 2015a; He et al. 2016). As inves-tigations of DCNNs moving on, recent studies have fre-quently pointed out the problem of performance degradationwhen encountering domain shift, i.e., attempting to apply thelearned models on testing data (target domain) that have dif-ferent distributions from the training data (source domain).In medical image computing, which is an important area toapply AI for healthcare, the situation of heterogeneous do-main shift is even more natural and severe, given the variousimaging modalities with different physical principles.

For example, as shown in Fig. 1, the cardiac areas presentsignificantly different visual appearance when viewed fromdifferent modalities of medical images, such as the magneticresonance (MR) imaging and computed tomography (CT).Unsurprisingly, the DCNNs trained on MR data completely

Copyright c© 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Model trained on MR dataMR CT

Test on CT

Domain Shift

W/o adapt Image adapt Feature adapt Ground truthOurs

(a) (b) (c) (e)(d)

Figure 1: Illustration of addressing the severe cross-modalitydomain shift of medical images from different perspectives.The segmentation results of the CT images with the DCNNtrained on MR data are shown in the bottom: a) without anyadaptation; b) with pure image adaptation; c) with pure fea-ture adaptation; d) our proposed synergistic image and fea-ture adaptations; e) the ground truth.

fail when being tested on CT images. To recover model per-formance, an easy way is to re-train or fine-tune models withadditional labeled data from the target domain (Van Opbroeket al. 2015; Ghafoorian et al. 2017). However, annotatingdata for every new domain is obviously and prohibitivelyexpensive, especially in medical area that requires expertise.

To tackle this problem, unsupervised domain adaptationhas been intensively studied to enable DCNNs to achievecompetitive performance on unlabeled target data, onlywith annotations from the source domain. Prior works havetreated domain shift mainly from two directions. One streamis the image adaptation, by aligning the image appearancebetween domains with the pixel-to-pixel transformation. Inthis way, the domain shift is addressed at input level to DC-NNs. To preserve pixel-level contents in original images, theadaptation process is usually guided by a cycle-consistencyconstraint (Zhu et al. 2017; Hoffman et al. 2018). Typi-cally, the transformed source-like images can be directlytested by pre-trained source models (Russo et al. 2017;Zhang et al. 2018b); alternatively, the generated target-likeimages can be used to train models in target domain (Bous-malis et al. 2017; Zhao et al. 2018). Although the synthesis

arX

iv:1

901.

0821

1v4

[cs

.CV

] 1

8 Ju

n 20

19

Page 2: arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the generated target-like images can be used to train models in target domain (Bous-malis

images still cannot perfectly mimic the appearance of realimages, the image adaptation process brings accurate pixel-wise predictions on target images, as shown in Fig. 1.

The other stream for unsupervised domain adaptation fol-lows the feature adaptation, which aims to extract domain-invariant features with DCNNs, regardless of the appearancedifference between input domains. Most methods within thisstream discriminate feature distributions of source/target do-mains in an adversarial learning scenario (Ganin et al. 2016;Tzeng et al. 2017; Dou et al. 2018). Furthermore, consider-ing the high-dimensions of plain feature spaces, some recentworks connected the discriminator to more compact spaces.For examples, Tsai et al. inputs segmentation masks to thediscriminator, so that the supervision arises from a seman-tic prediction space (Tsai et al. 2018). Sankaranarayanan etal. reconstructs the features into images and put a discrim-inator in the reconstructed image space (Sankaranarayananet al. 2018). Although the adversarial discriminators implic-itly enhance domain invariance of features extracted by DC-NNs, the adaptation process can output results with properand smooth shape geometry.

Being aware that the image adaptation and feature adapta-tion address domain shift from complementary perspectives,we recognize that the two adaptation procedures can be per-formed together within one unified framework. With imagetransformation, the source images are transformed towardsthe appearance of target domain; afterwards, the remaininggap between the synthesis target-like images and real targetimages can be further addressed using the feature adaptation.Sharing this spirit, several very recent works have presentedpromising attempts using image and feature adaptations al-together (Hoffman et al. 2018; Zhang et al. 2018a). How-ever, these existing methods conduct the two perspectives ofadaptations sequentially, without leveraging mutual interac-tions and benefits. Surely, there still remains extensive spacefor synergistic merge of image and feature adaptations, to el-egantly overcome hurdle of domain shift when generalizingDCNNs to new domains with zero extra annotation cost.

In this paper, we propose a novel unsupervised domainadaptation framework, called Synergistic Image and FeatureAdaptation (SIFA), and successfully apply it to adaptationof cross-modality medical image segmentation under severedomain shift. Our designed SIFA presents an elegant learn-ing diagram which enables synergistic fusion of adaptationsfrom both image and feature perspectives. More specifically,we transform the labeled source images into the appearanceof images drawn from the target domain, by using genera-tive adversarial networks with cycle-consistency constraint.When using the synthesis target-like images to train a seg-mentation model, we further integrate feature adaptation tocombat the remaining domain shift. Here, we use two dis-criminators, respectively connecting the semantic segmen-tation predictions and generated source-like images, to dif-ferentiate whether obtained from synthesis or real target im-ages. Most importantly, in our designed SIFA framework,we share the feature encoder, such that it can simultaneouslytransform image appearance and extract domain-invariantrepresentations for the segmentation task. The entire domainadaptation framework is unified and both image and fea-

ture adaptations are seamlessly integrated into an end-to-endlearning diagram. The major contributions of this paper areas follows:• We present the SIFA, a novel unsupervised domain adap-

tation framework, that exploits synergistic image and fea-ture adaptations to tackle domain shift via complementaryperspectives.

• We enhance feature adaptation by using discriminators intwo aspects, i.e., semantic prediction space and generatedimage space. Both compact spaces help to further enhancedomain-invariance of the extracted features.

• We validate the effectiveness of our SIFA on the chal-lenging task of cross-modality cardiac structure segmen-tation. Our approach recovers the performance degrada-tion from 17.2% to 73.0%, and outperforms the state-of-the-art methods by a significant margin. The code is avail-able at https://github.com/cchen-cc/SIFA.

Related WorkAddressing performance degradation of DCNNs under do-main shift has been a highly active and fruitful researchfield in recent investigations of deep learning. A plentifulof adaptive methods have been proposed from different per-spectives, including the image-level adaptation, feature-leveladaptation and their mixtures. In this section, we overviewthe progress and state-of-the-art approaches along thesestreams, with a particular focus on unsupervised domainadaptation in image processing field. Studies on both naturaland medical images are covered.

With a gratitude to generative adversarial network (Good-fellow et al. 2014), image-level adaptation methods havebeen developed to tap domain shift at the input level to DC-NNs. Some methods first trained a DCNN in source do-main, and then transformed the target images into source-like ones, such that can be tested using the pre-trained sourcemodel (Russo et al. 2017; Zhang et al. 2018b; Chen et al.2018). Inversely, other methods tried to transform the sourceimages into the appearance of target images (Bousmalis etal. 2017; Shrivastava et al. 2017; Hoffman et al. 2018). Thetransformed target-like images are then used to train a taskmodel which could perform well in target domain. This hasalso been used in medical eye retinal fundus image analy-sis (Zhao et al. 2018). With the wide success of CycleGANin unpaired image-to-image transformation, many previousimage adaptation works were based on modified CycleGANwith applications in both natural datasets (Russo et al. 2017;Hoffman et al. 2018) and medical image segmentation (Huoet al. 2018; Zhang et al. 2018b; Chen et al. 2018).

Meanwhile, approaches for feature-level adaptation havealso been investigated, aiming to reduce domain shift byextracting domain-invariant features in the DCNNs. Pio-neer works tried to minimize the distance between domainstatistics, such as the maximum mean distance (Long et al.2015b) and the layer activation correlation (Sun and Saenko2016). Later, representative methods of DANN (Ganin etal. 2016) and ADDA (Tzeng et al. 2017) advanced fea-ture adaptation via adversarial learning, by using a dis-criminator to differentiate the feature space across domains.

Page 3: arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the generated target-like images can be used to train models in target domain (Bous-malis

Reconstructed MR

CT stylized as MRSynergistic Learning Decoder

ClassifierC

Real MR MR stylized as CT

Real CT

Dt

tGGenerator

Prediction of MR

Prediction of CT

Ds

Dp

𝐿𝑐𝑦𝑐

𝐿𝑎𝑑𝑣𝑠

𝐿𝑎𝑑𝑣 𝑠

𝐿𝑎𝑑𝑣𝑝

𝐿𝑠𝑒𝑔

𝐿𝑎𝑑𝑣𝑡

MR label

Image Adaptation

SharedEncoder

E

Feature Adaptation

Feature Space

Source

Target

U

Figure 2: Overview of our unsupervised domain adaptation framework. The generator Gt serves the source-to-target imagetransformation. The encoder E and decoder U form the reverse transformation, where the encoder E is also connected with aclassifierC for image segmentation. The discriminators {Dt, Ds, Dp} differentiate their inputs accordingly to derive adversariallosses. The blue and red arrows indicate the data flows for the image adaptation and feature adaptation respectively. The reversecycle-consistency is omitted in this figure for ease of illustration.

Effectiveness of this strategy has also been validated inmedical applications of segmenting brain lesions (Kamnit-sas et al. 2017) and cardiac structures (Dou et al. 2018;Joyce et al. 2018). Recent studies proposed to project thehigh-dimensional feature space to other compact spaces,such as the semantic prediction space (Tsai et al. 2018) orthe image space (Sankaranarayanan et al. 2018), and a dis-criminator operated in the compact spaces to derive adver-sarial losses for more effective feature alignment.

The image and feature adaptations address domain shiftfrom different perspectives to the DCNNs, which are in factcomplementary to each other. Combining these two adap-tive strategies to achieve a stronger domain adaption tech-nique is under explorable progress. As the state-of-the-artmethods for semantic segmentation adaptation methods, theCyCADA (Hoffman et al. 2018) and Zhang et al. (Zhang etal. 2018a) achieved leading performance in adaptation be-tween synthetic to real world driving scene domains. How-ever, their image and feature adaptations are sequentiallyconnected and trained in stages without interactions.

Considering the severe domain shift in cross-modalitymedical images, feature adaptation or image adaptationalone may not be sufficient in this challenging task whilethe simultaneous adaptations from the two perspectives havenot been fully explored yet. To tackle the challenging cross-modality adaptation for segmentation task, we propose tosynergistically merge the two adaptive processes in a uni-fied network to fully exploit their mutual benefits towardsunsupervised domain adaptation.

MethodsAn overview of our proposed method for unsupervised do-main adaptation in medical image segmentation is shown inFig. 2. We propose synergistic image and feature adaptationswith a novel learning diagram to effectively narrow the per-

formance gap due to domain shift. The two perspectives ofadaptations are seamlessly integrated into a unified model,and hence, both aspects can mutually benefit each other dur-ing the end-to-end training procedure.

Image Adaptation for Appearance AlignmentFirst, with a set of labeled samples {xsi , ysi }Ni=1 from thesource domain Xs, as well as unlabeled samples {xtj}Mj=1

from the target domain Xt, we aim to transform the sourceimages xs towards the appearance of target ones xt, whichhold different visual appearance due to domain shift. The ob-tained transformed image looks as if drawn from the targetdomain, while the original contents with structural seman-tics remain unaffected. Briefly speaking, this module nar-rows the domain shift between the source and target domainsby aligning image appearance.

In practice, we use generative adversarial networks, whichhave made a wide success for pixel-to-pixel image transfor-mation, by building a generator Gt and a discriminator Dt.The generator aims to transform the source images to target-like ones Gt(x

s) = xs→t. The discriminator competes withthe generator to correctly differentiate the fake transformedimage xs→t and the real target image xt. Therefore, in thetarget domain, the Gt and Dt form a minimax two-playergame and are optimized via the adversarial learning:

Ltadv(Gt, Dt) = Ext∼Xt [logDt(x

t)]+

Exs∼Xs [log(1−Dt(Gt(xs)))],

(1)

where the discriminator tries to maximize this objective todistinguish between Gt(x

s)=xs→t and xt, and meanwhile,the generator needs to minimize this objective to transformxs into realistic target-like images.

To preserve original contents in the transformed images,a reverse generator is usually used to impose the cycle con-sistency (Zhu et al. 2017). As shown in Fig. 2, the encoder

Page 4: arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the generated target-like images can be used to train models in target domain (Bous-malis

E and upsampling decoder U form the reverse target-to-source generator Gs =E ◦ U to reconstruct the xs→t backto the source domain, and a discriminator Ds operates in thesource domain. This pair of source {Gs, Ds} are trained inthe same manner as {Gt, Dt} with the adversarial loss Ls

adv.Then the pixel-wise cycle-consistency loss Lcyc is used toencourage U(E(Gt(x

s))) ≈ xs and Gt(U(E(xt))) ≈ xt

for recovering the original image:

Lcyc(Gt, E, U) = Exs∼Xs ||U(E(Gt(xs)))− xs||1+

Ext∼Xt ||Gt(U(E(xt)))− xt||1.(2)

With the adversarial loss and cycle-consistency loss, theimage adaptation transforms the source images xs intotarget-like images xs→t with semantic contents preserved.Ideally, this pixel-to-pixel transformation could bring xs→t

into the data distribution of target domain, such that thesesynthesis images can be used to train a segmentation net-work for the target domain.

Specifically, after extracting features from the adapted im-age xs→t, the feature mapsE(xs→t) are forwarded to a clas-sifier C for predicting segmentation masks. In other words,the composition ofE◦C serves as the segmentation networkfor the target domain. This part is trained using the sam-ple pairs of {xs→t, ys} by minimizing a hybrid loss Lseg.Formally, denoting the segmentation prediction for xs→t byys→t=C(E(xs→t)), the segmentation loss is defined as:

Lseg(E,C) = H(ys, ys→t) + α · Dice(ys, ys→t), (3)

where the first term represents cross-entropy loss, the secondterm is the Dice loss, and α is the trade-off hyper-parameterbalancing them. The hybrid loss function is designed to meetthe class imbalance in medical image segmentation.

Feature Adaptation for Domain InvarianceIn above image adaptation, training a segmentation networkwith the transformed target-like images can already get ap-pealing performance on target data. Unfortunately, when do-main shift is severe, such as for cross-modality medical im-ages, it is still insufficient to achieve desired domain adapta-tion results. To this end, we further impose additional dis-criminators to contribute from the perspective of featureadaptation, attempting to bridge the remaining domain gapbetween the synthesis target images and real target images.

To make the extracted features domain-invariant, the mostcommon way is using adversarial learning directly in fea-ture space, such that a discriminator fails to differentiatewhich features come from which domain. However, a fea-ture space is with high-dimension, and hence difficult to bedirectly aligned. Instead, we choose to enhance the domain-invariance of feature distributions by using adversarial learn-ing via two compact lower-dimensional spaces. Specifi-cally, we inject adversarial losses via the semantic predictionspace and the generated image space.

As shown in Fig. 2, for prediction of segmentation masksfrom {E,C}, we construct the discriminator Dp to clas-sify the outputs corresponding to xs→t or xt. The semanticprediction space represents the information of human-bodyanatomical structures, which should be consistent across dif-ferent imaging modalities. If the features extracted from

xs→t are aligned with that from xt, the discriminator Dp

would fail in differentiating their corresponding segmenta-tion masks, as the anatomical shapes are consistent. Other-wise, the adversarial gradients are back-propagated to thefeature extractor E, so as to minimize the distance betweenthe feature distributions from xs→t and xt. The adversarialloss from semantic-level supervision for the feature adapta-tion is:

Lpadv(E,C,Dp) =Exs→t∼Xs→t [log Dp(C(E(xs→t)))]+

Ext∼Xt [log(1−Dp(C(E(xt))))].(4)

For generated source-like images from {E,U}, we addan auxiliary task to the source discriminator Ds to differen-tiate whether the generated images are transformed from realtarget images xt or reconstructed from xs→t. If the discrim-inator Ds succeeded in classifying the domain of generatedimages, it means that the extracted features still contain do-main characteristics. To make the features domain-invariant,the following adversarial loss is employed to supervise thefeature extraction process:

Lsadv(E,Ds) = Exs→t∼Xs→t [logDs(U(E(xs→t)))]+

Ext∼Xt [log(1−Ds(U(E(xt))))].(5)

It is noted that theE is encouraged to extract features withdomain-invariance by connecting discriminator from two as-pects, i.e., segmentation predictions (high-level semantics)and generated source-like images (low-level appearance).By adversarial learning from these lower-dimensional com-pact spaces, the domain gap between synthesis target imagesxs→t and real target images xt can be effectively addressed.

Synergistic Learning DiagramImportantly, a key characteristic in our proposed synergis-tic learning diagram is to share the feature encoder E be-tween both image and feature adaptations. More specifically,the E is optimized with the adversarial loss Ls

adv and cycle-consistency lossLcyc via the image adaptation perspective. Italso collects gradients back-propagated from the discrimina-tors {Dp, Ds} towards feature adaptation. In these regards,the feature encoder is fitted inside a multi-task learning sce-nario, such that, it is able to present generic and robust rep-resentations useful for multiple purposes. In turn, the differ-ent tasks bring complementary inductive bias to the encoderparameters, i.e., either emphasizing pixel-wise cyclic recon-struction or focusing on structural semantics. This can alsocontribute to alleviate the over-fitting problem with limitedmedical datasets when training such a complicated model.

With the encoder enabling seamless integration of the im-age and feature adaptations, we can train the unified frame-work in an end-to-end manner. At each training iteration, allthe modules are sequentially updated in the following or-der: Gt → Dt → E → C → U → Ds → Dp. Specifically,The generator Gt is updated first to obtain the transformedtarget-like images. Then the discriminator Dt is updated todifferentiate the target-like images xs→t and the real targetimages xt. Next, the encoder E is updated for feature ex-traction from xs→t and xt, followed by the updating of clas-sifier C and decoder U to map the extracted features to the

Page 5: arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the generated target-like images can be used to train models in target domain (Bous-malis

segmentation predictions and generated source-like images.Finally, the discriminator Ds and Dp are updated to classifythe domain of their inputs to enhance feature-invariance. Theoverall objective for our framework is as follows:

L = Ltadv(Gt, Dt) + λs

advLsadv(E,U,Ds) +

λcycLcyc(Gt, E, U) + λsegLseg(E,C) +

λpadvL

padv(E,C,Dp) + λs

advLsadv(E,Ds)

(6)

where the {λsadv, λcyc, λseg, λpadv, λ

sadv} are trade-off parame-

ters adjusting the importance of each component.For training practice, when updating with the adversarial

learning losses, we used the Adam optimizer with a learningrate of 2×10−4. For segmentation task, the Adam optimizerwas parameterized with an initial learning rate of 1×10−3

and a stepped decay rate of 0.9 every 2 epochs.During the testing procedure, when an image from the tar-

get domain arrives, this xt is forwarded into the encoder E,followed by applying the classifier C. In this way, the se-mantic segmentation result is obtained by C(E(xt)), usingthe domain adaptation framework which is learned withoutneed of any target domain annotations.

Network Configurations of the ModulesIn this section, we describe the detailed network configura-tions of every module in the proposed framework. Residualconnections are widely used to ease the gradients flow insideour complicated model. We also actively borrow the previ-ous successful experiences of training generative adversarialnetworks, as reported in the references.

The layer configuration of the target generator Gt followthe practice of CycleGAN (Zhu et al. 2017). It consists of3 convolutional layers, 9 residual blocks, and 2 deconvolu-tional layers, finally using one convolutional layer to get thegenerated images. For the source decoder U , we construct itwith 1 convolutional layer, 4 residual blocks, and 3 decon-volutional layers, finally also followed by one convolutionaloutput layer. For all the three discriminators {Dt, Ds, Dp},we follow the configuration of PatchGAN (Isola et al. 2017),by differentiating 70×70 patches. The networks consist of 5convolutional layers with kernels as size of 4×4 and stride of2, except for the last two layers, which use convolution strideof 1. The numbers of feature maps are {64, 128, 256, 512, 1}for each layer, respectively. At the first four layers, each con-volutional layer is followed by an instance normalizationand a leaky ReLU parameterized with 0.2.

The encoder E uses residual connections and dilated con-volutions (dilation rate=2) to enlarge the size of receptivefield while preserving the spatial resolution for dense pre-dictions (Yu et al. 2017). Let {Ck,Rk,Dk} denote a convo-lutional layer, a residual block and a dilated residual blockwith k channels, respectively. The M represents the max-pooling layer with a stride of 2. Our encoder module is deepby stacking layers of {C16,R16,M,R32,M, 2×R64,M, 2×R128, 4×R256, 2×R512, 2×D512, 2×C512}. Each convo-lution operation is connected to a batch normalization layerand ReLU activation. The classifierC is a 1×1 convolutionallayer followed by an upsampling layer to recover the resolu-tion of segmentation predictions to original image size.

Experimental ResultsDataset and Evaluation MetricsWe validated our proposed unsupervised domain adaptationmethod on the Multi-Modality Whole Heart SegmentationChallenge 2017 dataset for cardiac segmentation in MR andCT images (Zhuang and Shen 2016). The dataset consistsof unpaired 20 MR and 20 CT volumes collected at differ-ent clinical sites. The ground truth masks of cardiac struc-tures are provided, including the ascending aorta (AA), theleft atrium blood cavity (LAC), the left ventricle blood cav-ity (LVC), and the myocardium of the left ventricle (MYO).We aim to adapt the segmentation network at the setting ofcross-modality learning.

We employed the MR images as the source domain, andthe CT images as the target domain. Each modality was ran-domly split with 80% cases for training and 20% cases fortesting. The ground truth of CT images were used for eval-uation only, without being presented to the network duringtraining phase. All the data were normalized as zero meanand unit variance. To train our model, we used the coro-nal view images slices, which were cropped into the size of256×256 and augmented with rotation, scaling, and affinetransformations to reduce over-fitting.

For evaluation, we employed two commonly-used metricsto quantitatively evaluate the segmentation performance,which have also been used in previous cross-modality do-main adaptation works (Dou et al. 2018; Joyce et al. 2018).One measurement is the Dice coefficient ([%]), which cal-culates the volume overlap between the prediction mask andthe ground truth. The other is the average surface distanceASD ([voxel]) to assess the model performance at bound-aries and a lower ASD indicates the better segmentation re-sults.

Comparison with the State-of-the-art MethodsWe compare our framework with six recent popular unsuper-vised domain adaptation methods including DANN (Ganinet al. 2016), ADDA (Tzeng et al. 2017), CycleGAN (Zhu etal. 2017), CyCADA (Hoffman et al. 2018), Dou et al. (Douet al. 2018), and Joyce et al. (Joyce et al. 2018). Amongthem, The first four are proposed for natural datasets, and weeither used public available code or re-implemented them forour cardiac segmentation dataset. The DANN and ADDAemploy only feature adaptation, the CycleGAN adapts im-age appearance, and the CyCADA conducts both image andfeature adaptations. The last two methods are dedicated toadapt MR/CT cardiac segmentation networks in feature levelusing the same cross-modality dataset as ours, therefore, forwhich we directly reference the results from their papers. Wealso obtain the ”W/o adaptation” lower bound by directly ap-plying the model learned in MR source domain to test targetCT images without using any domain adaptation method.

Table 1 reports the comparison results, where we cansee that our method significantly increased the segmenta-tion performance over the ”W/o adaptation” lower boundand outperformed previous methods by a large margin interms of both Dice and ASD. Without domain adaptation,the model only obtained the average Dice of 17.2% over

Page 6: arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the generated target-like images can be used to train models in target domain (Bous-malis

CT image Ground truthW/o adaptation DANN CycleGAN CyCADA SIFA (Ours)ADDA

Figure 3: Visual comparison of segmentation results produced by different methods. From left to right are the raw CT images(1st column), ”W/o Adaptation” lower bound (2nd column), results of other unsupervised domain adaptation methods (3rd-6thcolumn), results of our SIFA network (7th column), and ground truth (last column). The cardiac structures of AA, LAC, LVC,and MYO are indicated in blue, red, purple, and yellow color respectively. Each row corresponds to one example.

Table 1: Performance comparison between our method and other state-of-the-art unsupervised domain adaptation methods forthe task of cardiac cross-modality segmentation. We report the Dice and ASD value for each cardiac structure and the averageof the four structures. (Note: - means that the results are not reported by that methods and N/A means that the ASD value cannotbe calculated due to no prediction for that cardiac structure.)

Methods Adaptation Dice ASDImage Feature AA LAC LVC MYO Average AA LAC LVC MYO Average

W/o adaptation 28.4 27.7 4.0 8.7 17.2 20.6 16.2 N/A 48.4 N/ADANN (Ganin et al. 2016) X 39.0 45.1 28.3 25.7 34.5 16.2 9.2 12.1 10.1 11.9ADDA (Tzeng et al. 2017) X 47.6 60.9 11.2 29.2 37.2 13.8 10.2 N/A 13.4 N/A

CycleGAN (Zhu et al. 2017) X 73.8 75.7 52.3 28.7 57.6 11.5 13.6 9.2 8.8 10.8CyCADA (Hoffman et al. 2018) X X 72.9 77.0 62.4 45.3 64.4 9.6 8.0 9.6 10.5 9.4

Dou et al. (Dou et al. 2018) X 74.8 51.1 57.2 47.8 57.7 27.5 20.1 29.5 31.2 27.1Joyce et al. (Joyce et al. 2018) X - - 66 44 - - - - - -

SIFA (Ours) X X 81.1 76.4 75.7 58.7 73.0 10.6 7.4 6.7 7.8 8.1

the four cardiac structures, demonstrating the severe domainshift between MR and CT images. Remarkably, with ourSIFA network, the average Dice was recovered to 73.0% andthe average ASD was reduced to 8.1. We achieved over 80%Dice score for the AA structure and over 70% Dice scorefor the LAC and LVC. Notably, compared with CyCADA,which also conducts both image and feature adaptations,our method achieved superior performance especially for theLVC and MYO structures, which have limited contrast inCT images. This demonstrates the effectiveness of our syn-

ergistic learning diagram, which unleashes the benefits frommutual conduction of image and feature alignments.

Visual comparison results are further provided in Fig. 3.We can see that without adaptation, the network hardly out-puts any correct prediction for the cardiac structures. By us-ing feature adaptation (3rd and 4th columns) or image adap-tation (5th column) alone, appreciable recovery in the seg-mentation prediction masks can be obtained, but the shapeof predicted cardiac structures is quite cluttered and noisy.Only the two methods, CyCADA and our SIFA, which lever-

Page 7: arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the generated target-like images can be used to train models in target domain (Bous-malis

MR CT stylized as MRMR stylized as CT CT

Figure 4: Examples of image transformation between MRand CT images.

Table 2: Effectiveness of each key component in SIFA. ”IA”denotes image adaptation; ”FA-P” and ”FA-I” respectivelydenote the feature adaptation in the semantic predictionspace and the generated image space.

Methods IA Lpadv Ls

adv Average DiceW/o adaptation 17.2+ Image adaptation X 58.0+ FA-P X X 65.7+ FA-I X X X 73.0

age both the feature and image adaptations, can generate se-mantically meaningful prediction for the four cardiac struc-tures. Particularly, our SIFA network outperforms CyCADAespecially for the segmentation of LVC and MYO. As can beseen in the last row in Fig. 3, the LVC and MYO structureshave very limited intensity contrast with their surroundingtissues, but our method can make good predictions while allthe other methods fail in this challenging case.

Effectiveness of Key ComponentsWe conduct ablation experiments to evaluate the effective-ness of each key component in our proposed synergisticlearning framework of image and feature adaptations. Theresults are presented in Table 2. Our baseline network usesimage adaptation only, which is constructed by removingthe feature adaptation adversarial loss Lp

adv and Lsadv when

training the network, i.e., removing the data flow of red ar-rows in Fig. 2. Compared with the ”W/o adaptation” lowerbound, our baseline network with pure image adaptation al-ready achieved inspiring increase in segmentation accuracywith average Dice increased to 58.0%. This reflects that withimage transformation, the source images have been success-fully brought closer to the target domain. Fig. 4 shows fourexamples of image transformation from source to target do-main and vice versa. As illustrated in the figure, the appear-ance of images is successfully adapted across domains whilethe semantic contents in original images are well-preserved.

Next, we combine baseline image adaptation with one as-pect of feature adaptation, i.e., adding the adversarial learn-ing in the semantic prediction space, which corresponds toadding the discriminator guided by Lp

adv. The increased per-

CT image Ground truthIA IA with FA-P SIFA

Figure 5: Illustration of effectiveness of each key componentin our method: ”IA” denotes our network with image adapta-tion only; ”IA with FA-P” denotes the combination of imageadaptation and the feature adaptation in semantic predictionspace; ”SIFA” is our overall framework.

formance over the image adaptation baseline, from 58.0% to65.7%, demonstrates that the image and feature adaptationsare complementary to each other and can be jointly con-ducted to achieve better domain adaptation. Finally, furtheradding the feature adaptation by aligning generated source-like images with Ls

adv completes our full SIFA network. Thisleads to further obvious improvement in the average Diceaccuracy of segmentation results, indicating that the featureadaptation in these two compact spaces would inject effectsfrom integral aspects to encourage feature invariance.

Fig. 5 shows the visual comparison results of our networkwith different components. We can see that the segmenta-tion results become increasingly accurate as more adaptationcomponents being included. Our baseline network with im-age adaptation alone can correctly identify the cardiac struc-tures, but the predicted shape is irregular and noisy. Addingthe feature adaptation in the two lower-dimensional spacesfurther encourages the network to capture the proper shapeof cardiac structures and produce clear predictions. Overall,our SIFA network synergistically merges different adapta-tion strategies to exploit their complementary contributionsto unsupervised domain adaptation.

Conclusion

This paper proposes a novel approach SIFA for unsuper-vised domain adaptation of cross-modality medical imagesegmentation. Our SIFA network synergistically combinesthe image and feature adaptations to conduct image appear-ance transformation and domain-invariant feature learningsimultaneously. The two adaptive perspectives are guided bythe adversarial learning with partial parameter sharing to ex-ploit their mutual benefits for reducing domain shift duringthe end-to-end training. We validate our method on unpairedMR to CT adaptation for cardiac segmentation by compar-ing it with various state-of-the-art methods. Experimentalresults demonstrate the superiority of our network over theothers in terms of both the Dice and ASD value. Our methodis general and can be easily extended to other segmentationapplications of unsupervised domain adaptation.

Page 8: arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the generated target-like images can be used to train models in target domain (Bous-malis

AcknowledgmentsThis work was supported by a grant from 973 Program(Project No. 2015CB351706), a grant from Shenzhen Sci-ence and Technology Program (JCYJ20170413162256793),a grant from the Hong Kong Research Grants Council un-der General Research Fund (Project no. 14225616), a grantfrom Hong Kong Innovation and Technology Commissionunder ITSP Tier 2 Fund (Project no. ITS/426/17FP), and agrant from Hong Kong Research Grants Council (Project no.PolyU 152035/17E).

References[Bousmalis et al. 2017] Bousmalis, K.; Silberman, N.; Do-han, D.; et al. 2017. Unsupervised pixel-level domain adap-tation with generative adversarial networks. In IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),95–104.

[Chen et al. 2018] Chen, C.; Dou, Q.; Chen, H.; and Heng,P.-A. 2018. Semantic-aware generative adversarial nets forunsupervised domain adaptation in chest x-ray segmenta-tion. arXiv preprint arXiv:1806.00600.

[Dou et al. 2018] Dou, Q.; Ouyang, C.; Chen, C.; Chen, H.;and Heng, P.-A. 2018. Unsupervised cross-modality domainadaptation of convnets for biomedical image segmentationswith adversarial loss. arXiv preprint arXiv:1804.10916.

[Ganin et al. 2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Ger-main, P.; et al. 2016. Domain-adversarial training of neu-ral networks. The Journal of Machine Learning Research17(1):2096–2030.

[Ghafoorian et al. 2017] Ghafoorian, M.; Mehrtash, A.; Ka-pur, T.; Karssemeijer, N.; Marchiori, E.; et al. 2017. Transferlearning for domain adaptation in mri: Application in brainlesion segmentation. In International Conference on Med-ical Image Computing and Computer-Assisted Intervention(MICCAI), 516–524.

[Goodfellow et al. 2014] Goodfellow, I. J.; Pouget-Abadie,J.; Mirza, M.; Xu, B.; et al. 2014. Generative adversarialnets. In Conference on Neural Information Processing Sys-tems (NIPS), 2672–2680.

[He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016.Deep residual learning for image recognition. In IEEE con-ference on computer vision and pattern recognition (CVPR),770–778.

[Hoffman et al. 2018] Hoffman, J.; Tzeng, E.; Park, T.; Zhu,J.; Isola, P.; Saenko, K.; Efros, A. A.; and Darrell, T. 2018.Cycada: Cycle-consistent adversarial domain adaptation. InInternational Conference on Machine Learning (ICML),1994–2003.

[Huo et al. 2018] Huo, Y.; Xu, Z.; Bao, S.; Assad, A.;Abramson, R. G.; et al. 2018. Adversarial synthesis learningenables segmentation without target modality ground truth.In IEEE International Symposium on Biomedical Imaging(ISBI), 1217–1220.

[Isola et al. 2017] Isola, P.; Zhu, J.; Zhou, T.; and Efros, A. A.2017. Image-to-image translation with conditional adversar-

ial networks. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), 5967–5976.

[Joyce et al. 2018] Joyce, T.; Chartsias, A.; Tsaftaris, S. A.;et al. 2018. Deep multi-class segmentation without ground-truth labels. In International conference on Medical Imagingwith Deep Learning (MIDL).

[Kamnitsas et al. 2017] Kamnitsas, K.; Baumgartner, C. F.;Ledig, C.; Newcombe, V. F. J.; Simpson, J. P.; et al. 2017.Unsupervised domain adaptation in brain lesion segmenta-tion with adversarial networks. In International Conferenceon Information Processing in Medical Imaging (IPMI), 597–609.

[Long et al. 2015a] Long, J.; Shelhamer, E.; Darrell, T.; et al.2015a. Fully convolutional networks for semantic segmen-tation. In IEEE conference on computer vision and patternrecognition (CVPR), 3431–3440.

[Long et al. 2015b] Long, M.; Cao, Y.; Wang, J.; et al. 2015b.Learning transferable features with deep adaptation net-works. In International Conference on Machine Learning(ICML), 97–105.

[Russo et al. 2017] Russo, P.; Carlucci, F. M.; Tommasi, T.;and Caputo, B. 2017. From source to target and back:symmetric bi-directional adaptive gan. arXiv preprintarXiv:1705.08824.

[Sankaranarayanan et al. 2018] Sankaranarayanan, S.; Bal-aji, Y.; Jain, A.; et al. 2018. Learning from syntheticdata: Addressing domain shift for semantic segmentation. InIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), 3752–3761.

[Shrivastava et al. 2017] Shrivastava, A.; Pfister, T.; Tuzel,O.; Susskind, J.; et al. 2017. Learning from simulatedand unsupervised images through adversarial training. InIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), 2242–2251.

[Sun and Saenko 2016] Sun, B., and Saenko, K. 2016. Deepcoral: Correlation alignment for deep domain adaptation. InEuropean Conference on Computer Vision (ECCV) Work-shops, 443–450.

[Tsai et al. 2018] Tsai, Y.-H.; Hung, W.-C.; Schulter, S.;Sohn, K.; Yang, M.-H.; et al. 2018. Learning to adaptstructured output space for semantic segmentation. arXivpreprint arXiv:1802.10349.

[Tzeng et al. 2017] Tzeng, E.; Hoffman, J.; Saenko, K.; et al.2017. Adversarial discriminative domain adaptation. InConference on Computer Vision and Pattern Recognition(CVPR), 2962–2971.

[Van Opbroek et al. 2015] Van Opbroek, A.; Ikram, M. A.;Vernooij, M. W.; and De Bruijne, M. 2015. Transferlearning improves supervised image segmentation acrossimaging protocols. IEEE transactions on medical imaging34(5):1018–1030.

[Yu et al. 2017] Yu, F.; Koltun, V.; Funkhouser, T. A.; et al.2017. Dilated residual networks. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 636–644.

Page 9: arXiv:1901.08211v4 [cs.CV] 18 Jun 2019 · 2019-06-19 · Zhang et al. 2018b); alternatively, the generated target-like images can be used to train models in target domain (Bous-malis

[Zhang et al. 2018a] Zhang, Y.; Qiu, Z.; Yao, T.; Liu, D.; andMei, T. 2018a. Fully convolutional adaptation networks forsemantic segmentation. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 6810–6818.

[Zhang et al. 2018b] Zhang, Y.; Miao, S.; Mansi, T.; andLiao, R. 2018b. Task driven generative modeling for un-supervised domain adaptation: Application to x-ray imagesegmentation. In International Conference on Medical Im-age Computing and Computer-Assisted Intervention (MIC-CAI), 599–607.

[Zhao et al. 2018] Zhao, H.; Li, H.; Maurer-Stroh, S.; Guo,Y.; et al. 2018. Supervised segmentation of un-annotatedretinal fundus images by synthesis. IEEE transactions onmedical imaging.

[Zhu et al. 2017] Zhu, J.; Park, T.; Isola, P.; and Efros, A. A.2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Confer-ence on Computer Vision (ICCV), 2242–2251.

[Zhuang and Shen 2016] Zhuang, X., and Shen, J. 2016.Multi-scale patch and multi-modality atlases for whole heartsegmentation of mri. Medical image analysis 31:77–87.