JOURNAL OF LA CCML: A Novel Collaborative Learning Model ...

14
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 CCML: A Novel Collaborative Learning Model for Classification of Remote Sensing Images with Noisy Multi-Labels Ahmet Kerem Aksoy, Student Member, IEEE, Mahdyar Ravanbakhsh, Member, IEEE, Tristan Kreuziger, Student Member, IEEE, and Beg¨ um Demir, Senior Member, IEEE Abstract—The development of accurate methods for multi- label classification (MLC) of remote sensing (RS) images is one of the most important research topics in RS. Deep Convolu- tional Neural Networks (CNNs) based methods have triggered substantial performance gains in RS MLC problems, requiring a large number of reliable training images annotated by multiple land-cover class labels. Collecting such data is time-consuming and costly. To address this problem, the publicly available thematic products, which can include noisy labels, can be used for annotating RS images with zero-labeling cost. However, multi- label noise (which can be associated with wrong as well as missing label annotations) can distort the learning process of the MLC algorithm, resulting in inaccurate predictions. The detection and correction of label noise are challenging tasks, especially in a multi-label scenario, where each image can be associated with more than one label. To address this problem, we propose a novel Consensual Collaborative Multi-Label Learning (CCML) method to alleviate the adverse effects of multi-label noise during the training phase of the CNN model. CCML identifies, ranks, and corrects noisy multi-labels in RS images based on four main modules: 1) group lasso module; 2) discrepancy module; 3) flipping module; and 4) swap module. The task of the group lasso module is to detect the potentially noisy labels assigned to the multi-labeled training images, and the discrepancy module ensures that the two collaborative networks learn diverse features, while obtaining the same predictions. The flipping module is designed to correct the identified noisy multi- labels, while the swap module task is devoted to exchanging the ranking information between two networks. The proposed CCML method detects noisy multi-labels in the training set and corrects them through a relabeling mechanism. Unlike existing methods that make assumptions about the noise distribution, our proposed CCML does not make prior assumptions about the noise distribution in the training set. The experiments conducted on two multi-label RS image archives confirm the robustness of the proposed CCML under extreme multi-label noise rates. Our code is publicly available online: http://www.noisy-labels-in-rs.org Index Terms—Multi-label noise, collaborative learning, multi- label image classification, deep learning, remote sensing I. I NTRODUCTION R EMOTE sensing (RS) images acquired by satellite-borne and airborne sensors are a rich source of information for monitoring the Earth surface, e.g., urban area studies, forestry applications, and crop monitoring [1]. As a result of recent A. K. Aksoy, M. Ravanbakhsh, T. Kreuziger, and B. Demir are with the Faculty of Electrical Engineering and Computer Science, Technische Universit¨ at Berlin, 10623 Berlin, Germany (emails : [email protected], [email protected], [email protected], demir@tu- berlin.de). advances in RS technology, huge amounts of RS images have been acquired and stored in massive archives, from which the mining of useful information is an important and challenging issue. In view of that, development of multi-label RS image scene classification methods that aim to automatically assign multiple land-cover class labels (i.e., multi-labels) to each RS image scene in an archive is a growing research interest in RS. This can be achieved by direct supervised classification of each image in the archive. In recent years, deep learning (DL) based approaches have attracted great attention also in multi-label classification (MLC) of RS images due to their high capability to describe the complex spatial and spectral content of RS images. As an example, in [2] convolutional neural networks (CNN) that contain the softmax function as the activation of the last CNN layer are presented. In [3], a radial basis function neural network with a multi-label classification layer is proposed. In [4], an attention-based long short-term memory (LSTM) network is used to sequentially predict classes one after another. An encoder-decoder neural network that includes: i) a squeeze excitation layer (which characterizes the channel-wise interdependencies of the image feature maps); and ii) an adaptive spatial attention mechanism (which models the informative image regions) is proposed in [5]. A multi-attention driven approach that contains: i) spatial resolution specific CNNs in a branch-wise architecture; and ii) a bidirectional LSTM network is presented in [6]. In [7], a study to analyze and compare different DL loss functions in the framework of MLC of RS images is presented. Most of the above mentioned DL based approaches for the MLC of RS images require a sufficient number of high quality (i.e., reliable) training images annotated with multi- labels. This is crucial for accurate characterization of complex content of images with discriminate and descriptive features and thus for achieving accurate multi-label predictions. How- ever, the collection of a sufficient number of reliable multi- labeled images is time-consuming, complex, and costly in operational scenarios, and can significantly affect the the final accuracy of the MLC methods [8]. To overcome this problem, a common approach is to employ DL models pre-trained on publicly available Computer Vision (CV) datasets (e.g., ImageNet [9]). Then, the pre-trained models are fine-tuned by considering a small set of multi-labeled RS images for the final classification task. This approach is not fully adequate for RS images due to the differences between the CV and RS image characteristics (e.g., Sentinel-2 multispectral images arXiv:2012.10715v3 [eess.IV] 12 May 2021

Transcript of JOURNAL OF LA CCML: A Novel Collaborative Learning Model ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

CCML: A Novel Collaborative Learning Model forClassification of Remote Sensing Images with

Noisy Multi-LabelsAhmet Kerem Aksoy, Student Member, IEEE, Mahdyar Ravanbakhsh, Member, IEEE,Tristan Kreuziger, Student Member, IEEE, and Begum Demir, Senior Member, IEEE

Abstract—The development of accurate methods for multi-label classification (MLC) of remote sensing (RS) images is oneof the most important research topics in RS. Deep Convolu-tional Neural Networks (CNNs) based methods have triggeredsubstantial performance gains in RS MLC problems, requiring alarge number of reliable training images annotated by multipleland-cover class labels. Collecting such data is time-consumingand costly. To address this problem, the publicly availablethematic products, which can include noisy labels, can be usedfor annotating RS images with zero-labeling cost. However, multi-label noise (which can be associated with wrong as well as missinglabel annotations) can distort the learning process of the MLCalgorithm, resulting in inaccurate predictions. The detection andcorrection of label noise are challenging tasks, especially in amulti-label scenario, where each image can be associated withmore than one label. To address this problem, we propose anovel Consensual Collaborative Multi-Label Learning (CCML)method to alleviate the adverse effects of multi-label noiseduring the training phase of the CNN model. CCML identifies,ranks, and corrects noisy multi-labels in RS images based onfour main modules: 1) group lasso module; 2) discrepancymodule; 3) flipping module; and 4) swap module. The taskof the group lasso module is to detect the potentially noisylabels assigned to the multi-labeled training images, and thediscrepancy module ensures that the two collaborative networkslearn diverse features, while obtaining the same predictions. Theflipping module is designed to correct the identified noisy multi-labels, while the swap module task is devoted to exchangingthe ranking information between two networks. The proposedCCML method detects noisy multi-labels in the training set andcorrects them through a relabeling mechanism. Unlike existingmethods that make assumptions about the noise distribution, ourproposed CCML does not make prior assumptions about thenoise distribution in the training set. The experiments conductedon two multi-label RS image archives confirm the robustness ofthe proposed CCML under extreme multi-label noise rates. Ourcode is publicly available online: http://www.noisy-labels-in-rs.org

Index Terms—Multi-label noise, collaborative learning, multi-label image classification, deep learning, remote sensing

I. INTRODUCTION

REMOTE sensing (RS) images acquired by satellite-borneand airborne sensors are a rich source of information for

monitoring the Earth surface, e.g., urban area studies, forestryapplications, and crop monitoring [1]. As a result of recent

A. K. Aksoy, M. Ravanbakhsh, T. Kreuziger, and B. Demir are withthe Faculty of Electrical Engineering and Computer Science, TechnischeUniversitat Berlin, 10623 Berlin, Germany (emails : [email protected],[email protected], [email protected], [email protected]).

advances in RS technology, huge amounts of RS images havebeen acquired and stored in massive archives, from which themining of useful information is an important and challengingissue. In view of that, development of multi-label RS imagescene classification methods that aim to automatically assignmultiple land-cover class labels (i.e., multi-labels) to each RSimage scene in an archive is a growing research interest inRS. This can be achieved by direct supervised classificationof each image in the archive. In recent years, deep learning(DL) based approaches have attracted great attention also inmulti-label classification (MLC) of RS images due to theirhigh capability to describe the complex spatial and spectralcontent of RS images. As an example, in [2] convolutionalneural networks (CNN) that contain the softmax functionas the activation of the last CNN layer are presented. In[3], a radial basis function neural network with a multi-labelclassification layer is proposed. In [4], an attention-based longshort-term memory (LSTM) network is used to sequentiallypredict classes one after another. An encoder-decoder neuralnetwork that includes: i) a squeeze excitation layer (whichcharacterizes the channel-wise interdependencies of the imagefeature maps); and ii) an adaptive spatial attention mechanism(which models the informative image regions) is proposed in[5]. A multi-attention driven approach that contains: i) spatialresolution specific CNNs in a branch-wise architecture; andii) a bidirectional LSTM network is presented in [6]. In [7], astudy to analyze and compare different DL loss functions inthe framework of MLC of RS images is presented.

Most of the above mentioned DL based approaches forthe MLC of RS images require a sufficient number of highquality (i.e., reliable) training images annotated with multi-labels. This is crucial for accurate characterization of complexcontent of images with discriminate and descriptive featuresand thus for achieving accurate multi-label predictions. How-ever, the collection of a sufficient number of reliable multi-labeled images is time-consuming, complex, and costly inoperational scenarios, and can significantly affect the the finalaccuracy of the MLC methods [8]. To overcome this problem,a common approach is to employ DL models pre-trainedon publicly available Computer Vision (CV) datasets (e.g.,ImageNet [9]). Then, the pre-trained models are fine-tunedby considering a small set of multi-labeled RS images for thefinal classification task. This approach is not fully adequatefor RS images due to the differences between the CV andRS image characteristics (e.g., Sentinel-2 multispectral images

arX

iv:2

012.

1071

5v3

[ee

ss.I

V]

12

May

202

1

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

contain 13 spectral bands associated with varying and lowerspatial resolutions compared to CV images). In addition, thesemantic content (and thus the considered semantic classes)present in RS images is significantly different from that of CVimages (see Fig. 1). Thus, DL models trained from scratch on alarge RS training set annotated with multi-labels are required.An effective approach for constructing a large training setwith zero-annotation effort is to exploit the publicly availablethematic products (e.g., the Corine Land Cover [CLC] map,the GLC2000 and the GlobCover) in RS as labeling sources[10,11]. As an example, Sumbul et al. develop the BigEarthNetbenchmark archive made up of Sentinel-2 multispectral imagesannotated by using the CLC map to drive the DL studies inRS MLC [8]. Constructing such large RS training sets withzero labeling cost is highly valuable. However, the set ofland-use and land-cover class labels available in a given areathrough the thematic products can be incomplete or wrong(i.e., noisy). As an example, according to the validation reportof the CLC, the accuracy is around 85% [12]. Using trainingimages with noisy labels may result in uncertainty in the MLCmodel and thus may lead to a reduced performance on multi-label prediction. Accordingly, methods that allow reducingthe negative impact of noisy annotations are needed in theframework of RS MLC. In detail, when training images areannotated by multi-labels, two types of noise can exist for agiven RS image:

1) Noise on missing label: This type of noise appears whena land cover class label is not assigned to the image underconsideration although that class is present in the givenimage (i.e., the label is missed in the label set of thatimage) [13].

2) Noise on wrong label: This type of noise appears whena land cover class label is assigned to the image underconsideration although that class is not present in thegiven image (i.e., the label is wrong in the label set ofthat image) [14].

The number of missing or wrong class labels can varydepending on the labeling source. It is worth noting that thetwo types of noise can simultaneously appear associated todifferent spatial areas of a given RS image (e.g., a land-coverclass label can be missed, while another land-cover class labelis wrongly assigned in different spatial portions of the image).Since DL models can easily overfit to noisy labeled data[16,17], dealing with label-noise can significantly improve theMLC performance. Recently a couple of studies in RS are pre-sented to learn from noisy labels in RS MLC. As an example,in [18], a semantic segmentation method that identifies labelnoise is presented to generate accurate land-cover maps byclassifying RS images. This is achieved by simply evaluatingthe loss values since the noisy image labels are associated withthe highest values of the loss. However, this method can onlyidentify the wrong label noise and ignores the missing labelnoise problem. Hua et al. propose a regularization method toimprove the MLC performance in RS under label noise [19].The regularization is defined on the basis of a label correlationmatrix constructed by the semantic word embedding of thelabels considering that semantically similar classes are more

Discontinuous urban fabric,Coniferous forest, Mixed forest, Industrial or commercial units

Discontinuous urban fabric,Industrial or commercial units,Non-irrigated arable land

Person, Bicycle, Bus, Umbrella, Clock, Backpack

Person, Bicycle, Car, Truck, Dog

(b)

(a)

Fig. 1: An example of multi-labeled images (a) in RS [8]; and(b) in CV [15].

likely to appear together. Construction of a reliable labelcorrelation matrix for different RS applications is a complextask due to the difficulty in collecting text descriptions ofclass labels for properly modeling the correlation between allpossible combinations of classes present in RS images. Theperformance of these two methods depends on the accurateestimation of noise distribution in the considered data. Thus,there is a need to make prior assumptions about the noisestructure, which restricts the applicability and generalizationcapability of the methods for different MLC applications withdifferent noise distributions. This is a critical limitation forcomplex RS MLC problems. Unlike RS, in the CV andmultimedia communities, the development of noise-robustDL models is much more extended and widely studied (seeSec. II-A for the literature survey). However, most of theexisting methods assume that each image is annotated by asingle label associated with the most significant content ofthe considered image [20,21]. Adapting single label noisetolerance methods for multi-labeled images is a challengingtask due to the complexity of modeling the above-mentionedtwo types of noise in multi-labeled images. This becomes morecritical when the number of land-cover classes (and thus classcombinations) increases.

To address this problem, we propose a novel ConsensualCollaborative Multi-label Learning (CCML) method. Unlikethe existing methods in the literature, the proposed methodidentifies, ranks, and corrects noisy multi-labels in RS imageswithout making any prior assumption about the type of noisein the training set. To this end, the proposed CCML methodcontains four different modules: 1) group lasso module; 2)

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

discrepancy module; 3) flipping module; and 4) swap module.To automatically identify the noise type, we propose a grouplasso module that computes a sample-wise ranking loss and pe-nalizes noisy labeled images. The discrepancy module ensuresthat the two collaborative networks learn different features,while obtaining consistent predictions. The flipping moduleflips the identified noisy labels, while the swap module ex-changes the ranking information between the two collaborativenetworks and excludes extremely noisy labeled images fromback-propagation dynamically. The main contributions of thiswork are summarized as follows:

• To the best of our knowledge, CCML is the first methodthat automatically identifies the two types of noise inmulti-labeled RS images without any prior assumptionabout the noise distribution.

• CCML can learn from highly noisy training sets by re-weighting or correcting the noisy labels based on thenoise type rather than just excluding them from thetraining process.

• We adapt the collaborative paradigm to be suitable formulti-labeled images based on ranking training imagesaccording to their noise levels and exchanging the rankinginformation.

• The proposed CCML is evaluated on the two multi-labeled benchmark RS image datasets: i) the BigEarthNetdataset that consist of Sentinel-2 multispectral images [8];and ii) the UC Merced Land Use dataset that containimages selected from aerial orthoimagery [22].

The rest of the paper is organized as follows. In Sec.II we survey the DL based methods presented in the CVand multimedia communities for learning from noisy single-labeled and multi-labeled training images. In Sec. III, weintroduce the proposed CCML method in detail. Sec. IVdescribes the considered datasets and the experimental setup,while Sec. V represents the experimental results. Finally, Sec.VI concludes the paper by a short discussion on the results,observations by indicating the future works.

II. RELATED WORK

In this section we review the methods that are robust to labelnoise and are presented in the CV literature in the contextof image scene classification. We categorize the consideredmethods in two sub-sections as: 1) methods robust to noiseon single-labeled images; and 2) methods robust to noise onmulti-labeled images.

A. Methods Robust to Noise on Single-Labeled Images

When an image is annotated by a single label, there isonly one type of noise associated to the wrong class labelassignment. The methods in this category aim to model thenoise distribution present in the data considering only this typeof noise to reduce the influence of noisy labels on the result.

Patrini et al. propose a loss correction method by estimatingthe noise transition matrix of data [23]. The proposed methodclaims to fix the class-dependent label noise in the data, giventhe probability of each class being corrupted into another.However, as the number of classes increases, it becomes harder

to estimate the noise transition matrix, making the method dif-ficult to scale. In [24], the correct labels are considered latentvariables in the presence of noisy labels, and the expectation-maximization [25] algorithm is applied to iteratively calculatethe correct labels as well as the network parameters. Thisscheme is also extended to scenarios, where noisy labels aredependent on the features, by modeling noise via a softmaxlayer that connects correct labels to noisy labels. In [26], itis shown that overfitting avoidance techniques such as regu-larization and dropout can be partially effective when dealingwith label noise. However, in these cases, label noise maystill affect the quality of the classifier, resulting in an accuracydrop. Moreover, since noise injection is already an overfittingprevention technique that attempts to improve generalizationperformance [27], regularization should be carefully appliedto avoid underfitting.

Sukhbaatar et al. explore the performance of CNNs trainedon noisy labels. CNNs modified by including an additionalfully connected layer at the end of their network architectureto adapt the predictions to the noisy label distribution of thedata [28]. In [29], data is partitioned into multiple subsets andused to train different classifiers. If all of the classifiers agreeon a label from the original training set, the label is updatedwith the agreed upon prediction. This process is repeated overseveral stages, which gradually improve the overall perfor-mance. Nonetheless, this process depends very much on thequality of the classifiers, and it can take much time whenlarge and complex training sets are used. Dehghani et al. usea student model that is trained on noisy data along with ateacher model that exploits the noise structure that is extractedby the student model [30]. However, this approach needstraining samples with both clean and noisy labels, which is notalways available. On the other hand, [31] proposes a self-error-correcting CNN that can work on completely noisy data. Thenetwork swaps potential noisy labels with the most probableprediction of the network, while simultaneously optimizing themodel parameters. Failing to distinguish hard samples fromnoisy ones is a common problem when fixing labels, whichmay result in an undesired situation where the model swapslabels by mistake. To address this issue, additional policiesare often needed. Wu et al. use semantic bootstrapping todetect noisy samples and remove them from training [32].Removing as few samples as possible is an important aspectof the noisy sample removal process to avoid unnecessaryinformation loss. On the other hand, [33] trains a DNN ona noisy dataset and filters the noisy labels out while keepingthe samples for the training in an unsupervised fashion. Thisprevents loosing training samples, and thus allows DNNs tolearn from more examples. MentorNet [34] imposes a data-driven sample weighting curriculum learning on the studentnetwork to choose probably clean samples from the data.Han et al. propose a learning paradigm called Co-Teaching,which we predicate our collaborative model on [35]. UnderCo-Teaching, two networks are trained simultaneously andchoose batches including only clean training samples to feedeach other. Each network back-propagates the mini-batch thatis chosen by its peer network to update itself. Han et al.demonstrate that Co-Teaching is an effective method when

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

dealing with label noise.In [36], the authors propose a Co-Teaching method with

an additional discrepancy measurement to make the usednetworks diverge. Two peer networks that learn differentfeatures of the same probability distribution are used. then,both networks select clean samples for each other based ontheir predictions to improve performances mutually. Ren etal. propose a meta-learning algorithm, in which a weight-ing factor is assigned to every training sample based onits gradient direction [37]. Although this method achievesimpressive results, it needs a clean validation set, which maynot always be present. One of the advantages of the Co-Teaching label noise structure modeling methods is that theycan be decoupled from the training process and the applicationdomain. Such decoupled label noise estimation process allowsCo-Teaching approaches to be used and tested with differentnetwork architectures. However, the success of the above-mentioned works mostly depends on the accurate estimationof noise distribution in data, which restricts the applicabilityand generalization capability of the methods.

B. Methods Robust to Noise on Multi-Labeled Images

Most of the above-mentioned works are designed to addressonly one type of noise (i.e., wrong class label assignment) andthus it is not easy to simply extend or apply these methods inthe framework of the MLC of images without large modifica-tions. There are also few works presented in the literature toaddress the multi-label noise problems. As an example, Ghoshet al. study different loss functions such as categorical cross-entropy (CCE), mean square error (MSE), and mean absoluteerror (MAE) for noisy multi-label classification, and argue thatMAE is more robust to label noise compared to the others[38]. In [39], it is stated that MAE shows poor performancewith more complex training sets, and a set of robust lossfunctions that combine CCE and MAE is proposed. This isknown as noise-robust alternative of CCE in the literature [38].Meta-Learning and ensemble methods are other approachesproposed to address the label noise in MLC problems. Li etal. aim to find noise-tolerant model parameters using a teachermodel along with a student model to make accurate predictionsby optimizing a meta-objective, which encourages the studentmodel to give consistent results with the teacher model afterintroducing synthetic labels [40].

Bucak et al. present a ranking-based multi-label learningmethod that exploits the group lasso to enhance the accuracywith incomplete class assignments [41]. The method firstcomputes two error values, one associated with predictedclasses in the multi-label set and one error for the unpredictedones within the same multi-label set for each sample. Thetwo errors are then combined to define the ranking error foreach unpredicted class. This error value indicates the possiblemissing class for the related sample. Finally, all the rankingerrors associated with that sample are summed to define afinal ranking error. The group lasso is introduced by [42] asan extension to the regular lasso for grouping points of interesttogether for accurate prediction in regression. It is effective,because it groups variables together to be included or excluded

completely, as opposed to lasso, which only selects variablesindividually. The group lasso is used to regularize the networkin [41], giving empirically robust results against missing classlabels. Durand et al. propose a modified binary cross-entropyloss, which reduces the negative effects of missing classlabels on training [14]. Jain et al. address the problem ofmissing class labels by defining a novel loss function calledpropensity scored loss (PSL) [13]. PSL not only can prioritizethe most relevant few labels over the irrelevant majority, butalso provide unbiased estimation of true labels of a samplewithout omitting the missing labels completely.

As mentioned before, two types of noise (which are associ-ated with wrong and missing class label assignments) mayexist when training images annotated with multi-labels areconsidered. However, all the aforementioned methods are de-signed to overcome only one type of label noise in the trainingdata (either missing or wrong class label assignments). Thus,these methods are not capable of identifying and correctingthe two noise types simultaneously, which is an importantlimitation in MLC. In this paper to address this issue, wepropose a novel collaborative learning method that aims attraining a DL model robust to the two types of label noisebased on a a group lasso module without any prior assumptionin the framework of MLC of RS images.

III. PROPOSED CCML

The proposed CCML method consists of four main mod-ules: 1) discrepancy module; 2) group lasso module; 3) flip-ping module; and 4) swap module. The discrepancy module isdevoted to allow the two networks that are used collaborativelyin the method to learn different feature sets, while ensuringconsistent predictions. The group lasso module aims at identi-fying potential noisy class labels by computing a sample-wiseranking loss and penalizing noisy labeled images. The flippingmodule is devoted to flip the noisy labels approved by thetwo collaborative networks. Finally, the swap module aimsat exchanging the ranking information between the networksby taking the Binary Cross Entropy (BCE) and the rankingloss functions into consideration and excluding the detectednoisy samples from back-propagation. The pseudo-code of theproposed method is presented in Algorithm 1.

CCML follows the principle of a collaborative frameworkcalled Co-Training to exchange loss information betweennetworks and to select samples associated with small lossvalues from the training set. Co-Training is a semi-supervisedlearning technique that was first proposed in [43] in orderto overcome the labeled data insufficiency. Inspired by [36],we use two CNNs with the same architectures, which areenhanced with discrepancy modules to make them learnindependent set of features, while attaining the same classdistribution. Therefore, it becomes easier for the networksto find their potential faults. This improves the ability ofselecting training images with clean labels immensely, sincetwo networks are forced to learn different features and correcteach other by exchanging their loss information. It is worthnoting that, our proposed collaborative model is architecture-independent, since it does not rely on any specific network

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

architecture. However, each deep neural network has differentcharacteristics, therefore the position of discrepancy modulemay differ from network to network. Another important prop-erty of the proposed algorithm is associated to its modularity.The flipping and the swap modules can be completely decou-pled from the selected classification algorithm and from eachother. Thus, these modules are applicable in the framework ofany classification algorithm.

Algorithm 1: Consensual Collaborative Multi-labelLearning

1 for epoch = 1,...,E do2 for mini-batch = 1,...,B do3 Calculate the ranking loss;4 if flipping = True then5 Flip the noisy labels;6 Recalculate the ranking loss;7 end8 Calculate the BCE loss;9 Combine the BCE loss and the ranking loss;

10 Get only samples associated with low lossvalues;

11 Swap the ranking information;12 Update θ and θ;13 end14 end

A. Problem Formulation

We consider a multi-label image classification problem witha training set D = {(x1, y1), ..., (xn, yn)}, where xi denotesthe i th training image labeled by a multi hot vector yi =(y1i , ..., y

mi ) ∈ {0, 1}m representing the corresponding label

over m classes, where yki = 1 if k ∈ yi, and yki = 0 ifk /∈ yi. We use two identical CNNs that are represented as fand g with parameters θ and θ, respectively. We employ BCEas our classification loss function for each network, which aredenoted by Lf and Lg as shown below:

Lf (xi) = −m∑k=1

yki · log(fθ(x

ki ))+ (1− yki ) · log

(1− fθ(x

ki ))

Lg(xi) = −m∑k=1

yki · log(gθ(xki )) + (1− yki ) · log(1− gθ(x

ki )),

(1)

where fθ(xki ) and gθ(x

ki ) are the predictions of the correspond-

ing networks for class k in xi.In a collaborative learning framework, the networks must

learn independent features while predicting the same classdistribution. The networks are required to be capable of fixingmistakes of each other in the training process by exchanginginformation. If they do not learn diverse features on thesame mini-batch, they cannot enhance predictions with eachother mutually. To achieve the desired diversity, we proposea discrepancy module embed in our method, which includes

two loss functions: discrepancy loss and consistency loss. Thedetail of the discrepancy module is described in the followingsection.

B. Discrepancy Module

The discrepancy module consists of two loss functions: i)the disparity loss (LD ); and ii) the consistency loss (LC ). Theformer makes sure that the networks learn distinct feature,and it is inserted in-between the chosen layers of each net-work. The layers are to be chosen depending on the networkarchitecture, and the disparity loss function can be placedanywhere in the network. On the other hand, the consistencyloss makes sure that the final predictions of the networks aresimilar. The networks should learn diverse and distinct featureswhile predicting the same class distribution. Hence, we locatethe consistency loss component at the end of the networks toreduce the discrepancy and achieve comparable predictions.Therefore, the final loss functions that are minimized by fand g are defined as:

Lossbf =

∑Ri=1 Lf (xi)

R+ λ1 · LC − λ2 · LD

Lossbg =

∑Ri=1 Lg(xi)

R+ λ1 · LC − λ2 · LD ,

(2)

where b represents the mini-batch, and R represents numberof selected samples associated with low loss values in themini-batch b. λ1 and λ2 represent the weights for LC andLD , respectively. The disparity component gets the logits ofthe layers present before the component as input and calculatesthe discrepancy loss LD . It is worth noting that the discrepancymodule is placed between the two networks. Thus, the inputis the logits of both of the networks. The calculated loss isdefined as:

LD = M (Fi , Gi) ,

Gi = gθ(1:l)(Xi) , Fi = fθ(1:l)(Xi) ,(3)

where M is the choice of discrepancy module, and Fi and Gi

represent the logits of the last layers before the component. ldenotes the layer where the discrepancy component is placed,and θ(1 : l) and θ(1 : l) stand for the parameters of networksuntil the layer l . Xi represents a batch of training images.The consistency module is placed after the last layer of thenetworks, and it gets the logits of the last layer as input:

LC = M (Fi ,Gi) , (4)

where Fi and Gi denote the logits of the last layer ofthe networks, analogous to (3). In general, any statisticaldistance function that measures the difference between twoprobability distributions can be used as the discrepancy moduleM . Distance measures have a wide range of applicationareas. However, in machine learning, they are commonlyused as loss functions that measure the distance between themodel predictions and the labels [44,45]. Algorithms such asMaximum Mean Discrepancy (MMD) [46], and Wassersteinmetric [47] are commonly used for diverging two distributions

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

… …

… …

Fig. 2: Block diagram of the proposed discrepancy module. f and g represent the networks. Fi and Gi stand for the intermediatelogits of the networks. Fi and Gi stand for the logits of the last layers. M is the discrepancy component.

from one another. Since it is shown that MMD can be usedto disentangle probability distributions [36], we exploit thesame principles for MMD in our discrepancy modules. Indetail, MMD is a distance measure between two distributionswith a quadratic computational cost, and it is defined as thedistance between the mean embeddings of the distributionsin Reproducing Kernel Hilbert Space (RKHS) [48]. In otherwords, if the mean values of the functional mappings of thedistributions to a high dimensional space are close to eachother, the distance between the distributions is small. MMDto measure the difference between two distributions p and qcan be defined as:

MMD(p, q) = ‖µp − µq‖H . (5)

where, µp and µq denote the expected values (means) ofthe distributions p and q , respectively. H represents theRKHS, and ‖‖H represents the L1 norm. Avoiding the explicitmapping of distributions with the kernel trick, an empiricalestimate of MMD between p and q can be denoted as:

MMD(p, q) =1

m2

( m∑i=1

m∑t=1

k(xpi , x

pt )−

2 ·m∑i=1

m∑t=1

k(xpi , x

qt ) +

m∑i=1

m∑t=1

k(x qi , x

qt )

),

(6)

where xpi and x q

i are samples from respective distributions.Inspired by [36], we define the radial basis function kernel[49] as:

k(s1, s2) = exp

(− ‖s1 − s2‖

2

σ

). (7)

C. Group Lasso Module

The BCE loss function is a widely used objective function inmulti-label learning. However, while training set contains labelnoise, the networks may be biased toward noise in the training

set and perform poorly. Thus, an additional mechanism isnecessary to avoid the model misguided by noisy trainingsets. The additional mechanism can have many forms, suchas regularization, noisy labeled image exclusion, or noisecorrection. Inspired by [41], we introduce a ranking errorfunction capable of dealing with different types of noise ina multi-label scenario (missing and wrong class label assign-ments) without considering prior assumption. To this end, theproposed ranking error approach for missing class labels in[41] is extended to identify the wrong class label assignmentsas well. In addition, we do not use our ranking error functionas a regularizer. Instead, we use it along with the BCE lossfunction to detect noisy labels within a mini-batch and excludethem from back-propagation. The exclusion of training imageswith noisy labels from back-propagation prevents overfittingto erroneous training set. The motivation behind using grouplasso is to identify potential noisy labels in training set, giventhe opportunity to correct them. Furthermore, it provides in-formation about label noise type. Let Ek ,l denotes the rankingerror function:

Ek ,l(xi) = max(0, 2 ·

(fl(xi)− fk(xi)

)+ 1), (8)

where k represents a class index labeled by 1 and l representsa class index labeled by 0 for image xi in the given batch.fk(xi) and fl(xi) denote the predictions of the classes k andl , respectively. The ranking error function gives a measure ofpotential noise in a class combination, which can be observedin Table I. In the case of a clean prediction the ranking erroris equal to 0, otherwise, the ranking error function returns apositive value indicating a label noise.

The key idea behind the ranking error function is that ifk ∈ yi and l /∈ yi , then fk(xi) > fl(xi). However, thetrustworthiness of the ranking error function depends verymuch on the prediction of the model. Hence, using Ek ,l(xi)in the beginning of the training process, when the networksare unstable, may cause to misleading results.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE I: Ranking error function values

noise fl fk Eclean label 0 1 0noise on missing label 0 0 +1noise on wrong label 1 1 +1missing label and wrong label 1 0 +3

The loss functions from class combinations are gathered bytwo loss terms using the group lasso to identify the potentiallabel noise. The first loss term calculates an aggregated loss,considering missing class labels, and the second term calcu-lates an aggregated loss for wrong class label assignments.This approach allows our algorithm to rank noisy trainingsamples according to their noise rate and noise type byadjusting the importance factors (α and β) of the loss terms.The combined ranking loss per sample denotes as:

Lassof (xi) = α ·m∑

l=a+1

√√√√ a∑k=1

E2k,l(xi) +

β ·m∑

k=b+1

√√√√ b∑l=1

E2k,l(xi),

(9)

where a and b denote the number of assigned labels andunassigned ones in an image label, respectively. Lassog(xi)is computed analogous to Lassof (xi).

Ranking errors per class is stored before summing them upinto the ranking loss per sample, allowing noisy labels to beidentified. The identified labels are afterward carried into theflipping module for correction. There may be more than onenoisy label in an image, whether it is missing or a wronglabel assignment. However, locating each of them correctlyis a difficult task. When it is done wrongly, it may lead todetrimental effects on training. Thus, we select one label perimage with the highest ranking-error for the flipping process.Finally, the group lasso module returns a ranking loss anda potential noisy class for each sample in a mini-batch.Thereturned ranking losses are added to the BCE loss by theirimportance factors to determine the noisy labels. Then thepotential noisy labels are sent to the flipping module.

D. Flipping Module

The flipping module consists of two components: i) noisyclass selector (NCS); and ii) noisy class flipper (NCF). TheNCS takes the previously calculated ranking losses and poten-tial noisy labels as input. When it comes to relabeling potentialnoisy labels, it is challenging to distinguish hard samplesfrom noisy samples. To tackle this issue, the NCS componentfirst compares the predicted labels by the two networks andeliminates the predictions that are not common, preventingrelabeling false positives (i.e., labels that are not noisy). Afterboth networks agree upon the noisy labels, the ranking lossesof these noisy labels from the two networks are summed up.The NCS component then selects the samples with the largestranking losses for flipping.

Let li = (l1i , l2i , ..., l

ni ) and li = (l1i , l

2i , ..., l

ni ) be ranking

losses, and ci = (c1i , c2i , ..., c

ni ) and ci = (c1i , c

2i , ..., c

ni ) be

the chosen potential noisy labels for every image in the i -thmini-batch with size n . The noisy labels Ci to be proceededto NCF are chosen as follows:

Ii =

{(lki + lki )

2

∣∣∣∣k ∈ {1, ...,n}, cki = cki

},

Ci = {cai = cai |a ∈ Di} ,

(10)

where Di represents the set of indices of k largest elementsin Ii .

The aim of the NCF component is to flip the labels of theidentified noisy labels by the previous component. Flippingmeans that if a noisy class is labeled with 0, NCF turnsthe label into 1, and if the noisy class is labeled with 1,NCF turns the label into 0. After the flipping is applied, anadditional group lasso component (see Sec. III-C) recalculatesthe ranking losses for the mini-batch with flipped labels.The flipped labels are also used for BCE loss calculation.An example for the mechanism of the flipping module isillustrated in Fig. 3.

As mentioned before, the success of the algorithm in findingand fixing the noisy labels depends on the accuracy of thenetwork predictions, which are in general erroneous at thebeginning of the training phase. Thus, if the label flippingmodule starts flipping noisy labels when the networks areunstable, it may flip the labels of hard but clean samples. Toavoid that, the flipping module is initiated after the networkis relatively stable. The value of this parameter can be setby analysing the learning curve of the networks. It is alsoimportant to point out that the number of labels that are flippedis significant for the success of the module. As the networkslearn from each other’s mistakes, the number of consentedlabels may increase, and a high flipping percentage may havenegative effects.

1

0

0

0

1

0

0

0

0

1

16

10

4

7

13

0.2

0.7

0.7

0.3

0.9

2

10

4

11

13

0.1

0.7

0.9

0.2

0.6

NCS0

0

1

10

4

13

0.14

0.16

0.15

1

0

NCF

Fig. 3: A qualitative example of flipping noisy labels in theflipping module: li and li represent the ranking losses. ci andci imply the potential noisy class indexes. NCS outputs theconsented noisy labels Ci and sums up the ranking losses fromboth of the networks. NCF gets the labels yi as input and flipsa percentage of the consented noisy labels with the largestranking losses. In this example the percentage of flipping isset to 2/3.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

E. Swap Module

The swap module adds the ranking losses, which are calcu-lated by the (10), to the BCE loss, and selects R samples thatare associated with the lowest loss values for each network.The ranking information of samples associated with the Rlowest loss values are exchanged between the two networks,where the network f uses R samples associated with lowestloss values identified by the network g to update its weights,and vice versa. Swapping ranking losses can be denoted asfollows:

B fi =

∑Ri=1 Lf (xi) + γ · Lassof (xi)

R, xi ∈ mR

g

Bgi =

∑Ri=1 Lg(xi) + γ · Lassog(xi)

R, xi ∈ mR

f ,

(11)

where B fi and Bg

i are the compound losses for each network,and mR

f and mRg are the samples associated with lowest loss

values by the networks f and g respectively. γ is the trade offparameter representing the strength of Lassof and Lassog forselecting samples with low loss values. A visualization of theprocess is illustrated in Fig. 4.

5

4

3

2

1

3

5

2

1

4

5

4

3

2

1

4

2

5

3

1

5

3

1

2

1

4

Fig. 4: A qualitative example to describe the swap module. Thetwo networks exchange the ranking information. Then, eachnetwork updates its weights using the samples with smallerloss values that are found by the opposite network.

IV. DATASET DESCRIPTION AND EXPERIMENTAL SETUP

In this Section, we describe the datasets that are used in theexperiments and present the experimental setup, including thenetwork architecture, optimal values for hyperparameters andconsidered synthetic label noise injection approach.

A. Dataset Description

We conducted experiments on two different benchmarkRS datasets. The first dataset is the Ireland subset of theBigEarthNet (denoted as IR-BigEarthNet) benchmark archive[8], which consists of 15894 Sentinel-2 multispectral imagesacquired between June 2017 and May 2018 over Ireland. Eachimage was annotated by multiple land-cover classes providedby 2018 CORINE Land Cover Map (CLC) inventory. Sentinel-2 images contain 13 spectral bands with varying spatial

resolutions. Each image in IR-BigEarthNet is a section of:i) 120×120 pixels for the bands that have a spatial resolutionof 10m; ii) 60×60 pixels for the bands that have a spatialresolution of 20m; and iii) 20×20 pixels for the bands thathave a spatial resolution of 60m. In the experiments, we usedthe bands with 10m and 20m spatial resolutions (and thus10 bands per image is used in total), excluding the two 60mbands due to their very low spatial resolution and small pixelsize. The cubic interpolation was applied to 20m bands ofeach image to have the same pixel sizes associated with eachband. In the experiments, we exploited the BigEarthNet-19land-cover class nomenclature proposed in [50] and eliminated5 classes which are represented with a significantly smallnumber of images in the dataset, leading to 12 classes intotal. The number of labels associated with each image variesbetween 1 and 7, while 97,4% of images contain less than5 labels. IR-BigEarthNet was divided into a validation setof 3839 images, a test set of 3856 images, and a trainingset of 8192 images.In the experiments, the images that arefully covered by seasonal snow, cloud and cloud shadow werenot used as suggested in [8]. An example of images fromIR-BigEarthNet together with their multi-labels is given inFig. 5. Table II shows the number of training, validationand test samples associated to each considered class in IR-BigEarthNet.

Pastures,Mixed Forest

Pastures,Intertidalflats, Sea andocean

Peatbogs Sea andocean

Road and railnetworks andassociatedland, Pastures

Pastures,Moors andheathland,Sea andocean

Pastures Pastures,Mixed forest

Fig. 5: An example of images with their multi-labels from theIR-BigEarthNet dataset.

The second dataset is the UC Merced Land Use (denoted asUCMERCED) archive that consists of 2100 images selectedfrom aerial orthoimagery and downloaded from the USGS Na-tional Map of the following US regions: Birmingham, Boston,Buffalo, Columbus, Dallas, Harrisburg, Houston, Jacksonville,Las Vegas, Los Angeles, Miami, Napa, New York, Reno, SanDiego, Santa Barbara, Seattle, Tampa, Tucson, and Ventura[22]. Each image is of size 256x256 pixels and has a spatialresolution of 30m. In the experiments, we used the multi-labelannotations of UCMERCED images that were obtained basedon visual inspection [51]. The total number of class labels

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

TABLE II: Number of samples per class in training, validationand test sets of the IR-BigEarthNet dataset.

Class Train Val TestUrban fabric 818 369 371Arable lands 2860 1404 1397Pastures 5723 2725 2739Complex cultivation patterns 510 240 226Land principally occupied by agriculture, withsignificant areas of natural vegetation 896 420 414

Broad-leaved forest 410 205 199Coniferous forest 1156 545 524Mixed forest 588 273 296Moors, heathland and sclerophyllous vegeta-tion 467 217 219

Transitional woodland, shrub 708 345 369Inland wetlands 811 400 415Marine waters 2087 894 900

is 17, while the number of labels associated with each imagevaries between 1 and 7. The number of training, validation andtest samples in the multi-label UCMERCED dataset is givenin Table III. Fig. 6 shows an example of images together withtheir associated multi-labels from the UCMERCED dataset.

TABLE III: Number of samples per class in training, validationand test sets of the UCMERCED dataset.

Class Train Val TestAirplane 70 15 15Bare Soil 506 103 109Buildings 482 102 107

Cars 631 125 130Chaparral 77 21 17

Court 73 16 16Dock 70 15 15Field 73 15 15Grass 683 145 147

Mobile home 70 17 15Pavement 917 189 194

Sand 210 43 41Sea 70 15 15Ship 70 16 16

Tanks 70 15 15Trees 702 155 152Water 142 31 30

B. Experimental Setup

In the experiments, we used ResNet [52], which is a wellestablished network architecture. Among different versions ofResNet, each of which includes different numbers of layers,ResNet50 with 50 layers was chosen. Moreover, we usedin our network improved residual units [53] that enhancethe network’s generalization performance and make trainingeasier by introducing additional non-linearities. We utilizedthe Adam optimizer [54] with a learning rate of 10−3. Thebatch sizes for IR-BigEarthNet and UCMERCED was set to256 and 32, respectively. We reduce the training time andconverge safer by avoiding noise with large batch sizes. Wereport the results of training obtained after 100 epochs. Thehyperparameter tuning and the initial tests were conducted onNVIDIA Tesla P-100 GPU with 16 GB of RAM, while themodel training and further experiments were conducted on aTesla V100 GPU with 32 GB RAM.

Airplane, Grass,Pavement

Bare soil, Build-ings, Grass

Cars, Grass, Pave-ment

Dock, Ship, Water Sand, Sea Trees

Fig. 6: An example of images with their multi-labels from theUCMERCED dataset.

Within the swap module of the proposed CCML method,the networks swap the calculated low loss sample informationwith each other. According to this information, each networkchooses a certain amount of low loss samples of the op-posite network and updates its weights by using them. Thegoal of this process is to remove the noisy samples fromback-propagation. Without knowing the exact noise amountand distribution in the labels, optimizing this parameter isvery difficult. In addition, excluding a very high numberof samples from back-propagation degrades the accuracy ofthe predictions. Networks may cannot learn diverse featuresusing insufficient number of samples. On the other hand,excluding as a very small number of samples also affectsthe accuracy adversely because the network may overfit tothe noise present in the data. According to our observations,updating the weights using the 75% of the samples associatedwith small loss values at each iteration is the optimal solutionto this tradeoff.

The discrepancy modules has an essential role in the successof the proposed CCML method. The ability of the network tolearn diverse features from the data depends on the discrepancymodules. In our proposed CCML we used MMD for thediscrepancy modules.The optimal value of the σ in the RBFkernel was determined as 10.000. We observed that low sigmavalues were ineffective for measuring the distance between thelogits of the two networks used in the proposed method. Usingthe RBF kernel, the discrepancy module first maximizes thediversity to learn diverse features, and then minimizes it tolearn the same class distributions. The module achieves thisresult by adding the respective losses, namely LC and LD , tothe final loss terms of the networks, weighting them by λ1 andλ2 as follows. The value of λ1 was set to 0.25, and the valueof λ2 was set to 0.5. The diversity component of the modulewas placed approximately within the second quarter of theconvolutional layers. Forcing networks to diverge in an earlystage and converging them thereafter teaches the networksdistinct feature sets while keeping the predictions consistent.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

An additional mechanism along with the BCE loss isnecessary to identify the potential noisy samples. For thispurpose, a ranking loss for every sample was calculated andadded to the BCE loss by a factor that was set to 0.2 as theresult of our experiments. There are also parameters α andβ that control the effects of two different noise types on thecalculation of the ranking loss. A group lasso module with ahigher α concentrates more on finding the missing class labels,whereas a higher β gives a heavier weight to detecting theextra class label assignments. An extra class label assignmentwas more harmful than a missing class label in our setting,where every sample was annotated with the minority of thepresent classes. Thus, we set α to 0.5 and β to 1.0.

As mentioned in Section III, it is crucial to choose theright time to start flipping potential noisy labels because thenetworks are not stable in a very early stage of the training,and thus the predictions may not be accurate. Therefore, theearly initiation of the flipping module may cause erroneousflips. To prevent this, we started the flipping process after 90%of epochs was reached. Furthermore, the flipping module flipslabels when the two networks agree that the label is noisy. Asthe networks learn from each other and get more stable, thenumber of agreed labels may increase. However, the networksmay agree on the labels of hard classes instead of noisy ones,and flipping too much of them would decrease the performanceof the model. The flipping module flips only the 5% of theagreed classes every iteration after it is initiated to avoid thisproblem.

1 0 0 0 0 0 1 1 0 0

0 0 0 1 0 1 0 0 1 0

1 1 0 0 1 0 0 1 1 0

1 1 0 1 0 1 0 0 1 0

0 1 0 0 1 1 0 0 0 1

0 0 0 1 0 1 0 0 1 0

.

.

.

Fig. 7: The considered label noise injection approach: RandomNoise per Sample (RNS) with a sampling rate 0.5 and classrate of 0.5. The colored cells represent the introduced artificiallabel noise. Cells in blue represent “0” value in ground truthlabels that are flipped to “1”, while those in red represent “1”value in ground truth labels that are flipped to “0”.

To verify the effectiveness of the proposed method , weadded synthetic noise to the labels of the IR-BigEarthNet andUCMERCED datasets. To ensure that both types of label noise(missing label and wrong label) are introduced to the multi-label training set, we designed a label noise injection approachcalled Random Noise per Sample (RNS). RNS chooses labelsrandomly from each mini-batch by a predefined percentagecalled the sampling rate. Afterward, to apply noise, RNSrandomly flips a certain number of selected labels. The numberof flipped labels driven by a parameter we called as class rate.An illustration of the considered label noise injection approachis shown in Fig. 7. The figure shows an example scenario forRNS with a sampling rate of 0.5 and a class rate of 0.5. Threesamples out of six are selected, and label noise is randomly

applied to half of their labels. In our experiments, the ratio ofthe introduced noise was varied from 0% to 50%.

The results of the experiments were provided in termsof three performance metrics: 1) precision (P ); 2) recall(R); and 3) F1-Score with the micro averaging strategy. Foran explanation of these metrics, the reader is referred to[6]. Since the proposed CCML includes two networks thatrun simultaneously, the network with the best F1 validationscore has been chosen for evaluation. To further study theperformance and the behavior of the proposed CCML, theclass-based F1 scores are reported in comparative plots.

We compared our proposed method with ResNet50 asthe baseline. Since our proposed CCML is architecture-independent and can be used with any network architecture, weselect the base model (ResNet50) due to its proven effective-ness. To have a fair comparison with the proposed method, werealized the CCML modules within the ResNet50 architecture.We trained CCML with the same number of training epochsand hyperparameters used in the baseline model.

V. EXPERIMENTAL RESULTS

This section reports comparative results concerning thebaseline model (ResNet 50) and the proposed CCML method.We report the experimental results on the IR-BigEarthNetand multi-label UCMERCED datasets, followed by class-wiseperformance analysis.

TABLE IV: Precision, Recall and F1 scores obtained by theproposed CCML and the baseline [52] using IR-BigEarthNetdataset under different noise rates.

Noise Precision (%) Recall (%) F1 (%)

Rate Baseline CCML Baseline CCML Baseline CCML

0% 79.4 88.4 73.1 69.9 76.2 78.0

10% 87.6 89.1 69.0 69.3 77.2 78.0

20% 87.8 90.2 68.7 68.7 77.1 78.0

30% 84.0 88.2 67.2 68.9 74.7 77.4

40% 76.4 88.4 65.1 69.3 70.3 77.7

50% 62.5 87.5 57.6 62.1 60.0 72.6

A. Results on Multi-Label IR-BigEarthNet Dataset

The results on the IR-BigEarthNet are presented in TableIV. By analyzing the results in Table IV, one can see thatthe proposed method outperforms the base model. Especiallyunder extreme noise rates such as 40% and 50%, CCMLachieves in average 10% better F1 scores. The performance ofboth models does not decrease significantly by the increasingnoise rates up to 30%. The reason is that the number of trainingimages for each class is still sufficient for the networks to learnand predict them correctly, despite the introduced label noise.

For a better understanding, we select six representativeclasses from the IR-BigEarthNet and report their class-basedF1 scores in Fig. 8. The class-based F1 scores in Fig. 8 revealthat both the baseline model and the proposed CCML learn theclasses represented by sufficient training images better than the

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

0 10 20 30 40 50

0.8

0.9

1

Injected Noise Rate (%)

F 1S

core

CCML Baseline

(a)

0 10 20 30 40 500

0.2

0.4

0.6

0.8

Injected Noise Rate (%)

F 1S

core

CCML Baseline

(b)

0 10 20 30 40 50

0.8

0.9

1

Injected Noise Rate (%)

F 1S

core

CCML Baseline

(c)

0 10 20 30 40 50

0.2

0.4

0.6

0.8

Injected Noise Rate (%)

F 1S

core

CCML Baseline

(d)

0 10 20 30 40 50

0.3

0.4

0.5

0.6

Injected Noise Rate (%)

F 1S

core

CCML Baseline

(e)

0 10 20 30 40 50

0

0.1

Injected Noise Rate (%)

F 1S

core

CCML Baseline

(f)

Fig. 8: Different noise rates versus class based F1 scores obtained by the proposed CCML and the baseline for classes: (a)Pastures; (b) Urban fabric; (c) Marine waters; (d) Inland wetlands; (e) Moors, heathland and sclerophyllous vegetation; and (f)Complex cultivation patterns in IR-BigEarthNet dataset.

classes that do not include enough sufficient number of train-ing images. The classes with a sufficient number of trainingimages (e.g. “Marine waters”) obtaining the highest F1 scoresunder different noise rates. Even high noise rates, such as 40%,do not significantly reduce the F1 scores of these classes. Theclasses with the highest number of training images still havemany images to learn from, despite applying high noise rates.On the other hand, the classes with insufficient training images(e.g., “Complex cultivation pattern”, and “Moors, heathlandand sclerophyllous vegetation”) receive a low F1 score overdifferent noise rates.

The proposed CCML outperforms the baseline model in themajority of classes, especially under high noise rates, whileobtaining comparative results against the baseline model.Both models perform poorly in the “Complex cultivationpatterns” class, which is among the classes with no sufficienttraining images, with an average of 0.03% in F1 score. Insome cases (e.g., “Inland wetlands”), the baseline and CCMLperformances are comparable over low rates of label noises.However, by increasing the noise rate, the CCML holds the F1

scores relatively high compared to the baseline, demonstratingthe stability of CCML under extreme noise rates scenario.

TABLE V: Precision, Recall and F1 scores obtained by theproposed CCML and the baseline [52] using UCMERCEDdataset under different noise rates.

Noise Precision (%) Recall (%) F1 (%)

Rate Baseline CCML Baseline CCML Baseline CCML

0% 74.2 69.6 69.8 64.1 71.9 66.7

10% 71.9 67.7 70.8 66.3 71.4 67.0

20% 69.9 68.5 62.3 70.1 65.7 69.3

30% 70.5 72.2 65.8 70.6 68.0 71.4

40% 58.7 64.1 61.4 71.9 59.7 67.7

50% 32.3 61.2 44.9 68.2 37.5 64.5

B. Results on Multi-Label UCMERCED Dataset

The comparative evaluation results for the multi-labelUCMERCED dataset are presented in Table V. In the ex-periments on multi-label UC Merced Land Use, increasingnoise rates severely deteriorate the baseline model’s learningprocess. As shown in Table V, noise creates instability onthe baseline model. On the other hand, the proposed CCMLprovides relatively stable performance and achieves a certaindegree of robustness against high label noise rates. Although

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Injected Noise Rate (%)

F 1S

core

CCML Baseline

(a)

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Injected Noise Rate (%)

F 1S

core

CCML Baseline

(b)

0 10 20 30 40 50

0

0.2

0.4

0.6

0.8

Injected Noise Rate (%)

F 1S

core

CCML Baseline

(c)

Fig. 9: Different noise rates versus class based F1 scores obtained by the proposed CCML and the baseline for classes: (a)Cars; (b) Pavement; (c) Airplane in the UCMERCED dataset.

the base model performs better than the proposed CCML underlower noise rates, the CCML achieves considerably higher F1

scores with 40% and 50% noise rates. We should note thatthe ResNet50 is a powerful model, capable of tolerating evena relatively high noise rate, but it is not stable under extremenoise rate. However, employing ResNet50 as a classier withour proposed collaborative framework makes it more stableand robust against higher noise rates.

A comparison between Table IV and Table V reveals thatthe multi-label UCMERCED dataset achieves lower F1 scorescompared to the IR-BigEarthNet under the same experimentalsetup. The reason for this is the number of training images. TheUCMERCED dataset includes 1470 training images, while IR-BigEarthNet includes 8192 training images. The availabilityof a large number of training images in the IR-BigEarthNetmakes it easier for the classifiers to learn the underlying classdistribution from the training set.

In Fig. 9, three representative classes from the UCMERCEDdataset are chosen and illustrated to analyze the obtained class-based F1 scores. The class-based F1 scores demonstrate asimilar pattern for the classes with a sufficient number oftraining images and classes that do not reach enough trainingimages. For example, the “Airplane” class contains the lowestnumber of training images among the other classes, whichis not considered a sufficient number of training images (70images). The “Pavement” class contains the highest number oftraining samples in the training set (more than 900 images).The performance of the baseline model is not stable underdifferent noise rates, as it is shown in Fig. 9. For example, inthe case of the “Airplane” class, the F1 scores are unstablefor the baseline model, while the proposed CCML scores arerelatively stable.

Similar to the IR-BigEarthNet results in Fig. 8, the F1 scoresof the classes with a sufficient number of training images arethe highest among the other classes. The F1 scores of thebaseline model for the lower noise rates are higher than CCMLsince the baseline model uses all the training samples, whileCCML excludes 25% of the training samples. This becomesmore critical in the case of the UCMERCED dataset with a

small number of training samples.

VI. CONCLUSION AND DISCUSSION

In this work, we have proposed a novel Consensual Col-laborative Multi-label Learning (CCML) method to overcomethe adverse effects of multi-label noise in the context of sceneclassification of RS images. The proposed method includesfour main modules: 1) group lasso module; 2) discrepancymodule; 3) flipping module; and 4) swap module. The grouplasso module detects the individual noisy labels assigned to thetraining image, while the discrepancy module is devoted to en-suring that two collaborative networks learn different featureswhile obtaining the same predictions. The proposed CCMLmethod exploits the flipping module to correct the identifiednoisy labels and the swap module to exchange the rankinginformation between two networks and exclude noisy samplesfrom back-propagation dynamically. The CCML method au-tomatically identifies the different multi-label noise types thatare associated to missing and wrong class label annotations.To the best of our knowledge, CCML is the first method tosimultaneously tackle the adverse effects of the two types ofmulti-label noise without making any prior assumption.

The performance of the proposed method was evaluatedunder different noise rates trained on two publicly availablemulti-label benchmark RS image archives. We used the IR-BigEarthNet and the UCMERCED archives. The experimentalresults confirm the effectiveness of the proposed method inspecific settings where deterministic label noise is introducedto the multi-label training sets. Furthermore, CCML showsmore robustness with respect to the baseline model under high-noise regimes (e.g., in the presence of high rate label noisesuch as 30% and more).

We would like to point out that developing efficient tech-niques for handling label noise in multi-label training sets isbecoming more and more important. On the one side, dueto the increased volume of RS image archives, manual large-scale image labeling is time-demanding and costly (and thusnot fully feasible). On the other side, making use of zero-cost labeling by the use of available thematic products can

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

introduce label noise to the training set. In this context, theproposed CCML is very promising as it allows to identifyof the potential multi-label noise within the training setwithout considering any prior assumption. We emphasize thatthis is a very important advantage because one of the mainobjectives of noise-robust methods is to be applicable andgeneralizable to any label noise distribution. Furthermore, theproposed method is intrinsically classifier-independent. Evenif in our framework it is implemented in the context of CNNs(because of their efficiency for RS image classification), itcan be adapted easily for any other classifiers or networkarchitectures. This can be done by a proper adjustment oneach module together with its input parameters.

As a final remark, it is worth noting that the proposedCCML provides much more accurate results, when an ad-equate number of training images related to each class isavailable. This may not be always possible and the datasetscan be imbalanced in the operational RS MLC applications.As a future work, to address this issue we plan to develop anadaptive class-weighted loss function. Another solution can beconsidering the data augmentation techniques.

REFERENCES

[1] L. Bruzzone and B. Demir, “A review of modern approaches to classifi-cation of remote sensing data,” in Land Use and Land Cover Mappingin Europe: Practices & Trends, 1st ed., I. Manakos and M. Braun, Eds.Springer, Dordrecht, 2014, ch. 9, pp. 127–143.

[2] I. Shendryk, Y. Rist, R. Lucas, P. Thorburn, and C. Ticehurst, “Deeplearning - a new approach for multi-label scene classification in plan-etscope and sentinel-2 imagery,” in IEEE International Geoscience andRemote Sensing Symposium, 2018, pp. 1116–1119.

[3] A. Zeggada, F. Melgani, and Y. Bazi, “A deep learning approach to UAVimage multilabeling,” IEEE Geoscience and Remote Sensing Letters,vol. 14, no. 5, pp. 694–698, 2017.

[4] Y. Hua, L. Mou, and X. X. Zhu, “Recurrently exploring class-wiseattention in a hybrid convolutional and bidirectional lstm network formulti-label aerial image classification,” Journal of Photogrammetry andRemote Sensing, vol. 149, pp. 188–199, 2019.

[5] A. Alshehri, Y. Bazi, N. Ammour, H. Almubarak, and N. Alajlan, “Deepattention neural network for multi-label classification in unmanned aerialvehicle imagery,” IEEE Access, vol. 7, pp. 119 873–119 880, 2019.

[6] G. Sumbul and B. Demir, “A deep multi-attention driven approach formulti-label remote sensing image classification,” IEEE Access, vol. 8,pp. 95 934–95 946, 2020.

[7] H. Yessou, G. Sumbul, and B. Demir, “A comparative study of deeplearning loss functions for multi-label remote sensing image classifica-tion,” in IEEE International Geoscience and Remote Sensing Sympo-sium, 2020.

[8] G. Sumbul, M. Charfuelan, B. Demir, and V. Markl, “Bigearthnet: Alarge-scale benchmark archive for remote sensing image understanding,”IEEE International Geoscience and Remote Sensing Symposium, pp.5901–5904, 2019.

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in IEEE Conference onComputer Vision and Pattern Recognition, 2009, pp. 248–255.

[10] G. Buttner, J. Feranec, G. Jaffrain, L. Mari, G. Maucha, and T. Soukup,“The corine land cover 2000 project,” EARSeL eProceedings, vol. 3,no. 3, pp. 331–346, 2004.

[11] C. Paris and L. Bruzzone, “A novel approach to the unsupervisedextraction of reliable training samples from thematic products,” IEEETransactions on Geoscience and Remote Sensing, pp. 1–19, 2020.

[12] G. Jaffrain, C. Sannier, A. Pennec, and H. Dufourmont, “Corineland cover 2012 - final validation report,” European EnvironmentAgency, Tech. Rep., 2017. [Online]. Available: https://land.copernicus.eu/user-corner/technical-library/clc-2012-validation-report-1

[13] H. Jain, Y. Prabhu, and M. Varma, “Extreme multi-label loss functionsfor recommendation, tagging, ranking & other missing label appli-cations,” in ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2016, pp. 935–944.

[14] T. Durand, N. Mehrasa, and G. Mori, “Learning a deep convnet formulti-label classification with partial labels,” in IEEE Conference onComputer Vision and Pattern Recognition, 2019, pp. 647–657.

[15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” European Conference on Computer Vision, pp. 740–755, 2014.

[16] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand-ing deep learning requires rethinking generalization,” arXiv preprintarXiv:1611.03530, 2016.

[17] K. Yi and J. Wu, “Probabilistic end-to-end noise correction for learningwith noisy labels,” in IEEE Conference on Computer Vision and PatternRecognition, 2019, pp. 7017–7025.

[18] P. Ulmas and I. Liiv, “Segmentation of satellite imagery using u-netmodels for land cover classification,” arXiv preprint arXiv:2003.02899,2020.

[19] Y. Hua1, S. Lobry, L. Mou, D. Tuia, and X. X. Zhu, “Learning multi-label aerial image classification under label noise: A regularizationapproach using word embeddings,” in IEEE International Geoscienceand Remote Sensing Symposium, 2020.

[20] H. Song, M. Kim, D. Park, and J.-G. Lee, “Learning from noisy labelswith deep neural networks: A survey,” arXiv preprint arXiv:2007.08199,2020.

[21] B. Frenay and M. Verleysen, “Classification in the presence of labelnoise: A survey,” IEEE Transactions on Neural Networks and LearningSystems, vol. 25, no. 5, pp. 845–869, 2014.

[22] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensionsfor land-use classification,” in SIGSPATIAL International Conference onAdvances in Geographic Information Systems, 2010, pp. 270–279.

[23] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Makingdeep neural networks robust to label noise: A loss correction approach,”in IEEE Conference on Computer Vision and Pattern Recognition, 2017,pp. 1944–1952.

[24] J. Goldberger and E. Ben-Reuven, “Training deep neural-networks usinga noise adaptation layer,” 2016.

[25] A. P. Dawid and A. M. Skene, “Maximum likelihood estimation ofobserver error-rates using the em algorithm,” Journal of the RoyalStatistical Society: Series C (Applied Statistics), vol. 28, no. 1, pp. 20–28, 1979.

[26] C. M. Teng, “Evaluating noise correction,” in PRICAI 2000 Topicsin Artificial Intelligence, R. Mizoguchi and J. Slaney, Eds. Berlin,Heidelberg: Springer Berlin Heidelberg, 2000, pp. 188–198.

[27] H. Noh, T. You, J. Mun, and B. Han, “Regularizing deep neural networksby noise: Its interpretation and optimization,” in Advances in NeuralInformation Processing Systems, 2017, pp. 5109–5118.

[28] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus,“Training convolutional networks with noisy labels,” arXiv preprintarXiv:1406.2080, 2014.

[29] B. Yuan, J. Chen, W. Zhang, H.-S. Tai, and S. McMains, “Iterative crosslearning on noisy labels,” in IEEE Winter Conference on Applicationsof Computer Vision, 2018, pp. 757–765.

[30] M. Dehghani, A. Mehrjou, S. Gouws, J. Kamps, and B. Scholkopf,“Fidelity-weighted learning,” arXiv preprint arXiv:1711.02799, 2017.

[31] X. Liu, S. Li, M. Kan, S. Shan, and X. Chen, “Self-error-correctingconvolutional neural network for learning with noisy labels,” in IEEEInternational Conference on Automatic Face & Gesture Recognition,2017, pp. 111–117.

[32] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep facerepresentation with noisy labels,” IEEE Transactions on InformationForensics and Security, vol. 13, no. 11, pp. 2884–2896, 2018.

[33] D. T. Nguyen, T.-P.-N. Ngo, Z. Lou, M. Klar, L. Beggel, and T. Brox,“Robust learning under label noise with iterative noise-filtering,” arXivpreprint arXiv:1906.00216, 2019.

[34] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “Mentornet:Learning data-driven curriculum for very deep neural networks oncorrupted labels,” in International Conference on Machine Learning,2018, pp. 2304–2313.

[35] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, andM. Sugiyama, “Co-teaching: Robust training of deep neural networkswith extremely noisy labels,” in Advances in Neural Information Pro-cessing Systems, 2018, pp. 8527–8537.

[36] Y. Han, S. Roy, L. Petersson, and M. Harandi, “Learning from noisylabels via discrepant collaborative training,” in Winter Conference onApplications of Computer Vision, 2020, pp. 3169–3178.

[37] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweightexamples for robust deep learning,” arXiv preprint arXiv:1803.09050,2018.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

[38] A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under labelnoise for deep neural networks,” arXiv preprint arXiv:1712.09482, 2017.

[39] Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for trainingdeep neural networks with noisy labels,” in Advances in Neural Infor-mation Processing Systems, 2018, pp. 8778–8788.

[40] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Learning to learnfrom noisy labeled data,” in IEEE Conference on Computer Vision andPattern Recognition, 2019, pp. 5051–5059.

[41] S. S. Bucak, R. Jin, and A. K. Jain, “Multi-label learning with incompleteclass assignments,” in IEEE Conference on Computer Vision and PatternRecognition, 2011, pp. 2801–2808.

[42] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,” Journal of the Royal Statistical Society: Series B(Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.

[43] A. Blum and T. Mitchell, “Combining labeled and unlabeled data withco-training,” in Conference on Computational Learning Theory, 1998,pp. 92–100.

[44] C. Liu and H.-Y. Shum, “Kullback-leibler boosting,” in IEEE Conferenceon Computer Vision and Pattern Recognition, vol. 1, 2003, pp. I–I.

[45] B. Bigi, “Using kullback-leibler distance for text categorization,” inEuropean Conference on Information Retrieval. Springer, 2003, pp.305–319.

[46] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. Smola,“A kernel two-sample test,” The Journal of Machine Learning Research,vol. 13, no. 1, pp. 723–773, 2012.

[47] S. Vallender, “Calculation of the wasserstein distance between probabil-ity distributions on the line,” Theory of Probability & Its Applications,vol. 18, no. 4, pp. 784–786, 1974.

[48] M. A. Alvarez, L. Rosasco, and N. D. Lawrence, “Kernels for vector-valued functions: A review,” arXiv preprint arXiv:1106.6251, 2011.

[49] J.-P. Vert, K. Tsuda, and B. Scholkopf, “A primer on kernel methods,”Kernel Methods in Computational Biology, vol. 47, pp. 35–70, 2004.

[50] G. Sumbul, J. Kang, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides,M. Caetano, and B. Demir, “Bigearthnet dataset with a new class-nomenclature for remote sensing image understanding,” arXiv, pp.arXiv–2001, 2020.

[51] B. Chaudhuri, B. Demir, S. Chaudhuri, and L. Bruzzone, “Multilabelremote sensing image retrieval using a semisupervised graph-theoreticmethod,” Transactions on Geoscience and Remote Sensing, vol. 56,no. 2, pp. 1144–1158, 2018.

[52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in IEEE Conference on Computer Vision and PatternRecognition, 2016, pp. 770–778.

[53] K. He, X. Zhang, S. Ren, and S. Jian, “Identity mappings in deep residualnetworks,” in European Conference on Computer Vision. Springer,2016, pp. 630–645.

[54] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.