URBAN LAND COVER CLASSIFICATION WITH MISSING DATA USING...

URBAN LAND COVER CLASSIFICATION WITH MISSING DATA USINGDEEP CONVOLUTIONAL NEURAL NETWORKS

Michael Kampffmeyer∗, Arnt-Børre Salberg†, Robert Jenssen∗†∗Machine Learning Group, UiT–The Arctic University of Norway

†Norwegian Computing Center

ABSTRACT

Fusing different sensors with different data modalities isa common technique to improve land cover classificationperformance in remote sensing. However, all modalities arerarely available for all test data, and this missing data problemposes severe challenges for multi-modal learning. Inspiredby recent successes in deep learning, we propose as a remedya convolutional neural network architecture for urban remotesensing image segmentation trained on data modalities whichare not all available at test time. We train our architecture witha cost function particularly suited for imbalanced classes, asthis is a frequent problem in remote sensing. We demonstratethe method using a benchmark dataset containing RGB andDSM images. Assuming that the DSM images are missingduring testing, our method outperforms both a CNN trainedon RGB images as well as an ensemble of two CNNs trainedon the RGB images, by exploiting the training time informa-tion of the missing modality.

Index Terms— Deep learning, convolutional neural net-works, remote sensing, missing data

1. INTRODUCTION

More than half of the world population now lives in cities,and 2.5 billion more people are expected to move into citiesby 2050 [2]. Although constituting only a small percentageof global land cover, urban areas significantly alter climate,biogeochemistry, and hydrology at local, regional, and globalscales. Thus, in order to support sustainable urban develop-ment, accurate information on urban land cover is needed.

A frequently studied topic in land cover classification,since it often leads to improved accuracy, is data fusion. Datafusion aims to integrate the information acquired with differ-ent spatial resolution, spectral bands and imaging modes fromsensors mounted on satellites, aircraft and ground platformsto produce fused data that contains more detailed informationthan each of the individual sources [3, 4]. However, often one

We gratefully acknowledge the support of NVIDIA Corporation with thedonation of the GPU used for this research and the German Society for Pho-togrammetry, Remote Sensing and Geoinformation (DGPF) for providing theVaihingen data set [1]. This work was partially funded by the Norwegian Re-search Council FRIPRO grant no. 239844 on developing the Next GenerationLearning Machines.

of the data sources is missing due to e.g. cloud coverage, orsimply lack of image data. To handle such missing data sce-narios, several classification strategies have been investigated[5, 6, 7]. However, none of these classifiers are able to exploitthe knowledge provided by all data sources to improve theaccuracy when one or more data sources are missing.

Another challenge that is often encountered when de-signing classifiers for land cover mapping is class imbalance.Land covers within the area of interest are often highly im-balanced, where some land cover types are frequent, whereasothers are rare. Moreover, many objects of interest in re-mote sensing are small compared to the overall image. Asolution based on optimizing the overall classification ac-curacy is often not satisfactory, since ”small” classes willoften be suppressed [8]. In a recent paper [9], the influenceof imbalanced classes was reduced by introducing weightsto the cost function. Classes with few samples where givenhigh weights, whereas classes with many samples were givensmall weights.

Due to the successes of deep learning architectures, con-volutional neural networks (CNNs) have found increased usealso in the field of remote sensing [9, 10], outperforming moretraditional approaches [11]. The strength of CNNs is theability to learn features that exploit the spatial context, andthereby provide land cover maps with high accuracy.

In this paper we propose a CNN for urban land cover map-ping of small objects and imbalanced classes that makes useof data modalities during the test-phase that are only availableduring training. The proposed system will build upon the hal-lucination network strategy proposed by Hoffman et al. [12]for training CNNs for object detection from data modalitieswhen one of the modalities is missing during testing.

2. DATASET

The performance has been evaluated using the ISPRS Vai-hingen 2D semantic labeling benchmark dataset [13]. Thedataset consists of 33 high resolution true ortho photo imageswith a ground sampling distance of 9 cm and of varying size(ranging from approximately 3 million to 10 million pixels),with ground truth being available for 16 images. Addition-ally, the normalized DSM for each of the images was pro-duced by Gerke [14]. The dataset contains six classes, imper-

vious surface, buildings, low vegetation, trees, cars and back-ground/clutter. To evaluate the approach and its use for miss-ing data modalities in remote sensing, the normalized DSMwas included only during the training phase.

We evaluate our results according to the ISPRS specifi-cation [13]. The F1-score is measured per class (F1= 2 ·precision·recall/(precision+recall)) and the overall accuracyis the percentage of correctly labeled pixels. Following thespecifications, the class boundaries were eroded with a diskof radius 3 and ignored in the evaluation to reduce boundaryeffects. For evaluation, the labeled part of the dataset was di-vided into a training set, validation set and test set containing11, 2 and 3 images, respectively.

3. APPROACH

In this section we describe the implementation and the train-ing details of the proposed architecture.

3.1. Fully convolutional neural networks

The most common approaches for image segmentation usingCNNs are currently based on the idea of fully convolutionalneural networks (FCN) [15], where all layers in the networkare based on convolutions and do not make use of, as in pre-vious approaches, fully connected layers. These architecturesconsist of an encoder-decoder architecture, where the encodermaps the input to a low resolution representation and the de-coder is responsible for mapping the representation to a pixel-wise prediction. As proposed in Long et al. [15] the layers inthe decoder consist of fractional strided convolutions (also re-ferred to as deconvolutions).

3.2. Hallucination Networks

Hallucination Networks [12] are a recent attempt to use datamodalities that are available solely during the training phaseto improve object detection performance. This is done byadding an additional network, the hallucination network, inaddition to the existing modalities, and adding a loss for themid-level features of the hallucination network to mimic thedata modalities that are missing during the test-phase.

3.3. Medium frequency balancing

Medium frequency balancing is a weighted cross-entropy lossfunction that has been shown to yield good performance forimbalanced classes [16, 17, 9]. Each class in the loss functionis weighted by the ratio of the median class frequency andthe class frequency (computed over the training dataset), suchthat

L = − 1

N

N∑n=1

∑c∈C

lnc log (pnc )wc , (1)

DepthNet

HallucinationNet

RgbNet

Lhallucinate

Lrgb+hal

Lrgb+depth

Lhal

Lrgb

Ldepth

Fig. 1. Network architecture of the proposed method.

where

wc =median ({fc | c ∈ C})

fc(2)

denotes the weight for class c, fc the frequency of pixels inclass c, pn

c is the softmax probability of sample n being inclass c, lnc corresponds to the label of sample n for class cwhen the label is given in one-hot encoding, C is the set of allclasses and N is the number of samples in the mini-batch.

3.4. Small object segmentation for imbalanced classeswith missing data

Our architecture for the individual networks follows the ar-chitecture of Kampffmeyer et al. [9] and consists of foursets of two 3 × 3 convolutional layers, each set followed bya 2 × 2 max-pooling layer and each individual convolutionlayer followed by a ReLU nonlinearity and batch normaliza-tion layer [18]. The first convolutional layer has stride 2 dueto memory restrictions during the test-phase when consider-ing the large images, whereas all other convolutions are ofstride 1. Three of these networks are trained jointly, one forthe RGB image, one for the depth image and the hallucina-tion network. Figure 1 illustrates the complete Architecture.Training is performed on image patches of size 256 × 256,which have been extracted from the original images with50% overlap and have been flipped and rotated at 90 degreeintervals as part of the data augmentation step.

The total loss consists of six individual losses and is

L = γLhallucinate + Ldepth + Lrgb + Lhal

+ Lrgb+depth + Lrgb+hal ,(3)

Imp Surf Building Low veg Tree Car OverallMethod F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc Avg F1 Acc

RGB 90.38 87.26 91.23 86.43 81.29 72.27 89.86 86.95 90.93 83.82 88.74 83.50RGB-ensemble 90.72 87.83 91.17 86.14 82.07 73.52 90.01 87.01 91.77 85.19 89.15 83.86Hallucination 92.02 89.58 92.65 88.30 82.74 74.33 90.33 87.47 91.31 84.30 89.81 85.20

Table 1. Performance of the different models. The F1 scores and accuracies are shown as percentages.

which is optimized using backpropagation. Lhallucinate, Lrgb,Ldepth and Lhal are the hallucination loss, the loss of the RGBnetwork, the loss of the depth network and the loss of thehallucination network, respectively, and Lrgb+depth and Lrgb+halare the joint losses. γ is the weight parameter for the halluci-nation loss.

Following Hoffman et al. [12] the hallucination loss is

Lhallucinate = ||σ(Adepth)− σ(Ahal)||22 , (4)

where σ(·) is the sigmoid function and A(·) refers to the ac-tivation of the networks at a certain depth D. To avoid thatthe depth features are adapting during the end-to-end trainingprocedure, the learning rate for the layers before depth D areset to zero. The hallucination loss in our experiments is basedon the activations after the third pooling layer and mediumfrequency balancing was used for all terms in the loss func-tion except the hallucination loss.

Training details Training is performed by first training theRGB and depth network separately and then finetuning thewhole architecture end-to-end. Following the example ofHoffman et al. [12] the depth network was used to initializethe hallucination network and the weight of the hallucina-tion network, γ, was set such that the hallucination loss isroughly 10 times the loss of the largest loss of the remainingterms in Eq. 3. To avoid large variations in the magnitudeof the gradients, gradient clipping [19] is performed to clipoutlier gradients to a an acceptable range, which is deter-mined by monitoring the gradients during training. Trainingis performed using Adam [20].

4. EXPERIMENTS AND RESULTS

To perform a fair comparison the approach was compared to aCNN trained on the RGB image only, as well as an ensembleof two CNNs trained on the RGB image. For the ensemblethe softmax output of the two CNNs was averaged during thetest-phase.

Table 1 illustrates that the hallucination network outper-forms both the single RGB model as well as the ensemblewhen considering overall accuracy and increases in accuracycan be observed in most classes with large increases beingobserved for the impervious surface class and the buildingclass. This indicates that some of the additional information

(a) RGB image (b) Ground truth

(c) Segmentation Ensemble (d) Segmentation Hallucination

Fig. 2. Segmentation results for an image in the test dataset.

(a) RGB image (b) Ensemble (b) Hallucination

Fig. 3. Closeup of the segmentation results for the bottom leftcorner of the image.

contained in the depth data is benefiting the test-phase eventhough the depth data is missing during testing.

To illustrate some of the differences between the results

achieved by the ensemble and the hallucination method, Fig-ure 2 shows one of the RGB images from the test set, theground truth, and the achieved segmentation using the pro-posed method with the median frequency balancing cost func-tion as well as the RGB ensemble. It can be seen that bothmodels perform well, however, when comparing the resultsof the two models it can be observed that the ensemble as-signs more impervious surface pixels to the building class.This is as expected, as the colour and shape of some roof ar-eas can be considered similar to impervious surfaces. Includ-ing, however, the additional depth information, as done in ourproposed method, allows a better separation between theseclasses as the normalized depth measurements indicate differ-ences between buildings and impervious surfaces. Looking ata close-up of the bottom right corner of the image in Figure2, Figure 3 illustrates this difference clearly, as large parts ofthe building gets classified as impervious surface by the RGBensemble. By making use of the depth information duringthe training phase, the proposed model is able to capture thebuilding class more accurately.

For comparison, we also investigate the performance of amodel when the RGB and depth data are available both dur-ing training and testing. The overall accuracy for this model is86.44%, with the difference being most notable in the build-ing class (92.33%). This corresponds to our intuition that thedepth data is most useful for the building class, but also il-lustrates that the hallucination model, with regards to overallaccuracy, is able to capture a significant part of the informa-tion contained in the depth data.

5. CONCLUSIONS

In this paper we have proposed a method for image segmen-tation in urban remote sensing that makes use of data modal-ities that are only available during the training phase. Ourexperiments show that the method performs better than botha single model using only the available data as well as an en-semble of two models. Additionally, by making use of themedium frequency balancing cost function, we achieve goodperformances on small classes. We therefore consider it anattractive choice for handling missing data in urban remotesensing.

6. REFERENCES

[1] M. Cramer, “The DGPF-test on digital airborne cameraevaluation–overview and test design,” Photogrammetrie-Fernerkundung-Geoinformation, , no. 2, pp. 73–82, 2010.

[2] D. Un, “World urbanization prospects: The 2014 revision,”2015.

[3] J. Zhang, “Multi-source remote sensing data fusion: status andtrends,” Int. J. Image Data Fusion, vol. 1, no. 1, pp. 5–24,2010.

[4] F. Bovolo and L. Bruzzone, “The time variable in data fusion:a change detection perspective,” IEEE Geosci. Remote Sens.Mag., vol. 3, no. 3, pp. 8–26, 2015.

[5] A.-B. Salberg, “Land cover classification of cloud contam-inated multitemporal high-resolution images,” IEEE Trans.Geosci. Remote Sensing, vol. 49, no. 1, pp. 377–387, 2011.

[6] A.-B. Salberg and R. Jenssen, “Land-cover classification ofpartly missing data using support vector machines,” Int. J. Re-mote Sensing, vol. 33, no. 14, pp. 4471–4481, 2012.

[7] B. A. Latif and G. Mercier, Self-Organizing Maps for process-ing of data with missing values and outliers: application toremote sensing images, INTECH, 2010.

[8] A. Estabrooks, T. Jo, and N. Japkowicz, “A multiple resam-pling method for learning from imbalanced data sets,” Compu-tational Intelligence, vol. 20, no. 1, pp. 18–36, 2004.

[9] M. Kampffmeyer, A.-B. Salberg, and R. Jenssen, “Semanticsegmentation of small objects and modeling of uncertainty inurban remote sensing images using deep convolutional neu-ral networks,” in Proc. IEEE Conf. Computer Vision PatternRecognition Workshops, 2016, pp. 1–9.

[10] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Fullyconvolutional neural networks for remote sensing image clas-sification,” in 2016 IEEE Int. Geosci. Remote Sensing Symp.(IGARSS), 2016, pp. 5071–5074.

[11] A. Lagrange, B. Le Saux, A. Beaupere, A. Boulch, A. Chan-Hon-Tong, S. Herbin, H. Randrianarivo, and M. Ferecatu,“Benchmarking classification of earth-observation data: fromlearning explicit features to convolutional networks,” in 2015IEEE Int. Geosci. Remote Sensing Symp. (IGARSS), 2015, pp.4173–4176.

[12] J. Hoffman, S. Gupta, and T. Darrell, “Learning with side infor-mation through modality hallucination,” in Proc. IEEE Conf.Computer Vision Pattern Recognition (CVPR), June 2016.

[13] “ISPRS 2d semantic labeling contest,” http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html.

[14] M. Gerke, “Use of the stair vision library within the isprs 2dsemantic labeling benchmark (vaihingen),” Tech. Rep., Tech-nical report, University of Twente, 2015., 2015.

[15] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutionalnetworks for semantic segmentation,” in Proc. IEEE Conf.Computer Vision Pattern Recognition, 2015, pp. 3431–3440.

[16] D. Eigen and R. Fergus, “Predicting depth, surface normalsand semantic labels with a common multi-scale convolutionalarchitecture,” in Proc. IEEE Int. Conf. Computer Vision, 2015,pp. 2650–2658.

[17] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet:A deep convolutional encoder-decoder architecture for imagesegmentation,” arXiv preprint arXiv:1511.00561, 2015.

[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerat-ing deep network training by reducing internal covariate shift,”arXiv preprint arXiv:1502.03167, 2015.

[19] R. Pascanu, T. Mikolov, and Y. Bengio, “Understanding theexploding gradient problem,” Computing Research Repository(CoRR) abs/1211.5063, 2012.

[20] D. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980, 2014.

URBAN LAND COVER CLASSIFICATION WITH MISSING DATA USING...

Documents

Transcript of URBAN LAND COVER CLASSIFICATION WITH MISSING DATA USING...