Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre...

15
2868 IEEE JOURNAL OF SELECTED TOPICS INAPPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 9, NO. 7, JULY 2016 Semantic Labeling of Aerial and Satellite Imagery Sakrapee Paisitkriangkrai, Jamie Sherrah, Pranam Janney, and Anton van den Hengel Abstract—Inspired by the recent success of deep convolutional neural networks (CNNs) and feature aggregation in the field of computer vision and machine learning, we propose an effective approach to semantic pixel labeling of aerial and satellite imagery using both CNN features and hand-crafted features. Both CNN and hand-crafted features are applied to dense image patches to produce per-pixel class probabilities. Conditional random fields (CRFs) are applied as a postprocessing step. The CRF infers a la- beling that smooths regions while respecting the edges present in the imagery. The combination of these factors leads to a semantic la- beling framework which outperforms all existing algorithms on the International Society of Photogrammetry and Remote Sensing (IS- PRS) two-dimensional Semantic Labeling Challenge dataset. We advance state-of-the-art results by improving the overall accuracy to 88% on the ISPRS Semantic Labeling Contest. In this paper, we also explore the possibility of applying the proposed framework to other types of data. Our experimental results demonstrate the generalization capability of our approach and its ability to produce accurate results. Index Terms—Aerial imagery, conditional random fields, con- volutional neural networks, deep learning, satellite imagery and remote sensing, semantic labeling. I. INTRODUCTION A UTOMATED annotation of urban areas from overhead imagery plays an essential role in many photogrammetry and remote sensing applications, e.g., environmental modeling and monitoring, building and updating a geographical database, gathering of military intelligence, infrastructure planning, land cover and change detection. Pixel labeling of aerial photog- raphy is one of the most challenging and important problems in remote sensing. The objective of pixel labeling is to assign an object class to each pixel in the given image. The task is challenging due to the heterogeneous appearance and high in- traclass variance of objects such as building, streets, trees, and cars. Although many different algorithms have been proposed in the past [1]–[3], the pixel labeling task cannot be consid- ered a solved problem. In this paper, we present a framework to perform semantic pixel labeling and discuss its performance on the ISPRS two-dimensional (2-D) semantic labeling challenge dataset [4], color infrared (CIR) imagery, and Red Green Blue (RGB) satellite imagery. Manuscript received September 30, 2015; revised February 18, 2016, April 14, 2016, June 10, 2016, and June 15, 2016; accepted June 16, 2016. Date of publication July 18, 2016; date of current version August 12, 2016. This work was supported in part by the Australian Research Council Linkage Project LP130100156. (Corresponding author: Pranam Janney) S. Paisitkriangkrai and A. van den Hengel are with the Australian Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail: [email protected]; anton. [email protected]). J. Sherrah and P. Janney are with the Defence Science and Technol- ogy Group, Department of Defence, Edinburgh, SA 5111, Australia (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSTARS.2016.2582921 Numerous photogrammetry and remote sensing applications, which make use of high-resolution geospatial images, have been developed as a result of the hardware improvement and faster imaging methods [5]–[14]. Some of these applications include land use, land cover [15], [16], scene classification [5], coarse- grained classification [5], [14], building and tree detection [12], object-class detection [11], oil tank detection [13], object track- ing [17], crop classification [8], identification of water-body types [6], visualization of bridges [10], and anomaly detection [9], [18]. In this paper, we address a problem of semantic pixel labeling of aerial and satellite imagery with a ground sampling distance (GSD) of less than 10 cm. Semantic labeling is typically applied to multimedia images, and involves dense classification followed by smoothing, for example, with a probabilistic graphical model. The traditional visual bag-of-words approach [19] extracts hand-crafted fea- tures which are clustered to form visual words, and boosting is used for classification. The success of this method relies on the initial choice of features. More recently, deep convolutional neural networks (CNNs) have been used to learn discriminative image features that are more effective than hand-crafted ones. CNNs have been used for semantic labeling of street scenes in [20]. In this paper, we apply CNNs to overhead imagery. We choose CNN due to the following reasons. First, CNN features can be extracted efficiently. We design our framework such that the en- tire test image is forward propagated only once. Since overhead images are typically very large, computational efficiency is a priority. Second, by augmenting the training data with various transformations, the CNN representation can be robust to both translation and rotation. Since imagery could be captured at ar- bitrary azimuth, the feature representation needs to be rotation- invariant. Complementary to learned visual features, we make use of simple hand-crafted features proposed in a previous sub- mission to the labeling challenge [21]. Combining both CNN features with simple hand-crafted features further slightly boosts the labeling accuracy of our proposed approach. As a postprocessing step, a CRF is applied to the label prob- abilities. The CRF infers a globally consistent labeling that is locally smooth except at edges in the imagery and can improve fragmented and marginal regions. In previous work on the IS- PRS challenge [21], super-pixel CRFs were found not to in- crease accuracy. In this paper, we use a pixel-level CRF to avoid errors in oversegmentation, and find that both the accuracy and visual appeal of the labeling improve somewhat. We further explore the possibilities of using a combined approach for semantic labeling of overhead imagery: learned features complemented by hand-crafted features. Subsequent sections detail the two approaches, experiments and our ob- servations. We also demonstrate the utility of combining both approaches. 1939-1404 © 2016 British Crown Copyright

Transcript of Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre...

Page 1: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

2868 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 9, NO. 7, JULY 2016

Semantic Labeling of Aerial and Satellite ImagerySakrapee Paisitkriangkrai, Jamie Sherrah, Pranam Janney, and Anton van den Hengel

Abstract—Inspired by the recent success of deep convolutionalneural networks (CNNs) and feature aggregation in the field ofcomputer vision and machine learning, we propose an effectiveapproach to semantic pixel labeling of aerial and satellite imageryusing both CNN features and hand-crafted features. Both CNNand hand-crafted features are applied to dense image patches toproduce per-pixel class probabilities. Conditional random fields(CRFs) are applied as a postprocessing step. The CRF infers a la-beling that smooths regions while respecting the edges present inthe imagery. The combination of these factors leads to a semantic la-beling framework which outperforms all existing algorithms on theInternational Society of Photogrammetry and Remote Sensing (IS-PRS) two-dimensional Semantic Labeling Challenge dataset. Weadvance state-of-the-art results by improving the overall accuracyto 88% on the ISPRS Semantic Labeling Contest. In this paper,we also explore the possibility of applying the proposed frameworkto other types of data. Our experimental results demonstrate thegeneralization capability of our approach and its ability to produceaccurate results.

Index Terms—Aerial imagery, conditional random fields, con-volutional neural networks, deep learning, satellite imagery andremote sensing, semantic labeling.

I. INTRODUCTION

AUTOMATED annotation of urban areas from overheadimagery plays an essential role in many photogrammetry

and remote sensing applications, e.g., environmental modelingand monitoring, building and updating a geographical database,gathering of military intelligence, infrastructure planning, landcover and change detection. Pixel labeling of aerial photog-raphy is one of the most challenging and important problemsin remote sensing. The objective of pixel labeling is to assignan object class to each pixel in the given image. The task ischallenging due to the heterogeneous appearance and high in-traclass variance of objects such as building, streets, trees, andcars. Although many different algorithms have been proposedin the past [1]–[3], the pixel labeling task cannot be consid-ered a solved problem. In this paper, we present a framework toperform semantic pixel labeling and discuss its performance onthe ISPRS two-dimensional (2-D) semantic labeling challengedataset [4], color infrared (CIR) imagery, and Red Green Blue(RGB) satellite imagery.

Manuscript received September 30, 2015; revised February 18, 2016, April14, 2016, June 10, 2016, and June 15, 2016; accepted June 16, 2016. Dateof publication July 18, 2016; date of current version August 12, 2016. Thiswork was supported in part by the Australian Research Council Linkage ProjectLP130100156. (Corresponding author: Pranam Janney)

S. Paisitkriangkrai and A. van den Hengel are with the AustralianCentre for Visual Technology, The University of Adelaide, Adelaide, SA5000, Australia (e-mail: [email protected]; [email protected]).

J. Sherrah and P. Janney are with the Defence Science and Technol-ogy Group, Department of Defence, Edinburgh, SA 5111, Australia (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSTARS.2016.2582921

Numerous photogrammetry and remote sensing applications,which make use of high-resolution geospatial images, have beendeveloped as a result of the hardware improvement and fasterimaging methods [5]–[14]. Some of these applications includeland use, land cover [15], [16], scene classification [5], coarse-grained classification [5], [14], building and tree detection [12],object-class detection [11], oil tank detection [13], object track-ing [17], crop classification [8], identification of water-bodytypes [6], visualization of bridges [10], and anomaly detection[9], [18]. In this paper, we address a problem of semantic pixellabeling of aerial and satellite imagery with a ground samplingdistance (GSD) of less than 10 cm.

Semantic labeling is typically applied to multimedia images,and involves dense classification followed by smoothing, forexample, with a probabilistic graphical model. The traditionalvisual bag-of-words approach [19] extracts hand-crafted fea-tures which are clustered to form visual words, and boostingis used for classification. The success of this method relies onthe initial choice of features. More recently, deep convolutionalneural networks (CNNs) have been used to learn discriminativeimage features that are more effective than hand-crafted ones.CNNs have been used for semantic labeling of street scenesin [20].

In this paper, we apply CNNs to overhead imagery. We chooseCNN due to the following reasons. First, CNN features can beextracted efficiently. We design our framework such that the en-tire test image is forward propagated only once. Since overheadimages are typically very large, computational efficiency is apriority. Second, by augmenting the training data with varioustransformations, the CNN representation can be robust to bothtranslation and rotation. Since imagery could be captured at ar-bitrary azimuth, the feature representation needs to be rotation-invariant. Complementary to learned visual features, we makeuse of simple hand-crafted features proposed in a previous sub-mission to the labeling challenge [21]. Combining both CNNfeatures with simple hand-crafted features further slightly booststhe labeling accuracy of our proposed approach.

As a postprocessing step, a CRF is applied to the label prob-abilities. The CRF infers a globally consistent labeling that islocally smooth except at edges in the imagery and can improvefragmented and marginal regions. In previous work on the IS-PRS challenge [21], super-pixel CRFs were found not to in-crease accuracy. In this paper, we use a pixel-level CRF to avoiderrors in oversegmentation, and find that both the accuracy andvisual appeal of the labeling improve somewhat.

We further explore the possibilities of using a combinedapproach for semantic labeling of overhead imagery: learnedfeatures complemented by hand-crafted features. Subsequentsections detail the two approaches, experiments and our ob-servations. We also demonstrate the utility of combining bothapproaches.

1939-1404 © 2016 British Crown Copyright

Page 2: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

PAISITKRIANGKRAI et al.: SEMANTIC LABELING OF AERIAL AND SATELLITE IMAGERY 2869

This paper expands on our previous work presented at theEarthVision workshop [22]. The main developments since [22]are 1) more technical insights on the learned features, (2) exper-imental evaluation of hard negative mining, (3) trialling the pro-posed approach on CIR data from other geographical locations,and (4) evaluation of our proposed approach on low-resolutionRGB satellite imagery.

II. RELATED WORK

Several researchers have applied machine learning in orderto annotate overhead imagery. However, due to a lack of high-computing power in the past, most of these techniques werepredominantly used for terrain classification, e.g., classifying anoverhead image into forest, water, agricultural land, etc. Overrecent years, advances in computing hardware and sensor tech-nologies have made processing large amounts of high-resolutionaerial imagery possible [2], [3], [5]–[14], [23]. Lately in com-puter vision, CNN features have been shown to outperform tra-ditional hand-crafted features in visual recognition challenges[24], image classification [25], and object detection [26]. CNNsroughly mimic the nature of the mammalian visual cortex andare among the most promising architectures for vision applica-tions. The CNN architecture exploits the strong spatial correla-tion present in natural images by enforcing a local connectivitypattern between neurons of adjacent layers. A deep CNN con-sists of multiple layers of small neuron collections which offeran alternative approach to learning visual patterns directly fromraw image pixels.

Zou et al. propose a deep learning-based framework for sceneclassification using deep belief networks (DBN)[14]. Using ac-tive learning, they iteratively perform feature selection on DBNfeatures for scene classification. Recently, deep learning ar-chitectures are being employed in the hyperspectral domain[27]–[29].

Vakalopoulou et al. propose a deep learning framework forbuilding detection in high-resolution multispectral (RGB andnear-IR) aerial imagery[30]. A deep CNN pretrained on Ima-genet images [24] was used to extract feature descriptors and anSVM classifier was trained using these descriptors to distinguishbetween buildings and nonbuildings. The pixel level classifica-tion was further refined using a Markov Random Field model. Aclassification benchmark on semantic labeling was presented in[31] using the IEEE GRSS DFC Zeebrugge dataset[32], wheredeep convolutional networks produced superior performancecompared to other baselines, however transfer learning fromlarge everyday image datasets was used. Firat et al. proposean end-to-end object and region detection framework compris-ing convolutional sparse autoencoders to extract features and anSVM to detect the target objects/regions[33]. The target objectswere airplanes, dry docks and regions were dispersal areas, taxiroutes, etc.

The combination of CNNs and CRFs has previously beenapplied to semantic labeling in several computer vision prob-lems. Farabet et al. combine multiscale CNNs with super-pixelCRFs to street scenes labeling [20]. Mnih and Hinton learn dis-criminative image features using deep neural networks to detect

roads and buildings from noisy labels [1], [34]. A postprocessingprocedure based on dependences present in nearby map pixelsis then applied to improve the predictions of their neural net-work. Several authors have also applied similar techniques toaerial imagery. Kluckner and Bischof apply super-pixel featuresand CRFs for building detection in aerial imagery [35]. Gerkeextracts several image-based features and train the AdaBoost-based classifier [21]. The graph-based segmentation approachis applied to efficiently group similar pixels based on their per-ceptual appearance within a local neighborhood.

III. APPROACH

In this section, we introduce the framework for automatedpixel classification in high-resolution aerial images. We firstintroduce the neural networks adopted for dense feature extrac-tion. We then discuss how we complement CNN features withhand-crafted features to further improve the classification ac-curacy. Finally we briefly introduce the concept of conditionalrandom fields (CRF) to smooth the final pixel labeling results.An overview of the proposed semantic pixel labeling frameworkis illustrated in Fig. 1.

A. Pixel Classification With Convolutional Network

The convolutional feature classifier is applied densely overthe input image. The classifier has two components: a CNNconsisting only of convolutional layers, and a logistic regressionclassifier that takes convolutional features as input and outputsclass probabilities. The convolutional features are learned bysupervised training of a CNN classifier (described next), anddiscarding the fully connected (fc) layers of the network leavingonly the convolutional layers. In this section, we first introducea concept of deep CNNs. We then discuss our approach and itsimplementation in more details.

1) Background on Deep CNNs: Deep CNNs consist of oneor several convolutional layers followed by one or more fc lay-ers [24]. Each layer consists of neural weights which are learnedjointly to minimize a specific objective function. Each CNNlayer receives some inputs, performs a dot product followed byeither linear or non-linear operations. The output of the last layerrepresents the class scores for our semantic labeling problem.The typical CNN network consists of the following CNN build-ing blocks: convolutional layers, activation functions, poolinglayers, fc layers and the loss layer.

The convolutional layer is the core building block of CNNs.It consists of a set of learned filters which are convolved acrossthe input image. The size of the filter is much smaller than thesize of the input image. The activation function is applied aftereach convolutional layer to transform the activation level of aneuron into an output signal. In this paper, we use the simplestnonlinearity activation function known as the rectified linearunit (ReLU).

The pooling layer combines several feature values obtained atnearby locations into some statistics that better summarize thefeatures over some region of interest (pooling region). The newfeature representation preserves visual information over a localneighborhood while discarding irrelevant details and noise [24].

Page 3: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

2870 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 9, NO. 7, JULY 2016

Fig. 1. Overview of the proposed pixels labeling framework.

Spatial pooling has been proven to be invariant to various imagetransformations and demonstrate better robustness to noise. Italso reduces the amount of parameters and computation of theCNN model by reducing the spatial size of the representation.The last few layers of CNNs are the fc layers and the loss layer.Unlike the convolutional layers, in which the output of neuronsare connected to a local region in the input, the fc layers capturethe global properties of the image by connecting every neuronfrom the previous layer to every neuron in the next layer. Theyare often used before the final loss layer, which is carefullydesigned for different learning tasks. In this paper, we use afive-way soft-max layer which predicts a single class from fivedifferent classes.

In summary, CNNs transform the original image from theoriginal pixel values to the final class scores. The activationfunction and pooling operation implement a fixed function withno neural weights while the convolutional and fc layers performtransformations that are a function of neuron weights. Theseneuron weights are trained with gradient descent so that theclass score that the CNNs predict are consistent with the labelsin the training set.

2) CNN Training for Feature Learning: In this paper, wetrain a CNN by adopting the approach of [24], i.e., the CNN net-work consists of several convolutional layers, which are placedalternatively between activation functions and max-pooling lay-ers. Each convolutional layer computes the convolutions be-tween the input and a set of filters. The activation function(ReLU) perform a nonlinear transformation while the max-pooling layer subsamples the output of the convolutional layer.These two operations improve the robustness of the network todistortions and small translations [24]. The output of the last fclayer is fed to a k-way soft-max layer which produces a dis-tribution over k class labels. In other words, the kth output ofthe soft-max layer (associated with the kth class) will be closeto zero if the probability of the input to belong to this class isvery small. The sum of the output of the soft-max layer will beequal to one. All network parameters are learned in a supervisedmanner.

3) Dense Neural Pattern Training: We follow the work ofRazavian et al. by applying a multiclass classifier on the CNN

feature representation [25]. We consider the multiclass clas-sification problem as a set of binary classification problems.We adopt a simple logistic regression with a “one-versus-all”scheme as it has been shown to be as accurate as any othermulticlass algorithms despite its simplicity [36]. The logisticregression solves the following optimization problem:

minwr

m∑

i=1

log(1 + exp(−yiw

�r xi)

)(1)

where wr is the weight vector we are trying to learn. Here weassume that {(xi , yi)}m

i=1 is the set of training data, xi ∈ Rd

represents vectorized CNN features, yi = 1 if the class label ofxi is the same as r and yi = −1, otherwise. m is the numberof training samples, d is the dimension of feature vector xi

and r ∈ {1, 2, · · · , 5}, which corresponds to the class labelsImpervious surfaces, Building, Low vegetation, Tree, and Car,respectively. In order to avoid overfitting, we introduce �2-normregularization on the weight vector wr . Given a test sample xt

and the learned weight vectors {w1 , . . . ,w5}, the probabilitythat xt belongs to class r is given by

P (yt = r) =1Z

(1

1 + exp(−w�r xt)

)(2)

where Z =∑5

r=11

1+exp(−w�r xt ) . The purpose of Z is to ensure

that the resulting distribution is a probability distribution.In the case of the CNN, the input features are not necessarily

linear and the output features are discriminatively trained to belinearly separable. Also the CNN features are high-dimensionaland for computational efficient logistic regression are advan-tageous since it is being applied densely over the image. Theadvantage of using one-versus-all logistic regression is that wecan exploit multicore processors during training. In other words,we can solve each wr ,∀r independently.

4) Multiresolution CNN: In order to correctly classify bothcoarse-scale and fine-scale details in the image, we train severalCNN models with different input image resolutions. Each CNNmodel encodes different patches of increasing sizes, coveringa larger context surrounding the center pixel. The output is aseries of feature vectors generated from patches of multiple sizes

Page 4: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

PAISITKRIANGKRAI et al.: SEMANTIC LABELING OF AERIAL AND SATELLITE IMAGERY 2871

Fig. 2. Overview of the multiresolution CNN.

centred at each pixel (see Fig. 2). A similar concept has also beenapplied in [20], [37], in which the authors demonstrate that amultiscale ConvNet outperforms a single-scale ConvNet forscene parsing.

5) Mining Hard Examples: Despite the fact that a largeamount of training data is made available (the ground-truthconsists of more than 108 labeled pixels), most learning algo-rithms are not designed to handle such a vast amount of trainingdata. In object detection [38]–[40], this specific problem hasbeen tackled by mining for hard training examples. In this it-erative process, an initial classification model is trained usingall positive examples and a randomly selected subset of nega-tive examples. The initial training set is incrementally appendedwith false positive examples produced while scanning the im-ages with the classification model learned so far. This procedureis known as bootstrapping or hard negative mining [39]–[41].Dalal and Triggs [39] show that two rounds of iterations aresufficient and additional rounds of bootstrapping make littledifference. Following the technique applied in object detection,we augment our training data by iteratively bootstrapping theinitial training set with hard-to-classify examples to improve themulticlass decision boundary. Since incorrectly classified pixelscan occur in close proximity, we suppress nearby misclassifiedpixels that are more likely to be correctly classified. To achievethis, we select the misclassified 64 × 64 patches, which have thelowest probabilities for the right class. This procedure is knownas nonmaximum suppression [38], [40]. To reduce the impact ofnoise in the training data, e.g., some pixels along building androad boundaries can be classified as either of these two classes,we only choose hard training examples which are not near theboundaries of objects.

6) Implementation: We train the CNN with a combinationof input data: orthophoto, digital surface model (DSM) imageand normalized DSM image. The orthophotos are CIR consist-ing of near infra-red (NIR), red and green bands. Since the dataare not RGB pretrained, weights from other datasets such as Im-ageNet could not be used; the network was trained from scratchwith random initialization. We train three CNN models with

three different input image resolutions: 16 × 16, 32 × 32, and64 × 64 pixels. The parameter settings used in our CNN modelfor 64 × 64 pixel input images are explained below. The firstconvolutional layer filters the 64 × 64 × 5 input image whichconsists of orthophotos, DSM images, and normalized DSMimages with 32 kernels of size 5 × 5 × 5 with a stride of 1pixel. The second convolutional layer takes as input the outputof the first convolutional layer and filters it with 64 kernels ofsize 5 × 5 × 32. The third convolutional layer has 96 kernels ofsize 5 × 5 × 64 connected to the output of the second convolu-tional layer. The fourth convolutional layer has 128 kernels ofsize 3 × 3 × 96. The fc layers have 128 neurons each. We applythe dropout term in both fc layers. The dropout term sets theoutput of each hidden neuron to zero with probability of 0.5.We train CNN with stochastic gradient descent at a learningrate of 0.001. The learning rate is reduced by a factor of 10 atevery 20 epochs. We apply a simple linear interpolation on theprobabilities to obtain a per-pixel classification. The convolu-tional layer has a stride of 1 pixel while the pooling layer has astride of 2 pixels. The choice of momentum and weight decayparameters were obtained based on cross-validation and wereset to 0.9 and 0.0005, respectively. To improve the robustnessof our CNN model, we augment the training data by rotating thetraining patch at every 45◦, an illustration of the CNN structureis shown in Fig. 3.

To train the CNN model for 16 × 16 and 32 × 32 pixels inputimages, we simply replace the input image to the first convolu-tional layer from 64 × 64 × 5 to 16 × 16 × 5 and 32 × 32 × 5,respectively. All other parameter settings are kept the same as be-fore. In this paper, we implement the network training based onthe MatConvNet CNN toolbox.1 Training is done on a standarddesktop with an NVIDIA GTX 780 GPU with 6 GB memory.We randomly extract 7500 patches from each class. The time totrain the CNN model is under 2 h. For �2-regularized logisticregression, we use LIBLINEAR [42]. We train LIBLINEAR ona cluster consisting of 32 Intel Xeon CPU 2.70GHz processor.

7) Dense Neural Pattern Classification: The output of eachconvolution kernel is a dense feature map of neural patterns overthe entire image. By carefully designing the CNN structure, wecan compute the location of each neural pattern from a patchand map it back to coordinates on the original image [43]. Giventhe test image and the trained CNN model, we extract the denseCNN feature map from the output of the last convolutionallayer. We vectorize CNN features within each input image patch,concatenate them into a single feature vector and apply logisticregression weights to classify object of different classes. Toevaluate the entire test image, we adopt a scanning windowapproach with a step size of 4 pixels. Since we forward propagatethe entire test image through the CNN structure only once, wesignificantly speed up the CNN feature extraction time duringevaluation. In addition, since the only class-specific computationis the dot product between extracted CNN features and logisticregression weights, our approach can scale to hundreds of objectclasses.

1http://www.vlfeat.org/matconvnet

Page 5: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

2872 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 9, NO. 7, JULY 2016

Fig. 3. Illustration of the CNN architecture. In this figure, the networks input is the 64 × 64 × 3-pixels orthophoto, 64 × 64-pixels DSM input image and64 × 64-pixels normalized DSM image. The networks consist of six layers (four convolutional layers and two fully connected layers) with a final five-waysoft-max layer.

B. Classification Using Hand-Crafted Features

In [21], several pixel-level features were found to be effectivefor discriminating the classes in the labeling contest. Since theseare complementary to the texture-based features extracted bythe CNN, a separate random forest (RF) classifier is trainedon hand-crafted features and the output probabilities combinedwith those generated by the CNN. The hand-crafted featuresare raw features. It was found that the nonlinearity of the RFgave higher accuracy than a simple linear classifier. For eachpixel in the imagery, a vector of seven features is generated:NDVI, saturation, and normalized DSM (see [21]), NIR+R+G

3 ,the three-channel maximum to indicate shadows, entropy, andkurtosis of l2-normalized histogram of normals gathered over a16 × 16 neighborhood from the DSM and normalized DSM.

Histogram of normals is the collation of angles of point nor-mals into 2-D histogram bins. Each three-dimensional (3-D)point in a DSM can be represented as angles of point normalswith two degrees of freedom, i.e., elevation and azimuth. Thenormal vector at a given 3-D point was calculated by fitting aRANSAC plane to the point’s k nearest neighbors. Consideringa normal vector at a point as �p = [l,m, n] and the surface planeas S = Ax + By + Cz + D = 0, where A,B,C, and D arecoefficients fitting a plane, we have

sin θ =|A ∗ l + B ∗ m + C ∗ n|√

A2 + B2 + C2 ∗√

l2 + m2 + n2(3)

where θ is the angle between the normal vector �p and the surfaceplane S. We calculate the elevation angle θe by setting thesurface plane, S to be the x − y plane, and the azimuth angleθα by setting the surface plane, S to be the y − z plane.

For a given neighborbood, an l2-normalized 2-D histogram,i.e. , elevation and azimuth can be used as a feature descriptor.The histogram of normals distribution will capture the featuresthat are useful in segregating ground and above-ground classes,with the ground class having a skewed distribution compared tothe above-ground class.

537000 training examples are chosen at random and usedalong with the corresponding ground-truth pixel labels to traina RF classifier with 100 trees. The Car class is excluded fromthis classifier since the features are not discriminative for cars.

Since the CNN and RF are such different approaches, weassume they are independent given the data and multiply theirclass probabilities to result in the combined probability for each

class

pcomboi =

pcnni prf

i∑Cj=1 pcnn

j prfj

(4)

where pcombo, pcnn, and prf are the combined, CNN, and RF prob-abilities per class. For the Car class, the combined probabilitiescome from the CNN only.

C. CRF Labeling

A CRF is a probabilistic graphical model that has beenused extensively for semantic labeling of images, for exam-ple, see [19], [20]. CRFs are often defined at the super-pixellevel rather than the pixel level to improve computational effi-ciency and robustness [35]. As pointed out in [21], this places anupper limit on the achievable accuracy due to oversegmentationerrors (i.e., super-pixels that cover multiple objects). Therefore,we use a pixel-level CRF: a four-connected grid in which eachnode corresponds to the class label of an image pixel.

Following the standard definition of image labeling CRFs,the energy function consists of unary and pairwise cost terms

E =∑

i∈VΦ(ci,x) +

i,j∈EΨ(ci, cj ,x) (5)

where V and E are the nodes and edges of the CRF graph, ci isthe class label of node i and x represents the given data. Theunary cost is based on the class probability from the combinedCNN and RF classifiers,

Φ(ci ,x) = − log pcomboci

. (6)

The pairwise costs use a contrast-sensitive Potts model to pe-nalise class boundaries with low contrast. However, the tradi-tional method comparing the neighboring pixel intensities isproblematic due to low contrast edges in the image. Instead,pairwise costs are based on a binary edge image such that classboundaries are encouraged to line up with the edges. Edge de-tection techniques have threshold parameters and could fail todetect an edge. To address this, the edge detector uses hystere-sis thresholding to perform boundary continuation across weakparts of an edge. This is a kind of locally adaptive thresholdingthat is somewhat robust to lighting variation. In contrast, usinggray values directly in a contrast-sensitive binary potential doesnot have this locally adaptive property and the term is dimin-ished in weak parts of the edge. This provides the CRF with

Page 6: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

PAISITKRIANGKRAI et al.: SEMANTIC LABELING OF AERIAL AND SATELLITE IMAGERY 2873

Fig. 4. Example of edges used in CRF pairwise cost.

stronger information about class boundaries than the weak im-age contrast information. Suppose we have a binary edge imageG(i) that is true if pixel i is an edge pixel and false otherwise.Define B(i, j) as an indicator that pixel i and its four-connectedneighbor j straddle an edge

B(i, j) =

⎧⎪⎪⎨

⎪⎪⎩

1 if G(i) and ¬G(j) and

(x(j) > x(i) or y(j) > y(i))

0 otherwise

(7)

where x(i) and y(i) are the image column and row of pixel i.This asymmetric definition stops edge pixels from being seg-mented as regions.

The pairwise cost is defined as

Ψ(ci, cj ,x) =

{K(1 − B(i, j)) if c(i) �= c(j)

0 otherwise(8)

where K is a constant penalty term. In our experiments thevalue of K = 25 is chosen to minimize the validation set error.The Canny edge detector is used to form the robust edge imageG(i); hysteresis thresholding fills in weak edges connectingstrong ones, and discards isolated weak edges. Since the Cannyimage edges are sometimes absent or incomplete, a combinededge map is used. Canny edges are extracted from each of thesingle-band images. In the case of RGB/CIR input, the imageis first converted to a greyscale image. Given three binary edgemasks they are combined into one mask with a Boolean-ORoperation to form G(i).

Fig. 4 shows the complementary nature of the three sets ofedges: image edges are accurate but unreliable, DSM edges aremore reliable but less accurate, and NDVI edges can delineatevegetation well regardless of elevation.

An approximation to the MAP labeling is inferred using α −β swaps [44]. This method solves the n-labeling problem byiteratively solving for the simpler binary case. The maxflowalgorithm of [45] is used for binary labeling, which performsexact inference in polynomial time. The CRF was implementedin python and C++.2

IV. EXPERIMENTS

The proposed method was applied to the ISPRS labeling con-test dataset [4]. The dataset consists of 33 large image patchesof different sizes, each being a true orthophoto (TOP) extracted

2http://github.com/RockStarCoders/alienMarkovNetworks

from a larger TOP image captured over Vaihingen, Germany. Intotal, there are over 168 million pixels. The dataset also containscorresponding DSM for each patch. The patches have a GSDof 9 cm and the DSMs were generated via dense image match-ing. Labeled ground truth was provided for 16 of the areas, andwere made up of 6 categories: Impervious surfaces, Building,Low vegetation, Tree, Car, and Clutter/background. NormalizedDSMs was provided to us at a later date, and were generatedusing the lasground tool3 where the normalized height is com-puted based on the off-ground pixels. The effect of terrain orground is nullified in normalized DSM compared to the regu-lar DSM. Guidelines for evaluation procedure and metrics aredefined by the ISPRS [4].

We investigate the experimental design of our approach. Wesplit the labeled training images into training and validation sets.The training set consists of 11 areas (1, 3, 5, 7, 13, 17, 21, 23, 26,32, 37) and the validation set consists of 5 areas (11, 15, 28, 30,and 40). The evaluation is based on the computation of pixel-based confusion matrices. For each class, we report the harmonicmean of precision and recall (F1-score). We also report theoverall accuracy (Overall Acc.), which is the normalized tracefrom the confusion matrix (i.e., percentage of pixels correctlylabeled). As per the ISPRS metrics, pixels near ground truthclass boundaries are excluded by eroding the labels with a 5 × 5diamond. Pixels from the “unknown” class are not included inthe overall accuracy metric, and none of our classifiers generatethe “unknown” label.

A. Input Data

In this experiment, we compare the CNN performance withand without the DSM model. All experimental settings are keptidentical, except the number of channels of convolutional ker-nels in the first layer. For orthophotos, the filter size in the firstlayer is set to 5 × 5 × 3 × 32. For orthophotos + DSM, the fil-ter size in the first layer is set to 5 × 5 × 4 × 32. We conductexperiments with both raw DSM and the normalized DSM. Ta-ble I compares the average F1-score and overall accuracy of theCNN given different input data. We observe that it is benefi-cial to use the normalize height information as it improves theoverall accuracy by 3.3%. Not surprisingly, we observe that thenormalized height feature has less impact on the detection rateof car-pixels (53.5% versus 54.6%). A similar finding has alsobeen reported in [21]. In our experiment, we achieve the highestaccuracy when we combine orthophoto with the raw DSM and

3http://rapidlasso.com

Page 7: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

2874 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 9, NO. 7, JULY 2016

TABLE IPERFORMANCE COMPARISON OF THE CNN WITH DIFFERENT INPUT DATA SOURCES (ORTHO—ORTHOPHOTO, DSM—RAW DIGITAL SURFACE MODEL, AND

NDSM—NORMALIZED DSM)

Imp. surf. Building Low veg. Tree Car Average F1 Overall Acc.

Ortho 82.91% 88.13% 67.14% 81.77% 53.50% 74.69% 80.10%Ortho + DSM 82.93% 88.75% 68.40% 82.44% 51.15% 74.74% 80.71%Ortho + NDSM 86.07% 92.79% 72.85% 82.85% 54.63% 77.84% 83.46%Ortho + DSM + NDSM 87.35% 93.34% 74.96% 84.97% 63.32% 80.79% 85.18%

Experiments are evaluated on the validation set (area 11, 15, 28, 30, and 40). All quality measures except for “Overall Acc.” are F1-scores using theground-truth with eroded boundaries. The best overall accuracy is shown in boldface.

Fig. 5. First layer filters learned from three-channel orthophoto, DSM, andnormalized-DSM data. The five channels are shown via three images as labeled.(a) NIR-R-G channels. (b) DSM channel. (c) Normalized DSM channel.

TABLE IIPERFORMANCE COMPARISON WITH DIFFERENT POOLING OPERATIONS

Imp. surf. Building Low veg. Tree Car Overall F1 Overall Acc.

No pooling 86.34% 92.52% 74.12% 84.21% 50.97% 77.63% 84.04%Avg. pooling 86.32% 92.73% 75.79% 84.82% 54.58% 78.85% 84.62%Max pooling 87.35% 93.34% 74.96% 84.97% 63.32% 80.79% 85.18%

TABLE IIIPERFORMANCE COMPARISON BETWEEN (A) CNN PROBABILITY OUTPUT AND

(B) THE PROBABILITY OUTPUT OF �2 -REGULARIZED LOGISTIC REGRESSION

(LR) TRAINED ON CNN FEATURE REPRESENTATION

Imp. surf. Building Low veg. Tree Car Average F1 Overall acc.

CNN (sixlayers)

86.69% 93.06% 76.16% 85.39% 55.62% 79.38% 84.93%

CNN (fourlayers) + LR

87.35% 93.34% 74.96% 84.97% 63.32% 80.79% 85.18%

the normalized DSM (an improvement of 5.2% on overall ac-curacy). Fig. 5 shows feature visualizations from the first layerof our model once training is complete (32 filters learned fromorthophoto, DSM, and normalized DSM). We observe orientededge filters in the first layer of our model. In addition, the firstthree filters (learned from orthophoto) and the last two filters(learned from DSM) look visually similar. Similar edge-likepatterns have also been observed in natural images as reportedin the literature [46].

TABLE IVPERFORMANCE COMPARISON BETWEEN SINGLE-RESOLUTION AND

MULTIRESOLUTION CNNS

Input imageresolution(pixels)

Imp. surf. Building Low veg. Tree Car Average F1 Overall acc.

16 × 16 84.70% 92.15% 72.54% 83.51% 42.54% 75.09% 82.78%32 × 32 85.96% 92.42% 74.09% 84.68% 61.06% 79.64% 84.20%64 × 64 87.35% 93.34% 74.96% 84.97% 63.32% 80.79% 85.18%ALL 87.72% 93.27% 75.53% 85.29% 66.89% 81.74% 85.56%

ALL—We extract CNN features from three different resolutions: 16 × 16, 32 × 32 and64 × 64 pixels.

TABLE VAVERAGE EVALUATION TIME PER IMAGE (ON THE VALIDATION SET: AREA 11,15, 28, 30, AND 40) BETWEEN SINGLE-RESOLUTION AND MULTIRESOLUTION

CNNS

Input image CNN Feature Window Total time(pixels) extraction scanning (s)

single-res. 9.8 57.3 67.12 × single-res. 18.8 115.4 134.23 × single-res. 28.0 171.8 199.8

TABLE VIPERFORMANCE COMPARISON OF DIFFERENT BOOTSTRAPPING ITERATIONS

Imp. surf. Building Low veg. Tree Car Average F1 Overall acc.

No boot-strapping

87.35% 93.34% 74.96% 84.97% 63.32% 80.79% 85.18%

Oneiteration

87.45% 92.70% 73.13% 84.25% 68.70% 81.25% 84.69%

Twoiterations

87.17% 91.95% 74.20% 85.00% 70.12% 81.69% 84.89%

Threeiterations

87.34% 92.22% 74.78% 85.14% 70.25% 81.95% 85.14%

Fouriterations

87.33% 91.79% 74.03% 84.57% 70.53% 81.65% 84.71%

B. Spatial Pooling Layer

Spatial pooling has been shown to increase the robustnessof the network to small translations [47]. However, it mightbe suboptimal to apply spatial pooling when we want to pre-cisely predict the label of a single pixel. In this experiment,we compare the performance of the proposed CNN architecturewith and without the spatial pooling layers. We also investigatethe effect of average pooling. Experimental results are shown

Page 8: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

PAISITKRIANGKRAI et al.: SEMANTIC LABELING OF AERIAL AND SATELLITE IMAGERY 2875

Fig. 6. Training images and their ground-truth (best view in color). Note that there exist few ambiguities in the provided ground-truths. (a) The vehicle trailershould be Clutter/bg. (b) Umbrellas should be Clutter/bg. (c) Low vegetation on the rooftop is ignored. (d) Trees are classified as imp. surfaces. (e) Low veg.is classified as imp. surfaces. (f) Low veg. is classified as imp. surfaces. (g) Low vegetation is classified as tree. (h) Extended low vegetation. (i) Extended lowvegetation. (j) Extended low vegetation. (k) Railway is classified as low vegetation. (l) Buildings are classified as trees.

in Table II. Based on our results, we observe that by apply-ing spatial pooling (increasing the robustness of CNN featuresagainst small deformations and translations), the accuracy ofpixel classification also tends to be improved. For “car” object,we observe a significant performance improvement on F1-scorewhen we apply max-pooling.

C. Multiresolution CNN Feature Extraction

We employ a multiresolution deep network that predicts anoutput based on the 16 × 16, 32 × 32, and 64 × 64 pixel im-age area. Experimental results are reported in Table IV. Forour baseline, we trained a single-resolution network. Table IVshows an improvement of multiresolution CNN over a single-resolution CNN-based classification on the average F1-scoreand overall accuracy. The accuracy improvement was more sig-nificant for the car class. This is expected because CNN featuresare extracted from 16 × 16 pixels to 64 × 64 pixels and cars arerelatively small detailed objects. We also report the evaluationtime in Table V. All computations were performed using a sin-gle core of an Intel Xeon E5-2680 with 2.70 GHz. Both CNN

TABLE VIIACCURACY OF RF CLASSIFIER AND CRF LABELING ON THE VALIDATION SET

Imp. surf. Building Low veg. Tree Car Overall F1 Overall acc.

Multi-resCNN

87.72% 93.27% 75.53% 85.29% 66.89% 81.74% 85.56%

Hand-craftedfeatures (RF)

85.83% 92.79% 70.88% 83.98% 0.0% 66.69% 83.47%

CNN+RF 88.58% 94.23% 76.58% 86.29% 67.58% 82.65% 86.52%CNN+RF+CRF

89.10% 94.30% 77.36% 86.25% 71.91% 83.78% 86.89%

feature extraction and scanning window codes are implementedin MATLAB as standard MATLAB external calls (MEX-files).Based on Tables IV and V, we can conclude that the perfor-mance gain of multiresolution CNN is achieved at the cost ofincreased computation complexity.

D. Mining Hard Examples

Next we compare the effect of data augmentation at variousbootstrapping stages. We initially select 8800 random patches

Page 9: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

2876 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 9, NO. 7, JULY 2016

Fig. 7. Examples of CRF smoothing. From left to right: input image, ground truth labeling, combined classifier labeling, and CRF labeling.

of each class as an initial training set. Table VI shows the in-fluence of repeated retraining (bootstrapping). Based on our ex-perimental results, we observe that bootstrapping improves theaverage F1-score but worsens the overall accuracy. These resultscontradict those results reported in [38]–[40]. Upon a closer in-spection, we observe that the ground-truth provided contains afew ambiguities, e.g., tree and low vegetation can sometimesbe mislabeled. Fig. 6 illustrates the mislabeled ground-truth.The hard mining process is focusing on these noisy labels andsystematically adding them to the training set. Interestingly weobserve that bootstrapping significantly improves the F1-scoreon the “car” class (an improvement of F1-score from 63 to70%). On the other hand, it appears to degrade performance onthe “building” class.

E. Hand-Crafted Features and RF Classifier

The accuracy of the RF classifier on the validation set of im-ages is shown in Table VII. The best CNN-only result is includedthere for convenience. The accuracy is surprisingly high consid-ering the relative simplicity of the hand-crafted features, eachusing input values from only a single pixel. Table VII also showsthe accuracy of the combined probabilities when used to labelthe pixels. Adding the hand-crafted feature result to the CNNimproved the accuracy evenly for all classes, indicating that thehand-crafted features do indeed contain information that is in-dependent from the CNN features. The type of classifier used ineach case is different, which could have also contributed to thisimprovement.

F. Conditional Random Field

The CRF is applied to the combined CNN and RF probabil-ities. Inference takes about 1 min per image on a single CPU.The validation set accuracy is shown in Table VII. Whereas theaverage pixel accuracy increased only a fraction of a percent,the average F1-score is about 1% higher. The CRF gave a par-ticular improvement for the car class because the CNN outputsare lower resolution than the input images and the fine detailof the cars is lost. The CRF exploits the image edges to restorethe detail in the labeling of the cars. The accuracy is not greatlyimproved by the CRF, but the aesthetic appeal of the labelingarguably makes it worthwhile. Examples of the effect of CRF

TABLE VIIICOMPARISON OF LABELING ACCURACY FOR DIFFERENT EDGE TYPES USED IN

THE PAIRWISE COST TERM OF THE CRF

Edge type Imp. surf. Building Low veg. Tree Car Average F1 Overall acc.

Ortho 88.68% 94.43% 76.76% 85.67% 71.78% 83.46% 86.53%DSM 88.37% 93.75% 76.94% 85.55% 70.87% 83.10% 86.28%NDVI 89.15% 94.05% 77.00% 86.35% 71.49% 83.61% 86.81%Ortho + DSM 88.68% 94.22% 77.06% 85.71% 71.55% 83.44% 86.53%Ortho + DSM +NDVI

89.08% 94.29% 77.32% 86.28% 72.29% 83.85% 86.88%

TABLE IXISPRS 2-D SEMANTIC LABELING CONTEST BENCHMARK RESULTS ON THE

HOLD-OUT TEST SET

Imp. surf. Building Low veg. Tree Car Overall acc.

CNN 88.1% 92.0% 79.0% 86.5% 59.0% 86.1%CNN+RF 89.0% 93.0% 81.0% 87.8% 59.5% 87.3%CNN+RF+CRF 89.5% 93.2% 82.3% 88.2% 63.3% 88.0%

Taken from the challenge web pagehttp://www2.isprs.org/vaihingen-2d-semantic-labeling-contest.html

smoothing are shown in Fig. 7. The main improvements of theCRF are to change the label of regions with ambiguous probabil-ities, and to remove small mislabeled regions. On the downside,CRF smoothing can sometimes remove small or thin regions.

To demonstrate the benefit of combining Orthophoto, DSM,and NDVI images in the pairwise term of the CRF, the resultsusing different edge maps are compared in Table VIII. The over-all accuracy is highest when all three edge maps are combined.It is also evident that the DSM edges do not help to improvethe accuracy, probably due to the imprecise nature of the DSMat object boundaries. In future work, the DSM edges could beomitted from the CRF pairwise term.

G. ISPRS Challenge Test Results

The results on the unlabeled test images were submitted to theISPRS for evaluation. Our results for the 2-D labeling challengeare shown in Table IX for CNN only, combined CNN and RFprobabilities, and combined probabilities postprocessed with theCRF. In comparison to Table VII, the accuracy is higher than

Page 10: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

PAISITKRIANGKRAI et al.: SEMANTIC LABELING OF AERIAL AND SATELLITE IMAGERY 2877

Fig. 8. Examples of our semantic labeling results on ISPRS Vaihingen dataset. From left to right: original CIR image, ground truth, computed result.

Fig. 9. Sample of the Casterton CIR imagery and our classification results.

on the validation set, particularly for low vegetation and trees.Fig. 8 shows examples of our semantic labeling results on theISPRS Vaihingen dataset.

H. Casterton CIR Data

To test the generalization performance of the framework,the system trained on the ISPRS data was tested on an en-tirely new dataset. CIR imagery was captured using a VexcelUltraCam-D airborne sensor over the town of Casterton, Aus-tralia by Aerometrex [48] at 15 cm GSD. The imagery con-sisted of three-channel pseudo-orthophotos4 with near-IR, red,and green channels, similar in nature to the training data. The

4Not a true-orthphoto, i.e., no orthographic correction has been applied.

imagery has neither the DSM data nor ground-truth labels. Us-ing the system pretrained on the ISPRS Vaihingen CIR datawithout the DSM channels, semantic labels were generated forthe new Casterton data excluding the postprocessing CRF step,only the CNN features were used. Fusion with hand-crafted fea-tures and the RF was omitted in this experiment. The test datawere preprocessed by resampling to the GSD of the trainingdata and histogram matching. A sample of the Casterton CIRimagery and its classification using an ISPRS data pretrainedframework is presented in Fig. 9. At the outset, classificationdoes not appear to be good, however on closer inspection wefound that impervious surface, low vegetation, tree and car werevery well classified. The problem class was building, which isoften classified as impervious surface. Fig. 10 shows samples of

Page 11: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

2878 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 9, NO. 7, JULY 2016

Fig. 10. Generalization capability of our proposed approach. The model was trained on ISPRS CIR data (without the DSM), and evaluated on Casterton CIRdata and ISPRS test data. (a) Examples of Casterton CIR data and classification results of our approach. (b) Examples of ISPRS test data and its classificationresults of our approach.

Fig. 11. Differences between ISPRS-provided data and easily accessible RGBsatellite images. (a) Temporal changes (circled in red). (b) Reduced resolutionand compression artifacts.

zoomed-in imagery with its classification. Buildings that tendto have brown roofs appear to have been correctly classified,like the buildings in the ISPRS data. This strongly suggests thatConvNet is relying on spectral properties in order to extractdistinguishable features. Perhaps texture or structure propertiescould be emphasized by training on greyscale imagery. Cars aresegmented quite well, considering the fact that Aerometrex im-agery is 15 cm GSD where the ISPRS data used for training theframework is 9 cm GSD. The pretrained system performs con-siderably well even though the test data have a lower GSD thanthe training data—about one-third as many pixels per squaremeter.

I. RGB Imagery

RGB-channel satellite imagery can be easily accessedthrough Google Maps, Bing Maps, the U.S. Geological Sur-vey or other similar service providers. In this section, we applythe proposed approach to the easily accessible RGB satelliteimagery. We retrained our system using these RGB satellite im-ages with the same experimental settings described in previousexperiments.

We extract RGB imagery from Google Maps/Bing Maps ofthe same location as ISPRS data (Vahingen, Germany). Thesewere captured using Digital Globe’s World View-2 satellitescapturing images at 46 cm GSD at-nadir. We apply the ran-dom sample consensus (RANSAC) algorithm to estimate thehomography between Google/BING images and ISPRS data.These images are then transformed to the ISPRS data’s coordi-nate system using the estimated homography. We apply scale in-variant feature transform to compute the point correspondencesbetween Google/BING images and ISPRS orthophotos. Sincetwo image sources differ greatly (image resolutions, type of

Page 12: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

PAISITKRIANGKRAI et al.: SEMANTIC LABELING OF AERIAL AND SATELLITE IMAGERY 2879

Fig. 12. Mapped labels on RGB satellite imagery appear noisy. (a) Imp. Surface. (b) Building. (c) Low veg. (d) Tree. (e) Car.

Fig. 13. Classification results on RGB satellite imagery. Left: Satellite imagery, Middle: Ground-truth, Right: Our results.

input channels, and the time both images were captured), weobserve several differences between the ISPRS-provided dataand the RGB satellite imagery. As shown in Fig. 11, RGBsatellite imagery consists of compression artefacts and is ofconsiderably lower resolution compared to the ISPRS data. Inaddition, the ground-truth provided by ISPRS may not correctlymap to the RGB satellite imagery due to temporal changes.Although these changes in content amounted to less than 2%of all pixels, the mapped labels do not necessarily representthe content, as shown in Fig. 12. Using these RGB satelliteimages, we applied our approach using the same experimentalsettings described in previous experiments, excluding the use ofDSM images and the postprocessing CRF step. Fig. 13 showssome experimental results of our approach on satellite imageryand the overall accuracy is shown in Table X. We observe that

TABLE XACCURACY OF OUR PROPOSED FRAMEWORK ON RGB SATELLITE IMAGERY

Imp. surf. Building Low veg. Tree Car Average F1 Overall acc.

ISPRS data(Ortho only)

82.91% 88.13% 67.14% 81.77% 53.20% 74.69% 80.10%

RGB Satelliteimagery

72.99% 77.89% 53.52% 64.76% 8.44% 55.52% 67.73%

the performance of our approach on Google/BING Maps (RGBsatellite imagery) is worse than the performance of our approachon ISPRS data (orthophoto and DSM images). This drop in per-formance is due to the lack of DSM images, Infrared band, noisylabels (shown in Fig. 12), compression artifacts and the low res-olution of RGB satellite imagery. In our opinion and experience,

Page 13: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

2880 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 9, NO. 7, JULY 2016

CRF should only ever be used when labeling needs to be refined.Since the labeling of the RGB imagery was so inaccurate andnoisy, there was no point in trying to redeem it by applying theCRF. An extension of our approach to low-resolution imagerywill be a topic of our future research.

V. DISCUSSION

As was also highlighted in [20] for street scenes, this paperdemonstrates that CNNs can effectively perform dense seman-tic labeling of aerial imagery. The accuracy is sufficiently highto enable automatic generation of vector data for a variety oftasks, such as change detection and building segmentation. Thefeatures are learned directly from the data rather than beinghand-crafted. In contrast to the sophistication and computationalcost of the CNN approach, simple pixel-level hand-crafted fea-tures achieved almost the same accuracy. Perhaps this is notsurprising because the input data are designed to discriminatethe target classes: the DSM highlights houses and trees, andinfrared highlights vegetation. In single-channel panchromaticimages, these phenomenologies cannot be relied upon, and theCNN’s texture-based approach would be much more accurate.

In [21], it was found that CRF smoothing had a negative effecton accuracy, whereas in this paper the accuracy improved. Thisis most likely because our CRF is defined at the pixel level ratherthan on super-pixels as in [21]. Along with [21] we concludethat the CRF improves the labeling visually, for example, byremoving speckle from classifier output labels. Since the CNNis applied with a sliding window, it does not have access toobject-level context during classification of objects much largerthan the CNN input size. CRFs could provide object-level con-straints using higher order cliques or a hierarchical approach,for example, to constrain edges of buildings to be straight lines.Labeling of cars could be improved by a rotation-invariant cardetector [49]. CRFs provide a probabilistic framework for com-bining these detections with the classifier labeling [50].

We have shown that a pretrained model can generalize to newdatasets captured with a sufficiently similar sensor, even whencaptured on the other side of the world. However, generalizationfails on new types of building and landscape. To construct agenerally useful labeling system, it would have to be trained ona very wide variety of scenes with a greater number of classes.This presents a major challenge for semantic labeling due tothe high cost of manual labeling. Our future work will focus onautomatic and semisupervised methods for obtaining ground-truth labels, for example, exploiting OpenStreetMap vector data.

The hand-crafted features were carefully selected to comple-ment CNN features. Adding more variety of hand-crafted fea-tures may or may not work. The scope of this particular paper isto present a combined framework using CNN and hand-craftedfeatures for semantic labeling. Investigating various combina-tions of hand-crafted features or fine-tuning the pretrained CNNmodel would be out of scope for this manuscript. We leave theseopen issues for our future works [51].

We would like to mention here that the provided ground-truthcontains several ambiguities, e.g., tree and low vegetation cansometimes be mislabeled. We illustrate some of these mislabeled

ground-truths in Fig. 6. These errors make up a relatively smallpart of the dataset.

ACKNOWLEDGMENT

The authors would like to thank the ISPRS for providingthe labeled benchmark dataset and M. Gerke[21] for providingthe access to normalized DSM. The labeled dataset is of highquality and the research community is fortunate to have sucha fantastic resource. The authors would also like to thank T.Cooke for discussions on CRFs, and A. Burke for collaboratingon the CRF code.

REFERENCES

[1] V. Mnih, “Machine learning for aerial image labeling,” Ph.D. dissertation,Univ. Toronto, Toronto, ON, Canada, 2013.

[2] J. Niemeyer, J. Wegner, C. Mallet, F. Rottensteiner, and U. Soergel, “Conditional random fields for urban scene classification with full wave-form lidar data,” in Photogrammetric Image Analysis. Berlin, Ger-many: Springer, 2011, pp. 233–244. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-24393-6_20

[3] J. Porway, K. Wang, B. Yao, and S.-C. Zhu, “A hierarchical and contextualmodel for aerial image understanding,” in Proc. IEEE Conf. Comput.Vision Pattern Recogn., Jun. 2008, pp. 1–8.

[4] I. W. III/4, “ISPRS 2D Semantic Label-ing Contest,” (2015). [Online]. Available:http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html

[5] L. Bruzzone, and B. Demir, A review of modern approaches to classifi-cation of remote sensing data, in Land Use and Land Cover Mapping inEurope: Practices & Trends. Dordrecht, The Netherlands: Springer, 2014,, pp. 127–143. [Online]. Available: http://dx.doi.org/10.1007/978-94-007-7969-3_9

[6] X. Huang, C. Xie, X. Fang, and L. Zhang, “Combining pixel- and object-based machine learning for identification of water-body types from urbanhigh-resolution remote-sensing imagery,” IEEE J. Sel. Topics Appl. EarthObserv. Remote Sens., vol. 8, no. 5, pp. 2097–2110, May 2015.

[7] C. Iovan, D. Boldo, and M. Cord, “Detection, characterization, and mod-eling vegetation in urban areas from high-resolution aerial imagery,” IEEEJ. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 1, no. 3, pp. 206–213,Sep. 2008.

[8] N. Kussul, G. Lemoine, J. Gallego, S. Skakun and M. Lavreniuk, “Parcelbased classification for agricultural mapping and monitoring using multi-temporal satellite image sequences,” 2015 IEEE International Geoscienceand Remote Sensing Symposium (IGARSS), Milan, 2015, pp. 165–168.

[9] A. Y. M. Lin, A. Novo, S. Har-Noy, N. D. Ricklin, and K. Stamatiou,“Combining Geoeye-1 satellite remote sensing, UAV aerial imaging, andgeophysical surveys in anomaly detection applied to archaeology,” IEEEJ. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 4, no. 4, pp. 870–876,Dec. 2011.

[10] U. Soergel, E. Cadario, A. Thiele, and U. Thoennessen, “Feature extractionand visualization of bridges over water from high-resolution insar data andone orthophoto,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens.,vol. 1, no. 2, pp. 147–153, Jun. 2008.

[11] N. Yokoya and A. Iwasaki, “Object detection based on sparse representa-tion and hough voting for optical remote sensing imagery,” IEEE J. Sel.Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 5, pp. 2053–2062,May 2015.

[12] A. Zarea and A. Mohammadzadeh, “A novel building and tree detectionmethod from LiDAR data and aerial images,” IEEE J. Sel. Topics Appl.Earth Observ. Remote Sens., vol. 9, no. 5, pp. 1864–1875, May 2016.

[13] L. Zhang, Z. Shi, and J. Wu, “A hierarchical oil tank detector with deepsurrounding features for high-resolution optical satellite imagery,” IEEEJ. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 10, pp. 4895–4909, Oct. 2015.

[14] Q. Zou, L. Ni, T. Zhang, and Q. Wang, “Deep learning based featureselection for remote sensing scene classification,” IEEE Geosci. RemoteSens. Lett., vol. 12, no. 11, pp. 2321–2325, Nov. 2015.

[15] G. M. Foody and D. S. Boyd, “Using volunteered data in land cover mapvalidation: Mapping West African forests,” IEEE J. Sel. Topics Appl. EarthObserv. Remote Sens., vol. 6, no. 3, pp. 1305–1312, Jun. 2013.

Page 14: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

PAISITKRIANGKRAI et al.: SEMANTIC LABELING OF AERIAL AND SATELLITE IMAGERY 2881

[16] K. Karantzalos, D. Bliziotis, and A. Karmas, “A scalable geospatial webservice for near real-time, high-resolution land cover mapping,” IEEEJ. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 10, pp. 4665–4674, Oct. 2015.

[17] L. Meng and J. P. Kerekes, “Object tracking using high resolution satelliteimagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 5,no. 1, pp. 146–152, Feb. 2012.

[18] R. Zhao, B. Du, and L. Zhang, “A robust nonlinear hyperspectral anomalydetection approach,” IEEE J. Sel. Topics Appl. Earth Observ. RemoteSens., vol. 7, no. 4, pp. 1227–1234, Apr. 2014.

[19] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “ Textonboost for imageunderstanding: Multi-class object recognition and segmentation by jointlymodeling texture, layout, and context,” Int. J. Comp. Vision, vol. 81, no. 1,pp. 2–23, 2009. [Online]. Available: http://dx.doi.org/10.1007/s11263-007-0109-1

[20] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchicalfeatures for scene labeling,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 35, no. 8, pp. 1915–1929, Aug. 2013.

[21] M. Gerke, “Use of the stair vision library within the ISPRS 2D semanticlabeling benchmark (Vaihingen),” Univ. Twente, Enschede, The Nether-lands, Tech. Rep. TR-P2015, 2015.

[22] S. Paisitkriangkrai, J. Sherrah, P. Janney, and A. Van-Den Hengel, “Effec-tive semantic pixel labelling with convolutional networks and conditionalrandom fields,” in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Jun.2015, pp. 36–43.

[23] T. T. Nguyen, H. Grabner, H. Bischof, and B. Gruber, “On-line boosting forcar detection from aerial images,” in Proc. IEEE Int. Conf. Res., InnovationVision Future, Mar. 2007, pp. 87–95.

[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro-cess. Syst., 2012, pp. 1106–1114.

[25] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN featuresoff-the-shelf: An astounding baseline for recognition,” in Proc. IEEE Conf.Comput. Vision Pattern Recogn., 2014, pp. 512–519.

[26] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchiesfor accurate object detection and semantic segmentation,” in Proc. IEEEConf. Comput. Vision Pattern Recogn., 2014, pp. 1–8.

[27] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-based classification of hyperspectral data,” IEEE J. Sel. Topics Appl.Earth Observations Remote Sensing, vol. 7, no. 6, pp. 2094–2107,Jun. 2014.

[28] D. Tuia, R. Flamary, and N. Courty, “Multiclass feature learning forhyperspectral image classification: Sparse and hierarchical solutions,”J. Photogrammetry Remote Sens., vol. 105, pp. 272–285, 2015. [On-line]. Available: http://www.sciencedirect.com/science/article/pii/S0924271615000234

[29] K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis, “Deepsupervised learning for hyperspectral data classification through convolu-tional neural networks,” in Proc. IEEE Int. Geosc. Remote Sens. Symp.,Jul. 2015, pp. 4959–4962.

[30] M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios, “Build-ing detection in very high resolution multispectral data with deep learn-ing features,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Jul. 2015,pp. 1873–1876.

[31] A. Lagrange et al., “Benchmarking classification of earth-observationdata: From learning explicit features to convolutional networks,” in Proc.IEEE Int. Geosci. Remote Sens. Symp., Jul. 2015, pp. 4173–4176.

[32] IEEE GRSS Data Fusion Contest. (2015). [Online]. Available:http://www.grss-ieee.org/community/technical-committees/data-fusion

[33] O. Firat, G. Can, and F. Y. Vural, “Representation learning for contextualobject and region detection in remote sensing,” in Proc. 22nd Int. Conf.Pattern Recog., Aug. 2014, pp. 3708–3713.

[34] V. Mnih and G. Hinton, “Learning to detect roads in high-resolution aerialimages,” in Proc. Eur. Conf. Comput. Vision, 2010, pp. 210–223.

[35] S. Kluckner and H. Bischof, “Image-based building classification and 3Dmodeling with super-pixels,” in Proc. Int. Soc. Photogrammetry RemoteSens., Photogrammetric Comput. Vision Image Anal., 2010, pp. 233–238.

[36] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,”J. Mach. Learn. Res., vol. 5, pp. 101–141, 2004.

[37] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Scene parsing withmultiscale feature learning, purity trees, and optimal covers,” in Proc. Int.Conf. Mach. Learn., 2012, pp. 319–327.

[38] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J. Comput.Vision, vol. 57, no. 2, pp. 137–154, 2004.

[39] N. Dalal, and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. IEEE Conf. Comput. Vision Pattern Recogn., 2005,vol. 1, pp. 889–893.

[40] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Objectdetection with discriminatively trained part based models,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645, Sep. 2010.

[41] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New features and insightsfor pedestrian detection,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,San Francisco, CA, USA, 2010, pp. 1030–1037.

[42] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIB-LINEAR: A library for large linear classification,” J. Mach. Learn. Res.,vol. 9, pp. 1871–1874, 2008.

[43] W. Y. Zou, X. Wang, M. Sun, and Y. Lin, “Generic object detection withdense neural patterns and regionlets,” CoRR, vol. abs/1404.4316, 2014.

[44] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy mini-mization via graph cuts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23,no. 11, pp. 1222–1239, Nov. 2001.

[45] Y. Boykov and V. Kolmogorov, “An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1124–1137, Sep. 2004.

[46] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutionalnetworks,” in Proc. Eur. Conf. Comp. Vis., 2014, pp. 818–833.

[47] S. Paisitkriangkrai, C. Shen, and A. V. D. Hengel, “Pedestrian detectionwith spatially pooled features and structured ensemble learning,” IEEETrans. Pattern Anal. Mach. Intell., vol. 38, no. 6, pp. 1243–1257, 2016.

[48] Aerometrex. (2015). [Online]. Available: http://www.aerometrex.com.au[49] K. Liu et al., “ Rotation-invariant hog descriptors using fourier anal-

ysis in polar and spherical coordinates,” Int. J. Comp. Vis., vol. 106,no. 3, pp. 342–364, 2014. [Online]. Available: http://dx.doi.org/10.1007/s11263-013-0634-z

[50] L. Ladicky, P. Sturgess, K. Alahari, C. Russell, and P. Torr, “What, where& how many? Combining object detectors and CRFs,” presented at the11th European Conf. Computer Vision, Crete, Greece, 2010.

[51] J. Sherrah, “Fully convolutional networks for dense semantic labelling ofhigh-resolution aerial imagery,” CoRR, vol. abs/1606.02585, 2016.

Sakrapee Paisitkriangkrai received the Bachelor’s degree in computer engi-neering, the Master’s degree in biomedical engineering, and the Ph.D. degreefrom the University of New South Wales, Sydney, Australia, in 2002, 2003, and2010, respectively.

He is currently a Postdoctoral Researcher at The Australian Centre for Vi-sual Technologies, The University of Adelaide, Adelaide, SA, Australia. Hisresearch interests include pattern recognition, image processing, and machinelearning.

Jamie Sherrah received the Bachelor’s degree in engineering and the Ph.D.degree on machine learning from the University of Adelaide, Adelaide, SA,Australia, in 1995 and 1999, respectively.

He is currently a Senior Computer Vision Scientist at the Defence Scienceand Technology Group, Australia in the Advanced Geospatial-Intelligence Ex-ploitation Group, Edinburgh, SA. Previously he worked as a Postdoc at QueenMary, University of London, on video analytics for visual surveillance andhuman–computer interaction. As a Chief Scientist at startup Clarity Visual In-telligence (2001–2007), he developed commercial software applying computervision to video surveillance analytics.

Dr. Sherrah was the General Chair of the 2015 International Conference onDigital Image Computing: Techniques and Applications.

Pranam Janney received the B.Eng. degree from Visveswaraiah Technolog-ical University, India, the M.Eng. degree in image processing from La TrobeUniversity, Melbourne, Australia, and the Ph.D. degree in computer vision andmachine learning from The University of New South Wales, Sydney, Australia,in 2002, 2004, and 2010, respectively.

He has previously worked as a Developer at Analog Devices Inc and inresearch capacity at Mahindra Satyam, Canon Inc and National ICT Australia.Since May 2011, he has been with the Advanced Geospatial Exploitation atDefence Science and Technology Group, as a Scientist. His research interestsinclude applied machine learning, scene/content analysis and pattern recogni-tion. He is currently serving as a Grants Assessor for the Australian ResearchCouncil.

Anton van den Hengel received the Bachelor of mathematical science degree,Bachelor of laws degree, Master’s degree in computer science, and the Ph.D.degree in computer vision from The University of Adelaide, Adelaide, SA,Australia, in 1991, 1993, 1994, and 2000, respectively.

He is a Professor and the Founding Director of the Australian Centre forVisual Technologies, at the University of Adelaide, focusing on innovation inthe production and analysis of visual digital media.

Page 15: Semantic Labeling of Aerial and Satellite Imagerydownload.xuebalib.com/3d5w29QHRkd0.pdf · Centre for Visual Technology, The University of Adelaide, Adelaide, SA 5000, Australia (e-mail:

本文献由“学霸图书馆-文献云下载”收集自网络,仅供学习交流使用。

学霸图书馆(www.xuebalib.com)是一个“整合众多图书馆数据库资源,

提供一站式文献检索和下载服务”的24 小时在线不限IP

图书馆。

图书馆致力于便利、促进学习与科研,提供最强文献下载服务。

图书馆导航:

图书馆首页 文献云下载 图书馆入口 外文数据库大全 疑难文献辅助工具