Building and road detection from large aerial imagery
-
Upload
shunta-saito -
Category
Technology
-
view
837 -
download
1
Transcript of Building and road detection from large aerial imagery
Building and road detection from large aerial imagery
Shunta SAITO*, Yoshimitsu AOKI* * Graduate School of Science and Technology,
Keio University, Japan
Motivation• Understanding aerial image is highly demanded for generating maps,
analyzing disaster scale, detecting changes to manage estate, etc. • But it’s been usually done by human experts, so that it’s both slow and costly. • Remote sensing community has been focused on this task but it’s still difficult
to detect terrestrial objects automatically from aerial image with high accuracy.
Goal
Input aerial image (RGB)
Output 3-channels map R: road, G: building, B: others
The trade-off between different objects at a same pixel
Previous WorksSenaras et al., Building detection with decision fusion, 2013
ResultProcess flow
Input
Predicted labels
Ground truth
Mean shift segmentation
Extract 15 different features
Combine the multiple classifica-tion results
segments
feature
feature
feature
Vegetation maskShadow mask
Infrared-Red image
Input aerial image (Infrared, RGB)
Infrared-Red-Green image
Hue-Saturation-Intensity (HSI)
image
Normalized Difference
Vegetation Index (NDVI) image
classifier
classifier
classifier
Previous WorksVolodymyr Mnih, Machine Learning for Aerial Image Labeling, 2013
• They take patch-based approach which is very suited to use Convolutional Neural Network (CNN).
• They formulate the problem as obtaining a mapping from aerial image patch to label image patch.
• However, they train two CNNs, one for building, another for road, despite there may be trade-offs between them.
Process flow Result Description
Aerial imagery
Predicted Label
Noise model
Patches
CNNCNNCNN
Dataset
Aerial image Building labelx 151
Road labelx 1109
Convolutional Neural Network
64 x 64 x 3 (RGB) sized patches
Correct answers
Predictions
16 x 16 x 3 (building, road, other)
Calculate loss
Back
prop
agat
ion
Our ApproachWe train a Convolutional Neural Network (CNN) as a mapping from an input aerial image patch and a 3-channel label image patch using stochastic gradient descent.
RG
B
Input aerialimage patch
Predicted map patch
FC(4096)C(64, 9x9/2) P(2/1) C(128, 7x7/1) C(128, 5x5/1) FC(768)
• We train a CNN which has the above architecture. • The CNN takes a small RGB image patch as input, and
output a predicted 3-channel label patch. • The predicted label patch is consisted of Road channel
and Building channel and others channel. • No pre-processing like segmentation is needed and we
don’t need to design any image features. CNN obtains good feature extractors automatically.
Patch-based framework
allow the network to use surrounding pixels to predict labels in the center patch
Using context
It’s building!?
1
0
0
0
0
1
m̃i2
m̃i1
m̃i3
i
s
m̃
wm
wm
ws
ws
Road
Building
Otherwise
Aerial imagepatch
Predicted label patch
p(m̃|s) =
w2m�
i=1
p(m̃i|s)We learn with CNN.
Loss function• Each pixel in predicted label image
• is independent each other (assumption) • is always belonging to only one of the 3 labels (building, road, others)
m̂i
(1.56, 4.37, 3.11)
softmax
(0.05, 0.74, 0.21)
Predicted label
(1, 0, 0)
m̃i
Correct label
c : channel (Building, Road, Others)P : correct label distribution (1-of-3 coding)Q : predicted label distribution
Asymmetric cross entropy
and just minimize this cross entropy by Stochastic Gradient Descent
wc : weight for each channel loss
wbuilding = 1.5, wroad = 1.5, wothers = 0
*because prediction loss in the others channel is not important
Dataset
+ →
Building label Road label 3-channel labelAerial image
• We combine the Volodymyr’s Road and Building detection datasets* to create our 3-channel map dataset.
• Our dataset contains 147 sets of aerial images and 3-channel label images. - 137 sets for training - 10 sets for testing
• Each image is 1500 x 1500 pixel sized at 1m^2/pixel resolution.
• The entire dataset covers roughly 340 km^2 of mainly urban region in Massachusetts, the United States.
Aerial images 3-ch labels
Dataset
* http://www.cs.toronto.edu/~vmnih/data/
Experiment
• Training with 137 images and labels
• Testing with 10 images and labels
• We test some variants of the basic architecture.
RG
B
Input aerialimage patch
Predicted map patch
FC(4096)C(64, 9x9/2) P(2/1) C(128, 7x7/1) C(128, 5x5/1) FC(768)
Basic architecture
Activation Dropout rate Filter sizeS-ReLU(Basic) ReLU N/A 9-7-5
S-ReLU-Dropout ReLU 0.5 9-7-5
S-Maxout Maxout 0.5 9-7-5
ReLU ReLU N/A 16-4-3
ReLU-Dropout ReLU 0.5 16-4-3
Maxout Maxout 0.5 16-4-3
Tested architectures
Input aerial image Predicted 3-channel label image from the basic architecture
Example of Test Results
Input aerial image
Example of Test Results
Predicted 3-channel label image from the basic architecture
Road BuildingS-ReLU(Basic) 0.8905 0.9241
S-ReLU-Dropout 0.8889 0.9220
S-Maxout 0.8842 0.9185
ReLU 0.8657 0.8984
ReLU-Dropout 0.8650 0.8973
Maxout 0.8548 0.8940
Volodymyr 0.8873 0.9150
Precision at breakeven point• Our basic architecture achieved the
best results. • Using Maxout or Dropout, or both
seem not to improve the performance. • The architecture which has smaller
filter size is better than ones with bigger filters.
Conclusion• We propose a CNN-based building and road extraction method for aerial
imagery. • Our method doesn’t need hand-designed image features because the good
feature extractors are automatically constructed by training CNN. • Our CNN predicts building and road regions simultaneously at state-of-the-art
accuracy.
Thank you for your kind attention.
All codes to generate our dataset, perform training of CNN, and test of the
resulting models are available on GitHub.
https://github.com/mitmul/ssai