Application of Improved Remotely Sensed Drought Severity ...
1 Scalable Remotely Sensed Image Mining Using Supervised ... · Scalable Remotely Sensed Image...
Transcript of 1 Scalable Remotely Sensed Image Mining Using Supervised ... · Scalable Remotely Sensed Image...
1
Scalable Remotely Sensed Image Mining
Using Supervised Learning and Content-based
Retrieval
Ritendra Datta, Jia Li, Ashish Parulekar, and James Z. Wang
Abstract
Automated satellite image analysis systems have traditionally been designed to accurately analyze
small land covers, typically covered by a single satellite image. It is often highly desirable to have a
real-time system which can analyze land cover spanning large regions. In this paper, we approach the
problem of large-scale satellite image mining from a content-based retrieval and semantic categorization
perspective. A two-stage architecture for automatic retrieval of satellite image patches is proposed. The
semantic categories of query patches are determined and patches from that category are ranked based
on an image similarity measure. Semantic categorization is done by a learning approach involving the
two-dimensional multi-resolution hidden Markov model (2-D MHMM). Patches that do not belong to
any trained category are handled using a support vector machine (SVM) based classifier. One issue is
the image variations due to changing sun elevation angles, which poses a hindrance to robust image
mining. We tackle this problem using histogram transformations. Experiments yield promising results
in modeling semantic categories within satellite images using 2-D MHMM, producing accurate and
convenient browsing. We also show that prior semantic categorization improves retrieval performance.
A system prototype has been created for demonstration purposes.
I. INTRODUCTION
Since 1972, when the first remote sensing satellite was launched, there have been signifi-
cant technological advances in optical sensing systems. While remote sensing data acquisition
The authors are with The Pennsylvania State University, University Park, PA 16802, USA.
The system prototype can be accessed at http://riemann.ist.psu.edu/remote/ .
2
technology has reached new heights, automated analysis of the collected data has not evolved
at the same pace to meet the current requirements. Every day, there is a massive amount of
remotely sensed data being collected and sent by terrestrial satellites. Prohibitive costs involved
in the manual analysis of these large volumes have made it imperative to develop automated
image analysis and mining tools. An automated, real-time, content based satellite image retrieval
system capable of handling large volumes can come in handy for information mining.
In this paper, we propose a system to aid in large-scale mining of remotely sensed images. In
essence, a content-based image retrieval (CBIR) system designed specifically for satellite imagery
is proposed. We represent collections of satellite images by a database of non-overlapping
fixed-size rectangular regions, henceforth referred to as patches. The proposed system involves
supervised learning of land cover categories of interest, and using the category information to
aid querying and browsing in a CBIR framework. It is shown through experiments that (1)
two-dimensional multi-resolution hidden Markov models (2-D MHMM) are an effective way
to jointly model the spectral and spatial structure of different land cover categories in satellite
imagery, and hence are valuable tools for automatic learning from this type of imagery, (2) CBIR
is a practical approach for real-time browsing and analysis of remotely sensed imagery, and (3)
performing categorization prior to applying image retrieval techniques significantly increases
speed and precision of retrieval from large collections of satellite imagery. The novelty and
key contributions include the proposal and implementation of a flexible system architecture
that allows for easy modifications, an intuitive user interface to aid in practical analysis, and
the demonstration that prior categorization leads to a better remotely sensed image retrieval
performance. No attempt is made to compare the categorization process in our architecture with
existing classification approaches.
For a general overview, consider the interface shown in Fig. 1. With the aim of mining the
image database for crop fields of a certain kind, an image analyst can use an example patch and
our system to retrieve all patches which resemble the particular pattern. A generative classifier
screens the retrieval results to ensure that patches retrieved are from the same semantic category
as the query, in this case crop fields. In other cases, land cover of a certain type within a particular
geographic region may be of interest. The proposed system is designed to aid in the retrieval
of such patches in real-time for further analysis. Other applications of our system could be as
follows. If an analyst specializing in residential areas wishes to scrutinize all images containing
3
Fig. 1. Screenshot of the system interface. A crop patch query (top-left) is used to retrieve other visually similar crop regions
within the database. Prior categorization of patches help produce high precision results.
significant portions of residential land cover, manually browsing through all available imagery
is typically impractical when large volumes are involved. Instead, our proposed system can be
employed to quickly and automatically find such images. The analyst can then proceed with the
investigation on the selected set of relevant images, possibly a fraction of the original volume.
The paper is arranged as follows. We survey work closely related to ours in Sec. II. In
Sec. III, the salient characteristics of the remotely sensed data used in our experiments and system
prototype are discussed. In Sec. IV, we elaborate on the proposed system architecture, including
the user interface for browsing and retrieval. In Sec. V, the learning based categorization methods
and the CBIR based image mining technique is discussed. Also discussed here are methods for
image normalization for handling sun elevation variance and other atmospheric effects. We then
discuss the experimental setup and the obtained results in Sec. VII. We conclude in Sec. VIII.
4
II. RELATED WORK
Thus far, there have been numerous attempts at automated classification of remotely sensed
imagery. In this section, we review research that is most closely related to our work.
One of the earliest attempts at remotely sensed image classification have been by Haralick et
al. [14]. In the more recent years, many different techniques have been proposed, some of which
include texture analysis [24], [20], Markov random fields [31], genetic algorithms [32], fuzzy set
theory [12], [38], [34], Bayesian classifiers [28], [10], decision trees [19] and neural networks
[4], [2]. Generalized orthogonal subspace projection has been found to improve classification
performance on Landsat TM images in [27]. The use of multi-source as well as multi-temporal
imagery for a data fusion approach to Landsat TM image classification has been explored in [5].
Since the growth in popularity of Support Vector Machines (SVM) as a supervised learning
technique, its use in the classification of Landsat and other types of imagery have been explored,
for example, in [15], [25]. Some others have approached remotely sensed image classification
as an application of CBIR, for examples, [24], [29], [11]. Content based image retrieval for
high resolution (���
) aerial photographs using Gabor filters has been implemented in [26],
[23]. Evaluation of remotely sensed image mining systems has been explored [9]. Assessment
of Landsat image classification performance has been discussed in [8]. Furthermore, in [30],
classification evaluation methods have been reviewed, and the ������ statistic has been found
most suited for comparing classifiers. Clearly, there has been extensive work on satellite image
classification, by the remote-sensing community as well as the image analysis community.
In a recent survey [37] on satellite image classification results published in the Photogram-
metric Engineering and Remote Sensing Journal, it has been reported that over the last fifteen
years the classification accuracies have not had significant increase. The paper reports a mean
overall pixel-wise classification accuracy across all these experimental results at ��� ����� with
a standard deviation of��� ����� . However, there are a number of parameters that have changed
over the years or across different experimental settings with respect to the remote-sensed data,
which make it difficult to generalize easily. Some of the parameters that vary include the spatial
and spectral information in the imagery, the sensor type, the geographic location and terrain
conditions of the data sets, and the ground-truth labels of the regions on the maps based on
which the accuracies are calculated (due to their subjective and often ambiguous nature). Owing
5
to advances in data collection technology, the quality of available image data has improved
as well, with reduced distortions using techniques such as orthorectification. Nonetheless, this
report is indicative of the possible existence of an upper bound on the classification accuracy
achievable in this type of imagery, given the current technology. Hence, the focus of our work
is to build a convenient CBIR-based tool for mining semantically relevant images from large
collections, instead of attempting to improve upon classification accuracy.
III. REMOTELY SENSED IMAGE DATA
Before elaborating on the system design, we discuss here the salient features of the data used
in our experiments and prototype. Prior knowledge about the data helps employ an appropriate
feature extraction process and design philosophy. It also helps foresee difficulties that may arise
due to the intrinsic nature of data, such that the approach can be suitably geared toward solving
them.
(a) (b) (c)
(d) (e) (f)
Fig. 2. Bands in the satellite imagery captured by the ETM+ device on Landsat 7. (a) Band 1 (Blue). (b) Band 2 (Green). (c)
Band 3 (Red). (d) Band 4 (NIR). (e) Band 5 (SWIR 1). (f) Band 7 (SWIR 2). (Courtesy: GLCF).
The Landsat satellites have been orbiting the Earth for over 30 years, and the data that they
collect has been used to study the land cover, environmental resources, and man-made structures
6
on the Earth’s surface. The Enhanced Thematic Mapper Plus (ETM+) instrument on the Landsat
7 is a multi-spectral remote sensing device which captures � bands. Each band is sensitive to
different wavelength ranges of solar energy. Bands 1, 2 and 3 of the ETM+ instrument record
reflected light in the visible range, and are known as blue band, green band, and red band
respectively. The ETM+ sensor also records energy beyond the visible spectrum. Band 4 is
known as near infrared (NIR), and bands 5 and 7 are known as short wavelength infrared
(SWIR). All these bands are recorded at a spatial resolution of ���� � meters. A plot of these �bands on a sample land cover can be seen in Fig. 2.
Each wavelength band has the potential to capture different features of the earth’s terrain.
While band 1 (blue) typically illuminates objects under clear water better than other colors,
plants do not show up brightly in this band. On the other hand, band 2 (green) reflects vegetation
brightly and hence can be used to determine the health of vegetation. Band 2 also gives excellent
contrast between clear and turbid water. Band 3 (red) reflects well from dead foliage and also
highlights urban features. Band 4 (NIR) can potentially help differentiate between various types
of vegetation. While band 5 (SWIR 1) is useful in distinguishing between various types of
vegetation, it has limited cloud penetration. This in turn helps differentiate between snow and
clouds. Band 7 (SWIR 2) is useful in detecting moisture in soil and vegetation [33]. Clearly,
the spectral dimensions have a lot of information content that can be utilized for classification.
The spatial patterns in local neighborhoods can further help improve upon classification.
(a) (b) (c)
Fig. 3. Tri-band Landsat 7 ETM+ image formats. (a) True color format. (b) NIR format. (c) SWIR format. (Courtesy: GLCF)
In typical color image retrieval, tri-band images (usually the ����� spectrum) are used for
classification and indexing. In this work, three bands are selected from the possible 6 bands
available in Landsat ETM+ imagery. This choice was based on a combination of two factors,
7
(1) calculation of the Optimal Index Factor (OIF) [18] over the set of all possible subset of 3
bands, and (2) consultation with an experienced satellite image analyst working in a government
research lab, who provided intuitions about the choice of spectral bands, based on the land cover
categories used.
A true color format ( ����� ) consists of bands 1, 2, and 3 representing their corresponding
color bands. As shown in Fig. 3 (a), it produces realistic representations of the land cover. The
problems, though, are that images produced are of low contrast, and that band 1 (blue) is usually
noisy due to its high dispersion in the atmosphere. In the SWIR format, blue color is displayed
by band 2 (green), green color by band 4 (NIR) and red color by band 5 (SWIR) (See Fig. 3 (b)).
While this format is useful in analyzing vegetation patterns, it is very sensitive to cloud cover.
A widely used format is the NIR format, in which blue color is displayed by band 2 (green),
green by band 3 (red) and red by band 4 (NIR) (See Fig. 3 (c)). With this format, vegetation
appears bright red and urban areas appear greenish blue.
To further provide insight into the choice of spectral bands, we computed the OIF for all 20
combinations of 3 bands out of the 6 bands (1,2,3,4,5, and 7) over a representative subset of the
imagery used, consisting of all land cover categories of interest (specified later). The OIF for a
set of 3 bands is computed as ��� �"! #%$&('*)�+ &# $, '*).- / , - (1)
where + & are the standard deviations of the 3 bands, and / , are the three correlation coefficients
formed out of the 3 bands, taken two at a time. The measure tends to be larger when the
information content for a set of 3 bands is higher. Because the training samples consists of
distinct sets for each land cover category in question, computing OIF over them ensures that the
measure helps rank discriminative power mainly over the categories of interest. The OIF values
for all 20 triplets are shown in Table I. We find that the top three ranked formats are 021�35463 ��7 ,0 � 35463 �87 , and 09�:3(1�354 7 , in that order. Band 4 appears in all of the top 10 combinations in the OIF
table. Intuitively, band 3 (red) helps highlight urban features, and band 3 (green) helps reflect
crop fields better, distinguishing them from other land cover classes, as discussed previously.
Thus, based on both intuitions as well as the OIF table, we find the 09�:3(1�3;4 7 (NIR format) to be
most suitable, being the third most discriminative combination according to OIF, and hence use
them for all our experiments.
8
TABLE I
OPTIMUM INDEX FACTOR (OIF) FOR THE LANDSAT ETM+ BANDS OVER REGIONS OF INTEREST.
Rank Combination OIF
1 3, 4, 5 41.1730
2 1, 4, 5 40.4513
3 2, 3, 4 39.2354
4 3, 4 7 37.9621
5 1, 3, 4 37.7571
6 1, 4, 7 37.7465
7 2, 4, 5 34.7594
8 4, 5, 7 33.4958
9 2, 4, 7 33.3145
10 1, 2, 4 30.3658
11 3, 5, 7 19.6789
12 1, 5, 7 18.8772
13 1, 3, 5 18.5256
14 2, 5, 7 18.3707
15 2, 3, 5 18.2690
16 1, 2, 5 16.7659
17 2, 3, 7 14.0359
18 1, 3, 7 13.8118
19 1, 2, 7 12.4120
20 1, 2, 3 10.6813
Publicly available1 Landsat-7 ETM+ images [33] are used, which have a spatial resolution of���� �<� . Four land cover categories of interest are considered, namely urban, residential, crop and
mountain regions, although our architecture supports seamless addition of more classes through a
modular training process. The choice of these particular land cover categories, which correspond
roughly with some of the Level I categories in the USGS classification [1], were due to a specific
application that was kept in mind at the inception of the project. Note that the specific formats
and resolutions used in our experiments do not restrict the use of our approach for other formats.
1Source of data for experiments : http://glcf.umiacs.umd.edu/data/landsat/.
9
In particular, our 2D-MHMM and IRM based modeling and retrieval processes are not restricted
by three band imagery, and hence the whole setup can be easily modified to support a higher
number of spectral bands. Our architecture is designed to allow for such additions to the system
without much effort.
IV. ARCHITECTURE OF THE PROPOSED RETRIEVAL SYSTEM
Patch pk
ModelComparison
ModelComparison
ModelComparison
Feature VectorLikelihood
MaximumLikelihood
ModelComparison
Categorize using2−D MHMMs
FeatureExtraction
FeatureExtraction
l 1
l 3
l K
l 2
Query Patch q
=6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6=>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>
BiasedSVM
C2
Category ck
Patch p
k
(To be categorized)
{1, ..., K}
Off−line Processing
CategorizedPatch
{0}C
1(N
ot to be categorized)
InDB ?
DB
Yes No
Fetch categoryfrom DB
Category c
MUX
1
2
K
Patch DB foreach category(conceptual)
On−line querying
Satellite images { I K }1
Re−size andRe−sample
User
Likelihood
Likelihood
Likelihood
Likelihood
, ..., I
Trained 2−D MHMMs
IRM Similaritymatching
similaritySort by
Interface
M1
M2
M3
MK
Fig. 4. System architecture. Left side: The database building process. Right side: Real-time user interaction.
The generic framework of the proposed system can be divided into two parts. The off-line
processing part consists of initial data acquisition, normalization, ground-truth labeling, and
model building, while the on-line part consists of querying, browsing and retrieval. A schematic
diagram of the architecture is given in Fig. 4, which follows in the lines of a learning-based CBIR
system on more generic image data. Below we discuss each of these components, and follow it
up with a section on implementation details and issues which cover the practical problems and
issues we faced while implementing this architecture, and how we tackled them.
A. Off-line Indexing of Images
Although we have introduced specifics on the data we experimented on, in this section we
elaborate on the details in a fairly general manner. Consider a set of ? satellite images��@ 35A !� 3� B B C3�? of an arbitrary type (e.g., Landsat, ASTER, SRTM). A subset of the available spectra
10
(here, bands 2, 3, and 4) for the given type of imagery (here, Landsat 7 ETM+) is used to
create composite images. A raw NIR image typically has low brightness and contrast. In order
to improve the visual clarity of the image and to reduce the differences in image contrasts due to
sun elevation angle, we attempt normalization of the images by two different methods, namely
a standard digital-number (DN) to at-surface reflectance conversion procedure commonly used
in the remote sensing community, and a generic normalization procedure practised in the image
processing community. These procedures are elaborated upon in Sec. V-A.
The normalized images are divided into equal sized non-overlapping rectangular patches
of size DFEHGJIKE , padding the right and bottom with zeros appropriately, to get a total ofLpatches M(N ) 3� B B C3ONKPRQ . Traditionally, satellite image classification has been done at the pixel
level [31], [4]. For a typical Landsat image at ���� �<� resolution, class boundaries are at pixel
level. Even at this level there may sometimes be class ambiguity, due to which some authors have
proposed fuzzy classification [12], [38], [3] as opposed to hard classification. Even though pixel-
level classification may give an overall segmented view of the land cover, formulating a CBIR
framework becomes challenging. While a system working at the patch level may have inferior
land cover classification accuracies, there are distinct advantages of doing so. They include, (1)
formulating the problem in an image retrieval framework becomes convenient, (2) within a patch
category such as urban regions or forests, it helps to find regions with comparable density, and
(3) tracking salient features such as specific patterns of deforestation or terrace farming within
crop fields using patch level classification followed by heuristic search techniques. However,
classification of patches instead of pixels can lead to greater ambiguity, because ground-truth
inter-class boundaries are at the pixel-level, or at an even finer level, especially for some of the
USGS Level II, III, and IV categories [1].
Fig. 5. Examples of ambiguous patches. Left: Urban and Residential. Right: Residential and Crop.
11
Image patch sizes of, say, ��4FGS��4 , and� ���TG � ��� pixels (as used in our experiments) cover
roughly 1� U� � � �WVand
� 46 X<4�� �WVon the ground respectively. While these are too large an area
to represent precise ground segmentation, the focus is on convenient visualization of image data.
However, a strategy is still required to resolve the ambiguity that this approach inherently leads
to. As shown in Fig. 5, some patches have large coverage of different categories. In our system,
fuzziness is not incorporated. Instead, we consider a patch N & to belong to a category Y & if N &has roughly over
�<Z��coverage of type Y & (dominant category). Patches which do not belong to
any of the categories M<Y ) 3� B B C3�Y\[RQ , or those that do not have a dominant category, are given class
labelZ
(category unknown). Dividing the image into rectangular patches makes it convenient
for training as well as browsing. Moreover, for semantic classification, a more global view of
an area is helpful. For example, a few trees in a city may occupy a pixel in the image. This
pixel is still ideally classified as part of an urban area rather than a forest. Thus, the granularity
of patches involves a trade-off between global view and ambiguity. Zooming and panning form
part of the interface to allow users the flexibility in result visualization.
For each patch N @ , information about the relative coordinates of its top left corner and its
parent image are stored as metadata. Suppose that there is some way to manually identify �semantically non-overlapping classes or categories M<Y ) 3� C B B3�Y\[]Q relevant to a specific application.
Note that this need not be an exhaustive set of classifications. We aim to build a supervised patch
classifier to help automate the categorization process. This is needed since it is impractical to
manually label the large number of patches representing the remotely sensed image collection.
For this purpose, a small number ^ of patches of each semantic category Y & are chosen to
training 2-D MHMMs, generative models used for the classification. Details on 2-D MHMMs
are presented in Sec. V-B. Here we give a brief overview. For each semantic category, a separate
2-D MHMM is trained using visual features of the corresponding ^ training patches, resulting
in � different models. Now, for each image patch M(N ) 3� C B B3ON_P`Q in the database, the likelihood a &of the patch belonging to class b is computed using the trained models. Since the system deals
with a non-exhaustive set of categories, those patches that do not belong to any of the trained
classes are required to be labeled as classZ, as mentioned before. It makes little sense to train
another 2-D MHMM for them, since there may not exist any spatial or textural motif among
them. Instead, another supervised classification is performed using Support Vector Machines
(SVMs). Two sets of randomly chosen training patches are taken, c �with manual class labels
12Z(unknown category), and cd� with any of the labels M � 3� B C B3(�eQ (known category). The � 2-D
MHMM likelihood estimations for each of the samples of the two classes are used as feature
vectors for training an SVM. A biased SVM classifier is used for this purpose, such that cd�is predicted with high accuracy at the cost of c �
being predicted with moderate accuracy. The
reasons for doing so will be discussed in Sec. V-C. If an unknown patch N & , whose 2-D MHMM
likelihood estimation vector 09a ) 3� B C B3(af[ 7 , is classified as c �by this biased SVM, it is labeled asg & ! Z
. Otherwise, its class label is assigned by the ruleg & !ih�jlknmohqp@ 09a @ 7 (2)
In summary, the overall classification process is as follows:r For a given patch N & , its likelihood a @ 3 �ts A s � for each trained model is computed.r The trained SVM is used to classify the vector 09a ) 3u C B C35af[ 7 as c �or cd� .r If the result is c �
then g & ! Zelse g & !Hh�j;kvmwh�px@ 09a @ 7 .
The class label g & is stored as metadata for N & . This set of tasks can be performed entirely
off-line over the collection of satellite imagery in order to generate a large annotated database
of patches. We remind the readers that the metadata for each patch consists of an identifier for
its parent image, its location relative to the top-left corner of the parent, and the predicted class
label M Z 3 � 3� B C B3(�eQ assigned to it.
B. On-line Query Processing
Assume that there is an efficient indexing strategy for handling the database of patches and
associated metadata. The simplest way to represent the query is as follows. Given a query patch
the user seeks to find patches within the same semantic category, sorted by their visual similarity.
Rather than searching through the entire database, it suffices for the system to search through
only those patches that belong to the same category as the query. This helps in improving
both accuracy and retrieval speed. The underlying assumption here is that class predictions
are acceptable. As shall be seen in Sec. VII, categorization using 2-D MHMM and SVM is
fairly accurate. One problem is that patches may not always be homogeneous, i.e., a patch may
contain a mixture of different, possibly unknown categories. Our experiments revealed that the
improvement in quality of mining due to categorization is more evident in these ambiguous cases.
This is especially evident for patches partially covered by known and unknown categories.
13
Fig. 6. Demonstrating retrieval improvement with semantic categorization. Left: Ambiguous urban query. Center: Unwanted
retrieval result (without categorization). Right: Desirable retrieval (with categorization).
Consider the example in Fig. 6. The left patch contains urban area and water, the middle one
water and forest, and the right one urban area and forest. Suppose urban areas are a known/trained
class, while water is not. If the left patch is queried in our system without prior categorization, the
middle patch ranks higher in visual resemblance than the right patch. This happens because (1)
water is present in the first two patches and absent in the third, and (2) the shapes of the coastlines
in the first two patches are similar. This problem is avoided by using prior categorization, which
eliminates the middle patch from consideration for ranking. In some sense, the action performs
semantic filtering to improve mining quality.
The user has two different means of formulating a query:
Within-database query: In the case that an existing patch is used as query, its semantic
category g & is already stored and hence known. Patches in the database whose semantic categories
are not g & are eliminated from consideration.
External query: In the case of an external query, the patch is re-sized or trimmed to fit
the standard dimensions DoEyGzI_E and adjusted for spectral encoding, if needed. The semantic
category g & of this patch is predicted using the 2-D MHMM likelihoods and the biased SVM.
Again, all but the patches labeled g & are eliminated from consideration.
The remaining patches are now ranked according to their visual similarity with the query.
Visual similarity is computed using the Integrated Region Matching (IRM) measure, which is fast
and robust, and can handle large image volumes [35]. The top { matched patches M(N}|~(3� B B C3ON�|��\Qare then displayed for perusal. The choice of { is contingent upon the specific application.
Experimentation on choosing { and how precision of retrieval varies with it are discussed in
Sec. VII. Note that for the purpose of retrieval, query patches determined as c �(uncategorized)
14
are also searched from among only the patches labeled c �in the database.
C. Implementation Details and Issues
Here we provide more insight into the practical aspects of implementation, and the problems
associated with it. The system architecture is particularly designed keeping in mind scalability
concerns, that is, the ability to handle large land cover, imagery with many spectral dimensions,
and widely varying patch sizes. For this purpose, all implementation has been carried out using a
combination of the C programming language and the freely available MySQL database system.
The scalability of the image retrieval component, IRM, stems from the fact that the module first
builds a fixed width feature set out of all the images in the database. This consists of color
features, texture features, and scale/rotation invariant shape features. This is stored in a MySQL
database, associated with each image patch, along with their location and parent image details.
The use of MySQL allows random access on the set of features over all the files, and it helps
index as many images as the database system allows, which is very large in case of MySQL.
The index set is rebuilt at regular time intervals, depending upon the frequency of database
updates. Thus the most time-consuming portion of IRM based retrieval is performed only as
an intermittent background process. The real-time image ranking process involves computations
with database columns themselves, and hence is primarily performed as database actions, making
it fast and avoiding the necessity for the images to be read in repeatedly.
The 2D-MHMM and SVM training processes are time consuming, but can be performed
entirely as static processes, which is what is done in our system. These are performed on
relatively small but representative sets of images, and hence are fixed costs, making the training
process independent of the size of the database. Testing with both 2D-MHMM as well as SVM
are rapid, and hence scale well to large databases. Yet, it is not necessary to perform these
in real-time, since the classification due to them do not change unless the training process is
re-done. Therefore, once the training of these models are completed, all images patches are
categorized once. This category label is stored in the MySQL database along with the IRM
features and meta-data, indexed by the image. This way, the process of prior categorization
followed by retrieval is achieved simply by a database filtering operation prior the IRM distance
computation. Zooming and panning in the user interface has been speeded up by the use of Haar
wavelet transforms, as explained in Sec. VI.
15
Issue arise when an external patch is given as query. In this case, the feature extraction program
needs to be executed on the patch, followed by the 2D-MHMM based categorization, and finally
the SVM based categorization. These steps can together take approximately 3 seconds per image.
Since these programs are executed only once, this is the maximum wait time for the user for
external queries, which can be typically tolerable. Secondly, even for the off-line processes,
when there are over a hundred thousand images, the 2D-MHMM categorization can take a
significant amount of time. We note that with the images divided into patches, categorization
can be entirely parallelized, with no dependencies across patches. In our case, the images are
categorized using cluster computers. Each set of patches associated with one Landsat image is
assigned to a separate node in the cluster. This way, we completed the categorization of all 12
Landsat images in one-twelfth the time. We note that categorization is essentially a static process
if the patch database is not being regularly updated. Hence this process can be run once, without
the need for it to run as a background process at regular intervals. Finally, we note that when
an external image query is provided which is significantly smaller in size than those used in the
existing database, the required re-sizing has negative effects on the feature extraction process.
In particular, visual features get distorted when scaled to a larger size, and lose detail. With the
features extracted on this scaled version, the retrieval results are not always favorable. This is
the reason external query patches of very similar or same size are preferred.
V. CATEGORIZATION AND RETRIEVAL
We now proceed to elaborate on the specific techniques used in our system. These include the
image normalization process, the 2-D MHMM based categorization process, biased SVM based
categorization, and the IRM distance for computation of visual relevance. Note that scalability
is one of the key advantages of our generative modeling and image retrieval approach. The 2D-
MHMM based categorization has been applied to a 600 image category problem in [22]. The
IRM based image retrieval system [35] has been applied to over 1 million images for real-time
retrieval. The use of multi-spectral imagery involving higher dimensions also does not pose a
restriction to either the generative modeling or retrieval components. Hence our system has the
potential for handling a large number of land cover categories, a very large number of Landsat
images, and higher dimensional spectral bands than what have been used here, without any major
scalability issues.
16
A. Normalization
The need for image normalization arises from the inherent sensing variations due to the sun
elevation angle and other atmospheric effects. As the Landsat 7 revolves around the Earth, it
passes over a particular region at same time of the day, each day. Each column in the Universal
Transverse Mercator (UTM) grid is known as a path. Images within each UTM zone are captured
separately. Images of zones that lie on the same path are fairly consistent because the sun
elevation angle does not vary much during the time satellite traces one path. However, as the
satellite traverses along different paths, the sun elevation angle keeps changing. With this change,
the amount of incident light on the earth’s surface changes and hence the reflectance changes
with it. As a result, the images lying on different paths appear considerably different.
Fig. 7. Satellite images from different UTM paths within the US. Left: Raw NIR image from East Coast (path �O� ) Right:
Raw NIR image from West Coast (path �5� ). (Courtesy: GLCF)
Fig. 7 shows raw NIR images from two different paths. The difference in color between
the two images is apparent. It is expected that the color features extracted from patches of the
same semantic category but from the two different paths will be considerably different. Without
normalization, most results for such a query patch will be drawn from the same path as that
of the query. This phenomenon is likely to be disadvantageous, given the typical applications
of satellite image mining. It is further noted that the NIR images in their raw form have low
brightness and contrast.
To compensate for poor contrast of the NIR images and lighting variations across UTM
paths due to sun elevation and other atmospheric effects, we explore two different normalization
procedures. These include a standard digital number (DN) to at-satellite reflectance conversion
procedure commonly used in the remote sensing community, and a generic histogram normal-
17
Fig. 8. Left: Enhanced West Coast images. Center: Enhanced East Coast images. Right: Normalized West Coast images
(Reference Path � East Coast). (Courtesy: GLCF)
ization procedure practised in the image processing community.
The Landsat 7 imagery data consists of digital numbers (DN) representing intensity at each
pixel over each spectral band. To compensate for the effects mentioned above, header information
about the particular Landsat images can be utilized. One such compensation method is DN to
at-satellite reflectance conversion [16]. For each band of the multi-spectral imagery, the DN can
be converted to reflectance values � using the following equation:
� ! � G���GS� V� Yn� L G + A9��02� 7 (3)
where � ! ����A9��GS� L GS�dA�� +where � is the at-satellite radiance for the pixel, � , ����A9� , and �dA�� + are the sun elevation,
gain and bias for the particular spectrum/image obtained from its header,� Yn� L is the solar
irradiance obtained from [33], and � is the normalized Sun-Earth distance over the year. By this
conversion it has been shown that the variations due to elevation and other effects over different
Landsat 7 images reduce. We apply this conversion to all bands of the training and test data,
18
and then observe its effect on retrieval over a subset of the test data.
The second approach is simple and more ad-hoc, and follows a procedure that is common
in the image analysis community. Prior to dividing an image into rectangular patches,� � � of
the histogram for each band (leaving out � � on each side) is linearly stretched over the entire
available intensity range, i.e., 0 Z�� � �8�87 . We note that although this improves contrast of the NIR
images, it potentially increases the difference between images originating from different paths.
To compensate for this, histogram centralization is performed before the linear stretching. More
specifically, using all images from a chosen path � �, the means �}| , �K� , and ��� of red, green,
and blue spectral bands respectively are computed. Then, for each image in the database, its
histograms of the � , � , and � bands are shifted to be centered at �}| , �_� , and ��� correspondingly.
After the shifting operation, linear stretching as discussed above is applied. The results of
normalization on sample images are shown in Fig. 8. We apply this normalization first on a
subset of the test images.
Having applied both normalization techniques separately on a set of two Landsat images (on
each taken from the East Coast and West Coast), retrieval is performed over the entire pooled
set of test patches in both cases. It is observed that the average precision and recall over all
categories showed better performance with the ad-hoc normalization than with the standard
DN to at-satellite reflectance conversion. While this may be due to the specific visual features
extracted in our system, we are unaware of an accurate explanation for this. Nonetheless, due
to the improved performance, we use the ad-hoc normalization procedure for the rest of our
experiments. Adapting a more standard normalization procedure into our system is part of future
work.
B. Categorization Using 2-D MHMMs
The 2-D multiresolution hidden Markov model (2-D MHMM) has been used for generic
image categorization. This section presents a brief overview of the model and its application to
categorization. For a more detailed discussion on 2-D MHMMs, please refer to [21].
Under 2-D MHMM, each image is characterized by several layers, i.e., resolutions, of feature
vectors. The feature vectors within a resolution reside on a 2-D grid. The nodes in the grid
correspond to local areas in the image at that resolution. A node can be a pixel or a block of
pixels. The feature vector extracted at a node summarizes local characteristics around a pixel or
19
Fig. 9. A conceptual diagram of the 2-D MHMM based modeling process. Arrows indicate the intra-scale and inter-scale
dependencies among visual features.
within a block. The 2-D MHMM specifies the distribution of all the feature vectors across all the
resolutions by a spatial stochastic process. Both inter-scale and intra-scale statistical dependence
among the feature vectors are taken into account in this model. These dependencies are critical
for judging the semantic content of satellite image patches because texture or spatial structure
in these patches can be captured at a larger scale than at a block or pixel level. The inter- and
intra-scale dependencies of the feature vectors are captured by assuming hidden layers of states
at all the resolutions. The feature vectors, which are actually observed, are assumed to follow
Gaussian distributions conditioned on given states at the corresponding resolution and position.
The statistical dependence among states across scales is modeled by a Markov chain, and that
within each scale is modeled by a Markov mesh.
For the experiments, a three-level pyramidal structure in the model was used. A schematic
diagram for this process can be found in Fig. 9. The number of states at lowest resolution is 1and the number of states for each of the two higher resolutions is 4 . For feature extraction, 4x4
blocks are taken and the visual features are characterized by a six dimensional feature vector.
This vector consists of three moments of the wavelet coefficients in the high frequency bands
(representing texture) and the three average color components in the LUV space. As discussed
earlier, instead of taking ����� bands, near-IR, red and green bands are taken from the satellite
image spectra. This is motivated by the fact that traditionally these bands have been visualized
on screen for manual classification as if they are ����� bands. It is thus reasonable to convert
these bands to ����� in the same way as we would convert from ����� . For more details on
20
the feature extraction process, readers are referred to [35]. The likelihood for an image given a
trained 2-D MHMM is computed as explained in [21]. The computed scores for an image over
all trained models are then used in the SVM classification and the eventual category prediction.
C. Separating Known (C2) and Unknown (C1) Classes Using SVM
The training of the 2-D MHMMs is performed on a non-exhaustive set of categories M<Y ) 3� B C B3�Y\[RQ .Generating a training set covering all possible land-cover categories is time-consuming and
expensive, if at all possible. Hence it is preferable to limit the scope to only those classes that
are of interest. As a result, among the image patches there exist many that represent categories
outside of M<Y ) 3� C B B3�Y\[]Q . Also, there are patches that are a blend of multiple categories, with none
dominating. In both cases, these patches should ideally be labeledZ 09c ��7
. As mentioned, all
patches labeled M � 3� B C B3(�eQ are considered part of c�� .
Using the maximum likelihood approach, we can always assign a category label between�
and � to every patch, even if it is actually part of c �(unknown). This is not desirable, and
as explained in Sec. IV-A, neither can we train another 2-D MHMM to model patches in c �.
A naive approach to solving this problem is based on the following assumption. Given a patch
which does not resemble any trained category, the likelihood estimation from all the models
tend to be low. Therefore, if all likelihood scores are below a certain threshold, the patch can be
assigned classZ
( c �). However, not surprisingly, it is found that for a given patch, the likelihood
estimates are not independent of each other. This may be due to the fact that the 2-D MHMMs
are trained on samples that have some degree of visual resemblance across categories.
To solve this problem, we employ a formal classification approach. Let the set of likelihood
estimates for a given patch N & be its feature vector � & ! Mqa ) 3u C B C35af[]Q . In the experiments, 4 classes
were considered. We can plot the 4-D feature vectors of �:3 Z�Z8Z patches manually labeled as c �orc�� . The plots, taken two dimensions at a time, are shown in Fig. 10. Clearly, a non-linear method
can better model the class separation than thresholding or other linear methods. Classification was
attempted using Quadratic Discriminant Analysis (QDA) and Logistic Regression. The accuracy
rate with Logistic Regression turned out to be the best at approximately ��� U� � with accuracy
of classifying only c�� at about ��4� X1 � . Classification using SVM was then performed on the
data using the LibSVM software package [6], using the RBF Kernel ��09� @ 3�� , 7 !H�< $(¡ ¢O£ ¢¥¤¦¡ § . The
results were further improved, at ���: U1 � overall accuracy and ��: �8� accuracy at classifying cd� .
21
−22 −20 −18 −16 −14 −12 −10 −8 −6 −4 −2−60
−50
−40
−30
−20
−10
0
Dimension 1
Dim
ensi
on 2
C1C2
−22 −20 −18 −16 −14 −12 −10 −8 −6 −4 −2−40
−35
−30
−25
−20
−15
−10
−5
0
Dimension 1
Dim
ensi
on 3
C1C2
−22 −20 −18 −16 −14 −12 −10 −8 −6 −4 −2−30
−25
−20
−15
−10
−5
0
Dimension 1
Dim
ensi
on 4
C1C2
−60 −50 −40 −30 −20 −10 0−40
−35
−30
−25
−20
−15
−10
−5
0
Dimension 2
Dim
ensi
on 3
C1C2
−60 −50 −40 −30 −20 −10 0−30
−25
−20
−15
−10
−5
0
Dimension 2
Dim
ensi
on 4
C1C2
−40 −35 −30 −25 −20 −15 −10 −5 0−30
−25
−20
−15
−10
−5
0
Dimension 3
Dim
ensi
on 4
C1C2
Fig. 10. Plot of the 4-D likelihood feature vectors L for C1 (black/circles) and C2 (red/crosses). The six unordered pairs of
dimensions are shown.
When a patch is classified as c �it is removed from further consideration for retrieval. To
be on the safe side, we would rather prefer to have some c �patches to be classified as cd� ,
than have some valid cd� patches mistakenly classified as c �and eliminated from consideration.
Therefore, the goal is to be achieve higher accuracy in detecting c�� . This increases the search
space to some extent while eliminating a significant chunk of unwanted patches. Hence we desire
to have a biased classifier. One way to introduce weights into the SVM learning process is to
sample the training classes accordingly. The bias is introduced by sampling c �and c�� in the
approximate ratio� 1T¨ � � for training the SVM, resulting in a total of about
� 3 Z�Z8Z samples (with
repetition). In this manner, we achieve high accuracy of classifying c�� (� � � ) while for c �
the
score is moderate (� : X� � ). Hence less than 4 � of the patches within categories M<Y ) 3u C B C3(Y\[]Q
will be mistakenly eliminated from consideration. It is presumed that this is not a problem since
patches of one category in a satellite image are usually spread over a large region. It is highly
unlikely that all patches in one region will be eliminated.
22
D. Retrieval using IRM
The Integrated Region Matching (IRM) measure [35] used in the SIMPLIcity system is
employed for image patch mining. IRM is a scalable and robust region-based image similarity
measure. IRM attempts to integrate visual features of each segmented region in the images to
provide robust region-based image matching, with low dependence on reliable segmentation. The
scheme allows multiple regions of one image to be matched with several regions of the other
image. The overall similarity measure is computed as a weighted sum of the similarity between
region pairs, with weights determined by the significance of regions.
Each image is segmented using b -means clustering [17] on the color component vectors, and
for each generated segment is represented by a nine dimensional feature vector summarizing
color and wavelet based texture properties. The feature vectors used include the same six texture
and color features used in 2-D MHMM, and three additional features characterizing the shape of
the segment. The matching is performed by a soft similarity measure in the following manner.
For two images A ) and A V , suppose they are segmented into b ) and b V regions respectively. The
IRM distance between images A ) and A V is then given by��02A ) 35A V 7 ! & ~© ª '*) & §©« '*) + ª ¬ « � ª ¬ « subject to
& §©« '*) + ª ¬ « ! N ª 3 ��s a s b ) and ~© ª '*) + ª ¬ « ! N6®« 3 �tsJ�¯s b V(4)
where � ª ¬ « is the distance between the feature vectors characterizing region a of image A ) and
region�
of image A V , and + @ ¬ , is the significance score for that region pair. More specifically,
denoting the region pairs as / ª and / « respectively, the values of � ª ¬ « are computed as follows:� ª ¬ « ! ���0 / ª 3 / « 7�°�± 09� 0 /ª 3 / « 7;7 (5)
where ���l0 / ª 3 / « 7 is the color and texture distance given by�8�²0 / ª 3 / « 7 ! �� ³© @ '*) 09� @ 0 / ª 7�� � @ 0 / « 7;7 V (6)
with � ) , � V , and � $ being the mean L, U, and V color components of the region, ��´ , �qµ and � ³being the square-roots of the second-order moments of wavelet coefficients in the HL, LH, and
HH bands respectively. The shape distance � 0 /ª 3 / « 7 is the shape distance given by� 0 /
ª 3 / « 7 ! �1 $© @ '*) ¶ �·0 / ª 35A 7� @ � �]0 / « 3;A 7� @ ¸ V(7)
23
where �]0 / 3;¹ 7 is the normalized inertia [13], invariant to scaling and rotation, defined by�]0 / 3;¹ 7 ! #"º�» º�¼ | -¥- ½ �¿¾½�-C- À0 - /�- 7 )9Á ÀàV (8)
where - /�- denotes the size in pixels of the region / ,¾½ denotes mean of ½ , and � @ denotes the A �Ä�
order normalized inertia of spheres. The function± 0 °Å7 in Eq. 5 is used to make the color-texture
distance coherent with the shape distance in the overall � ª ¬ « computation, with the following
form being found to be empirically appropriate, as detailed in [35]:± 09� 7 ! ÆÇÇÇÈ ÇÇÇÉ� �oÊ Z �Z X� � Z X�ÌËÍ� sÎZ �Z � �oË Z X� (9)
The significant credits + ª ¬ « determine how important a role each pair of regions plays in the
calculation, constrained by N ª and N ®« , which are the significance of regions a and�
within A )and A V respectively. To calculate significance credit between a pair of segments, the most similar
highest priority (MSHP) principle [35] is used for assignment of region significance.
In region-based image similarity measures, segmentation quality typically plays a major role
in the similarity scores. In the case of IRM, however, there is a high degree of robustness to poor
segmentation. Difficulty in segmentation can be particularly acute in the case of noisy remotely
sensed images, and hence robustness to unsatisfactory segmentation is important. The use of
the IRM distance for ranking patches by visual similarity helps generate high precision retrieval
in our system, as observed in our experiments. Another possible reason for the demonstrated
success of IRM in the remote sensing domain is the importance given to texture and localized
shape features in the similarity computation.
VI. USER INTERFACE
A screenshot of the user interface can be found in Fig. 11. Readers are encouraged to
experience the demonstration at the aforementioned location. Initially, a random set of patches
from among the database collection are shown. The user can then choose to mine the imagery
in three different ways:
(1) The user can click on a patch to retrieve other visually similar patches. Along with the
retrieved patches the user also has access to their metadata. This process of browsing can continue
as a chain of clicks to arrive at the required set of patches and/or parent images of interest.
24
Fig. 11. Interface with two possible options to query the patch database, (a) by explicit specification, or (b) by clicking on a
retrieved patch.
Fig. 12. Interface for zooming and panning, showing urban (blue), residential (yellow), and the retrieved (green) patch. Left:
Original Ï²Ð¦Ñ Ò²Ó resolution. Right: Zoomed out at �;Ô²Ó . (Courtesy: GLCF)
(2) The user has the option of clicking the random button to display a new set of random
patches retrieved from the database. This is one way to explore the database, understand the
distribution of land cover within the database, and get an overview of the variations within the
images captured.
(3) If the user wishes to enter a query patch not in the database, this can be done as our
system supports external queries.
25
To add to the convenience in browsing, six levels of zooming is supported by our system.
A user can enter the zooming mode by clicking on the location coordinates specified below
a retrieved patch. This is an effective way to explore geographic regions in the proximity of
patches of interest. In other words, the zooming capability allows users get a more global view
of regions of interest. Panning is implicitly supported by the interface as well, by clicking on
any portion of the image to move gaze to that region at the next level of zoom. A screenshot
of the zooming/panning interface can be viewed in Fig. 12. The best way to experience this
interface is through usage of our prototype.
Zooming and panning operations are optimized for real-time response in the following man-
ner. Haar wavelet transforms are used to achieve zooming, since they preserve localization
of data [36]. These transforms decompose the images into sums and differences of neighbor-
hood pixels. On a given query, the system only needs to retrieve the quantized coefficients of
the queried region for reconstruction. Since the processing for categorization and zooming is
done only once during setup, and only localized parameters are required, the response time is
considerably low. Using the metadata associated with the patches, such as precise geographic
locations or semantic categories (either manually provided or automatically predicted), more
advanced querying capabilities can be incorporated. For example, useful extensions could be
the capability to formulate queries such as “Find the closest urban area near this location” or
“Find a water source nearest to this residential area”. When accurate orthorectified imagery are
available, interface extensions to support such complex querying within our framework should
be fairly straightforward.
VII. EXPERIMENTAL RESULTS
For all our experiments, ? ! � � Landsat 7 ETM+ multi-spectral images with ���� �<� resolution
are used. Six images from the East Coat (path� 4 ), and six others from the West Coast (path
46) of the US are used. Since there is considerable difference in the sensed data from each path,
normalization is employed to homogenize the quality, as explained previously. The selection
of images in this manner is aimed at demonstrating the effectiveness of our system given the
varying sun elevation challenge. Our system supports four semantic categories (i.e., � ! 4 ),
namely mountain, crop field, urban area, and residential area. As described in Sec. III, the
NIR bands are chosen for image representation. The pixel dimensions of each image�q@
are
26
� Z8Z�Z GS�8� Z8Z , with geographic dimensions being approximately� � Z � � G ��� �w� �
.
Ground-truth categorization is not available readily for patches. This is required for training and
testing the 2-D MHMM based categorization process, as well as for measuring the precision of
retrieval using the IRM measure. In order to build a manual categorization of the patches into the
specified classes, an expert working on satellite image analysis in a government research lab gave
two arbitrarily chosen subjects a tutorial on how to distinguish between the 4 semantic categories.
The satellite images are divided into square patches. The subjects then independently label each
test patch as either of M � 3��:351�354�Q , orZ
in case they belonged to neither of the classes or had no
dominant coverage, keeping in mind the��Z��
coverage policy (Sec. IV). The final category labels
are determined by taking the overlap of the sets as it is, and in case of disagreements, randomly
choosing one of the two. With the high-quality of the ETM+ images, it is not hard to visually
identify the four categories used. The overlap between these two sets from the independent
subjects is approximately� � � . In the absence of a “gold standard”, this serves as a “silver
standard”.
The choice of what patch size to divide the images into is critical. A patch should be large
enough to encapsulate the visual features of a semantic category. At the same time it should
be small enough to include only one semantic category in most cases. Instead of arbitrarily
selecting a patch size, we explore the effectiveness of 4 different patch sizes, each covering a
different level of granularity, in our system. In particular, we explore using patch sizes� ��G � � ,1���GÍ18� , ��4zGÍ��4 , and
� ����G � �<� . For training, we initially experiment with using samples
sizes from as low as 28, to as high as 90 per category, by testing the built models on a small
validation set of ��4�G]��4 image patches. We observe a clear trend, that beyond��Z
training patches,
the classification accuracy is not showing noticeable improvement, despite significant increase
in computational cost. At 50 training samples itself, the categorization results are satisfactory.
Hence ^ ! ��Zsamples of each of the four categories, and for each of the 4 patch sizes, are
used for training the 2-D MHMMs to yield models ? ) 3�? V 3(? $ and ?y´ in each case. A biased
SVM is trained using the procedure described in Sec. V-C and used in the likelihood-based class
prediction process. In order to test the effectiveness of categorization over each training size, and
to eventually choose an appropriate patch size for the system, 484�� � test patches in each case are
classified using the built models, and the results compared with the generated ”silver standard”.
The confusion matrices for the 4 classes as well as for the classZ
( c �), for each of the 4 patch
27
sizes, are shown in Table II. Note that the accuracy of classifying c �patches reflects on the
model accuracy of both 2-D MHMMs and the biased SVM. A measure of accuracy often used
in the remote-sensing community to evaluate multi-class classification performance is Cohen’s
Kappa Coefficient [7], approximated by �Õ���²� [8], defined as�o���� ! L # [@ '*) � @Ö@ � # [@ '*) 02� @Ö× GØ� @X× × 7L V � # [@ '*) 09� @ × GS� @ × × 7where � is the number of classes, � denotes the confusion matrix obtained for this � -class
classification problem,L
is the total number of test samples, � @ , indicates observation in rowA column Ù , � @Ö× is the total of row A and � @X× × is the total of column A . More specifically,� @ × ! P© , '*) � @ ¬ , and � @ × × ! P© , '*) � , ¬ @ 3 (10)
which denote the row sums and column sums of the confusion matrix respectively. We use this
measure to select a patch size that maximizes �Õ���²� , i.e., produces the best classification for the
categories in question at the corresponding level of granularity.
When taking only classes�
to 4 , the �w���²� coefficients are��� Z8��� , � � Z � � ,
�8Z U1 � , and ��: U� ���respectively for patch sizes
� ����G � �<� , �<4ÚGt��4 , 1��nGt1�� , and� �nG � � respectively. When including
classZ
( c �) as well into consideration, the corresponding �W���� values are �81� � � � , � � X1<4 � ,� Z ���8� , and ��� ��� . While these results are overall very encouraging, and the differences across
patch sizes are not significant, the size� ���`G � �<� tends to produce the best categorization among
them by this measure. Moreover, this patch size was preferred by the analyst we consulted with,
for visualization purposes. Hence for building the system prototype, and for the remainder of
our experiments, patch sizes of� ���ÛG � �<� are used.
Sample results obtained when querying our system using an urban patch (Fig. 13) and a
mountain patch (Fig. 14) are shown. To analyze the improvement in speed of retrieval due to
prior categorization with 2-D MHMMs,��Z8Z
queries were made for patches from each category,
and the retrieval times were noted. In the first run, 2-D MHMM categorization was not used.
Thus, for each query patch, the entire database was searched to find similar patches. In the
second run, 2-D MHMM categorization was used to limit the search to only those patches
within the same predicted class. Table III shows the average time per query for patches of each
category for the two runs. As expected, the average time of retrieval for each category is fairly
similar because the size of the database to be searched for each patch is the same. Table III
28
TABLE II
CLASSIFICATION RESULTS (CONFUSION MATRIX) USING 2-D MHMM WITH 4 DIFFERENT PATCH SIZES.ܦÝ�ÞÚß·Ü�Ý�ÞMountain Crop Urban Residential Others Accuracy
Mountain (1) �l�áà;Ô Ð �l� � �5Ò â²�ÃÑ Ò;ãCrop (2) Ô â;в� �áÏ �;� ���;Ô Ðl�¦Ñ Ï;Ò;ãUrban (3) � � Ï(�áÐ ä²â Ï;Ï ä(�;Ñ Ò;Òlã
Residential (4) Ï�� � ϲâ à;Ð�� ϲ� Ð;Ï�Ñ à5ÒlãOthers (0) Ò²� �5ä àlà ÏlÒ �l�;��� ÐlÐ¦Ñ âl�5ãåÃæRß`å¦æ
Mountain Crop Urban Residential Others Accuracy
Mountain (1) �l�áàl� Ô Ô Ô �l� âl�¦Ñç�ä;ãCrop (2) Ô �áÔ��;� à Ò�� ��à5Ò Ð;Ï;ãUrban (3) Ô � ϲàl� �là ϲ� �l�¦Ñ �5älã
Residential (4) �Ò â àlà à5ä²Ð ϲ� Ðlà¦Ñ Ô¦�áãOthers (0) �;à Òlä � �áÒ �l�áà²� Ð;ä�Ñ Òlà;ãè�ÝÚß`èuÝ
Mountain Crop Urban Residential Others Accuracy
Mountain (1) �l�;�ä Ò Ô Ï ä²� â;à�Ñ Ô;Ð5ãCrop (2) � â;â²� ��Ô à;â �áÒlà Ð5Ï(Ñ Ð;à5ãUrban (3) Ô �áÒ Ïlϲ� Òlä à¦� �lÐ¦Ñ Ò;ã
Residential (4) �áà �áÏ àl� à5ä²Ô ÏlÐ ÐlÔ¦Ñ �¦�áãOthers (0) ä²� ä²� ϲ� ��à ��Ô;Ðl� Ð;Ò�Ñç�áÐ;ãܦåÚß·Ü�å
Mountain Crop Urban Residential Others Accuracy
Mountain (1) �l�Ï;Ï � �l� �áà Ò²Ô âlà¦Ñ Ò;ãCrop (2) Ô �áÔlÔ;à â �5� �O�(Ï Ð;à�Ñ ÒlÐ5ãUrban (3) Ò Ï²à ��â5ä ä� ÏlÐ �lÔ¦Ñ Ï²�5ã
Residential (4) â ��à �5Ï à5älÏ Ïlà Ð��;Ñ Ô5ÒlãOthers (0) â¦� âlâ ��â �áÏ ��Ô5Ò� Ð;Ï�Ñ �5älã
shows the average times per query across categories are fairly consistent for the first run. There
is considerable improvement in retrieval speed for each category in the second run due to the
reduced search space induced by categorization. The speed difference among categories for the
second run is no longer consistent, since the number of patches varies across categories. It is
important to note that the largest number of patches belong to the c �category (uncategorized).
29
Fig. 13. Ordered retrieval results on an Urban query patch. Patch labels consist of (1) Parent image, (2) Local Coordinates
and, (3) IRM distance.
This shows the significance of categorization of patches into uncategorized ( c �) and categorized
( c�� ) using SVM. Without this, the number of patches to be searched of each query of class c��would increase because patches of c �
would get distributed among the trained categories.
As mentioned in Section V-A, normalization is required for effective querying across different
paths. To establish the need for normalization and effectiveness of the scheme used, we perform
the following experiments. In the first trial, images are used without any normalization. The
retrieved patches can be grouped as (1) those that belong to the very same image as the query
patch, (2) those that belong to images from the same path, and (3) those that belong to images
from a different path. A total of�<Z
queries are run for each category. The average percentage of
these groups among the top��Z
ranked patches for each query, are shown in Table IV. It can be
observed that for each category, more than <� � of the results belong to the same parent image
as the query. Also note that more than�8Z��
of the results for the query belong to the same path
as the query. Thus, without normalization, the results tend to show significant bias. In the second
30
Fig. 14. Ordered retrieval results on a Mountain query patch. Patch labels consist of (1) Parent image, (2) Local Coordinates,
and (3) IRM distance.
trial, the normalization procedure described earlier is used on the images before retrieval, and the
experiment repeated. The new results are reported in Table V. We note significant improvement
in the retrieval, with a clear reduction in bias toward the query image and path. Exploring more
robust normalization procedures to counter the bias further, is a possible future direction.
In order to assess the impact of prior categorization using 2-D MHMMs and SVM in improving
retrieval effectiveness, we perform the following experiment. Of the { patches displayed in
response to each query, one measure to determine retrieval effectiveness is the percentage of
relevant patches in them, i.e., the precision. It is measured as follows. For each category, we
use the system to retrieve from�
to 1 Z patches per query (in intervals of�) and measured the
percentage of patches retrieved that have the same manual category label as the query patch.
This is repeated�
times for each category and the average precision is plotted over variation of{ , as shown in Fig. 15. The most vital observation made is that semantic categorization using
2-D MHMM results in roughly � � to��Z��
improvement in retrieval relevance. For specific
31
TABLE III
COMPARISON OF RETRIEVAL TIMES WITH AND WITHOUT PRIOR 2-D MHMM CLASSIFICATION
Search Entire database Semantically relevant patches No. of patches
Mountain (1) é�êXë²ìuíÃî é�ê éÃï¦ðÃî í�ìÃðÃñCrop (2) é�êXë;òÃï¦î é�ê é�ì<ë;î ë²ì�óÃó�ï
Urban (3) é�êXë;ò¦ôÃî éqê éqë;î ëlðÃñ�ëResidential (4) é�êXë²ìuðÃî é�ê é�ë5î ï(ìÃð¦ì
Others (0) é�êXë;ò¦ðÃî é�ê éuó�ôÃî ï�ëlôuò(ìTABLE IV
AVERAGE DISTRIBUTION OF PATCH RETRIEVAL WITHOUT NORMALIZATIONõ÷öuøúùlû�ü5ý;þqÿ���ý��¥û����Same image Same path Different path
Mountain (1) íÃé�� í�ì�� ô��Crop (2) ë;é¦é�� ëlé¦é�� é��
Urban (3) ó�ð�� í�ë� í��Residential (4) ðuï� í¦ô�� ì��
TABLE V
AVERAGE DISTRIBUTION OF PATCH RETRIEVAL AFTER NORMALIZATIONõ÷öuøúùlû�ü5ý;þqÿ���ý��¥û����Same image Same path Different path
Mountain (1) ôuï� ð�ì�� ëlô��Crop (2) ó�ð�� ð¦í�� ë¦ë�
Urban (3) òÃï� ó(ñ�� ïÃó�Residential (4) òuó�� ó�ë�� ï�í��
requirements, these plots may be used to choose suitable values of { . We have thus established
the effectiveness of prior categorization using 2-D MHMMs as a tool for satellite image mining,
both in terms of retrieval precision as well as in terms of retrieval speed.
It is worth noting that, patches in untrained categories can also be effectively retrieved as
shown in Fig. 16, albeit with less precision. The retrieval for untrained categories is also efficient
because the system searches for similar patches only among the patches labeled c �rather than
32
5 10 15 20 25 3050
60
70
80
90
100Accuracy of retrieval: Mountain
Per
cent
age
Rel
evan
ce
No. of patches retrieved (Q)
Mountain − IRM OnlyMountain − IRM + 2−D MHMM
5 10 15 20 25 3050
60
70
80
90
100Accuracy of retrieval: Crop field
Per
cent
age
Rel
evan
ce
No. of patches retrieved (Q)
Crop − IRM OnlyCrop − IRM + 2−D MHMM
5 10 15 20 25 3050
60
70
80
90
100Accuracy of retrieval: Urban area
Per
cent
age
Rel
evan
ce
No. of patches retrieved (Q)
Urban − IRM OnlyUrban − IRM + 2−D MHMM
5 10 15 20 25 3050
60
70
80
90
100Accuracy of retrieval: Residential area
Per
cent
age
Rel
evan
ceNo. of patches retrieved (Q)
Residential − IRM OnlyResidential − IRM + 2−D MHMM
Fig. 15. Average precision of IRM based retrieval for each category, with and without 2-D MHMM categorization.
the entire database of patches. We finally comment on the 2-D MHMM training times. About
20 minutes are required to train each 2-D MHMM on a 1.7 GHz Intel Xeon machine, taking
approximately 80 minutes to build the required models for the system. However, this process
can be run off-line, and is non-recurring. The subsequent indexing process is done only once
for each image added to the database. The system performs retrieval in real-time using the fast
and robust IRM measure.
VIII. DISCUSSION AND FUTURE WORK
The proposed system uses a convenient learning based approach for large-scale browsing and
retrieval of satellite image patches. It has been shown that automatic semantic categorization of
patches using 2-D MHMM prior to retrieval improves performance in terms of speed as well
as precision of retrieval. Prior categorization reduces the search space to fewer, more relevant
patches, thereby reducing search time. Searching through only the semantically relevant patches
leads to improvement in the quality of retrieval. SVM has been effectively used to deal with
patches that have not been trained for. Performing classification at patch level instead of pixel
33
Fig. 16. Demonstrating the effectiveness of our system on untrained categories. Shown here are ordered retrieval results on a
coastline query patch (coastline - untrained).
level in satellite images helps in building a convenient interface for browsing. Adding new
satellite images to the system is fairly straightforward and does not require re-training of the
existing models. To add a new category of interest, a new 2-D MHMM model is required to be
trained only for that category, while the existing models can be re-used. The SVM classifier must,
however, be retrained with samples from all classes. Two different normalization approaches have
been attempted, and for this specific case, histogram transformations have been found effective
for giving the system robustness to sun elevation and other atmospheric variations. There are,
however, still some issues which have not been tackled and form part of our ongoing research.r Square patches are used in the proposed system due to the convenience in computation, but
the users may desire more flexible shapes for querying. It will be interesting to see how
accuracy of retrieval varies with the size of the patches.r The impact of using standard normalization procedures such as DN to at-satellite reflectance
in place of histogram transformations on our particular system, though experimented with,
34
has not been explored extensively, and hence remains future work.r Only four low level land cover categories have been considered in this work. Although
classification over a small number of categories have often been experimented with and
found to be useful for applications, for example [15], [5], [38], extending our system to
support more categories will be beneficial. If we wish to adapt the system for more finer
categories, a higher resolution imagery such as IKONOS. Instead of the rough categories
used in this work, an interesting future direction for us is to explore using higher resolution
imagery and more precise land cover categories. We note that these can be attempted without
making any major changes to the proposed architecture.r An important issue related to the categories used in the experiments was the presence/absence
of noise in the form of snow and cloud covers. While developing the system, we faced this
problem, especially when training on the ‘mountain’ patches, which were a mix of snow-
capped mountains, and low-altitude hills. To resolve this, we took advantage of the fact
that 2D-MHMM based models have been found capable of learning significantly diverse
image categories [22]. We hand-picked a mix of all observed variants within a land cover
category into the training set. As noted, this approached seemed to work, since the accuracy
for the mountain category was as high as 94.5%. This reinforces the belief that 2D-MHMM
is capable of learning fairly diverse land cover categories. As for cloud covers, the Landsat
images that our experiments were conducted on did not have significant areas of cloud
cover. However, this may often be the case, and this can undermine the performance of the
system. This problem can potentially be tackled by either building multiple 2D-MHMM
based models for each category with and without cloud cover, or by using multi-temporal
image stacks for more improved categorization and retrieval. The use of multi-temporal
imagery for improved classification and retrieval of patches has not been explore in this
work, but is believed to have potential for such improvement. It may be possible to train
different models of the same land cover categories over different periods of time. Pooling
of all such patches into one database and performing retrieval may be one straightforward
approach to accounting for annual weather changes (snow, clouds) in the retrieval process.
35
REFERENCES
[1] J. R. Anderson, E. H. Harvey, J. T. Roach, R. E. Whitman, “A Land Use and Land Cover Classification System for Use
with Remote Sensor Data,” Geological Survey Professional Paper 964, US. Government Printing Office, 1976.
[2] P. M. Atkinson and A. R. L. Tatnall, “Neural Networks in Remote Sensing,” International Journal of Remote Sensing,
18(4):699–709, 1997.
[3] L. Bastin, “Comparison of Fuzzy C-means Classification, Linear Mixture Modeling and MLC Probabilities as Tools for
Unmixing Coarse Pixels,” International Journal of Remote Sensing, 18(17):3629-3648, 1997.
[4] H. Bischof, W. Schneider, and A. J. Pinz, “Multispectral Classification of Landsat-images using Neural Networks,” IEEE
Trans. on Geoscience and Remote Sensing, 30(3):482–490, 1992.
[5] L. Bruzzone, D. F. Prieto, and S. B. Serpico, “ Neural-Statistical Approach to Multitemporal and Multisource Remote-
Sensing Image Classification,” IEEE Trans. on Geoscience and Remote Sensing, 37(3):1350–1359, 1999.
[6] C.-c. Chang and C.-j. Lin, “LIBSVM : A Library for SVM,” 2001. Software available from:
http://www.csie.ntu.edu.tw/ � cjlin/libsvm.
[7] J. Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, 20:37-46, 1960.
[8] R. G. Congalton, “A Review of Assessing the Accuracy of Classifications of Remotely Sensed Data,” Remote Sensing of
Environment, 37:35–46, 1991.
[9] H. Daschiel and M. Datcu, “Information Mining in Remote Sensing Image Archives: System Evaluation”, IEEE Trans. on
Geoscience and Remote Sensing, 43(1):188–199, 2005.
[10] M. Datcu, H. Daschiel, A. Pelizzari, M. Quartulli, A. Galoppo, A. Colapicchioni, M. Pastori, K. Seidel, P. G. Marchetti,
and S. D’Elia, “Information Mining in Remote Sensing Image Archives: System Concepts”, IEEE Trans. on Geoscience
and Remote Sensing, 41(12):2923–2936, 2003.
[11] R. Datta, J. Li, and J. Z. Wang, “Content-Based Image Retrieval - A Survey on the Approaches and Trends of the New
Age,” Proc. ACM International Workshop on Multimedia Information Retrieval, ACM Multimedia, 2005.
[12] G. M. Foody, “Approaches for the Production and Evaluation of Fuzzy Land Cover Classifications from Remotely-sensed
Data,” International Journal of Remote Sensing, 17(7):1317–1340, 1996.
[13] A. Gersho, “Asymptotically Optimum Block Quantization,” IEEE Trans. on Information Theory, 25(4):373–380, 1979.
[14] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Texture Features for Image Classification, IEEE Trans. on Systems, Man
and Cybernetics, 3(6):610–621, 1973.
[15] L. Hermes, D. Frieauff, J. Puzicha, and J.M. Buhmann, “Support Vector Machines for Land Usage Classification in Landsat
TM Imagery,” Proc. Geoscience and Remote Sensing Symposium, 1999.
[16] C. Huang, L. Yang, C. Homer, B. Wylie, J. Vogelman, and T. DeFelice, “At-satellite Reflectance: A First-order
Normalization of Landsat 7 ETM+ images,” USGS Technical Report, 2001.
[17] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
[18] J. R. Jensen, “Introductory Digital Image Processing: A Remote Sensing Perspective,” Prentice Hall, 1995
[19] A. S. Kumar and K. L. Majumder, “Information Fusion in Tree Classifiers,” International Journal on Remote Sensing,
22(5), 861-869, 2001.
[20] C.-S. Li and V. Castelli, “Deriving Texture Feature Set for Content-based Retrieval of Satellite Image Database,” IEEE
ICIP, 1997.
[21] J. Li, R. M. Gray, and R. A. Olshen, ”Multiresolution Image Classification by Hierarchical Modeling with Two Dimensional
Hidden Markov Models,” IEEE Trans. on Information Theory, 46(5):1826–1841, 2000.
36
[22] J. Li and J. Z. Wang, “Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 25(9):1075–1088, 2003.
[23] W.-Y. Ma and B. S. Manjunath, “A Texture Thesaurus for Browsing Large Aerial Photographs,” Journal of the American
Society for Information Science, 49(7):633–648, 1998.
[24] B. S. Manjunath and W.-Y. Ma, “Texture Features for Browsing and Retrieval of Image Data,” Journal of Applied Optics:
Information Processing, 43(2):210–217, 2004.
[25] F. Melgani and L. Bruzzone, “Classification of Hyperspectral Remote Sensing Images with Support Vector Machines,”
IEEE Trans. on Geoscience and Remote Sensing, 42(8):1778–1790, 2004.
[26] S. Newsam, L. Wang, S. Bhagavathy, and B. S. Manjunath, “Using Texture to Analyze and Manage Large Collections of
Remote Sensed Image and Video Data,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(8):837–842, 1996.
[27] H. Ren and C.-I. Chang, “A Generalized Orthogonal Subspace Projection Approach to Unsupervised Multispectral Image
Classification,’ IEEE Transactions on Geoscience and Remote Sensing, 38(6):2515–2528, 2000.
[28] M. Schrder, H. Rehrauer, K. Seidel, and M. Datcu, “Interactive Learning and Probabilistic Retrieval in Remote Sensing
Image Archives, IEEE Trans. on Geoscience and Remote Sensing, 38(5):2288-2298, 2000.
[29] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-Based Image Retrieval at the End of the
Early Years,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000.
[30] P. C. Smits, S. G. Dellepiane, R. A. Schowengerdt, “Quality assessment of image classification algorithms for land-cover
mapping: a review and a proposal for a cost-based approach”, International Journal of Remote Sensing, 20(8):1461–1486,
1999.
[31] A. H. S. Solberg, T. Taxt, and A. K. Jain, “A Markov Random Field Model for Classification of Multisource Satellite
Imagery,” IEEE Trans. on Geoscience and Remote Sensing, 34(1):100–113, 1996.
[32] B. C. K. Tso and P. M. Mather, “Classification of Multisource Remote Sensing Imagery using a Genetic Algorithm and
Markov Random Fields,” IEEE Trans. on Geoscience and Remote Sensing, 37(3):1255–1260, 1999.
[33] U.S. Geological Survey, “Landsat (Sensor: ETM+),” EROS Data Center, Sioux Falls, SD. Available from:
http://glcf.umiacs.umd.edu/data/landsat/.
[34] F. Wang, “Fuzzy supervised classification of remote sensing images,” IEEE Transactions on Geoscience and Remote
Sensing, 28(2):194–201, 1990.
[35] J. Z. Wang, J. Li, and G. Wiederhold, “SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries,” IEEE
Trans. on Pattern Analysis and Machine Intelligence, 23(9):947–963, 2001.
[36] J. Z. Wang, J. Nguyen, K. Lo, C. Law, and D. Regula, “Multiresolution Browsing of Pathology Images Using Wavelets,”
Journal of the American Medical Informatics Association, Proc. of AMIA Annual Symposium, 1999:340–344, 1999.
[37] G. G. Wilkinson, “Results and Implications of a Study of Fifteen Years of Satellite Image Classification Experiments,”
IEEE Trans. on Geoscience and Remote Sensing, 43(3):433–440, 2005.
[38] J. Zhang and G. M. Foody, “A Fuzzy Classification of Sub-urban Land Cover from Remotely Sensed Imagery,” International
Journal of Remote Sensing, 19(14):2721–2738, 1998.