Post on 19-Apr-2022
Structure Features for Content-Based
Image Retrieval and Classification
Problems
Dissertation
zur Erlangung des Doktorgrades
der Fakultat fur Angewandte Wissenschaften
an der Albert-Ludwigs-Universitat Freiburg im
Breisgau, Deutschland
von
Dipl. Phys. Gerd Brunner
2006
ii
Dekan: Prof. Dr. Bernhard Nebel
Prufungskommission: Prof. Dr. Rolf Backofen (Vorsitz)Prof. Dr. Stephane Marchand-Maillet (Prufer)Prof. Dr. Hans Burkhardt (Betreuer)Prof. Dr. Thomas Ottmann (Beisitz)
Datum der Disputation: 30. Jannuar 2007
Acknowledgements
This dissertation is a result from my work at the Chair of Pattern Recognition and Im-
age Processing, Institute for Computer Science of the Albert-Ludwigs-Univeristy Freiburg,
Germany. I am most grateful to my advisor Prof. Dr.-Ing. Hans Burkhardt who has always
supported my research and gave me the opportunity to work at his chair. I would also
like to thank Prof. Stephane Marchand-Maillet from the University of Geneva for being
the coreferee of my thesis. In addition, I would like to thank all my colleagues from the
chair for fruitful discussions and support. I am most grateful to Cynthia Findlay for her
constant encouragement and advises. I would like to express my thanks to Stefan Teister
and his administration team for keeping my computer running. I am greatly indebted to
my wife Kasia for her love and understanding as well as for her contributions to this work.
Freiburg, September 2006
Gerd Brunner
iv
Zusammenfassung
Wahrend der vergangenen Jahrzehnte konnte man einen stetigen Zuwachs an Bildmaterial
verfolgen, der gigantische Archive hervorbrachte. Inhaltsbasierte Bildsuchealgorithmen
haben versucht den Zugriff auf Bilddaten zu vereinfachen. Heutzutage gibt es zahlreiche
Merkmalsextraktionsverfahren, die entscheidend zur Qualitatsverbesserung von inhalts-
basierten Bildsuche- und Bildklassifizierungssystemen beigetragen haben.
Struktur ist eines der wichtigsten Merkmale der Bildanalyse. Das beruht auf der
menschlichen Wahrnehmung von Objekten und Szenen, die zum Grossteil auf spezielle
raumliche Konfiguration und Anderungen der Intensitat basiert. In dieser Dissertation
fuhren wir eine strukturbasierte Merkmalsextraktionsmethode ein. Die Methode kann glob-
ale Strukturen als auch lokale perzeptuelle Gruppen und deren Relationen reprasentieren.
Der Vorteil der Methode ist ihre breite Anwendbarkeit und Invarianz gegen Ahnlichkeits-
transformationen und gegen Anderung der Beleuchtung. Erstens wird die Erzeugung von
Kanten mittels mehrerer Kantendetektoren behandelt. Aus diesem Grund wird ein Algo-
rithmus vorgestellt, der automatisch die besten Parameter fur den Canny Kantendetektor
berechnen kann. Zweitens wenden wir eine Liniengruppierungsmethode, die auf agglomer-
ativen hierarchischem Clustering beruht an. Dazu werden die Linensegmente mittel einer
Kantentracking Methode extrahiert. Das Clustering Verfahren evaluiert die beste Linkage
Methode und schneidet das Dendrogramm automatisch ab. Nachdem die finale Cluster-
Hierarchie erzeugt ist, werden die weniger signifikanten Cluster, basierend auf einem Kom-
paktheitsmaß, verworfen. Drittens berechnen wir strukturbasierte Merkmale auf globalen
und lokalen Skalen. Die globale Skala versichert eine holistische Szenenanalyse des Bildes.
Im Gegensatz dazu kodieren die lokalen Merkmale perzeptuelle Gruppen und deren Re-
lationen. Abschließend werden die strukturbasierten Merkmale auf Binar- und Farbbild,
Objektklassen und Textur Retrieval und/oder Klassifikation angewandt. Die erste An-
wendung ist die Klassifikation und inhaltsbasierte Suche von historischen Wasserzeichen.
Die zweite Anwendung ist die Bildsuche in zwei Farbbilddatenbanken mit 1.000 und 10.000
Bildern der Corel Kollektion. Die Ergebnisse werden zusammen mit einer Invarianzanalyse
prasentiert, wobei die vorgestellten Merkmale eine Invarianz von uber 96% erzielen kon-
nten. Die dritte Anwendung beruht auf der Erkennung und dem Retrieval von Objekt-
klassen der Caltech Datenbank, wobei eine Erkennungsrate von 92.45% bzw. 95.45%, fur
vi
die Funf- und Dreiklassenprobleme erzielt werden konnte. Die vierte und finale Anwendung
ist die Klassifikation von Texturen der Brodatz Kollektion. Eine Support-Vektormaschine
mit einem Intersection-Kernel konnte unter der Verwendung des leave-one-out Tests eine
Klassifikationsrate von 98% erzielen.
Abstract
During the past decades we have been observing a permanent increase in image data, lead-
ing to huge repositories. Content-based image retrieval methods have tried to alleviate the
access to image data. To date, numerous feature extraction methods have been proposed
in order to improve the quality of content-based image retrieval and image classification
systems.
Structure is one of the most important features for image analysis as shown by the fact
that the human perception of objects and scenes is to a large extent based on particular
spatial configurations and changes in intensity.
In this thesis we introduce a structure-based feature extraction technique. The method
is capable of representing the global structure of an image, as well as local perceptual groups
and their connectivity. The advantage of the method is its broad range of applications and
its invariance against changes in illumination and similarity transformations.
We first discuss the creation of edge maps accompanied by an evaluation of various
edge detectors. Therefore, we present a method that automatically computes the best set
of parameters for the Canny edge detector.
Secondly, we apply a line segment grouping method based on agglomerative hierarchical
clustering, where the segments are extracted with an edge point tracking algorithm. The
procedure automatically evaluates the best linkage method and prunes the dendrogram
based on a subgraph distance ratio. Once the final clusters are obtained, an intra-class
compactness measure is used to discard less significant segment groups.
Thirdly, the structure-based features are computed on a global and local scale. The
global scale ensures a holistic scene analysis of an image, whereas the local features account
for perceptual groups and their connectivity.
Finally, we apply the structure-based features to tasks as broad as binary, color, ob-
ject class and texture image retrieval and/or classification. The first application is the
classification and content-based image retrieval of ancient watermark images. The second
application is a retrieval task of two color image databases from the Corel collection with
1.000 and 10.000 images. The results are accompanied by an invariance analysis, where
our features have obtained a score of more than 96%. The third application is object class
recognition and retrieval for the Caltech database, where we achieve a classification rate
viii
of 92.45% and 95.45% for the five and three class problem, respectively. The fourth and
final application is the classification of textures obtained from the well known Brodatz
collection. A support vector machine with an intersection kernel and a leave-one-out test
obtained a classification rate of 98%.
Contents
1 Introduction 1
1.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Edge Detection 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Canny Edge Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Performance Evaluation and Parameter Selection . . . . . . . . . . . . . . 14
2.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Line-Segment Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Hough Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Edge Point Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Clustering 27
3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 The Cophenetic Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Cluster Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Cluster Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Clustering Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Cutting a Dendrogram . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Structure-Based Features 43
4.1 Euclidean Distance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Feature Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Global Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Local Perceptual Features . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
x CONTENTS
4.3 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Feature Space Representation . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Applications 67
5.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.1 Ancient Watermark Images . . . . . . . . . . . . . . . . . . . . . . 67
5.1.2 Corel Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.3 Caltech Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.4 Brodatz Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 Measures for Classification . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.2 Measures for Content-based Image Retrieval . . . . . . . . . . . . . 77
5.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Filigrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.2 Retrieval Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.3 Partial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5 Color Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.1 Retrieval Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.2 Invariance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.6 Object Class Recognition and Retrieval . . . . . . . . . . . . . . . . . . . . 111
5.6.1 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6.2 Retrieval Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.7 Texture Class Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6 Conclusions and Perspectives 127
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Outlook and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A Appendix 131
A.1 Mercer Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
CONTENTS xi
Bibliography 132
xii CONTENTS
List of Figures
1.1 Content-based image retrieval system chart. . . . . . . . . . . . . . . . . . 5
2.1 Line profiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Comparison of five edge detectors, with default parameters. Visually, the
Canny method performs best. . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Typical examples of edge profiles and their derivatives. . . . . . . . . . . . 13
2.4 Edge maps for the Canny edge detector with various sets of parameters. The
leftmost column shows gray scale images. The second column resembles the
ground truth images. The third column displays the best edge maps we
could have automatically obtained. The parameters are listed in the first
column of Table 2.3. The last column depicts sample edge maps obtained
by less performant Canny parameters, that are typically in the order of:
σ ≥ 3.5 and [θl, θu] approaching 1. . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Left panel shows a straight line in the Cartesian space. In addition, r and
θ are plotted to show the parameterization of the line. Right panel the
Hough transformation of the single line from the left. Note, that the pattern
originates from the quantization in the Hough space. . . . . . . . . . . . . 22
2.6 Comparison of line segments obtained from the Hough-transform and the
edge point tracking algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Flow chart of the agglomerative hierarchical clustering algorithm. . . . . . 29
3.2 Various dendrograms obtained from six different linkage methods. . . . . . 32
3.3 The two graphs show dendrograms where the ellipse in the left one encloses
a typical subgraph. SD12 and SD23 are the distances between nodes in the
enclosed subgraph. The right dendrogram pictures the actual pruning for
the segments of the temple image (see Figure 3.4). . . . . . . . . . . . . . . 39
xiv LIST OF FIGURES
3.4 The first row shows a color image and all line segments. The second row
presents line segment clusters, where the left graph illustrates noise-like
segments (non-compact clusters). Whereas the right graph shows a salient
subset of line segments according to the compactness measure (see Equa-
tion 3.20). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 A graphical illustration of a set of points and three corresponding transfor-
mations (translation, rotation and scaling). Note, that we only show the
EDM elements (e12 to e15) for the point P1. . . . . . . . . . . . . . . . . . 45
4.2 Illustration of line segment properties. . . . . . . . . . . . . . . . . . . . . 49
4.3 Sample groups of line segments that follow certain constraints (see Equa-
tions 4.20 to 4.24 and Equations 4.31 to 4.35). . . . . . . . . . . . . . . . . 54
4.4 Local perceptual groups of line segments based on Equations 4.31 to 4.35
and Equations 4.36 to 4.39. . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Class-wise averaged recall versus number of retrieved images plot with eight
different similarity measures for two classes of the ancient Watermark database
(see Section 5.1.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Samples of scanned ancient watermark images (courtesy Swiss Paper Mu-
seum, Basel). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Sample filigrees of each class from the Watermark database after enhance-
ment and binarization (see [112]). The classes are according to Table 5.1
(starting from top left): Eagle, Anchor1, Anchor2, Coat of Arm, Circle, Bell,
Heart, Column, Hand, Sun, Bull Head, Flower, Cup and Other objects. . . 70
5.3 Sample filigrees from the Watermark database after enhancement and bina-
rization (see [112]). Each of the four rows show watermarks from the same
class, namely Heart, Hand Eagle and Column. The samples show the large
intra-class variability within the Watermark database. . . . . . . . . . . . . 71
5.4 Sample images from the two Corel databases, where the images in the first
row are randomly taken from the 1.000 image set and the images in the
second row from the 10.000 image database. . . . . . . . . . . . . . . . . . 73
5.5 Two sample images of each class from the Caltech database. . . . . . . . . 74
5.6 Sample images from the 13 classes of the Brodatz image database. . . . . . 75
5.7 Sample retrieval result obtained with our structure-based features (see Sec-
tion 4.2) from the class Anchor1 of the Watermark image database. . . . . 84
5.8 Sample retrieval result of the class Circle from the Watermark database,
under the usage of global and local structural features (see Section 4.2). . . 86
5.9 Sample retrieval result obtained with our structure-based features (see Sec-
tion 4.2) from the class Column of the Watermark database. . . . . . . . . 87
LIST OF FIGURES xv
5.10 Sample retrieval result of the class Flower from the Watermark database,
under the usage of our structural features (see Section 4.2). . . . . . . . . 88
5.11 Sample retrieval result obtained with our structure-based features (see Sec-
tion 4.2) from the class Eagle of the Watermark database. . . . . . . . . . 89
5.12 Sample retrieval result obtained with our structure-based features (see Sec-
tion 4.2) from the class Eagle of the Watermark database. . . . . . . . . . 90
5.13 Sample retrieval result of the class Coat of Arms from the Watermark
database. As features we have incorporated our global and local structre-
based features (see Section 4.2). . . . . . . . . . . . . . . . . . . . . . . . . 91
5.14 Figure 5.14: Class-wise recall vs. number of
images retrieved graphs for the Watermarks. . . . . . . . . . . . . . . . . . 93
5.15 Partial matching result obtained from the Watermark database with our
structural features (see Section 4.2) and under the usage of the intersection
similarity measure. The query image resembles a cut of a filigree from the
class Coat of Arms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.16 Partial matching result obtained from the Watermark database with our
structural features (see Section 4.2) and under the usage of the intersection
similarity measure. Note, that the query resembles a head of an eagle that
belongs to the class Eagle. . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.17 The seven images show examples of watermarks with ambiguities in respect
to their ground truth class membership and their real content. The labels
below each image show the watermark class. Image 5.17(a) belongs to the
class Heart, although there is just a tiny heat in the center of the watermark.
In fact, it looks more like a Coat of arms. A similar argumentation holds for
image 5.17(c). Note, the embedded eagle. Specifically, 5.17(c) and 5.17(d)
were classified as Eagle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.18 Sample image retrieval results obtained from the 1.000 image collection. As
features we have used block-based features (see Section 5.5.1), and the global
structure based features (see Section 4.2.1). The first image represents the
query and the images are arranged in decreasing similarities from left to
right indicated by the numbers above the images. Note, that 1 denotes an
identical match with the query image. . . . . . . . . . . . . . . . . . . . . . 103
5.19 Sample retrieval results obtained from the 10.000 image collection.As fea-
tures we have used block-based features (see Section 5.5.1), and the global
structure based features (see Section 4.2.1). The first image represents the
query and the images are arranged in decreasing similarities from left to right
indicated by the numbers above the images, where 1 denotes an identical
match with the query image. . . . . . . . . . . . . . . . . . . . . . . . . . . 104
xvi LIST OF FIGURES
5.20 Average class-wise recall versus the number of retrieved images graph for the
1.000 image database. All features are plotted for class-wise comparisons.
Note that the curves represent averaged quantities, i.e. each class member
was taken as a query image and the resulting graphs were averaged. . . . . 106
5.21 Precision-recall graph for the 1.000 image data-set, where the graph is aver-
aged over all images and classes representing an overall performance measure.107
5.22 Precision-recall graph for the 10.000 image database, where the graph is
averaged over all images and classes representing an overall performance
measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.23 Robustness analysis of the 1.000 image database for different brightness and
rotation conditions (brightness 30 % increased, saturation 10 % decreased).
The ordinate shows the degree of averaged invariance for each feature set,
with 1 being 100 % invariant. . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.24 Figure 5.24: Result obtained with structure
features (see Section 4.2), from the Caltech database. . . . . . . . . . . . . 115
5.25 Figure 5.25: Result obtained with structure
features (see Section 4.2), from the Caltech database. . . . . . . . . . . . . 116
5.26 Class-wise averaged precision-recall graphs for the Caltech image database. 117
5.27 Figure 5.27: Result obtained with structural
features (see Section 4.2) for some transformations. . . . . . . . . . . . . . 119
5.28 Figure 5.28: Result obtained with structural
features (see Section 4.2) for some transformations. . . . . . . . . . . . . . 120
5.29 Precision-recall graph for several image transformations for the query image
of Figures 5.27 and 5.28 that belong to the class Motorbikes. . . . . . . . . 121
List of Tables
2.1 Number of ground truth pixels for sample images. . . . . . . . . . . . . . . 16
2.2 Falsely detected edge pixels for the ten best sets of parameters [in %]. The
two right most columns indicate the mean error for the best ten and best five
sets, respectively. The second part of the table lists the absolute numbers
of falsely detected edge pixels. . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 The parameters [θl, θu, σ], for the ten best edge maps. At the end of the
table we list the averaged values. . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Averaged cophenetic correlation coefficients for six linkage methods. . . . . 37
5.1 The classes of the Watermark database. . . . . . . . . . . . . . . . . . . . . 72
5.2 Sample confusion matrix for a two class problem. . . . . . . . . . . . . . . 76
5.3 Class-wise performance measures for the Watermark database. A detailed
description can be found in Section 5.2.2 and in the text. . . . . . . . . . . 94
5.4 Confusion matrix for the Watermark database. . . . . . . . . . . . . . . . . 98
5.5 Class-wise true positive (TP) and false positive (FP) rates for the Watermark
database, where the first column indicates the correctly classified images and
the total number of class members. The second column shows the TP rate
in [%]. Column three represents all FP obtained and column four gives the
FP rate in [%]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.6 Detailed performance measures for the Watermark database. The measures
are explained in 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.7 Averaged feature invariance representation. First column: Invariance under
different brightness conditions (brightness 30 % increased, saturation 10 %
decreased). Second column: Invariance under different brightness and rota-
tion. Third column: Invariance between: different brightness conditions and
different brightness plus rotation. LS: Line segment feat., BBF: Block-based
feat., IFH: invariant feature histogram. . . . . . . . . . . . . . . . . . . . . 110
5.8 Confusion matrix for the Caltech database. . . . . . . . . . . . . . . . . . . 112
xviii LIST OF TABLES
5.9 Class-wise true and false positive rates for the Caltech database, where the
first column indicates the correctly classified images and the total number
of class members. The second column shows the TP rate in [%]. Column
three represents all FP obtained and column four gives the FP rate in [%]. 112
5.10 Detailed performance measures for the Caltech database (see Section 5.2). 113
5.11 Comparison of our structure-based method with others from the literature. 113
5.12 Confusion matrix for the Brodatz database. . . . . . . . . . . . . . . . . . 122
5.13 True positive and false positive rates for the Brodatz database, where the
first column indicates the correctly classified images and the total number
of class members. The second column shows the TP rate in [%]. Column
three represents all FP obtained and column four gives the FP rate in [%]. 123
5.14 Detailed performance measures (see Section 5.2.1), for the Brodatz database. 124
5.15 Comparison of the structure-based method with others from the literature
for the Brodatz database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Chapter 1
Introduction
Digital images play an important role in our everyday life. In many areas of science,
commerce and government images are daily acquired and used. During the past decades
we have been observing a permanent increase of image data, leading to huge repositories.
The national geographic imagery archive of the United States currently has a size in the
range of Petabytes (PB) and daily grows by several Terabytes (TB). The commercial
Internet search engine Yahoo claims to have indexed about 1.6 billion images. This rapid
evolution triggers the demand of qualitative and quantitative image retrieval systems. To
date, various commercial systems and Internet search engines feature keyword-based image
retrieval. Usually, the results are not satisfactory due to the high complexity of images
that can not easily be described by words. Therefore, content-based image retrieval gained
in importance during the past decade. The areas of application are as wide as but not
limited to:
• E-commerce solutions (support client online product search, e.g. retrieve all similar
pictures of a specific couch. Possible interest groups are online-stores)
• Authors, editors, lecturers or journalists in order to improve their working efficiency
(retrieval by subject or categorical search, e.g. an author writes an article about
Rome; he wants to find all images of St.Peter’s Cathedral)
• Private users (retrieve personal images acquired with their own digital camera)
• Surveillance systems (search for a certain person, a car or other items; possible areas
of application are casinos, traffic surveillance, access controls, crime prevention and
others)
• Aviation, space and military applications, e.g. navigation systems, satellite and
imagery (object and structure recognition, e.g. buildings, road network)
2 Introduction
• Medicine (analysis of X-rays, Magnetic Resonance Imaging (MRI) - might help for
diagnostics, tissue classification)
• Biological image processing (e.g. pollen recognition, cell detection and recognition)
• Art collections, museums (e.g. retrieval of painting styles)
• Watermarks and filigrees (retrieval of ancient watermarks)
• Autonomous robot navigation (e.g. obstacle detection)
The field of image retrieval has also triggered advancements in the related area of image
classification. One of the main differences between image classification and retrieval is
that the latter one imposes a kind of similarity ranking whereas the first one distinguishes
between classes rather then between single images. The broad range of image retrieval and
classification applications demands general as well as highly specialized systems equipped
with image features such as for example color, texture or structure-based ones. However, in
this thesis we focus on image classification and content-based image retrieval with structure-
based features.
1.1 Contributions of this Thesis
The aim of this thesis is the investigation of discriminative structure-based features for
image classification and content-based image retrieval problems. Moreover, the method
developed in this work contributes to the state-of the art in (structure-based) image re-
trieval. In the following we will list the main contributions of this thesis:
• Edge detection and verification: We show that the Canny edge detector is most
suitable for deriving structure-based image features. We present a method that auto-
matically computes a set of Canny parameters for real world images. Therefore, we
evaluate 550 different parameter sets and determine the best one by the comparison
with a manually generated ground truth. The best parameter set produces an error
rate of only 0.1 to 1.3%. Finally we define a range for the three parameters, based
upon the ten best sets.
• Hierarchical Clustering: We present a straight line segment clustering method.
Firstly, we determine the existence of an underlying clustering structure in the feature
space with the Hopkins test. A result of h=0.81 suggests the adequateness of our
clustering method with a confidence level of more then 90%. Secondly, we evaluate
the most suitable linkage method for the agglomerative hierarchical clustering by
computing the cophenetic correlation coefficient for 15972 different hierarchies. With
1.2 Content-Based Image Retrieval 3
an average score of 0.9262 the average linkage method produces the best result.
Thirdly, we introduce a subgraph distance ratio that is used in order to prune or cut
a dendrogram. The resulting groups of straight line segments are divided into salient
and less important clusters on the basis of the intra-class compactness.
• Structure-Based Features: We have developed a method that represents the
global structure of an image, as well as the local structure of perceptual groups
and their connectivity. The relations between perceptual groups are computed under
the usage of Euclidean distance matrices. The features are invariant against similar-
ity transformations and robust against changes in illumination. The method can be
applied to a broad range of applications.
• Classification and Content-Based Image Retrieval of Ancient Watermark
Images: The aim of this task is the retrieval and classification of ancient watermarks
(14 classes). The retrieval results are presented with averaged class-wise precision-
recall graphs that show the discrimination ability of the structure-based feature set.
For the classification we use a support vector machine with an intersection kernel.
The results are obtained by a leave-one-out evaluation and show that for the 14 class
problem an average true positive rate of 87.41% could have been achieved.
• Object Class Recognition and Retrieval: The next contribution is object class
recognition of the five classes from the Caltech database consisting of more than
2800 images. We apply a set of structure-based features and compare the results
with scores from the literature. The results show a correct classification rate of more
then 95%.
• Color Image Retrieval: Structure-based features are applied to two Corel image
dataset of 1.000 and 10.000 images. The results are compared with two state-of-the
art methods and presented with averaged precision-recall graphs. The analysis is
completed with an investigation of the feature robustness under several transforma-
tions for all presented methods. The result shows that the proposed features are
invariant (96.56%) against rotation and brightness transformations.
• Texture Class Recognition: We perform a classification task on texture samples
from the Brodatz benchmark database. The structure-based features obtain an aver-
age classification rate of 98%, that is in the same range as the best scores published.
1.2 Content-Based Image Retrieval
There have been great undertakings to create text-based image search engines by several
companies e.g. Google, Yahoo or Microsoft. Although the pure text-based search performs
4 Introduction
already quite well, the image search results are not satisfactory at all, since text cannot
sufficiently describe an image’s content, e.g. imagine an expressionistic painting. Hence,
image retrieval systems solely based on textual search often perform poorly. Content-based
image retrieval (CBIR) on the other hand enables us to search an image by its real content.
Figure 1.1 depicts a chart of a CBIR system.
A wide variety of models have been considered for image retrieval [19] from the very
simple early vision techniques, like color histograms [133], to very sophisticated methods,
applicable for specific and restricted environments. The various kinds of CBIR features
can be grouped into the following categories:1 color with/without spatial constraints [141]
[130], texture [43] [26] [53] [84]2 [128] [127], shape and structure [61] [91] [4] [13] [14].
In the past, the scientific community propagated three types of content based image
retrieval approaches: semi-automatic extraction of attributes, automatic feature extraction
and object recognition,
Semi-automatic systems [27] provide tools to accelerate the annotation process; how-
ever, they require manual interaction for database generation. The user might manually
(probably computer assisted) segment objects in the image, followed by an automatic
analysis and annotation by the computer. Such systems considerably accelerate the image
annotation. Semi-automatic systems can only work off line, i.e., a database has to be
generated before queries can be provided. Hence, they are not well suited for heavily fluc-
tuating data like for example a collection of Internet sites. In addition these systems are
limited in size due to the fact that new image data might be faster acquired than manual
annotations can be added.
Fully automatic systems [131][129] overcome these problems since the analysis can be
done at query time, if the data set has not been analyzed before. Nonetheless, for unre-
stricted image retrieval problems, there is a large gap between objective image features,
that have been extracted, and the semantics of an image.
A necessary pre-requisite of features is invariance; no matter whether they are local,
global or semantic. Invariant features remain unchanged in case the image content is
transformed according to a group action, i.e. the features obtained for an unaltered or
from a transformed image are mapped to the same point in feature space. A simple
example is the color histogram of an image that remains identical under any permutation
of the image pixels. However, a slight change in illumination may significantly change a
simple color histogram.
The invariance property might simplify the comparison of images. Clearly, does the
invariance property average an object’s content in terms of feature space representation.
Therefore, each invariant feature should be able to describe various transformations of
1Note, that we do not intent to give a comprehensive survey on CBIR techniques and systems. However,Veltkamp et.al. [142] gives a good overview of CBIR systems.
2The authors combine texture and color features for the usage of image classification.
1.2 Content-Based Image Retrieval 5
Figure 1.1: Content-based image retrieval system chart.
6 Introduction
an object, while remaining a unique descriptor. Hence, we will evaluate the invariance
properties of the proposed structure-based features and the consequences of retrieving
similar images.
Many state of the art algorithms base their features on low-level properties. Though
these features often give good results, they clearly lack higher-level interpretation and
knowledge on an image’s content. Recent research initiatives have begun to focus on se-
mantic meanings and conceptual representations of image features by incorporating prob-
lem specific techniques [120], structure-based knowledge on levels of different granularity
or scene content understanding [29]. It is inevitable for the further progress of CBIR al-
gorithms to integrate semantics or conceptual descriptors on various levels to sufficiently
represent an image’s content.
The structure-based algorithm investigated in this thesis focuses on image semantics in
order to contribute to the state of the art in image retrieval [49]. Achieving progress in the
area of semantic image retrieval requires solving the so-called semantic gap [29] in parts
or as a whole. There are various definitions for the semantic gap in CBIR. However, the
following definition gives a good abstraction of the problem:
The rift between meaningful descriptors that users expect CBIR applications
to identify with image content, and the features current state of the art CBIR
systems are able to compute.
Indeed, there is a huge ”gap” between the human perception and recognition of complex
scenes and today’s limited features to describe them. Though, today’s research is far from
solving the semantic gap in general - good results have been reached on special semantics
such as equivalence or object [50] classes, e.g. classification of buildings [13].
However, in the sequel of this thesis we will show that structure-based knowledge clearly
advances image classification and retrieval tasks. Structure-based information can take on
many forms, such as from fine granular to rough and coarse scales. Structure is usually
manifested in areas of high contrast leading to visually well distinguished image areas. The
aim of this work is to extract these information and describe them with proper features on
local and global scales.
1.3 Related Work
In this section we give a review of related structure-based feature extraction methods in
the area of image classification and content-based image retrieval. Structure-based features
have been frequently used for image representation, classification and content-based im-
age retrieval tasks [58]. Moreover, the authors in [57] have successfully combined texture
with structure and color features. Their findings are not unexpected since the nature of
1.3 Related Work 7
the features is somewhat complementary to each other, as they focus on different aspects
of an image’s content. In the work of [58], the authors use perceptual grouping features
to represent the structural image content. Their structure-based method extracts a fea-
ture vector containing the number of lines in so-called ”L”-, ”U”-junctions and significant
parallel groups and polygons. The authors argue that their representation of structure
information serves well for man-made objects. Their quantitative statistical analysis of
line segments lacks relative spatial knowledge of distant segments, which are suitable for
describing a complete image scene. Our approach is different, whereas we encode relative
arrangements of line segments obtained by a hierarchical clustering. Our structure-based
method encodes the relative position, orientation and distance from a set of line segments
to all others in a histogram representation.
The authors in [61] utilized the edge direction histogram to represent the shape of
images. In order to achieve scale invariance, each histogram was normalized with respect
to the number of edge points in the image. In [83], a high-level image feature called
consistent line clusters was introduced, which is a set of lines that is homogeneous in
terms of its characteristics. The algorithm was applied to a building recognition task
with satisfactory results. Another building classification algorithm based on weighted line
segment arrangements was published in [13]. The authors adopted the connectivity of
parallel and perpendicular segments in order to classify buildings. The work of [14] shows
that the clustering of line segments can be used for CBIR applications. The method
described incorporates a hierarchical clustering algorithm with single linkage, where the
authors created a 128-bin feature histogram from the clustering result.
The work of [56] presents a comparison between structure-based and texture-based
CBIR systems, with respect to man-made object retrieval. Perceptual grouping3 features
were used to represent an image’s structure. The results reveal that the usage of a-priori
knowledge of man-made buildings succeeded in a better performance than gray-scale his-
tograms and Gabor texture features.
In [54] a shape histogram is introduced, which is constructed from geometric attributes
and structural information. Line patterns based on line segments are extracted and rep-
resented by a N-nearest neighbor graph. The set of edges extracted from each N-nearest
neighbor graph are used to construct a histogram. The actual matching process was per-
formed with the Bhattacharyya distance on a trademark and logo database. The results
revealed that the method is robust against missing lines and line-splitting, but additional
line segments drastically reduce the performance. In addition, the authors presented a
graph-matching approach which performed better than the histogram method. However,
due to its extreme long matching times it is not applicable for real time retrieval tasks.
The authors of [91] describe an edge-based feature for shape recognition, where a model
3A detailed description of the perceptual grouping features may be found in [58].
8 Introduction
of an object is learnt and used for recognition of the same object in different images. In [152]
another edge-map based feature was proposed for CBIR applications. The actual feature
is computed by applying a waterfilling algorithm to the initial edge-map, which results
in a structure representation of an image. The results show, that edge-based features are
useful and can improve CBIR results. An edge-based shape feature was presented in [4]
[75], where contour edge point relations are encoded representing the entire object shape.
The feature performed well on recognition tasks within object databases such as COIL-100
(household objects photographed on a turntable with 5 degrees increment and constant
background) and MNIST (dataset of handwritten digits).
The article in [121] presents a structure-based feature named image context vector. The
features are computed from line segments, where short segments were omitted. In fact, only
four orientations were used for the angular segment representation. The authors observed
a robustness with respect to scaling, translation and noise. However, the approach suffers
from not being invariant against rotation and illumination.
In [33] the authors present a model-based approach to 3D building extraction from
aerial images. The model consists of a hierarchical set of parts, representing subparts of
a building. The work of [37] approached the problem of building extraction from aerial
imagery by incorporating geometric moments. The method extracts regions containing
buildings and fits a polygon to the area of interest. However, the actual features are
geometric moments computed from the closed contour of the fitted polygon.
The review of the literature shows that an image’s structure can be used to extract
salient and powerful features for CBIR applications.
Chapter 2
Edge Detection
2.1 Introduction
In this chapter we will discuss the problem of the edge detection that is regarded as a
subject of fundamental importance in image processing. During the last decades the image
processing community has spend huge efforts to cope with the task of edge finding. The
usage of edge finding methods is mainly driven by the need of revealing structural informa-
tion. Usually, edge detection is one of the initial steps in the sequel of the image processing
task. That is also the case in our problem - the structure-based feature extraction. The
edge maps serve as a kind of ground truth data for the subsequent feature extraction pro-
cedure. Therefore, we have to take great care of the choice and usage of the proper edge
extraction technique.
The question of ”what can be regarded as an edge” must be answered before we can
derive edge detection methods. In fact, one can regard edges as locations of great impor-
tance or saliency within an image. This makes also sense from the human physiological
point of view. The human perception of objects and scenes is to a large extent based on
particular spatial configurations and changes in intensity [150].
We can define an edge as a strong intensity contrast or a jump in an image’s intensity
within a limited spatial range, i.e. intensity changes between one image pixel to the next.
Figure 2.1 shows the three most common edge profiles. Formally, we can define a step edge
profile function as follows:
Vstep =
b if x0 − f2
< x < x0
b + m if x0 ≥ x < x0 + f2,
(2.1)
with b as the background grey value and m as the change of the gray value. The edge is
localized at x0 and f the is size of the window. In Figure 2.1(a) we can see an ideal step
edge. Although the ideal case of an edge is easy to describe it almost never will occur in a
10 Edge Detection
(a) Ideal step edge (b) Ramp edge (c) Line profile
Figure 2.1: Line profiles.
real image. Usually, objects do not have real sharp boundaries as Equation 2.1 suggests.
Additionally, it is almost impossible that an image scene is sampled in such a way that the
edges are exactly located at the margins of pixels. Physical aspects such as temperature,
motion, thermal noise from the camera and others, superimpose a noise component on the
image. Therefore, edges are often represented by a ramp as shown in Figure 2.1(b).
Once, when the basic properties of edges are defined, various edge extraction meth-
ods can be derived. To date, numerous edge detection algorithms have been proposed.
It is well out of the scope of this thesis to give a complete review of all edge detection
algorithms. Some of the most known methods are Roberts, Sobel, Robinson [116], Kirsch
[65], Canny [18], Prewitt [107], Marr-Hildreth [89] or the Laplace of Gaussians (LOG)
operator. However, good overviews, reviews and evaluation of various techniques can be
found in [153], [148] and in [8]. The selection of the proper edge operator is not obvious.
In practice, the choice of an edge detector is not always driven by accurate performance
evaluation, but rather by an intuitive or empirical knowledge. Yet the question remains,
”which edge detection algorithm performs best?”. Figure 2.2 exemplarily shows the com-
parison of five different edge detectors (LoG, Prewitt, Roberts, Sobel and Canny) where
the Canny algorithm visually produces the best result1. In fact, the Canny detector is
widely used for various structure or shape-based feature extraction methods. An useful
edge extraction and evaluation procedure was presented in [47]. In the experimental part
five good working edge detection algorithms (Canny, Nalwa, Iverson, Bergholm , Rothwell)
were applied. The authors took great care on the evaluation of all methods under different
parameter settings. Independent judges visually verified the quality of edge maps obtained
from various parameter sets with all algorithms. It turned out that in case of adapted
parameters the Canny edge detector performed superior. The authors of [126] compared
seven edge detectors to the task of structure from motion. Their experiments have shown
that the Canny detector performs best and additionally it is one of the fastest algorithms.
A similar conclusion was drawn by [8]. Therefore we use the Canny method as the fist
1Note, that we will show in the sequel of this chapter how to objectively determine the quality of anedge map.
2.2 Canny Edge Detector 11
step in our structure-based feature extraction procedure. Moreover we will show how to
automatically determine the best set of parameters for the Canny edge detector. In the
next section we review the basics of the Canny edge detector in order to better understand
the parameter determination procedure.
2.2 Canny Edge Detector
The Canny edge detector is known to be a robust good working algorithm [18]. More
precisely, the Canny edge detector is optimal for step edges which are corrupted by a
Gaussian noise process.
The aims of the Canny algorithm were clearly stated in the original work of [18] in form
of optimality criteria
• Good detection criterion: Important edges should be detected and spurious responses
should be omitted
• Good localization criterion: The distance between the real and the located edge
position should be minimal
• Clear response criterion: Multiple responses to a single edge should be avoided
The first step of the Canny edge detector is to smooth the actual input image. This
has the effect of slightly blurring the image - depending on the size of the Gaussian. The
smoothing is accomplished by convolving the raw 2-D image g(x, y) ∈ R2, with a 2-D
Gaussian G(r) in polar coordinates:
G(r) =1√
2πσ2e−
r2
2σ2 , (2.2)
with r =√
x2 + y2 representing the radial distance from the origin. In order to extract
edges we need to form the first and second derivative of the Gaussian that are
G′(r) = − r√2πσ4
e−r2
2σ2 , (2.3)
and
G′′(r) = − 1√2πσ4
[
1 − r2
σ2
]
e−r2
2σ2 . (2.4)
For the second step we use the smoothed version of the image g and convolve it with
the operator Gn, which is the first derivative of G(r) in the direction n.
∂2
∂n(G ∗ g) =
∂2G
∂n∗ g = 0, (2.5)
12 Edge Detection
Laplace of Gaussian Edge Detector
Prewitt Edge Detector Roberts Edge Detector
Sobel Edge Detector Canny Edge Detector
Figure 2.2: Comparison of five edge detectors, with default parameters. Visually, theCanny method performs best.
2.2 Canny Edge Detector 13
(a) Smoothed step edge. (b) Smoothed step edge with noise.
(c) First derivative of a smoothedstep edge.
(d) First derivative of a smoothedstep edge with noise.
(e) Second derivative of a smoothedstep edge.
(f) Second derivative of a smoothedstep edge with noise.
Figure 2.3: Typical examples of edge profiles and their derivatives.
14 Edge Detection
where g denotes the image and Gn is defined as follows
Gn =∂G
∂n= n∇G, (2.6)
with n as the edge normal (gradient direction) defined as ∇(G∗g)|∇(G∗g)|
. In words, Equation 2.6
defines how to find local maxima in the direction perpendicular to an edge. In the literature
this operation is known as non-maximal suppression, i.e. local maxima can be found where
peaks in the gradient function occur. The non-maxima perpendicular to the edge direction
are suppressed, since the edge strength along the edge contour is mostly continuous. Thus
the method ensures a minimal signal-to-noise ratio (SNR) and a maximal localization of
the edge operator. This observation works pretty well for most kinds of edges but fails at
corner locations. Therefore, the Canny method is not proper for corner detection tasks.
Setting the final threshold is a common problem with edge detection in order to identify
significant edges and omitting others. The manifestation of this property is the so-called
streaking which refers to the appearance of broken lines due to edge pixels below and above
a fixed threshold. For our investigation on structure-based features - describing local and
global arrangements - streaking can be a serious problem. As we will explain in Chapter 4,
the edge map is of great importance for the computation of meaningful structure-based de-
scriptors. The lower and upper threshold hysteresis (θl, θu) of the Canny algorithm ensure
that variations of the edge value within the upper and lower threshold of the hysteresis
counteracts streaking. So that the likelihood of streaking can be extremely reduced. In
the next section we will discuss how to automatically select the three parameters for the
Canny algorithm based upon the evaluation of ground truth data.
2.3 Performance Evaluation and Parameter Selection
Though the Canny detector produces good results in general, it is not obvious how to select
the parameters. In fact, an automatic determination or selection is most desirable. Various
researchers tried to come up with evaluation procedures that can be roughly classified
into evaluation methods based on ground truth and evaluation methods without ground
truth. Evaluation methods without ground truth solely rely on human judgement (visual
inspection) of an edge detector’s result. Although, even for human beings it is difficult to
evaluate the output of edge detectors because the human perception level of coarser or finer
structures strongly varies in a complex scene [47]. The authors have shown for a set of real
world images how human individuals rate the output of various edge detectors visually.
The test persons were asked to find the best edge images for each detector from a pool of
different parameter settings. Although the analysis revealed that many of the test persons
agreed on one particular result as being the best for some images, this was not true in
2.3 Performance Evaluation and Parameter Selection 15
general. The subjectiveness of rating which edge map is the ideal one is evident, though it
depends to a large extent on an image’s content. Further, the results have shown that the
clearer a scene appeared on the image the lower was the variance in the ratings. Therefore,
we conclude that the level of a highly structured background, e.g. images of nature scenes
like forests, natural grass areas or stoney grounds, lead to very different ground truths
or evaluation results, respectively. In fact, it is nearly impossible to accurately create a
ground truth or rely on a human evaluation of such images. Even if a visual inspection of
edge images results in the selection of the best parameter set it is not applicable to larger
amounts of images or automatic processing methodologies.
Evaluation methods based on ground truth data are capable to automatically select
parameters. We agree with the authors in [144] that an evaluation of edge detection
methods with ground truth data is inevitable and can help to determine optimal parameter
sets for any kind of detector. The authors in [8] have evaluated the performance of edge
detectors by the usage of ground truth data. In detail, an edge is counted as a true positive
if it was detected within a specified region and as a false positive it is wasn’t. Various tests
have shown that the Canny and Heitger edge detectors performed best. Other automatic
and semi-automatic evaluation procedures based on ground truth data are presented in
[68] and [149].
However, the subject of automatic parameter selection remains highly subjective. The
experiments we have conducted largely agree with observations of other groups (e.g. [47])
- that even manually generated ground truths strongly depend on the individual person.
Despite of this evident evaluation problem it is necessary to develop algorithms for the
automatic parameter selection of edge detectors. We have to clearly state that our results
of automatically generated parameters only hold for the currently used image database.
Other sets might need different parameters in order to obtain good visual results.
Next we will present an automatic parameter selection procedure and apply it to the
Canny edge detector. Therefore, we define a simple yet effective measure of an edge image’s
quality. For this purpose we use sample images from the University of South Florida (USF)
image database [47] that is provided together with a manually created ground truth. In
practice the labelling process turns out to be an extreme tedious work and for very complex
scenes such as forest or mountains it is nearly impossible to correctly label all true edges.
Although it is not possible to exactly determine the perfect result for an arbitrary image
one can try to limit the detector’s parameter space to a meaningful range.
As shown earlier in this section the Canny operator can perform reasonably well in
most cases, but the results may still heavily vary depending on the parameters as shown
in Figure 2.4. So we want to determine the best possible combination of the three Canny
parameters θl, θu and σ. Hence, we densely sample the parameter range of 0 ≤ θl, θu < 1
and 0 < σ ≤ 5, by increments of 0.1, where θl is multiplied by factor 0.4 in order to obtain
16 Edge Detection
Table 2.1: Number of ground truth pixels for sample images.
Image # of GT pixels (NGTp )
207 (Car) 874843 (Telephone) 17870
101 (Electric Iron) 8833132 (Kitchen Tools) 6701
a gradual interval. Subsequently, we form combinations of θl, θu and σ resulting in 550
different sets of parameters sm = 1, 2, ..., 550. In the next step we compute the edge
map from all images and for every set of parameters. Once an edge image is obtained, the
result has to be compared with the ground truth as follows: Let pi, with i = 1, 2, ..., N,be a pixel value, where N is the number of pixels of an edge image that are in the range
of pi = 0, 1, for a binary or pi = 0, 255, for a gray scale edge map2. Then we can find
the best set of parameters Npsm
for an image that minimizes the following expression.
Npsm
= argminm=1,2,...,550
||NGTp − N sm
p ||, with (2.7)
N sm
p =
I∑
i=1
pi 6= 0;
and NGTp is the number of edge pixels different from zero for the ground truth image (see
Table 2.1). Figure 2.4 shows, from left to right, the ground truth, the best and the worst
set of parameters for several images. A visual comparison of the samples shows that the
automatically selected parameters result in edge maps of similar quality compared with
the ground truths. Moreover, the bad examples are clearly to identify.
2Similarly we can define pi for color images.
2.3 Performance Evaluation and Parameter Selection 17
Figure 2.4: Edge maps for the Canny edge detector with various sets of parameters. The leftmost columnshows gray scale images. The second column resembles the ground truth images. The third column displaysthe best edge maps we could have automatically obtained. The parameters are listed in the first columnof Table 2.3. The last column depicts sample edge maps obtained by less performant Canny parameters,that are typically in the order of: σ ≥ 3.5 and [θl, θu] approaching 1.
18
Edge
Dete
ctio
n
Table 2.2: Falsely detected edge pixels for the ten best sets of parameters [in %]. The two right most columns indicate themean error for the best ten and best five sets, respectively. The second part of the table lists the absolute numbers of falselydetected edge pixels.
Measures [%]Image
1 2 3 4 5 6 7 8 9 10x10 [%] x5 [%]
207 1.3832 1.5318 4.0924 5.5670 5.9671 5.9785 9.3621 9.8537 10.4138 11.2140 5.97 4.0943 0.7359 1.6076 7.9928 9.1475 9.7702 12.1137 17.5818 19.9819 23.7632 24.2839 10.94 7.99101 0.0056 0.0839 0.2294 0.3302 0.5484 0.5484 0.5932 0.8226 1.0632 1.2815 0.55 0.23132 0.0149 1.3282 1.4923 1.8803 2.5220 2.8205 3.1786 3.7308 4.2382 4.6560 2.67 1.49
Measures [px]Image
1 2 3 4 5 6 7 8 9 10x10 [px] x5 [px]
207 121 134 358 487 522 523 819 862 911 981 522 35843 65 142 706 808 863 1070 1553 1765 2099 2145 966 706101 1 15 41 59 98 98 106 147 190 229 98 41132 1 89 100 126 169 189 213 250 284 312 179 100
2.3 Performance Evaluation and Parameter Selection 19
Table 2.3: The parameters [θl, θu, σ], for the ten best edge maps. At the end of the tablewe list the averaged values.
Measures [%]Image 1 2 3 4 5 6 7 8 9 10
θl 0.04 0.04 0.04 0.04 0.04 0.08 0.04 0.04 0.12 0.08207 θu 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.3 0.2
σ 2.5 2.6 2.7 2.3 2.4 0.9 2.2 2.8 0.6 1.0
θl 0 0.04 0.04 0 0 0.04 0 0.04 0 0.04101 θu 0.01 0.1 0.1 0.01 0.01 0.1 0.01 0.1 0.01 0.1
σ 2.2 1.1 1.0 2.1 2.3 1.2 2.4 0.9 2.5 1.3
θl 0 0.12 0.08 0.12 0.04 0 0.12 0 0.12 0.0443 θu 0.01 0.3 0.2 0.3 0.1 0.01 0.3 0.01 0.3 0.1
σ 4.8 0.7 1.1 0.4 2.1 4.9 0.6 4.7 0.5 2.2
θl 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.12132 θu 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
σ 0.4 4.0 3.9 3.8 4.1 3.7 3.5 3.6 4.2 1.8
θl 0.03 0.07 0.06 0.06 0.04 0.05 0.06 0.04 0.08 0.07θu 0.08 0.18 0.15 0.15 0.10 0.13 0.15 0.1 0.2 0.18σ 2.48 2.1 2.18 2.15 2.73 2.68 2.18 3.0 1.95 1.58
Table 2.2 lists the percentage and the absolute number of falsely detected edge pixels
for the ten best sets of parameters. The average, x10, of falsely detected edge pixels for the
ten best parameter sets is in the order of a few percent and significantly less when we only
consider the best five results, x5. The best result is obtained for image 101 (Electric Iron)
with an x10 of about 0.5%. Moreover, the table shows that there is no correlation between
the absolute number of ground truth edge pixels and false detection. That suggests that
not only the number of edge pixels but also the complexity, i.e. semantic content of an
image heavily influence an edge detector’s performance. Table 2.3 lists the corresponding
ten parameter sets.
The best results are obtained for similar parameters. Thus, it is possible to define a
range of good parameters. Since the best parameter sets have a small variance we compute
the average of the five best results for all images. Note, that we only consider the best five
sets in order to ensure a compact range for the three parameters.
Θl =1
N
N=5∑
i=1
θli, Θu =
1
N
N=5∑
i=1
θui , Σ =
1
N
N=5∑
i=1
σi, (2.8)
with θli, θu
i and σi being the average over the best five sets. Finally we arrive at the best set
20 Edge Detection
of parameters for the Canny detector that are: Θl = 0.052; Θu = 0.132 and Σ = 2.33. In
fact, we suggest the usage of the following ranges 0.045 ≤ Θl ≤ 0.065; 0.13 ≤ Θu ≤ 0.155
and 2.2 ≤ Σ ≤ 2.4.
2.3.1 Summary
In this section we have shown that the Canny edge detector produces high quality edge
maps if the proper parameters can be determined. For that purpose we have presented a
method that automatically identifies the best set of parameters for real world image scenes.
By evaluating 550 different combinations of parameters per image we could designate the
best set through the comparison with a manually created ground truth. The result shows
that the best set of parameters only produce an error (between an edge map an the ground
truth image) of 0.1 to 1.3%. The visual appearance of the resulting edge maps is not
very different from the ground truth and thus is a proof for the accuracy of the parameter
determination. Finally, we have given a range for the three Canny parameters that should
lead to high quality edge maps.
As a future work, it would be of interest to investigate how parameters may be affected
for the detection of salient edge points, i.e. to distinguish between important and less
important edge pixels.
2.4 Line-Segment Extraction 21
2.4 Line-Segment Extraction
Line segment detection is a long standing problem in computer vision with numerous
applications. Although there have been advancements during the past decades [42] the
problem remains unsolved in general. Line segments can be understood as primitive geo-
metric objects that may be used for high level feature computations. Applications can be
found in various domains such as object detection/recognition [13], [17], stereo imaging,
face recognition [36], content-based image retrieval [83], segmentation, object contour en-
coding, image descriptors, object tracking, sketch-, logo-, and watermark-matching. For
imagery applications such as the localization of buildings [85] or road networks it is of
essential importance to accurately detect line segments, too. In [106] an automated road
segment extraction system is presented, where intrinsic properties of roads are incorpo-
rated. The method assumes each road segment being a part of a sequence of connected
road objects. These road objects must obey a set of rules, which reflect empirical knowl-
edge of the appearance of road networks in aerial images. Subsequently, the road objects
are grouped into road segments. The algorithm is tested with two high resolution images,
where about 80% of the road network was detected.
In [137] a method named sequential fuzzy cluster extraction is presented. The algorithm
extracts single clusters by maximizing the sum of all similarities among members. Thus,
weak collinear line segments are clustered as one single segment. The approach was applied
to a hand drawn sketch image in order to obtain a more compact representation. Though
the results are interesting, more comprehensive experiments (i.e. with more sketch images)
are missing.
In [99] a method is presented in order to detect and estimate straight patterns in videos.
The method involves the Radon transform and a whitening filter as a likelihood processor
which in turn determines a statistic for a global signal detection. The authors applied
the proposed algorithm to videos of vehicles obtained from highway gates, where they
estimated the motion. Thus, this approach detects straight patterns in a spatio-temporal
context rather then in single images.
The authors in [76] applied line detection for scene modelling by focusing on horizontal
and vertical lines. In detail, the method is block-based, where for every block a local
accumulator is created which is checked for the presence of line segments. In order to
apply the method five parameters have to be adjusted, where the authors do not fully
describe how they are determined. Although the authors claim a lower computational
time then for the Hough transform, no run-time comparison is presented. The method
is applied to few test images but the analysis lacks of a comprehensive evaluation and
parameter discussion.
In the sequel we will shortly review the standard Hough-transform and present an edge
tracking algorithm for the task of straight line segment extraction.
22 Edge Detection
0 20 40 60 80 100 1200
50
100
150
200
250
300
X−axis
Y−
axis
θ
CartesianSpace
r
Hough transform
θ
r
−80 −60 −40 −20 0 20 40 60 80
−600
−400
−200
0
200
400
600
Figure 2.5: Left panel shows a straight line in the Cartesian space. In addition, r and θ areplotted to show the parameterization of the line. Right panel the Hough transformationof the single line from the left. Note, that the pattern originates from the quantization inthe Hough space.
2.5 Hough Transformation
The Hough transformation [52] is an image analysis method which is commonly used in
order to detect straight lines, circles, polynomials or others in images. The Hough transform
for detecting straight lines in an image will be shortly reviewed in the following. A line
segment is parameterized by two variables r and θ as follows:
r = xi cos(θ) + yi sin(θ), (2.9)
with i = 1, 2, ..., N, and N the number of collinear points and r is the perpendicular
distance from the origin of the coordinate system and θ is the angle of r with respect
to the X-axis. The Hough transform actually maps collinear points from the Cartesian
space into sinusoidal curves in the Hough space. In detail, each collinear point (xi, yi) is
transformed into a sinusoidal curve (parameterized according to Eq. 2.9) in the Hough
space, where all curves intersect in the point (r, θ). In the Hough space the intersections
can be visualized easily, as shown in Figure 2.5. Usually, the Hough space is quantized
into accumulator cells (finite intervals), where each point (xi, yi) votes in the (r, θ) space
for all possible lines passing through it. In other words,the more votes a position in the
accumulator matrix obtains, the more feature points lay exactly on this line.
However, in order to extract straight line segments in the original image the (r, θ) inter-
section peaks have to be extracted. A common way is to locate local maxima by introducing
a threshold in the accumulator array. The final step is the so-called de-Houghing, which is,
the mapping from the Hough space back into the Cartesian space. Thus, we finally obtain
2.6 Edge Point Tracking 23
a set of points (xi, yi) describing straight line segments in the original image.
2.6 Edge Point Tracking
In order to extract straight line segments we have adopted the algorithms of [105] and
[69]. In the first step we create an edge map with the Canny detector according to the
findings of Chapter 2. Next, the actual algorithm begins with an initial scan through the
binary edge map, where the neighborhood of every edge pixel is investigated. We have
decided for a binary representation for reasons of complexity. Intensity edge maps would
strongly increase the computational time per image. Let pi ∈ 0, 1 be a binary edge
pixel, with i = 1, 2, ..., N, and N is the number of image pixels. In detail, each detected
edge pixel pj = 1, with j = 1, 2, ...N e, where N e denotes the number of edge pixels
serves as a starting point in order to track neighbor pixels, that are eight connected. The
tracking proceeds as long as no more neighbor pixel can be detected. Then the procedure
starts again with the same initial pixel pj, but into the other direction. Thus, we obtain
a list of edge pixels listj , where all members are assigned a label. This iterative tracking
is performed for all edge map pixels pj resulting in j lists. Subsequently, for every listjthe beginning pb
j and end point pixels pej are determined in order to from the line segment
lj . Next, a second parameter is introduced which controls the threshold of the maximum
allowed line tolerance, i.e. pixels that are too far off the line segment lj (line connecting the
first and the last position in the edge pixel list listj). This step is repeated as long as the
actual distance deviation falls below the tolerance threshold. Once all edge pixel lists are
tested for the tolerance parameter, we obtain a set of line segments L = lj |j = 1, 2, ..., N e,with lj = [pb
j pej ], and pb
j , pej are the start and end points of a segment. Occasionally it
happens that some segments are falsely tracked. In order to compensate this effect we
perform a final procedure that (re-)connects broken segments.
Basically, this procedure performs quite similar to the initial pixel tracking algorithm.
Therefore we take the list of line segments L in order to check for segments that lie within
a specified distance and angle tolerance. Thus two segments lk and ln must simultaneously
fulfill the following constraints
θ(lk, ln) ≤ T ang, (2.10)
dist(lk, ln) ≤ T dist, (2.11)
with θ(lk, ln) as the angle between two segments and dist(lk, ln) stands for the Euclidean
distance between two lines. T ang denotes the angle tolerance and T dist the distance thresh-
old. If both conditions are met at the same time then two segments are merged into one.
The practical experience shows that angle deviations between two segments of less then
24 Edge Detection
five degrees are meaningful. For the distance tolerance a few pixels (one to eight depending
on an image’s resolution and size) are sufficient in order to obtain good results.
Evaluation: Next, we want to apply the Hough-transform and the edge-point tracking
algorithm to a real world image. In Figure 2.6 four images are displayed where we have
overplotted the obtained line segments. The left images show segments obtained with
the Hough-transform and the lines on the right images were detected with the edge point
tracking algorithm. We can see that for the building image the Hough-transform performs
poorly in comparison to the edge tracking algorithm. The Hough-transform3 recognized
only about one third of the segments detected by the edge point tracking method. In fact,
it is known that the Hough-transform can produce misleading or false results in the case
of objects that happen to be aligned. The Hough-transform result for the second image
looks at the first sight more promising. A more detail look reveals that some of the most
salient contour lines of the puma (back and tail) are not detected.
The bad results of the Hough-transform may be due to quantization issues. It is com-
mon to describe the mapping from the Cartesian to the Hough space as a two-dimensional
histogram, where the number of bins turns out to be a crucial factor. A too rough reso-
lution, i.e. too few bins, will most likely map spatially close and almost parallel lines to
the same bin. On the other hand, a very fine resolution will lead to a mapping of collinear
points into separated bins and not into one accumulation point, as theoretically assumed.
In the literature one can find various enhanced Hough-transformation algorithms that
might work better then the standard method, such as for example [35] or [96].
However, our observations clearly show the superior performance of the edge tracking
algorithm over the Hough-transform. Therefore, it is our preferred choice for the extraction
of straight line segments. In the next chapter we will use the edge point tracking in order
to extract straight line segments that will serve as input for the hierarchical clustering
algorithm.
3The presented results were obtained with the Matlab implementation of the Hough-transform withdefault parameters.
2.6 Edge Point Tracking 25
Line Segments from Hough Transform Line Segments from Edge Point Tracking
Line Segments from Hough Transform Line Segments from Edge Point Tracking
Figure 2.6: Comparison of line segments obtained from the Hough-transform and the edgepoint tracking algorithm.
26 Edge Detection
Chapter 3
Clustering
Cluster analysis is a method of multi-variate statistics to reveal homogeneous groups of
objects or training patterns. Since no a priori knowledge, such as labels of data patterns is
available clustering techniques belong to unsupervised classification methods. One could
roughly describe clustering as the process of organizing patterns into meaningful entities,
who’s members are somewhat similar to each other. Clustering algorithms try to discover
similarities and differences among data patterns and to derive meaningful conclusions about
them. Indeed, the paradigm of clustering is met in many, if not in most scientific disci-
plines, where any kind of data patterns occur. The most important applications reach from
medical sciences, social sciences, physical sciences, life sciences (biology, zoology) to engi-
neering and computer sciences. It would go beyond the scope of this thesis to describe all
clustering techniques in detail. The interested reader is refereed to the following literature
on clustering techniques: [5], [147], [63] and [60].
In this work we are using hierarchical clustering methods that are described in this chap-
ter. In order to classify hierarchical clustering within the zoo of clustering techniques, the
major methods are sometimes divided into sequential algorithms, hierarchical algorithms
and algorithms based on cost function optimization [136]. Whereas, from the algorithmic
point of view we can distinguish the following methods1:
• Deterministic clustering methods: Each data object is assigned to exactly one cluster,
such that the clustering defines a partition of the data
• Function-based clustering: Function-based methods decide the data class member-
ship on an assignment function that has to be optimized. Thus, function-based
clustering methods are optimization problems.
• Possibilistic clustering methods: Each data object gets a possibility value assigned,
1Note, that this taxonomy of clustering algorithms is methodically and is not the only possible one.Moreover, [31] and [63] give a more detailed overview of possible taxonomies of clustering methods.
28 Clustering
that determines if the object belongs to a certain cluster. Possibilistic clustering
methods are often referred to as pure fuzzy clustering. where the sum of all possibil-
ities for one data object belonging to a cluster has not to equal one.
• Probabilistic clustering methods: For each data object a probability distribution is
determined, which defines the probability of any data object belonging to a certain
cluster.
• Hierarchical clustering methods: The data to cluster are subdivided into several steps
in finer and finer or in coarser and coarser groups. The process of merging groups is
done by so-called linkage.
3.1 Hierarchical Clustering
Hierarchical clustering algorithms represent a given data-set as a sequence of partitions,
where each partition is nested into the next higher partition in the hierarchy. Formally,
there are two hierarchical algorithms, namely the agglomerative and the divisive one. The
agglomerative algorithm is a bottom-up technique, where initially all data patterns belong
to disjoint clusters, which are subsequently merged into larger and larger clusters until only
one cluster remains. The divisive algorithm on the other hand is a top-down technique,
that first starts with all data patterns in one cluster and consequently is subdivided into
smaller clusters. Both algorithms end up with an hierarchical representation of the data
- the so-called dendrogram (tree structure), where each step in the clustering process is
illustrated by a join of the tree. Figure 3.1 shows a flow-chart of the hierarchical clustering
algorithm.
For our problem, the grouping of line segments, we only consider the agglomerative
hierarchical clustering algorithm (AHC). The reason for that is the higher complexity of
the divisive method which makes it practically unusable for the present application. In
addition, AHC is our preferred choice over other clustering techniques, such as Linde-Buzo-
Gray (LBG) or k-means algorithm, where the number of clusters has to be known at the
onset.
Usually, the number of clusters is unknown in advance and additionally may vary for
each image. Hence, we have decided for hierarchical clustering. Moreover, the AHC omits
the cluster-center initialization problem, which has a crucial impact on the performance of
the k-means algorithms.
Clustering or grouping of line segments is a long standing problem of computer vision
and was already studied by [40], where the authors used the segment lengths for the clus-
tering process. A more recent work was presented in [62], where the grouping is done
according to ratios of segment lengths and their distances. In an iterative process a hier-
3.1 Hierarchical Clustering 29
Figure 3.1: Flow chart of the agglomerative hierarchical clustering algorithm.
30 Clustering
archy of line segment clusters is produced. Unfortunately, the authors have applied their
method to just one image. A different approach is presented in [108]: the line segments are
detected by a Hough-transform in subwindows. Subsequently, the segments are merged
across different windows, where the process is highly sensitive to the window size. The
authors in [117] performed a grouping of line segments based on Eigenclustering. More
related work is mentioned in Section 1.3.
In detail, the agglomerative hierarchical clustering is a special clustering technique that
groups data over a variety of scales. The clusters are represented in a multi-level hierarchy,
or dendrogram, where clusters at one level are joined to form clusters at the next higher
level. AHC produces a number of partitions CnNn=1 with N denoting the number of
clusters. AHC initially consists of N clusters, each filled with a single element of the input
vector X = xn; n = 1, ..., N. At each of the N subsequent clustering steps, two clusters,
that are closest to each other, are joined together.
The following algorithm summarizes the steps of the general agglomerative hierarchical
clustering schema (sometimes called generalized agglomerative scheme (GAS)).
Algorithm Agglomerative Hierarchical Clustering
1. p = 0 Partition level
2. Initialize N singleton clusters:
3. Select the initial clusters: Tp=0 = Cn = xn, n = 1, 2, ..., N4. p = p + 1
5. R = Np(Np−1)2
All possible pairs of clusters for the current clustering Tp
6. for all R
7. D(Ci, Cj) = min1≤k,l,≤RD(Ck, Cl)
8. Create a new cluster from Ci and Cj:
9. Cs = Ci
⋃
Cj
10. Update the clustering Tp
11. Repeat Steps 4 to 10 until all vectors lie in a single cluster
It is easy to notice that the agglomerative hierarchical clustering forms a nested hierarchy
Tp=0 ⊂ T1, ...,⊂ TN−1, where p are the partition levels for each clustering step.
The next section is dedicated to on how we actually compute the distance between
cluster pairs in order to form the hierarchy.
3.2 Linkage
The formation of a cluster hierarchy can be seen as the iterative application of a dissimi-
larity function to a set of possible pairs of clusters of a data matrix X. The step of merging
clusters into larger and larger clusters until only one cluster remains is called linkage. More
3.2 Linkage 31
specifically, linkage is the criterion by which the clustering algorithm determines the ac-
tual distance between two clusters by defining single points that are associated with the
clusters in question. Before we define several linkage methods it is necessary to introduce
the dissimilarity matrix in order to compute distances between clusters.
Dissimilarity Matrix It is essential for hierarchical cluster analysis to measure the
closeness of two objects or a pair of clusters, which is done by the so called dissimilarity
matrix (or proximity matrix). It is a symmetric square N ×N matrix D with dij elements.
Each element represents a measure of distinction between the i-th and j-th object or vector.
The matrix features zero diagonal values stating that the self-distance is zero. We can define
a dissimilarity measure d for a data set X as a function: d : X × X → r ∈ R+, with r as
the set of real numbers. Note that d features the same properties as the cophenetic matrix
(see Equation 3.10).
Linkage Methods Subsequently, we will define the methods for our experiments that
are single, complete, average, centroid, median and ward linkage.
Single linkage defines the distance between any two clusters as the minimum distance
between them, i.e. the distance between the two closest points
d(k, l) = min(dist(xki, ylj)); k ∈ (1, ..., ni), l ∈ (1, ..., nj), (3.1)
where ni and nj are the number of objects in cluster k and l, respectively. xki denotes the
i-th object in cluster k and ylj the j-th object in cluster l. This method tends to produce
elongated clusters which is known as chaining effect.
Complete linkage is the opposite of single linkage in that it defines the distance between
any two clusters as the maximum distance between them.
d(k, l) = max(dist(xki, ylj)); k ∈ (1, ..., ni), l ∈ (1, ..., nj), (3.2)
where ni, nj, xki and ylj have the same meaning as in the single linkage method. In
comparison to single linkage, the complete method tends to form tightly bound clusters
[151].
Average linkage, also known as UPGMA (Unweighted Pair-Group Method using Arith-
metic averages) takes the mean distance between all possible pairs of entities of the two
clusters in question. Therefore, it is computationally more expensive than the previous
methods.
d(k, l) =1
nkns
nk∑
i=1
nl∑
j=1
dist(xki, ylj) (3.3)
Centroid linkage, sometimes called Unweighted Pair-Group Method using Centroids
32 Clustering
161817 3 15 5 1 2 8 292324 4 9 1428 7 26192730201113 6 21121025220
2000
4000
6000
8000
10000
12000
14000
16000
Single Linkage Method
Object Number
Dis
tanc
e
2 4 8 6 1 5 121017 9 3 131621 7 2628291130191524222314201827250
0.5
1
1.5
2
2.5
x 105 Ward Linkage Method
Object Number
Dis
tanc
e
1719 5 6 1124 1 18 2 8 212628 3 4 13 7 2214301027202916 9 151225230
1
2
3
4
5
6
7
8
9
10x 10
4 Complete Linkage Method
Object Number
Dis
tanc
e
1 18 2 4 2126 3 5 1719 6 8 1125 7 163013102822202915 9 23141227240
1
2
3
4
5
6
7
x 104 Average Linkage Method
Object Number
Dis
tanc
e
1 18 2 4 2126 3 5 1719 6 8 1125 7 163013102822202915 9 23141227240
1
2
3
4
5
6
7
x 104 Centroid Linkage Method
Object Number
Dis
tanc
e
1719 6 8 1125 7 163013 1 18 2 3 2126 4 5 9 23141028222029151227240
1
2
3
4
5
6
x 104 Median Linkage Method
Object Number
Dis
tanc
e
Figure 3.2: Various dendrograms obtained from six different linkage methods.
3.2 Linkage 33
(UPGMC) takes distances between the centroids of two groups.
d(k, l) = dist(xkyl), with (3.4)
xk =1
nk
nk∑
i=1
xki, (3.5)
xl =1
nl
nl∑
j=1
xlj . (3.6)
Ward’s linkage [145] attempts to minimize the sum of squares of two clusters that are
formed at each step and usually tends to create clusters of small size.
d(k, l) = nknld(k, l)2
(nk + nl), (3.7)
where d(k, l)2 is the centroid distance between cluster k and l, as defined in the centroid
linkage.
The last linkage method we consider is the so-called median linkage that uses the
distance between weighted centroids of two clusters.
d(k, l) = dist(xkyl), with (3.8)
xk =1
2(xp + xq), (3.9)
where xk was formed by combining clusters p and q. xl is similarly defined.
In Figure 3.2 we can see the six linkage methods applied to a hierarchical clustering
problem, i.e. the clustering of a set of straight line segments obtained from a color image.
The different dendrograms show the final hierarchy, where for the case of single linkage the
chaining effect can be easily observed.
In the sequel, we will present measures for the validation of the obtained clustering.
Therefore, we will firstly introduce the cophenetic matrix, that is a tool for cluster valida-
tion. Secondly, we will evaluate the clustering result for every linkage method in order to
determine the proper one. Thirdly, we will discuss how to cut or prune a dendrogram in
order to obtain the final clustering and we will introduce a pruning method that is based
on a subgraph distance ratios.
3.2.1 The Cophenetic Matrix
An important quantity of hierarchical clustering algorithms is the cophenetic matrix. Based
on a clustering hierarchy diagram we can define the cophenetic matrix, which consists of a
34 Clustering
set of cophenetic distances dc defined as in the following. First, assume that Tpijrepresents
the clustering where xi and xj are for the first time merged in the same cluster. Further,
Lpijis the proximity level where the clustering Tpij
has been formed. Then the distance dc
can be written as
dc(xi,xj) = Lpij. (3.10)
The cophenetic matrix is subsequently defined as:
Dc = dc(xi,xj), (3.11)
where i and j = 1, ..., N, with N being the number of elements of x. The cophenetic
matrix fulfills the properties of a metric, such that the following conditions are met:
Dc(i, j) ≥ 0, i 6= j Non-negativity (3.12)
Dc(i, j) = 0, i = j Zero selfdistance (3.13)
Dc(i, j) = Dc(j, i) Symmetry (3.14)
Dc(i, j) ≤ Dc(i, k) + Dc(j, k), ∀ i, j, k Triangle inequality. (3.15)
The first condition holds clearly since there are no negative distances between clusters
possible. The selfdistance must be zero, since an element can be found in the same cluster
with itself at the zero level clustering. The symmetry condition obviously holds, too. In
fact, even the ultrametric inequality holds, which is a stronger condition than the triangle
inequality. The ultrametric inequality states in this case that for every triplet of distances,
the two largest distances out of the three possible ones are equal [46].
Dc(i, j) ≤ max[Dc(i, k),Dc(j, k)] ∀ i, j, k. (3.16)
The cophenetic matrix Dc(i, j) is a special case of a dissimilarity matrix [136].
3.3 Cluster Validity
Cluster validity quantitatively evaluates the results of a clustering algorithm. In the liter-
ature several methods have been proposed that are reviewed in [45] and [39].
Most clustering algorithms impose a kind of cluster structure on a data set X, that is a
priori not evident. Moreover, different clustering techniques will create different clusters,
which might be correct or not. Thus, before we think of a quantity to measure the quality
of the obtained clustering, we need to answer a question that naturally arises; ”Is there a
natural structure in our data at all?”.
3.3 Cluster Validity 35
3.3.1 Cluster Tendency
Methods for identifying the presence of a clustering structure are called clustering ten-
dency. The basic approach is to test points of a dataset for randomness. Various test for
randomness in datasets have been suggested in the literature, such as [7] who proposed a
Poisson model where objects are represented as uniformly distributed points in a region
R of the l-dimensional data space. In [104] the authors compared the classical Hopkins
test [51] to an approach based on fractal dimension theory (FDT) and fuzzy approximate
reasoning (FAR) to analyze the clustering tendency. FDT investigates the real dimension
of a dataset and FAR is a procedure that deduces (imprecise) conclusions from fuzzy rules.
Their results showed that the Hopkins test is able to robustly determine the clustering ten-
dency in an unknown dataset. Hence, due to its outstanding and thoroughly investigated
performance we decided for the Hopkins test in order to determine the cluster tendency of
our data.
Hopkins Test In the following we will shortly review the Hopkins test which is based
on the nearest neighbor distance, that are the distances between randomly sampled points
and points from the actual distribution. In detail, let X = xi, i = 1, ..., N, where N
is the number of elements of X. Further let Xs ⊂ X with M randomly selected vectors
from X, defined as X = xi, i = 1, ..., M, M ≈ N10
. Also let Xr = yi, i = 1, ..., Mbe a set of vectors randomly distributed according to the uniform distribution. Now we
define dj as the distance from yj ∈ Xr to its nearest vector in xj ∈ Xs. Moreover let d′j
be the distance from xj to the nearest vector in Xs − xj. Now we can write down the
complete Hopkins statistics with the lth powers of dj and d′j accordingly to [59] and [136]
as:
h =
∑Mj=1 dl
j∑M
j=1 dlj +
∑Mj=1 d′l
j
, (3.17)
with 0 ≤ h ≤ 1. In words, the Hopkins statistic index for clustering tendency exam-
ines whether objects in a dataset differ significantly from the assumption that they are
uniformly distributed in the multidimensional space. The statistical test compares the dis-
tances between the real data and their nearest neighbors. In case the data are uniformly
distributed, dj and d′j will be similar resulting in a Hopkins statistic of h=0.5. On the other
hand if clusters are present, the distances for the artificial random data will be larger than
for the real ones, because the random objects are evenly distributed and the real ones are
grouped together. In this case the Hopkins statistic will result in values larger than 0.5.
In our experiment we intend to prove the assumption that we can reject the null hy-
pothesis of a randomly or regularly distributed feature space. In fact, randomness of a
data distribution indicates that an alternative method should be incorporated for the data
analysis. Therefore, we create a random data set according to the uniform distribution,
36 Clustering
that is used to be tested against our data. The actual data to be tested are the coordinates
of all line segments for each image obtained from the Caltech database (see Section 5.1.3).
Thus, for every database image we evaluate the Hopkins test, resulting in hi values, with
i = 1, 2, ..., 2662. Finally, we average over all hi ending up at a Hopkins test of h=0.81,
that indicates a strong clustering structure, such that we can reject the null (randomness)
hypothesis. Hence, we can safely assume a clustering tendency in our data. A strong
verification for our conclusion is given in the work of [73] where the authors introduced
a modified Hopkins statistics in order to get a measure how much the data are clustered.
Their findings clearly showed that a Hopkins h value of 0.75 or higher is an evidence for a
clustering tendency at the 90% confidence level.
Now, that we have proven the adequateness of clustering techniques for our data we
can proceed with the description of our clustering strategy. Next, we will present the
hierarchical clustering method we have successfully applied to our data. Subsequently, we
are going to motivate the clustering method used, as well as the choice of the best suited
linkage algorithm and the final dendrogram cutting and pruning technique, respectively.
Again, we will provide a proof of the final clustering quality by the cophenetic correlation
coefficient introduced in Section 3.18.
3.3.2 Clustering Validation
An important issue is the validation of a clustering result that is, for hierarchical clus-
tering algorithms, represented as a dendrogram. In Section 3.2 we have shown how to
represent a hierarchical clustering structure by a cophenetic matrix. Now, we can define
a coefficient that measures the degree of similarity between the cophenetic matrix Dc and
the dissimilarity matrix D obtained from the dataset X. As stated in Paragraph 3.2, the
cophenetic and dissimilarity matrices are symmetric, i.e. their diagonals are zero. Hence,
we only need to take the upper triangle matrix of every matrix into consideration, with
O ≡ N(N − 1)/2 elements. Then the cophenetic correlation coefficient (CCC) is defined
as a follows:
CCC =1O
∑N−1i=1
∑Nj=i+1 dc
ijdij − µDcµD
√
(
1O
∑N−1i=1
∑Nj=i+1 dc2
ij − µ2Dc
)(
1O
∑N−1i=1
∑Nj=i+1 d2
ij − µ2D
)
, (3.18)
where µDcand µD are the respective means, that are, µDc
= 1O
∑N−1i=1 dc
ij and µD =1O
∑N−1i=1 dij . The values of the CCC are in the range of [−1, 1], where values closer to
1 indicate a better agreement between the cophenetic and the proximity matrix. Hence,
the CCC is a measure of how accurately the hierarchical tree represents the dissimilarities
of the original input data.
3.3 Cluster Validity 37
Table 3.1: Averaged cophenetic correlation coefficients for six linkage methods.
Data pattern Linkage Method Average CCCAngles Single 0.7350
Centroid 0.7808Ward 0.7415Average 0.7808Complete 0.7228Median 0.8035
Lengths Single 0.8632Centroid 0.9262Ward 0.7208Average 0.9262Complete 0.8907Median 0.8920
In order to verify the quality of our clustering we check to which extent it fits the actual
data. Thus, we compute the cophenetic correlation coefficient for the clustering result
obtained with the AHC under the usage of six linkage algorithm, defined in Section 3.2.
For this experiment we have taken six different clusterings obtained for every image of
the Caltech database presented in Section 5.1.3. In detail, we take an imagei with i =
1, 2, ...2662 and extract a set of line segments that is the ground truth data of every
image. Then, we compute the Euclidean distance matrix (EDM, see Chapter 4.1), where
we use the angles between any segment and the ordinate2. In addition, we compute an EDM
from the relative segment lengths. The relative length of each segment is computed as the
fraction of the longest possible segment of an image, its diagonal and computed according
to Equation 4.9 Once the EDMs are computed the AHC can construct all dendrograms
under the usage of different linkage methods (see Section 3.2).
In Table 3.1 the cophenetic correlation coefficient is printed for various linkage meth-
ods 3. We can see that most of the CCC values are close to one, indicating a high quality
clustering. The average linkage method gives the best result for the length data and the
median linkage the best for the angle data. The centroid method is of similar quality.
The single linkage gives rather poor results in comparison to the others. In [9] the author
confirms a similar observation. Now, when we have determined the quality of the various
linkage methods it is time to think about obtaining the final clusters. That is done by
cutting or pruning the hierarchy.
2Note, that during the actual EDM computation only differences are considered. That makes the resultinvariant against similarity transformations
3Note, that CCC is averaged over all images from the Caltech database.
38 Clustering
3.3.3 Cutting a Dendrogram
The result of the hierarchical clustering is presented in the form of a dendrogram, as
shown in Figure 3.2. In order to determine the final clusters we have to cut or partition
the hierarchy. Two groups of approaches exist, where the first one is the cutting of the
dendrogram at a given height, that is, the distance between the nodes in the graph [31]. The
second method is to prune the dendrogram by a manual or automatic selection of clusters at
various distances, such as applied in [90] or [132], where a nearest neighbor purity estimator
was introduced in order to determine the pruning point. The authors in [90] concluded
that their proposed method tends to create small clusters. In our experiments we explored
a different approach. We propose an automatic partitioning algorithm that relies on the
distances between nodes taken from a dendrogram. Note, that a dendrogram is a binary
rooted tree, in which every node has either two children (child nodes to the left and to the
right) or is a leaf.
We define the subsequent node distance SDnodei,j with i = 1, 2, ..., Nnode, and j =
i+1l, i+1r where Nnode is the number of nodes in the dendrogram, with il and ir defining
the subsequent (child) nodes to the left and to the right, respectively. Specifically, SDnodei,j
is the dendrogram’s ordinate value between any two subsequent nodes within a subtree.
Then, we define a distance ratio between nodes in any subtree as:
DRsubm =
argminj=i+1l,i+1r
(SDnodei,j )
argmaxk=i+2l,i+2r
(SDnodej,k )
, (3.19)
where m = 1, 2, ..., M − 1, with M as the number of subtree nodes. The pruning of the
dendrogram is made at DRsubm ≤ 1
4, where 1
4turned out to work best with respect to well
separated clusters. An example of a dendrogram pruning result is shown in Figure 3.3.
As discussed in the begin of this chapter, we are interested in grouping straight line
segments. Now, we want to select the final clusters that will be used for the further feature
computations and the subsequent content-based image retrieval and classification tasks.
From the final set of clusters4 we will only select the most compact ones for further con-
siderations, since all members are more similar to each other than to any other subgroup;
such clusters show a high mutual similarity. As the measure of compactness (intra-cluster
distance) we use
σj =
(
1
nc
∑
x∈Cj
||x − wj||2)
1
2
, (3.20)
where Cj denotes an individual cluster with data members x and wj is the j-th cluster
representative. Finally, we select all clusters with a σj ≤ stdσ, where stdσ is the standard
4Note, that the number of actual cluster may vary from image to image.
3.3 Cluster Validity 39
Figure 3.3: The two graphs show dendrograms where the ellipse in the left one enclosesa typical subgraph. SD12 and SD23 are the distances between nodes in the enclosedsubgraph. The right dendrogram pictures the actual pruning for the segments of thetemple image (see Figure 3.4).
40 Clustering
deviation of all σj .
Figure 3.4 and Figure 3.3 illustrate the last two steps: the pruning of the dendrogram
and the final selection of clusters based on the compactness measure σj . The actual clus-
tering was performed with the average linkage method that incorporates the combination
of segment angle and segment length data patterns, as previously described.
The first row in Figure 3.4 shows a color image and its straight line segments. The
second row shows two sets of line segments, where the left image represents members of
less compact clusters and the right image denotes clusters of high compactness, according
to Equation 3.20. The left set of segments can be interpreted as less important and, thus,
will be ignored for the structure-based feature computation presented in Chapter 4. Similar
results are obtained for other images. As a final remark it has to be said that it is not
possible to precisely determing all salient line segments of an image. However, the salient
segments clearly show a tendency of being more visually more significant then the discarded
ones. The further results presented in this thesis support this observation.
3.4 Summary and Conclusion
In this chapter we have described the important unsupervised learning problem of ag-
glomerative hierarchical clustering, that was applied to the task of grouping straight line
segments. Firstly, we have described the algorithm and six linkage methods. Secondly,
we have proven the a priori assumption of an underlying clustering structure of our data
with the Hopkins test. The result of h=0.81 indicates a strongly structured dataset that
supports the usage of clustering methods for the grouping of line segments. Thirdly, we
have evaluated the quality of 159725 hierarchies obtained from line segment patterns of the
Caltech database. The cophenetic correlation coefficient was used as a quality measure.
With an averaged CCC score of 0.9262 the average linkage method produces the best re-
sult for our dataset. Fourthly, we have presented a subgraph distance ratio method that is
used for pruning or cutting the dendrogram, respectively. The results show that the ratio
is well suited for measuring the distance between clusters. Finally, we base the selection
of meaningful or salient clusters on the intra-class compactness, resulting in a subset of
straight line segments, that is used for further computations.
5Six times the number of images (2662) from the Caltech database.
3.4 Summary and Conclusion 41
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
0 50 100 150 200 250 300 350 400
0
50
100
150
200
2500 50 100 150 200 250 300 350 400
0
50
100
150
200
250
Figure 3.4: The first row shows a color image and all line segments. The second row presentsline segment clusters, where the left graph illustrates noise-like segments (non-compactclusters). Whereas the right graph shows a salient subset of line segments according to thecompactness measure (see Equation 3.20).
42 Clustering
Chapter 4
Structure-Based Features
The structure of an image is a fundamental source of salient information. In fact, it is
long known that a certain perceptual organization plays a fundamental role in the human
early vision system. This law of organization dates back to the Gestalt psychologists
[66] who defined various heuristics such as for example proximity, symmetry, similarity or
continuation. The work of [93] incorporated perceptual grouping for computer vision tasks.
Specifically, the authors concentrated on geometrical relationships in images, where tests
were run on about 20 images.
In [30] a method is presented that groups line segments in perceptually salient contours.
The algorithm results in a hierarchical representation of the contour. According to the
authors the method is robust against texture, clutter and repetitive image structures. The
results look promising, however, the experimental analysis is restricted to a few images only.
This method can not be easily applied to CBIR, since objects in an image rarely have a clear
contour. Complex and cluttered scenes are rather the normal cases. More advancements
in the area of perceptual grouping can be found in [58] described in Chapter 1.3.
In this chapter we derive features that are computed from the geometric structure of
images. In the previous chapters we have shown how to robustly compute edge maps
and form line segments. A hierarchical clustering algorithm was used in order to select a
salient subset of line segments, that will be used for the following computations. As we
have motivated in Section 1.3, structure is of importance on several scales. Therefore, we
compute a hierarchy of structural features, namely global and local ones. The former ones
depict a holistic scene representation and the latter ones take local perceptual groups and
their connectivity into account.
In the sequel, we present Euclidean distance matrices, the extraction of global features
and the definition of local perceptual groups.
44 Structure-Based Features
4.1 Euclidean Distance Matrix
An Euclidean Distance Matrix (EDM) is a two-dimensional array consisting of distances
taken from a set of entities, that can be coordinates or points from a feature space. Thus,
and EDM incorporates distance knowledge between points.
Revealing information solely obtained from point distances is of major importance in
various areas of research such as genetics, pattern recognition, molecular configurations,
protein folding, geodesy, economics, statistics, psychology, chemistry, engineering and many
others.
A set of points in an Euclidean space is characterized by their coordinates. Specifically,
an Euclidean Distance Matrix is a real valued n × n matrix E containing the squared
distances of pairs of points from a table of n points (Xk, Yk), with k = 1, 2, . . . n. An
EDM is defined as
E =
e11 e12 · · · e1n
e21 e22 · · · e2n
......
. . ....
en1 en2 · · · enn
,
where
eij =√
(Xi − Xj)2 + (Yi − Yj)2, with i,j =1,2,. . . ,n, (4.1)
describes the Euclidean distance between the feature points i and j.
An EDM E inherits the following properties from the norm ‖ · ‖2:
1. eij ≥ 0, i 6= j non-negativity
2. eij = 0, i = 0 self-distance
3. eij = eji symmetry
4. eij ≤ eik + ekj, i 6= k 6= j triangle inequality
The dimension of E is n2, but as E is symmetric with zeros along the diagonal, there
are only n(n−1)2
unique elements.
Figure 4.1 shows a set of five points and their corresponding EDM distances. In addi-
tion, we can see the same set of points under rotation, translation and scaling, respectively.
As we will show, an EDM remains constant under similarity transformations.
A common translation of all points will not effect an EDM E since the change of the
point co-ordinates is nullified, as shown in Equation 4.1. Assume we want to shift all points
4.1 Euclidean Distance Matrix 45
P4
P1
e12
P2
P5
e13P3
e14
e15
(a) A set of five points.P5
P4P2
P1P3
e13
e12
e15
e14
(b) Translated set of points.
P1
e14
e13
e15
P5
e12
P2
P3
P4
(c) Rotated set of points.
P1
P2
e13
P4e15
e12
P5
e14
P3
(d) Scaled set of points.
Figure 4.1: A graphical illustration of a set of points and three corresponding transformations(translation, rotation and scaling). Note, that we only show the EDM elements (e12 to e15) for thepoint P1.
46 Structure-Based Features
P = pk; pk ∈ Rn, k = 1, 2, . . . , n, by the translation vector1 t = (t1, t2)T , t1, t2 ∈ R,
that is in 2-dimensions (Xk, Yk) 7→ (Xk + t1, Yk + t2). Such that the resulting EDM can be
written as
E(Xk, Yk) = E(Xk + t1, Yk + t2). (4.2)
In general, an Euclidean distance matrix remains unaltered in case of an arbitrary trans-
lation2 t ∈ Rn.
Similarly, an EDM is invariant against rotation. Let us consider a rotation in R2, with
Rθ being a rotation matrix3 defined as
Rθ =
[
cos(θ) sinθ
−sinθ cos(θ)
]
. (4.3)
Note, that a rotation matrix is always orthogonal, such that RTθ = R−1
θ and RθRTθ =
RTθ Rθ = 1. So, any rotation matrix will fulfill the following property
XTRTθ RθX = XTX = 1, (4.4)
with X = (Xk, Yk)T . Recall the definition of an EDM (4.1), then together with Equation 4.4
we arrive at
E(RθX) = E(X). (4.5)
In words, Equation 4.5 states that an EDM E remains unchanged in case of rotating the
underlying point set.
Finally, an EDM is invariant against scaling if the matrix is normalized in the range of
[0, 1], otherwise it is scale invariant up to a factor S. Next, we want to show a simple toy
example that shows the invariance properties of an EDM.
Example
Let us show the invariance properties of an EDM by considering the following set of points
X =
3 8
1 2
12 9
6 9
4 17
,
1For the two dimensional case. Note, that in general t can be as well of a higher dimension.2Note, if t is known in advance then it is possible to remove the translation by: E((Xk, Yk)T − t
T ) =E(Xk, Yk).
3The determinant of Rθ is equal to 1. det(R) =
∣
∣
∣
∣
cos(θ) sinθ
−sinθ cos(θ)
∣
∣
∣
∣
= 1.
4.2 Feature Computation 47
then it’s EDM can be written as:
E =
0 6.3246 9.0554 3.1623 9.0554
6.3246 0 13.0384 8.6023 15.2971
9.0554 13.0384 0 6.0000 11.3137
3.1623 8.6023 6.0000 0 8.2462
9.0554 15.2971 11.3137 8.2462 0
.
Next, consider the case of a simple common translation in X:
E(Xk, Yk) = E(Xk + t1, Yk), (4.6)
with t1 = 10. In addition, we will apply the following rotation to our initial EDM:
Rθ =
[
cos(π/3) sin(π/3)
−sin(π/3) cos(π/3)
]
. (4.7)
Then the resulting EDM can be written as follows:
Etransrot =
0 6.3246 9.0554 3.1623 9.0554
6.3246 0 13.0384 8.6023 15.2971
9.0554 13.0384 0 6.0000 11.3137
3.1623 8.6023 6.0000 0 8.2462
9.0554 15.2971 11.3137 8.2462 0
.
Etransrot denotes a rotated and translated version of the original EDM. Next, we will show
the invariance property by the subtraction of the two EDMs:
Etransrot − E = 0.44409 · 10−16. As we can see, the values of both EDMs are almost
identical.
4.2 Feature Computation
Euclidean distance matrices are based on the relations of Euclidean point coordinates. In
an image, many of these points represent noise and are not meaningful enough for an
accurate image representation. In order to overcome this problem, we base EDMs on
the information of meaningful geometric primitives such as line segments (see Chapter 3).
Note, that a line segment is more discriminative than just a single point, since it features an
orientation and a length. Subsequently, we will derive global and local structural features
of line segments.
48 Structure-Based Features
4.2.1 Global Features
The global features we define in this section are able to capture a holistic representation
for the structural content of an image. Euclidean distance matrices help to encode angular
and spatial relations of line segments. In the end, the global feature vectors are represented
by histograms.
Let L = li|i = 1, 2, ..., N, be a set of line segments obtained from one image, according
to the method presented in Chapter 3. Each line segment li is described by two points in
the Euclidean space, li(pbi(x1, x2), p
ei (x1, x2)), where pb
i and pei are the begin and end points
of a segment with the coordinates x1, x2 ∈ R2. Then we can compute geometric properties
of L such as the angles of all segments between each other, the relative lengths of every
segment and the relative Euclidean distance between all segment mid-points. In detail, the
angle between two segments li and lj is defined as:
cos(θij) =li · lj
||li · lj ||2, (4.8)
with || · ||2 being the L2−Norm. The angle is in the range of [−π, π]. The relative length
of a segment li is the distance−−→pb
ipei and can be written as:
len(li) =
√
(xei − xb
i)2 + (ye
i − ybi )
2
√
(xmax − x0)2 + (ymax − y0)2, (4.9)
where xbi , xe
i , ybi and ye
i denote the coordinates of the segment’s begin and end points. The
denominator is a scaling factor in respect to the longest possible line segment4 with (x0, y0)
and (xmax, ymax) as the begin and end point coordinates.
The Euclidean distance between the mid-points pci and pc
j of the segments li and lj is
defined as
distc(li, lj) =
√
(xcj − xc
i)2 + (yc
j − yci )
2
√
(xmax − x0)2 + (ymax − y0)2, (4.10)
with xci , xc
j , yci and yc
j as the coordinates of the segment mid-points. The denominator
fulfills the same scaling purpose as the one in Equation 4.9. Thus, the relative length of a
segment and the relative distance between two segments is limited to the range [0, 1]. The
relative representation ensures invariance under isotropic scaling. Figure 4.2 illustrates the
geometric properties for a set of line segments.
Now, that the three basic properties of a set of line segments are computed, we can
incorporate this information into Euclidean distance matrices. Thus, the EDMs will rep-
resent the relative geometric connectivity of a set of straight line segments. Specifically,
4The longest possible line segment is as long as the diagonal of the image.
4.2 Feature Computation 49
θ13
l2
l3
θ23
l1
θ12
(a) Angles between a set of three segments.
l3
dc23dc
12
dc13
l1
l2
lc2
lc3
lc1
(b) Distances between mid-points for a set of threesegments.
Figure 4.2: Illustration of line segment properties.
we define three EDMs: one based on segment angles Eang (see Equation 4.8) a second one
based on relative segment lengths Elen (see Equation 4.9) and a third one based on relative
distances between segments Edist (see Equation 4.10). The matrix of Eang can be written
as:
Eang =
eang11 eang
12 · · · eang1n
eang21 eang
22 · · · eang2n
......
. . ....
eangn1 eang
n2 · · · eangnn
, (4.11)
and is computed according to
eangij = ‖θi − θj‖, (4.12)
where the values of θi and θj are in the range of [−π, π]. The angles are taken between the
line segments i and j. Similarly, Elen is given by:
Elen =
elen11 elen
12 · · · elen1n
elen21 elen
22 · · · elen2n
......
. . ....
elenn1 elen
n2 · · · elennn
, (4.13)
and calculated by
elenij = [(len(li) − len(lj))
T (len(li) − len(lj))]1
2 = ‖len(li) − len(lj)‖, (4.14)
with len(li) and len(lj) being the lengths of the Euclidean line segments i and j, defined
in Equation 4.9.
50 Structure-Based Features
The third EDM5 Edist encodes the relative Euclidean distances of all line segment mid-
points.
edistij = [(pc
i − pcj)
T (pci − pc
j)]1
2 = ‖pci − pc
j‖, (4.15)
where pci and pc
j represent mid-points of segments i and j.
Next, we form histograms for the three EDMs. Since every EDM is symmetric (see
Section 4.1), we can extract the upper triangle matrix EDMutriij ∀ i > j. Such that we
obtain three upper triangle matrices: Eangutri
ij , Elenutri
ij and Edistutri
ij . Then we create three
histograms with different resolutions
Hang = hangba
, ba = 1, 2, ..., Ba (4.16)
H len = hlenbl
, bl = 1, 2, ..., Bl (4.17)
Hdist = hdistbd
, bd = 1, 2, ..., Bd, (4.18)
where Ba, Bl and Bd denote the different number of bins of the three histograms. The b-th
bin hb for every histogram is defined as:
hb =1
N2
N∑
i=1
N∑
j=1
1 if the eij element is quantized into the b-th bin
0 otherwise,(4.19)
where eij is used as a synonym for an element of any of the three upper triangle EDMs
Eangutri
ij , Elenutri
ij or Edistutri
ij . The three histograms can be understood as a geometric holistic
representation of a set of segments. The histogram features are invariant against similarity
transformations and changes in illumination (as shown in Section 5.5.2).
In our experiments we have empirically determined the best resolution for the his-
tograms. For Hang it turned out that 72 or 36 bins, that corresponds to a 5 or 10
resolution with respect to angles, produced the best results. The resolution for H len and
Hdist depends more on the application data then Hang. However, we have found out that
15 bins results in a robust and compact histogram feature. In the sequel of this thesis we
denote the global structure features as Hglobal = Hang, H len, Hdist. For the experiments
described in Chapter 5 we use a resolution of 36 bins for Hang and 15 bins for H len and
Hdist, if not indicated otherwise.
4.2.2 Local Perceptual Features
The global features described in Section 4.2.1 encode a complete scene. The advantage
of the global approach is its general applicability. However, for certain tasks such as
for example object recognition, local features play an important role, too. Man-made
5The matrix is similarly defined as in Equation 4.13.
4.2 Feature Computation 51
objects such as buildings or cars exhibit many perpendicular and/or parallel line segments.
Therefore, we will introduce local features based on perceptual groups of line segments.
First, we define perceptual groups that are unique, eminent structural entities of line
segments with well defined relations:
• Parallelity
• Perpendicularity
• Diagonality (π4,3π
4)
These groups are formed according to angular relations between segments and will be used
in order to compute geometric relations between their members. Subsequently, each group
will be divided into subsets that are formed according to spatial relations between their
members. In the sequel, we will show that the hierarchical approach of grouping segments
correspondingly to various geometric, angular and spatial relations succeeds in a highly
discriminative local feature as shown in Chapter 5.
Now we form the groups listed above. Specifically, let L = li | i = 1, 2, ..., N, be a
set of line segments extracted from an image according to Section 4.2.1. Then we obtain
several subsets of segments that fulfill specific relations. The first relation is parallelity.
Let L‖ = l‖k | k = 1, 2, ..., N‖ be a set of N‖ parallel line segments. Two segments li and
lj are parallel if they fulfill the following constraint:
(π − ǫ‖) ≤ |θij(li, lj)| ≤ π for i 6= j, (4.20)
0 ≤ |θij(li, lj)| ≤ (ǫ‖) for i 6= j, (4.21)
with ǫ‖ ≤ 5. The parameter ǫ‖ accounts for robustness and quantization errors occurring
from the actual line segment formation process (see Section 2.4).
The second subset of segments is formed by a set of perpendicular segments L⊥ =
l⊥l | l = 1, 2, ..., N⊥ that follow:
(π
2− ǫ⊥) ≤ |θij(li, lj)| ≤ (
π
2+ ǫ⊥) for i 6= j, (4.22)
with ǫ⊥ ≤ 5. ǫ⊥ play’s a similar role as ǫ‖ in Equation 4.20.
The third subset contains diagonal (π4) segments Ldiag45 = ldiag45
m |m = 1, 2, ..., Ndiag45under the condition:
(π
4− ǫdiag45) ≤ |θij(li, lj)| ≤ (
π
4+ ǫdiag45) for i 6= j, (4.23)
with ǫdiag45 ≤ 5.
52 Structure-Based Features
The fourth subset is created with diagonal (3π4
) segments Ldiag135 = ldiag135
n | n =
1, 2, ..., Ndiag135 following:
(3π
4− ǫdiag135) ≤ |θij(li, lj)| ≤ (
3π
4+ ǫdiag135) for i 6= j, (4.24)
with ǫdiag135 ≤ 5.
The four sets of segments L‖, L⊥, Ldiag45 and Ldiag135 account for well defined structures.
The subsets reflect line segments with certain relations. In fact, we can extract similar
features as we did in Section 4.2.1. Following that methodology, we can compute three
EDMs Eang∗ , Elen
∗ and Edist∗ for each of the four extracted set of segments. Note that the ∗
is a placeholder for the four sets. Specifically, we define the angles between two segments,
the relative segment lengths and the relative distance between two segments according to
Equations 4.8, 4.9 and 4.10 for every subset of line segments. The resulting EDMs are
defined as follows:
Eang∗ : eang∗
ij = ‖θi − θj‖, (4.25)
Elen∗ : elen∗
ij = ‖len(li) − len(lj)‖, (4.26)
Edist∗ : edist∗
ij = ‖pci − pc
j‖, (4.27)
where θi and θj are in the range of [−π, π]. len(li) and len(lj) are the lengths of the line
segments i and j (see Equation 4.9). pci and pc
j represent the mid-points of segments i and
j. Note, that in Equations 4.25, 4.26 and 4.27 the indices i and j differ for every subset
L‖, L⊥, Ldiag45 and Ldiag135 .
Then we create three histograms for every subset of line segments (denoted by the ∗),similarly to Equations 4.16 to 4.18:
Hang∗ = hang∗
ba, ba = 1, 2, ..., Ba, (4.28)
H len∗ = hlen∗
bl, bl = 1, 2, ..., Bl, (4.29)
Hdist∗ = hdist∗
bd, bd = 1, 2, ..., Bd, (4.30)
where each bin is computed similarly as in Equation 4.19. The histograms represent ge-
ometric relation of salient segment subsets. Since three histograms have been formed for
every set, we obtain in total 12 histograms.
We can recognize that geometric relations play an important role on various scales.
Remember, that in Section 4.2.1 we have solely focused on the global scale. To this
point, we have encoded relations of line segment subsets. In the next step we will further
reduce the members of the four sets by additional constraints. Specifically, we extract local
perceptual groups by forming spatially close sets. Let l⊥l be a straight line from the set
4.2 Feature Computation 53
of perpendicular segments. And len(l⊥l ) denotes the Euclidean length of the l-th segment.
Then, we search all segments in the set L⊥ that fulfill the following criteria:
dist(l⊥l , l⊥o ) ≤ S · len(l⊥l ), with (4.31)
dist(l⊥l , l⊥o ) = mindist(l⊥l (pb), l⊥o (pb)), dist(l⊥l (pb), l⊥o (pe)), (4.32)
dist(l⊥l (pe), l⊥o (pb)), dist(l⊥l (pe), l⊥o (pe)),
with l, o = 1, 2, ..., , N⊥ and S is a scaling parameter. Thus we obtain a subset of
perpendicular line segments L⊥s
that are spatially close to each other. In fact L⊥s
=
L⊥s
o | o = 1, 2, ..., N⊥s may consist of several local sets, that fulfill the proximity criterion
of Equation 4.31, i.e. there might be clear separated groups (with several members) of
perpendicular line segments. In words, Equation 4.31 searches for all segment that are
within the radial distance of S-times the length of the l-th segment. The radial distance is
computed for all four possible distances between two segments, where the minimal one is
selected. Similarly, we define spatially close subgroups for the other three sets of segments
L‖, Ldiag45 and Ldiag135 as:
dist(l‖k, l
‖p) ≤ len(l
‖k), (4.33)
dist(ldiag45
m , ldiag45
r ) ≤ len(ldiag45
m ), (4.34)
dist(ldiag135
n , ldiag135
s ) ≤ len(ldiag135
n ), (4.35)
where k, p denote any two segments from L‖ and m, r stand for any two segments from
Ldiag45 and n, s are segment indices of Ldiag135 . The distances between segment points are
defined as in Equation 4.31. Thus, we arrive at four subsets of local perceptual segment
groups L⊥s
, L‖s
, Ldiags45 and Ldiags
135 .
We show the formation of segment subsets in Figure 4.3, where we can see 12 arbi-
trary line segments. In order to illustrate the formation of the segment subsets we group
perpendicular segments. The subset L‖ in Figure 4.3 consists (see Equation 4.22) of seg-
ments l2, l5, l6, l9, l10, l11. Next we consider L⊥s ⊂ L‖. Correspondingly to Equation 4.31
segments l6, l10, l2, l5 belong to the set L⊥s
.
Any of the four set6 L∗s
consists of segments that are spatially close. Although the
segments are close to each other, their Euclidean lengths may differ significantly. Thus,
we select additional perceptual segment groups from L∗s
that are of a similar Euclidean
length. Let rlen be a segment length ratio. Then we can define rlen for each of the four
6The ∗ is a placeholder for any of the set L⊥s
, L‖s
, Ldiags
45 or Ldiags
135 .
54 Structure-Based Features
l3
l1l2
l10
l6l12
l8
l7
l11
L⊥s
1 (l6, l10)
l5
Ldiag45
1 (l9, l12)
L⊥s
2 (l2, l5)
l9
L‖s
1 (l7, l8)
l4
Figure 4.3: Sample groups of line segments that follow certain constraints (see Equa-tions 4.20 to 4.24 and Equations 4.31 to 4.35).
sets as:
rlen⊥ =
len(l⊥s
l )
len(l⊥s
o ), (4.36)
rlen‖ =
len(l‖s
k )
len(l‖s
p ), (4.37)
rlendiags
45
=len(l
diags45
m )
len(ldiags
45
r ), (4.38)
rlendiags
135
=len(l
diags135
n )
len(ldiags
135
s ), (4.39)
with len(l∗l ) ≤ len(l∗o) and 0 ≤ rlen∗ ≤ 1, where ∗ denotes any of the four segment sets. rlen
∗
is a variable threshold for the selection of segment length ratios, e.g. rlen∗ = 1 means that
only lines of equal length will be selected. Thus, rlen∗ is used to select another four sets of
segments: L⊥r
, L‖r
, Ldiagr45 and Ldiagr
135 .
Now, we have obtained two times four sets of perceptual groups of line segments.
The first four sets were extracted according to the spatial distance of segments and the
second four set of segments have been derived on the basis of similarly long segments.
4.2 Feature Computation 55
However, we can form a final set L∗rs
: L∗r ∩ L∗s
, that is the intersection of the two
sets. In words, L∗rs
contains all segments that simultaneously fulfill the constraints of
Equations 4.31 to 4.36 and Equations 4.36 to 4.39. Thus, this set represents a further
local perceptual group that follows certain relations. Now we form EDMs and histograms,
as we have already done before. The EDMs and histograms are similarly formed as in
Equations 4.25 to 4.27 and Equations 4.28 to 4.30. Thus, we obtain three sets of four
histograms Hangrs
∗ , H lenrs
∗ , Hdistrs
∗ .To this point we have derived a complete hierarchy of sets of segments. We have
started with the grouping of segments with respect to their angles into four main groups.
Subsequently, we have extracted subsets that are restricted to spatially close lines and to
segments of similar lengths and to sets that fulfill both criteria at the same time, respec-
tively. This highly selective process of grouping segments into perceptual entities assigns
a large fraction of the initially computed lines (L) to perceptual groups. Thus, depending
on the size of the original set, some perceptual groups can be very sparse.
At the begin of this chapter we have mentioned the extraction of perceptual local groups
and their connectivity. The extraction of various local perceptual groups is important for
the determination of highly salient image locations such as for example windows or doors.
Symmetry and repetitive structures are common for man-made objects, e.g. a facade
of a skyscraper exhibits numerous parallel and perpendicular structures (see Figure 2.2).
The various local perceptual groups we have defined so far are capable to encode such
symmetries as shown by the results in Chapter 5.
Next we add connectivity to all perceptual groups. Connectivity adds relations between
the local perceptual groups as shown in Figure 4.4. In the graph we see a set of line
segments. Specifically, we can see two sets of local perceptual groups with two members
each ( L⊥r
and L‖r
). For reasons of a better readability we only show subsets consisting
of two lines such as for example L⊥r
2 . Note that subsets can consist of more than two
segments, if all constraints are met. The dotted connection line linking the two paired
segments L⊥r
2 and L⊥r
1 can be interpreted as a center line of gravity.
Each set L∗r
consists of o-subsets Lo∗r
, with o = 1, 2, ..., N∗r. Compare to Figure 4.4.
Then we take the two closest segments within every set Lo∗r
, where the distance between
segments ll and lo is computed similarly as in Equation 4.31.
Once the two closest segments of a perceptual group are found, we can repeat that for
all other groups. These paired segments can be understood as line segments that exhibit
highly relevant geometric relations between each other.
Now, we introduce connectivity between the local perceptual groups. To this point
we have computed relations between single segments of sets that fulfill certain criteria.
Now we calculate connectivity between paired groups of segments. Let dkp denote the
shortest distance between any two points of the paired segments k and p that is computed
56 Structure-Based Features
l4
l5
l12 l14
l8
l9
l2
l3
l11
l7
l6
L⊥r
L‖r
L‖r
2 (l2, l3)
l10
l13
l1
d67
L⊥r
1 (l6, l7)
L⊥r
2 (l4, l5)
d45
d89
L‖r
1 (l8, l9)
d23
Figure 4.4: Local perceptual groups of line segments based on Equations 4.31 to 4.35 andEquations 4.36 to 4.39.
4.2 Feature Computation 57
according to Equation 4.32. Further, we introduce the mid-point pckp of the connection line
dkp. Specifically, connectivity is defined as the Euclidean distance matrix based on the set
of points pckp, where the indices k and p denote a pair of two segments. In Figure 4.4 we
can see dkp for paired segments such as for example d45.
In order to control the scale of connectivity we introduce a factor C∗ that is the number
of groups that should be taken into consideration, with ∗ denoting all sets of local perceptual
groups. The parameter C∗ can be any value in the range of [1, 2, ..., N∗r
], with N∗r
as the
number of subgroups for every L∗r
and can be understood as a granularity factor. A C∗
of 1 means no connectivity, since only one pair of segments is taken into consideration,
whereas a C∗ equal to the number of subgroups N∗r
indicates full connectivity. Then, we
construct connectivity EDMs according to:
Eangpg
∗ : eangpg∗
ku = ‖θkp − θuv‖, (4.40)
Elenpg
∗ : elenpg∗
ku = ‖len(dkp) − len(duv)‖, (4.41)
Edistpg
∗ : edistpg∗
ku = ‖pckp − pc
uv‖, (4.42)
with θkp and θuv in the range of [−π, π] denoting the angles between the connection lines
dkp and duv of a pair of segments from one of the four local perceptual groups. These
groups were extracted according to Equations 4.36 to 4.39: L⊥r
, L‖r
, Ldiagr45 and Ldiagr
135 ,
summarized with the ∗ notation as: L∗r
. The EDMs defined in Equations 4.40 to 4.42
stand for 12 EDMs, due to the ∗ notation. The 12 EDMs are denoted as follows:
• Eangpg
⊥ , Eangpg
‖ , Eangpg
diag45, E
angpg
diag135,
• Elenpg
⊥ , Elenpg
‖ , Elenpg
diag45, Elenpg
diag135,
• Edistpg
⊥ , Edistpg
‖ , Edistpg
diag45, Edistpg
diag135.
The EDMs contain the connectivity knowledge between paired segments of local per-
ceptual groups. The factor C∗ ensures that all resulting EDMs are of the same size. Note,
that for the case of global features (see Section 4.2.1) the number of elements for each
EDM may varies. That is due to the difference in the amount of detected line segments.
Since, each EDM has a constant number of elements we take every EDM as feature vector.
For a C∗=3 we connect 3 groups of paired segments, resulting in EDMs of size 3 × 3 with
3 unique elements. These numbers form the feature vectors. A the end of this section we
will list all histograms an feature vectors including their resolutions.
To this point we have computed a large number of feature from sets of line segments.
In order to complete the description of sets of line segments we derive some statistical
properties. In addition to the EDM based features we compute other properties from each
set of line segments that we have extracted so far. We have shown that spatial arrangements
58 Structure-Based Features
of line segments can be encoded by Euclidean distance matrices. The hierarchical approach
of forming subsets with respect to different properties seems reasonable. However, we can
complete the set of features describing local structure by counting the members of each
line segment (sub-) set. Therefore we define a fraction that relates the number of segments
of a subset N∗ to the total number of segments N detected in one image as:
F ∗ =N∗
N, (4.43)
where F ∗ is in the range of [0, 1]. F ∗ equal to 1 indicates that the two sets of segments are
identical in their sizes. Whereas a F ∗=0.5 means that half of all segments belong to a set.
Thus we obtain a single number for each subset of segments. The ensemble of all fractions
forms a vector vfrac that is added to the features. The complete vector can be written
as: vfrac= F ∗s
, F ∗r
, F ∗rs, with ∗ denoting ⊥, ‖, diag45, diag135. Hence, vfrac consists of
3 × 4 numbers. Thus, in addition to the spatial relations of segments, we have obtained a
quantitative measure.
Finally, we form a last histogram feature for our local structure description. The
distribution of the relative segment lengths can account for an image’s structural content.
Our experiments show that there is a strong distinction between images of man-made
objects and landscape or animal images, on the basis of the segment length distribution.
Man-made objects exhibit, in general, longer segments that vary less in their lengths.
Hence, we form histograms for each (sub-) set of segments obtained so far.
Hslen∗ = hslen∗
b , b = 1, 2, ..., Bslen, (4.44)
where Bslen denote the number of bins of the histogram. The b-th bin hb for the histogram
is defined similarly to Equation 4.19. The histogram represents the distribution of the
relative lengths of a set of line segments.
Eventually we have to form the final set of local feature vectors. Therefore, we con-
catenate all histograms and feature vectors obtained so far. The final set of local features
comprises histograms of local perceptual groups of line segments, that encode the rela-
tions between single segments and information of the connectivity of of paired segments
of perceptual groups. In addition the set contains quantitative knowledge of the members
of every subset of segments and histograms that represent the distributions of the relative
segment lengths.
Feature Representation In our experiments we have empirically determined the best
resolution for the histograms. Note, that the histogram resolutions are similar to the
findings for the global features. For Hang∗ and Hangrs
∗ it turned out that 72 or 36 bins, that
corresponds to a 5 or 10 resolution with respect to angles, produced the best results. The
4.2 Feature Computation 59
resolutions for H len∗ , Hdist
∗ , H lenrs
∗ and Hdistrs
∗ depend more on the application data then
Hang∗ . However, we have found out that 15 bins result in a robust and compact histogram
features. For Hslen∗ we use a 20 bin resolution and vfrac was described in Equation 4.43.
The complete feature vector is a concatenation of all histograms and vectors:
H local=Hang∗ , H len
∗ , Hdist∗ , Hangrs
∗ , H lenrs
∗ , Hdistrs
∗ , Edistpg
∗ , Elenpg
∗ , Eangpg
∗ , Hslen∗ , vfrac .
Note, that the ∗ denotes: ⊥, ‖, diag45, diag135. For the experiments described in
Chapter 5 we use a resolution of 36 bins for Hang∗ and Hangrs
∗ ; 15 bins for H len∗ , Hdist
∗ ,
H lenrs
∗ and Hdistrs
∗ , if not indicated otherwise. In the sequel of this thesis, H local is refereed
to as local features or local structure-based features. The combination of H local and Hglobal
will be denoted a structure-based or simply structural features.
4.2.3 Summary
In this section we have developed a structure-based feature extraction method that encodes
relative spatial arrangements of line segments. The method determines relations on global
and local scales. Specifically, the global features capture a holistic representation for the
structural content of an image. Therefore, Euclidean distance matrices are computed.
The EDMs encode geometric relations for a set of line segments such as the angles of
all segments between each other, the relative lengths of every segment and the relative
Euclidean distance between all segment mid-points. The feature vector is obtained from
histograms that are computed from the EDMs.
The local features are computed from perceptual groups of line segments that are formed
according to angular relations between segments. Subsequently each group is divided into
subsets that are formed correspondingly to spatial relations between their members. Then,
local perceptual groups are extracted that consist of paired segments. In a subsequent
step we compute connectivity between these groups. Euclidean distance matrices help to
encode the geometric, angular and spatial relations for all sets of segments. In the end a
feature vector is formed that contains histograms of EDMs and histograms that represent
the distributions of the relative segment lengths. In addition the set contains a vector that
describes the fraction of all lines from an image that belong to the various subsets.
60 Structure-Based Features
4.3 Similarity Measures
In this section we investigate the performance of several similarity measures for the global
and local structure features derived in the previous section. The concept of similarity is
one of the most difficult aspects within content-based image retrieval. The basic question
- ”What is similarity on the image level?” - remains unanswered to a certain degree. One
reason is the individual perception of man. Imagine the following example: Someone wants
to find all images similar to a query image, that show a red car in an urban scene. The
first person want to find just red cars. Another person might want to find the very same
car type, no matter which color. A third person might be not at all interested in the car,
but in the background building, and so on. Hence, an apparently simple looking image
can lead to a set of similar, correct image retrieval results - depending on a specific user’s
intention.
To date, this subjective perception cannot be modelled by a general mathematical
measure of image similarities. Nonetheless, there exists a large choice of image similarity
measures. Good reviews and overview articles can be found in [135], [122], [109], [141], [1],
[146], [74], [125], [98], and [34]. We do not intend to describe all measures in detail. For
our further investigation, we will rather select a set of the most important.
In the sequel, we will describe eight relevant measures and test their performances
on a real world set of images. In general, each image in a database is represented by
a n-dimensional vector v = v1, v2, ..., vn. Then, the similarity between a query image
feature vector q = q1, q2, ..., qn, and feature vectors in the database v is given by a metric
measure7. The similarity measures chosen for this investigation are:
• L1 − norm: The L1 − norm is sometimes also called Manhattan or city block dis-
tance. The histogram intersection is identical to the L1−norm in case of normalized
histograms.
• L2 − norm: A similarity measure between two vectors, based on the Euclidean dis-
tance.
• Matusita: The Matusita distance describes a vector metric, i.e. it fulfills the Cauchy-
Schwartz inequality (generalized triangle inequality).
• Chi−squaremeasure: The Chi-square similarity measure is based on the well known
chi-square test of equality.
• Correlation coefficient: The correlation coefficient can be interpreted as the dot-
product of two standard vectors divided by the rank of these vectors.
7Although some measures do not fulfill all properties of a metric.
4.3 Similarity Measures 61
• Bhattacharyya: The Bhattacharyya similarity measure [6] is often interpreted as a
geometric similarity measure.
• Kullback − Leibler divergence: The Kullback-Leibler divergence [72] is a natural
distance measure between a true probability distribution and an arbitrary probability
distribution.
• Jensen − Shannon divergence: The Jensen-Shannon divergence [23] measures the
similarity between two probability distributions.
The eight similarity measures are computed as follows:
L1 − norm : L1(q,v) =n
∑
i=1
|qi − vi|, (4.45)
L2 − norm : L2(q,v) =
( n∑
i=1
|qi − vi|)1/2
, (4.46)
Matusita : M(q,v) =
√
√
√
√
n∑
i=1
(√qi −
√vi
)2
, (4.47)
Chi − square : χ2(q,v) =n
∑
i=1
(qi − vi)2
qi + vi
, (4.48)
Bhattacharyya : B(q,v) = −ln
n∑
i=1
√qivi, (4.49)
Correlation Coefficient:
r(q,v) =
∑ni=1(qi − q)(vi − v)
√∑n
i=0(qi − q)2√
∑ni=0(vi − v)2
, (4.50)
Kullback − Leibler divergence:
K(p1,p2) =
n∑
i=1
p1(x)log2p1(x)
p2(x)
=n
∑
i=1
p1(x)log2p1(x) −n
∑
i=1
p1(x)log2p1(x). (4.51)
Jensen − Shannon divergence:
JS(p1,p2) = H(π1p1 + π2p2) − (π1H(p1) + π2H(p2)), (4.52)
62 Structure-Based Features
with π1 and π2 as weights of the distributions, satisfying π1 + π2 = 1 and
H(p) = −n
∑
i=1
pilog2pi, (4.53)
denotes the Shannon entropy of the probability distribution p = p1, p2, ..., pn. In con-
trary to the Kullback-Leibler divergence, the Jensen-Shannon (JS) divergence is symmetric,
always well defined and bounded.
Next we apply the eight measures to an image retrieval problem of ancient Water-
marks, in order to determine the best one. As features we use the global and local set
Hglobal, H local derived in Section 4.2.
In Figure 5.5 we can see the comparison between all previously defined measures. The
eight measure have been applied to the two classes Cup and Bull Head of the of the ancient
Watermark database (see Section 5.1.1). The graph shows two class-wise averaged recall
versus number of retrieved images plot. We can see that on the average the Intersection
measure performs best for both classes. The Bhattacharyya and the Correlation Coefficient
measure are second best followed by the other measures.
4.4 Data Normalization
Normalization is a process that changes the range of data. For image processing appli-
cations, including image representation, matching, CBIR and classification methods data
usually are normalized to the range of [−1, +1] or [0, 1]. This compact feature represen-
tation improves subsequent data processing such as training a classifier or the matching
of similar images in CBIR applications. Moreover, several authors reported significant
improvements of classification or image retrieval results under the usage of proper nor-
malization methods [3], [2] and [41]. For the data represented in this thesis we use the
following normalization methods:
• Linear scaling to unit range [0, 1]: Set a lower bound l and an upper bound u for a
feature vector
vn =v − l
u − l. (4.54)
• Zero mean and unit variance normalization to the range [−1, 1]: Compute the mean
µ and standard deviation σ of the feature vector as in [59]:
vn =v − µ
σ. (4.55)
4.4 Data Normalization 63
200 400 600 800 1000 1200 1400 16000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rec
all
# retrieved images
Class: Cup
Intersection
L2−norm
Chi−square
Matusita
Bhattacharyya
Corr−coef.
Kullback Leibler
Jensen Shannon
200 400 600 800 1000 1200 1400 16000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rec
all
# retrieved images
Class: Bull Head
IntersectionL2−normChi−squareMatusitaBhattacharyyaCorr−coef.Kullback LeiblerJensen Shannon
Figure 4.5: Class-wise averaged recall versus number of retrieved images plot with eightdifferent similarity measures for two classes of the ancient Watermark database (see Sec-tion 5.1.1).
64 Structure-Based Features
Note the different range for the second normalization method. It should be mentioned that
the data can be mapped by a probability of 99% to the unit range [0, 1] by a simple shift
and rescaling operation [2]:
vn =v−µ3σ
+ 1
2. (4.56)
• Rank normalization uniformly maps a feature vector in the range [0, 1]:
vn =rankvn
vi(v) − 1
R − 1, (4.57)
where R is the maximum number of ranks for each feature vector v; R = max(r), with
r = 1, 2, ...n and n being the number of elements of v. Note that rankvnvi
(v) performs a
ranking, i.e. sorting the feature vector elements in an ascending order. Moreover, we could
observe that the choice of the normalization method can have a significant impact on the
results.
4.5 Feature Space Representation
It is a common practice to assume that a feature space follows the normal distribution.
Rarely this assumption is proven, nor it can be said how often the presumption holds.
The authors in [110] observed that a uniform data distribution of feature sets results in
a better retrieval for the Euclidean distance measure. Further they found, if the common
assumption of a Gaussian feature space distribution does not hold, then the Euclidean
distance methods fails to work effectively. Similarly [102] argued that assuming a Gaussian
feature distribution is an effective means to normalize features for a similarity search.
Hence, it is easy to recognize the importance of a Gaussian like features in order to ensure
the best possible results for Euclidean distance-based similarity measures.
Therefore, we have set up an experiment that checks for the real feature space distri-
bution with the help of goodness-of-fit tests. Various goodness-of-fit tests are reviewed in
[12], [94] and [24].
In order to test the real distribution of our features we use the Kolmogorov-Smirnov
(KS) test [67] that makes no assumption about the data distribution and thus is non-
parametric and can be applied to any kind of distribution. This is a clear advantage
over other tests such as for example the χ2-test which assumes that the sampling error
between the two empirical density functions has a normal (Gaussian) distribution. Each
goodness-of-fit test involves the examination of a random sample from a dataset with a
known underlying distribution such that to test for the null hypothesis. The null hypothesis
states that the unknown distribution function is of the same kind as a known specified one.
4.5 Feature Space Representation 65
In the following Paragraph we will shortly summarize the most important aspects of the
KS-test.
Kolmogorov-Smirnov Test Let F2(x) be the theoretical cumulative distribution8 to
be tested which must be fully specified. Further, F1(x) is the empirical distribution or
empirical cumulative distribution describing the distribution of data to be tested. For the
actual test we have to state the null hypothesis H0 and the alternative hypothesis H1.
H0 : f1(x) = f2(x) −∞ =< x <= ∞ (4.58)
H1 : f1(x) 6= f2(x) −∞ =< x <= ∞, (4.59)
with f1(x) as the sample probability density function. Then the sample probability distri-
bution functions F1(x) and F2(x) are given by:
F1(x) =
N1∑
x=1
f1(x) (4.60)
F2(x) =
N2∑
x=1
f2(x), (4.61)
where N1 and N2 are the number of elements for F1(x) and F2(x), respectively. Moreover,
the intervals for both probability distribution functions are normalized by the number of
samples to the range [0, 1]. Then the test statistic Dn is given by:
Dn = supxn=1≤xn≤xn=N
|F2(xn) − F1(xn)|, (4.62)
where Dn denotes the maximum difference between the observed empirical distribution
F2(xn) and the theoretical cumulative distribution F1(xn). Thus, we obtain a measure of
difference between the theoretical model and the experimental data. Finally, the KS-test
determines for a specified level of significance α whether the null hypothesis is rejected or
fails to be rejected. For Dn < Dαn the hypothesis is correct and for Dn ≥ Dα
n the hypothesis
is rejected. Dαn can be found in statistical tables.
Experiment The null hypothesis H0 for our experiment is that our feature space X,
has a standard normal distribution, that is, a normal distribution with zero mean and
a variance equal to one. The alternative hypothesis H1 is that X does not follow that
distribution. We chose a level of significance of α = 0.05. Then, the KS-test results in
8Cumulative distribution functions (CDF) have the property of being monotone increasing and contin-uous from the right. Additionally, all CDF follow: lim
x→−∞F (x) = 0 and lim
x→+∞F (x) = 1.
66 Structure-Based Features
a value Dn of one, if we can reject the null hypothesis that X follows a standard normal
distribution, or in a Dn equal to zero if we cannot reject the null hypothesis.
We decided to test the feature space distribution of the histograms derived from the
angle based Euclidean distance matrix, defined in Equation 4.12. The 36-bin histogram
Hang was defined in Section 4.2.1. The feature space X consists of 2662 feature vectors,
obtained from the Caltech image database (see Section 5.1.3). The distribution of the
feature space X corresponds to F2(xn) in Equation 4.62 and F1(xn) is a standard normal
distribution. Then we obtain a KS-test result of Dn = 0. Hence, we cannot reject the null
hypothesis H0 for our feature space X. That means that the tested feature space X follows
a standard normal distribution.
4.6 Summary
In this chapter we have derived a structure-based feature extraction method. The method
determines relations on a global and local scale. The method is capable of representing
the global structure of an image, as well as the local structure of perceptual groups and
their connectivity. The advantage of the method is the invariance towards changes in
illumination, similarity transformations (rotation, translation and isotropic scaling) and
slight changes of the viewing angle, as shown in Chapter 5.
Next, we have compared eight similarity measures under the usage of the global and lo-
cal features. The measure have been applied to images of the ancient Watermark database
(see Section 5.1.1). The retrieval results reveal that in average the intersection measure
performs best for our features. We have also discussed the normalization of our features
in Section 4.4. The chapter was concluded with a discussion of the feature space rep-
resentation. Therefore we have used the Kolmogorov-Smirnov test in order test the real
distribution of our features.
In the next chapter we present classification and image retrieval results for several
applications.
Chapter 5
Applications
In this chapter we present several image classification and content-based image retrieval
applications. First we will describe the datasets. Then we will discuss the performance
measures we are using for the evaluation the our results. Since support vector machines
(SVM) are used for the classification tasks, we will shortly review their fundamental prop-
erties. In the sequel, we will apply our previously introduced features to the following
problems:
• Ancient watermark retrieval and classification (Filigrees)
• Color image retrieval (Corel database)
• Object class recognition and retrieval (Caltech database)
• Texture class recognition (Brodatz collection)
For each application we present a comprehensive evaluation of the results with the per-
formance measures defined in Section 5.2 followed by a detailed discussion of the results.
Various sample retrieval results give the reader a visual understanding of our results. The
presentation of the results is completed by an invariance analysis of our features under
various image transformations.
5.1 Data Description
5.1.1 Ancient Watermark Images
Watermarks in papers have been in use since 1282 in the Medieval Europe. Watermarks
can be understood much in the sense of being an ancient form of a copyright signature. In
fact, watermarks served as a mark for the paper mill that made the sheet. Thus, it served
68 Applications
as a unique identifier and as a quality label1. Nowadays, scientists from the International
Association of Paper Historians (IPH) try to identify unique watermark in order to get
known the evolution of commercial and cultural exchanges between cities in the Middle
Ages. [55].
The physical creation of watermarks2 is described in detail by [28] as follows;
A watermark is formed by the attachment of a wire design to the mesh surface
of the papermakers’ mould. During paper production the paper pulp is scooped
from a vat onto the surface of the mould and the excess water is allowed to
drain away through the mesh. The wire watermark, which sits proud of the
mesh, reduces the density of fibers deposited on that area of the mould, and
when the finished sheet is viewed with transmitted light, the area where the wire
had been present is thinner and appears lighter than the remainder of the sheet.
The interest of the IPH association lies specifically in the determination of the origin and
date of creation of an ancient unknown paper by comparison with known documents bearing
a similar watermark signature [114]. It is then possible to determine whether this unknown
paper comes from the same region and approximately the same period as the reference
watermark. The internationally known Swiss Paper Museum in Basel houses thousands of
images of historical papers as well as ancient watermarks. There are approximately 600.000
known watermarks and their number is steadily growing. Note, that the variety of their
shapes and forms is abundantly large, where the early watermarks often exhibit sanctuary
symbols, geometric figures, animals, crescents, etc. The Swiss Paper Museum in Basel has
provided us with a subset of their digital Watermark database. The database used in the
subsequent experiments consists of about 1800 images (see Table 5.1). Figure 5.1 shows
scanned sample watermark images. A detailed description of the scanning setup can be
found in [112]. In fact, the watermarks are digitized from the original sources. Specifically,
each ancient document has to be scanned three times (front, back and by transparency)
in order to obtain a high quality digital copy, where the last scan contains all necessary
information [112]. A semi-automatic method, that is describe in [112], delivers the final
images. The method incorporates a global contrast, contour enhancement and grey-level
inversion. Figure 5.3 shows sample images after the method was applied. Subsequently,
these images are used in order to extract descriptors and features (described in Chapter 4),
for retrieval or classification problems.
1Modern watermarks serve as security identifiers, such as on banknotes or important documents.2The actual shape of the watermarks can be seen when the paper is held up to the light.
5.1 Data Description 69
Figure 5.1: Samples of scanned ancient watermark images (courtesy Swiss Paper Museum,Basel).
70 Applications
Figure 5.2: Sample filigrees of each class from the Watermark database after enhancementand binarization (see [112]). The classes are according to Table 5.1 (starting from topleft): Eagle, Anchor1, Anchor2, Coat of Arm, Circle, Bell, Heart, Column, Hand, Sun, BullHead, Flower, Cup and Other objects.
5.1 Data Description 71
Figure 5.3: Sample filigrees from the Watermark database after enhancement and bina-rization (see [112]). Each of the four rows show watermarks from the same class, namelyHeart, Hand Eagle and Column. The samples show the large intra-class variability withinthe Watermark database.
72 Applications
Table 5.1: The classes of the Watermark database.
Class Nr. Class-Name # of Class-Members1 Eagle 3222 Anchor1 1153 Anchor2 1394 Coat of Arms 715 Circle 916 Bell 447 Heart 1978 Column 1269 Hand 9910 Sun (various types) 3311 Bull Head 1412 Flower 3113 Cup 1714 Other objects 416
Total 1715
5.1.2 Corel Databases
We used two databases from the Corel image collection. The first database is a widely
used subset of the Corel database. We utilize this image collection, since there are several
results published [82], [143] and [25]. The dataset consists of 1.000 images containing 10
class with 100 color images each and can be downloaded at [22]. The 10 image classes
are Africa, Beach, Buildings, Buses, Dinosaurs, Flowers, Elephants, Horses, Food and
Mountains. Although the database size is quite small for today’s standards it became de
facto the benchmark for CBIR applications. Note that the class labels were assigned by
Corel and were not altered by us. The second dataset contains 10.000 images and was
used for a further evaluation of our results. The collection was created by [87], with images
from different publicly available image databases, i.e. the dataset has a large variation of
different image formats and image acquisition settings. The set is made up of 25 different
classes with 400 color and gray-scale images each. For a better understanding of the
database used in the experiments, Fig. 5.4 shows some sample images taken out of the two
image databases.
5.1.3 Caltech Database
We decided to use the Caltech database [32] since several authors have reported results for
this image collection or for subsets of it. In addition, the Caltech image set exhibits various
5.1 Data Description 73
Figure 5.4: Sample images from the two Corel databases, where the images in the first roware randomly taken from the 1.000 image set and the images in the second row from the10.000 image database.
74 Applications
Figure 5.5: Two sample images of each class from the Caltech database.
different classes and, thus, is suitable for classification and retrieval tasks. In detail, we
use five classes for our experiments that are airplane (class 1), cars (class 2), faces (class
3), leaves (class 4) and motorbikes (class 5). In total the dataset consists of 2662 images
where each class contains a different number of members. Figure 5.5 shows two sample
images of each class. It can be seen that for each image the background may heavily vary.
In addition, many images differ significantly in their sizes. In order to assure comparable
results we did not alter, remove or modify any images.
5.1.4 Brodatz Database
The Brodatz database [11] consists of 13 textures shown in Figure 5.6. Each 512×512
image is digitized under six different rotations (0, 30, 60, 90, 120 and 150). Moreover,
every image is subdivided into 16 non-overlapping images of the size 128×128. Thus, the
whole database consists of 1248 texture samples belonging to 13 classes with 96 members
each.
The 13 classes contain the following textures Bark (class 1), Brick (class 2), Bubbles
(class 3), Grass (class 4), Leather (class 5), Pigskin (class 6), Raffia (class 7), Sand (class
8), Straw (class 9), Water (class 10),Wave (class 11), Wood (class 12), Wool (class 13),
5.2 Performance Evaluation
For the results presented in this thesis we use several performance measures. In the follow-
ing two subsections we will describe measures for classification and content-based image
retrieval tasks.
5.2 Performance Evaluation 75
Bark Brick Bubbles Grass
Leather Pigskin Raffia Sand
Straw Water Weave Wood
Wool
Figure 5.6: Sample images from the 13 classes of the Brodatz image database.
76 Applications
Table 5.2: Sample confusion matrix for a two class problem.
Predicted ClassNegative Positive
Negative CM1 CM2Actual ClassPositive CM3 CM4
5.2.1 Measures for Classification
It is common to represent classification results by a so-called confusion matrix. Such a
matrix typically is of size N × N , where N specifies the number of classes and displays the
actual and predicted class labels obtain by a classifier. In order to describe the properties
of a confusion matrix we assume N = 2, which means we consider a a two-class problem.
Table 5.2 shows a schematic representation of a two-class confusion matrix, where each
column and row shows the predicted and actual class instances, respectively.
More precisely, CM1 and CM4 display the number of correct predictions for negative
and positive cases, respectively. Similarly, CM2 and CM3 display the number of incorrect
predictions for negative and positive cases, respectively. In order to evaluate the content
of a confusion matrix the following measures can be derived:
• True Positive rate (TP)3: Fraction of correctly classified positive cases.
• False Positive rate (FP): Fraction of incorrectly classified positive cases.
• Accuracy (AC): Fraction of all classifications that were correct.
• True Negative (TN): Fraction of correctly classified negative cases.
• False Negative (FN): Fraction of incorrectly classified positive cases.
• Precision (P): Fraction of positive cases that were correctly classified.
• Geometric Mean1 (G−Mean1) and Geometric Mean2 (G−Mean2) [71]: Geometric
mean of TP with P and TP with TN4.
• F-Measure (F) [79]: Weighted combination of true positive and precision.
F(β2 + 1) ∗ P ∗ TP
β2 ∗ P + TP
3Sometimes referred to as Recall.4Note, that the G-Mean is high when TP and TN are large and their difference is small.
5.2 Performance Evaluation 77
With respect to the two class confusion matrix shown in Table 5.2 we can define the above
measures as follows:
TP =CM4
CM3 + CM4, (5.1)
FP =CM2
CM1 + CM2, (5.2)
AC =CM1 + CM2
CM1 + CM2 + CM3 + CM4
, (5.3)
TN =CM1
CM1 + CM2
, (5.4)
FN =CM3
CM3 + CM4
, (5.5)
P =CM4
CM2 + CM4
, (5.6)
G − Mean1 =√
TP ∗ P, (5.7)
G − Mean2 =√
TP ∗ TN, (5.8)
F =(β2 + 1) ∗ P ∗ TP
β2 ∗ P + TP, (5.9)
with β ∈ [0, 1], where β = 1 means that TP and P are weighted equally strong. Thus, the
F-measure can be seen as the combination of precision and recall into a single expression,
where β indicates the relative importance of recall and precision. All measures can also
be computed for multi-class problems. In general, for classes with different number of
members as simple averaging might distort the performance measures. Therefore, it is
advisable to take the class size into account. This can be accomplished by a weighted
averaging with the class sizes as weights.
Mw =
∑Ni=1 wi ∗ Mi∑N
i=1 wi
, (5.10)
where Mi may be any measure from Equations 5.1 - 5.9.
5.2.2 Measures for Content-based Image Retrieval
In order to evaluate CBIR results we use precision and recall graphs. Precision, the likeli-
hood that a retrieved image is relevant and recall, the likelihood that a relevant image has
78 Applications
been retrieved are defined as follows:
Recall =retrieved and relevant images
all relevant images in the database(5.11)
Precision =retrieved and relevant images
number of retrieved images(5.12)
In most situations, the number of class members will vary. Then it is difficult to compare
precision and recall numbers between different classes or databases. For these cases the
average rank 5 is not meaningful enough, since the database size N and the number of
relevant images Nr may differ. Thus, a quantity that normalizes by N and Nr is necessary.
That is the normalized average rank introduced in [95] and defined as follows:
Rank =1
N · Nr
(( Nr∑
i=1
Ri
)
− Nr(Nr − 1)
2
)
, (5.13)
where Ri is the (class-wise) averaged rank. In contrary to many CBIR measures Rank = 0
indicates a perfect performance. The closer Rank approaches 1, the worst the performance
gets.
5.3 Support Vector Machines
During the last years support vector machines (SVM) became a very popular technique for
classification problems. The basic concept was introduced in [139] and triggered a large
amount of fundamental findings and practical applications such as in the field of image
classification [48], [21]. For a detailed discussion of we refer to interested reader to [140],
[16] or [123]. To date, the distribution of various easy to use software packages [20], [118]
helped to spread SVM technique to fields as different as computer science, biological and
medical research, engineering and economy.
In contrary to clustering algorithms which are unsupervised learning methods (see
Chapter 3), SVMs belong to supervised learning techniques. That implies the usage of
training and testing data, where the former serves as input for the learning function and
the latter is used to test the generalization ability of the classifier towards unseen data
samples. In fact, the learning algorithm generates a model that should be able to accu-
rately map input objects to desired outputs6. Next, we will shortly summarize the basics
of SVMs.
5Simple average over the ranks of all relevant images6For most classification tasks the output is a class label.
5.3 Support Vector Machines 79
In detail, a support vector machine is based on classifying hyperplanes
(w · x) + b = 0, (5.14)
with w ∈ Rn and b ∈ R, where the optimal hyperplane can be found by margin maximiza-
tion.
Assume a set of labeled feature vectors (xi, yi), i = 1, 2, . . . , k, with xi ∈ Rn and
yi ∈ −1, +1k, where the SVM calculates the optimal hyperplane that separates the
feature vectors xi with label yi = 1 from the feature vectors xj with label yj = −1. The
resulting hyperplane is optimal in sense of maximal margin with respect to both of the
classes. In fact, a hyperplane can be described by just a few feature vectors, the so-called
support vectors. Moreover, the hyperplane is parameterized by a vector w, normal to the
hyperplane, and an offset value b. The actual classification is performed by:
y = sgn(wTx + b). (5.15)
Note, that the decision function only depends on dot products between patterns. Next, we
will briefly show the calculation of the hyperplane for a simple linear separable two class
problem.
Calculating the hyperplane: In order to compute the hyperplane we normalize the
parameters w and b such that
min∣
∣wTx + b∣
∣ = 1. (5.16)
Thus, the distance from any feature vector x to the hyperplane can be computed by
z =|wTx + b|
‖w‖ . (5.17)
Specifically, the margin measured perpendicularly to the hyperplane is given by 1‖w‖
+ 1‖w‖
=2
‖w‖. Finally, the following optimization problem has to be solved:
Minimize1
2‖w‖2, (5.18)
subject to yi · (wTxi + b) ≥ 1, i = 1, 2, . . . , k. (5.19)
This quadratic optimization problem can be solved by the Lagrangian formalism
80 Applications
maxλ≥0 (minw,b (L(w, b, λ))), with
Minimize L(w, b, λ) =1
2‖w‖2
−n
∑
i=1
λi ·[
yi(wTxi + b) − 1
]
, (5.20)
with λi ≥ 0. In words, L(w, b, λ) has to minimized with respect to w, b and at the same
time the derivatives of L(w, b, λ) with respect to λi have to vanish, under to constraints of
λi ≥ 0. Note, that Equation. 5.20 states a convex quadratic programming problem with
a convex objective function. Thus, it is permitted to equivalently solve the so-called dual7
problem, that is, maximize L(w, b, λ)8. The solution is characterized by a subset of all
training patterns. In detail, all patterns with non-zero λi, i.e. the support vectors form
the optimal solution. Thus, only the closest patterns contribute to the determination of
the optimal hyperplane. Note, that all other training patterns do not contribute to the
solution.
The non linear separable case
The next step is to generalize the SVM method to non-linear decision functions. It appears,
that the training patterns are only represented as dot products in the dual picture of
Equation (5.20). The idea is to first map all data patterns to a Hilbert space H , such that
there exists a mapping
x ∈ Rn 7→ z ∈ Rm,
from the input feature space to a m-dimensional space.
In the Hilbert space the input pattern appear in the training algorithm only in forms of
dot products. Assume, the following function exists K(x, z) = Φ(x)Φ(z) and call it kernel
function. Then we would only need to use K in the learning algorithm. In fact, Mercer’s
theorem (see Appendix 1) states that positive definite kernels can be represented by a set
of basis functions and makes it possible to learn in the feature space without the explicit
knowledge of Φ. Any Mercer kernel9 describes a scalar product in a high dimensional space.
Thus, we simply exchange the scalar product by kernel functions K(x, z). Commonly used
kernels K(x, z) are:
7Often referred to as Wolfe dual problem.8The final dual optimization problem can be written as:
maxλ
n∑
i=1
λi −1
2
∑
i,j
λiλjyiyjxTi xj
subject to
n∑
i=1
λiyi = 0; λ ≥ 0.
9Each kernel function with the properties of Theorem 1.
5.3 Support Vector Machines 81
• Linear
K(x, z) = xTz. (5.21)
• Polynomials
K(x, z) = (xTz + 1)q, q > 0. (5.22)
• Radial Basis Functions
K(x, z) = exp
(
−‖x − z‖2
σ2
)
. (5.23)
• Histogram Intersection
K(x, z) =
n∑
i=1
minxi, zi. (5.24)
• Hyperbolic Tangent
K(x, z) = tanh(
βxTz + γ)
, for β, γ such that Mercer’s condi-
tions 1 are satisfied.
(5.25)
Thus, under the consideration of Mercer’s Theorem the whole formalism of the linear
separable case holds for the non-linear separable problem.
Non separable case
In general, a data set may consist of non-separable patterns. Then applying the above
algorithm will not succeed in a correct solution. Outliers may be caused by noise and
should not have a big influence on the resulting hyperplane. A possibility to overcome
these influences is to “soften” the margin, i.e. to allow misclassification of training samples.
For this purpose so-called positive slack variables can be introduced ξi, i = 1, 2, . . . , n,in the constraints. Then the constraints can be relaxed as:
wTxi − b ≤ −1 + ξi, ∀ yi = −1, (5.26)
wTxi − b ≥ +1 − ξi, ∀ yi = +1, (5.27)
ξi ≥ 0, ∀ i. (5.28)
Thus, we allow misclassification which should in turn be penalized. Therefore, a penalty
function C will be introduced in the convex optimization problem.
82 Applications
The final optimization problem can be written in the Wolfe dual representation accord-
ing to [16] as:
Maximize: LD(w, b, λ) =∑n
i=1 λi −1
2
∑
i,j
λiλjyiyjxTi xj ; (5.29)
subject to 0 ≤ λi ≤ C, (5.30)n
∑
i=1
λiyi = 0; λ ≥ 0. (5.31)
C is a positive parameter that controls the penalization of outliers. The lower C, the more
outliers are allowed. The final solution appears to be:
w =Ns∑
i=1
λiyixi, (5.32)
where Ns denotes the number of support vectors. Note, that λi has an upper bound of
C10.
In the sequel of this chapter we will use SVMs for the classification of image classes.
5.4 Filigrees
In this section we discuss the application of the structure-based features (see Chapter 4),
to the ancient Watermark database (see Section 5.1.1). First, we will review related work.
Then we present the retrieval results followed by the classification of the 14 watermark
classes.
5.4.1 Related Work
To date, there have been attempts to classify and retrieve watermark images, both by
textual- and content-based approaches. Del Marmol [88] classified ancient watermarks after
the date of creation. Briquet [10] manually classified about 16.000 watermarks in about
80 textual classes. Pure textual classification systems might be error prone. Watermark
labels and or textual descriptions might be very old, erroneous or just not detailed enough.
Moreover, many watermarks are not labeled and, thus, can not be used for textual queries.
Therefore, more recent attempts have been undertaken in order to focus on the real content
of watermark images. In [113] the authors describe the set-up of a system capable to add,
edit or remove watermarks with the possibility of textual and content-based retrieval.
As described in the paper, the authors used a 16-bin large circular histogram computed
10The upper bound is the only difference from the optimal hyperplane case.
5.4 Filigrees 83
around the center of gravity of each watermark image. In addition, eight directional filters
were applied to each image and used as a feature vector. The algorithms were tested on
a watermark database consisting of 120 images, split up into 12 different classes. The
system achieved a probability of 86% that the first retrieved image belongs to the same
class as the query image. A different approach was taken by the authors in [115] and [78]
who used three sets of various global moment features and three sets of component-based
features. The latter feature set consists of several shape descriptors which are extracted
from various image regions. For testing purposes, the authors selected a database consisting
of 806 tracings of watermarks from the Churchill collection. The authors set-up a retrieval
system with the city-block distance as discrimination measure. The system was evaluated
with normalized and averaged precision (Pn) and recall (Rn) scores. Therefore, 15 images
and their ground-truths have been manually selected. The authors report a precision of 0.53
and a recall of 0.81 for their best features, obtained from the original grey-level images. By
introducing a threshold11 the results could be improved. Finally, the authors conclude that
for their database, global features work better then the component-based (local) features
extracted from various regions.
5.4.2 Retrieval Results
In this subsection we show the results obtained from the medium size set of filigree images.
The usage of line segments and the computation of global and local features of their
arrangements succeeded in a highly performant method, as the results show.
Features
We use the global and local features as described in Chapter 4. The feature set Hglobal, H localis normalized accordingly to Equation 4.54. The features are computed offline for the com-
plete database. At retrieval time, only the feature vector for the query image has to be
computed. The retrieval results are obtained with the intersection similarity measure (see
Section 4.3).
First, we will show several retrieval results in order to give a better impression of the
watermark images. The results show that most classes exhibit a high intra-class variability.
However, the retrieval results will be followed by a thorough performance analysis (see
Section 5.2).
Figure 5.7 shows a set of 20 watermark images. The first image is the query, the second
one is the identical match, indicated by the 1 above the image. The subsequent images are
sorted in decreasing similarity, as it is indicated by the numbers above each image. As the
numbers indicate, most images are very similar. It is interesting to observe that most of
11The authors do not explain how the threshold value was obtained.
84 Applications
1 0.959415 0.955348 0.952716
0.952052 0.952001 0.951644 0.949675 0.949298
0.948088 0.947644 0.942182 0.941482 0.940527
0.937491 0.937469 0.934245 0.934234 0.932529
Query
Figure 5.7: Sample retrieval result obtained with our structure-based features (see Sec-tion 4.2) from the class Anchor1 of the Watermark image database.
5.4 Filigrees 85
the retrieved anchors show the same orientation except the last image in the second last
row. A closer look at the query image reveals that it is featured with a tiny cross atop and
with cusp-like structures at the outer endings12. The retrieved images clearly show that
both of these small scale structures are present in all (except one image) of the displayed
images.
In Figure 5.8 we can see another retrieval result. The similarity values show a faster
decrease in comparison to Figure 5.7. A closer look at the retrieved images reveals that
the actual watermarks do contain slightly different substructures, such as the segment of a
circle in the query image that is hardly to find in the other filigrees. The last image in the
second last row states a mis-match from the class Anchor1. We argue that the anchor’s
basic structure is quite similar to the query image in terms of the geometric arrangements
of the line segments.
Figure 5.9 displays a sample query from the class Column. We can see that all images
belong to the query image’s class, except the second last image which is out of the class
Heart. Visually, it is not clear to us why the heart image is retrieved before another
column image. However, it is not always possible to translate visual appearance into the
high dimensional feature space, where the actual image similarity is computed.
In Figure 5.10 we can see a retrieval result from the class Flower. Again, the first
retrieved image is identical with the query image. Although the class Flower consists only
of 31 images13, all displayed images belong to the same class. The good retrieval result
might come from the smaller intra-class variation in comparison to the other classes. Note,
that for the class Flower the classifier14 obtained a true positive rate of 100%.
Figures 5.11, 5.12 and 5.13 are of special interest. Each figure displays sample retrieval
results, where the query images of Figure 5.11 and 5.12 belong to the class Eagle and
the query watermark from Figure 5.13 is out of the class Coat of Arms. It came to our
attention that images from the class Coat of Arms did not perform very well in terms of
precision-recall graphs displayed in Figure 5.14. A visual inspection of all members of the
classes Coat of Arms and Eagle explained the drop in performance. The first and more
important reason is that eagle motives are very common in heraldry, i.e. about half of
the members of the class Coat of Arms have some kind of eagle embedded on a shield or
armorial bearings. Such that eagles and embedded eagles are retrieved at the same time
as shown in Figure 5.12, where the query filigree belongs to the class Eagle. Among the
retrieved watermarks we can find several images from the class Coat of Arms. The sample
retrieval in Figure 5.13 shows the same phenomenon, just reversed. Here, the query image
12Note, that the class Anchor1 possesses a large intra-class variation of shapes, i.e. many anchors haveno crosses or show very different endings.
13The class Flower is the third smallest class of the database. The size of the other classes are displayedin Tab. 5.1
14See Table. 5.8
86 Applications
1 0.922194 0.920298 0.901907
0.895846 0.893444 0.889845 0.887541 0.886112
0.885645 0.878217 0.875053 0.870906 0.870492
0.870481 0.8677 0.865952 0.86308 0.861701
Query
Figure 5.8: Sample retrieval result of the class Circle from the Watermark database, underthe usage of global and local structural features (see Section 4.2).
5.4 Filigrees 87
1 0.949537 0.947515 0.939641
0.934524 0.934285 0.933701 0.93253 0.932393
0.926542 0.92593 0.924799 0.921137 0.921017
0.920758 0.919269 0.915287 0.912817 0.911982
Query
Figure 5.9: Sample retrieval result obtained with our structure-based features (see Sec-tion 4.2) from the class Column of the Watermark database.
88 Applications
1 0.96195 0.961072 0.957413
0.9530250.951513 0.945948 0.945852 0.945689
0.943605 0.943526 0.943496 0.94102 0.940977
0.9399650.93513 0.934169 0.933677 0.931049
Query
Figure 5.10: Sample retrieval result of the class Flower from the Watermark database,under the usage of our structural features (see Section 4.2).
5.4 Filigrees 89
1 0.9705360.968812 0.968684
0.967473 0.966818 0.966298 0.965719 0.965699
0.965482 0.964 0.962416 0.962366 0.962107
0.961854 0.961122 0.960781 0.959923 0.959723
Query
Figure 5.11: Sample retrieval result obtained with our structure-based features (see Sec-tion 4.2) from the class Eagle of the Watermark database.
90 Applications
1 0.960806 0.95905 0.958831
0.95785 0.957334 0.957211 0.956785 0.956766
0.956285 0.956039 0.95537 0.954302 0.95423
0.954229 0.953734 0.953506 0.953504
0.953257
Query
Figure 5.12: Sample retrieval result obtained with our structure-based features (see Sec-tion 4.2) from the class Eagle of the Watermark database.
5.4 Filigrees 91
1 0.975298 0.973771 0.971675
0.971372 0.971086 0.970558 0.970454 0.96916
0.969115 0.968256 0.967976 0.967299 0.966694
0.96669 0.966388 0.965942 0.965742 0.964802
Query
Figure 5.13: Sample retrieval result of the class Coat of Arms from the Watermarkdatabase. As features we have incorporated our global and local structre-based features(see Section 4.2).
92 Applications
belongs to the class Coat of Arms. The second reason we could identify is the difference
in size of the two classes. In fact, they approximately vary by a factor of five in their
sizes. Thus, it is more probable to observe eagles retrieved by queries from the class Coat
of Arms (see Figures 5.12 and 5.13), then it is the other way round (see Figure 5.1115). A
quantitative proof for our observations can be found in the confusion matrix in Tab. 5.8,
where for example 31 images from class four (Coat of Arms) are classified as class one
(Eagle). In fact, we could observe similar occurrences for some of the other classes.
15The figure shows a sample retrieval of eagles. Note, that for this query watermark only filigrees fromthe same class have been retrieved.
5.4 Filigrees 93
500 1000 15000
0.2
0.4
0.6
0.8
1Eagle
Rec
all
# retrieved images500 1000 1500
0
0.2
0.4
0.6
0.8
1Anchor1
Rec
all
# retrieved images500 1000 1500
0
0.2
0.4
0.6
0.8
1Anchor2
Rec
all
# retrieved images500 1000 1500
0
0.2
0.4
0.6
0.8
1Coat of Arms
Rec
all
# retrieved images
500 1000 15000
0.2
0.4
0.6
0.8
1Circle
Rec
all
# retrieved images500 1000 1500
0
0.2
0.4
0.6
0.8
1Bell
Rec
all
# retrieved images500 1000 1500
0
0.2
0.4
0.6
0.8
1Heart
Rec
all
# retrieved images500 1000 1500
0
0.2
0.4
0.6
0.8
1Column
Rec
all
# retrieved images
500 1000 15000
0.2
0.4
0.6
0.8
1Hand
Rec
all
# retrieved images500 1000 1500
0
0.2
0.4
0.6
0.8
1Sun
Rec
all
# retrieved images500 1000 1500
0
0.2
0.4
0.6
0.8
1Bull Head
Rec
all
# retrieved images500 1000 1500
0
0.2
0.4
0.6
0.8
1Flower
Rec
all
# retrieved images
500 1000 15000
0.2
0.4
0.6
0.8
1Cup
Rec
all
# retrieved images500 1000 1500
0
0.2
0.4
0.6
0.8
1Other objects
Rec
all
# retrieved images
Figure 5.14: Class-wise recall vs. number of images retrieved graphs for the Watermarks.
94
Applic
atio
ns
Table 5.3: Class-wise performance measures for the Watermark database. A detailed description can be found in Section 5.2.2and in the text.
Measures\Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14Nr 322 115 139 71 91 44 197 126 99 33 14 31 17 416
Rank .1360 .3215 .1976 .1772 .4508 .2333 .4444 .3639 .1326 .2229 .1261 .0835 .092 .3621Rank1 1 1 1 1 1 1 1 1 1 1 1 1 1 1P (10) .7435 .5515 .5414 .2241 .3184 .4194 .4597 .5000 .6886 .1058 .0853 .9167 .4759 .6195P (30) .6593 .4298 .3565 .1532 .1643 .0968 .3165 .2256 .5744 .0342 .0194 .1391 .0517 .5042P (N/2) .4916 .2431 .2138 .1435 .1089 .2441 .1730 .0973 .4417 .0684 .1895 .8024 .5560 .2830P (N) .2228 .0744 .1187 .0643 .0540 .0292 .1157 .0755 .0822 .0319 .0194 .0241 .0517 .2427P (Best, 10) 1 1 1 .4762 1 1 1 1 1 .2778 .2857 1 .9091 1P (Best, 30) 1 1 .9091 .2632 .4839 .2206 .9091 .6977 1 .0433 .0253 .4545 .1269 1P (Best5, 10) 1 1 1 .4015 .9848 .9268 1 1 1 .2250 .1479 1 .8460 1P (Best5, 30) .9731 1 .8395 .2378 .4309 .1893 .8913 .6478 1 .0406 .0214 .3645 .0842 1R(N) .4783 .1217 .2590 .1690 .0659 .0909 .1320 .1587 .2828 .0606 .0714 .4839 .235 .3197R(N/2) .5280 .1391 .3022 .1972 .0879 .1818 .1523 .1905 .2626 .0606 .1429 .7097 .352 .3606R(Best, N) .6118 .5565 .4245 .3099 .3736 .5227 .4467 .3651 .6667 .2727 .4286 .7742 .647 .5312R(Best, N/2) .7764 .8596 .6377 .3714 .5778 .8182 .6531 .5556 .9388 .4375 .7143 1 1 .7019
5.4 Filigrees 95
Table 5.3 shows a variety of performance measures (see Section 5.2) for every class.
Note, that all measures are class-wise averaged. We can see that the first match is always
identical with the query, indicated by Rank1 equal to 1. In addition we list precision
values at various positions, i.e. the value of P (10) for class 1 means that on the average
almost 75% of the first 10 images retrieved belong to the first class. Moreover, we present
the precision after the 10 and 30 retrieved images for the best query P (Best, 10) and
P (Best, 30) for every class in order to see the differences between the class-wise average
and the single best query image. We see that for several classes P (Best, 30) is equal to
1, i.e. that the first 30 images belong to the query’s class and are ranked in decreasing
similarity. P (Best5, 10) and P (Best5, 30) present the average precision of the best five
queries after the first ten and 30 retrieved images, respectively. The numbers show that we
obtain the highest possible score for several classes. Additionally, we report the averaged
recall R(N) and precision values P (N) after N retrieved images, where N is the number of
class members. Finally, R(Best, N) and R(Best, N/2) depict the averaged recall numbers
after N and N/2 retrieved images. We can interpret the values of all measures given in
Table 5.3 as the dynamics of our results. Thus, we agree with the arguments listed in [95]
that only a broad set of performance measures can ensure a thorough evaluation of CBIR
results16. However, we do observe some classes of worse performance. That is to a large
extent due to the high intra-class variation of the database. Figure 5.3 shows the strong
intra-class variation of several classes. Since, CBIR performs a similarity ranking some
class members might be less similar to a certain query (from the same class) then images
from other classes as for example image d and e in Figure 5.17. Both images show eagles,
but do not belong to the same class. Similar observations hold for many other cases.
In Section 5.4.4 we will discuss the class discrimination ability of our structure-based
features for the same image database. The results will not show a similarity ranking based
on the features of one image, but rather the membership of images to a certain class
(based on SVM learning). Before proceeding with the classification, we want to discuss
the problem of partial matching for watermark images.
5.4.3 Partial Matching
Partial matching is the retrieval of image substructures from an image database. The
image retrieval results for the ancient watermark images have shown that various classes
contain similar filigree or sub-filigree, such as for example the Eagle and Coat of Arms
classes. For these classes it makes sense to investigate partial matching. In Section 4.3 we
have motivated the usage of the intersection measure for determining the image similarity.
In addition to its good performance it supports partial matching. Therefore, we cut out an
16Note, that the analysis of our results largely agrees with the propose evaluation strategy presented in[95].
96 Applications
arbitrary subregion of a watermark image and compute the same features as for the CBIR
application. Figure 5.15 shows the partial matching result obtained with the intersection
measure. The query image shows a part of an eagle that was cut out from an image of the
Coat of Arms class. The retrieved filigrees mainly contain eagles or parts of eagles that
are similar to the query image.
A second partial matching result is depicted in Figure 5.16, where the query is an eagle’s
head wearing a crown. Several of the retrieved images represent eagles, including heads
and some are wearing a crown. The fourth retrieved image only contains parts of an eagle,
a wing like structure. It is not obvious why this image is retrieved at the rank four. We
might argue that the bins of the wing match with the bins of the feather-like structure in
the query image.
The biggest advantage of the intersection based partial matching is its linear complex-
ity resulting in a fast matching time that allows online applications17. However, although
the histogram intersection measure produces good partial matching results for watermark
images, there might be cases of worst performance (e.g. strongly cluttered scenes of color
images). Such cases require other approaches as for example a sliding window based his-
togram matching, where the window is of the query image size. Then, is it possible to
extract the features from each window for every database image. The matching complex-
ity is at least O(M2) since for every window a matching steps has to be performed. In case
of overlapping windows the increase in complexity is even higher. Different implementa-
tions of window sliding algorithms can be found in [64] and [97].
5.4.4 Classification Results
Previously, we have retrieved similar watermark images. Now we want to learn the fea-
ture distribution of every class in the feature space. Therefore, the classification of the
watermark images is treated as a learning problem. The features of every class are learned
with a support vector machine (see Section 5.3). The classification results are obtained
with leave-one out tests and SVMs under the usage of different kernel. Specifically, we
have obtained the best results with the intersection kernel and a cost parameter C = 220.
We have used the same features as for the retrieval task. The feature vectors have been
normalized according to zero mean and unit variance (see Section 4.4). In the follow-
ing we present a thorough evaluation of the results with a confusion matrix and various
performance measures.
Table 5.4 shows the confusion matrix obtained from the leave-one-out classification of
the ancient Watermark database. A closer look at the confusion matrix identifies possible
inter-class misclassification (false positives), such as between classes one and four. Addi-
17Matching times vary between two and four seconds.
5.4 Filigrees 97
0.976747 0.976599 0.973386 0.972307
0.9720180.971106 0.970937 0.970842 0.969966
Query
Figure 5.15: Partial matching result obtained from the Watermark database with ourstructural features (see Section 4.2) and under the usage of the intersection similaritymeasure. The query image resembles a cut of a filigree from the class Coat of Arms.
0.973879 0.971251 0.970363 0.969645
0.969336 0.969039 0.968719 0.968309 0.968181
Query
Figure 5.16: Partial matching result obtained from the Watermark database with ourstructural features (see Section 4.2) and under the usage of the intersection similaritymeasure. Note, that the query resembles a head of an eagle that belongs to the class Eagle.
98 Applications
Table 5.4: Confusion matrix for the Watermark database.
Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Members1 296 0 5 16 1 1 2 0 0 0 0 0 0 1 3222 3 100 5 0 2 1 2 1 0 0 0 0 1 0 1153 5 3 121 3 0 1 3 0 0 2 1 0 0 0 1394 31 0 2 33 0 0 1 1 0 3 0 0 0 0 715 0 4 3 1 69 0 6 3 1 0 0 0 0 4 916 0 1 0 0 2 34 5 1 1 0 0 0 0 0 447 3 2 6 1 9 1 161 4 1 2 1 0 0 6 1978 2 1 0 0 3 1 10 109 0 0 0 0 0 0 1269 1 1 1 0 1 1 2 1 91 0 0 0 0 0 9910 6 0 5 0 0 0 3 1 0 18 0 0 0 0 3311 1 1 3 0 0 0 1 0 0 0 8 0 0 0 1412 0 0 0 0 0 0 0 0 0 0 0 31 0 0 3113 0 1 0 0 0 0 2 0 0 0 0 0 14 0 1714 0 0 0 0 1 0 1 0 0 0 0 0 0 414 416
tionally, classes seven (Heart) and five (Circle) show a clear tendency of being confused
with each other. We have manually identified the ill-classified watermarks and could reveal
that some of the circle filigrees do contain largely identical substructures.
Table 5.15 shows the class-wise true and false positive cases obtained with a leave-
one-out test. We can see that for most of the classes a high recognition rate is achieved.
In total, a 87.41% true positive rate could be achieved. However, classes four, ten and
11 show lower recognition rates. A possible explanation is discussed earlier in Subsection
5.4.2. A detailed visual inspection shows that many members of various classes possess
a high inter-class similarity in terms of their real content18. All classes exhibit a large
intra-class variability. If we reassign all eagles from the Coat of Arms class to the Eagle
class19, the true positive rate would exceed 90%.
A single measure is not sufficient to validate the content of a confusion matrix [77].
Influences of an imbalanced class distribution may be reflected by just a single quantity,
whereas multiple measures tend to be more robust and significant in order to guarantee
a concise evaluation of the results [71]. Therefore, in order to assess the classification
results in detail and un-biased, Table 5.8 displays various performance measures. It is easy
18In Figure 5.17 we show a few representative examples of watermarks with an ambiguous class mem-bership.
19A visual check of all 31 mis-classified Coat of Arms revealed that all of them are featured with a kindof eagle.
5.4 Filigrees 99
Table 5.5: Class-wise true positive (TP) and false positive (FP) rates for the Watermarkdatabase, where the first column indicates the correctly classified images and the totalnumber of class members. The second column shows the TP rate in [%]. Column threerepresents all FP obtained and column four gives the FP rate in [%].
Class True Positive Fales Positive1 296/322 91.93 52/1393 3.7332 100/115 86.96 14/1600 0.8753 121/139 87.05 30/1576 1.9044 33/71 46.48 21/1644 1.2775 69/91 75.82 19/1624 1.176 34/44 77.27 6/1671 0.35917 161/197 81.73 38/1518 2.5038 109/126 86.51 12/1589 0.75529 91/99 91.92 3/1616 0.185610 18/33 54.55 7/1682 0.416211 8/14 57.14 2/1701 0.117612 31/31 100 0/1684 013 14/17 82.35 1/1698 0.058914 414/416 99.52 11/1299 0.8468
Total: 1499/1715 87.41 216/1715 12.59
100 Applications
(a) Class Heart (b) Class Heart (c) Class Coat of Arms
(d) Class Coat of Arms (e) Class Coat of Arms (f) Class Circle
(g) Class Heart
Figure 5.17: The seven images show examples of watermarks with ambiguities in respect to theirground truth class membership and their real content. The labels below each image show thewatermark class. Image 5.17(a) belongs to the class Heart, although there is just a tiny heat inthe center of the watermark. In fact, it looks more like a Coat of arms. A similar argumentationholds for image 5.17(c). Note, the embedded eagle. Specifically, 5.17(c) and 5.17(d) were classifiedas Eagle.
5.4 Filigrees 101
Table 5.6: Detailed performance measures for the Watermark database. The measures areexplained in 5.1.
Measures [%]Class
Accuracy TN FN P G − Mean1 G − Mean2 F1 95.45 96.27 8.07 85.06 88.42 94.07 88.362 98.31 99.12 13.04 87.72 87.34 92.84 87.343 97.20 98.10 12.95 80.13 83.52 92.41 83.454 96.56 98.72 53.52 61.11 53.30 67.74 52.805 97.61 98.83 24.18 78.41 77.11 86.57 77.096 99.07 99.64 22.73 85.00 81.04 87.75 80.957 95.69 97.50 18.27 80.90 81.31 89.26 81.318 98.31 99.24 13.49 90.08 88.28 92.66 88.269 99.36 99.81 8.08 96.81 94.33 95.79 94.3010 98.72 99.58 45.45 72.00 62.67 73.70 62.0711 99.53 99.88 42.86 80.00 67.61 75.55 66.6712 100.00 100.00 0 100.00 100.00 100.00 100.0013 99.77 99.94 17.65 93.33 87.67 90.72 87.5014 99.24 99.15 0.48 97.41 98.46 99.34 98.45
W. Average 97.64 98.39 12.59 87.12 87.19 92.45 87.13
to recognize that the weighted averaged accuracy20 for the presented 14-class problem is
higher than 97% and that the combined measures G − Mean1 and G − Mean2 score in
average at 87.19% and 92,45%, respectively. It is worth to notice that the F-measure is as
high as 87.13%, with β = 1.
5.4.5 Conclusion
The retrieval and classification of watermark images is of great importance for paper his-
torians. The results show that structure is a powerful feature for this tasks. The retrieval
results have shown that the presented features work quite well. However, for some classes
the recall versus number of images plots did perform worse. That is due to two factors, as
we have shown. Firstly, the large visual intra-class variability. Filigrees of the same class
often look very different, which causes a low precision, in the evaluation. Secondly, some
classes exhibit prominent substructures of others such as for example the classes Eagles
and Coat of Arms. Next, we have performed a classification of the watermark images. A
support vector machine with intersection kernel was able to successfully learn the char-
acteristics of every class. A classification rate (true positive rate) of more than 87% is
an indicator of a good performance. The presentation of several performance measures,
20Defined in Equation 5.10.
102 Applications
derived from the confusion matrix gives a complete evaluation of the results. In future
work, we would like to apply the structural features to a larger database of watermarks.
In addition, we would like to reduce the confusion between the classes Eagles and Coat of
Arms.
5.5 Color Image Retrieval
For the current experiments, we use two different real world color image databases. The
task is to find similar images within the two datasets. The evaluation of the results ob-
tained from the first dataset consists of a detailed class-wise analysis. Due to computational
resource- and time limitations we provide a less comprehensive representation of the ex-
periments performed with the second dataset containing 10.000 images.
The results are evaluated with precision recall graphs and compared with two well
known methods, namely global color histograms and the local invariant feature histogram
algorithm [129]. The color histogram is widely known since the work of [133]. More recent
approaches towards color histograms can be found in [38], [138] and in the survey of [142].
Since color histograms are commonly used in CBIR, they serve well as a comparative
feature. However, in our experiments we have obtained a 32-bin color-histogram from
the YCbCr chrominance channel of each image. The invariant feature histogram method
is adopted due to its good performance as a local texture-based feature. In [25], the
method was compared along with nine other algorithms and performed best for one of
two image databases used in their experiments. The invariant feature histogram method
based on Haar integrals introduced by [124], has shown good image retrieval results [129].
The method constructs invariant features by applying nonlinear kernel functions to color
images and integration over all possible rotations and translations.
5.5.1 Retrieval Results
For the retrieval task we use the global structure-based features Hglobal as described in
Chapter 4. For the current experiment we are interested in the performance of global
structure features and if they are suitable to be combined with local descriptors. Therefore,
we combine the global structure features with block-based features (BBF21, [15], [86] and
[87]). The local method (BBF) computes pixel value distributions of equally sized blocks.
Specifically, the method computes 3 block-based features, where for the first one the higher
and lower mean histograms of the block-based image representation, containing the pixel
value distributions above and below the mean value for the gray-level channel in the YCbCr
21The block-based features have resulted from a joint work with Prof. Z.M. Lu, Harbin Institute ofTechnology, China.
5.5 Color Image Retrieval 103
1 0.787933
0.775795 0.754991 0.750781
0.745174 0.739979 0.727213
query
(a) Sample image retrieval result for an image out of the class: Mountain.
1 0.682571
0.672512 0.650146 0.639879
0.625853 0.623612 0.621926
query
(b) Sample image retrieval result for an image out of the class: Beach.
Figure 5.18: Sample image retrieval results obtained from the 1.000 image collection. Asfeatures we have used block-based features (see Section 5.5.1), and the global structurebased features (see Section 4.2.1). The first image represents the query and the images arearranged in decreasing similarities from left to right indicated by the numbers above theimages. Note, that 1 denotes an identical match with the query image.
104 Applications
1 0.997511
0.997492 0.997377 0.997355
0.997346 0.997343 0.997333
query
(a) Sample retrieval result for an image out of the class: Buses.
1 0.849547
0.831033 0.823511 0.818884
0.811361 0.807883 0.802103
query
(b) Sample retrieval result for an image out of the class: Elephants.
Figure 5.19: Sample retrieval results obtained from the 10.000 image collection.As featureswe have used block-based features (see Section 5.5.1), and the global structure based fea-tures (see Section 4.2.1). The first image represents the query and the images are arrangedin decreasing similarities from left to right indicated by the numbers above the images,where 1 denotes an identical match with the query image.
5.5 Color Image Retrieval 105
color space. The second block-based feature is the binary pattern histogram where binary
blocks of the same size are matched with a pre-generated codebook of binary patterns. The
final block-based feature is a histogram based on the 2 chrominance channels Cb and Cr.
Note, that we only use the global structure-based features. In contrary to the watermark
application, we exchange the local structure features H local with the block-based features.
The synthesis of the two features results in a robust semantic image descriptor.
After extracting and normalizing the image feature vectors (see Equation 4.54), we
need to determine which images are the most similar ones to any given query image.
Image matching is performed by comparing feature histograms out of a feature database,
where the feature histogram of each image is computed offline. We only need to compute
the query image’s feature histogram and perform a histogram comparison over the whole
database. For the current application we used the intersection similarity measure (see
Section 4.3).
In our evaluation strategy, we use each image in both databases as the query image
to compute the precision and recall. By averaging the results for each class we obtain a
qualitative performance analysis for all image classes. Before we present a detailed class-
wise analysis of the experiments we show sample results.
We selected some sample queries from both datasets and display their results in in
Figure 5.18 and Figure 5.19, respectively. The images in Figure 5.18 were obtained from
the 1.000 image data-set and the results of Figure 5.19 were retrieved from the 10.000
image collection. Each panel displays sample results, where the top-left image is the query
image. All other images are ranked in decreasing similarities, where the number above
each image points out the similarity, with values in the range of [0, 1], with one denoting
an identical match with the query image. The result in Figure 5.18 and 5.19 reveals that
the first eight retrieved images belong to the query image class.
Figure 5.20 shows the average recall versus the number of retrieved images graph for
all ten classes of the 1.000 image database. Each graph in the figure shows five different
curves representing the performance of the global color-histogram, local invariant feature
histogram, global line segment features, the set of block-based features and the combination
of the latter two. It can be seen that the combination the structure and block-based features
performs best for most of the classes, for the first 100 retrieved images. For class Buses the
global line structure feature performs better than the combination with the block-based
feature (BBF). This can be explained with a very prominent geometric structure of the
buses, which is here best encoded by the global line structure features. Thus, the texture
information is of less importance than the geometric arrangements.
We believe that a sophisticated feature selection method might improve the results of
the combined features by applying larger weights to the line segment features. However,
for the current experiments we did not implement feature selecting, feature weighting or
106 Applications
100 200 300 4000
0.2
0.4
0.6
0.8
1
Rec
all
Number of Images retrieved
Africa
100 200 300 4000
0.2
0.4
0.6
0.8
1
Rec
all
Number of Images retrieved
Beach
100 200 300 4000
0.2
0.4
0.6
0.8
1
Rec
all
Number of Images retrieved
Buildings
100 200 300 4000
0.2
0.4
0.6
0.8
1
Rec
all
Number of Images retrieved
Buses
100 200 300 4000
0.2
0.4
0.6
0.8
1
Rec
all
Number of Images retrieved
Dinosaurs
100 200 300 4000
0.2
0.4
0.6
0.8
1
Rec
all
Number of Images retrieved
Flowers
100 200 300 4000
0.2
0.4
0.6
0.8
1
Rec
all
Number of Images retrieved
Elephants
100 200 300 4000
0.2
0.4
0.6
0.8
1
Rec
all
Number of Images retrieved
Horses
100 200 300 4000
0.2
0.4
0.6
0.8
1
Rec
all
Number of Images retrieved
Food
100 200 300 4000
0.2
0.4
0.6
0.8
1
Rec
all
Number of Images retrieved
Mountains
Line Segment Feat.Block−Based Feat.Line Seg. + Block Feat.Invariant Feat. Hist.Color Histogram
Figure 5.20: Average class-wise recall versus the number of retrieved images graph for the1.000 image database. All features are plotted for class-wise comparisons. Note that thecurves represent averaged quantities, i.e. each class member was taken as a query imageand the resulting graphs were averaged.
5.5 Color Image Retrieval 107
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pre
cisi
on
Recall
Line Segment Feat. Block−based Feat.Line + Block Feat.Invariant Feature Hist. Color Hist.
Figure 5.21: Precision-recall graph for the 1.000 image data-set, where the graph is averagedover all images and classes representing an overall performance measure.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pre
cisi
on
Recall
Line Segment Feat.Block−based Feat.Line + Block Feat.Color Hist.Invariant Feat. Hist.
Figure 5.22: Precision-recall graph for the 10.000 image database, where the graph isaveraged over all images and classes representing an overall performance measure.
108 Applications
any further relevance feedback methods to improve the results. We rather intended to give
a clear presentation of our features and postpone the merits of improvement to a later
time.
In the class Horses our features perform worse than the global color-histogram. A pos-
sible reason for that might be found in the color distribution of the image class, i.e. about
80% of horses in that class are standing or running on green meadows which introduces a
significant background. In terms of pixels it clearly shows that color dominates the infor-
mation. The fraction of pixels belonging to horses is much smaller than the background
area and thus, color is a stronger feature than structure.
Though the curves of some classes look quite similar to each other, the overall precision-
recall graph in Figure 5.21 proves the superior performance of our features. The graph
represents the averaged precision-recall over all ten classes. The figure shows that the
combination of the local and global features reaches a precision of 100% until a recall
of about 20 % is reached. An interesting observation can be found in the part between a
recall of about 0.25 and 0.35, where the color-histogram performs better than the combined
features. Our investigations revealed that the higher performance of the color-histogram
for the class Horses resulted in a superior recall for this part of the curve.
Figure 5.22 depicts the precision-recall graph averaged over all 25 image categories
of the larger dataset (10.000 images). As well as for the first data-set, we compute the
precision recall curve for all other features. The combination of the global structure and
block-based features performs better than the others methods.
5.5.2 Invariance Analysis
Though the results shown so far give a good idea of the proposed feature’s performance,
we did not give a measure of the robustness or invariance for all methods. It is of great
importance for CBIR applications to be invariant or robust under common image trans-
formations. Very common ones are changes of the illumination, rotation or both simulta-
neously. Hence we conduct an invariance analysis, where we compute all features for each
image for two newly created data-sets. To cope with these kinds of transformations, we
have derived two data-sets from our 1.000 image database. In the first data-set, we increase
for all images the brightness by 30% and decreased the saturation by 10%. The second
data-set undergoes the same transformation as the first, but with an additional rotation
of all images by 90 degrees counter-clock wise.
For the actual invariance test we determine the similarity between each feature of the
database. We take the histogram of each image from the original database and compute the
similarity based on histograms obtained from the transformed image data-set, where the
brightness and saturation of every image was changed. In the second step, we computed the
similarity of each image feature with respect to histograms obtained from the data-set with
5.5 Color Image Retrieval 109
100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sim
ilarit
y
Images
Line Segment Feat.Block−based Feat.Line + Block Feat.Invariant Feature Hist.Color Hist.
Figure 5.23: Robustness analysis of the 1.000 image database for different brightness androtation conditions (brightness 30 % increased, saturation 10 % decreased). The ordinateshows the degree of averaged invariance for each feature set, with 1 being 100 % invariant.
different brightness, saturation and rotation. As measure to compare two histograms of the
original and the transformed image we use the L1-Norm, resulting in a value between [0, 1]
with 1 indicating 100% invariance. The results are displayed in Figure 5.23, where we can
see five similarity versus image number plots. The curves show the variation of the features
taken from the original image data and the data-set with a different brightness, saturation
and rotation. A constant line at similarity equal to 1 indicates 100% invariance. From the
figure we can conclude that the line segment feature performs best and the combination
of block-based and line segment features second best and the color histogram the worst.
For a qualitative representation of the feature invariance have a look at Table 5.7, where
we present the feature similarity values averaged over all images; for each feature and all
databases used; i.e. we provide one number as invariance descriptor. One can see, that the
global line structure feature performs best in average. The combination of our block-based
and global line structure features is slightly worse than the invariant feature histograms.
The comparison of the original image features with the features of just rotated images is
omitted, since this information can be extracted from Table 5.7. In fact, all features are
equally robust to rotation.
110 Applications
Table 5.7: Averaged feature invariance representation. First column: Invariance underdifferent brightness conditions (brightness 30 % increased, saturation 10 % decreased).Second column: Invariance under different brightness and rotation. Third column: Invari-ance between: different brightness conditions and different brightness plus rotation. LS:Line segment feat., BBF: Block-based feat., IFH: invariant feature histogram.
Invariance of Features under Image Transformations [%].Features
Bright. Rot.-Bright. Bright. vs. Rot.-Bright. AverageLS 97.14 96.12 96.48 96.58
LS+BBF 90.99 89.67 93.92 91.53BBF 86.25 86.25 86.25 86.25IFH 91.05 90.83 99.49 93.79Color 65.65 65.62 99.62 76.96
5.5.3 Conclusion
We have presented retrieval results for two color image databases. The global structure-
based features have clear invariance properties that constitute their usefulness to CBIR
applications. Moreover, we have combined the global structure-based features with block-
based features. The block-based features are derived from equally sized non-overlapping
4 × 4 image blocks. The global structure-based features contain information of relative
spatial arrangements of straight line segments.
Although the global structure features produce good results, the combination of the two
methods improves the performance. The precision recall graphs, averaged over the whole
image database, reveal for both image data-sets that combined features perform better
than the invariant feature histogram method and also outperforms the color histogram. It
has to be said, that for some classes the features perform slightly worse, but averaged over
all classes the precision recall graphs document a better performance.
The feature robustness analysis of the experiments revealed, that the global line struc-
ture features perform best in average, under changes of the illumination and under changes
of the illumination plus rotation with a mean invariance of the feature histograms of 96.58%.
The invariance of the combination of the global structure features and the local block-based
method is with 91.53% slightly worse than the very robust invariant feature histogram
method with 93.79% invariance.
Although the block-based features perform, according to the precision recall graphs
more often better than the structure-based features, their combination gains in performance
and robustness towards image transformations.
The obtained results encourage for further enhancement, i.e. feature selection and/or
sophisticated weighting schemes that might improve the feature combination process, lead-
5.6 Object Class Recognition and Retrieval 111
ing to better results.
5.6 Object Class Recognition and Retrieval
In this section we present the results obtained for the Caltech database (see Section 5.1.3).
Specifically, we performed two experiments: the classification of the five image classes
under the usage of the original class labels and content-based image retrieval for all images
of all classes. For the first experiment we report the category confusion matrix and several
performance measures, whereas for the second we present averaged precision recall graphs
and the same measures as in Section 5.4. A critical review of the literature reveals that
occasionally results from subsets are reported. It turns out that a common practice is the
usage of three (sometimes four) classes airplane, faces, (cars,) motorbikes [101]. However,
for our evaluation we use the complete dataset. In order to compare our results with others,
we have additionally performed experiments with just the three classes.
Features
For the subsequent results we have used global and local features as described in Chapter 4.
The feature set Hglobal, H local is normalized (see Equation 4.54) and computed offline for
the complete database. Note, that we used the same features for the classification and the
image retrieval task.
5.6.1 Classification Results
In this subsection we report the classification results for the Caltech database obtained
with a multi-class support vector machine and a leave-one-out-test22. The leave-one-out
test was used in order to overcome selection effects of splitting the dataset into training
and test sets. For the classification we use a support vector machine with a histogram
intersection kernel as defined in Equation 5.24. The best result was obtained for a penalty
cost parameter C=221. Note, that we have also used other kernels such as RBF (with
different γ, polynomial (of various degree), sigmoid and linear. Since their classification
results are some percents lower then with the histogram intersection kernel, we decided
not to list their outputs. The features have normalized to zero mean and unit variance
as described in Section 4.4. Table 5.8 contains the class-wise confusion matrix. The last
column shows the number of members for each class. The integers in the table represent
absolute numbers of classified instances. The advantage of a confusion matrix is the easy
identification of true and false positives. Note, the exact true and false positive rates are
listed in Table 5.9. One can observe that the class airplanes depicts the best classification
22N-fold cross validation, with n equal to the number of images in the database.
112 Applications
Table 5.8: Confusion matrix for the Caltech database.
Class 1 2 3 4 5 # Members1 1036 0 4 3 31 10742 2 115 4 4 1 1263 6 5 411 19 9 4504 10 3 36 127 10 1865 30 1 16 7 772 826
Table 5.9: Class-wise true and false positive rates for the Caltech database, where the firstcolumn indicates the correctly classified images and the total number of class members.The second column shows the TP rate in [%]. Column three represents all FP obtainedand column four gives the FP rate in [%].
Class True Positive False Positive1 1036/1074 96.46 48/1588 3.022 115/126 91.27 9/2536 0.3553 411/450 91.33 60/2212 2.714 127/186 68.28 33/2476 1.335 772/826 93.46 51/1836 2.76
Total: 2461/2662 92.45 201/2662 7.55
rate of 96.46%. For the class cars we could only identify 9 false positives resulting in a
classification rate of more than 91%. The highest number of false positives, 33 in number,
is observed for the class leaves. A detailed look at the confusion matrix in Table 5.8 reveals
that leaves were mainly confused with faces. In fact, almost 60% of all false positives
are classifies as faces and similarly for the reversed case. Apparently, the features do not
discriminate these two classes very well. A possible explanation can be given by a detailed
visual inspection of both image classes. It can be observed that leaves exhibit only a few
structures and that the used features are not powerful enough the perform as good as for
the other four classes. The classification rate for motorbikes is as high as 93.46%. The
classification rates for faces and cars are comparable.
The class-wise averaged classification rate is 92.45% with a false positive rate of 7.55%.
Occasionally, the literature reports other performance measures such as for example preci-
sion, F-measure or accuracy. Table 5.10 lists the various measures for all five classes. The
classifier achieved an accuracy of 96.57%, a precision of more then 92% and a F-measure
of 92.35% that is comparable with the true positive rate. Note, that the averaged values
are weighted according to the class sizes. For the F-measure we used β = 1, i.e. recall and
5.6 Object Class Recognition and Retrieval 113
Table 5.10: Detailed performance measures for the Caltech database (see Section 5.2).
Measures [%]Class
Accuracy TN FN P G − Mean1 G − Mean2 F1 96.77 96.98 3.54 95.57 96.02 96.72 96.012 99.25 99.65 8.73 92.74 92.00 95.37 92.003 96.28 97.29 8.67 87.26 89.27 94.26 89.254 96.54 98.67 31.72 79.38 73.62 82.08 73.415 96.06 97.22 6.54 93.80 93.63 95.32 93.63
W. Average 96.57 97.35 7.55 92.35 92.3 94.78 92.35
precision are equally weighted.23
In order to give a comparison of our results with state of the art algorithms we list the
Table 5.11 lists classification rates of several authors in Table 5.11. For a fair comparison we
only list results obtained from the full set of images. Our classification rates are given for
multi-class experiments obtained from complete leave-one-out tests. Moreover, most results
have been only reported on class-wise classification methodologies, i.e. a classifier was
trained in order to discriminate a single class among the four listed ones in Table 5.11 from
a background class24. Only in [80] the authors additionally reported the class separation
performance for all four class in form of a confusion table. It is obvious that the latter
approach is more challenging then a pure one-class problem. The results of [80] confirm
that - for the class motorbikes - the classification rate dropped for about 3% between the
one-class and multi-class problem. This result suggests that the inter-category separation
is of a higher difficulty, but also gives a further insight into the discrimination ability of
a feature. However, Table 5.11 shows that our approach outperforms most of the other
Table 5.11: Comparison of our structure-based method with others from the literature.Classes Our [32] [80] One − vs − rest [80] [101]
Airplanes 96.55 90.2 93.7 95.4 n.ra
Faces 96.22 96.4 94.4 93.4 n.rMotorbikes 93.58 92.5 96.1 93.1 n.r
Aver. Class.b 95.45 93.0 94.68 94.21c 94.4
aSingle class results were not reported.bThe average weighted classification performance is computed with respect to the same classes in order
to guarantee a fair comparison.cThe authors report an overall classification rate of 93.46% for four classes. For a fair comparison we
only consider the three classes listed in the current table.
23The F-measure can be interpreted as a harmonic mean.24The background class consists of arbitrary images.
114 Applications
methods. Notice, that we repeated the classification for the very same image classes in order
to guarantee a fair comparison. For the class faces our approach is slightly less performant
then the one in [32]. For the class motorbikes [80] could report a higher classification rate
for the one-class approach. However, for the class separation task the performance drops
below our one. The overall classification rate of our method is the highest with more than
95%. Note, that the classification rate of our features increased for 3% in comparison to
the five class problem (see Table 5.9).
5.6.2 Retrieval Results
In this subsection we present the CBIR results for the Caltech database. Now we will find
a ranking of similar images without the usage of a training set25. In fact, the similarity
measure has to determine the degree of resemblance for each (feature) vector with the query
vector in the high dimensional feature space Rn. However, for the current experiment we
used the histogram intersection measure as we have motivated in Section 4.3.
Figure 5.24 shows a sample retrieval result for the class airplanes. The first image
retrieved is the query image and the other images displayed belong to the same class.
The numbers above each image denote the degree of similarity with the query, where 1
indicates an identical match. A more detailed look reveals that not only images from the
same class are retrieved but all airplanes appear in a quite similar environment, which is
due to the structure-based features encoding the whole scene (e.g. there are no airplanes
in the sky among the first retrieved samples). In Figure 5.25 we can see a sample retrieval
for the class of motorbikes. The query image is found as the first match. The images are
sorted in decreasing similarity with respect to the query denoted by the numbers above
each image. The structure-based features capture the geometric appearance such that the
retrieved motorbikes exhibit a similar placement and a fairly similar type.
Sample retrievals are a good means for a visual inspection but lack an overall objectivity.
Therefore, it is absolutely necessary to report performance measures such as precision-recall
graphs and other measures for an objective evaluation. Figure 5.26 displays five class-wise
averaged precision-recall graphs. The class airplanes performs very well, in average slightly
less then 40% of all relevant images are retrieved at precision 1. That means that in average
for an airplane query each of the first 380 retrieved images belong to the same class. The
graph for the class cars looks a bit different. In fact, the precision drops faster from the
value one but decreases smoother with increasing recall then for airplanes. The worst
performance is achieved for the class leaves26. The class leaves exhibits high difficulties.
The leaves fill a rather small area of each image, so background information is strengthened.
In addition, leaves do not contain a lot of structural information and our features are not
25Note, that we take the same features as for the classification task.26Note, that the class leaves is almost never used in the literature.
5.6 Object Class Recognition and Retrieval 115
1 0.986649 0.984584
0.979556 0.974421 0.971927 0.97137
0.971323 0.971165 0.969762 0.969198
0.966765 0.966592 0.964933 0.964622
Query
Figure 5.24: Result obtained with structure features (see Section 4.2), from the Caltech database.
116 Applications
1 0.797734 0.790854
0.786349 0.785345 0.7840540.782529
0.780913 0.779573 0.7757250.775349
0.775287 0.774637 0.774448 0.772638
Query
Figure 5.25: Result obtained with structure features (see Section 4.2), from the Caltech database.
5.6 Object Class Recognition and Retrieval 117
0 0.5 10
0.2
0.4
0.6
0.8
1Airplane
Pre
cisi
on
Recall0 0.5 1
0
0.2
0.4
0.6
0.8
1Cars (rear)
Pre
cisi
on
Recall0 0.5 1
0
0.2
0.4
0.6
0.8
1Faces
Pre
cisi
on
Recall
0 0.5 10
0.2
0.4
0.6
0.8
1Leaves
Pre
cisi
on
Recall0 0.5 1
0
0.2
0.4
0.6
0.8
1Motorbikes
Pre
cisi
on
Recall
Figure 5.26: Class-wise averaged precision-recall graphs for the Caltech image database.
as discriminative as for the other four classes. The combination with a color histogram
might improve the performance for the class leaves. The precision-recall graphs for faces
and motorbikes evidence a good retrieval performance.
Invariance
In the previous sections we have shown the appropriateness of our structural features for
object class retrieval. The precision recall graphs have shown a good performance for the
five classes. Now, we want investigate in the invariance properties of the features. In Section
5.5.2 we have already shown the invariance properties of the our features. The comparison
with other methods has proven a high degree of invariance for the structure-based features.
Now, we want to investigate the invariance or robustness against seven non-linear image
118 Applications
transformations. Therefore, we take an image out of the class Motorbikes and apply seven
transformations to it: In detail, we perform a Gaussian blurring, add noise, add sparkle light
effects, flip the image, perform one affine and two projective transformations. The newly
created images serve as queries for the complete Caltech database. We are interested in the
retrieval performance and in the retrieval rank of the original unchanged image. Figures
5.27 and 5.28 show the eight retrieval results. The figures show that the transformed query
images have strongly changed their visual appearance. The result in Figure 5.27(a) is
obtained from the original image. The first retrieved image is identical with the query, as
indicated by the intersection similarity measure of 1. The unaltered motorbike is retrieved
as first for the case of Gaussian blurring (see Figure 5.27(b)). Although the similarity
measure as decreased from 1 to 0.823. Similarly, the original image exhibits rank 1 for
the sparkle light effect (see Figure 5.27(d), the flipped image (see Figure 5.28(a), the
affine transformation (see Figure 5.28(b)) and for the second projective transformation
(see Figure 5.28(d)). The motorbike image with added noise retrieved the original image
at position two. For the first projective transformation the original motorbike is retrieved
at rank three. Note, that the second projective transformation changes the original image
extremely.
The results show that the structure-based features are mostly robust against the seven
described transformations. Figure 5.29 shows seven precision-recall graphs for the each of
the transformed query image. The seven graphs are quite similar, except the graph for
the first projective transformation which is significantly worse then the others. The results
show that the similarity score decreases for all transformations. Nonetheless, the retrieval
rank of the original image remains 1 for most of the transformed image queries. Although
the results are very promising, further investigations are necessary in order to make a
general statement of the feature’s robustness under the seven discusses transformations.
5.6.3 Conclusion
In this section we have presented results obtained from classification and retrieval exper-
iments performed with the Caltech image collection. For the classification we have used
a multi-class SVM with a histogram intersection kernel. The results are very competitive
as shown by a comparison with state of the art algorithms. In fact, we have obtained
a classification rate of 92.45% for the five class problem and a 95.45% rate for the three
class problem. For the second experiment we have demonstrated averaged precision recall
graphs and various performance measures. The graphs show that our structure-based fea-
tures perform well for object categorization and retrieval tasks. We have completed the
experiment with an investigation in the robustness/invariance under seven non-linear im-
age transformations. The results show, that for most transformations the retrieval rank of
the unaltered image remains 1. In future work we are interested in performing a thorough
5.6 Object Class Recognition and Retrieval 119
1
0.8846980.878476
0.874475Query
(a) Original
0.8231290.81152 0.806432 0.804441
Query
(b) Gaussian blurr
0.891755
0.883486
0.8705310.865858
Query
(c) Random noise
0.883973
0.87090.864979
0.861888Query
(d) Sparkle light effects
Figure 5.27: Result obtained with structural features (see Section 4.2) for some transformations.
120 Applications
0.8721210.869341
0.8686240.8613
Query
(a) Image flip
0.88593
0.868887 0.8662950.864814
Query
(b) Affine
0.857005 0.854721
0.8474760.846256
Query
(c) Projective1
0.877945
0.8758260.874458
0.873334
Query
(d) Projective2
Figure 5.28: Result obtained with structural features (see Section 4.2) for some transformations.
5.6 Object Class Recognition and Retrieval 121
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pre
cisi
on
Recall
OriginalBlurNoiseSparkleFlipAffineProjective1Projective2
Figure 5.29: Precision-recall graph for several image transformations for the query imageof Figures 5.27 and 5.28 that belong to the class Motorbikes.
122 Applications
Table 5.12: Confusion matrix for the Brodatz database.
Class 1 2 3 4 0 5 6 7 8 9 10 11 12 13 # Members1 94 0 1 1 0 0 0 0 0 0 0 0 0 962 0 96 0 0 0 0 0 0 0 0 0 0 0 963 1 0 95 0 0 0 0 0 0 0 0 0 0 964 0 0 0 95 1 0 0 0 0 0 0 0 0 965 0 0 0 2 94 0 0 0 0 0 0 0 0 966 0 0 0 0 0 85 0 1 0 0 0 0 10 967 0 0 1 0 0 0 95 0 0 0 0 0 0 968 0 0 0 0 0 0 0 95 0 0 0 0 1 969 0 0 0 0 1 0 0 0 95 0 0 0 0 9610 0 0 0 0 0 0 0 0 0 96 0 0 0 9611 0 0 0 0 0 0 0 0 0 0 96 0 0 9612 0 0 0 0 0 0 0 0 0 1 0 95 0 9613 0 0 0 0 0 4 0 0 0 0 0 0 92 96
invariance analysis for the seven transformations, i.e. compute averaged precision-recall
graphs for every image in the Caltech database under the seven image transformations.
5.7 Texture Class Recognition
In this section we demonstrate the performance of our structure-based features for a tex-
ture classification problem. Texture analysis and classification is a widely studied field
of research and a tremendous number of different approaches have been developed so far.
However, in the following we restrict the literature review to edge-based methods. Though
edge-based approach have not been as widely used as statistical, Gabor filters or wavelet
based methods, several studies have proven [103], [70], [134] and [81] its good classification
performance.
We use the global and local features Hglobal, H local, described in Chapter 4. The
feature vectors are normalized to zero mean and unit variance. In the current experiment
we apply a support vector machine with intersection kernel with a cost parameter of C = 220
and a leave-one-out cross validation. So, we partitioned the original set into n− 1 subsets,
with n being the number of patterns, in order to compute the average classification score.
Table 5.12 shows the confusion table for the 13 classes of the Brodatz database and in
Table 5.13 the corresponding true positive and false positive rates are printed. Note, that
each class consists of 96 members. For classes 2, 10 and 11 (Brick, Wate and Wave) a
100% classification rate is obtained. For classes 3, 4, 7, 8, 9 and 12 (Bubbles, Grass, Raffia,
Sand, Straw and Wood) only one image was falsely classified that is equal to a true positive
5.7 Texture Class Recognition 123
Table 5.13: True positive and false positive rates for the Brodatz database, where the firstcolumn indicates the correctly classified images and the total number of class members.The second column shows the TP rate in [%]. Column three represents all FP obtainedand column four gives the FP rate in [%].
Classes True Positive [%] False Positive [%]1 94/96 97.92 1/1152 0.086812 96/96 100 0/1152 03 95/96 98.96 2/1152 0.17364 95/96 98.96 3/1152 0.26045 94/96 97.92 2/1152 0.17366 85/96 88.54 4/1152 0.34727 95/96 98.96 0/1152 08 95/96 98.96 1/1152 0.086819 95/96 98.96 0/1152 010 96/96 100 1/1152 0.0868111 96/96 100 0/1152 012 95/96 98.96 0/1152 013 92/96 95.83 11/1152 0.9549
Total: 1223/124 98.00 25/1248 2.003
rate of 98.96%. Class 6 and 13 (Pigskin and Wool) performed the worst. A closer look at
the confusion matrix reveals that both classes are confused with each other. The texture
appears to be quite similar in terms of their edge and line-segment representation. These
two classes account for almost 60% of all false positives of the data set. Nonetheless, the
overall averaged classification rate for our method is 98.0%, which can be regarded very
competitive. In Table 5.14 we list additional performance measures such as for example
accuracy, precision and F-measure.
In order to provide a qualitative comparison with other approaches we list in Table 5.15
the class-wise classification rates, if available, for the work of [44] and [100]. A discussion
of the table reveals that although the method of [44] scores a 100% for five classes, their
overall performance is lower then our one. The authors in [100] have reported the overall
performance for two sets of features. However, the first number of 97.52% reports the
best result for their rotation invariant variance (VAR) operator (128 bins) that describes
the contrast of local image texture. A second feature, called the local binary pattern
(LBP) operator was introduced in order to extract a rotation and gray scale invariant
feature. In detail, the local neighborhood is binary thresholded at the gray scale level of
the center pixel. The best LBP run resulted in a correct classification of 96.88%. Finally,
the combination of both features improved their classification rate to 99.52%.
124 Applications
Table 5.14: Detailed performance measures (see Section 5.2.1), for the Brodatz database.
Measures [%]Classes
Accuracy TN FN P G − Mean1 G − Mean2 F1 99.76 99.91 2.08 98.95 98.43 98.91 98.432 100.00 100.00 0 100.00 100.00 100.00 100.003 99.76 99.83 1.04 97.94 98.45 99.39 98.454 99.68 99.74 1.04 96.94 97.94 99.35 97.945 99.68 99.83 2.08 97.92 97.92 98.87 97.926 98.80 99.65 11.46 95.51 91.96 93.93 91.897 99.92 100.00 1.04 100.00 99.48 99.48 99.488 99.84 99.91 1.04 98.96 98.96 99.43 98.969 99.92 100.00 1.04 100.00 99.48 99.48 99.4810 99.92 99.91 0 98.97 99.48 99.96 99.4811 100.00 100.00 0 100.00 100.00 100.00 100.0012 99.92 100.00 1.04 100.00 99.48 99.48 99.4813 98.80 99.05 4.17 89.32 92.52 97.43 92.46
W. Average 99.69 99.83 2.00 98.04 98.01 98.90 98.00
5.7.1 Conclusion
In the previous section we have presented results for the classification of Brodatz textures.
The structure-based method has obtained very good results that are similar to state of
the art methods. An average classification rate of 98% is reached. The result indicates
that straight line segments and their connectivity are useful for texture classification. We
believe that our method is a good alternative to Gabor and wavelet based methods for
texture recognition tasks. In future, we are interested in combining Gabor features with
our structural features for similar applications.
5.7 Texture Class Recognition 125
Table 5.15: Comparison of the structure-based method with others from the literature forthe Brodatz database.
Classes Our [44] [100]1 97.92 87.5 n.r.2 100 100 n.r3 98.96 100 n.r4 98.96 95.8 n.r5 97.92 93.8 n.r6 88.54 95.8 n.r7 98.96 100 n.r8 98.96 97.9 n.r9 98.96 100 n.r10 100 97.9 n.r11 100 100 n.r12 98.96 97.9 n.r13 95.83 91.7 n.r
Average: 98.00 96.8 97.52 99.52
126 Applications
Chapter 6
Conclusions and Perspectives
6.1 Conclusions
In this thesis we have investigated content-based image retrieval and classification prob-
lems. For this purpose we have developed a structure-based feature extraction method that
encodes relative spatial arrangements of line segments. The method is capable of repre-
senting the global structure of an image, as well as the local structure of perceptual groups
and their connectivity. The advantage of the method is the invariance towards changes
in illumination, similarity transformations (rotation, translation and isotropic scaling) and
slight changes of the viewing angle. The results show that structure is a prominent and
highly discriminative feature.
In order to verify the correct extraction of structural information we have evaluated
various edge detectors. Next, we have presented a method that automatically computes
the best set of Canny parameters for real world images. We have evaluated 550 different
parameter sets and determined the best one by the comparison with a manually generated
ground truth. The best parameter set produces an error rate of only 0.1 to 1.3%. The
result shows that it is possible to restrict the range of the three Canny parameters to a
meaningful subrange.
In addition, we have used the edge maps to extract straight line segments with an edge
point tracking algorithm that was compared with the standard Hough transformation. The
result has shown that the standard Hough transformation could not robustly produce line
segments for our set of images. Though the Hough transform gives in general good results,
it is prone to misleading or false results for aligned objects. On the other hand, the edge
point tracking algorithm produces high quality maps of line segments that are very robust
to changes in illumination.
Next we have applied a straight line segment grouping method based on agglomera-
tive hierarchical clustering that automatically discards less important segments. We have
128 Conclusions and Perspectives
firstly proven the existence of an underlying clustering structure in the feature space with
the Hopkins test. A result of h=0.81 suggests the adequateness of our clustering method
with a confidence level of more then 90%. Secondly, we have evaluated the best linkage
method for the agglomerative hierarchical clustering by computing the cophenetic correla-
tion coefficient for 15972 different hierarchies. With an average score of 0.9262 the average
linkage method produces the best result. Thirdly, we have introduced a subgraph distance
ratio that is used in order to prune or cut a dendrogram. The resulting groups of straight
line segments are divided into salient and less important clusters on the basis of the intra-
class compactness. Subsequently, we have used the salient subset of line segments for the
computation of the invariant structural features.
The features have been applied to various content-based image retrieval and classifica-
tion problems. The first application was the classification and content-based image retrieval
of watermark images, where the features have proven as powerful descriptors. We could
have achieved a classification rate of more than 87% for the 14-class watermark problem.
Although the database features a high intra-class variability. The precision recall graphs
have shown good results for various classes. Some classes performed worse due to the high
intra-class variation. Therefore, we have presented several performance measures that show
the dynamics of the retrieval results. In addition, we have performed a partial matching for
various sample filigrees, that produced promising results. The partial matching turned out
to be computationally very expensive. Further investigation and algorithmic acceleration
is needed in order to make it more applicable.
The second application was a retrieval task of two color image databases from the Corel
collection with 1.000 and 10.000 images. We have shown that the global structure features
produce quite good results and could be improved with local block-based features. The
combination of the local and global information performed better for most classes, except
for the class Buses. That is because buses contain to a large extent linear structures which
favor the line segment features. In addition, the combination performed better than two
standard methods for both data-sets. We could have only observed one class (Horses),
from the 1.000 image set, where the color histogram clearly outperformed our features. A
visual inspection of that class revealed a clear domination of the green color for the Horses
images.
Since invariance plays a very important role for CBIR, we have evaluated the perfor-
mance of all methods under several image transformations. The analysis has shown that
the structure-based features are the most invariant against changes in illumination and
rotation, with an average score of more than 96%.
The third application was object class recognition and retrieval for the Caltech database.
We have used support vector machines to learn the feature space distribution of our
structure-based features for several images classes. For the classification we have used
6.2 Outlook and Perspectives 129
a multi-class SVM with a histogram intersection kernel. The results have been obtained
with a leave-one-out test. The results are competitive as shown by a comparison with
state of the art algorithms. In fact, we have obtained a 92.45% classification rate for the
five class problem and a 95.45% rate for the three class problem. The invariance analysis
showed a good robustness towards several image transformations.
In addition we have performed an image retrieval task for the same database. The
results were presented with averaged precision recall graphs and various performance mea-
sures. The results show that our structure-based features can be used for learning as well
as for image similarity tasks.
The fourth and final application was the classification of textures obtained from the
well known Brodatz collection. We have applied our line segment features to the 13-class
problem, that was evaluated with a SVM and a leave-one-out test. The intersection kernel
produced a very competitive classification rate of 98%.
We have shown the broad applicability of structure-based image features for classifi-
cation and content-based image retrieval tasks. The four presented applications comprise
tasks as broad as binary, color, object class and texture image retrieval and/or classifi-
cation. The features have proven invariant and of high quality in the comparison with
state-of-the art methods.
6.2 Outlook and Perspectives
In this thesis we have shown that structure can effectively be used in order to solve content-
based image retrieval and classification tasks. Although, we have presented four very
different applications, structure features can possibly be applied to various other tasks such
as for example satellite imagery, medical image analysis and retrieval or to astronomical
data repositories (e.g. the virtual observatory).
We are also interested in pursuing our results on the automatic estimation of the best
parameters for edge detectors. We will enhance the method with respect to salient ground
truth maps. Currently, we consider each correctly or falsely detected pixel equally impor-
tant, though it is easy to imagine that some edge are more salient then others.
Hierarchical clustering is an other point of interest to us. In this thesis we have applied
AHC to a set of line segments, although in general the method can be applied to arbi-
trary feature vectors. This generality motivates us for further experiments on clustering
features. The pruning method and cluster selection strategy presented in this thesis have
been applied to a set of line segments. However, we are interested in the performance for
other data patterns such as for example feature vectors. For arbitrary features, the cluster
selection process might be understood as the selection of important or salient features.
Thus, the method might contribute to the problem of feature selection.
130 Conclusions and Perspectives
The structure-based features show good discrimination abilities, so far. We are inter-
ested whether a formation of larger perceptual groups would improve the results or not.
Currently, we consider local groups of line segments that fulfill contain relative spatial
arrangements such as proximity, parallelity, perpendicularity, similar lengths or angles. In
future work we might develop an even more structured hierarchy of perceptual groups.
Moreover, we have shown partial matching results. Unfortunately, partial matching is
computationally very expensive. Therefore, we are interested in accelerating the matching
process and/or investigate in other/faster matching methods (Earth mover’s distance [119],
Hausdorff distance).
The experiments on the evaluation of the best similarity measure inspired us to conduct
further research. Specifically, we are interested if our observation that the intersection mea-
sure performs best, also holds for arbitrary feature sets. We believe that there is too little
literature on the behavior of similarity measures for different feature space distributions.
The good classification results presented in this thesis encourage us to use our structure-
based features for relevance feedback applications. We expect a significant improvement
of the performance.
Appendix A
A.1 Mercer Theorem
Theorem 1. Mercer’s Theorem. Let Ω be ⊂ Rn. Assume x ∈ Rn is a continuous and
symmetric function and that there exists a mapping Φ from Rn to H, where H is a Hilbert
space;
Φ : x 7→ Φ(x) ∈ H.
Then there is an equivalent representation of the inner product operation
∑
r
Φr(x)Φr(z) = K(x, z), (1)
where K(x, z) is a symmetric function, the so-called kernel function satisfying
∫
Ω×Ω
K(x, z)g(x)g(z)dxdz ≥ 0, ∀ g ∈ Φ(Ω), (2)
for any g(x), x ∈ Rn such that
∫
g(x)2dx < ∞. (3)
132 Conclusions and Perspectives
Bibliography
[1] S. Aksoy and R. M. Haralick. Probabilistic vs. geometric similarity measures for
image retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR
2000), 13-15 June 2000, SC, USA, pages 2357–2362, 2000.
[2] S. Aksoy and R. M. Haralick. Feature normalization and likelihood-based similarity
measures for image retrieval. Pattern Recognition Letters, 22(5):563–582, April 2001.
[3] Y. Avrithis, Y. Xirouxakis, and S. Kollias. Affine-invariant curve normalization
for object shape representation, classification and retrieval. Machine Vision and
Applications, 13, Issue 2:80–94, 2001.
[4] S. Belongie, J. Malik, and J. Puzicha. Shape context: A new descriptor for shape
matching and object recognition. In NIPS, pages 831–837, 2000.
[5] P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue
Software, 2002.
[6] A. Bhattacharyya. On a measure of divergence between two statistical populations
defined by their probability distributions. Bull. Calcutta Math. Soc., 35:99–110, 1943.
[7] H. H. Bock. On some significance tests in cluster analysis. Journal of Classification,
2:77–108, 1985.
[8] K.W. Bowyer, C. Kranenburg, and S. Dougherty. Edge detector evaluation using
empirical roc curves. Computer Vision and Image Processing (CVIU), 84(1):77–103,
October 2001.
[9] J. N. Breckenridge. Validating cluster analysis: Consistent replication and symmetry.
Multivariate Behavioral Research, 35:261–285, 2000.
[10] C. M. Briquet. Les filigranes, Dictionnaire historique des marques de papier d‘es leur
apparition vers 1282 jusqu’en 1600, Tome I B IV, Deuxieme ’edition. Verlag Von
Karl W. Hiersemann, Leipzig, 1923.
134 BIBLIOGRAPHY
[11] P. Brodatz. Textures – A Photographic Album for Artists and Designers. Dover
Publications, New York, 1966.
[12] R. Brunelli. Histogram analysis for image retrieval. Pattern Recognition, 34, 2001.
[13] G. Brunner and H. Burkhardt. Building classification of terrestrial images by generic
geometric hierarchical cluster analysis features. In IAPR Workshop on Machine
Vision Applications (MVA2005), pages 136–139, Tsukuba Science City, Japan, Apr.
2005.
[14] G. Brunner and H. Burkhardt. Structure features for content-based image retrieval.
In Pattern Recognition - Proc. of the 27th DAGM Symposium, Vienna, Austria, pages
425–433. Springer, Berlin, Aug. 2005.
[15] G. Brunner and Z.-M. Lu. Block-based and structure-based features for content-based
image retrieval. In preparation, 2006.
[16] C. Burges. A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2(2):121–167, 1998.
[17] J. B. Burns, A. R. Hanson, and E. M. Riseman. Extracting straight lines. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 1986.
[18] J. Canny. A computational approach to edge detection. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 8:679–698, 1986.
[19] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A
system for region-based image indexing and retrieval. In Third International Con-
ference on Visual Information Systems. Springer, 1999.
[20] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.
Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[21] O. Chapelle, P. Haffner, and V. Vapnik. SVMs for histogram-based image classifica-
tion. IEEE Transaction on Neural Networks, 10(5):1055–1064, 1999.
[22] Corel Inc. Corel’s 1000 images database.
[23] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons,
1991.
[24] M. H. DeGroot. Probability and Statistics 3rd ed. Reading, MA: Addison-Wesley,
1991.
BIBLIOGRAPHY 135
[25] T. Deselaers, D. Keysers, and H. Ney. Features for image retrieval: A quantitative
comparison. In C. E. Rasmussen, H. H. Bulthoff, M. A. Giese, and B. Scholkopf,
editors, Pattern Recognition - Proc. of the 26th DAGM Symposium, LNCS 3175,
Tubingen, Germany, pages 228–236. Springer, Berlin, Aug. 2004.
[26] M. Do and M. Vetterli. Wavelet-based texture retrieval using generalized gaussian
density and kullback-leibler distance. IEEE Trans.on Image Proc., 146-158 2002.
[27] A. Dorado and E. Izquierdo. Semi-automatic image annotation using frequent key-
word mining. In Proceedings of the Seventh International Conference on Information
Visualization, volume 00, pages 532–535, Los Alamitos, CA, USA, 2003. IEEE Com-
puter Society.
[28] J. Eakins, A. Jean, E. Brown, J. Riley, and R. Mulholland. Evaluating a shape
retrieval system for watermark images. In A. Bentkowska, T. Cashen, and J. Sun-
derland, editors, CHArt Conference Proceedings, volume four; Digital Art History -
A Subject in Transition: Opportunities and Problems, British Academy in London
on 28th and 29th November 2001, 2002.
[29] P. Enser and C. Sandom. Towards a comprehensive survey of the semantic gap in
visual image retrieval. In CIVR 2003 - International Conference on Image. and Video
Retrieval, Urbana, IL, USA, 24-25 July, LNCS 2728, pages 291–299, 2003.
[30] F. J. Estrada and A. D. Jepson. Perceptual grouping for contour extraction. In 17th
International Conference on Pattern Recognition (ICPR), volume 02, pages 32–35,
Los Alamitos, CA, USA, 2004. IEEE Computer Society.
[31] B. S. Everitt. Cluster Analysis 2nd Edition. Heineman Educational Books Ltd., 2nd
Edition, London, 1980.
[32] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised
scale-invariant learning. In IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR), volume 02, page 264, Los Alamitos, CA, USA,
2003. IEEE Computer Society.
[33] A. Fischer, T. H. Kolbe, F. Lang, A. B. Cremers, W. Forstner, L. Plumer, and
V. Steinhage. Extracting buildings from aerial images using hierarchical aggregation
in 2D and 3D. Computer Vision and Image Understanding: CVIU, 72(2):185–203,
1998.
[34] T. Frese, C. Bouman, and J. Allebach. A methodology for designing image similarity
metrics based on human visual system models. Technical report, Tech. Rep. TR-ECE
97-2, Purdue University, West Lafayette, 1997.
136 BIBLIOGRAPHY
[35] C. Galambos, J. Kittler, and J. Matas. Progressive probabilistic hough transform
for line detection. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’99), volume 01, page 1554, Los Alamitos, CA, USA,
1999. IEEE Computer Society.
[36] Y. Gao and M. K. H. Leung. Line segment hausdorff distance on face matching.
Pattern Recognition, 35(2):361–371, 2002.
[37] M. Gerke, C. Heipke, and B.-M. Straub. Building extraction from aerial imagery
using a generic scene model and invariant geometric moments. In Proceedings of
the IEEE/ISPRS joint Workshop on Remote Sensing and Data Fusion over Urban
Areas, University of Pavia, Rome (Italy), pages 85–89, Nov. 2001.
[38] T. Gevers and A. Smeulders. Color based object recognition. In ICIAP (1), pages
319–326, 1997.
[39] A. D. Gordon. Clustering algorithms and cluster validity. In P. Dischdedt, R. Os-
termann (Eds.), Computational Statistics: Papers Collected on the Occasion of the
25th Conference on Statistical Computing, pages 497–512. Physica-Verlag, Heidel-
berg, 1994.
[40] K. C. Gowda. Cluster detection in a collection of collinear line segments. Pattern
Recognition, 17(2):221–237, 1984.
[41] A. Graf, A. Smola, and S. Borer. Classification in a normalized feature space using
support vector machines. IEEE Transactions on Neural Networks, 14(3):597–605,
2003.
[42] D. S. Guru, B. H. Shekar, and P. Nagabhushan. A simple and robust line detection
algorithm based on small eigenvalue analysis. Pattern Recognition Letters, 25(1):1–
13, January 2004.
[43] A. Halawani and H. Burkhardt. Image Retrieval by Local Evaluation of Nonlinear
Kernel Functions around Salient Points. In Proceedings of the 17th International
Conference on Pattern Recognition (ICPR-2004), Cambridge, United Kingdom, Aug.
2004.
[44] G. M. Haley and B. S. Manjunath. Rotation-invariant texture classification us-
ing a complete space-frequency model. IEEE Transactions on Image Processing, 8,
no.2:255–69, Feb. 1999.
BIBLIOGRAPHY 137
[45] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clustering algorithms and validity
measures. In Proceedings of the 13th International Conference on Scientific and Sta-
tistical Database Management, July 18-20, 2001, George Mason University, Fairfax,
Virginia, USA, pages 3–22, 2001.
[46] J. A. Hartigan. Representation of similarity matrices by trees. Journal of the Amer-
ican Statistical Association, 62:1140–1158, 1967.
[47] M. D. Heath, S. Sarkar, T. Sanocki, and K. W. Bowyer. A robust visual method for
assessing the relative performance of edge detection algorithms. IEEE Trans. Pattern
Analysis and Machine Intelligence (PAMI), 19(12):1338–1359, December 1997.
[48] W. Hl and C. Mu. Image semantic classification by using SVM. Journal of Software,
14(11):1891–1899, 2003.
[49] P. Hobson and Y. Kompatsiaris. Advances in semantic multimedia analysis for per-
sonalized content access. In Special Session on Advances in Semantic Multimedia
Analysis for Personalized Content Access, 2006 IEEE International Symposium on
Circuits and Systems (ISCAS 2006), Kos Island, Greece, 21 - 24 May, 2006.
[50] D. Hoiem, R. Sukthankar, H. Schneiderman, and L. Huston. Object-based image
retrieval using the statistical structure of images. In IEEE Conference on Computer
Vision and Pattern Recognition, June 2004.
[51] B. Hopkins. A new method for determining the type of distribution of plan-
individuals. Annals of Botany, 18:213–226, 1954.
[52] P. V. C. Hough. Method and Means for Recognizing Complex Patterns. U.S. Patent
3,069,654, Dec 1962.
[53] P. Howarth, A. Yavlinsky, D. Heesch, and S. M. Rger. Medical image retrieval
using texture, locality and colour. In C. Peters, P Clough, J Gonzalo, G.J.F. Jones,
M. Kluck, and B. Magnini, editors, Multilingual Information Access for Text, Speech
and Images, 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004,
Bath, UK, pages 740–749, 2004.
[54] B. Huet and E. R. Hancock. Line pattern retrieval using relational histograms. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 21(12):1363–1370, 1999.
[55] IHP. International Standard for the Registration of Watermarks. International As-
sociation of Paper Historians (IHP), 1998.
138 BIBLIOGRAPHY
[56] Q. Iqbal and J. K. Aggarwal. Using structure in content-based image retrieval. In
IASTED International Conference on Signal and Image Processing (SIP 99), pages
129–133, 1999.
[57] Q. Iqbal and J. K. Aggarwal. Combining structure, color and texture for image
retrieval: A performance evaluation. In Proceedings of the International Conference
on Pattern Recognition, volume 3, Quebec, Canada, pages 438–443, 2002.
[58] Q. Iqbal and J. K. Aggarwal. Retrieval by classification of images containing large
man-made objects using perceptual grouping. Pattern Recognition, 35:1463–1479,
July 2002.
[59] A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc.,
Upper Saddle River, NJ, USA, 1988.
[60] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing
Surveys, 31(3):264–323, 1999.
[61] A. K. Jain and A. Vailaya. Image retrieval using color and shape. Pattern Recognition,
29(8):1233–1244, August 1996.
[62] A. Jonk and A. Smeulders. An axiomatic approach to clustering line-segments. In
Proceedings of the Third International Conference on Document Analysis and Recog-
nition, volume 1, pages 386–389, Aug. 1995.
[63] L. Kaufmann and P. J. Rousseeuw. Finding Groups in Data. John Wiley & Sons,
Inc.,New York, 2005.
[64] A. Kimura, T. Kawanishi, and K. Kashino. Similarity-based partial image retrieval
guaranteeing same accuracy as exhaustive matching. In IEEE International Confer-
ence on Multimedia and Expo, (ICME), pages 1895–1898, 2004.
[65] R. A. Kirsch. Computer determination of the constituent structure of biological
images. Comp. Biomed. Res., 4(3):315–328, June 1971.
[66] K. Koffka. Principles of Gestalt Psychology. Harcourt, Brace and World, Inc., New
York, 1935.
[67] A. N. Kolmogorov. On the empirical determination of a distribution function (ital-
ian). Giornale dell’Instituto Italiano degli Attuari, 4:83–91, 1933.
[68] S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu. Statistical edge detection:
learning and evaluating edge cues. IEEE Transactions on Pattern Analysis and
Machine Intelligence (PAMI), 25(1):57–74, January 2003.
BIBLIOGRAPHY 139
[69] P. D. Kovesi. Edges are not just steps. In Proceedings of the Fifth Asian Conference
on Computer Vision, pages 822–827, January 2002. Melbourne.
[70] J. K. P. Kuan and P. H. Lewis. Complex textures classification with edge information.
In In Proceedings of Proceedings on Second International Conf. on Visual Information
System. San Diego, 1997.
[71] M. Kubat, R. Holte, and S. Matwin. Learning when negative examples abound. In
Proc. of the European Conference on Machine Learning, ECML97, Prague, pages
146–153, 1997.
[72] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathemat-
ical Statistics, 22:79–86, 1951.
[73] R. G. Lawson and P. J. Jurs. New index for clustering tendency and its application
to chemical problems. J. Chem. Inf. Comput. Sci., 30:36–41, 1990.
[74] L. Lee. Measures of distributional similarity. In 37th Annual Meeting of the Associ-
ation for Computational Linguistics, pages 25–32, 1999.
[75] T. K. Leen, T. G. Dietterich, and V. Tresp, editors. Advances in Neural Information
Processing Systems 13, Papers from Neural Information Processing Systems (NIPS)
2000, Denver, CO, USA. MIT Press, 2001.
[76] S. Lefevre, C. Dixon, C. Jeusse, and N. Vincent. A local approach for fast line
detection. In Digital Signal Processing, 2002. DSP 2002. 2002 14th International
Conference on, volume 2, pages 1109–1112, 2002.
[77] S. Lessmann. Solving imbalanced classification problems with support vector ma-
chines. In H. R. Arabnia, editor, IC-AI, pages 214–220. CSREA Press, 2004.
[78] M. S. Lew, N. Sebe, and J. P. Eakins, editors. Image and Video Retrieval, Interna-
tional Conference, CIVR 2002, London, UK, July 18-19, 2002, Proceedings, volume
2383 of Lecture Notes in Computer Science. Springer, 2002.
[79] D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In SIGIR
’94: Proceedings of the 17th annual international ACM SIGIR conference on Research
and development in information retrieval, pages 3–12, New York, NY, USA, 1994.
Springer-Verlag New York, Inc.
[80] F. Li, J. Kosecka, and H. Wechsler. Strangeness based feature selection for part based
recognition. In Computer Vision and Pattern Recognition, 2006 Conference on, June
2006.
140 BIBLIOGRAPHY
[81] H. Li, S.-C. Yan, and L.-Z. Peng. Robust non-frontal face alignment with edge based
texture. J. Comput. Sci. Technol., 20(6):849–854, 2005.
[82] J. Li and J. Wang. Automatic linguistic indexing of pictures by a statistical modeling
approach. IEEE Trans. Pattern Anal. Mach. Intell., 25(9):1075–1088, 2003.
[83] Y. Li and L. G. Shapiro. Consistent line cluster for building recognition in cbir.
In Proceedings of the International Conference on Pattern Recognition, volume 3,
Quebec, Canada, pages 952–956, 2002.
[84] S. Liapis and G. Tziritas. Image retrieval by colour and texture using chromaticity
histograms and wavelet frames. In Visual Information and Information Systems,
pages 397–406, 2000.
[85] Z. J. Liu, J. Wang, and W. P. Liu. Building extraction from high resolution - imagery
based on multi-scale object oriented classification and probabilistic hough transform.
In Proceedings of the IGARSS 2005 Symposium. Seoul, Korea. July 25-29, 2005.
[86] Z.-M. Lu, C Liu, and S. Sun. Digital image watermarking technique based on block
truncation coding with vector quantization. Chinese Journal of Electronics (English
Version), 11(2):152–157, 2002.
[87] Z.M. Lu, H. Skibbe, and H. Burkhardt. Image retrieval based on a multipurpose
watermarking scheme. In International Workshop on Intelligent Information Hiding
and Multimedia Signal Processing, Melbourne, Australia, September 2005.
[88] F. Del Marmol. Dictionnaire des filigranes classes en groupes alphabetique et
chronologiques. Namur: J. Godenne, 1900. -XIV, 1987.
[89] D. Marr and E. C. Hildreth. Theory of edge detection. Proceedings of Royal Society
of London, B-207:187–217, 1980.
[90] S. Meignier, J. Bonastre, and I. Magrin-Chagnolleau. Speaker utterances tying among
speaker segmented audio documents using hierarchical classification: towards speaker
indexing of audio databases. In Proceedings of International Conference on Spoken
Language Processing ICSLP., 2002.
[91] K. Mikolajczyk, A. Zisserman, and C. Schmid. Shape recognition with edge-based
features. In British Machine Vision Conference, volume 2, pages 779–788, September
2003.
[92] M. Mirmehdi and B. T Thomas, editors. Proceedings of the British Machine Vision
Conference 2000, BMVC 2000, Bristol, UK, 11-14 September 2000. British Machine
Vision Association, 2000.
BIBLIOGRAPHY 141
[93] R. Mohan and R. Nevatia. Perceptual organization for scene segmentation and
description. IEEE Transactions on Pattern Analysis and Machine Intelligence,
14(6):616–635, 1992.
[94] N. Mukhopadhyay. Probability and Statistical Inference. New York: Dekker, 2000.
[95] H. Muller, W. Muller, D. Squire, S. Marchand-Maillet, and T. Pun. Performance
evaluation in content-based image retrieval: overview and proposals. Pattern Recogn.
Lett., 22(5):593–601, 2001.
[96] K. Murakami and T. Naruse. High speed line detection by hough transform in local
area. 15th International Conference on Pattern Recognition (ICPR’00), 03:3471,
2000.
[97] A. Natsev, R. Rastogi, and K. Shim. Walrus: A similarity retrieval algorithm for im-
age databases. IEEE Transactions on Knowledge and Data Engineering, 16(3):301–
316, 2004.
[98] H. Neemuchwala, A. O. Hero, and Carson P. L. Image matching using alpha-entropy
measures and entropic graphs. European Journal of Signal Processing (Special Issue
on Content-based Visual Information Retrieval), Mar. 2004.
[99] A. Neri. Optimal detection and estimation of straight patterns. IEEE Trans. Image
Processing, 5(5):787–792, May 1996.
[100] T. Ojala, M. Pietikainen, and T. Maenpaa. Gray scale and rotation invariant texture
classification with local binary patterns. In D. Vernon, editor, ECCV (1), volume
1842 of Lecture Notes in Computer Science, pages 404–420. Springer, 2000.
[101] B. Ommer and J. Buhmann. Object categorization by compositional graphical mod-
els. In Rangarajan et al. [111], pages 235–250.
[102] M. Ortega, Y. Rui, K. Chakrabarti, S. Mehrotra, and T. S. Huang. Supporting
similarity queries in MARS. In ACM Multimedia, pages 403–413, 1997.
[103] D. Patel and T.J. Stonham. Texture image classification and segmentation using
rank-order clustering. In Proceedings of the 11th International Conference on Pattern
Recognition, The Hague, The Netherlands, pages 92–95, 1992.
[104] S. M. Peres and M. L. de Andrade-Netto. A fractal fuzzy approach to clustering
tendency analysis. In Lecture Notes in Computer Science, volume 3171, pages 395–
404, Jan. 2004.
142 BIBLIOGRAPHY
[105] A. R. Pope and D. G. Lowe. Vista: A software environment for computer vision
research. In CVPR94, pages 768–772, 1994.
[106] A. P. D. Poz, G. M. D. Vale, and Zanin I. R. B. Automated road segment ex-
traction by grouping road objects. In XXth ISPRS Congress.’Geo-Imagery Bridging
Continents’, 12-23 July 2004, Istanbul, Turkey, 2004.
[107] J. M. S . Prewitt. Object enhancement and extraction. New York: Academic, 1970.
[108] J. Princen, J. Illingworth, and J. Kittler. A hierarchical approach to line extraction
based on the Hough transform. Comput Vision Graphics Image Process (CVGPDB),
52(1):57–77, October 1990.
[109] J. Puzicha. Distribution-based image similarity. In State-of-the-Art in Content-Based
Image and Video Retrieval [141], pages 143–164.
[110] G. Qian, S. Sural, Y. Gu, and S. Pramanik. Similarity between euclidean and cosine
angle distance for nearest neighbor queries. In SAC ’04: Proceedings of the 2004
ACM symposium on Applied computing, pages 1232–1237, New York, NY, USA,
2004. ACM Press.
[111] A. Rangarajan, B. C. Vemuri, and A. L. Yuille, editors. Energy Minimization Methods
in Computer Vision and Pattern Recognition, 5th International Workshop, EMM-
CVPR 2005, St. Augustine, FL, USA, November 9-11, 2005, Proceedings, volume
3757 of Lecture Notes in Computer Science. Springer, 2005.
[112] C. Rauber. Acquisition, archivage et recherche de documents accessibles par le con-
tenu: Application a la gestion d’une base de donnees d’images de filigranes. Ph.D.
Dissertation No. 2988, University of Geneva, Switzerland, March 1998.
[113] C. Rauber, T. Pun, and P. Tschudin. Retrieval of images from a library of water-
marks for ancient paper identification. In EVA 97, Elektronische Bildverarbeitung
und Kunst, Kultur, Historie, Berlin, Germany, Nov. 12 - 14, 1997., 1997.
[114] C. Rauber, J. Ruanaidh, and T. Pun. Secure distribution of watermarked images
for a digital library of ancient papers. In DL ’97: Proceedings of the second ACM
international conference on Digital libraries, pages 123–130, New York, NY, USA,
1997. ACM Press.
[115] K. J. Riley and J. P. Eakins. Content-based retrieval of historical watermark images:
I-tracings. In Lew et al. [78], pages 253–261.
[116] G. S. Robinson. Edge detection by compass gradient masks. Comput. Graphics Image
Processing, 6, 1977.
BIBLIOGRAPHY 143
[117] A. Robles-Kelly and E. R. Hancock. Grouping line-segments using eigenclustering.
In Mirmehdi and Thomas [92].
[118] O. Ronneberger and F. Pigorsch. LIBSVMTL: A support vector machine
template library, 2004. Software available at http://lmb.informatik.unifreiburg.
de/lmbsoft/libsvmtl/.
[119] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for
image retrieval. International Journal of Computer Vision, 40(2):99–121, Nov. 2000.
[120] Y. Rui, T. Huang, and S. Chang. Image retrieval: current techniques, promising
directions and open issues. Journal of Visual Communication and Image Represen-
tation, 10:39–62, April 1999.
[121] M. Johnston S. Gallant. Image retrieval using image context vectors: First results.
In Storage and Retrieval for Image and Video Databases (SPIE), pages 82–94, 1995.
[122] S. Santini and R. Jain. Similarity measures. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 21(9):871–883, 1999.
[123] B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization and Beyond. MIT Press, 2002.
[124] H. Schulz-Mirbach. Invariant features for gray scale images. In G. Sagerer, S. Posch,
and F. Kummert, editors, 17. DAGM - Symposium “Mustererkennung”, pages 1–14,
Bielefeld, 1995. Springer.
[125] N. Sebe, M. Lew, and N. Huijsmans. Towards optimal ranking metrics. IEEE Trans.
on Pattern Analysis and Machine Intel. (PAMI), pages 1132–1143, Oct. 2000.
[126] M. C. Shin, D. B. Goldgof, K. W. Bowyer, and S. Nikiforou. Comparison of edge
detection algorithms using a structure from motion task. IEEE Trans. Systems, Man
and Cybernetics Part B, SMC-B, 31(4):589–601, August 2001.
[127] S. Siggelkow. Feature Historgrams for Content-Based Image Retrieval. PhD thesis,
Albert-Ludwigs-Universitat Freiburg, December 2002.
[128] S. Siggelkow and H. Burkhardt. Image retrieval based on local invariant features. In
Proceedings of the IASTED International Conference on Signal and Image Processing
(SIP) 1998, pages 369–373, Las Vegas, Nevada, USA, October 1998. IASTED.
[129] S. Siggelkow, M. Schael, and H. Burkhardt. SIMBA — Search IMages By Appear-
ance. Lecture Notes in Computer Science, 2191, 2001.
144 BIBLIOGRAPHY
[130] A. Smeulders, M. Worring, S. Santiniand, A. Gupta, and R. Jain. Content-based
image retrieval at the end of the early years. IEEE Trans. Pattern Analysis and
Machine Intelligence, 22:1349–1380, Dec. 2000.
[131] J. R. Smith and S.-F. Chang. Visualseek: A fully automated content-based image
query system. In ACM Multimedia, pages 87–98, 1996.
[132] A. Solomonoff, A. Mielke, M. Schmidt, and H. Gish. Clustering Speakers by their
Voices. In IEEE International Conference On acoustics, speech, and signal processing
ICASSP, volume 2, pages 757–760, 1998.
[133] M. J. Swain and D. H. Ballard. Color indexing. Int. J. Comput. Vision, 7(1):11–32,
1991.
[134] Ojala T., Pietikinen M., and Silven O. Edge-based texture measures for surface in-
spection. In Proc. 11th International Conference on Pattern Recognition, The Hague,
The Netherlands, volume 2, pages 594–598, 1992.
[135] J.-P. Tarel and S. Boughorbel. On the choice of similarity measures for image retrieval
by example. In Proceedings of ACM MultiMedia Conference, pages 446 – 455, Juan-
Les-Pins, France, 2002.
[136] S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1998.
[137] K. Tsuda, M. Minoh, and K. Ikeda. Extracting straight-lines by sequential fuzzy
clustering. Pattern Recognition Letters, 17(6):643–649, May 1996.
[138] I. Valova and B. Rachev. Retrieval by color features in image databases. In ADBIS
(Local Proceedings), 2004.
[139] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995.
[140] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998.
[141] R. Veltkamp, H. Burkhardt, and H.-P. Kriegel. State-of-the-Art in Content-Based
Image and Video Retrieval. Kluwer Academic Publishers, 2001.
[142] R. Veltkamp and M. Tanase. Content-based image retrieval systems: A survey. In
Oge Marques and Borko Furht, editors, In O. Marques and B. Furht (Eds.), Content-
based image and video retrieval, pages 47–101. Kluwer Academic Publishers, 2002.
[143] J. Wang, J. Li, and G. Wiederhold. SIMPLIcity: Semantics-sensitive integrated
matching for picture LIbraries. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 23(9):947–963, 2001.
BIBLIOGRAPHY 145
[144] S. Wang, F. Ge, and T. Liu. Evaluating edge detection through boundary detection.
EURASIP Journal on Applied Signal Processing, 2006:1–15, 2006.
[145] J. H. Ward. Hierarchical grouping to optimize an objective function. American
Statistical Association, 58:234–244, 1963.
[146] R. W. White and J. M. Jose. A study of topic similarity measures. In SIGIR ’04:
Proceedings of the 27th annual international ACM SIGIR conference on Research
and development in information retrieval, pages 520–521, New York, NY, USA, 2004.
ACM Press.
[147] R. Xu and D. Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural
Networks, 16:645–678, May 2005.
[148] T.-J. Yen. A Qualitative Profile-based Approach to Edge Detection. PhD thesis,
Department of Computer Science New York University, 2003.
[149] Y. Yitzhaky and E. Peli. A method for objective edge detection evaluation and
detector parameter selection. IEEE Transactions on Pattern Analysis and Machine
Intelligence (PAMI), 25(8):1027–1033, August 2003.
[150] J. M. Zacks and B. Tversky. Event structure in perception and conception. Psycho-
logical Bulletin, 127(1):3–21, 2001.
[151] O. Zamir, O. Etzioni, O. Madani, and R. M. Karp. Fast and intuitive clustering of
web documents. In In Proceedings of the 3rd International Conference on Knowledge
Discovery and Data Mining, pages 287–290, 1997.
[152] X. Zhou and T. S. Huang. Edge-based structural features for content-based image
retrieval. Pattern Recognition Letters, 22(5):457–468, 2001.
[153] D. Ziou and S. Tabbone. Edge detection techniques - an overview. International
Journal of Pattern Recognition and Image Analysis, 8:537–559, 1998.