Otto-von-Guericke-University Magdeburg - Semantic … · Otto-von-Guericke-University Magdeburg...

Otto-von-Guericke-University Magdeburg

School of Computer ScienceInstitute for Knowledge and Language Engineering

Internship Report

Design and Implementation of an Algorithm and

Data Structure for Matching of Geometric

Primitives in Visual Object Classification

Author:

Sebastian Stober

April 28, 2005

Supervisors:

Prof. Dr. Rudolf Kruse A/Prof. Saman HalgamugeOtto-von-Guericke-University Magdeburg The University of Melbourne

School of Computer Science Mechanical & Manufacturing Engineering

P.O. Box 4120, D–39016 Magdeburg 3010 Victoria

Germany Australia

Stober, Sebastian:Design and Implementation of an Algorithm andData Structure for Matching of Geometric Primi-tives in Visual Object ClassificationInternship Report, Otto-von-Guericke-UniversityMagdeburg, 2005.

i

Abstract

This report refers to work completed during my internship with the MechatronicsResearch Group at the department of Mechanical and Manufacturing Engineering at theUniversity of Melbourne, Australia from September 5th, 2003 until March 5th, 2004.

Recognition of three-dimensional objects in two-dimensional images is a key area ofresearch in computer vision. One approach is to save multiple 2D views instead of a3D object representation thus reducing the problem to a 2D to 2D matching problem.The Mechatronics Research Group is developing a novel system that focuses on artificialobjects and further reduces the 2D views to symbolic descriptions. These descriptionsare based on shape-primitives: ellipses, rectangles and isosceles triangles. Evidence insupport of a hypothesis for a certain object classification is collected through an activevision approach.

This work deals with the design and implementation of a data structure that iscapable of holding such a symbolic representation and an algorithm for comparison andmatching. The chosen symbolic representation of an object view is rotation-, scaling- andtranslation-invariant. For the comparison and matching of two object views a branch &bound algorithm based on problem specific heuristics is used. Furthermore, a GA-basedgeneralization operator is proposed to reduce the number of object views in the systemdatabase.

Experiments show that the query performance scales linearly with the size of thedatabase. For a database containing 10000 entries, a response time of less than a secondis expected on an average system.

Acknowledgments

This research internship was made possible by funding from the German National Aca-demic Foundation. I would like to thank my supervisors, Prof. Rudolf Kruse and A/Prof.Saman Halgamuge, and further Jun/Prof. Andreas Nuernberger, Prof. Horst Hollatz,and my family for encouragement, support, and help. Special thanks are extended toRuby Law, who never seemed to tire of writing comments and remarks, for discussions,feedback, ideas, and for help on the literature review. I would also like to thank Chris-tian Borgelt, who read this report prior to publication and was kind enough to offer hiscomments. My grateful thanks go to all the people at the Mechatronics Research Groupand their friends for the pleasant co-operation and hospitality they offered during mystay, particularly Genevieve and Karl, Salim, Kenneth, Guru, and Asanga. Finally, Iwant to thank Matthias Steinbrecher, especially for help on C++ but also for superbculinary art and bearing my company for almost seven months.

CONTENTS iii

Contents

List of Figures vii

List of Tables ix

List of Algorithms xi

List of Abbreviations xiii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Related work in object recognition . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Ideas from cognitive science . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Non-cognitive approaches . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Outline of this report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Database 7

2.1 Object representation syntax . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Building and storing the database . . . . . . . . . . . . . . . . . . . . . . 9

3 Classification matching component 13

3.1 Comparison of two object views . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Processing a query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Branch & bound algorithm . . . . . . . . . . . . . . . . . . . . . . 20

iv CONTENTS

3.2.2 Underlying data structure . . . . . . . . . . . . . . . . . . . . . . 22

3.2.3 Lower bound computation . . . . . . . . . . . . . . . . . . . . . . 24

3.2.4 Upper bound computation . . . . . . . . . . . . . . . . . . . . . . 24

3.2.5 An error-overestimating heuristic . . . . . . . . . . . . . . . . . . 26

3.2.6 Further extensions of the branch & bound algorithm . . . . . . . 27

4 Forming generalizations in the database 29

4.1 About genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Representation of an individual and fitness function . . . . . . . . . . . . 31

4.3 Initial population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.5 Genetic operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Termination criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Graphical user interface 39

6 Test Result 45

6.1 Test data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1.1 Real test data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1.2 Artificially generated test sets . . . . . . . . . . . . . . . . . . . . 46

6.2 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Conclusion & future work 53

7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.2 Ideas for further improvement . . . . . . . . . . . . . . . . . . . . . . . . 54

7.2.1 Introduction of an error threshold . . . . . . . . . . . . . . . . . . 54

7.2.2 Optimization of the parameters for the shape error functions . . . 54

7.2.3 Incorporation of shape confidences . . . . . . . . . . . . . . . . . 54

7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

A Definition of the shape-primitive property errors 57

CONTENTS v

B Overview of the source files 59

C Number of possible matchings between two object views 63

D Data structure for a (partial) matching 65

Bibliography 67

vi CONTENTS

LIST OF FIGURES vii

List of Figures

1.1 Overall structure of the object recognition module . . . . . . . . . . . . . 4

2.1 Six orthogonal 2D views of a mug . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Object description syntax scheme . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Shape-primitives and properties . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Shape-primitives detected by the detection module . . . . . . . . . . . . 11

2.5 Normalized object view representation of the input shown in figure 2.4 . 11

3.1 One-to-one mapping of shape-primitives of two object views α and β . . 13

3.2 Matching example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Comparison of two entries - flowchart (part1) . . . . . . . . . . . . . . . 16

3.4 Comparison of two entries - flowchart (part2) . . . . . . . . . . . . . . . 17

3.5 Optimal search-tree for the example used in section 3.1 . . . . . . . . . . 19

3.6 Cases of “heavyness” of an AVL tree . . . . . . . . . . . . . . . . . . . . 23

4.1 Flowchart of mutateWeak and mutateWeakMod-operator . . . . . . . . . 36

4.2 Flowchart of nPointCrossover-operator . . . . . . . . . . . . . . . . . . . 37

5.1 Screenshot of GUI with display areas in shape-match mode . . . . . . . . 40

5.2 Screenshot of GUI displays in unlinked mode and with position labels . . 42

5.3 Screenshot of GUI displays in size-linked mode . . . . . . . . . . . . . . . 42

5.4 Screenshot of GUI displays in match mode . . . . . . . . . . . . . . . . . 43

6.1 Benchmark results for large scale databases . . . . . . . . . . . . . . . . . 49

6.2 Changes in the lower and upper bounds for the matching error . . . . . . 50

A.1 Triangular membership function . . . . . . . . . . . . . . . . . . . . . . . 57

viii LIST OF FIGURES

LIST OF TABLES ix

List of Tables

2.1 Conversion of shape-primitive properties . . . . . . . . . . . . . . . . . . 9

2.2 Value ranges of shape-primitive properties . . . . . . . . . . . . . . . . . 10

6.1 Partitioning of the first test set . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 Number of expanded nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 49

x LIST OF TABLES

LIST OF ALGORITHMS xi

List of Algorithms

1 Generic branch & bound algorithm . . . . . . . . . . . . . . . . . . . . . 21

2 Computation of the estimated error . . . . . . . . . . . . . . . . . . . . . 25

3 Computation of the unmatched error . . . . . . . . . . . . . . . . . . . . 25

4 Computation of an upper bound for the matching error . . . . . . . . . . 26

5 Generic genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 30

xii LIST OF ALGORITHMS

xiii

List of Abbreviations

α object view (the query object view in the context of a query)β object view (an object view in the database in the context of a query)Fα set of free shape-primitives of object view α

Fα,t set of free shape-primitives of object view α with type tαi the i-th shape-primitive of object view α

|α| the number of shape-primitives of α

m(αi) the shape-primitive matched with the i-th shape-primitive of α (maybe “unmatched” indicating that αi could not be matched)

GUI Graphical User InterfaceAPI Application Program InterfaceGA Genetic AlgorithmSTL Standard Template LibraryMFC Microsoft Foundation ClassesGDI Graphics Device Interface

Chapter 1. Introduction 1

Chapter 1

Introduction

This chapter starts with an explanation of the motivation and the context of this work.Then a brief overview on the problem of object recognition is given. In section 1.3 thesystem framework is described. Afterwards, the task is defined. The chapter closes withan outline of the remaining chapters of this work.

1.1 Motivation

Recognition of three-dimensional objects in two-dimensional images is a key area ofresearch in computer vision. One approach is to save multiple 2D views instead of a3D object representation thus reducing the problem to a 2D to 2D matching problem.The Mechatronics Research Group is currently developing a novel system that focuseson 2D views and incorporates the idea of “active vision”. In active vision a feedbackloop is closed between the image generating process and an actuator module: basedon hypotheses about a perceived object, commands to the actuator module are derivedto change the camera position in an attempt to refine the hypothesis. In order to beuseful, an active vision system must return its results within a fixed delay which dependson the application domain and requires limitation of the data. The system limits theamount of data to process by working on geometric information extracted from imagesinstead of the raw image data. The so derived symbolic descriptions are based on shape-primitives: ellipses, rectangles and isosceles triangles. This decompositional approachmay yield advantages in terms of computational costs and real-time capability. On theother hand, the fundamental restriction on the shape complexity will limit the differencesthat can be captured between objects. System performance has therefore to be examinedin terms of processing time as well as classification robustness.

The question of how to represent objects and the idea of decompositional approacheshave been subjects of wide discussions amongst cognitive scientists. This has led to manydifferent approaches. A brief overview on related ideas is given in the following section.

2 1.2. Related work in object recognition

1.2 Related work in object recognition

As a key task of computational vision, object recognition deals with the retrieval (clas-sification or identification) and localization of objects of interest from an image (scene)based on object models which are known (or have to be learned) a priori. The complexityof this task is dependent on [Mai]:

• the number of objects that can occur within a single image (scene),

• whether the objects may partially occlude others,

• the size of the model database,

• whether images are acquired under similar conditions to those of the models, e.g.illumination, background, camera parameters and viewpoint, and

• the choice of the internal representation for the objects.

Whilst the first four factors are mainly domain specific, the last factor is a matter of theapproach chosen and may have a big impact on the performance of the system.

1.2.1 Ideas from cognitive science

David Marr, a cognitive scientist, divided the process of human object recognition intofour different stages from low-level to high-level visual processes [MN78, Mar82]:

1. Grey level description of the intensity of light at each point in the retinal image.

2. Primal sketch - a 2 dimensional, viewpoint-dependent description.

3. 2.5 Dimensional sketch - a “viewer centered”1 representation (still viewpoint-dependent).

4. 3 Dimensional description - In this stage, the viewpoint-dependent (or “viewer-centered”1) sketch is remapped into a viewpoint-independent (or “object-centered”1) representation.

Eventually, the constructed 3D model is matched with a 3D object model stored in thelong term memory (the equivalent to the model database).

In 1987 Irving Biederman developed Marr’s approach into the “Recognition by com-ponents theory” [Bie87] identifying 36 different “geons”, i.e., fundamental shapes2 fromwhich all real world objects can be composed. The geons have co-occurring patterns of

1Usually, “viewer-centered” is used equivalent to 2D and “object-centered” equivalent to 3D.2arcs, wedges, spheres, cylinders, blocks etc. Examples can be found e.g. in the TarrLab Stimuli

collection [Tar].


lines and edges that are non-accidental and can be detected independent of the viewpoint.However, viewpoint-independent, object-centered representations have been questionedsince the evidence that human object recognition seems to be rather viewpoint-dependentas shown by Bulthoff et al. [BET95] and Tarr et al. [TWHG98]. In fact, it seems as ifthe representation of objects in the human brain is neither purely viewer-centered norpurely object-centered [TB95]. Possibly the kind of information stored depends on thecomplexity of the object and the context in which it appears (e.g. everyday use) [Kos96].

As this discussion moves on at the level of cognition and neuroscience, knowledgeabout the processes and structures in the human brain grows and more sophisticatedideas are introduced as e.g. the “chorus of fragments” approach by Edelman et al.[EIJ02, EI02, EI03]. It picks up ideas of the “Recognition by components theory”[Bie87], but instead of the restriction to a fixed set of components, a network of “whatand where”-neurons learns the most frequent “fragments” occurring in the database.The scheme of “what and where”-neurons is justified by analogies to the receptive fieldsof the retina [EN98].

In viewer-centered representation systems, entries in the database describe 2D viewsof objects from different viewpoints. The database entry which best resembles the de-scription derived from a 2D input image is then found. This is an advantage of aviewer-centered system since object-centered approaches contain 3D models of objectsin the database. Generally, 3D models are much more complex than the 2D view rep-resentations and thus harder to acquire. They work well with 3D inputs as a 3D-3Dmatching function is relatively straightforward to implement (even though it is slowerthan a 2D-2D matching function) but the generation of 3D input images requires specialhardware such as a laser scanner or at least a stereo (or triple) vision camera setup withadditional preprocessing steps (as e.g. in [DY95]). In the case of a 2D only input either

• a matching function between 2D input and 3D model has to be developed, or

• each 3D model in the database has to be projected to a viewpoint before a 2D-2Dmatching function is applied as e.g. in Lowe’s “viewpoint consistency constraint”[Low85] or Ullman’s “recognition by alignment” approach [Ull89, Ull96], or

• a 3D description from the 2D input has to be derived (as pointed out by Marr[Mar82]) and a 3D-3D matching function is applied.

These additional computations for object-centered approaches make the viewer-centeredones appear less computationally expensive. On the other hand, viewer-centered ap-proaches require a bigger database because, for each object, multiple views are required.To reduce the number of views, input and models can be normalized, or functions thatinterpolate or extrapolate objects of the model database can be used.

Tarr and Bulthoff [TB98] provide a more detailed overview on viewer-centered recog-nition pointing out parallels between man, monkey and machine. For a survey onmodel-based recognition refer to [Pop94]. An overview on computational theories of

4 1.3. Framework

object recognition with discussion of pros and cons of the different approaches is givenby Edelman [Ede97].

1.2.2 Non-cognitive approaches

Non-cognitive approaches to the object recognition problem are solutions based on fea-ture spaces and mathematical concepts. Comparison of query and database object rep-resentations can be facilitated by mapping into a different space. Methods such as Min-imum Description Length principle and Principle Component Analysis concentrate onsignificant differences between known objects. They have the added benefit of reducingthe dimension of the feature space. Another common way is to use transformations likethe Discrete Fourier Transform. E.g. Funkhouser et al. [FMK+03] implemented a searchengine for 3D models using spherical harmonics. An approach based on the computationof the shape distribution as the signature of an object is presented by Osada [OFCD01]and there are many other approaches that concentrate on the shape of the object as e.g.shown in [OMT03]. Veltkamp [Vel01] provides an overview on shape matching similaritymeasures and algorithms.

1.3 Framework

The complete object recognition module that is currently being developed at the Mecha-tronics Research Group involves three components as shown in figure 1.1.

Figure 1.1: Overall structure of the object recognition module.

The database stores the module’s knowledge of objects in the world. This knowledge isencoded using a symbolic description based on shape-primitives. Apart from the object


representation (which is a viewer-centered3 one) the following assumptions have beenmade4:

• There is only one object in each single image (scene).

• As there is only one object in each image, the subject of (partial) object occlusionwill not be addressed.

• The size of the model database has not been taken into consideration.

• Slightly different lighting conditions are addressed by generalization of the modelsin the database. Interpolation of views can be implemented, but this is not partof this work.

The visual shape detection component5 provides the sensory input to the matchingcomponent. The matching component generates a set of hypotheses about the identityof the sensed object view. Based on these hypotheses, commands to the actuator modulecan be derived to change the camera position in an attempt to refine its hypothesis.

1.4 Task

In the context of this project, this work covers the database and the classification match-ing component. The original task was to design and implement:

1. a data structure capable of importing and holding all information that is extractedfrom an image by the detection module. This information includes:

• type (ellipse, triangle or rectangle),

• size (width and height),

• rotation from the main-axis and

• position

of shape-primitives detected in the image,

2. a database structure that holds information about all object views known to thesystem, and

3. an algorithm that finds the most similar object within the database for any givenquery.

3Refer to section 1.2.1 for a discussion of viewer-centered and object-centered representation.4regarding the complexity criterion stated at the beginning of the section 1.25Implemented by Ruby Law at the Mechatronics Research Group.

6 1.5. Outline of this report

The implementation is to be in C/C++ and should be capable of running in real-timeto be applicable in the active vision domain.During the process of development the following task was added:

4. Provide an algorithm to reduce the size of the database to one entry for each of thesix orthogonal 2D views of an object. I.e. a set of entries representing the sameobject from the same view (e.g. under slightly varying lighting conditions) has tobe reduced to a single entry.

1.5 Outline of this report

The remaining chapters are organized as follows:

• Chapter 2 introduces the syntax used to encode the system’s knowledge of objectsin the world and covers tasks 1 and 2.

• Task 3 is addressed in chapter 3, where the classification matching algorithm isdescribed.

• For the additional task of forming generalizations in the database, a genetic algo-rithm approach is presented in chapter 4.

• Chapter 5 deals with the graphical user interface (GUI) that has been developedfor debugging and visualization purposes.

• Test results are presented and discussed in chapter 6.

• The last chapter concludes this work and proposes ideas for further improvementof the system.

• The appendix contains a detailed description of the implemented error functionsfor the shape properties, an overview of the source files, an analysis of the totalnumber of possible matchings between two object views, and a detailed descriptionof the data structure representing a (partial) matching.

Chapter 2. Database 7

Chapter 2

Database

Knowledge about the individual visual appearances of a collection of objects is storedin the database. Each object is represented by a maximum of six orthogonal 2D viewsas exemplified in figure 2.1. (For identical views only one representation with multiplelabels is stored.)

Figure 2.1: Six orthogonal 2D views of a mug.

Each of these views is termed an “entry”. Rather than storing these entries asimages, geometric information is stored symbolically using the syntax described in thefollowing section and whose schematic is shown in figure 2.2. Symbolic representationcompresses the amount of information stored in the database when compared to images.Furthermore, the representation developed is size-, translation- and rotation-invariantwhich will benefit matching between query and database object views.

8 2.1. Object representation syntax

Figure 2.2: Object description syntax scheme.

2.1 Object representation syntax

Every entry is described as a geometric formation of shape-primitives. Each shape-primitive has a set of geometric properties prescribing its shape type, size, aspect ratio,rotation and relative coordinates to a reference point. Shape-primitives used in thisapplication belong to 3 shape types: ellipses, isosceles triangles and rectangles. This setof shape types was chosen as they cover a wide range of shapes found in artificial objectsand each can be described by two parameters: their height and width, see figure 2.3.

b

Figure 2.3: Shape-primitives ELLIPSE, (isosceles) TRIANGLE and RECTANGLE andproperties: width (a), height (b), diameter of circumcircle (d) and rotation angle (ϕ)

Rather than storing the height and width of each shape-primitive, a generic definitionof size coupled with the aspect ratio is used for all shape types. The diameter of thecircumcircle, i.e. the smallest circle encompassing the shape, is used as the size. Forellipses and rectangles the ratio of the shorter dimension to the longer dimension is usedas the aspect ratio, whilst for triangles the aspect ratio is the height divided by the


width. The relationship for converting between height/width to size/aspect ratio foreach shape type is given in table 2.1.

ellipse triangle rectangle

ratio = ba, a ≥ b ratio = b

aratio = b

a, a ≥ b

size = a size = a ∗ 1+4∗ratio2

4∗ratiosize = a ∗

√1 + ratio2

Table 2.1: Conversion of the properties of shape-primitives: computation of size (diam-eter of circumcircle) and aspect ratio from width, a, and height, b.

To combine the shape-primitives into a description for individual entries, a cartesiancoordinate system is defined for the formation and normalized as follows:

• The center of the largest shape-primitive is used as the origin (0, 0).

• The sizes of all shape-primitives are normalized by the size of the largest shape-primitive in the entry, thus yielding size-invariant entries.

• The x-axis is defined by the line joining the shape-primitive farthest from thelargest shape-primitive. The center of the farthest shape-primitive has coordinates(δ, 0), where δ is the distance between the largest shape-primitive and the farthestshape-primitive.Where the shape-primitives are concentric, i.e. δ is 0, the axis is defined such thatthe rotation of the largest shape-primitive is 0. (Rotation of each shape-primitiveis defined as the angle between the width “a” in figure 2.3 and the positive x-axis.)

• The y-axis is defined in the usual manner of 90◦ counter clockwise from the positivex-axis.

• The remaining shape-primitives are assigned coordinates relative to the origin asdefined above.

Shape-primitives are ordered within each entry with the largest shape-primitive firstfollowed by the other shape-primitives sorted by non-ascending Euclidean distance oftheir circumcenters from the circumcenter of the first shape-primitive.

Given the consistent detection of the largest shape-primitive and the farthest shape-primitive, this representation syntax provides a size, translation and rotation invariantrepresentation for each entry. These definitions impose natural bounds for the ranges ofsize, aspect ratio and rotation as presented in table 2.2.

2.2 Building and storing the database

The database is built by importing descriptions from the detection component: Rawinformation about detected shape-primitives (type, position, aspect ratio and rotation

10 2.2. Building and storing the database

ellipse triangle rectangle

size (0, 1] (0, 1] (0, 1]ratio (0, 1] (0,∞] (0, 1]ϕ [0, π) [0, 2π) [0, π)

Table 2.2: Value ranges of shape-primitive properties: Due to symmetries only rotationangles up to π for ellipses and rectangles have to be considered. For ellipses and rectan-gles, ratio is defined as smaller dimension

bigger dimensionand thus its range is (0, 1]. For triangles, however,

ratio and rotation angle can take any arbitrary value.

angle) is read from a text file and normalized as described in the preceding section. Theimported data is stored in two flat text files: One holds the description of the entriesbased on the object representation syntax and the other contains the corresponding textdescriptions (or labels), e.g. “mug front 04”. For an object view there may be multipletext descriptions as different objects may look the same from certain points of view.

Figure 2.5 shows the result of a normalized object view representation of the inputshown in figure 2.4. The object view in figure 2.4 will henceforth be represented by thecollection of shape-primitives, shape 0 to 5, as an entry in the database.


Figure 2.4: Shape-primitives detected by the detection module.

Figure 2.5: Normalized object view representation of the input shown in figure 2.4.

12 2.2. Building and storing the database

Chapter 3. Classification matching component 13

Chapter 3

Classification matching component

After converting detection results from a query image into the object representationsyntax, it is passed as a query to the classification matching component. The followingsection describes the comparison of two object views. In section 3.2 an algorithm ispresented that extends this basic operation to an operation on the whole database whichfinds the most similar entry in the database to the query.

3.1 Comparison of two object views

In order to compare two arbitrary entries, α and β, a similarity measure is defined basedon one-to-one mappings between shape-primitives in entries α and β, see figure 3.1.

Figure 3.1: One-to-one mapping of shape-primitives of two object views α and β.

Such a one-to-one mapping, called a “matching”, consists of several elementary map-pings that map one shape-primitive to another one. Each result of such an elementaryone-to-one mapping of shape-primitives, termed a “shape match”, may differ in the

14 3.1. Comparison of two object views

properties of the shape-primitives except the shape type as it is assumed that there isno accidental cross shape type detection. For each shape match, differences in size, as-pect ratio, rotation angle and position are accumulated in a shape match error. Theonly exception are matches with the virtual shape-primitive “unmatched”. A match ofa shape-primitive of α with “unmatched” indicates that the specific shape-primitive hasnot been matched with any shape-primitive of β. In this case a special shape matcherror (eunmatched) is computed as a penalty, because there are no properties that couldbe compared. For details on the computation of the specific errors refer to appendix A.Accumulating all elementary shape match errors finally results in an error for the wholematching.

Only the error for the aspect ratio is translation-, rotation- and scaling-invariant. Forcomparison of sizes, rotation angles and positions, some alignment is necessary to achievescaling-, translation- and rotation-invariance, which is demonstrated in the followingconsidering β to be the object view shown in figure 2.5 and α to be an artificiallyderived object view from β. Figure 3.2a shows α (left) and β (right) normalized andat the same scale (the diameter of the biggest circumcircle in each object view is 1.0).α has been constructed from β, assuming for demonstration purposes that for somereason the biggest shape-primitive (the rectangle, that covers most of the main bodyof the mug) has not been detected (e.g. due to different lighting conditions), althoughthat seems to be rather unlikely. Furthermore one additional small rectangle has beendetected but not the triangle (which represents a shadow region) and the properties ofthe remaining shape-primitives have been altered slightly. This example resembles theworst case scenario for a comparison because the complete set of transformations shownin the following has to be applied.

Figure 3.3 shows a flowchart of the first part of the comparison algorithm. The firsttransformations are applied after the first time a shape-primitive of α is not matchedwith “unmatched”. In the example, the first (i.e. biggest) shape-primitive of α hasbeen matched with a corresponding shape-primitive of β as shown in figure 3.2b. Forthe position and rotation comparison the following transformation has to be made: Bothobject views are shifted so that the circumcenters of both shape-primitives of the matchedpair are (0, 0). To attain scaling invariance, β is scaled so that the relative sizes of bothshape-primitives of the corresponding pair are the same. As a result, the errors for sizeand position of this shape match are zero. At this stage, the coordinate systems of thetwo object views use the same scale and correspond at least at the point of the origin.Obviously, the latter computations are only necessary if the first shape-primitive of α

has not been matched with the first shape-primitive of β as in the case of the example.Figure 3.4 shows a flowchart of the remaining computations. For the following shapematches the error computation is limited to the size and aspect ratio error, because therotation angles cannot be compared, and the position error is substituted by a distanceerror as long as the rotation of β has not been aligned. To do this final step of alignment,a second shape match has to be found with the additional constraint that the positionof both shape-primitives must be different from (0, 0), see figure 3.2c. This shape match


Figure 3.2: Matching example: In this illustrative (worst) case the complete set oftransformations has to be applied for the comparison.

16 3.1. Comparison of two object views

Figure 3.3: Comparison of two entries - flowchart (part1). (For details on the computa-tion of the specific shape property errors refer to appendix A.)


Figure 3.4: Comparison of two entries - flowchart (part2). (For details on the computa-tion of the specific shape property errors refer to appendix A.)

18 3.2. Processing a query

is very likely to be found very early because of the order of the shape-primitives withinan object view, as was explained in section 2.1.

After the rotation, all positions and rotation angles refer to the same coordinatesystem. At this stage, all four errors mentioned above can be computed and the errorsfor all previous shape matches are updated. Continuing the example, the next tworectangles of α are matched with rectangles of β as shown in figure 3.2d. The lastremaining rectangle of α is matched with “unmatched”, leaving two unmatched shape-primitives of β because there are no more shape-primitives of α they could be matchedwith (figure 3.2e). Theoretically, the last shape-primitive of α could be matched withthe big rectangle of β, which might result in a matching with a smaller error. In fact,this example covers only one out of an exponential number of possible matchings. Aformula for the computation of the number of possible matchings between two objectviews is given in appendix C.

The smallest error that a matching between two object views produces defines thesimilarity of the two views.

3.2 Processing a query

For a given query, the most similar entry within the database has to be found based onthe similarity measure introduced in the preceding section. Thus the original task (offinding the most similar entry) can be redefined to: find the optimal matching (i.e. theone with the minimum error) out of all possible matchings of the query with an entry ofthe database.

In the following it is important to differentiate between complete and partial match-ings. Recall the comparison example from section 3.1, a (partial) matching describes theshape matches and alignments at an intermediate stage of the comparison. A matchingdescribing the end stage is a complete matching because all shape-primitives of both ob-ject views have been matched (shape-primitives mapped to “unmatched” are regarded asmatched). In contrary, partial matchings describe stages of the comparison where thereare still “free” shape-primitives, i.e. shape-primitives that are not matched with othersor “unmatched”. A detailed description of the data structure representing a (partial)matching is given in appendix D.

The search space for the matching problem contains all possible partial and completematchings of the query object view with all object views in the database. Obviously, asolution of the problem is a complete matching. Hence, the solution space is the subspaceof the search space that only comprises all complete matchings. This solution space isexponential.1 The search space can be structured as a tree with the empty matching2 as

1Refer to appendix C for a formula to compute the number of possible matchings between two objectviews. The size of the solution space is the sum of the number of possible matchings of the query witheach entry of the database.

2The empty matching is a partial matching that contains no matching data. I.e. no database entryhas been assigned and no shape primitive has been matched.


Figure 3.5: Optimal search-tree for the example used in section 3.1. (Matchings withdatabase entries other than β are not shown.)

root. The root has n children, where n is the number of object views in the database.Let this be the 0-th level of the search tree. To each partial matchings at this level, acorresponding database entry has been assigned, but apart from that these nodes containno matching data. All nodes in the subtree rooted at a node at level 0 correspond tomatchings with the same database entry. The tree has |α| more levels, where α is thequery object view and || denotes the number of shape-primitives of α. The nodes atlevels 1 ≤ k ≤ |α| are all possible extensions of the partial matchings of level k − 1that can be constructed by matching the k-th shape primitive of the query. The leafsof the search tree (i.e. the nodes at level |α|) are complete matchings. Figure 3.5 showsthe optimal search tree that leads to the matching constructed in the example used insection 3.1.

A naive exhaustive search algorithm would solve the optimal matching problem bysimply enumerating all possible matchings and picking the solution with the minimum


error. An enumeration of all possible matchings can be obtained e.g. by breadth-firstor depth-first traversal of the tree described above. Obviously, this is a highly inefficientalgorithm. It can be observed that extension of a matching cannot decrease the matchingerror. Thus, the error of an internal node in the search tree is a lower bound for thematching error of all complete matchings (leafs) in the subtree rooted at this node. Thewhole subtree can be pruned, if already a complete matching with a smaller error has beenfound. Pruning significantly improves the algorithm’s efficiency but depends very muchon the quality of the complete matching that is used for pruning. Additionally, pruningcannot be applied as long as no complete solution has been found. Using an heuristicto compute a start solution before traversing the search tree solves these problems. Inaddition, the algorithm’s efficiency can by further improved by finding a tighter lowerbound for the matching error.

An algorithm that incorporates these ideas is called “branch & bound algorithm”[LW66]. A generic branch & bound algorithm is described in section 3.2.1. The datastructure holding all (partial) solutions created by the branch & bound algorithm isexplained in section 3.2.2. Section 3.2.3 gives a detailed description of the lower boundthat is used for pruning. The heuristic to compute the start solution is presented insection 3.2.4. As an optional replacement for the lower bound, an error-overestimatingheuristic is proposed in section 3.2.5. Finally, section 3.2.6 discusses further extensionsof the branch & bound algorithm.

3.2.1 Branch & bound algorithm

The main structure of the generic branch & bound algorithm is presented in algorithm1. At the beginning, an initial solution is generated using a (greedy) heuristic and storedas the best complete solution known so far. This best known solution is used for pruningthe search tree (“bound”) and is updated every time a better (complete) solution isfound (line 8). The search begins with the empty solution as root. In each iterationthe node that represents the “best” partial solution amongst those created so far isexpanded (“branch”). Generally, the branch & bound algorithm places no constraintson the choice of the partial solution to be extended. However, for this application, thealgorithm converges to the best solution faster, if only those nodes that lead to goodsolutions are expanded. In the expansion step (line 5) the next3 free shape-primitiveof the query is matched with all free shape-primitives (of the same shape type) of thedatabase entry the partial matching refers to and with “unmatched”. From the resulting(partial) solutions, only those that could lead to a better solution than the one knownso far are kept and the algorithm stops when no better solutions can be constructed.Performance of the search is highly dependent on the quality of the lower bound of theerror as this lower bound is used, in line 6, to prune the search tree. Obviously, thegreater the underestimate of the minimum error of a partial solution, the more branchesare created in the search tree thus affecting performance. On the other hand, the error

3The order of shape-primitives has been introduced in section 2.1.


Algorithm 1 Generic branch & bound algorithm.

Require: problem: min {f(x)|x ∈ B, B 6= ∅, |B| <∞}Ensure: optimal solution: best and f(best)1: best← initial solution2: list← {the empty solution}3: while list is not empty do4: x← best partial solution from list

5: for all possible extensions c of x do6: if flower bound(c) < f(best) then7: if c is a complete solution then8: best← c

9: for all e ∈ list do10: if f(e) ≥ f(best) then11: remove e from list

12: end if13: end for14: else15: insert c into list

16: end if17: end if18: end for19: end while


should not be overestimated, as a branch leading to the best solution could be cut offand the algorithm cannot be guaranteed anymore to return the best solution. However,as a trade off between the quality of the solution and speed, the lower bound may bereplaced by a heuristic that may overestimate the minimum error.

3.2.2 Underlying data structure

The data structure that holds all partial solutions created by the branch & bound algo-rithm (referred to as “list” in algorithm 1) is implemented as an “AVL tree”.4 Namedafter its inventors, Adelson-Velskii and Landis [AVL62], an AVL tree is a height-balancedbinary search tree. I.e. each node within the tree has at most two child subtrees whichmay differ in height by at most one. All nodes in the left subtree have smaller valueswhereas those in the right subtree have bigger values. Look-up, insertion and deletionare all O(log(n)) in both the average and worst cases where n is the number of nodes.These operations are used very frequently by the branch & bound algorithm: In eachiteration there are one look-up (line 4) and multiple insertions (depending on the numberof good extensions of the partial solution chosen in this iteration) (line 15). Every timea complete solution is found, a batch deletion is called to prune the search tree (line11). Inserting or deleting a node may result in an unbalanced subtree, but rebalancing isdone in only a few operations: At most one rotation is required after an insert operationwhereas O(log(n)) rotation may be required after a delete operation, because it might benecessary to continue rebalancing back up the tree after a rotation (at most O(log(n))operations). Figure 3.6 shows the four rebalancing operations:

• Left-left-heavyness can occur in a subtree of an AVL tree after a node has beeninserted in subtree 1 or deleted from subtree 3 (in this case subtree 2 may haveheight h+1 as well). The operation labeled “LL” is a single right rotation in nodeB.

• Left-right-heaviness can result from deleting a node from subtree 4 or from insertinga node into subtree 2 or 3 (in this case both subtrees of node C would have hadheight h− 1 before the insertion and only one of them - it does not matter whichone - would have the height h after the insertion). The operation labeled “LR” isa left rotation in node A (to reduce the problem to the case of left-left-heavyness)followed by a right rotation in B. The whole operation is called a “double rotation”.

• Right-right-heavyness and right-left-heavyness are basically the symmetric cases ofleft-left-heavyness and left-right-heavyness.

4In the current implementation only the node with the smallest value has to be returned. Thus, thefull functionality of the look-up operation is not needed and the AVL tree may be replaced by a simplerdata structure such as a heap. However, this may result in only a slight improvement of the performanceof the branch & bound algorithm. During development the additional functionality of the AVL tree hadseemed to be required and the look-up operation has been extensively used during debugging.


Figure 3.6: The four cases of “heavyness” that can occur in an AVL (sub)tree and therotations required to rebalance it.

All operations can be performed in O(1) as it is only necessary to update a fixed numberof pointers.

The nodes of the AVL tree are implemented as linked lists to be able to hold multiplepartial matchings with the same lower bound for the matching error. (The lower boundis used as the value of the node and introduced later on in this section.) This is necessarybecause using the matchings directly as tree nodes would result in multiple nodes havingthe same value. Note that multiple entries for the same partial matching cannot occurbecause each partial solution can be created only once (see description of how a partialmatching is extended in section 3.1). Two operations can be performed on a node: Amatching can be inserted (push) or retrieved and deleted (pop). Theoretically, the pop-operation may return any arbitrary element of the list. Here, a stack-like LIFO (lastin first out) behavior has been chosen, because returning the head of the list involvesthe least computational cost (O(1)). (Besides it slightly biases the branch & boundalgorithm towards a depth-first search.)


3.2.3 Lower bound computation

Pruning the branch & bound search tree requires the computation of a lower bound ofthe matching error for partial matchings. (For complete solutions the matching errorcan be computed as presented in section 3.1.) The lower bound of the matching errorcan be divided into three parts:

1. The initial error is the sum of the individual shape match errors for all shapematches of the partial matching. This error is exact as these shape matches arefixed and it cannot decrease during further extensions of the partial solution.

2. The estimated error is the lower bound for the increase of the matching error duringfurther extension of the partial solution. The algorithm is shown in algorithm 2.For each shape type, t, the maximum number of shape-primitives that can bematched in further extensions, nt, is calculated (lines 3-5). Then, for each freeshape-primitive of the query the minimum shape match error for all possible shapematches with shape-primitives of the database entry or “unmatched” is computed(lines 6-10). Finally, the estimated error is accumulated from the nt smallestof these shape match errors for each shape type t (lines 11-15). This procedureunderestimates the real error as the estimate permits multiple shape-primitives ofthe query to match to a single database shape-primitive.

3. The unmatched error is a lower bound for the matching error resulting from shapematches with “unmatched”. The computation is very similar to the one for theestimated error and shown in algorithm 3: Firstly, for each shape type, t, theminimum number of shape-primitives that have to be matched with “unmatched”,ut, is determined (lines 3-5). Then the unmatched error is accumulated from theut smallest shape match errors for matching free shape-primitives of this shapetype with “unmatched” (lines 11-15). (Depending on where there are more shape-primitives of this type, shape-primitives of the query or the database entry arechosen (lines 6-10).)When all shape-primitives of the query are matched, i.e. it is a complete solu-tion, this part of the matching error holds the penalty for shape-primitives of thedatabase entry that were matched with “unmatched”.

3.2.4 Upper bound computation

In line 1 of the branch & bound algorithm (see algorithm 1) an initial solution has to begenerated. Algorithm 4 shows the basic structure of the algorithm used. It is a greedyalgorithm that extends a partial matching, M , to a complete matching, M ′, by matchingeach free shape-primitives, s, of the query object view, α, with the free shape-primitive,s′, that has the same shape type as s and produces the minimum shape match error (line5). The matching error for such an initial solution is used as an initial upper bound for


Algorithm 2 Computation of the estimated error.

Require: Fα-set of free shape-primitives of α (the query)Require: Fβ-set of free shape-primitives of β (the database entry)Ensure: errorestimated

1: errorestimated ← 02: for all t ∈ {ELLIPSE, TRIANGLE, RECTANGLE} do3: Fα,t ← {s ∈ Fα|type(s) = t}4: Fβ,t ← {s ∈ Fβ |type(s) = t}5: nt ← min{|Fα,t|, |Fβ,t|}6: Et ← ∅7: for all s ∈ Fα,t do8: e← min{shape match error(s, s′)|s′ ∈ Fβ,t ∪ “unmatched”}9: Et ← Et ∪ e

10: end for11: for 0 ≤ i < nt do12: e← min{Et}13: Et ← Et \ e

14: errorestimated ← errorestimated + e

15: end for16: end for

Algorithm 3 Computation of the unmatched error.

Require: Fα-set of free shape-primitives of α (the query)Require: Fβ-set of free shape-primitives of β (the database entry)Ensure: errorunmatched

1: errorunmatched ← 02: for all t ∈ {ELLIPSE, TRIANGLE, RECTANGLE} do3: Fα,t ← {s ∈ Fα|type(s) = t}4: Fβ,t ← {s ∈ Fβ |type(s) = t}5: ut ← |Fα,t| − |Fβ,t|6: if ut < 0 then7: Et ← {shape match error(s, “unmatched”)|s ∈ Fα,t}8: else if ut > 0 then9: Et ← {shape match error(s, “unmatched”)|s ∈ Fβ,t}

10: end if11: for 0 ≤ i < |ut| do12: e← min{Et}13: Et ← Et \ e

14: errorunmatched ← errorunmatched + e

15: end for16: end for


Algorithm 4 Computation of an upper bound for the matching error.

Require: partial matching M of object views α (query) and β (database entry)Ensure: complete matching M ′ with errorUB

1: M ′ ←M

2: Fα ← {s ∈ α|free(s)}3: Fβ ← {s ∈ β|free(s)}4: for all s ∈ Fα do5: m(s)← argmin

s′∈Fβ∪“unmatched”

type(s′)=type(s)

{shape match error(s, s′)}

6: M ′ ←M ′ ∪ (s, m(s))7: if m(s) 6= “unmatched” then8: Fβ ← Fβ \m(s)9: end if

10: end for

the matching error. All (partial) solutions that exceed this threshold can be discardedas a better solution is already known.

Originally only intended for the initial solution, this algorithm can by applied to anypartial solution, M , yielding an upper bound for the minimum matching error of allsolutions in the subtree of the branch & bound search tree that is rooted at M .

3.2.5 An error-overestimating heuristic

As a trade off between the quality of the solution and speed, the lower bound usedto prune the branch & bound search tree may be replaced by a heuristic that mayoverestimate the minimum error. Using such a heuristic, it is no longer guaranteed thatthe best solution will be found but on the other hand the algorithm can be sped up.The implemented heuristic is just a weighted mean of the lower bound and upper boundintroduced in the preceding paragraphs (see algorithms 2, 3 and 4):

errorsum = errorinit + wUBerrorUB + (1.0− wUB)(errorestimated + errorunmatched)

To use the heuristic the module has to be compiled with the switch USE_HEURISTIC

defined in config.h. The weight wUB is defined by HEURISTIC_WEIGHT_UB in config.h

and may have any value in between 0 and 1. The higher the weight, the bigger theinfluence of the upper bound and the higher the probability that the branch of thesearch tree that leads to the best solution is cut off. During the test period the heuristichas only been used to speed up the genetic algorithm that is presented in the followingsection (the switch GA_USE_ONLY_ESTIMATION has to be defined in config.h). But asthe genetic algorithm is supposed to run offline and there is usually no need for real-timecapability the heuristic is disabled by default.


3.2.6 Further extensions of the branch & bound algorithm

During the evaluation process and for the application in an active vision process it hasbecome useful to have information not only on the best solution but on the k best ones.Unfortunately branch & bound algorithms are not supposed to return more than thebest solution. To illustrate that, assume that the initial solution is already the best one.As a result no leaf node of the search tree (complete solution) with a bigger error wouldbe reached. This behavior can be circumvent by either:

• running the branch & bound algorithms k times excluding the database entryreferring to the best solution of each run from the database, or

• not pruning (and letting the tree grow exponentially), or

• allowing the error of (partial) solutions to exceed the upper bound by a certainamount.

Obviously, the first approach is very inefficient and would take to much time. For thesecond approach, an exponential amount of space would be necessary. Thus, a derivateof the third approach has been implemented, posing only the constraint that at leastk object views have to be stored in the database. The variable best in algorithm 1 isextended to an array storing k solutions that is consequently initialized with k initialsolutions. For the pruning step (line 6) the worst of all solutions in best is used. Forline 8 a more complex update logic has been implemented. Additionally the switchDONT_ALLOW_MULTIPLE_MAPPINGS can be set in config.h to prevent the algorithm fromreturning multiple matchings with the same database entry (this is very likely but usuallythe question is which database entry is the next most similar one). Obviously, more nodeswill be created because of the weakened bounding criterion which will have a negativeimpact on the performance of the algorithm.

In every iteration the branch & bound algorithm accesses the leftmost node in theAVL tree. Access cost for this node can theoretically be improved by an additionalpointer to this node. This reduces the complexity of this access operation from O(log(n))(where n is the number of nodes in the AVL tree) to O(1), i.e. constant time. This op-tional extension can be enabled by defining CAVLTREE_FIRSTPTR in CAVLTree.h. How-ever, in practical use it does not improve performance but seems to slightly increase therunning time (for a small database as well as a large database with 5000 random entries).This unexpected behavior is likely to be caused by the additional logic that is needed tomaintain the pointer to the leftmost node.

Chapter 4. Forming generalizations in the database 29

Chapter 4

Forming generalizations in thedatabase

To reduce the size of the database to one entry for each of the six orthogonal 2D views,a set of entries representing the same object from the same view (e.g. under slightlyvarying lighting conditions) has to be reduced to a single entry that should in some wayresemble all entries in the set. As there is no operation that computes an “average”object view (the varying description length makes it even more complicated), a geneticalgorithm has been applied to find such an “average” object view.

4.1 About genetic algorithms

Genetic algorithms, like neural networks, fuzzy systems and probabilistic reasoning, be-long to the “soft computing” techniques. Soft computing provides tolerance of impre-cision, uncertainty and partial truth as well as low solution cost, which makes it veryattractive for problems of high computational complexity or incomplete/inaccurate in-put data. On the other hand, solutions are only approximated. There is no guaranteethat an optimal solution will be found.

Generally, genetic algorithms (also referred to as evolutionary algorithms) can be ap-plied to any kind of optimization problem such as parameter optimization, path-findingproblems or strategy-finding problems. They can easily be parallelized and can searchspaces of hypotheses containing complex interacting parts [Mit97]. Their basic underly-ing idea is to simulate the process of biological evolution which has proven to be a robustmethod for adaption within biological systems: Starting from an initial population ofindividuals, following generations are generated by random variations (mutation) andcombination (crossover). During this recombination process new features can evolve.Individuals with advantageous features are favored in the selection for the next genera-tion. They benefit from their better “fitness” and thus have higher probabilities to haveoffspring.

Algorithm 5 shows the structure of a generic genetic algorithm. Implementation of

30 4.1. About genetic algorithms

Algorithm 5 Generic genetic algorithm as e.g. in [GKK04].

Ensure: best hypothesis in popt

1: t← 02: initialize popt

3: evaluate popt

4: while termination criterion is not met do5: t← t + 16: select popt from popt−1

7: alter popt

8: evaluate popt

9: end while

such an algorithm comprises of:

• Representation of an individual (hypothesis, solution candidate)The representation of the individuals defines the search space of the GA. Accord-ing to Goldberg’s “principle of the minimal alphabet” [Gol89], the smallest rep-resentation that permits a natural expression of the problem should be selected.Otherwise, choosing an oversized representation might result in wasting time bysearching irrelevant regions of the search space, or contrary, some (possibly good)hypotheses might not be represented if an undersized representation is selected. Avery common way to code hypotheses (e.g. sets of if-then-else rules) is by bitstrings which can be easily manipulated by genetic operators. Symbolic represen-tations as e.g. widely used in the domain of genetic programming (e.g. in [Mit97])require more sophisticated implementations of the genetic operators.

• Generation of an initial populationIn line 2 of algorithm 5 the population is initialized, which usually means thatit is generated by random. Thus a function is needed that randomly generatesindividuals.

• Definition of a fitness functionThe fitness function is used to compute the quality of each hypothesis (lines 3 and8 in algorithm 5). The definition of this function together with the choice of therepresentation are the most important and challenging tasks when implementinga GA. The fitness function is so important because actually, the GA optimizesthis function and not the original problem. Thus mistakes in the definition of thefitness function will have a great impact on the quality of the result of the GA.

• SelectionBased on their fitness, individuals are selected for the next generation (line 6 ofalgorithm 5). There are several methods to do this, e.g.:

– fitness proportionate selection (also known as roulette wheel selection orMonte Carlo selection): The probability that an individual is selected for


the next generation is defined by the proportion of its fitness to the fitnessof the whole population (sum of the fitness values of all individuals of thepopulation).

– rank selection: The probability that an individual is selected for the nextgeneration is proportional to its rank (considering all individuals of the pop-ulation to be sorted by their fitness).

– tournament selection: The “winner” of a tournament of k ≥ 2 randomlychosen individuals is selected. Chances for an individual to win a tournamentdepend on its fitness.

• Genetic operatorsGenetic operators are needed to alter the population (line 7 of algorithm 5) tocreate individuals with new features. These operators usually correspond to thosefound in biological evolution. Common genetic operators are:

– mutation: An individual is randomly altered.

– crossover: Taking two individuals as input, one or two offsprings are generatedby recombination of the features of the parents.

• Termination criterionGAs approximate solutions. Therefore it cannot be expected that a perfect solutionis found. To guarantee that the algorithm terminates, some criterion has to bedefined that is checked after each iteration (line 4 of algorithm 5). Some possibletermination criteria are:

– The maximum number of iterations has been reached.

– The fitness of the best individual is larger than a certain value.

– The fitness of the best individual or the average fitness did not improve duringthe last k iterations by a certain amount.

Each implementation detail will be addressed by following sections. For a more detailedoverview on GAs refer to [Mit97] or [GKK04].

4.2 Representation of an individual and fitness func-

tion

For the representation of the individuals, the same coding as presented in section 2.1is used. Some functions have been added to apply the genetic operators (see classCGAIndividual in appendix B). Using this representation, it is ensured that everypotential hypothesis can be represented and that every individual in turn is a validhypothesis (i.e. an object view as described in section 2.1). However, this choice of

32 4.2. Representation of an individual and fitness function

representation holds a disadvantage as well: Such a complex object description requirescomplex genetic operators in contrast to simple bit-manipulating operators that wouldhave been applicable for simple bit-string representations. But this drawback is com-pensated by another big advantage: A fitness function that evaluates the quality of ahypothesis (i.e. the fitness of an individual) can easily be derived from the function thatcompares two object views (see section 3.1). As the GA has to find an “average” objectview for a set of object views, O, the best hypothesis minimizes the sum of the matchingerrors with all object views, β, in O:

hbest = argminh∈H

{∑

β∈O

matching error(h, β)} (4.1)

where H is considered to be the hypotheses space. An appropriate fitness function thatsatisfies

hbest = argmaxh∈H

{fitness(h)} (4.2)

is obviously:

fitness1(h) = −∑

β∈O

matching error(h, β) (4.3)

This function has a negative co-domain, but to be able to apply fitness proportionateselection the fitness values need to be in R

+. Thus, the values need to be shifted, i.e. apositive constant has to be added:

fitness2(h) = fitness1(h) + a a ∈ R+ (4.4)

Witha = − min

h∈popt

{fitness1(h)} (4.5)

this resembles the linear dynamic scaling proposition by Greffenstette and Baker [GB89].(popt is considered to be the population at time t as in algorithm 5.) Using this fitnessfunction, the fitness of the “worst” individuals is exactly 0 and all other individuals havepositive fitness values at any time t. The fitness function resulting from combination ofequations 4.3, 4.4 and 4.5 is:

fitness(h) = −∑

β∈O

matching error(h, β) + maxh′∈popt

{∑

β∈O

matching error(h′, β)} (4.6)

Defining the switch GA_SIGMA_SCALING in config.h enables a further transformationof the fitness function (equation 4.6) as follows:

fitness′(h) = (max(0, f itness(h)− (µt − b · σt)))k (4.7)

The parameters b and k can be set in config.h using the defines GA_SIGMA_SCALING_Aand GA_SIGMA_SCALING_POW. The default value of the former is 2.0, the latter can bea constant value close to 1.0 (default is 1.005) or a dynamic value as e.g. proposed by


Michalewicz [Mic96]. µt is the mean and σt is the standard deviation of the distributionof fitness values in the population at time t (for the standard deviation the unbiasedestimator has been used):

µt =1

|popt|∑

h∈popt

fitness(h) (4.8)

σt =

√

1

|popt − 1|∑

h∈popt

(µt − fitness(h))2 (4.9)

Equation 4.7 is a combination of the σ-scaling as defined by Goldberg [Gol89] and ex-ponential scaling as defined by Goldberg [Gol89] and Greffenstette and Baker [GB89].In case of k = 1 it is pure σ-scaling which aims to lower the selection pressure whendeviation is high (in the early stage of the GA) and to increase it when deviation is low(when convergence starts to take place). Usual values of parameter b are in [1, 2]. Forsmall b the pressure is higher than for bigger values. Parameter k controls the impactof the exponential scaling. Value smaller than 1 decrease the selection pressure, whereasindividuals with high fitness benefit from values bigger than 1.

4.3 Initial population

Several possibilities for the generation of the initial population have been implemented.By using the parameter initMode of the GA the initialization method can be chosen.The following values are supported:

• GA_INIT_USE_RANDOM

This is the straightforward approach to generate individuals by random. A func-tion for random initialization of object views has been implemented. It takes asparameters value ranges for the number, size, aspect ratio and position of theshape-primitives. These values are derived from the object views that have to beapproximated. Generated hypotheses will have only small fitness but cover thewhole search space.

• GA_INIT_COPY_ONLY

Instead of starting with a completely random set of hypotheses, the GA is initializedwith randomly chosen elements from the set of object views from which the averagehas to be computed for. In most cases the size of this set is small in comparisonto the size of the population. Thus there can be multiple copies of an element inthe initial population. (Usually, this is not desired! Therefore this initializationmode is not recommended.) Generated hypotheses will have relatively high fitnessbut cover only a tiny fraction of the search space which holds the danger of gettingstuck in a local optimum. However, the global optimum is very likely to be foundin this small region.

34 4.4. Selection

• GA_INIT_ALTER_WEAK

The motivation for this mode is the same as for GA_INIT_COPY_ONLY and theyare nearly identical. The difference lies in that each copy is slightly altered byusing the mutateWeak-operator that will be explained in section 4.5. This reducesthe undesired extreme homogeneity of the initial population that is present in thesecond mode. The generated hypotheses will still have relatively high fitness butcover only a tiny fraction of the search space.

• GA_INIT_ALTER_STRONG

This mode is the default and identical to GA_INIT_ALTER_WEAK except for thealteration-operator that is applied. Here the mutateStrong-operator is used (asexplained in section 4.5) which leads to higher diversity in the initial population.The GA is still only confined to a region of the search space but the risk of subop-timal results is significantly reduced.

The default size of the population can be set by the GA_POPSIZE parameter in config.h.

4.4 Selection

A fitness-proportionate approach has been chosen as the selection method: On a “roulettewheel” with a fixed number of equal sectors (the parameter GA_ROULETTE_WHEEL_SIZE isdefined in config.h and has a default value of 10000) each hypothesis in the populationgets a certain fraction. The number of assigned sectors corresponds to the proportion ofits fitness to the fitness of the whole population:

sectors(h) =

⌊

fitness(h)∑

h′∈poptfitness(h′)

⌋

(4.10)

Due to rounding, the total number of assigned sectors may vary (but not exceed the max-imum). Individuals of the next population are then selected by “turning”‘the “roulettewheel”, i.e. a random integer between 1 and the total number of assigned sectors isgenerated and the individual that corresponds to the selected sector is copied to thenext population.

Optionally, an extension called “elitism” can be enabled (define GA_ELITISM inconfig.h, enabled by default). Elitism assures that the best hypothesis known so far willremain in the population. A copy of this hypothesis is preserved and does not undergothe process of alteration.

4.5 Genetic operators

Following genetic operators have been implemented:


• mutateWeak

This operator slightly alters a single individual, i.e. it alters only one feature ofone shape-primitive. Figure 4.1 shows the flowchart of the operator. At first, theshape-primitive and the feature to be altered is selected by random.1 If the shapetype has been selected for alteration, it is simply overwritten by a new randomlychosen value. For size, position (center), ratio or rotation, a random value is addedto the old value. Bounds for these random values can be set in config.h. Theobject view needs to be normalized afterwards.2

• mutateStrong

The strong mutation operator overwrites a whole shape-primitive or adds a new oneto an individual. This operator may change the description length of an individual.The input and output individuals may differ significantly in their fitness values.

• mutateWeakMod

This operator is an extension of the mutateWeak-operator. The mutateWeak-operator maintains the description length of the original individual, whereas, inthe extension, shape-primitives with a size below a certain threshold are removedand there is an additional case to create new shape-primitives. (See the shadedregions in figure 4.1.) Thus mutateWeakMod may change the description length ofan individual and may (partially or fully) replace mutateStrong.

• nPointCrossover

This operator overwrites an individual by a recombination of two other individuals.The flowchart is shown in figure 4.2. Figuratively speaking, the input individualsare broken into pieces. Each piece has the size of one shape-primitive. Their orderis preserved. The “child” is then assembled from the pieces as follows: for eachindex in the order of the shape-primitives, it is decided by random from whichparent the piece is copied. If the selected parent does not have a shape-primitivewith this index, the copy-process is skipped. Obviously, the resulting individual willhave a description length, l, which corresponds to the number of shape-primitives:

min{|α|, |β|} ≤ l ≤ max{|α|, |β|}

The crossover-operator is applied before the mutation operators. The default im-pact of the specific operators can be controlled by the parameters GA_MUTATE_WEAK,GA_MUTATE_STRONG and GA_CROSSOVER in config.h. Different parameter values can beset when the GA is called.

4.6 Termination criterion

The GA terminates in any of the following two cases:

1For a description of the shape properties refer to section 2.1.2The normalization is described in section 2.1.

36 4.6. Termination criterion

Figure 4.1: Flowchart of mutateWeak and mutateWeakMod-operator (extensions in mu-tateWeakMod are shaded).


Figure 4.2: Flowchart of nPointCrossover-operator.

1. The maximum number of iterations has been reached. The default threshold isdefined in parameter GA_MAX_EPOCHS in config.h.

2. The fitness of the best individual has not improved for the lastGA_MAX_EPOCHS_WITH_NO_FITNESS_CHANGE iterations. If elitism (see section4.4) is disabled, the current best hypothesis is compared to the best hypothesis ofthe preceding population instead of the all-time best. (Otherwise, this would leadto early termination if GA_MAX_EPOCHS_WITH_NO_FITNESS_CHANGE has a smallvalue).

38 4.6. Termination criterion

Chapter 5. Graphical user interface 39

Chapter 5

Graphical user interface

The graphical user interface (GUI) has been designed as a tool to visualize object viewsand matching results in the debugging process. It runs only on Microsoft Windows sys-tems because it is MFC-based and uses GDI+. MFC [Micb] stands for “Microsoft Foun-dation Classes”. This is a library of C++ classes developed by Microsoft for Windows-based applications. GDI+ is an extension of the Windows Graphics Device Interface(GDI). This API enables applications to use graphics and formatted text on both thevideo display and the printer without the need to access hardware directly [Mica]. Both,MFC and GDI+, require a Windows system to run. This implementation has beenchosen because it appeared to be less time-consuming than a platform-independent one(which was not required in the case of the GUI). Figure 5.1 shows a screenshot of themain window. The menu allows access to the main functions of the core components:

• Database - submenu

– Load the database from the hard disc - file names for the files to be loadedare defined by DB_FILE, which contains the data, and DB_DESC_FILE, whichcontains the string descriptions, in config.h. (Refer to section 2.2.)

– Save the database to hard the disc - refers to the same files as load.

– Import the database from the detection module. The import file is specifiedby the define DB_IMPORT_FILE in config.h. (See section 2.1 for a descriptionof the import algorithm.)

– Generate a database by random.

• Match - submenu

– Find k Best Matches within the loaded database for the loaded query objectview. Parameter k is defined by NUM_BEST_MATCH in config.h. (For thedescription of the matching algorithm refer to section 3.2.6.).

– Match Query With Selected dbEntry - The loaded query object viewis matched with the currently selected object view from the database.

40

Figure 5.1: Screenshot of the graphical user interface (GUI). The display areas are inshape-match mode.

Again, the NUM_BEST_MATCH best matches will be retrieved. If no objectview of the database has been selected. The result is the same as forFind k Best Matches.

– Evaluation-submenu - Provides access to several functions needed to deter-mine the performance of the matching algorithm (refer to chapter 6).

• Genetic Algorithm - submenu

– Filter Database - This has to be done as a preparation step for a manually-initiated run of the GA. Usually, the database contains representations ofobjects from different views but for the GA all database entries must refer tothe same object and the same view.

– Run GA - Runs the GA on the currently loaded database.

– Built New Database - For each set of database entries referring to the sameobject from the same view an “average” object view is computed (see section


4) and stored in a new database. File names are defined by GA_DB_FILE (data)and GA_DESC_FILE (descriptions) in config.h.

• Active Vision provides control of the active vision process. These methods arestill under development and are not within the scope of this work.

• About displays information about the GUI.

Furthermore, the main window is divided into three parts:

• The upper part of the main window contains an edit field that is used to load queryobject views and a list control that displays the content of the database and allowsselection of single database entries. In this list control the field “id” refers to theindex of the entry within the database, “description” contains the descriptions fromDB_DESC_FILE and “raw data” the data from DB_FILE in the object representationsyntax which is used to save the database in the file. (The same format is used forthe query.)

• The middle of the main window is again divided into three parts: The query displayarea (left), the database entry display area (center) and two list controls (right).The upper list control displays the results of Find k Best Matches (submenuMatch). Fields are “#” for the rank of the matching result, “id” for the index ofthe corresponding database entry, “error” for the total error, “init” for the initialerror and “unmatched” for the unmatched error. (Refer to section 3.2.3 for anexplanation of the different errors.) Each element (row) of this list control refersto a matching result and can be selected.The lower list control shows the details of the selected matching result, i.e. thedifferent shape matches. Fields are “#1” for the index of the shape-primitive ofthe query, “#2” for the index of the shape-primitive of the database entry, “error”for the shape match error and “size”, “position”, “ratio” and “rotation” for theerrors on the specific shape properties. (Refer to section 3.1 for details on theshape match error.) A value starting with “u” in column “#1” or “#2” stands for“unmatched” and indicates that the other corresponding shape-primitive has notbeen matched.The query display area shows the currently loaded query object view and thedatabase entry display area shows the currently selected database entry. The twodisplay areas are linked with each other and support four different display modes:

1. Unlinked mode (figure 5.2):This display mode is used, if either no query object view has been loaded orno database entry has been selected, i.e. one display area is empty. In thismode the displayed object view is scaled to fit exactly into the display area(maximal scaling). All shape-primitives are colored blue.

2. Size-linked mode (figure 5.3):As soon as an object view is assigned to the other (empty) display area, both

42

Figure 5.2: Screenshot of GUI object view displays in unlinked mode and with positionlabels.

displayed object views are scaled in the following way: The bigger object viewfits exactly into its display area and the smaller one uses the same scalingfactor, i.e. both object view are displayed at the same scale. Again, allshape-primitives are colored blue.

Figure 5.3: Screenshot of GUI object view displays in size-linked mode.

3. Match mode (figure 5.4):This mode is enabled by selecting a matching result in the upper list control.The scaling is the same as in the size-linked mode. Additionally, the rotationof the database entry has been aligned in the manner described in section 3.1.Corresponding shape-primitives have the same color and each correspondingpair has a different color. (Because of the low opacity, which has been chosento visualize occlusions, colors might mix.)

4. Shape match mode (figure 5.1):This mode is enabled by selecting a shape match in the lower list control.The shape-primitives of the selected shape match are highlighted red, allremaining shape-primitives are colored blue. Scaling and rotation alignmentare the same as for match mode.


Figure 5.4: Screenshot of GUI object view displays in match mode.

Additionally, the coordinates of the circumcenters of the shape-primitives can bedisplayed in any of the modes as demonstrated for the unlinked mode (figure 5.2).A click into the display toggles between showing and hiding position labels.

• The lower part is used to display log messages from the core components. By de-fault, these messages are redirected to stdout (console). However, if USE_MFC_GUIis defined in config.h, log messages are displayed in the log window and statusmessages are displayed in both, the status bar and the log window. The log win-dow can be cleared (Clear-button) and opened in a separate window (View Extra-button). The logging behavior can be controlled in config.h by various loglevel-defines and a log threshold. Only log messages with levels that are higher than thelog threshold are displayed. Status messages are always displayed.

Chapter 6. Test Result 45

Chapter 6

Test Result

This chapter contains a performance analysis of the classification matching componentbased on empirical tests. Firstly, the construction of the test data sets that have beenused for evaluation is described. The second part is the presentation and discussion ofthe test results.

6.1 Test data sets

For a performance analysis on real images, a large collection of images that can beprocessed by the detection component is required. The detection component works ongreyscale images with a maximum size of 100x100 pixels. Additionally the images shouldcontain single objects from different orthogonal, axis-parallel views. Unfortunately, acollection containing such images could not be found. Although, there exist several freeimage collections on the web [SoCS], none of them was suitable for this application.E.g. the “Columbia University Image Library” (Coil) [COI] consists of images of singleobjects but the views are not axis-parallel. For the evaluation of this work, but moreimportantly for development and evaluation of the visual shape detection component,two small image data sets have been generated. They are described in the following.However, to be able to measure the performance on larger-scale data sets, an artificialtest set has been generated as explained in section 6.1.2.

6.1.1 Real test data sets

First test set

The first test set was build as part of the internship on November 10th, 2003. The dataset consists of 360 manually labeled images of 10 single objects. Each image containsonly one object centered in the image and each object is represented by a maximum ofsix orthogonal views (the number of views depends on symmetries of the objects). Thereare 10 images with varying lighting conditions for each view. All images were captured

46 6.1. Test data sets

by a Canon IXUS V2 digital camera (2.0 M pixel CCD) that was statically mounted ona tripod. The distance from the object was about 50cm. The objects were centered infront of a white background and neither object nor camera were moved during capturingone particular view. The flash was disabled and as light source a halogen lamp was usedthat was moved for each view in a repeating sequence to produce 10 different lightingconditions for each view. The images were captured as JPEGs at a resolution of 640x480pixel with quality level set to “superfine” (highest JPEG quality to minimize the presenceof artifacts). Afterwards they were converted to greyscale, cropped, resized to 100x64pixels and stored as uncompressed greyscale TIFFs. This final image format resemblesthe required input format of the detection component.

The captured objects were chosen from a kitchen environment, preferring objectswith simple geometry. Table 6.1 shows the partitioning of the whole set of images.

Object description Number of images Number of views (descriptions)

“cascade” bottle 30 3 (bottom, side, top)cordial bottle 50 5 (back, bottom, front, side, top)cup 50 5 (back, bottom, front, side, top)can of fish 30 3 (front, side, top)styrofoam cup 30 3 (bottom, side, top)box for glasses 30 3 (front, side, top)mug 50 5 (back, bottom, front, side, top)salt 30 3 (bottom, side, top)soy sauce 30 3 (bottom, side, top)masking tape 30 3 (front, side, top)

Table 6.1: Partitioning of the first test set (360 images in total).

Second test set

For the development of the active vision component, a second test set was build by RubyLaw under similar conditions as for the first set. Both sets have been merged and areused in the following to derive parameters for the artificially generated test sets.

6.1.2 Artificially generated test sets

Whilst real test sets a useful for developing and debugging components for detection andclassification matching, the “real” test sets are far too small to allow estimation of therunning time of the algorithm in real-life applications where databases with thousandsof object views may occur. To generate databases of this size, the random generationfunction mentioned in section 4.3 and the mutation operator discussed in section 4.5have been used. Firstly, a database of 10000 randomly generated object views has been


created. This database is the “evaluation database”. Parameters for the generation ofan object view were as follows:

• The number of shape-primitives has a range of [3, 10].For the real test sets, the detection component returned values in [0, 6]. But objectviews with only a few shape-primitives can be matched very quickly. This wouldhave had a positive influence on the performance of the algorithm, which was notdesired for the evaluation.

• The size of each shape-primitive is initially (i.e. before normalization) in [0.7, 1.5].Very small shape-primitives were not desired (and unlikely to be detected by thedetection component). Therefore, the size had to be bounded downwards andupwards (because of the final normalization step).

• The position (i.e. the values of the coordinates of the circumcenter) of each shape-primitive is bounded by ±2.0 to avoid excessive scattering of the shape-primitives.

• The ratio of each shape-primitive is in [0.4, 2.0] to avoid degenerated shape-primitives.

From the evaluation database, three different test sets of object views are derived. Eachset resembles one of three scenarios that are explained in the following and contains 10query object views. The performance of the algorithm on the evaluation database for aspecific scenario is then the mean value of the queries with each object view from thecorresponding set.

Ideal case

This scenario, though very unlikely, helps to estimate the lower bound of the processingtime. In this set, for every query object view, α, there is an identical object view, β, inthe database. Thus, the matching error of α and β would be 0.The query set for this scenario contains simple copies of the first 10 object views fromthe evaluation database.

Normal case

In this scenario, for the query object view, α, there is no identical object view in thedatabase as in the previous case, but there is at least a similar object view, β. “Similar”means, that the matching error of α and β is below some threshold which in this casehas been chosen to be 5.0.The query set is initialized with copies of the first 10 object views from the evaluationdatabase (as for the previous scenario). Then each object view in the set is mutateduntil the matching error with the evaluation database exceeds 3.0. Additionally, theentries of the evaluation database are reordered to ensure that the best matches for all

48 6.2. Test results

query object views are among the first 1000 database entries.1 The actual errors for thequeries used in this evaluation are in the range of [3.17, 4.85], with a mean of 3.84.

(Approximated) Worst case

This scenario approximates worst cases and provides an estimate of the upper boundfor the processing time. The query object view is not in the database and the matchingerror with the database exceeds a high threshold (5.0).The query set is constructed from random object views satisfying the additional con-straint that the matching error for each query object view on the evaluation database isat least 5.0. Again, the entries of the evaluation database are reordered to ensure thatthe best matches for all query object views are amongst the first 1000 database entries.1

The actual errors for the queries used in this evaluation are in the range of [5.09, 7.32],with a mean of 6.64.

6.2 Test results

Figure 6.1 shows the benchmark results for the three query scenarios on the evaluationdatabase introduced in the preceding section. To demonstrate the dependency of theperformance on the size of the database, the queries are run on the first k thousandobject views of the evaluation database, where k takes every integer value from 1 to10. The number of created nodes refers to the number of partial matchings that havebeen created during the matching process. The execution times (dotted lines) refer to anIntel Pentium M system running on 1.5 GHz with 512 MB memory. The results can bereproduced by running the command Evaluate from the submenu Match>Evaluation ofthe GUI. (The required database files are generated by Generate Benchmark DBs fromthe same submenu.) Additionally, figure 6.2 shows as illustrative example the details ofthe matching process of one object view from the “normal case” query set (left) and oneobject view from the “worst case” query set (right) with the corresponding best matchesfrom the evaluation database.

The following observations can be made:

• For all three scenarios, the processing time and the number of created nodes scalelinear with the size of the database.

• For the “ideal case” the number of created nodes is exactly the number of entries inthe database. This is the absolute minimum number, considering that the branch& bound algorithm needs to be initialized with a set of nodes containing one node

1 The test queries are run on the first k thousand object views of the evaluation database, where k

takes every integer value from 1 to 10. The reordering ensures that the best match for each query is inthe evaluation database regardless of the value of k. Note that, reordering, in general, has no impacton the performance.


Figure 6.1: Benchmark results for large scale databases (the dotted lines refer to theexecution times).

DB size Ideal Case Normal Case Worst Case

1000 0.0 (0.0) 311.5 (10.2) 2115.2 (424.9)2000 0.0 (0.0) 610.3 (17.4) 4247.6 (855.2)3000 0.0 (0.0) 908.5 (23.1) 6358.0 (1280.9)4000 0.0 (0.0) 1196.0 (27.8) 8476.5 (1706.4)5000 0.0 (0.0) 1477.5 (33.7) 10571.3 (2120.3)6000 0.0 (0.0) 1765.3 (39.3) 12690.7 (2539.4)7000 0.0 (0.0) 2046.4 (45.5) 14737.6 (2920.2)8000 0.0 (0.0) 2342.2 (51.3) 16889.1 (3352.1)9000 0.0 (0.0) 2633.3 (56.8) 18972.4 (3762.4)

10000 0.0 (0.0) 2909.7 (62.4) 21098.9 (4194.4)

Table 6.2: Number of expanded nodes (values in brackets refer to the number of nodeswith depth > 2).


Figure 6.2: Changes in the lower and upper bounds for the matching error during thematching process of two example object views. Left: “normal case”-scenario. Right:“worst case”-scenario. (The lower bound is the sum of initial, unmatched and estimatederror as described in section 3.2.3.)

for each database entry. Table 6.2 which shows the number of nodes that havebeen expanded reveals that not a single node has been expanded.

• For the scenario that is considered as the “normal case”, the number of nodescreated is about twice as high as in the previous scenario. Table 6.2 shows thatmore than 95% of the expanded nodes have depth ≤ 2. This indicates that thebranch & bound search process gets more directed towards the optimal solutionwith increasing search depth.

• For the (approximated) worst case scenario the number of nodes created is signif-icantly higher than in the other scenarios. Moreover, about 20% of the number ofexpanded nodes have depth ≥ 2 indicating that the branch & bound search tree isstill broad on deeper levels.

The observations can be explained as follows:

• The search performance for the “ideal case” scenario can only be optimal if theinitial solution is already the optimal matching, i.e. the heuristic used to generatethe initial solution, which is used as upper bound in the search process, is veryefficient in this scenario.

• In the early stages of the search, the computation of the lower bound is limited.This is caused by the definition of the error functions that are used for the compu-tation.2 As can be observed in figure 6.2, the estimated error increases significantly

2Recall that the comparison of sizes of shape-primitives requires at least one correspondence point.For the computation of the errors for the position and rotation, two correspondence points are necessary.For details refer to section 3.1.


once the 1st and 2nd correspondence point is found whereas the unmatched erroris independent of the number of correspondence points. In the early stages of thesearch, the unmatched error contributes the major part for the lower bound. Con-sequently, the search can be easily misled. This explains the dominating percentageof expanded nodes with depth ≤ 2.

• As long as all nodes in the branch & bound search tree have nearly identical lowerbound values, selection of the node to expand is almost random, which renders thesearch process nearly undirected and uninformed. This is the case for the “worstcase” scenario where all matchings are bad.3 In addition, using a bad completesolution for pruning does not bound the branch & bound search tree well. Thisexplains the significantly higher number of created and expanded nodes in the“worst case” scenario.

3See definition of the query set for this scenario in section 6.1.2.

Chapter 7. Conclusion & future work 53

Chapter 7

Conclusion & future work

This chapter summarizes the entire work and presents possibilities for further improve-ment of the algorithm and thoughts on how the work could be continued.

7.1 Discussion

There is evidence that an object’s geometry is decomposed into several parts during theprocess of human object recognition1 but this process is far from being fully understood.Whilst it is rather unlikely that human object recognition is based on artificial shapessuch as ellipses, triangles and rectangles, a decompositional approach using this set ofshape-primitives may yield advantages in terms of computational costs and real-time ca-pability. Moreover, relying on such very basic shape-primitives, it may be even possibleto migrate the detection of the shape-primitives from software to hardware. However,the fundamental restriction on the shape complexity limits the differences that can becaptured between objects and possibly confines the setting to non-real-world environ-ments with less complex shapes and structures like the manufacturing environment ina factory. Such an approach cannot be expected to return results that are on a levelas achieved by other object recognition approaches as e.g. those mentioned in section1.2. Rather, it can be regarded as a possible preprocessing step in object recognitionthat helps to decide whether and what more sophisticated (and possibly computationallymore expensive) further steps in object recognition should be taken to gain additionalinformation about perceived objects.

1Refer to section 1.2.1 for an overview on the discussion on the field of cognitive science.

54 7.2. Ideas for further improvement

7.2 Ideas for further improvement

7.2.1 Introduction of an error threshold

The error of a matching of two object views, α and β, is obviously only bound by|α| + |β|.2 Thus, even cases worse than 5.0 are imaginable. In fact, such cases are notunlikely. However, it seems advisable to define a certain threshold for the matchingerror and implement the following behavior: If this threshold is exceeded for a query,the matching process can be aborted and a message like “There is no similar object inthe image.” or “unknown object” is returned. The threshold should be database specificand determined empirically.

7.2.2 Optimization of the parameters for the shape error func-

tions

The functions for the comparison of the size, aspect ratio, rotation and position of shape-primitives described in appendix A are parameterized. Currently, the parameter valuesare chosen based on empirical tests. An optimization of the values could improve thequality of the computed errors and have a positive impact on the performance of theclassification matching component.

7.2.3 Incorporation of shape confidences

The detection component provides additional information about the quality of the de-tected patterns of the shape-primitives. This information is currently not used. It maybe incorporated into the object representation syntax3 and used e.g. as a measure ofuncertainty in the shape error function. (See appendix A for the current implementationof the shape error functions.)

7.3 Conclusion

In the context of an object recognition system that is based on the decomposition of 2Dobject views into shape-primitives4, a symbolic object view representation has been de-signed. This representation is capable of holding all information that can be gathered bya detection component such as type, size, aspect ratio, rotation and position of individualshape-primitives. It can be normalized and is rotation-, scaling- and translation-invariantwhich benefits matching between object views. For the comparison of properties of shape-primitives, shape error functions based on the concept of fuzzy similarities are used. The

2|| denotes the number of shape-primitives in an object view.3Import and storage of the values is already supported.4Shape-primitives are basic geometries such as ellipses, rectangles and isosceles triangles.

Chapter 7. Conclusion & future work 55

symbolic representations of object views known to the system can be stored in a databasethat supports querying of other object views. To keep the size of the database as smallas possible, generalizations of object views can be learned by a genetic algorithm. Forquerying the database, a classification matching algorithm has been implemented. It isbased on a branch & bound algorithm that utilizes error estimates and heuristics thathave been designed specifically for this problem. Furthermore, the branch & boundalgorithm has been modified to return the k most similar database entries (instead ofonly the most similar database entry) for any given query. The implemented applicationcan be accessed via command line (platform independent) or a graphical user interface(requires a Microsoft Windows systems).

The query performance scales linearly with the size of the database. For a databasecontaining 10000 entries, a response time of less than a second is expected on an averagesystem. (It can be further improved by using a heuristic as a trade off between thequality of the solution and speed.5) Thus, it is possible to apply the system in the activevision domain.

Based on this work, an active vision module is currently developed.

5Using the heuristic, it is not guaranteed that the most similar database entry is returned for anygiven query. The probability of this event can be influenced by a parameter.

56 7.3. Conclusion

Appendix A. Definition of the shape-primitive property errors 57

Appendix A

Definition of the shape-primitiveproperty errors

The implemented error functions for the comparison the of properties of two shape-primitives are based on the concept of fuzzy similarity relations [KGK94]. A fuzzysimilarity relation E based on a set X is a mapping from X ×X to [0, 1] satisfying thefollowing characteristics:

reflexivity : E(x, x) = 1

symmetry : E(x, y) = E(y, x)

pseudo− transitivity : max{E(x, y) + E(y, z)− 1, 0} ≤ E(x, z)

E(x, y) = 1 denotes that x and y are identical and E(x, y) = 1 denotes maximumdissimilarity of x and y respectively. For a fuzzy similarity relations E a correspondingdistance measure can simply be defined as 1− E.

Figure A.1: Triangular membership function used for modeling similarity and the corre-sponding distance function.

All fuzzy sets used in this implementations are based on triangular membershipfunctions. Figure A.1 shows such a function modeling similarity and the correspond-ing distance function which is used here for the error computation. The functions are

58

parametrized by δ which has an impact on the width of the base of the “triangle” andthus can be used to control the error tolerance. For each shape-primitive property, avalue defining the error tolerance can be set in config.h.

The error for matching a shape-primitive with “unmatched”, eunmatched, is - for reasonof consistency - defined as the error for the shape-primitive property “size” for a matchingwith a zero-size shape-primitive.

Appendix B. Overview of the source files 59

Appendix B

Overview of the source files

The following source files contain the implementation presented in this report (excludingthe GUI):

• config.h - is the global configuration file.

• datatypes.h - contains definitions of the following basic datatypes:

TShapeType supported types of shape primitivesTPoint a point in the planeTShape represents a shape-primitiveTAbsoluteShape used to export an absolute shape-primitive representation

for visualizationTAbsoluteObjectView used to export an absolute object view representation for

visualizationTMatchDetailShape holds details of a shape-match to be displayed in the GUITMap represents a (partial) matching

• CAVLTree.h - contains the following template classes for the AVL tree used by thebranch & bound algorithm:

CListNode<T> template class for a node of a simple linked listCAVLNode<T> template class for a node of an AVL treeCAVLTree<T> template class for an AVL tree

• tools.h/cpp - contains auxiliary functions for logging (for the command line ver-sion as well as for the GUI), random number generation, string tokenization andsimple mathematic operations.

• CObjectView.h/cpp - contains the class CObjectView that represents an objectview and provides function for import, random initialization, modification andexport.

60

• CGAIndividual.h/cpp - contains the class CGAIndividual that is inherited fromCObjectView and represents an individual for the genetic algorithm. Added func-tionality comprises initialization with a CObjectView and alteration operators (sev-eral implementations of mutation and crossover operators).

• CDatabase.h/cpp - contains the class CDatabase which is the main class of thiswork and incorporates the functionality of the database and the classificationmatching component (see figure 1.1), i.e. functions for import and export of thedatabase, the implementation of the branch & bound algorithm, error computa-tions, heuristics and the control function for the genetic algorithm. Some functionshave recently been added for communication with the active vision component anddo not belong to the content of this report.

• shapes.h/cpp - contains high level control functions for CDatabase. These func-tions are called by the GUI. Some functions have recently been added for commu-nication with the active vision component and do not belong to the content of thisreport.

• CActiveVision.h/cpp - contains the implementation of the active vision compo-nent which is currently developed and does not belong to the content of this report.However, these files are needed to be able to compile the project.

The following source files are only GUI-related:

• resource.h - is an automatically generated include file. It contains the IDs of theGUI components.

• shapesGUI.rc - is an automatically generated resource script.

• shapesGUI.h - is the main header file for the GUI. (The location of the headerfiles for MFC and GDI+ needs to be specified in this file.)

• shapesGUI.cpp - defines the class behaviors for the GUI.

• shapesGUIDlg.h/cpp - contains the implementation of the main dialog window ofthe GUI.

• CLogEdit.h/cpp - contains the class CLogEdit which is an extension of the MFCclass CEdit with additional functionality for logging.

• CObjectViewerCtrl.h/cpp - contains the class CObjectViewerCtrl which inheritsfrom the MFC class CWnd and is the implementation of a display area used tovisualize a TAbsoluteObjectView (see e.g. figure 5.2).

The implementation of the database and the classification matching component is inC++ and uses only STL-classes1. Thus, the core components can run on any plat-

1The “Standard Template Library”(see e.g. [STL]) is a platform-independent collection of containerclasses, generic algorithms and related components that can greatly simplify many programming tasksin C++.

Appendix B. Overview of the source files 61

form whereas the GUI that served as a tool in the debugging process is not platform-independent (for further explanation refer to section 5).

Appendix C. Number of possible matchings between two object views63

Appendix C

Number of possible matchingsbetween two object views

Let α and β to be two object views. In the following, a formula that computes thenumber of possible matching between α and β is derived.

For any shape type t, let nα,t be the number of shape-primitives of this shape type inobject view α and nβ,t the number of shape-primitives of this shape type in object viewβ. Obviously, the number of shape matches for this shape type excluding shape matcheswith “unmatched” is bound by min(nα,t, nβ,t).

For any integer k, 0 ≤ k ≤ min(nα,t, nβ,t), all sets containing exactly k shape matches(non-overlapping) can be constructed as follows:

1. k shape-primitives of α are chosen to be matched (shape-primitives that have notbeen chosen are matched with “unmatched”). This is a combination of nα,t shape-primitives, k at a time. Different arrangements of the same elements do not count.The number of different choices is:

(

nα,t

k

)

=nα,t!

(nα,t − k)!k!(C.1)

2. k shape-primitives of β are chosen to be matched (again shape-primitives that havenot been chosen are matched with “unmatched”), where the i-th chosen shape-primitive is matched with the i-th chosen shape-primitive of α. Thus, differentarrangements of the same elements count. This is a permutation of k out of nβ,t

elements. The number of different choices is:

nβ,t!

(nβ,t − k)!(C.2)

Combining (1) and (2) leads to the number of different sets containing exactly k shapematches:

nα,t!

(nα,t − k)!k!

nβ,t!

(nβ,t − k)!(C.3)

64

The number of different matchings with any number of shape matches regarding onlyone particular shape type t is then given by:

min(nα,t,nβ,t)∑

k=0

nα,t!

(nα,t − k)!k!

nβ,t!

(nβ,t − k)!(C.4)

Taking all shape types t into consideration finally results in the total number of possiblematching between α and β, where tmax is the maximum shape type value. (In thisimplementation there are 3 different shape types: ellipses (0), triangles (1) and rectangles(2).)

tmax∏

t=0

min(nα,t,nβ,t)∑

k=0

nα,t!

(nα,t − k)!k!

nβ,t!

(nβ,t − k)!(C.5)

Appendix D. Data structure for a (partial) matching 65

Appendix D

Data structure for a (partial)matching

The data structure of a (partial) matching holds the following information:

• unsigned short dbEntry;

This is an explicit reference to an object view in the database. (A reference to thequery is not stored because there is only one query.)

• vector<unsigned char> map;

A matching is constructed iteratively as demonstrated in section 3.1. The shape-primitives of the query are matched according to their order (introduced in section2.1). After each shape match, the reference to the shape-primitive of the databaseentry is appended to this vector. I.e. the n-th element of this vector holds theindex of the shape-primitive of the database entry that has been matched with then-th shape-primitive of the query.

• float error_sum;

For a complete matching, this value holds the exact matching error. In the caseof a partial matching this is a lower bound for the matching error of any completematching that can be constructed from this partial one.

In addition to these attributes, others may be stored to avoid repeating computationsand increase the performance of the branch & bound algorithm:

• float error_init;

This is the part of error_sum that does not need to be recomputed. It is called ini-tial error and further explained in the paragraph that deals with the computationof the lower bound.

66

• double scale;

This is the scaling factor for the database entry to achieve scaling invariance (seesection 3.1 and figure 3.3).

• TPoint shiftQ;

TPoint shiftDB;

These attributes indicate how the query and the database entry have to be shiftedto achieve translation invariance (see section 3.1 and figure 3.3). The TPoint

structure holds separate values for x and y-axis.

• double graPhi;

double graSin;

double graCos;

These attributes hold the value, sine value and cosine value of the global rotationangle. This is the angle by which the database entry has to be rotated to achieverotation and position invariance (see section 3.1 and figure 3.4).

• short unmatched[NUM_SHAPE_TYPES];

Each element of this array holds the number of shape-primitives of the correspond-ing shape type that cannot be matched at all. A negative number means that thequery has too many shape-primitives, a value greater than 0 indicates there aretoo many in the database entry. (These are called ut in algorithm 3.)

• vector<bool> matched;

Each element of this vector corresponds to a shape-primitive of the database entryand indicates whether this shape-primitive is “free” or has already been matched.

• unsigned char correspondencePoints;

This attribute is a counter for the correspondence points that have been found be-tween the query and the database entry. Only the values 0, 1 and 2 are important.These values refer to the different stages in the comparison algorithm (see section3.1 and figures 3.3 and 3.4).

BIBLIOGRAPHY 67

Bibliography

[AVL62] Georgii M. Adelson-Velskii and Evgenii M. Landis. An algorithm for the or-ganization of information. Doklady Akademii Nauk SSSR, 1962. (Russian).English translation by Myron J. Ricci in Soviet Math. Doklady, 3:1259-1263,1962. 22

[BET95] Heinrich H. Bulthoff, Shimon Edelman, and Michael J. Tarr. How are three-dimensional objects represented in the brain? Cerebral Cortex, 5(3):247–260,1995. Available from: citeseer.lcs.mit.edu/525321.html. 3

[Bie87] Irving Biederman. Recognition by components: a theory of human imageunderstanding., volume 94 of Psychol. Reviews, pages 115–147. 1987. 2, 3

[COI] Columbia university image library (COIL-100) [on-line, cited Jan 13th, 2005]. Available from:www1.cs.columbia.edu/CAVE/research/softlib/coil-100.html. 45

[DY95] Gang Dong and Masahiko Yachida. Acquiring fuzzy relational model from3-D hierarchical structure of objects. Proceedings of the Fourth IEEE Inter-national Conference on Fuzzy Systems, 3:1367–1374, March 1995. Availablefrom: intl.ieeexplore.ieee.org/xpl/abs free.jsp?arNumber=409859.3

[Ede97] Shimon Edelman. Computational theories of object recognition.Trends in Cognitive Sciences, 1:296–304, 1997. Available from:citeseer.nj.nec.com/edelman97computational.html. 4

[EI02] Shimon Edelman and Nathan Intrator. Visual Pro-cessing of Object Structure. 2002. Available from:kybele.psych.cornell.edu/∼edelman/arbib2e-final.pdf. 3

[EI03] Shimon Edelman and Nathan Intrator. Towards structural sys-tematicity in distributed, statically bound visual representa-tions. Cognitive Science, 27:73–110, 2003. Available from:kybele.psych.cornell.edu/∼edelman/cogsci-03.pdf. 3

citeseer.lcs.mit.edu/525321.html

www1.cs.columbia.edu/CAVE/research/softlib/coil-100.html

intl.ieeexplore.ieee.org/xpl/abs_free.jsp?arNumber=409859

citeseer.nj.nec.com/edelman97computational.html

kybele.psych.cornell.edu/~edelman/arbib2e-final.pdf

kybele.psych.cornell.edu/~edelman/cogsci-03.pdf

68 BIBLIOGRAPHY

[EIJ02] Shimon Edelman, Nathan Intrator, and Judah S. Jacobson. Un-supervised learning of visual structure. 2002. Available from:kybele.psych.cornell.edu/∼edelman/bmcv02longer.pdf. 3

[EN98] Shimon Edelman and Fiona Newell. On the representation of ob-ject structure in human vision: evidence from differential prim-ing of shape and location. CSRP 500, 1998. Available from:citeseer.nj.nec.com/edelman98representation.html. 3

[FMK+03] Thomas A. Funkhouser, Patrick Min, Michael Kazhdan, Joyce Chen, AlexHalderman, David Dobkin, and David Jacobs. A search engine for 3Dmodels. ACM Transactions on Graphics, 22(1), 2003. Available from:citeseer.ist.psu.edu/funkhouser02search.html. 4

[GB89] John J. Grefenstette and James E. Baker. How genetic algorithms work: Acritical look at implicit parallelism. In J. David Schaffer, editor, Proceedingsof the 3rd International Conference on Genetic Algorithms, pages 20–27,San Mateo, CA, June 1989. Morgan Kaufmann. 32, 33

[GKK04] Ingrid Gerdes, Frank Klawonn, and Rudolf Kruse. Evolutionare Algorith-men. Vieweg, July 2004. 30, 31

[Gol89] David E. Goldberg. Genetic Algorithms in Search, Optimization and Ma-chine Learning. Addison-Wesley Professional, 1989. 30, 33

[KGK94] Rudolf Kruse, Jorg Gebhardt, and Frank Klawonn. Foundations of fuzzysystems. Wiley, Chichester, 1994. 57

[Kos96] Stephen M. Kosslyn. Image and Brain. MIT Press, Cambridge, MA, 1996.3

[Low85] David G. Lowe. Perceptual Organization and Visual Recognition. KluwerAcademic Publishers, Boston, MA, 1985. 3

[LW66] Eugene L. Lawler and D. E. Wood. Branch-and-bound methods: A survey.Operations Research, 14(4):699–719, 1966. 20

[Mai] Introduction to computer vision and image process-ing [online, cited Aug 1st, 2004]. Available from:www.netnam.vn/unescocourse/computervision/computer.htm. 2

[Mar82] David Marr. A Computational Investigation into the Human Representationand Processing of Visual Information. W. H. Freeman, San Francisco, CA,1982. 2, 3

[Mica] Microsoft Corporation. GDI+ [online, cited Jan 24th, 2005]. Available from:msdn.microsoft.com/library/en-us/gdicpp/gdiplus/gdiplus.asp. 39

kybele.psych.cornell.edu/~edelman/bmcv02longer.pdf

citeseer.nj.nec.com/edelman98representation.html

citeseer.ist.psu.edu/funkhouser02search.html

www.netnam.vn/unescocourse/computervision/computer.htm

msdn.microsoft.com/library/en-us/gdicpp/gdiplus/gdiplus.asp

BIBLIOGRAPHY 69

[Micb] Microsoft Corporation. Microsoft foundation class library(MFC) [online, cited Jan 24th, 2005]. Available from:msdn.microsoft.com/library/en-us/vcmfc98/html/mfchm.asp. 39

[Mic96] Zbigniew Michalewicz. Genetic Algorithms + Data Structures = EvolutionPrograms. Springer-Verlag, Berlin, 3rd edition, March 1996. 33

[Mit97] Tom Mitchell. Machine Learning. McGraw Hill, 1997. 29, 30, 31

[MN78] David Marr and Herbert Keith Nishihara. Representation and recognitionof the spatial organization of three dimensional structure. Proceedings of theRoyal Society of London B, 200:269–294, 1978. 2

[OFCD01] Robert Osada, Thomas A. Funkhouser, Bernard Chazelle, and David P.Dobkin. Matching 3D models with shape distributions. In Shape ModelingInternational, pages 154–166. IEEE Computer Society, 2001. Available from:citeseer.nj.nec.com/373604.html. 4

[OMT03] Ryutarou Ohbuchi, Takahiro Minamitani, and Tsuyoshi Takei. Shape-similarity search of 3D models by using enhanced shape functions, 2003.Available from: citeseer.nj.nec.com/573301.html. 4

[Pop94] Arthur R. Pope. Model-based object recognition - A survey of re-cent research. Technical Report TR-94-04, Dept. Computer Sci-ence, Univ. British Columbia, January 1994. Available from:citeseer.nj.nec.com/pope94modelbased.html. 3

[SoCS] Pittsburgh PA School of Computer Science, Carnegie Mellon University.Computer vision test images [online, cited Jan 13th, 2005]. Available from:www-2.cs.cmu.edu/afs/cs/project/cil/www/v-images.html. 45

[STL] STLport [online, cited Oct 1st, 2004]. Available from: www.stlport.org.60

[Tar] Tarrlab stimuli [online, cited Jan 21st, 2005]. Available from:www.cog.brown.edu/∼tarr/stimuli.html. 2

[TB95] Michael J. Tarr and Heinrich H. Bulthoff. Is human object recognition betterdescribed by geon-structural-descriptions or by multiple views? Journal ofExperimental Psychology, Human Perception and Performance, 21(6):1494–1505, 1995. Available from: citeseer.ist.psu.edu/tarr95is.html. 3

[TB98] Michael J. Tarr and Heinrich H. Bulthoff. Image-based object recognition inman, monkey and machine. Cognition, Special issue on Image-Based ObjectRecognition in Man, Monkey and Machine, 67:1–20, 1998. Available from:citeseer.ist.psu.edu/tarr98imagebased.html. 3

msdn.microsoft.com/library/en-us/vcmfc98/html/mfchm.asp

citeseer.nj.nec.com/373604.html


citeseer.nj.nec.com/pope94modelbased.html

www-2.cs.cmu.edu/afs/cs/project/cil/www/v-images.html

www.stlport.org

www.cog.brown.edu/~tarr/stimuli.html

citeseer.ist.psu.edu/tarr95is.html

citeseer.ist.psu.edu/tarr98imagebased.html

70 BIBLIOGRAPHY

[TWHG98] Michael J. Tarr, Pepper Williams, William G. Hayward, and Is-abel Gauthier. Three-dimensional object recognition is viewpoint de-pendent. Nature Neuroscience, 1:275–277, 1998. Available from:citeseer.nj.nec.com/35937.html. 3

[Ull89] Shimon Ullman. Aligning pictorial descriptions: An approach to objectrecognition. Cognition, 32:193–254, 1989. 3

[Ull96] Shimon Ullman. High Level Vision: Object Recognition and Visual Cogni-tion. MIT Press, Cambridge, MA, 1996. 3

[Vel01] Remco C. Veltkamp. Shape matching: Similarity measures and algorithms.In Shape Modeling International, pages 188–199. IEEE Computer Society,2001. Available from: citeseer.nj.nec.com/veltkamp01shape.html. 4


citeseer.nj.nec.com/veltkamp01shape.html

71

Selbstandigkeitserklarung

Hiermit erklare ich, dass ich die vorliegende Arbeit selbstandig und nur mit erlaubtenHilfsmitteln angefertigt habe.

Magdeburg, den 28. April 2005

Vorname Nachname des Bearbeiters

Otto-von-Guericke-University Magdeburg - Semantic … · Otto-von-Guericke-University Magdeburg...

Documents

Transcript of Otto-von-Guericke-University Magdeburg - Semantic … · Otto-von-Guericke-University Magdeburg...