EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH

EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH CACHING

by

JAIM AHMED

(Under the Direction of Maria Hybinette)

ABSTRACT

We introduce a new algorithm for K-nearest neighbor queries that uses clustering and

caching to improve performance. The main idea is to reduce the distance computation cost

between the query point and the data points in the data set. We use a divide-and-conquer

approach. First, we divide the training data into clusters based on similarity between the data

points in terms of Euclidean distance. Next we use linearization for faster lookup. The data

points in a cluster can be sorted based on their similarity (measured by Euclidean distance) to the

center of the cluster. Fast search data structures such as the B-tree can be utilized to store data

points based on their distance from the cluster center and perform fast data search. The B-tree

algorithm is good for range search as well. We achieve a further performance boost by using B-

tree based data caching. In this work we provide details of the algorithm, an implementation,

and experimental results in a robot navigation task.

INDEX WORDS: K-Nearest Neighbors, Execution, Caching.


by

JAIM AHMED

B.S., Southern Polytechnic State University, 1997

A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment

of the Requirements for the Degree

MASTER OF SCIENCE

ATHENS, GEORGIA

2009

© 2009

Jaim Ahmed

All Rights Reserved


by

JAIM AHMED

Major Professor: Maria Hybinette

Committee: Eileen T. Kraemer

Khaled Rasheed

Electronic Version Approved:

Maureen Grasso

Dean of the Graduate School

The University of Georgia

May, 2009

iv

DEDICATION

First of all my dedication goes out to my wife Jennifer for her support and inspiration

especially when the going got tough. Also, my dedication goes to my parents for their

unconditional love and motivation. My final dedication goes to my sister and brother-in-law for

their genuine friendship and kindness.

v

ACKNOWLEDGEMENTS

First of all, I express my sincere gratitude to my Major Advisor Dr. Maria Hybinette for

her constant support and encouragement. Dr. Hybinette has been very kind with her time and

wisdom. She has been a shining example of hard work and dedication and will remain a source

of inspiration for me forever.

I would also like to thank my committee members Dr. Eileen Kraemer and Dr. Khaled

Rasheed for their time and consideration. Special thanks Dr. Tucker Balch for his helpful

suggestions and consultations. Also, thanks to the Borg lab for access to example data and their

helpful suggestions.

vi

TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS.........................................................................................................v

LIST OF TABLES...................................................................................................................viii

LIST OF FIGURES ...................................................................................................................ix

CHAPTER

1 Introduction ...............................................................................................................1

1.1 Overview.........................................................................................................1

1.2 Problem Domain..............................................................................................3

1.3 What is K-nearest Neighbor Search?................................................................6

1.4 Contributions ...................................................................................................8

2 Related Work...........................................................................................................10

3 Background .............................................................................................................15

3.1 Data Clustering..............................................................................................15

3.2 Data Caching .................................................................................................19

3.3 Basic KNN Search.........................................................................................20

3.4 KD-tree Data Structure ..................................................................................22

4 System Architecture.................................................................................................24

4.1 Pre-processing ...............................................................................................25

4.2 ckSearch Runtime Queries.............................................................................31

5 Experiments & Results ............................................................................................45

vii

5.1 Setup Information ..........................................................................................45

5.2 The effect of the size of the data set ...............................................................46

5.3 The effect of data dimension on the performance ...........................................51

5.4 The effect of search radius on the performance ..............................................54

5.5 The effect of search radius on accuracy..........................................................57

5.6 The effect of the number of clusters ...............................................................58

6 Conclusion...............................................................................................................62

REFERENCES .........................................................................................................................64

APPENDICES ..........................................................................................................................67

A Notation Table .........................................................................................................67

B Implementation Pseudocode ....................................................................................68

viii

LIST OF TABLES

Page

Table 5.1: The effect of data size on performance (k=1) ............................................................47

Table 5.2: Effect of data size on performance (k=3) ..................................................................48

Table 5.3: Effect of data size on performance (k=10) ................................................................49

Table 5.4: ckSearch speedup over linear search .........................................................................50

Table 5.5: The effect of data dimension on performance (N=50K) ............................................52

Table 5.6: The effect of data dimension on performance (N=100K) ..........................................52

Table 5.7: ckSearch speedup over linear search for various dimensions.....................................53

Table 5.8: The effect of search radius on performance (k = 3) ...................................................55

Table 5.9: The effect of search radius on performance (k = 10) .................................................56

Table 5.10: The effect of the search radius on query accuracy ...................................................57

Table 5.11: The effect of the number of clusters on performance (k=1) .....................................59

Table 5.12: The effect of the number of clusters on performance (k=5) .....................................60

Table A.1: List of various notations used in this thesis ..............................................................67

ix

LIST OF FIGURES

Page

Figure 1.1: Autonomous robot being trained to navigate through obstacles..................................4

Figure 1.2: Autonomous robot navigation sensors input ..............................................................5

Figure 1.3: Pictorial representations of KNN search ....................................................................6

Figure 3.1: Data clustering in 2-dimensional space....................................................................16

Figure 3.2: Stages in data clustering ..........................................................................................17

Figure 3.3: Typical application cache structure..........................................................................18

Figure 3.4: Basic KNN search process represented in 2-dimensional space ...............................20

Figure 3.5: Basic KNN Search Algorithm .................................................................................21

Figure 3.6: KD-tree data structure .............................................................................................22

Figure 4.1: Cluster data linearization .........................................................................................26

Figure 4.2: B-tree data structure ...............................................................................................29

Figure 4.3: Data cluster to B-tree correlation .............................................................................34

Figure 4.4: ckSearch algorithm data caching scheme.................................................................36

Figure 4.5: Cluster search rule 1 (Cluster exclusion rule)...........................................................39

Figure 4.6: Cluster search rule 2 (Cluster search region rule).....................................................40

Figure 4.7: Cluster search rule 3 (Cluster contains query sphere)...............................................42

Figure 4.8: Cluster search rule 4 (Cluster intersects query sphere) .............................................43

Figure 5.1: Performance vs. data set size chart (k = 1) ...............................................................47

Figure 5.2: Performance vs. data set size chart (k = 3) ...............................................................48

x

Figure 5.3: Performance vs. data set size chart (k = 10) .............................................................49

Figure 5.4: Chart showing ckSearch speedup over the linear search ..........................................50

Figure 5.5: Data dimension vs. performance chart (N = 50K) ....................................................52

Figure 5.6: Data dimension vs. performance chart (N = 100K) ..................................................53

Figure 5.7: Search radius vs. performance for 10000 data records .............................................55

Figure 5.8: Search radius vs. performance chart for 10,000 data records (k = 10) ......................56

Figure 5.9: Search radius vs. query accuracy chart ....................................................................58

Figure 5.10: The number of clusters vs. performance chart for 50000 data records (k = 1).........59

Figure 5.11: The number of clusters vs. performance chart (k = 5) ............................................60

Figure B.1: ckSearch KNN algorithm........................................................................................68

Figure B.2: SeachClusters(q) pseudocode..................................................................................69

Figure B.3: The SearchCache(q) algorithm pseudocode ............................................................70

Figure B.4: The SearchLeftNodes(leafNodei, keyleft) pseudocode..............................................71

Figure B.5: The SearchRightNodes(leafNodei, keyright) pseudocode ..........................................72

1

CHAPTER 1

INTRODUCTION

In this research, we introduce an efficient algorithm for K-nearest neighbor queries that uses

clustering, a pruning of the search space, and caching to improve performance. We call our

algorithm ckSearch. The main goal of this work is to improve performance of queries in a k-

nearest neighbor (KNN) system.

In this chapter we provide an overview of the KNN algorithm, and brief coverage of

the performance challenges facing KNN implementations. We describe our application and

experimental domain, and then provide details on our approach.

1.1 Overview

The K-nearest neighbor algorithm (KNN) is a well-known statistical search or

learning method used in a wide range of problem solving domains: e.g., robotics navigation

[32], data mining [33], and image processing [11]. In robotic navigation KNN is used to

select an appropriate action of a robot by evaluating similar (K) instances from the ‘nearest

neighbor feature set’ in training data. In forestry KNN is used to map satellite image data to

inventory forest resources [34] and in wine evaluation KNN is used to classify wines, here

the feature space include alcohol level, hue and wine opacity [35]. More formally, KNN

finds the K closest (or most similar) points to a query point among N points in a d-

2

dimensional attribute (or feature) space. K is the number of neighbors that are considered

from a training data set and typically ranges from 1 to 20.

Advantages of KNN algorithm include that it is fairly simple to implement and it is

well suited for multi-modal classes [36]. However, a major disadvantage of KNN

implementations is its high computational costs, especially when coupled with a large

amount of data. The high cost is partly due to computing Euclidian distances between the N

neighboring data points and the query point. Further many KNN implementations degrade in

performance as the data becomes higher dimensional (i.e., they suffer from the “curse of

dimensionality”), typically when the number of features are 20 and above performance starts

to degrade [10]. Another drawback of KNN concerns its significant memory requirements,

especially for Locality Sensitive Hashing (LSH) based KNN systems [6].

A key idea of our ckSearch algorithm is to improve performance by avoiding costly

distance computations for the KNN search. We use a divide-and-conquer approach. First, we

divide the training data into clusters based on similarity between the data points in terms of

Euclidean distance. Next we perform a linearization of data points in each cluster for faster

lookup. The data points in a cluster can be sorted based on their similarity (measured by

Euclidean distance) to the center of the cluster. Our data linearization process takes

advantage of this similarity and produces metric indexes for each data point in a cluster. Fast

search data structures such as the B-tree can be utilized to store data points based on their

metric indexes. Next we load the data points into a memory aware B-tree data structure. We

achieve a further performance boost using B-tree based data caching.

The ckSearch cache policy pre-fetches closer (or more similar) clusters to the query

point into the cache in anticipation of what may be needed next and it avoids checking the

3

cache if the cluster needed has not been put in the cache. This policy avoids some cache

misses. At runtime, the ckSearch system first evaluates the cache upon receiving the query

point and then searches for the k closest points in the cache. The cache is organized

hierarchically in a B-tree structure and thereby reduces distance. In the case of a cache miss,

the ckSearch algorithm searches the main B-tree for the k nearest neighbors using our new

method.

1.2 Problem Domain

A focus of this research is to improve performance of the KNN approach and to

demonstrate its performance in a real-world problem. We assessed our approach using data

from an autonomous robot navigation experiment. The existing solution for this system uses

the KD-tree algorithm that partitions the training data set recursively (KD-trees are

specialized BSP trees). A KD-tree based algorithm provided direction and speed commands

for a robot based on learned perception examples. One of our objectives is to improve

performance of the existing KD approach. In order to improve data processing speed, we

introduce our novel ckSearch algorithm that utilizes data clustering and data caching. In

addition, our ckSearch system utilizes several rules to further reduce or avoid costly distance

calculations. Even though our system has been assessed for efficient execution of KNN

algorithm in a robotics domain, this system is expected to perform well in any domain that

utilizes a KNN algorithm. One such domain could be image processing where a KNN

algorithm is used to classify comparable image pixels.

4

Figure 1.1: An autonomous robot being trained to navigate through obstacles.

Figure 1.1 shows an autonomous robot being trained to navigate through obstacles.

Green lines showing sensors and yellow arrow showing direction. These sensor readings will

be used as training data to classify speed and direction during autonomous run. This image is

used by permission from the Borg Lab at Georgia Institute of Technology.

Autonomous robot navigation in unstructured outdoor environments is a challenging

area of active research. At the core of this navigation task, identifying obstacles and

traversing around these obstacles plays a vital role in reaching the robot’s target destination.

There is a recent trend of using KNN-based approaches in autonomous robotics research

[32]. Autonomous robots have the ability to function and can perform desired tasks in

unstructured environments without continuous human guidance. But, it relies on algorithm

such as KNN for learned data classification. Typically, sensors collect obstacle data and the

decision making system must decide which action to take based on previously learned

behavior [1].

5

Figure 1.2: Autonomous robot navigation sensors input.

Figure 1.2 shows a representation of a robot’s sensor input for navigation. Each green

line represents an estimate of free space from the robot to an obstacle. At each time step there

are 60 such inputs which make up 60 dimensional data point. The yellow arrow shows the

direction input by the robot trainer, and the blue arrowhead shows original direction of the

target path. Later the robot uses these 60 dimensional sensor data and the direction taken by

the trainer as the training data set to decide speed and direction during autonomous run.

In this manner, it has the ability to move through its operating environment without

human assistance using KNN algorithm to dictate which direction to move and what the

speed should be based previous learned data. Needless to say this decision making process

must be efficient, accurate, and swift to enable the robot to cope with its environment and

avoid obstacles. Most of the current KNN algorithms (such as KD-tree) are too slow for the

task. As defined, this is our problem domain. It was determined that the existing KD-tree

based nearest neighbor search algorithm suffered performance devaluation from “curse of

6

dimensionality” and performance needed improvement. In this research we worked to come

up with an apt algorithm to speed up classification of such robots’ direction and speed data.

(a) (b)

Figure 1.3: Pictorial representations of KNN search.

1.3 What is K-nearest Neighbor Search?

The k-nearest neighbor (KNN) is a variation of the nearest neighbor algorithm where it is

required to find k number closest point to the query point. The nearest neighbor search

algorithm along with its variations are frequently used to solve problems in areas such as

robotics, data mining, multi-key database retrieval, and pattern classification. Discovering a

way to reduce the computational complexity of nearest neighbor search is of considerable

interest in these areas.

The KNN search, also known as the similarity search, can be expressed as an

optimization problem for finding closest points in metric spaces [2]. Given a set of N points,

in metric space M, and a query point q where q ∈ M. The problem is to find the closest k

points to the query point q, in set Nset. Usually, M is considered to be a d-dimensional

Euclidean space and distance is measured by Euclidean distance or Manhattan distance.

7

A significant cost of the KNN approach is due to the computation of the O(l) distance

function, especially when an application uses vectors with a high dimensionality such as

sensor data from an autonomous robot [3]. A full search solution involves calculating the

distance between the target vector q, and every vector pi, in order to find the k closest to q.

Although full search ensures the best possible search results, this solution is often unfeasible

due to its O(nl) cost. Autonomous robot decision making applications often involve

searching a large database for a closest match to a query case [4].

A simple solution to the KNN search problem is to compute the distance from the

query point to all the other points in the database, storing the data point smallest calculated

distance yet [5]. This sequential full search finds the k nearest neighbors by progressively

updating the current nearest neighbor pj when a data point is found closer to the query point

than the current nearest neighbor. With each update, the current KNN search radius shrinks

to the actual kth nearest neighbor distance. The final nearest neighbor is one of the data

points inside the current nearest neighbor search radius. Thus, in the sequential full search,

the distances of all N data points to the query point are computed with the search complexity

being N distance computations per query point.

The number of data points in a data set increases the number of distance calculations

for any KNN algorithm. Further, the “curse of dimensionality” increases the number of

calculations tremendously. One approach to reducing the complexity of the nearest neighbor

search is to reduce the number of data points to be searched. Our approach to KNN search

focuses on an inexpensive way of eliminating data points from consideration using

computationally inexpensive rules, thereby avoiding a more expensive distance computation.

8

The rules determine those data points which cannot be nearer to the query point than the

current nearest neighbor.

The computational complexity of KNN queries has increased in recent years.

Moreover, the advent of new research areas using learning algorithms such as autonomous

robotics and other artificial intelligence domains has drawn interest back to nearest neighbor

search. Currently, use of large databases containing millions of image records for a vision

based navigational system is quite common [1]. Naturally, these new challenges have

prompted a fresh look at nearest neighbor search and the ways it can help solve new

problems.

As mentioned above, we apply our cluster-based KNN search method to the task of

steering and speed decision making for an autonomous robot based using training data. In

addition, our approach utilizes data caching strategy to improve performance as well. In

addition, the ckSearch algorithm is general enough to produce good performance in problem

domains such as pattern recognition in image processing, information extraction in data

mining, and classification of texts as well.

1.4 Contributions

Results of our research will be of interest to those investigating high performance

memory-based learning methods. In particular, we have implemented a system that supports

fast and exact KNN queries without scanning the entire data set. Our novel contributions

include:

9

• A geometry-based method for pruning the search space at query time. Some

existing approaches (e.g. Approximate Nearest Neighbor) also prune, but are

not able to provide exact responses to queries.

• Further improved performance using caching.

Our solution is based on a framework consisting of three major components: (1) Pre-

processing of data points into clusters; (2) Data point mapping to a metric data structure; and

(3) Implementation of smart caching. We have designed our caching strategy based on the

assumption that a data cache can boost performance in repeated calculation algorithms such

as KNN. The approach takes advantage of an algorithm that balances the cost and

performance of each component in order to achieve an overall reduction in cost to improve

performance [4]. Using above mentioned techniques along with rules to avoid unnecessary

computation, our algorithm achieves performance improvement over linear search and KD-

tree based KNN algorithms. The performance evaluation section details these experiments

and results.

The rest of the thesis is organized as follows: Chapter 2 discusses related work done

by various other researchers in this area. Chapter 3 presents background information and

various concepts used in this project. Chapter 4 describes in detail our proposed approach

and all the related information. The experiments are discussed and the results are presented in

Chapter 5. Finally, Chapter 6 presents the conclusions of this thesis and describes future

work.

10

CHAPTER 2

RELATED WORK

This related work section reports various recent research works on autonomous robot

navigation as well as KNN algorithm. Navigation is one of the most challenging skills

required from a mobile robot. There has been a recent trend of using KNN algorithm to

classify learned data amongst researchers. In this section, we present some of the related

work done by researchers in this area.

The 6D SLAM (Simultaneous Localization and Mapping) system is based on scan

matching technique where scan matching is based on the well known iterative closest point

(ICP) algorithm [3]. This system employs a cached KD-tree to improve performance of the

iterative closest point algorithm. Since the KD-tree itself suffers from performance break

down with high-dimensional data points, we believe the 6D SLAM will suffer performance

deterioration with high-dimensional navigation data [17].

Another approach by researchers to solve the navigation problem is based on stereo

vision of the robot system. Binary classifiers were used to augment stereo vision for

enhanced autonomous robot navigation. However this system does not use any one optimized

binary classifier. Instead, suggests using several generic classifiers such as SVM, Simple

Fisher Algorithm, and Fisher LDA. This approach also suggests creating and storing learned

models of traversable and non-traversable terrain. We believe generic binary classifiers are

prone to performance degradation which can affect performance of this system [1].

11

Some researchers applied memory-based robot learning to solve similar problems.

Memory-based neural networks were used to learn task to be performed [22]. This task can

be figuring out navigational hot spots or could be decision making. These researchers also

augmented nearest neighbor network with a local model network.

Next, we present several related work done in the KNN search area. There has been a

long flow of research on solving the nearest neighbor search problem. A large number of

solutions have been proposed to improve cost of the nearest neighbor search. The quality and

usefulness of these various proposed solutions are determined by the time complexity of the

queries as well as the space complexity of any search data structures that must be maintained.

The current KNN techniques can be divided into five major approaches. These approaches

are: data partitioning approach, dimensionality reduction approach, locality sensitive hashing

(LSH), scanning based approach, and linearization approach.

The most prominent is the data partitioning approach. It is also known as the space

partitioning, spatial index, or spatial access method. Data partitioning techniques such as

KD-tree [22] or Grid-file [25] iteratively bisects the search space into regions containing

fraction of the points of the parent region. Queries are performed via traversal of the tree

from the root to a leaf by evaluating the query point at each split. One of the main drawbacks

for this concept, is the “curse of dimensonality”. Curse of dimensionality is a problem

caused by the exponential increase in volume associated with adding extra dimensions to a

mathematical space. Data partitioning techniques perform comparable with low-dimension

data points. On the other hand with high dimensional data, partioning technique’s

performance quickly degrades. It is because of the exponential increase in volume associated

with iterative partitioning of the high-dimension euclidean search space. Multi-dimensional

12

indexes such as R-trees [46] have been shown to be inefficient for supporting range queries

in high-dimensional databases [19].

Dimensionality reduction approaches apply “dimensionality reduction” techniques

on the data and insert the data into the indexing trees. The “dimension reduction” is the

process of reducing the number of random variables or attributes being considered. This

process is divided into feature selection and feature extraction. The dimensionality reduction

approach first performs dimension reduction and then inserts the data into indexing trees.

There are cost associated with performing dimension reduction and subsequent data

indexing. That is why this technique performs well on low-dimensional data set but

performance suffers when data dimensions increase.

Locality sensitive hashing (LSH) is comparatively new nearest neighbor search

approach. It is a technique for grouping points into buckets based on distance metric

operation on the points. Points that are close to each other under the chosen metric are

mapped to the same bucket with high probability. Theoretically, for a database of n vectors

of d dimensions, the time complexity of finding the nearest neighbor of an object using

locality sensitive hashing is sub-linear in n and only polynomial in d. A key requirement of

applying LSH to a particular space and distance measure is to identify a family of locality

sensitive functions, satisfying the properties [26]. Thus, locality sensitive hashing is only

applicable for specific spaces and distance measures where such families of functions have

been identified, such as real vector spaces with distance measures, or bit vectors with the

Hamming distance [28]. Also, locality sensitive hashing techniques are based on hashing it

has large memory footprint. A large amount of memory must be allocated to apply LSH

which certainly is a major drawback [27].

13

Scanning based approaches such as the VA-file [17] divide the data space into 2b

rectangular cells, where b denotes a user specified number of bits. Each cell is allocated a bit-

string of length b that approximates the data points that fall into a cell. The VA-file is based

on the idea of object approximation and approximates object shapes by their minimum

bounding box. The VA-file itself is simply an array of these compact, geometric

approximations. The nearest neighbor search starts by scanning the entire file of

approximations and filtering out the irrelevant points based on their approximations. Instead

of hierarchically organizing these cells like in grid-files or R-trees, the VA-file allocates a

unique bit-string of length b for each cell, and approximates data points that fall into a cell by

that bit-string.

Linearization approaches, such as space-filling curve methods (Z-order curve) map d-

dimensional points into a one-dimensional space (curve). As a result, one can issue a range

query along the curve to find k-nearest neighbors.

As it is evident from discussion so far that most of the conventional approach to the

KNN search suffers from drawbacks related either performance or memory space

complexity. Our proposed ckSearch approach, as it will be described in detail later in this

paper, is a novel approach to the KNN search problem. It utilizes clustering to achieve data

partition. It makes smart but balanced use of data caching technique to boost performance. It

avoids curse of dimensionality by mapping d-dimensional points into a one-dimensional

space using “linearization approach”. It uses indexing tree as a data structure to evade large

memory requirement. Moreover, it introduces metric index caching to the KNN algorithm.

As described in this chapter, there have been metric based KNN systems. But, our proposed

data clustering along with smart use of data cache is a unique and a novel approach. Our

14

proposed solution has been carefully designed to overcome the disadvantages many of these

conventional approaches suffer. At the same time, our approach culls the benefits the above

mention techniques enjoy.

15

CHAPTER 3

BACKGROUND

We have provided comprehensive background information of our project in this chapter. It is

important to remind the reader that the main goal of this project is to design a fast cluster-

based KNN algorithm. In addition, this KNN algorithm must be able to process autonomous

robots navigation (sensor) data quickly so that the robot can decide on direction and speed

without stalling or running into obstacles. As mentioned above, the actual algorithm will be

described in the next section but all the necessary background information will be explained

in this section. For ease of exposition, the background section is further divided into three

subsections. These subsections are: data clustering, notations, basic nearest neighbor search,

and KD-tree search algorithm.

3.1 Data Clustering

Data clustering is an essential component of our ckSearch algorithm and considered part of

the pre-processing step. A large portion of the cost of the KNN search is due to the

computation of the O(l) distance function, especially when an application contains points

with large number of dimensions such as navigation sensor readings of an autonomous robot.

The central strategy to reduce these repeated, and in some cases unnecessary, distance

computation is to partition the data space. As the goal is to split the data space into partitions,

data clustering is one of several the ways to achieve this goal.

16

Figure 3.1: Data clustering in 2-dimensional space.

Cluster analysis is the organization of a collection of patterns, usually a vector of

measurements or a point in a multidimensional space, into clusters based on similarity [9].

Ideally, patterns within a valid cluster are more similar to each other than they are to a pattern

belonging to a different cluster. Since data points in a large database or data set are often

clustered or correlated, data clustering as a data partitioning technique seems ideal. The

diversity of techniques for data representation, similarity between data elements, and

categorizing data elements has generated a range of clustering methods.

Typical pattern clustering activity involves the following steps [9]:

(1) pattern representation

(2) definition of a pattern proximity measure appropriate to the data domain

17

(3) clustering or grouping

(4) data abstraction

(5) assessment of output if needed

Figure 3.2: Stages in data clustering

Pattern representation refers to the number of classes, the number of available

patterns, and the features available to a clustering algorithm. It is divided into feature

selection and feature extraction process. Feature selection is the process of recognizing the

most effective of the original features to use in clustering. Feature extraction is the use of

one or more transformations of the input features to produce new prominent features. In

order to use in clustering, either or both of these techniques can be used to obtain an

appropriate set of features.

Pattern proximity is usually measured by a distance function defined on pairs of

patterns. A variety of distance functions are used depending on data domains. Typically,

Euclidean distance function is the most popular of these functions and often used to show

similarity between two patterns. On the other hand, other similarity measures can be used to

show the conceptual similarity between patterns.

The clustering step can be performed in a variety of ways. There are several major

clustering techniques available such as: hierarchical, partitional, fuzzy, probabilistic, and

graph theoretic to name a few. The K-means clustering, a partition-based clustering

18

technique was used in this project. K-means clustering is simple and a perfect fit for data

partitioning required for nearest neighbor search algorithm. There are several existing

clustering schemes in the literature such as BIRCH [30], CLARANS, and DBSCAN [31].

Data abstraction is the next step in the clustering process (except for optional output

assessment step). It is the process of extracting simple representation of the data set. A

typical data abstraction is a compact description of each cluster, usually in terms of cluster

prototype or representative patterns such as the centroid.

In this project, data indexing is not dependent on the underlying clustering method.

But, it is expected that the clustering strategy will have an influence on data retrieval

performance.

Figure 3.3: Typical application cache structure

19

3.2 Data Caching

Data caching is a general technique used to enhance performance of data access where the

original data is expensive to compute compared to the cost of reading the cache. In KNN

search, a large number of high-dimensional dataset is repeatedly accessed with each query. A

data cache can prove extremely effective in such KNN search process. When data is cached,

the most recently accessed data from the high-dimensional data set is stored in a memory

buffer. Thus, this data cache is a temporary storage area where frequently accessed data can

be stored for rapid access. When our ckSearch algorithm needs to access data, it first checks

the cache to see if the data is there. If it finds what it is looking for in the cache, it will use

the data from the cache instead of going to the data source to find it. Thus, using data cache

our proposed algorithm can achieve shorter access time and boost performance. Even though

a data cache is favorable, there are computational costs associated with data caching. This

cost is primarily an accumulation of data retrieval cost, data maintenance cost, and cache

miss cost. Thus, our proposed algorithm implements a comprehensive caching strategy to

keep cache cost from offsetting performance gains.

20

Figure 3.4: Basic KNN search process represented in 2-dimensional space

3.3 Basic KNN Search

In order to search for the k nearest neighbors of a query point q, the distance of the kth nearest

neighbor to q defines the minimum radius rmin required for retrieving the complete answer

set. It is not possible to calculate such a distance preemptively because of the fact that we are

unaware of query point; q’s surrounding points without further scanning. Thus, iteratively

increasing the search radius and examining the neighbors within that search sphere is a viable

approach.

Describing this algorithm, the query point in question is q. The task is to find k

nearest neighbor for this query point, q. This search process starts with a query sphere

defined by a relatively small radius r about query point q, SearchSphere(q,r). Naturally, all

data spaces the query sphere intersects have to be searched for potential k nearest neighbors.

Iteratively, the search sphere is expanded until all k nearest neighbor points are found. In this

21

process, all the data subspaces intersecting the current query space are checked. If

enlargement of the query sphere does not introduce new nearest neighbor points, the current

KNN result set, R is considered the nearest neighbors (assuming the size of the current result

set is k). At this point, the search query is started with a small initial radius which in turn

keeps the search space small to avoid unwanted calculations. The goal here is to minimize

unnecessary search costs. Arguably, a search sphere with larger radius may contain all the k

nearest points but cost of going through all the data points out weighs the benefits.

Basic KNN Search(k):

1 R = empty; // The result set

2 Search sphere radius, r = as small as possible;

3 Find all data spaces intersection current query sphere;

4 Check all intersecting data spaces for k-nearest neighbor;

5

6 if R.Size() == k (where k-nearest neighbors are found)

7 exit;

8 else

9 increase search radius;

10 goto line 3;

11 start the search process again;

END;

Figure 3.5: Basic KNN Search Algorithm

22

We performed several performance comparisons between the ckSearch algorithm and

the KD-tree based multi-dimensional indexing structure. It has been detailed in the

experiment section of this project. We believe it is important to understand the KD-tree

algorithm to understand the performance comparisons we have conducted in the experiment

section of this paper. So, a comprehensive account on the KD-tree is included in the

following section.

Figure 3.6: KD-tree data structure

3.4 KD-tree data structure

K-Dimensional search trees, i.e. KD-trees, are a generalization of binary search trees

designated to handle the case of multidimensional records. In KD-tree, a multidimensional

record is identified with its corresponding multidimensional key x = (x(1)

, x(2)

, . . ., x(K)

),

23

where each x(n)

, 1 ≤ n ≤ K, refers to the value of the nth

attribute of the key x. Each x(n)

belongs to some totally ordered domain Dn, and x is an element of D = D1 × D2 × . . . × DK.

Therefore, each multidimensional key may be viewed as a point in a K-dimensional

space, and its nth

attribute can be viewed as the nth coordinate of such a point. Without loss of

generality, we assume that Dn = [0,1] for all 1 ≤ n ≤ K, and hence that D is the hypercube

[0,1]K [10]. A KD-tree for a set of K-dimensional records is a binary tree such that,

(1) Each node contains a K-dimensional record and has an associated discriminant

n ∈ 1, 2, . . . , K

(2) For every node with key x and discriminant n, it is true that any record in the left sub-

tree with key y satisfies y(n)

< x(n)

and any record in the right sub-tree with key y

satisfies y(n)

> x(n)

.

(3) The root node has depth 0 and discriminant 1. All nodes at depth d have discriminant

(d mod K) + 1.

There are many implementations of KD-trees. There are homogeneous and non-

homogeneous KD-trees. The non-homogeneous KD-trees contain only one value in internal

nodes and pointers to its left and right sub-trees. For non-homogeneous KD-trees, all records

are stored in external nodes. The expected cost of a single insertion in a random KD-tree is

O(logn). On the other hand, the expected cost of building the whole tree is O(nlogn). On

average, the deletions in KD-trees have expected cost of O(logn). The nearest neighbor

queries are supported in O(logn) time in KD-trees [10].

24

CHAPTER 4

SYSTEM ARCHITECTURE

In this section, we describe the system architecture of ckSearch, which includes our scalable

and efficient KNN search mechanism.

A number of solutions have been introduced to reduce the cost of the KNN search.

The quality and usefulness of these various solutions are limited by the computational time

complexity of computing queries as well as the space complexity of the relevant search data

structures. As mentioned in the Related Work chapter, solutions face tradeoffs that affect

performance and are prone to the curse of dimensionality phenomena as number of attributes

increases. When the number of attributes is large KNN implementations either require large

memory allocations (due to space complexity) or fall victim to time complexity. A rule of

thumb is that KNN algorithms work well for 20 or less attributes [10].

Our ckSearch technique balances both time and space complexities to achieve an

overall reduction in both. Our cluster-based approach uses caching to minimize the cost of

searching high-dimensional data. Our solution, detailed in this section, includes two phases:

(1) The pre-processing of data points; and (2) Runtime queries. In the pre-processing step the

d-dimensional data set is partitioned into data clusters based on similarity between the data

points. We discuss both phases in detail in the preprocessing and runtime query sections

below. The following observations influenced the design of the ckSearch system:

25

Observation 1: (Data partitioning)

Data space partitioning can reduce redundant distance computations while searching for k

nearest neighbors in a high dimensional data domain. Simple clustering algorithms such as

K-means clustering can reduce computational cost by separating high-dimensional data

points into clusters based on similarity.

Observation 2: (Data reference)

Reference to a cluster centroid may expose similarity or dissimilarity between data points

within a cluster and data points across different clusters. Moreover, data points in a cluster

can be sorted based on their distance from a reference point such as the cluster centroid.

Observation 3: (Data Caching)

Data caching can substantially reduce search time by pre-fetching and reduce cost of distance

calculation for the KNN search. Cache miss expenditure must be kept in-check by using

smart cache strategies and rules to predict cache miss scenario.

4.1 Pre-Processing

Step 1: Data Partitioning – K-means Clustering

Data clustering is an essential component of our algorithm. By clustering as a pre-processing

step, we are able to improve the performance of queries at runtime. A direct approach to

reducing the complexity of the nearest neighbor search is to reduce the number of data points

investigated. The central strategy to reduce these repeated, and in some cases unnecessary,

distance computation is to partition the data space. CkSearch splits the data space into

26

partitions and uses data clustering to avoid examining unnecessary data points in

multidimensional data by clustering based on data similarities (Observation 1). The first step

is to cluster the data set using an existing K-means clustering algorithm.

K-means clustering is a simple, partition-based clustering technique. It is a good fit

for the data partitioning required for nearest neighbor search. It is important to mention that

even though our approach uses K-means clustering; it does not depend on this particular

clustering technique. We could just as easily have selected another clustering algorithm such

as: DBSCAN [31], CLARANS, or BIRCH [30]. In our algorithm, the number clusters is

selected based on number of records present in the data set. We choose 5 clusters up to 10000

records and go up by 2 clusters for every 5000 records.

Figure 4.1: Cluster data linearization

27

Once we have selected a number of cluster centers, we can use them to index our

data. Figure 4.1 shows cluster data linearization based on the distance between the center

and individual data point in that cluster. The cluster center is the starting point of the segment

and the cluster boundary being the maximum for the segment.

Step 2: Data index construction & Index structure

After the clustering phase, our algorithm constructs the data index. This data index is a single

dimensional value based on the distance between the data point and a reference point in a

data partition. During this part of the process, each high-dimensional point is transformed

into a point in a single dimensional space.

This conversion is commonly known as data linearization or data mapping.

Linearization is achieved by selecting a reference point and then ordering all partitions

according to their distances from the selected reference point. This extends well with

Observation 2 which states that reference to a cluster center may expose similarity or

dissimilarity of data points in a cluster. This similarity or dissimilarity is exposed by

linearization in the form of data mapping. There are several types of reference points that can

be used for the linearization process. Typically the center of a cluster is used as a reference

point. But, some linearization techniques use either a boundary (edge) point or a random

point as reference. Ad hoc linearization approaches, such as space-filling curve methods such

as Z-order curve [15] map d-dimensional points into a one-dimensional space (curve). For

further cost reduction, we use a three step data linearization algorithm. In the following

section, the data index construction (i.e. linearization) using the three step algorithm is

described.

28

First, a reference point is identified for each partition or data cluster. The center of

each partition or cluster is selected as reference point. In the second step, the Euclidean

distance between the data point pi and the reference cluster center Ci is computed. And, in the

final step, the following simplistic linear function is used to complete the conversion (i.e., the

data mapping). Each high-dimensional data point is transformed into a key, keyi, in a single

dimensional space.

keyi = distance(pi, Ci) + m × µ; (4.1)

In the above function (4.1), the term keyi represents the single dimensional index

value for a data point after the linearization process [11]. According to the research work on

data partitioning by Agbhari & Makinouchi [11], data points in a cluster can be referenced

and mapped by a fixed data point such as the cluster center. We utilized this concept to

perform data linearization in this project. The distance function, distance(pi, Ci) represents

distance function between the data point pi and the cluster center reference point Ci. The

function distance(pi, Ci) is a Euclidean function and returns single dimensional distance

value. The next parameter, m is the number of the data cluster being processed. If there are

total M clusters, then the m value is in between 0 and M – 1, such that 0 ≤ m ≤ M – 1. If there

are 10 clusters for example, then m value will be one of the values in the [0,1,2, . . . . . , 9]

range.

The last parameter µ is a constant to stretch the data ranges. The constant µ serves as

a multiplier to parameter m so that all points in a partition or cluster can be mapped to a

region within m × µ and ((m + 1) × µ). Because of the µ multiplier, the function (1.1)

29

correctly maps the cluster center as the minimum boundary or starting index of this region

and furthest data point (in this cluster) as the maximum boundary or index of this region.

Moreover, all the data points (in this cluster) appropriately map in between the minimum and

maximum indices. As a result, one can issue a range query to find the nearest neighbor

enabling use of efficient single dimensional index structure such as the B-tree.

Figure 4.2: B-tree data structure

Figure 4.2 above shows a B-tree data structure. The leaf nodes contain data points. B-

tree is especially optimized for search operations.

Step 3: Data structure & data loading

The selection of appropriate data structures is an integral part of any efficient search

algorithm design. For fast data retrieval algorithms such as our ckSearch, it is vital to use a

speedy data structure. In ckSearch system, we used three different data structures. The core

structure is the B-tree. A B-tree was used as the main data storage for our system. We have

also utilized one-dimensional arrays and two-dimensional arrays. The two-dimensional array

was used to store the minimum and maximum data distance for each cluster. Any balanced

tree such as a B-tree works well as a fast cache data structure because of its rapid data

30

retrieval time. Accordingly, an instance of the B-tree algorithm is used for the data caching

implementation as well.

B-tree is a data structure that keeps data sorted and allows searches, insertions, and

deletions in logarithmic time. It is optimized for systems that read and write segments of data

such as: data clusters, databases, and file-systems. In B-trees, non-leaf nodes can have

variable number of child nodes and are used as guides to leaf-nodes. Search operations with

in-memory B-trees are significantly faster than the in-memory red-black trees and AVL trees

[28]. It fits perfectly for our ckSearch algorithm because costly B-tree insertion operations

are only performed during the pre-processing index loading time. During actual ckSearch

runtime, only inexpensive search operations are performed (on the B-tree) to locate k nearest

neighbors. This strategy further aids our algorithm in improving overall processing time.

After the data linearization process, as described in the previous section, the mapped

points are loaded in the B-tree. The transformed data point indexes worked as keys for our

data structure where only leaf-nodes store the actual data points. The conventional B-tree was

modified so that each leaf node was linked to neighboring leaf nodes on both sides. This

modification assisted in further speedy retrieval of nearest neighbor points.

In our algorithm, a two-dimensional array is used to store the maximum distance,

distMaxi between each cluster center Ci and the furthest data point pi in that cluster.

Similarly, the minimum distances distMini are stored in this two-dimensional array as well.

Our algorithm uses the distMaxi and the distMini distance values to eliminate unnecessary out

of boundaries (data space) computations. A separate single dimensional array is used to store

cluster centers.

31

4.2 ckSearch Runtime Queries

In this section, we describe the ckSearch query. After loading the indexes into the tree-based

data structure, the pre-processing part of the algorithm concludes. At this point, our algorithm

performs the fast KNN search.

How ckSearch Works

In this section, we describe the search process of the ckSearch. The overall technique is to

iteratively solve the KNN problem. It begins selecting a small radius ri defining a small area

around the query point and then iteratively increase the radius up to a radius, rmax. The search

space is iteratively increases until all k nearest neighbor values are found or the “STOP”

criterion has been met (when r is rmax).

As explained above, during the pre-processing process the data points are clustered (by

using K-means clustering), reference points are selected (cluster centers), data linearization

are completed, and data points are loaded in a B-tree data structure. The actual search begins

by consulting cache hit-miss strategy and determining the outcome based on the cache rules

as described below in the “cache strategy” section. Regardless of the outcome of the cache

strategy, the ckSearch algorithm next inspects the following two stopping criteria:

• The search radius ri has reached its maximum rmax threshold value and still have not

found k-nearest neighbors.

• The distance(pmax,q) value, the distance between query point q and the furthest data

point pmax in the result set R, is less than or equal to the current search radius ri and

the size of the result set is k. In this case, we can be sure that the algorithm has found

32

all the k nearest neighbors to query point q and further increase of the query area (i.e.

search radius ri) will only result in redundant computational cost.

Next, if the outcome of the cache hit-miss strategy is a hit the algorithm enters

SeachCache(q) sub-routine (see Appendix B, figure B.3). The data cache is a B-tree index

structure modified to access left and right leaf nodes. At this point, the algorithm iteratively

runs the SeachCache(q) sub-routine until stopped by the stopping criteria mention in the

“cache strategy” section below. For each iteration, it increases the search radius ri by

increment amount, rincrement to widen the search space. Instead of a cache hit if a cache miss

occurs at the beginning of the search, our ckSearch algorithm enters a loop where it first

checks the stopping criteria and then enters the SearchClusters(q) routine.

The SearchClusters(q) (see Appendix B, figure B.2) search routine is an important part

of our algorithm because it applies the “Cluster search rules” to reduce significant

computation cost. It checks every cluster iteratively and takes one of the following three

actions:

• Exclude the cluster from the search: If the cluster in question does not contain or

intersect the search sphere of the query point q and falls under the Cluster exclusion

rule (Rule 1), the cluster is then exempted from the KNN search. Thus, a significant

reduction of computation cost occurs.

• Call SearchLeftNode(), Search the cluster inwards and ignore nodes to the right: If

the cluster in question “intersects” the query search sphere, according to the Cluster

intersects query sphere rule (Rule 4) criterion, data space inward toward the cluster

33

center must be search. In this case, only nodes left of the query node in the B-tree

need to be search. Moreover, nodes to the right (in the B-tree) are ignore. Because

these nodes reside outside of this cluster boundary. Thus, our algorithm only calls the

SearchLeftNodes(leafNodei, keyleft) (Appendix B, figure B.4) sub-routine in the next

step to search for k nearest neighbors.

• Perform an Exhaustive Search: If the data cluster “contains” the query point q

determined by the Cluster contains query sphere rule (Rule 3), then an exhaustive

search of the cluster must be completed to find the k nearest neighbors. The data

space is sufficiently traversed to complete the search. This is be done by searching

inward and outward of the cluster center accordingly. Because potential nearest

neighbors can be left or right of the query node in the B-tree. The search routines,

SearchLeftNodes(leafNodei, keyleft) and SearchRightNodes(leafNodei, keyright) are

used for searching inward and outward of the cluster center.

Next, our ckSearch algorithm locates the leaf node, leafNodei (from the B-tree)

whereby query point q with index keyquery may be stored. Intuitively we can say that this

leafNodei has the high probability of having the nearest neighbors of the query point.

Because the data points stored in leafNodei has similar distance from the cluster center as the

query point q. Thus, resides in the same region as the query point in the data space. The sub-

routine getQueryLeaf(btree, keyquery) returns this leaf node.

Next, based on the Cluster search rules (as mentioned in the “Cluster search rules”

section), the ckSearch algorithm either calls SearchLeftNodes(leafNodei, keyleft) for Rule 3

34

or calls both SearchLeftNodes(leafNodei, keyleft) and SearchRightNodes(leafNodei, keyright)

for Rule 4. Each of these sub-routines has built-in loops to check for k nearest neighbors in

the leafNodei. Moreover, these routines check left and right leaf nodes based on inward or

outward data search (Rule 3 or Rule 4).

Figure 4.3: Data cluster to B-tree correlation

In the above figure 4.3 Data cluster to B-tree correlation is shown. This figure shows

how the data points in a cluster are stored in the B-tree leaf nodes (bottom level). The data

points are sorted based on the 1-dimensional linear transformed distance from the cluster

center (used as keys).

It is import to mention that the actual discovery of the nearest neighbors happen in the

SearchLeftNodes(leafNodei, keyleft) and SearchRightNodes(leafNodei, keyright) sub-routines.

Because each of these two search routines iterative calculates the distance between each data

point in leafNodei and the query point q. The k data points with shortest distance to query

point are returned as a result set.

35

If the query sphere contains the first element of a node, then it is likely that its

predecessor with respect to distance from the cluster center may also be close to q. Thus, the

SearchLeftNodes(leafNodei, keyleft) also examines its left sibling leaf node for nearest

neighbors. On the other hand if the query sphere contains the last element of a node, for the

same reason as stated above, the SearchRightNodes(leafNodei, keyright) routine examines its

right sibling leaf node for nearest neighbors.

Now, at the end of these phases the algorithm re-examines the two stopping criteria

mentioned above. It checks for KNN result set R and stops if the k nearest neighbors has

been identified. Moreover, ensures that the further enlargement of the search sphere does not

change the KNN result set. The search process only stops if the distance of the furthest data

point in the answer set, R, from query point q is less than or equal to the current search radius

ri. Otherwise, it increases the search radius and repeats the entire process. Figures B.1 – B.5

in appendix B illustrate the algorithm pseudocodes of the sub-routines mentioned above.

36

Figure 4.4: ckSearch algorithm data caching scheme

Figure 4.4 shows the ckSearch algorithm data caching scheme. In this example, the

query point q and data point A reside in the same leaf node of the ckSearch cache. This is a

cache hit scenario.

Cache Strategy

Data caching is an important component of our KNN search algorithm. A data cache can

prove extremely effective in a KNN search process where repeated access of a large number

of high-dimensional dataset is performed. A fast cache implementation can dramatically

reduce the number of distance computations by simply storing frequently accessed data into a

37

data cache. On the other hand, expensive cache misses can degrade performance. Thus, we

have developed a cache strategy to reduce redundant computation while avoid expensive

cache misses (and therefore reduce costly B-tree insertion operations). This cache strategy is

comprised of the following rules:

• Reduce the cost of the insertion operation as much as possible by reducing frequent

cache updates. The underlying data structure of our cache strategy is a B-tree data

structure ideal for fast cache implementations. In B-tree, inserting a record requires

O(log n) operations in worst case.

• Conduct preliminary checks before performing costly cache searches to reduce cache-

miss cost. We have decided to take this conservative approach to make sure that the

cache-hits remain as performance boost for the ckSearch system and do not get

overwhelmed by too many cache misses. For a given query point, we find the closest

cluster to query point by calculating the distance between the query point and the

cluster centers. Then, we check if the closest cluster to the query point is the same as

the cluster stored in data cache (a B-tree structure). Our assumption here is that two

consecutive query points will fall in the same cluster and possibly around the same

region of that cluster. Thus, their k nearest neighbors will also be in the same region

of the cluster.

• Perform an additional check by matching query point leaf node from the data cache

B-tree with leaf node from the actual data storage B-tree. These two leaf nodes

essentially indicate same region of the same data cluster. If these two leaf nodes turn

38

out to be the same, then the current query point falls in the same data region as the

previous query point because our data structure keeps the leaf nodes sorted based on

distance from the cluster center. So, in order for two leaf nodes to be same the stored

in these leaf nodes must be located in the same region in a cluster. Thus, our ckSearch

algorithm proceeds to perform search to retrieve k nearest neighbors from the data

cache.

• If it is a cache-miss scenario based on the above-mentioned strategies, our algorithm

skips the data cache and the search is performed on the main data storage B-tree data

structure. At the end of the query search, the leaf nodes containing the nearest

neighbors are loaded on to the data cache B-tree for next query iteration under the

CacheUpdate process.

Cluster Search Rules

Our online search strategy depends critically on a search radius parameter ri. We initially

select ri to be conservatively small. If there are not enough points returned from a query, we

can gradually increase the value of ri. In this section we describe several cluster search rules

based on query radius, cluster boundary, and location of the query point. Using the above

mentioned parameters and simple geometric calculations, it possible to figure out with

certainty that some clusters will not contain any of the k nearest neighbors. Thus, these

clusters can be completely excluded from computations and reduce significant amount of

computational cost. These following rules are applied during the query time (runtime) of the

ckSearch.

39

Figure 4.5: Cluster search rule 1 (Cluster exclusion rule)

The above figure (figure 4.5) illustrates the cluster search rule 1 (Cluster exclusion

rule). In this example, the query point is outside the cluster M1. This cluster can be excluded

from the KNN search operations. Thus, reducing expensive distance computation cost.

Rule 1: The cluster exclusion rule

A cluster can be excluded from nearest neighbor search if the following condition is true,

distance(Ci, q) - ri > distMaxi (4.2)

Employing an exclusion strategy, it is possible to exclude a cluster and its data points

from KNN search. Naturally, by excluding a cluster from distance computations,

computation cost can be reduced.

40

Let Ci be the reference point (cluster center) of the cluster Mi. Now, the query point q

has a search radius ri. As described above, ri is the search radius of the search area where the

ckSearch system looks for possible nearest points. The distance between the cluster center

and the query point q is denoted by distance(Ci, q). Moreover, the distance between Ci and

the furthest data points in cluster Mi is denoted by distMaxi. Given the condition

distance(Ci, q) > distMaxi, we can say that the cluster Mi can be excluded from KNN search

if the query point q and query sphere rests out side the cluster boundary.

Figure 4.6: Cluster search rule 2 (Cluster search region rule)

The above figure (Figure 4.6) shows Cluster search rule 2 (Cluster search region

rule). This rule describes the valid search region for a query point in a cluster. This rule

ensures valid search computations in a cluster and avoids unnecessary iterations in invalid

region.

41

Rule 2: Cluster search region rule

When a cluster is searched for nearest neighbor point, the effective search range is,

distmax = max0, distMini

distmin = mindistMaxi, (distance(Ci,q) + ri)

Then, the effective search region is within, [distmin, distmax] (4.3)

Carefully selected search region can further reduce cost for nearest neighbor search.

Moreover, range query can be performed using the search range within an affected cluster.

And, most importantly search termination rules can be set up based on this search range

while searching the leaf nodes in B-tree index structure for the nearest neighbor. This speeds

up the data retrieval from the B-tree.

Let the distance between the cluster center Ci and the query point q is denoted by

distance(Ci, q). Now, the query point q has a search radius ri. Moreover, the distances to the

furthest and the closest data points from the cluster center Ci in cluster Mi is denoted by

distMaxi and distMini. From these given conditions, we can deduce effective search region of

a cluster because no data point lies beyond this search region.

42

Figure 4.7: Cluster search rule 3 (Cluster contains query sphere)

Figure 4.7 above illustrates Cluster search rule 3 (Cluster contains query sphere). This

rule describes that the query point q and its search region based on query radius r1 is

completely inside the cluster M1. Thus, cluster M1 contains the q’s query sphere.

Rule 3: Cluster contains query sphere rule

The query sphere with radius ri is completely contained in the affected cluster Mi, if the

following conditions are true.

distance(Ci, q) + ri ≤ distMaxi (4.4)

It is an important piece of information if the query point q and its query search sphere

are completely contained in the partition or cluster. The reason is this information then can be

used to formulate smarter nearest neighbor search and in turn reduce search related

computation cost.

43

Let, distance(Ci, q) be the distance function between cluster center Ci and the query

point q. The radius of the cluster Mi is distMaxi. Given, distance(Ci, q) ≤ distMaxi we can

correctly formulate that if, distance(Ci, q) + ri ≤ distMaxi condition is true then the cluster Mi

will completely contain the query sphere (see figure 4.7).

Figure 4.8: Cluster search rule 4 (Cluster intersects query sphere)

Figure 4.8 above shows cluster search rule 4 (Cluster intersects query sphere). This

rule describes that the query point q only intersects the cluster M1. Thus, it is possible that the

k-nearest neighbor may not be in the cluster M1.

Rule 4: Cluster intersects query sphere rule

The query sphere with radius ri intersects the affected cluster Mi, if the following conditions

are true.

distance(Ci, q) - ri ≤ distMaxi (4.5)

44

Similar to the above section, it is important to know if a cluster is intersecting with

the search sphere of the query point q. In this case, the nearest neighbor point may be in the

cluster in question. It is also possible that the nearest neighbor is located in another cluster.

Thus, the iterative process of searching may continue.

Let, distance(Ci, q) be the distance function between cluster center Ci and the query

point q. The radius of the cluster Mi is distMaxi. Assuming, distance(Ci, q) > distMaxi where

the query point is outside the affected cluster. We can correctly put together the rule that if,

distance(Ci, q) - ri ≤ distMaxi condition is true then the cluster Mi will partially intersect the

query sphere (see figure 4.8).

45

CHAPTER 5

EXPERIMENTS & RESULTS

In this section, we detail experimental setups, describe our experiments, and present the

results of those experiments. The main objective of our experiments is to evaluate the

performance of our ckSearch system. The indexing strategies of ckSearch are tested on

different data sets varying data set size, dimensions, and data distribution.

We use the KD-tree algorithm as a benchmark for comparison. The KD-tree

algorithm is an effective and commonly used KNN method based on a multi-dimensional

indexing structure. Moreover, the KD-tree algorithm is especially appealing for comparison

as it is similar to our ckSearch multi-dimensional indexing tree structure. The focus of our

research is to speed up the learned data classification process, and is especially applicable for

an existing autonomous robot that currently use a KD-tree based KNN system. According to

Arya and Silverman [2], linear search serves as an effective KNN search technique. So, for

completeness we also compare the performance of ckSearch with the linear search KNN

technique.

5.1 Setup information

The ckSearch search algorithm and related K-means clustering technique were implemented

in Java. A tree-based indexing structure was used as the primary data structure along with a

two dimensional array to store cluster information. The linear search and the KD-tree

46

implementations are Java based implementations. The KD-tree code was obtained from our

colleagues at Georgia Institute of Technology [37]. Experiments were performed on a 1.5-

GHz PC with 512 megabytes main memory, running Microsoft Windows XP version 2002

SP3.

For our training set we used a training data generated by an autonomous robot guided

by a human. We also created synthetic test data sets ranging from 10,000 to 100,000 records

with various dimensions (such as: 9, 18, 36, 50, and 60). For each query, a d-dimensional

point is used. One hundred query trials were used for each experiment. Then, we averaged

the total performance time to even out the I/O cost of the performance.

5.2 The effect of the size of the data set

The size of the data set can play a significant role in the performance of a KNN algorithm

where searches are O(n) with respect to the number of items stored for the query. In order to

evaluate the performance of our system with this criterion, we conducted a series of

experiments: We used a 60-dimensional data set with k set to 1, 3, and 10. During these

experiments we gradually increased the number of data points in the data set. We started with

5000 data points and increased it to 10000, 20000, 40000, 50000, 60000, 80000, and 100000.

With each data set we also recorded the performance time of the ckSearch and compared it

with the performance time of the linear search implementation. The results are tabulated in

table 5.1.

The following table shows the effect of the size of the data set on the ckSearch and the

linear search implementations of KNN algorithm. The size of the data set was gradually

increased and the performance times were recorded.

47

Data Size Dimension Linear search (ms) ckSearch (ms) k

5000 60 44.010621 3.574 1

10000 60 79.029039 5.663 1

20000 60 159.794052 15.7322 1

40000 60 219.094885 12.4334 1

50000 60 264.959094 16.3343 1

80000 60 264.959094 26.3441 1

100000 60 721.357044 34.59425 1

Table 5.1: The effect of data size on performance (k=1)

Performance vs. Data Set Size

k = 1

0

100

200

300

400

500

600

700

800

5K 10K 20K 40K 50K 80K 100K

Data Size

Execu

tio

n T

ime (

ms)

Linear Search

ckSearch

Figure 5.1: Performance vs. data set size chart (k = 1)

48


5000 60 73.031907 13.91388 3

10000 60 98.910615 29.22268 3

20000 60 174.237228 40.72744 3

40000 60 278.986854 87.66623 3

50000 60 325.127635 68.59792 3

80000 60 520.718415 155.5077 3

100000 60 873.28166 122.8424 3

Table 5.2: Effect of data size on performance (k=3)


k = 3

0

100

200

300

400

500

600

700

800

900

1000

5K 10K 20K 40K 50K 80K 100K

Data size

Ex

ec

uti

on

Tim

e (

ms

)

Linear Search

ckSearch


49


5000 60 73.031907 13.91388 10

10000 60 98.910615 29.22268 10

20000 60 174.237228 40.72744 10

40000 60 278.986854 87.66623 10

50000 60 325.127635 68.59792 10

80000 60 520.718415 155.5077 10

100000 60 873.28166 122.8424 10

Table 5.3: Effect of data size on performance (k=10)


k = 10

0

200

400

600

800

1000

1200

1400

5K 10K 20K 40K 50K 80K 100K

Data Size

Ex

ec

uti

on

Tim

e (

ms

)

Linear Search

ckSearch


50

Data Size Dimension k = 1 k = 3 k = 10

5000 60 12.32791 5.248852694 3.226432

10000 60 13.96273 3.384720805 2.27282

20000 60 10.17797 4.278128799 2.999871

40000 60 17.66894 3.182375559 3.516738

50000 60 16.55994 4.739613768 2.510961

80000 60 10.19073 3.348506537 4.066482

100000 60 20.85194 7.10895797 3.356904

Table 5.4: ckSearch speedup over linear search

ckSearch Speedup over Linear Search

0

5

10

15

20

25

30

35

40

45

50

5K 10K 20K 40K 50K 80K 100K

Data Size

Sp

eed

up

Speedup, K=1

Speedup, K=3

Speedup, K=10

Figure 5.4: Chart showing ckSearch speedup over the linear search

The tables 5.1, 5.2, and 5.3 show the results of the “effect of the size of the data set”

experiment. In this experiment we evaluated the performance of the ckSearch algorithm

against an implementation of the linear search KNN algorithm. The results clearly show that

the ckSearch performed far better than the linear search. The speedup chart (figure 5.4)

verifies that the ckSearch achieves and maintains a steady speedup over the linear search

51

method on several values of k. The ckSearch process copes much better than the linear search

where the increased number of computations takes place due to larger data sets.

5.3 The effect of data dimension on the performance

The number of dimensions of a data set can influence the performance of a KNN algorithm.

This happens due to increased complexity of Euclidean distance computations associated

with high-dimensional data. An autonomous robots’ navigational data set can be high-

dimensional. An autonomous robot system may use high-dimensional sensor arrays or high-

dimensional image processing for navigation. Thus, we focused on evaluating the ckSearch

system performance with high-dimensional data criteria.

In this experiment, we used large data sets with 50000 and 100000 records. The k value

was set to 1 for the first experiment and the data set size was set to 50000. For the second

experiment, the k value was set to 3 and the data set with 100000 records was used. For each

of these experiments, we used a 9-dimensional data set to begin with. Then, the number of

dimensions was gradually increased to 18, 36, 60, and 75. The performance time of each

experiment was recorded. In order to perform comparisons the same experiments were

performed with a KD-tree implementation and a linear search technique. We tabulated the

experiment results in the following table.

The following table shows the effect of dimension of the data set on ckSearch and KD-

tree implementations of KNN algorithm. The dimension of the data set was gradually

increased and the performance times were recorded.

52

Dimension Data Set Size KD-tree (ms) Linear Search (ms) ckSearch (ms)

9 50000 125.3342 268.728288 22.26529

18 50000 112.7768 291.717802 19.05041

36 50000 140.987 324.284511 18.37775

60 50000 157.7129 345.627796 30.71787

75 50000 222.6564 561.185849 27.38437

Table 5.5: The effect of data dimension on performance (N=50K)

Data Dimension vs. Performance

N = 50000 & k = 1

0

100

200

300

400

500

600

9 18 36 60 75

Dimensions

Execu

tio

n T

ime (

ms)

linear search

Kd-tree

ckSearch

Figure 5.5: Data dimension vs. performance chart (N = 50K)

Dimension Data Set Size k Linear Search (ms) ckSearch (ms)

9 100000 3 419.812183 35.07446

18 100000 3 451.023915 92.92075

36 100000 3 477.775919 158.4776

60 100000 3 571.532918 122.8424

75 100000 3 660.196533 140.9969

Table 5.6: The effect of data dimension on performance (N=100K)

53

Data Dimension vs. Performance

N = 100000 & k = 3

0

100

200

300

400

500

600

700

9 18 36 60 75

Dimensions

Execu

tio

n T

ime (

ms)

linear search

ckSearch

Figure 5.6: Data dimension vs. performance chart (N = 100K)

Dimension Data Set Size Linear Search (ms) ckSearch (ms) Speedup

9 100000 419.812183 35.07446 11.96917

18 100000 451.023915 92.92075 4.853856

36 100000 477.775919 158.4776 3.014785

60 100000 571.532918 122.8424 4.652569

75 100000 660.196533 140.9969 4.682349

Table 5.7: ckSearch speedup over linear search for various dimensions

In this experiment we compared the ckSearch performance with the KD-tree and linear

search implementations. The experiment results clearly show that the ckSearch system

performed better than both the KD-tree and the linear search method. Moreover, the

ckSearch achieved considerable speedup over linear search (see table 5.7). The results also

show that as the number of dimensions increase the KD-tree and the linear search

performance gradually degrades. Especially, with a larger data set (100000 records) the linear

search performance degrades at a higher rate. On the other hand, ckSearch shows robustness

54

to dimension increase. The ckSearch performance time increases at a much slower rate than

the KD-tree and the linear search system (see figure 5.6).

5.4 The effect of search radius on the performance

The search radius is an important factor for the ckSearch system. The ckSearch system uses

incremental radius based search. Typically, a small search sphere is used and enlarged when

the search condition cannot be met. Our proposed ckSearch system relies on the search

sphere to minimize repeated costly distance calculations to optimize performance. Thus, it is

important to study the effect of the search radius on performance of the ckSearch system.

In this experiment, we used a large data set with 10000 records. We used a k value 1 for

the first part of the experiment and a k value of 3 for the second part of the experiment. The

radius value was gradually increased from 1.0 meters to 10.0 meters. The performance time

of each experiment was recorded.

The following table shows the effect of search radius on ckSearch query performance.

Even though the KD-tree and the linear search methods do not use a search radius, we have

listed the KD-tree and the linear search performance results to compare with the ckSearch

system.

55

Radius (m) Data Set Size KD-tree (ms) Linear Search (ms) ckSearch (ms)

1.0 10000 293.99938 109.38541 5.122073

2.0 10000 293.99938 109.38541 13.31026

3.0 10000 293.99938 109.38541 19.89829

4.0 10000 293.99938 109.38541 26.42299

5.0 10000 293.99938 109.38541 34.06723

6.0 10000 293.99938 109.38541 39.26128

7.0 10000 293.99938 109.38541 42.53429

8.0 10000 293.99938 109.38541 46.6953

9.0 10000 293.99938 109.38541 49.8422

10.0 10000 293.99938 109.38541 51.81746

Table 5.8: The effect of search radius on performance (k = 3)

Effect of radius increase on performance

k = 3

0

50

100

150

200

250

300

350

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

Radius

Execu

tio

n T

ime (

ms)

linear scan

kd-tree

ckSearch

Figure 5.7: Search radius vs. performance for 10000 data records

56

Radius (m) Data Set Size Linear Search (ms) ckSearch (ms)

1.0 10000 169.24721 11.324462

2.0 10000 169.24721 27.077553

3.0 10000 169.24721 43.68484

4.0 10000 169.24721 63.019417

5.0 10000 169.24721 76.492185

6.0 10000 169.24721 85.420351

7.0 10000 169.24721 94.545402

8.0 10000 169.24721 102.19342

9.0 10000 169.24721 112.9639

10.0 10000 169.24721 114.11205

Table 5.9: The effect of search radius on performance (k = 10)

Effect of radius increase on performance

k = 10

0

20

40

60

80

100

120

140

160

180

1.0

2.0 3.0

4.0

5.0 6.0

7.0 8.0

9.0

10.0

Radius

Ex

ec

uti

on

Tim

e (

ms

)

linear scan

ckSearch

Figure 5.8: Search radius vs. performance chart for 10,000 data records (k = 10)

Considering the experimental results listed in the tables 5.8 and 5.9, the search radius

has a significant impact on ckSearch performance. We observe a sharp increase in

performance time as the search radius increases. We believe this is due to an increase in the

number of redundant distance computations. As will be shown in the accuracy experiments,

57

the ckSearch algorithm finds the results well before reaching maximum radius of 10.0 meters

used for the above experiments.

5.5 The effect of search radius on accuracy

This experiment is similar to the above experiment regarding the effect of search radius on

performance time. In this experiment, we evaluate the effect of search radius on data

accuracy. Typically, a small search sphere is used as the start radius. This radius was

enlarged when the search condition could not be met. In this experiment we started with the

radius as 1.0 meters and went up to 10.0 meters. Since ckSearch system relies on the search

sphere to minimize repeated costly distance calculations to optimize performance, it is

important to study the effect of the search radius on accuracy of the ckSearch system.

In this experiment for each of the radius values used during the query, we recorded the

number of accurate nearest neighbor found by the ckSearch algorithm. The results from this

experiment are shown in the table below.

Radius (m) Data Set Size k = 3 k = 10

1.0 10000 0.0 0.0

2.0 10000 85.423 76.667

3.0 10000 97.355 97.702

4.0 10000 99.856

99.113

5.0 10000 100.0 99.822

6.0 10000 100.0 100.0

7.0 10000 100.0 100.0

8.0 10000 100.0 100.0

9.0 10000 100.0 100.0

10.0 10000 100.0 100.0

Table 5.10: The effect of the search radius on query accuracy

58

Search radius vs. Accuracy

N = 10000 & d = 60

0

20

40

60

80

100

120

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

Radius

KN

N f

ou

nd

(%

)

k=3

k=10

Figure 5.9: Search radius vs. query accuracy chart

The experimental results in the table show that the accuracy of the query search gets

better as the search radius increases from 1.0 meters to 10.0 meters. The larger search radius

allows the ckSearch algorithm to assess more of the nearest neighbors. Thus, the accuracy of

the search increases as the search radius increases. It also important to notice that the

ckSearch algorithm achieves 100% accuracy well before the maximum radius value 10.0

meters. This indicates that selecting proper radius is important for performance of the

ckSearch system.

5.6 The effect of the number of clusters

The number of clusters can affect the performance of a cluster based algorithm. Even though

clustering for ckSearch is part of the pre-processing stage and does not directly affect

performance time, it can indirectly influence the ckSearch algorithm time. In order to find out

the effect of the number clusters, we performed several experiments. In this set of

59

experiments, the effect of the clusters on the ckSearch system was investigated. As the

number of clusters increase it is plausible that it can cause increased computational

complexity and in turn increase computation time.

The number of cluster was gradually increased and the subsequent performance was

recorded in this experiment. We used 5, 10, 20, 30, and 50 clusters and the size of the data set

was 50,000 records. We varied the number of nearest neighbor values k (1 and 5) and

conducted two separate experiments. The results of the experiments are tabulated below.

Cluster k KD-tree (ms) Linear Search (ms) ckSearch (ms)

5 1 266.2874 246.287447 29.59359

10 1 266.2874 246.287447 27.03791

20 1 266.2874 246.287447 29.7673

30 1 266.2874 246.287447 17.83305

50 1 266.2874 246.287447 21.50379

Table 5.11: The effect of the number of clusters on performance (k=1)

Effect of data clusters on performance

N = 50000 & k = 1

0

50

100

150

200

250

300

5 10 20 30 50

Clusters

Execu

tio

n T

ime (

ms)

linear search

Kd-tree

ckSearch

Figure 5.10: The number of clusters vs. performance chart for 50000 data records (k = 1)

60

Cluster k Linear Search (ms) ckSearch (ms)

5 5 363.641468 116.6476

10 5 363.641468 121.0026

20 5 363.641468 104.7671

30 5 363.641468 86.15084

50 5 363.641468 114.5369

Table 5.12: The effect of the number of clusters on performance (k=5)

Effect data clusters on performance

N = 50000 & k = 5

0

50

100

150

200

250

300

350

400

5 10 20 30 50

Clusters

Ex

ec

uti

on

Tim

e (

ms

)

linear search

ckSearch

Figure 5.11: The number of clusters vs. performance chart (k = 5)

Tables 5.11 and 5.12 illustrate the results of our experiments with the number of

cluster. Our initial hypothesis was that as the number of clusters increases, performance will

decrease because more clusters will take longer to search. Interestingly according to our

results, ckSearch performance times remain close to the same or very slightly increase. We

hypothesize that this is due to spreading out the data records into number of clusters and most

61

of this cluster search is eliminated using the “cluster search rules”. These “cluster search

rules” prevented the ckSearch system from unnecessary searching.

62

CHAPTER 6

CONCLUSION

In this thesis, we introduced a new algorithm for K-nearest neighbor queries that uses

clustering and caching to improve performance. The main idea is to reduce the distance

computation cost between the query point and the data points in the data set. We used a

divide-and-conquer approach. First, we divide the training data into clusters based on

similarity between the data points in terms of Euclidean distance. Next we use linearization

for faster lookup. The data points in a cluster can be sorted based on their similarity

(measured by Euclidean distance) to the center of the cluster. Fast search data structures such

as the B-tree can be utilized to store data points based on their distance from the cluster

center and perform fast data search. The B-tree algorithm is good for range search as well.

We achieve a further performance boost by using B-tree based data caching. In this work we

provided details of the algorithm, an implementation, and experimental results in a robot

navigation task.

We conducted extensive experiments on the performance and the accuracy of the

ckSearch algorithm. In order to confirm performance improvement of KNN queries, we

performed experiments on the ckSearch system with large and small data sets. Several of our

experiments focused on performance of the ckSearch algorithm with high dimensional data

sets since many of the KNN search algorithms fail on performance when it comes to high

dimensional data. The results show that our algorithm is both effective and efficient. In fact,

63

the ckSearch algorithm achieves performance improvement over both the KD-tree and the

linear scan KNN algorithms.

In the future we will further improve our system by adding an analysis to select the

best possible initial search radius for the ckSearch algorithm. It is conceivable that selecting

too small search radius can end up with much unnecessary iteration. We want to remedy this

weakness of the system by adding the search radius selection analysis.

64

REFERENCES

[1] M. Procopio, T. Strohmann, A. Bates, G. Grudic, J. Mulligan. Using Binary

Classifiers to Augment Stereo Vision for Enhanced Autonomous Robot

Navigation. April 2007.

[2] Arya, S., D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An

Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed

Dimensions. Journal of the ACM, vol. 45, no. 6, pp. 891-923

[3] V. Ramasubramanian, Kuldip K. Paliwal. Fast nearest-neighbor search

algorithms based on approximation-elimination search. January 1999.

[4] J. Chua, P. Tischer. A Framework for the Construction of Fast Nearest

Neighbour Search Algorithms. Monash University, Australia.

[5] J. Chua, P. Tischer. Minimal Cost Spanning Trees for Nearest-Neighbour

Matching. Monash University, Australia.

[6] V. Athitsos, M. Potamias, P. Papapetrou, G. Kollios. Nearest Neighbor

Retrieval Using Distance-Based Hashing. In Proc. IEEE International

Conference on Data Engineering (ICDE), April 2008.

[7] Y. Hsueh, R. Zimmermann, M. Yang. Approximate Continuous K Nearest

Neighbor Queries for Continuous Moving Objects with Pre-Defined Paths.

Department of Computer Science, University of Southern California.

[8] W. Shang, H. Huang, H. Zhu, Y. Lin, Z. Wang, Y. Qu. An Improved kNN –

Fuzzy kNN Algorithm. School of Computer and Information Technology,

Beijing Jiaotong University, China.

[9] A. Jain, M. Murty, P. Flynn. Data Clustering: A Review. Michigan State

University, U.S.A.

[10] A. Duch, V. Castro, C. Martinez. Randomized K-Dimensional Binary Search

Trees. September, 1998.

[11] Z. Aghbari, A. Makinouchi. Linearization Approach for Efficient KNN Search

of High-Dimensional Data. University of Sharjah, Sharjah, UAE.

65

[12] R. Weber, H. Schek, S. Blott. A Quantative Analysis and Performance Study for

Similarity-Search Methods in High-Dimensional Spaces. ETH Zentrum, Zurich.

[13] A. Thomasian, L. Zhang. The Stepwise Dimensionality Increasing (SDI) Index

for High-Dimensional Data. May, 2006.

[14] B. Zheng, W. Lee, D. Lee. Search K Nearest Neighbors on Air. Hong Kong

University of Science and Technology, Clear Water Bay, Hong Kong.

[15] H. Zhang, A. Berg, M. Maire, J. Malik. SVM-KNN: Discriminative Nearest

Neighbor Classification for Visual Category Recognition. University of

California, Berkeley, California.

[16] C. Yu, B. Ooi, K. Tan, H. Jagadish. Indexing the Distance: An Efficient Method

to KNN Processing. Proc. Of the 27th VLDB Conference, Roma, Italy, 2001

[17] A. Nuchter, K. Lingemann, J. Hertzberg. 6D SLAM with Cached kd-tree Search.

University of Osnabruck, Osnabruck, Germany.

[18] G. Neto, H. Costelha, P. Lima. Topological Navigation in Configuration Space

Applied to Soccer Robots. Instituto Superior Tecnico, Portugal.

[19] C. Yu, S. Wang. Efficient Index based KNN join processing for high

dimensional data. Information and Software Technology. May 2006.

[20] G. DeSouza, A. Kak. Vision for Mobile Robot Navigation: A Survey. IEEE

Transactions on pattern analysis and machine intelligence, vol. 24, no. 2,

February, 2002.

[21] E. Plaku, L. Kavraki. Distributed Computation of the knn Graph for Large

High-Dimensional Point Sets. Journal of Parallel and Distributed Computing,

2007, vol. 67(3), pp. 346-359.

[22] J.L.Bentley Multidimensional Binary Search Trees in Database Applications.

IEEE Trans. on Software Engineering, SE-5(4):333-340, July 1979.

[23] N. Ripperda, C. Brenner. Marker-Free Registration of Terrestrial Laser Scans

Using the Normal Distribution Transform. University of Hannover, Germany.

[24] C. Atkeson, S. Schaal. Memory-Based Neural Networks For Robot Learning.

GIT, Atlanta, Georgia.

[25] J. Nievergelt, H. Hinterberger, K. Sevcik. The gridfile: An Adaptable

Symmetric Multikey File Stucture. ACM Trans. on Database Systems, 9(1):38

- 71, 1984.

66

[26] A. Gionis, P. Indyk, R. Motwani. Similarity search in high dimensions via

Hashing. In International Conference on Very Large Databases (VLDB), 1999

pp. 518-529.

[27] V. Athitsos, M. Potamias, P. Papapetrou, G. Killios. Nearest Neighbor Retrieval

Using Distance-Based Hashing.

[28] A. Andoni, P. Indyk. Efficient algorithms for substring nearest neighbor

Problem. In ACM-SIAM Symposium on Discrete Algorithms (SODA). 2006,

pp. 1203 – 1212.

[29] M. Zhang, T. Zhang, R. Ramakrishnan. BIRCH: A new data clustering

algorithm and its applications.Data Mining and Knowledge Discovery.

[30] G. Grizaite, R. Oberperfler. DBSCAN Clustering Algorithm. January 31, 2005.

[31] T. Bingmann. “STX B+ Trees Template Classes: Speed Test Results.” 2008.

Idle box. 4th April, 2009< http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-

html/speedtest.html >.

[32] D. Bentivegna. Learning from Observation Using Primitives. Doctoral Dissertation,

Georgia Institute of Technology, 2004.

[33] L. Xiong, S. Chitti. Mining multiple private databases using a kNN classifier.

In Proceedings of the 2007 ACM symposium on Applied computing. 2007,

pp. 435 - 440.

[34] H. Franco-Lopez, A. Ek, M. Bauer. Estimation and mapping of forest stand density,

volume, and cover type using the k-nearest neighbors method. Remote Sensing of

Environment, Vol. 77, No. 3, 2001, pp. 251-274.

[35] H. Maarse, P. Slump, A.Tas, J. Schaefer. Classification of wines according to type

Journal Zeitschrift für Lebensmitteluntersuchung und -Forschung A. Vol .184,

No. 3, March, 1987, pp. 198-203.

[36] A. Sohail, P. Bhattacharya. Schaefer. Classification of Facial Expressions Using

K-Nearest Neighbor Classifier. Computer Vision/Computer Graphics Collaboration

Techniques. Vol .4418, June, 2007, pp. 555-566.

[37] S. Arya, D. Mount. “ANN: A Library for Approximate Nearest Neighbor

Searching.” August 4, 2006, ANN. 14th

April, 2009.

< http://www.cs.umd.edu/~mount/ANN/>

67

APPENDIX A

NOTATION TABLE

Notation

Table A.1 lists a variety of symbols, functions, and parameters used in this paper. Following

terms and notations are used throughout this paper, especially in the pseudo-code section of

the algorithm.

D Number of dimensions

N Number of data points

D ∈ Ω Data set

Ω = [0,1]d Data space

R Result set containing k-nearest neighbors

Ci Cluster center reference point

R Radius of a search sphere

rincrement Radius increment value

rmax Maximum radius value for STOP criterion

pi A data point p in the ith cluster

distMaxi Maximum radius of a partition, Mi

distMini Distance between Ci and closest point to Ci

pmax The furthest data point from q in the KNN result set R

FurthestPoint(R,q) Furthest point from query point q in set R

SearchRadius(q) Search radius of query point, q

SearchSphere(q, r) Sphere space with query point q in center and radius r

distNearestq Nearest distance to query point, q

distance(pi, Ci) Distance between point pi and Cluster center Ci

keyi B-tree index of nodes and data entries in leaf node

datai Data entries in leaf node of a B-tree

distCenter Distance from query point q to Cluster center Ci

GetNearest(q) nearest neighbor to query point, q

Table A.1: List of various notations used in this thesis

68

APPENDIX B

IMPLEMENTATION PSEUDOCODE

ckSearch _KNN(q) 1 initialize();

2 loadBTree();

3 rincrement = increment value;

4 R = empty;

5

6 if (IsCacheHit(q) == true):

7 while( r < rmax)

8 if (distance(pmax,q) < r and R.Size() == k):

9 STOP;

10 return;

11

12 r = rincrement + r ;

13 SearchCache(q);

14

15 else if (IsCacheHit(q) == false):

16 while( r < rmax)

17 if (distance(pmax,q) < r and R.Size() == k):

18 STOP;

19 return;

20

21 r = rincrement + r ;

22 SearchClusters(q);

23 UpdateCache();

24

25 End ckSearch_KNN;

Figure B.1: ckSearch KNN algorithm

The figure above shows the ckSearch KNN query algorithm pseudocode. This is one of

the several methods utilized to implement ckSearch algorithm.

69

SearchClusters(q): 1 for i = 0 to (M – 1):

2 distCenter = distance(Ci,q);

3

4 if (exclude(i,q) == true) : // Rule 1

5 SKIP CLUSTERi;

6

7 else if (intersects(i,q) == true): // Rule 2

8 keyquery = i * µ + distCenter;

9 leafNodei = getQueryLeaf(btree, keyquery);

10 keyleft = i * µ + (distCenter – r);

11 SearchLeftNodes(leafNodei, keyleft);

12

13 else if (contains(i,q) == true): // Rule 3





18 keyright = i * µ + (distCenter + r);

19 SearchRightNodes(leafNodei, keyright);

20

21 //end of for loop;

22 END;

Figure B.2: SeachClusters(q) pseudocode

The figure B.2 above shows pseudocode of the main cluster search algorithm. This

“SearchClusters(q)” algorithm is part of our proposed ckSearch KNN search algorithm.

70

SearchCache(q): 1 index = index of cached cluster;

2 for i = 0 to (M – 1): // searching all cached clusters

3 distCenter = distance(Ci,q);

4

5 if (exclude(i,q) == true) : // Cluster Rule #1

6 SKIP CLUSTERi;

7

8 else if (intersects(i,q) == true): // Cluster Rule #2





13

14 else if (contains(i,q) == true): // Cluster Rule #3





19 keyright = i * µ + (distCenter + r);

20 SearchRightNodes(leafNodei, keyright);

21

22 END;

Figure B.3: The “SearchCache(q)” algorithm pseudocode

The figure above (figure: B.3) shows pseudocode of the cache search algorithm. This

“SearchCache(q)” algorithm is part of our proposed ckSearch KNN search system.

71

SearchLeftNodes(leafNodei, keyleft) 1

2 for( i = 0; i < leafNodeSize(); i++

): // searching leafNodei for nearest neighbors

3 if R.Size() == k:

4 if (distance(pmax,q) > distance(datai,q)):

5 Remove pmax from R;

6 Add datai to R;

7

8 else if R.Size() ≠ k:

9 Add datai to R;

10

11 // End of for loop

12

13 distleft = distCenter – r;

14 leftLeafNode = GetLeftLeafNode(leafNodei);

15

16 while (true)

17 leftLeafNode = GetLeftLeafNode(leftLeafNode);

18 SearchLeafNode(leftLeafNode); //searching leftLeafNode for nearest neighbors

19

20 keyOfMinRecord = get key value of left most entry of leftLeafNode;

21

22 if keyOfMinRecord < distleft OR, if Cluster boundary reached

23 break loop; //reached the search sphere limit, no need to search

24

25 END;

Figure B.4: The SearchLeftNodes(leafNodei, keyleft) pseudocode

Figure B.4 above shows the “SearchLeftNodes(leafNodei, keyleft)” function

pseudocode. This function searches the leaf nodes to the left in the data structure for nearest

neighbor points. It is considered one of the most important functions in the ckSearch

implementation.

72

SearchRightNodes(leafNodei, keyright) 1

2 for( i = 0; i < leafNodeSize(); i++

): // searching leafNodei for nearest neighbors

3 if R.Size() == k:

4 if (distance(pmax,q) > distance(datai,q)):

5 Remove pmax from R;

6 Add datai to R;

7

8 else if R.Size() ≠ k:

9 Add datai to R;

10

11 // End of for loop

12

13 distright = distCenter + r;

14 rightLeafNode = GetRightLeafNode(leafNodei);

15

16 while (true)

17 rightLeafNode = GetRightLeafNode(rightLeafNode);

18 SearchLeafNode(rightLeafNode); //searching rightLeafNode for KNN

19

20 keyOfMaxRecord = get key value of left most entry of rightLeafNode;

21

22 if keyOfMaxRecord > distright OR, if Cluster boundary reached

23 break loop; //reached the search sphere limit, no need to search

24

25 END;

Figure B.5: The SearchRightNodes(leafNodei, keyright) pseudocode

Figure B.5 above shows the “SearchRightNodes(leafNodei, keyright)” function

pseudocode. This function searches the leaf nodes to the right in the data structure for nearest

neighbor points. It is considered one of the most important functions in the ckSearch

implementation.

EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH

Documents

Transcript of EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH