ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with...
Transcript of ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with...
ATHABASCA UNIVERSITY
Modeling Uncertainty in datasets with FN-DBScan and Bayesian Networks
BY
AARON R. ULRICH
A project submitted in partial fulfillment
Of the requirements for the degree of
MASTER OF SCIENCE in INFORMATION SYSTEMS
Athabasca, Alberta
May, 2012
© Aaron R. Ulrich, 2012
1
DEDICATION
This work is dedicated to my loving wife Nicole and daughter Sophia.
2
ABSTRACT
This experiment investigates the effectiveness of using a hybrid AI solution in a semi-supervised learning
scenario. To achieve this, the hybrid applies clustering and classification techniques to the IRIS dataset and
measures various statistics used to determine a level of success. Currently, clustering methods consist of DBScan
and FN-DBScan where-as classification uses a Bayesian Network with hill climbing search and a simple estimator.
The measurements for the solutions effectiveness serve to answer the following three questions. First, is FN-DBScan
more effective than the original DBScan? Next, can additional accuracy be gained by using a classification step to
classify noise from the clustering process? And finally, is there a significant performance gain in using classification
for ad-hoc data rather then re-running the clustering process? The methodology used to achieve this first discovers
an optimal set of parameters for the algorithms in question and then applies 3 rounds of 10,000 10-Fold cross
validation iterations on the dataset to establish averages for the measurements. The resulting outcome of this
experiment empirically implies that FN-DBScan is indeed more effective than the DBScan, additional accuracy can
be gained by classifying noise in some cases, and that using classification to classify ad-hoc data into clusters offers
significant performance gains. It is also interesting to notice that although DBScan performed faster than FN-
DBScan, the classifier built from FN-DBScans cluster-set was able classify data faster. Additionally there appears to
be an optimal noise level with which the classifier provides gains in accuracy. Although these results look
promising, future work should provide additional verification by expanding this experiment to additional datasets.
3
ACKNOWLEDGMENTS
I would like to acknowledge the support and encouragement of Larbi Esmahi and Dragan Gasevic who inspired
many aspects of my research.
4
TABLE OF CONTENTS
CHAPTER I - Introduction ........................................................................................................................... 9
1.1 Statement of the Purpose................................................................................................................... 10
1.2 Chapter Organization ........................................................................................................................ 10
1.3 Definition of Terms ........................................................................................................................... 10
CHAPTER II – Review of Related Literature ............................................................................................ 12
2.1 Uncertainty ........................................................................................................................................ 12
2.2 Data Clustering ................................................................................................................................. 21
2.3 Applications for uncertainty .............................................................................................................. 26
2.4 Summary ........................................................................................................................................... 27
CHAPTER III – Methodology and Design ................................................................................................. 28
3.1 The Experiment Tools ....................................................................................................................... 28
3.2 Measurements & Parameters ............................................................................................................ 28
3.3 The System Design Overview ........................................................................................................... 30
3.4 The System Design Components ...................................................................................................... 32
3.5 Summary ........................................................................................................................................... 38
CHAPTER IV - Analysis ............................................................................................................................ 39
4.1 Parameter Selection for the IRIS Dataset.......................................................................................... 39
4.2 Optimal Parameter Discovery for DBScan and Bayes’ HC/SE ........................................................ 40
4.3 Optimal Parameter Discovery for FN-DBScan and Bayes’ HC/SE ................................................. 42
4.4 Optimal Parameter Analysis ............................................................................................................. 44
4.5 Summary ........................................................................................................................................... 46
CHAPTER V – Recommendations and Conclusions ................................................................................. 47
5.1 Suggestions for Further Research ..................................................................................................... 47
5.2 Conclusions ....................................................................................................................................... 48
REFERENCES ........................................................................................................................................... 49
5
LIST OF EQUATIONS
Equation 1 – Posterior Probability ............................................................................................................................... 13
Equation 2 - Posterior Probability #2 ......................................................................................................................... 13
Equation 3 – Bayesian Network JPD Factorization ..................................................................................................... 14
Equation 4 – Defuzzification Center of Gravity Calculation ....................................................................................... 21
Equation 5 – Euclidean Distance ................................................................................................................................. 23
Equation 6 – MinMax Normalization .......................................................................................................................... 23
Equation 7 – Cardinality Calculation........................................................................................................................... 25
Equation 8 – Exponential Neighborhood Relation Function ....................................................................................... 25
Equation 9 – Permutation Calculation ......................................................................................................................... 33
6
LIST OF TABLES
Table 1: Permutation Parameters ................................................................................................................................. 29
Table 2: Accuracy Measures ....................................................................................................................................... 29
Table 3: Timing Measures ........................................................................................................................................... 30
Table 4: Parameter Permutation Input Rules ............................................................................................................... 33
Table 5: DBScan vs. FN-DBScan ............................................................................................................................... 36
Table 6: DBScan Optimal Parameters ......................................................................................................................... 41
Table 7: FN-DBScan Optimal Parameters ................................................................................................................... 43
Table 8: Accuracy Analysis ......................................................................................................................................... 44
Table 9: Noise Classification Analysis ........................................................................................................................ 45
Table 10: Time Analysis .............................................................................................................................................. 45
7
LIST OF FIGURES
Figure 1: Trained Bayesian Network Visualization ..................................................................................................... 15
Figure 2: One Dimensional Membership Functions .................................................................................................... 16
Figure 3: Composite Membership Functions ............................................................................................................... 16
Figure 4: The Age Variable ......................................................................................................................................... 17
Figure 5: Deriving hedge terms ................................................................................................................................... 17
Figure 6: Simple Inference Engine .............................................................................................................................. 18
Figure 7: Fuzzification – Crisp Value to Fuzzy Value ................................................................................................ 19
Figure 8: Probability of Success .................................................................................................................................. 20
Figure 9: E-Neighborhood ........................................................................................................................................... 23
Figure 10: Density Connectivity .................................................................................................................................. 24
Figure 11: System Overview ....................................................................................................................................... 31
Figure 12: System Data Flow ...................................................................................................................................... 32
Figure 13: Permutation Engine .................................................................................................................................... 33
Figure 14: Data Engine ................................................................................................................................................ 34
Figure 15: AI Engine ................................................................................................................................................... 35
Figure 16: DBScan Parameter Analysis ...................................................................................................................... 41
Figure 17: High accuracy membership counts for e-radius ......................................................................................... 41
Figure 18: FN-DBscan Parameter Analysis ................................................................................................................. 43
8
CHAPTER I
INTRODUCTION
We exist in a world rife with magnitudes of uncertain and imprecise data. The era when rigid boundaries
between classifications would placate our analytic requirements is gone. In this world modeling uncertainty provides
a way to deal with an all-to-often partial view of reality for the ever growing flow of data. And it is this perpetual
flow of data which drives an insatiable need for effective clustering and rapid classification of real world data.
Data clustering is a mostly unsupervised method used to group data into natural partitions while requiring a
minimal amount of human interaction. A popular clustering strategy is known as Density Based Spatial Clustering
of Applications with Noise (DBScan) which is capable of discovering arbitrarily shaped clusters and detecting noise.
A technique known as Fuzzy-Neighborhood DBScan (FN-DBScan) enhances the mechanism which the original
DBScan used to discover core points; core points form the backbone of a clusters shape. FN-DBScan is based on the
idea of passing a Fuzzy Neighborhood Relation Function (F-NRF) as a parameter to the clustering algorithm. The F-
NRF is a flexible concept which determines the strength with which a given point gravitates towards becoming a
core point based on its neighborhood. A common F-NRF is exponential, but it is also trivial to model the original
DBScan’s point count using this concept.
Data classification is used to classify new data based on the knowledge obtained from existing data. A
trained classifier represents a model of the dataset under study and will produce a best fit answer when given a new
data object. One such technique, which is deeply rooted in probability theory, is known as a Bayesian Network
(BN). A BN uses a probabilistic logic model in the form of a structured probability network to deduce a
classification with the highest probability. Classification can potentially assist the clustering process by attempting
to classify noisy data points and also perform rapid ad-hoc cluster assignment to new points.
A widely accepted method to deal with uncertain boundaries is known as fuzzy set theory. Fuzzy set theory
was developed to mathematically support a system where objects have multiple set memberships, each of which
represents a degree of belief in the set membership. Due to this inherent ability to belong to multiple sets, fuzzy
values provide a way to deal with non-crisp (fuzzy) knowledge.
9
1.1 Statement of the Purpose
The purpose of this experiment is to measure the success of a hybrid AI solution which is capable of semi-
supervised learning through the use of both clustering and classification with a focus on fuzzy methods. In the
current experiment, the clustering methods under observation include DBScan and FN-DBScan with the
classification process using a Bayesian Network with hill climbing search and a simple estimator. This experiment
will serve to explore three primary questions: First, is FN-DBScan more effective than the original DBScan?
Second, can additional accuracy be gained by performing an additional classification step to classify the noise
generated by the clustering process? And finally, what performance gain is achievable by using Bayesian
classification on new data, rather than re-running the clustering process?
1.2 Chapter Organization
The subsequent sections are organized as follows. Chapter two introduces background knowledge and
previous research dealing with uncertainty and data clustering. Once the background is established chapter three
describes the experiment and the system architecture used to generate the experimental results. Next, given the
generated results chapter four provides an analytical study into how the results support the questions under study.
Finally a conclusion and suggestions for further research are given.
1.3 Definition of Terms
Acronym Definition DBScan Density Based Spatial Clustering of Applications with Noise FN-DBScan Fuzzy-Neighborhood DBScan SFN-DBSCAN Scalable FN-DBScan F-NRF Fuzzy Neighborhood Relation Function BN Bayesian Network AI Artificial Intelligence GA Genetic Algorithm LLN Law of large numbers CLT Central limit theorem DAC Directed Acyclic Graph JPD joint probability distribution CPT conditional probability table MF Membership Function DIL Dilation CON Concentration INT Contrast Intensification DIM Diminishing COG center of gravity NRFJP Noise robust fuzzy joint points KNN K-Nearest Neighbor
10
xFKM fuzzy k-means CSV Comma Separated Values
11
CHAPTER II
REVIEW OF RELATED LITERATURE
Uncertainty is an ever present issue surrounding data analysis that plagues the process with potential error
at every step. This chapter will discuss several techniques which can help mitigate the effects of uncertainty. The
first technique is based on probability theory and is presented in the form of Bayesian Classification. The next is a
branch of set theory known as fuzzy set theory. And finally a branch of data mining known as density based
clustering will be introduced through a discussion of the DBScan algorithm.
2.1 Uncertainty
There are many instances where absolute data precision is simply not possible [1, 2, 3, 4], which is
particularly true when dealing with real-world models pertaining to time and space which have an infinite level of
detail. There are two primary branches in the field of uncertainty. Probability theory is highly objective and is based
on the establishment of causal relationships through the collection of empirical data, and fuzzy set theory which is
usually considered more subjective in that the relationships modeled are based on the modeler’s impression of the
variable in question. A couple other important uncertainty techniques are rough set and evolutionary theory. Rough
set theory, developed by Pawlak [56], is similar to fuzzy theory in that they both represent uncertainty about set
memberships. Rough set theory, however, uses the concept of set boundaries defined through topology operations to
express vagueness of membership rather than the fuzzy method of using membership functions to establish a degree
of belief. An important aspect of the Rough set approach is that it does not require any preliminary meta-data such
as the membership functions required by fuzzy sets which makes it an appealing candidate for unsupervised
solutions. Evolutionary theory represents a field of study which is influenced by biological models for evolution
such as genetics. A popular evolutionary design is known as a Genetic Algorithm (GA) which was predominantly
pioneered by Holland [57] and uses a probabilistic transition scheme through populations. The purpose of a GA is to
effectively and efficiently evolve an optimal solution to the problem under study by first applying a fitness test to
determine the strongest members of a population and then evolving new populations based on these members until a
terminating condition is met.
2.1.1 General Probability Theory
12
The undeniable need to understand uncertain events has led to a branch of mathematics known as
probability theory. Kolmogorov's Grundbegriffe [5] represents a milestone synthesis establishing the fundamental
foundation for modern probability theory [6]. Two fundamental axioms emerge from the foundations of probability,
the Law of large numbers (LLN) and the Central limit theorem (CLT), both of which apply to large sequences of
independent random variables. LLN simply states that as a sample set increases the sample average will begin to
converge at the probability distribution P, where-as CLT states that given a sufficiently large sample set the random
variables will be approximately normally distributed.
When modeling a system using probability theory the thematic objects consist of random (stochastic)
variables, random processes, and events. An event represents the set of possible outcomes where each outcome is
known as a random variable. The value associated with a random variable can be modeled using a probability
distribution. In the simplest case, which assumes discrete time, a sequence of random variables represents a random
process known as a time series. The two classes of probability distributions used to model random variables are
discrete and continuous. A discrete probability distribution deals with event distributions that occur in a countable
sample space, such as the probability of drawing a specific card from a 52-card deck. Continuous probability
distributions deal with event distributions that occur in a continuous sample space, such as a real number
measurement.
2.1.2 Bayesian Classification
The famous Bayes’ Theorem is commonly credited to the English mathematician and theologian Thomas
Bayes (1702 to 1761), although there is some evidence which implies otherwise [7]. This theorem is the basis for
much research dealing with situations that lack certainty, including Bayesian networks. Bayes’ theorem is based on
the concept of conditional probabilities (posterior probabilities):
P(B | A) = P(B ^ A) / P(A) Equation 1 – Posterior Probability
Which reads the probability of B given A is equal to the probability of the intersection between B and A (both B and
A are true) divided by the probability of A. In the above formula P(B|A) is referred to as the posterior probability of
B. This idea is the basis for Bayes’ Theorem which is often written as follows:
P(B | A) = [P(A | B) * P(B)] / P(A) Equation 2 - Posterior Probability #2
In its simplest implementation, Naïve Bayes’ (Simple Bayesian) directly applies Bayes’ theorem and will predict
that A belongs to the class having the highest posterior probability, P(B|A), conditioned on A. A key characteristic
13
of this algorithm is its simplicity, which is achieved in the design by adopting class conditional independence by
treating all attribute class values as independent of other attribute class values.
2.1.2.1 Bayesian networks
Bayesian networks (BN’s) were first introduced by Pearl [8]. A key aspect of the design is the capability of
encoding variable dependencies as a causal model. To achieve dependent variables a finite set of random attributes
(variables) are positioned within a Directed Acyclic Graph (DAC) to form a causal topology of nodes which encodes
the dependencies in the form of a network. Each node within the topology has a joint probability distribution (JPD),
also-known-as a conditional probability table (CPT), which represents the nodes value probability given a specific
combination of direct parent variable values. It is an important condition of Bayesian Networks that only the direct
parents need to be considered for the CPT; in this sense, BN’s satisfy the Markov condition, which means, that
given their parents the nodes within a network are conditionally independent of their non-descendants. Because of
this condition the JPD can be factorized to the following equation:
P(X1, X2, …, Xn) = Πi P(Xi | PA(Xi))
Equation 3 – Bayesian Network JPD
Factorization
In the above mentioned Equation 3, Π is the product symbol, X1… Xn are the set of variables in the DAC, and
PA(Xi) represents the parents of Xi in the network.
2.1.2.2 Network Inference
In a Bayesian Network inference is achieved by maximizing the posterior (conditional) probability of a
class for a given set of attribute values. The simplest method of performing Bayesian inference is known as the
enumeration algorithm [9], which can be greatly enhanced by applying variable elimination [10]. Put concisely the
enumeration algorithm is a recursive algorithm that performs a summation of products of the conditional
probabilities while ensuring probabilistic normalization [11] and in this way is able to achieve a maximal posterior
probability. Figure 1, adapted from [12], visualizes a simple trained network which can be used to infer the
probabilities of outcomes based on a set of evidence. For example, what is the probability that someone goes to
college (C), Studies (S), Does not party (¬P), Succeeds in their exams (E), and does not have fun (¬F)? The
calculation that follows can be traced by observing the highlighted areas in Figure 1.
P(C, S, ¬P, E, ¬F) = P(C) * P(S|C) * P(¬P|C) * P(E|S Ʌ¬P) * P(¬F|¬P)
= 0.2 * 0.8 * 0.4 * 0.9 * 0.3 = 0.01728
14
Figure 1: Trained Bayesian Network Visualization
2.1.2.3 Network Training
The goal of training is to learn the Conditional Probability Table (CPT) entries for the network. A popular
technique described by Russell et. al. [13] utilizes a gradient descent strategy to perform greedy hill climbing. A
similar strategy is also implemented by neural networks to back-propagate the error (gradient). It is important to
note, however, that back-propagation is not required in a probabilistic network because the information is available
to compute it locally by accessing the direct parents. The purpose of a gradient descent strategy is to converge at the
local optima by updating the weights during each iteration. Learning the CPT entries (training the network) using
the technique described by Russell et. al. [13] is as follows:
1. Define or discover the topology
2. Set each CPT entry to a random probability value (between 0.0 and 1.0)
3. For each piece of training data, iterate the topology nodes in a top-down manner
a. Compute the gradient of the Node based on the current training data
b. Update the weights based on learning rate and newly discovered gradient. This moves
towards the optimal local solution without backtracking (greedy hill-climbing).
c. Renormalize weights to ensure they add up to 1.0
15
2.1.3 Fuzzy Set Theory
The concept of the fuzzy-based methodology has its origins in fuzzy-set theory as proposed by Zadeh [14]
and later refined by Kandel & Byatt [15]. The underlying concept behind Zadeh’s theory is simply that “a fuzzy set
expresses the degree to which an element belongs to a set” [14], where the degree is expressed as a value between
0.0-1.0 and is assigned to an object by a Membership Function (MF). Unlike classical sets, which an object either
belongs or does not, “a fuzzy set is a set without a crisp boundary” [16].
2.1.3.1 Fuzzy Membership Functions
A membership function is the mechanism which defines a mapping between an input x value and an output
y value. The typical one dimensional functions are shown in Figure 2 which include: Triangular MFs, Trapezoidal
MF, Gaussian MF, Generalized bell MF, Sigmoidal MF, and Left-Right MF. It is also possible to combine MF’s
using logical statements, such as AND/OR, which is demonstrated in Figure 3 where the first image represents both
functions, the second applies AND, and the third applies OR.
Figure 2: One Dimensional Membership Functions
AND OR
Figure 3: Composite Membership Functions
2.1.3.2 Linguistic Variables
A universe of discourse X is often called a linguistic variable. When this variable is a continuous space,
which is often the case, then it is usually modeled using multiple MF’s, each of which define a linguistic value (aka
linguistic term) for the variable. In a linguistic statement terms can be linked using connectives (and, or, etc.),
16
negated (not), and/or modified by hedges (very, sort-of, a-little-bit, etc.). Figure 4, adapted from [16] exemplifies the
Age variable with three linguistic terms: Young, Middle Aged, and Old. Notice the shaded areas in the figure
represent areas in which both terms hold true, in this way it is possible to be in a state that has a degree of belief
associated with multiple linguistic terms.
Figure 4: The Age Variable
2.1.3.3 Language Modifiers
Several functions exist to generate derivative curves which establish an elegant method to handle language
modifiers (aka hedges). These include Dilation (DIL), Concentration (CON), Contrast Intensification (INT), and
Diminishing (DIM). These modifiers allow modeling more complex linguistic statements. For example, assume we
would like to extend the linguistic variable Age discussed above with terms modified by the hedges very, sort-of,
and definitely. In Figure 5 shown below, very young and very old have been modified with CON, sort-of young and
sort-of old are derived with DIL, and definitely middle aged is a result of applying INT which creates a crisper
curve. The modifier DIM, which is not represented below, is simply the inverse of INT.
Figure 5: Deriving hedge terms
17
2.1.3.4 Fuzzy Inference
The simple inference engine portrayed in Figure 6 takes a single input value X (could be crisp or fuzzy) and
outputs a fuzzy valued result Y with the option of defuzzification. Fuzzy values are used as inputs into the actual
inference engine. This engine contains a set of rules, sometimes referred to as a rule-base, and an aggregator which
composes the final result based on the combined outcome of the rule-base.
Figure 6: Simple Inference Engine
2.1.3.5 Fuzzification
The first step of inference is to find the fuzzy value for the input X. This is done by using the term MF’s to
lookup a degrees of belief (or membership) based on the input variable. In Figure 7 a fuzzification process for two
variables is shown. Given the linguistic variable Age with input parameter of a=15 a table of belief values for the set
of terms will be generated. Similarly the Experience variable is given the parameter of e=38 which produces a
similar table. It is this table of belief values that establishes the foundation for fuzzy values.
18
Age
Experience
Figure 7: Fuzzification – Crisp Value to Fuzzy Value
2.1.3.6 The Rule-base
The next step in the inference engine involves iterating the rules in the rule set and calculating the
respective output belief values. The following rule listing summarizes this process on a rule-base containing three
simple rules.
IF (Age = very young) OR (Experience = low) Success Prob = Poor Equation µA˅B(x)= max[µA(x), µB(x)]
µsuccess=poor = max[0.25, 0.15] = 0.25
IF (Age = young) AND (Experience = moderate) Success Prob = Good Equation µA˄B(x)= min[µA(x), µB(x)]
µsuccess=good = min[0.75, 0.35] = 0.35
IF (Age = middle aged) Success Prob = Excellent
19
Equation µA(x)=µA(x)
µsuccess=excellent = 0.1
The first rule performs an or-operation between the very young entry of the Age fuzzy value and the low entry of the
experience fuzzy value. The outcome of the first rule represents the degree to which the inference engine believes
this rules implication, which in this case is the probability of success being poor. After the belief values have been
calculated for each rule, the output is passed on to the aggregator.
2.1.3.7 The Aggregator
The aggregator varies between implementations but will typically average the rule outcomes of each
linguistic term to establish a new fuzzy value. Building upon the outcomes of the rule listing described above this
process would produce a fuzzy value that contains the following:
µsuccess=poor = 0.25 µsuccess=good = 0.35 µsuccess=excellent = 0.1
2.1.3.8 Defuzzification
The final step in the process is known as defuzzification. A common technique used to perform this task is
the centroid method which calculates the gravity for the area under a curve. Figure 8 highlights the area based on the
fuzzy value which was determined by the aggregator in the previous section.
Figure 8: Probability of Success
Calculation of the centroid, or center of gravity (COG), is done using the following formula:
20
∑
∑
=
== n
axA
n
axA
x
xxCOG
)(
)(
μ
μ
Equation 4 – Defuzzification Center of Gravity
Calculation
The COG formula given in Equation 4 steps through the range of values on the x-axis using some
increment, using Figure 8 as an example, the range would be from 0 to 100 with an increment set to 10. First, for
each step in the range a weighted value is created by multiplying the x-axis value with the membership value (y-
axis); the weighted values are then summed to create a weighted sum. Next a participation weight for each
membership value is created by multiplying the number of steps participating in the value; these weights are then
summed to create a participation sum. Finally the weighted sum is divided by the participation sum. Applying this
formula to the running example, with a curve sample rate of 10, yields the following calculation:
COG = [ (0 + 10 + 20) * 0.25 ] + [ (30 + 40 + 50 + 60) * 0.35 ] + [ (70 + 80 + 90 + 100) * 0.1 ] /
[ (3 * 0.25) + (4 * 0.35) + (4 * 0.1)]
= 104.5 / 2.55 = 40.98 The value of 40.98 represents the deffuzified Probability of Success given the input values (a=15 and e=38) and the
rule set described previously.
2.2 Data Clustering
The data available in most disciplines is mostly unlabeled and unclassified [17]. The purpose of clustering
in data analysis is to partition a dataset into homogenous groupings known as clusters [18], and these groupings
represent naturally occurring data classifications. In data mining clustering is a progressive process that attempts to
optimally minimize inter-cluster similarity while maximizing intra-cluster similarity [19]. This process has its roots
in statistics and originates in a paper by Fisher [20], in which it is stated that “it is often important to know how a
population may be decomposed into sub-groups that contrast sharply with each other, individuals of the same group
being fairly alike.”
2.2.1 Types of Clustering
Two key classifications which often help in distinguishing between types of clustering algorithms are based
on the results produced and the algorithms used [21]. The first type of classification, discussed by Kaufman &
Rousseeuw [22], organizes clustering methods by the results produced and includes hierarchical as-well-as
21
partitioning based methods. Hierarchical strategies split the data into dendrogram trees which can then be iterated
using either a top-down or bottom-up approach. Partitioning algorithms work by splitting the data into a set of k
clusters. The second classification organizes the methods by the style of the algorithm and includes optimization-
based, distance-based, and density-based. Of these the most popular algorithms are density and distance-based [23].
Distance-based methods are commonly used with partitioning algorithms and are limited to discovering spherically
shaped clusters. Density-based methods are focused on “growing” the clusters based on thresholds set in the input
parameters. Because of this dynamic growth, density-based clustering is capable of determining arbitrarily shaped
clusters. Early attempts at density clustering [24] suffered from prohibitive space and run-time complexity but
research continues to mitigate these shortcomings.
2.2.2 DBScan
The need for effective and efficient clustering has well established roots in the complexities of spatial
database systems [25, 26, 27, 28, 29], and more recently has been applied to the study of applications involving
imprecise data [1, 2, 3, 4]. Density-based algorithms are often preferred where imprecise data exists, and therefore
fuzzy distances naturally occur [30]. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a
density-based clustering algorithm capable of forming arbitrarily shaped clusters as-well-as detecting noise [31]
which has proven popular through its many variants [32, 33] designed to mitigate various limitations. The
fundamental idea behind DBScan is the core object which is discovered by analyzing the E-Neighborhood and
forms the backbone of the clusters shape.
The E-Neighborhood is the set of objects within the radius (ε ) of the object in question. The neighborhood
of object X is shown as the dotted surrounding circle in Figure 9. All y objects, with exception of y6, are within the
radius 5.0=ε and are therefore considered to be part of object X’s neighborhood. To discover the set of neighbors
the simplest form of DBScan iterates the dataset and measures the distance between the object in question (object x)
and the object currently being iterated (object y). These distances are the numbers beside the dotted line connecting
the two objects in the figure. A common distance function is known as Euclidean distance, given in the below
equation. In order to apply this function to measuring the distance between data objects it is important to note that it
is summing the distance between each attribute squared, where (x1i - x2i) is the distance between the attribute for the
two objects.
22
∑=
−=n
iii xxxxdist
1
22121 )(),( Equation 5 – Euclidean Distance
It is often desirable to normalize the attribute values into a standard range such as [0, 1]. A common method to
achieve normalization is known as minmax and is given in Equation 6 shown below. It is important to realize that
minmax requires the maximum and minimum values for the attribute to which v (the value) belongs.
AA
Avvminmax
min)max(min−
−= Equation 6 – MinMax Normalization
Figure 9: E-Neighborhood
An object becomes a Core Object if the E-Neighborhood contains a number of objects equal to at least
MinPts. If an object is within the E-Neighborhood of a core object then it is directly density reachable from that core
object. Furthermore, two objects are indirectly density reachable if there is a path of core objects between them. A
walk of paths through the connectivity network is known as density connectivity.
In Figure 10, MinPts = 3 and E-Radius create the spheres encompassing the neighborhoods of the various
points. The core points in this example are C, D and E, each of which contain 3 points. Furthermore, point B is
directly density reachable from point C because B is within the neighborhood of point C which is a core point; point
F is directly density reachable from point E for similar reasons. B is indirectly density reachable from point E
23
because there is a path of core point neighborhoods between them; E is not indirectly density reachable because B is
not a core point; F and C have a similar relationship. Points B,C,D,E and F are all considered to be density
connected and would represent a discovered cluster in DBScan. The points A and G are outliers and would be
flagged as noise.
Figure 10: Density Connectivity
2.2.3 DBScan Varients
DBScan is an elegant algorithm with great potential, but it is not without its limitations. The four major
difficulties plaguing DBScan implementations in current research include computation complexity, memory
consumption, datasets with varying densities, and parameter selection. The computational complexity of DBScan is
capable of achieving the undeniably poor rating of O(n^2) [34]; albeit, its normal rating is O(nlogn). Various studies
have addressed this issue in certain circumstances and demonstrated performance ratings of linear [35,36] and near-
linear [37]. As dataset size increases the memory resources required by DBScan will grow proportionally due to its
algorithmic requirement to load the entire dataset into memory. Achieving a more scalable algorithm is addressed in
various branches of research [3, 38, 39]. The issue of varying densities has also been acknowledged in literature as a
critical limitation of the DBScan algorithm. The algorithm OPTICS [40], for example, is a cluster analysis method
concerned with computing an augmented cluster ordering. This augmented data can be used to enable varying
densities. Additionally, this ordering is conducive of human analysis and/or automated parameter selection. Uncu et.
al. [41] further extend the notion of OPTICS by developing an algorithm that focuses on density parameter values
for regions rather than across all data vectors.
2.2.4 Fuzzy Clustering
It has been said that the application of fuzzy set theory to the clustering problem can increase adequacy of
the results obtained [42]. There are currently two main branches of fuzzy clustering in research, the first defines
24
fuzzy boundaries between object classes, and the second regards the objects themselves as fuzzy representations of
their true values. This paper will focus on the former of these.
2.2.4.1 Fuzzy Neighborhood Relations
Traditional fuzzy clustering [43] evolved from a need for non-crisp, or fuzzy, boundaries between clusters
of objects. Traditionally each object in a dataset has a degree of membership for a set of clusters, where the
membership value is calculated by the specified MF. This is the premise behind the Fuzzy Neighborhood Relation
Function (F-NRF) [33] method which combines NRFJP (Noise robust fuzzy joint points) [44, 45] with DBScan [31]
to create FN-DBSCAN. This algorithm is further enhanced with scalability in [17] when Parker, Hall, and Kandel
combine FN-DBSCAN with SDBDC [3, 23] to produce an algorithm known as SFN-DBSCAN. In F-NRF the
fuzzy membership is established by measuring the sum of all comparisons between an object and its’ neighborhood
objects, where the comparison is defined by the neighborhood relation function.
2.2.4.2 Neighborhood Relation Function
A neighborhood relation function is used to calculate what has been referred to as the neighborhood
cardinality [17] which represents the cumulative density of the objects (y) surrounding the object (x) in question.
The Cardinality calculation is shown below in Equation 7 which produces a summation of the exponential
neighborhood relation function shown Equation 8.
)()( yNxCard xNy∑=ε
Equation 7 – Cardinality Calculation
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠⎞
⎜⎝⎛−=
2
max
),(**1exp)(d
yxdistKyN x
Equation 8 – Exponential Neighborhood
Relation Function
In Equation 8 dmax is the maximum distance possible between two data objects and dist(x, y) represents the actual
distance between the two objects in question. The parameter K, such that K>0, provides additional sensitivity, and in
the case of Equation 8 has an effect on the radius [33]. Although implementations may vary, it is common to use the
distance of 1 if two nominal attributes differ and 0 if they are the same. Furthermore if an attribute is numeric it is
often normalized using MinMax normalization.
25
2.2.4.3 Fuzzy Object Representations
Another branch of the fuzzy clustering paradigm, referred to as Fuzzy Object Representations [46], views
the objects themselves as fuzzy rather than the boundaries separating the different classifications. This paradigm
embraces the imprecise nature of measurement data and allows modeling uncertainty in terms of knowledge about
the actual objects; this is achieved by using fuzzy distance functions to measure the distance between fuzzy objects.
Furthermore, the solution established in [46], is able to deal with Multi-Represented Objects as described in [47].
Multi-represented objects are objects in which queries need to be made at multiple resolutions, which can be thought
of as fuzzy queries; an example of this would be a spatial map which requires queries to be made at various
resolutions of detail.
2.2.4.4 Other fuzzy clustering methods
Fuzzy clustering first emerged in 1973 when Dunn [48] defined a fuzzy generalization for the min-variance
partitioning problem, which is better known was fuzzy k-means and was further refined by Bezdek [49]. In 1985
Keller, Grey and Givens [50] extended Cover and Hard’s K-Nearest Neighbor (KNN) algorithm [51] with fuzzy set
theory. A significant limitation of early clustering methods is the poor ability to deal with non-numeric data types.
Recent research by Jiacai and Ruijun [52] describes an algorithm referred to as extended fuzzy k-means (xFKM)
which incorporates techniques used for clustering categorical data. Because of this a key advantage of xFKM is that
it does not require converting nominal attributes into binary attributes which it achieves by using the categorical
dissimilarity measure.
2.3 Applications for uncertainty
The effect of uncertainty can be found in any application requiring the measurement of real world
phenomena or data entry and/or collection conducted by human agents. Global Position Systems (GPS) provide
position tracking of moving objects, but due to realities in timing and physical measurement it will always contain a
certain degree of uncertainty [53], which also extends to any measurements pertaining to moving objects [1, 2, 3, 4].
Another important aspect of uncertainty comes from human error and deception. Human error is commonly
associated with data entry related tasks with which a certain probability of error is always present. Deception
pertains to the possibility that data entered may have been falsified during the collection process.
26
To help accommodate uncertainty in data mining tools need to be developed to allow crisp data sets to be
interrelated using fuzzy concepts. In the following sections the Fuzzy-Neighborhood DBScan algorithm will be
implemented for experimentation and its performance compared with the original DBScan.
2.4Summary
This chapter introduced research surrounding several important techniques addressing uncertainty issues in
data analysis. The chapter started with an introduction to general probability theory which then worked into the
areas of Bayesian Networks and Fuzzy Set theory. Additionally data clustering with a focus on the density based
DBScan and FN-DBScan was discussed followed by a section describing various areas in which uncertainty raises
issues. The following chapter proposes a system architecture that will harmonize the use of Bayesian classification,
fuzzy set theory, and the density-based DBScan clustering algorithm.
27
CHAPTER III
METHODOLOGY AND DESIGN
The architecture implemented for this experiment is a Weka based hybrid model which uses both clustering
and classification to establish and maintain a clustered dataset. The clustering strategies implemented are DBScan
and FN-DBScan, and the classification strategy currently uses a Bayesian Network to classify noise from the
clustering process. To verify the experimental results the architecture was designed to gather various measurements
pertaining to the accuracy and efficiency of the algorithms used. In this design the Weka API is directly accessible
and extendible by the architecture by adding the API as an external reference library to the Eclipse workspace. To
enhance Weka’s extendibility of the original DBScan a new version, DBScanV2, was implemented as a slightly
modified variant.
3.1 The Experiment Tools
This experiment utilizes the Weka API in a Java simulation developed with the Eclipse IDE and was run
using an Intel i7 quad-core machine with 12 gb of memory. For data analysis of the parameter permutations the
heap-size allocated for the Weka explorer was also increased to 8192mb. Further analysis was performed using 32-
bit MS Excel which required ensuring the permutation result dataset was less than 1,048,576 entries.
3.2 Measurements & Parameters
The product of executing the broad exploration portion of the code is an output CSV file containing various
measures for each permutation explored. For each entry there will be a set of permutation parameters, accuracy
measures, and timing measures. The permutation parameters are used to initialize the hybrid algorithm which will
establish a test case for measurements to be taken from. The measurements, for each test case, relate to the
performance of the clustering algorithm (DBScan/FN-DBScan) and the effectiveness of the Bayesian Network.
3.2.1 Permutation Parameters
Permutation parameters represent the input parameters used for the result records test case, and will be
further discussed in the permutation engine section. Table 1 shown below describes the various parameters which
are used in this experiment.
28
Table 1: Permutation Parameters
Column Algorithm Description MinPts DBScan Determines the number of required points within a given neighborhood to
signify a core point. MinCard FN-DBScan Determines the minimum cardinality within a given neighborhood to signify a
core point. The concept of cardinality represents a measure that encompasses both density and magnitude.
K FN-DBScan A constant value used with the neighborhood relation functions. e DBScan / FN-
DBScan The radius measure used to determine which objects fall within an objects neighborhood.
Alpha Bayes’ Used in the calculation of conditional probabilities. P Bayes’ The maximum number of parents used in the search algorithm. N Bayes’ If true (1) then it sets the initial structure to empty instead of naïve bayes. MBC Bayes’ If true (1) then a Markov Blanket correction is applied to the learned network
structure. R Bayes’ If true (1) then use the arc reversal operation. S Bayes’ The score type such that
0 = Bayes 1 = MDL 2 = Entropy 3 = AIC 4 = Cross Classic 5 = Cross Bayes
3.2.2 Accuracy Measures
The primary measure for the accuracy of a permutation is based on the percentage of errors in the testing
set after noise classification has occurred (PostNC), but observing the error in the training set is also useful. For both
sets (testing and training) there is a Pre Noise Classification (PreNC) and Post Noise Classification (PostNC) error
value. The noise classification process pertains to using the Bayesian classifier to classify the noise generated by the
clustering process. In Table 2, shown below, the set of various accuracy measures are summarized.
Table 2: Accuracy Measures
Column Description ClusterCount The average number of clusters produced during the clustering process. ClusterNoisePER The average percentage of noise produced during the clustering process. IsBayesRun The average number of times the Bayesian classifier was run; it won’t run if the
noise level is equal-to or greater-than 80%. A value of 1.0 indicates Bayes is always run and 0.0 indicates it is never run.
ErrTestPreNCPER The percentage of incorrectly classified objects in the testing set during the Pre-Noise Classification (PreNC) step.
ErrTestPostNCPER The percentage of incorrectly classified objects in the testing set during the Post-Noise Classification (PostNC) step.
ErrTrainPreNCPER The percentage of incorrectly classified objects in the training set during the Pre-Noise Classification (PreNC) step.
ErrTrainPostNCPER The percentage of incorrectly classified objects in the training set during the Post-Noise Classification (PostNC) step.
29
The measure of error for the testing set is the percentage of unsuccessfully classified test objects using the
input Bayesian classifier which would be the PreNC classifier or the PostNC classifier. Successful classification is
determined by comparing the test objects actual class with the maximal class found within the cluster. Unlike test
cases, measuring error in the training set does not directly use the classifier, and the PreNC error is unaffected by
classification which makes it a direct measure of only the clustering process. Just like training cases the rate of error
in test cases is measured by comparing the objects actual class with the cluster numbers maximal class, training
objects however differs in that any objects flagged as noise by the clusterer are automatically considered error.
3.2.3 Timing Measures
Time is a critical factor in the performance of real world applications. Table 3 below describes the various
values gathered by this experiment pertaining to time.
Table 3: Timing Measures
Column Description TimeEntirePermutation The time taken to run through the entire K-Fold process for the given permutation.
10 Fold Cross Validation, for example, would run the process once for each fold accumulating values to produce averages.
TimeDBScanALL The time taken to perform the entire clustering process, including the extra steps required to make the cluster number readily available for the classification phase.
TimeDBScanBuild The time taken to perform the actual clustering process when using the specified DBScan variant.
TimeBayesAll The time taken to perform the entire Bayesian process. TimeBayesTrainPreNC The time taken to remove the noise and train the initial classifier. TimeBayesTrainPostNC The time taken to classify the noise and retrain the Bayesian classifier. TimeBayesClassify Average time taken to classify and object during PreNC and PostNC classification. 3.2.4 Noise and Noise Classification measures
A measure of noise, ClusterNoisePER, is given as a percentage which represents the amount of singleton
(unclustered) data objects. Furthermore, both ErrTrainPostNC and ErrTestPostNC represent accuracy measures
after Bayesian noise classification has occurred. These values are used to help determine if there is any added
benefit in adding a noise classification phase after clustering occurs.
3.3 The System Design Overview
This system was designed to execute a set of test cases, also referred to as parameter permutations, in a way
that a set of empirical results are generated for further analysis. The design is composed of three core packages
shown in Figure 11: the permutation engine, the data engine, and the AI engine. The purpose of the permutation
30
engine is to encapsulate the mechanisms used to iterate through a set of possible parameter permutations while
gathering test case measurements to be used for further analysis. Next, the data engine provides a consolidation of
common tasks associated with dataset and data object management. And finally, the AI engine provides wrapper
classes for DBScan/FN-DBScan and Bayesian Networks which implement the hybrid logic and measurement
tracking required to conduct the experiment. In figure 11, the AppMain and App classes are used to provide the
logic pertaining to this specific experiment. Each of these packages will be discussed further in subsequent sections.
Figure 11: System Overview
A key feature of the design is the ability to repeat any number of K-Fold cross validation iterations to
generate an average for each parameter permutation it attempts. For the test cases used in this experiment the
permutations were explored in two ways. The first method provided a broad exploration by iterating the permutation
space defined by an input CSV file (see Table 4 in section 3.4.1.1) and the second method allowed the specific
permutations to be set and attempted in rapid succession to obtain a more accurate average. By conducting the broad
exploration first potential parameter combinations yielding desirable outcomes can be isolated for further
exploration. Once selected each of the candidates becomes the focus of the narrow but exhaustive second method to
establish an average set of representative measurements.
There are three elements of flow in the application shown in Figure 12: altering the training set, building
the clusters and classifiers, and performing tests using the testing set. Training set alterations occur at different
stages to prepare the data for the next step. These alterations include preparing a classless dataset for clustering,
updating the classless dataset with a cluster number, and removing noise for classifier training. Next, building the
clusters and classifiers is the central task of this experiment and starts with using the training data to establish
naturally occurring classifications of objects using DBScan or FN-DBScan. Following this all noisy objects are
31
removed from the dataset and the initial Bayesian Network is trained using the Cluster Number as the class. By
applying the initial Bayesian classifier to the noise additional classifications are discovered, and a final Bayesian
network is trained. Finally, accuracy is measured by comparing the maximal class name found in a cluster number
with the expected class value for a data object; this method is applied to both the training data and the testing data.
The testing data, however, gives a more realistic measurement of how well the trained classifier will perform when
given a set of unknown data.
Figure 12: System Data Flow
3.4 The System Design Components
3.4.1 The Parameter Permutation Engine
Input parameters are a critical component of many clustering and classification solutions and it is often
difficult to quickly establish which parameters will generate an optimal solution. To help with this process a
parameter permutation engine was designed which is capable of iterating through all possible permutations
following the rules defined in an input CSV file (see Table 4). The rules simply describe the algorithm and its
parameters in terms of testable ranges and the size of each iterative step. The permutation engine is also responsible
for storing the various measures gathered during each iteration. Each measure is stored in a PermResultObject which
32
maintains the average and min/max aggregates until the object is cleared. Figure 13, shown below, describes the
general structure of this package.
Figure 13: Permutation Engine
To understand the time complexity of this engine, it is important to calculate the number of possible values
for each input parameter and then multiply the totals together. So if there were n variables (V1…Vn), and their
cardinality |V1| represented the number of possible values for that variable, the number of permutations P could be
calculated as follows:
P = |V1| * |V2| * … * |Vn| Equation 9 – Permutation Calculation
So, if there were 25,000 permutations, each of which took 100ms to execute, the engine would require 2500
seconds, or about 42 minutes to run. In terms of both time complexity and storage this is far from perfect and future
work may focus on applying evolutionary algorithms to guide test cases.
3.4.1.1 Parameter permutations
It is often very difficult to select the input parameters required by various data mining algorithms. In order
to provide a thorough parameter result set for analysis, the testing methodology iterates through many input
parameter permutations for the data set under study. For each of these input parameter permutations 10-fold cross
validation is used to derive a set of averages to be recorded. The input parameter permutations for the test cases
discussed shortly were generated from a user defined CSV file containing the following:
Table 4: Parameter Permutation Input Rules
Algorithm Varient Parameter Value Range Step size Step Count DBScan Original MinPts 2-8 1 9 DBScan Original e 0.05-1 0.05 20 DBScan FN-DBScan MinCard 5-50 5 10 DBScan FN-DBScan e 0.05-1 0.05 20 DBScan FN-DBScan K 5-20 5 4
Bayesian Net HC/SE Alpha 0.1-1 0.1 10 Bayesian Net HC/SE P 1-2 1 2 Bayesian Net HC/SE * N 0-1 1 2 Bayesian Net HC/SE * MBC 0-1 1 2 Bayesian Net HC/SE * R 0-1 1 2 Bayesian Net HC/SE ** S 0-5 1 6
* Parameters with 0-1 value range are booleans ** S {0=Bayes, 1=MDL, 2=Entropy, 3=AIC, 4=Cross_Classic, 5=Cross_Bayes}
33
In Table 4 the Algorithm and Variant columns uniquely identify the AI algorithm which utilizes the input
parameters. The value range to test for each algorithm variants parameter is specified by the Parameter, Value
Range, and Step Size columns. Step Size represents the incremental value with which to iterate the parameters value
range. The step count column was added for reference and it represents the number of steps required to cover the
value range, which can be used to calculate the expected number of permutations. In terms of permutations, running
the Original DBScan with a HC/SE Bayesian Network requires 172,800 permutations and the FN-DBScan variant
requires 768,000.
3.4.2 The Data Engine
The Data Engine package was designed to help manage the data flow during the execution of the
experiment. The purpose of DataSet is to encapsulate various operations such as removing attributes and preparing
the training and testing sets using the cross-validation process. The key purpose of DataObject was to allow
additional information for an Instance to be collected, in the case of this experiment a way to easily access the
cluster number was necessary. DataSetFold is used to provide a consistent lookup for each fold numbers train/test
partition which ensures the same arrangement of instances for each fold will occur as the permutation set is iterated.
The CSVFileReader is used to read in the CSV file defining the algorithms and their respective parameter
permutations to test. Figure 14 below overviews the structure of this package.
Figure 14: Data Engine
3.4.2.1 Cross Validation
K-Fold cross validation is a process that randomly partitions the dataset into training/testing partitions k
times. The purpose of cross-validation is to help guard against training the algorithms under study to fit the only
available data, which is commonly known as over-fitting. Cross Validation is achieved by randomly selecting a test
subset from the complete data set to simulate the possibility of future data. For-each fold, various statistics
34
surrounding performance and accuracy of the algorithms under study are gathered, which upon completion of all
iterations are averaged to obtain representatives values.
3.4.2.2 Attribute Maintenance
When training/building a classifier or clusterer for experimentation it is important to remove the class label,
unique id, and any redundant/irrelevant attributes. Currently this experiment manually selects the attributes to
remove, however future work will automate the selection using an attribute selection measure such as Information
Gain [54, 55]. Furthermore, this experiment requires the dataset to be altered slightly for various tasks. During the
DBScan phase the training data is classless, but upon completion the ClusterNumber attribute is added as a nominal
attribute. It is important that ClusterNumber is set as the nominal class prior to running Bayes’, which does not
support continuous valued attributes.
3.4.3 The AI Engine
The AI Engine, shown in Figure 15, encapsulates the various requirements of the Data Mining techniques
under study and leverages several core Weka AI algorithms.
Figure 15: AI Engine The BayesNetWrap class serves to encapsulate the experiments Bayesian based process. The process
consists of training a Bayesian classifier with the noiseless training set, then using this trained classifier to classify
the noise, and finally re-training the Bayesian classifier with the full training set. This process produces accuracy
results based on Pre-Noise Classification testing and Post-Noise Classification testing. The DBScanWrap component encapsulates both FN-DBScan and DBScanV2 for the experiments testing
purposes and utilizes the ClusterSet to store information about ClusterObjects. In terms of efficiency the ClusterSet
should be an integral part of DBScan, but is currently generated after the buildCluster process has been run, thus
adding an additional O(n) requirement to the computational complexity. Future work should continue the re-design
of DBScanV2 which is an alteration of the original Weka version; the alterations have currently been limited to
35
simply allowing better sub-classing for FNDBScan. For similar reasons SequentialDatabaseV2 re-implemented
SequentialDatabase to allow better sub-classing for the FNSequentialDatabase. The algorithms for DBScan and FN-
DBScan are contrasted in Table 5. The yellow highlighting has been used to mark areas in which the algorithms
differ.
Table 5: DBScan vs. FN-DBScan
Original DBScan FN-DBScan [Start] 1. Set parameters minpts/mincard and e-radius 2. Set all data objects as unclassified, then [Process Data Set] [Process Data Set] 3. For each data object X in the data set
If X is unclassified then [Expand Cluster(C, X)] ‐ Where C is the current cluster
[Expand Cluster(C, X)] 4. seedSetX = [Get Seed Set(X)]
5. If [Is Core Point(seedSetX)]
5.1. Add X to cluster C and remove X from seedSetX 5.2. While seedSetX.size > 0
5.2.1. Get next object Y from seedSetX and Remove Y from seedSetX
5.2.2. If Y is unclassified OR noise
5.2.2.1. seedSetY = [Get Seed Set(Y)]
5.2.2.2. If [Is Core Point(seedSetY)] Then add seedSetY to seedSetX
[Get Seed Set(X)]
‐ Get all objects within E-Neighbourhood of X [Is Core Point(X, seedSet)]
‐ If seedSet.size >= minpoints then return true
[Start] 1. Set parameters minpts/mincard, e-radius, k, and f 2. Set all data objects as unclassified, then [Process Data Set] [Process Data Set] 3. For each data object X in the data set
If X is unclassified then [Expand Cluster(C, X)] ‐ Where C is the current cluster
[Expand Cluster(C, X)] 4. seedSetX = [Get Seed Set(X)]
5. If [Is Core Point(seedSetX)]
5.1. Add X to cluster C and remove X from seedSetX 5.2. While seedSetX.size > 0
5.2.1. Get next object Y from seedSetX and Remove Y from seedSetX
5.2.2. If Y is unclassified OR noise
5.2.2.1. seedSetY = [Get Seed Set(Y)]
5.2.2.2. If [Is Core Point(seedSetY)] Then add seedSetY to seedSetX
[Get Seed Set(X)]
‐ Get all objects within E-Neighbourhood of X [Is Core Point(X, seedSet)]
‐ CARD = Neighborhood Relation(X, seedSet) o The neighborhood relation is specified by parameter f.
The parameter k is also used in f.
o Cardinality (CARD) is calculated by applying f to all objects within the neighborhood.
‐ If CARD >= mincard then return true
3.4.3.1 Original DBScan Clustering
Density based clustering has the ability to detect naturally occurring groupings within a set of objects
without the need for human supervision. These groupings, known as clusters, represent object classifications. The
obvious naturally occurring class label for a clustered instance is simply the cluster number to which it belongs. For
the purpose of testing this experiment, the class attribute for each training instance is set to the nominal
ClusterNumber, it is nominal due to requirements of the Bayesian Network implementation.
36
An implementation of the original DBScan algorithm, as shown in Table5, is supplied with Weka. This
algorithm required some minor alterations to allow better inheritability but was preserved as much as possible. FN-
DBScan, for instance, is ideally a subclass of the original DBScan but required many of the internal class properties
to be declared as protected rather than private. An important note about the Weka (3.6.6) DBScan clusterInstance
method is that it does not work how one might initially suspect. This method takes an instance as a parameter but the
implementation ignores the input parameter and simply returns the cluster number of a data object based on an
internal counter. This means that as long as the entire data set (instance set) is iterated the counter will properly
retrieve the cluster number of the instance at the indicated position. In this way the cluster numbers assigned to the
instances can be gathered without adding significant alterations to the DBScan algorithm found in Weka.
3.4.3.2 Implementing FN-DBScan Clustering
Implementing FN-DBScan required a clustering class to be derived from DBScanV2 and a Sequential
Database class derived from SequentialDatabaseV2 (see section 3.4.3). These classes implement the specific
functionality to allow the additional parameters and the alterations to the core point test shown above in Table 5.
The current implementation of FN-DBScan only allows for the Exponential Neighborhood Relation function, but
this could easily be expanded with additional functions and even an additional input parameter to specify the
function to use. Although not explored in the experiment the implementation added an option parameter to
normalize the neighborhood cardinality which would constrain the value between 0 and 1; this value is simply a
measure of density, and in essence nullifies the importance of magnitude, which will cause it to behave differently.
3.4.3.3 Bayesian Network Classification
Classification using Bayesian networks is based on finding the maximal posterior probability for a class
given a set of evidence. The Weka API offers many variations of Bayesian networks by allowing the search
algorithm and estimator to be passed as parameters. The search algorithm is used to learn the structure of the
network based on the dataset’s attributes and the estimator is used to learn the conditional probability distributions
for each node in the learned structure. Currently this experiment focuses only on the Local Hill Climber search
algorithm and the Simple Estimator. Furthermore, the training data provided to the Bayesian network currently does
not allow for missing values and it must have all attributes discretized (no continuous values). The DataSet class
implements the Weka unsupervised method of discretization which is used prior to building a Bayesian classifier.
37
3.5Summary
This chapter discussed a Weka based implementation used to conduct the experiment described in the proceeding
chapter. The implementation leverages both clustering and classification techniques and gathers various
measurements used during analysis. Although the scope of this experiment is small, the implementation was
designed to be broadened in future work.
38
CHAPTER IV
ANALYSIS
To effectively conduct this experiment the overall success of the previously proposed system must be
measured. All measurements are taken using the IRIS dataset and fall loosely into three categories: Time, Accuracy,
and Noise. Additionally, this experiment will focus on comparing DBScan & FN-DBScan with a Bayesian Network
using a simple estimator and a local hill climber search. The following sections start by introducing the IRIS dataset
and the parameters passed to the implementation. Next, the optimal parameters are discovered for both the DBScan
& FN-DBScan variants of the experiment. And finally, an analysis of the results generated by focusing on the
optimal parameters is conducted and discussed.
4.1 Parameter Selection for the IRIS Dataset
To help analyze the success of the proposed system the IRIS dataset [58] was selected for two important
reasons. First, it is a very popular data set which is used in many pattern recognition experiments [58] as-well-as
research directly related to this experiment [17]. And second, the relatively small size of the dataset (150 entities)
made the time complexity of this experiment more viable without considerable optimization considerations. The
IRIS data set consists of 150 instances with 50 instances existing in each of the 3 classifications. Existing research
[17] has established that performing density based clustering can yield 67% accuracy (or a 33% error rate). A
common outcome for clustering the IRIS dataset is a 2 cluster solution in which one cluster contains only iris-setosa
and the other contains both iris-versicolor and iris-virginica –thus producing a 66.66% (67%) success rate.
In the following test cases 10-Fold cross validation is performed against the dataset under study using a
broad and exhaustive parameter permutation engine. Once the results of the permutation engine have been collected
analysis of the optimal set of parameters is conducted by independently observing the average accuracy achieved by
each parameter value. To help initially limit the values of interest Figure 16 and 18 use the Weka explorer to analyze
the accuracy ratings of various parameter values. The Y Axis on all figures presented is the percentage of errors
when running the final Bayesian classifier against the test cases (post noise classification).
The Bayesian Classifier is trained to classify test cases into cluster numbers. During the testing process the
cluster number to which a test case is assigned is used to find the max-class representing the cluster. This max-class
is then compared with the actual class of the test cases, and an error is produced if they differ.
39
After analysis is completed the full permutation set is reduced based on criteria which independently imply
optimal solutions. With this reduced dataset the data is sorted and patterns meeting optimal solution requirements
are observed. Once an optimal pattern is selected it is run through an intensive set of tests to ensure that an accurate
representation of its average accuracy is established. This intensive process consists of performing 3 rounds of
10000 iterations using 3-fold, 6-fold, and 10-fold cross validation against the specified parameters, from which the
results are averaged.
4.2 Optimal Parameter Discovery for DBScan and Bayes’ HC/SE
The first broad exploration against the Iris dataset used the original DBScan algorithm and a Bayesian
Network with local hill climbing search and a simple estimator which required testing 134,400 parameter
permutations and resulted in approximately 2 hours of execution time (7,609,118 ms). The parameter trends shown
in Figure 16 are generated using the Weka explorer and are based on analysis of the permutations after an initial
filtering based on DBScan noise levels of 80% or more reduced this set to 127,680 records (this will vary slightly
between runs). The most observable trend is e-radius shown in 16b which visually demonstrates that an e-radius
value between 0.15 and 0.45 has a high concentration of accurate results. The trends for MinPts (16a), Alpha (16c),
and S (16d) are difficult to observe from the graph and all seem to imply a large percentage of high accuracy results.
Additionally we can observe that good results can be obtained with the total cluster count is between 2 and 4.7
(16e). And finally, a noise level between 0 and 10% seems to be an acceptable range found in the high accuracy
permutations.
a. DBScan MinPts b. DBScan e-radius
c. DBScan Alpha d. DBScan S
40
e. DBScan cluster count f. DBScan noise
Figure 16: DBScan Parameter Analysis
To help further limit our set of possible high accuracy permutations there are several strategies
implemented. First, only permutations achieving a 33.33% (or better) accuracy rating need to be observed which
reduces the set to 26582 entries. This set represents the parameter permutations that achieved optimal solutions.
Next, to isolate the optimal set of parameters, each parameter will be analyzed individually to determine which value
has the largest number of entries; the values with the largest number, for each parameter, will be chosen as the
proposed optimal set.
In Figure 17, high accuracy membership counts for e-radius were gathered and summarized on the chart.
Put another way, the chart shows the number of highly accurate permutations (33.3% or better) for each important
value of e-radius (0.15-0.45). The chart shows that when e-radius is between 0.35 and 0.45 the highest number of
highly accurate permutations result (all are exactly 4648).
Figure 17: High accuracy membership counts for e-radius
Performing a similar analysis of the remaining parameters produces the findings summarized below in Table 6 and
will be used to generate the results discussed shortly.
Table 6: DBScan Optimal Parameters
Parameter/Result High Performers Selected Best MinPts 3, 5 5 E-Radius 0.35-0.45 0.4 Alpha 0.4-0.8 0.6 P 1 1 N 0 0
41
MBC 1 1 R 0 0 S 0, 4, 5 0
4.3 Optimal Parameter Discovery for FN-DBScan and Bayes’ HC/SE
The next broad exploration test is based on the Iris dataset using the FN-DBScan algorithm and a Bayesian
Network with local hill climbing search and a simple estimator. This test required running 768,000 parameter
permutations and resulted in an execution time of around 12hrs (43,665,616 ms). In the following figures the trends
shown are based on analysis of the parameter permutations tested this process. For each permutation, if the noise
level produced by any of the folds is 80% or greater the system will not build a Bayesian classifier and will therefore
not produce any results for analysis; these entries are purged from the initial dataset by purging based on the
IsBayesRun filter, which reduces the dataset to only 479,040 records. As with the analysis of the original DBScan,
the permutation set can be further limited. Observing only permutations achieving a 33.33% (or better) accuracy
rating reduces the set to 37,870 optimal solution entries.
Figure 18, shown below, helps visualize the patterns of these optimal solutions found with 10-Fold Cross
Validation. The coloring represents cluster noise, blue being low noise and orange being high noise. It is a point of
importance that optimal solutions achieving accuracy greater than 33.3% (better the DBScan) have substantially
more noise; and thus rely more heavily on the classifier to classify the noise.
a. MinCard b. e-radius
c. K d. Alpha
42
e. P f. N
g. MBC h. R
k. S
Figure 18: FN-DBscan Parameter Analysis
Although the above figure gives a definite visual direction for optimal parameter values, some parameters such as
Alpha and S, require additional exploration. To achieve this, the parameters will be analyzed individually to help
determine an optimal parameter configuration using the method described during DBScan parameter analysis. The
findings for this investigation imply the parameters found in Table 7 representing an optimal solution.
Table 7: FN-DBScan Optimal Parameters
Parameter/Result Independent Best Selected Best MinCard 10 10 E-Radius 0.15 0.15 K 15 15 Alpha 0.2-0.5 0.3 P 1 1 N 0 0 MBC 0 1 R 0-1 0 S 0-5 3
43
4.4 Optimal Parameter Analysis
4.4.1 Accuracy Analysis
To confirm the results implied by the parameters in the selected best columns a narrowly focused algorithm
was designed to test the selected permutation 3 times using 10000 iterations of 10-Fold, 6-Fold, and 3-Fold cross
validation. In all test cases discussed below the results produced by 10-Fold cross-validation are the best. This is due
to the simple fact that the full data set is split into 10 testing sets, which in the case of IRIS means there are only 15
test cases in each fold and 135 training cases. Put another way, more folds means larger training sets and smaller test
sets, and with larger training sets a tendency towards increased accuracy is observable.
The results for a focused run using the DBScan selected best parameters found in Table 6 and the FN-
DBScan selected best found in Table 7, are summarized below in Table 8. The 10-Fold DBScan results for both the
test error and the training error generated a 33.3% error rate, which is in agreement with previous research [17]. As
expected both 6-Fold and 3-Fold however performed slightly worse, coming in at 34.68% and 34.18% respectively.
Furthermore, the results of this focused algorithm in all cases show that the cluster count is exactly 2 and the noise
level is 0%.
Table 8: Accuracy Analysis
Folds Train/Test
count Algorithm Train Error
Pre-Noise Classify
Train Error Post-Noise
Classify
Test Error Pre-Noise Classify
Test Error Post-Noise
Classify
Noise Clusters
10 135 / 15 DBScan 33.33 % 33.33 % 33.33 % 33.33 % 0% 2 6 125 / 25 DBScan 33.06 % 33.06 % 34.68 % 34.68 % 0% 2 3 100 / 50 DBScan 33.01 % 33.01 % 34.18 % 34.18 % 0% 2
10 135 / 15 FN-DBScan 40.9% 28.3% 38.9% 30.4% 39.3% 2.7 6 125 / 25 FN-DBScan 48.4% 34.8% 42.7% 36.9% 47.2% 2.3 3 100 / 50 FN-DBScan 65.5% 55.6% 59.3% 58.5% 65.2% 1.5
By performing a focused FN-DBScan run on the optimal pattern found in Table 7 an ErrTestPostNC value of 30.4%
is achievable, which represents an accuracy gain over using the traditional DBscan. Furthermore, it is interesting to
note that the actual training set achieves a 28.3% accuracy rating which is significantly better than the original
DBScan (33.33%).
4.4.2 Noise Classification Analysis
Noise classification is a method which may enhance standard clustering by first training a classifier with
clustered objects and then using that classifier to assign noise objects to a best-fit cluster. By using such a hybrid
44
methodology in this experiment accuracy increases are observable as shown below in Table 9 which represents
averages across the entire permutation set based on the noise level. An interesting pattern is that both the original
DBscan and FN-DBScan optimally benefit from noise classification when the amount of noise is between 11-20%.
This pattern merits additional future study to determine whether this will generally hold true with other data sets.
Table 9: Noise Classification Analysis
Noise Level DBScan FN-DBScan Accuracy Increase
Average Accuracy (ErrTestPostNCPER)
Accuracy Increase
Average Accuracy (ErrTestPostNCPER)
0-10% 0.08% 56% 0.2% 58% 11-20% 6.74% 40% 4.9% 42% 21-30% 5.27% 54% 2.6% 52% 31-40% - - 0.7% 63% 41-50% 3.45% 52% 1.2% 60% 51-60 6.83% 51% 0.4% 55% 61-70 3% 54% 2.0% 56% 71-80 1.27% 64% 0.1% 67% 81-90 - - - -
91-100 - - - -
4.4.3 Time Analysis
Both the DBScan and FN-DBScan perform at a similar time complexity level which is shown below in
Table 10. As expected, however, the FN-DBScan requires slightly more time due to its more complicated core point
calculations. In the table below the “Entire Process” column represents the average time to perform the experiment
for a single permutation. “DBScan Build” is the portion of the experiment dealing with only building the clusters,
where-as “DBScan All” consists of not only building the clusters but also the various book keeping activities
required to ensure the cluster numbers are available to the noise classification phase. It is interesting that although a
time increase occurs with FN-DBScan in the clustering phase, the Bayesian classification phase is slightly less on
average, which in real-time applications where ad-hoc classification occurs very frequently could provide a
significant performance increase. Furthermore, it is apparent that utilizing Bayes’ classification for ad-hoc
classifications would yield significant performance increases over re-running the clustering process.
Table 10: Time Analysis
Entire Process DBScan All
DBScan Build
Bayes All
BayesTrain PreNC
BayesTrain PostNC
Bayes Classify
DBScan 55.7 4.21 4.06 1.26 0.61 0.63 0.0008
45
FN-DBScan 56.82 4.76 4.58 1.18 0.56 0.6 0.0007
4.5Summary
This experiment demonstrated several important performance gains. First it is shown that FN-DBScan
produces a higher level of accuracy then the original DBScan. Next, performing noise classification in cases with a
moderate amount of noise will potentially increase accuracy. And, finally using a Bayesian Network for ad-hoc
classification (into clusters) is significantly faster then re-generating the clusters.
46
CHAPTER V
RECOMMENDATIONS AND CONCLUSIONS
5.1 Suggestions for Further Research
To help further verify the results of this experiment it should be expanded beyond the IRIS dataset. Future
work should continue to expand on exploring the advantages found by combining a clustering algorithm with noise
classification on other datasets. The current system architecture should allow for the addition of new datasets but
will require some slight modifications to allow for missing data.
Broadening this experiment also requires additional consideration in terms of time complexity. The current
permutation engine is exhaustive but requires a significant amount of time to run. Therefore, more intelligence in the
parameter permutation engine would allow a better coverage without extreme computation time. A genetic
algorithm with random mutations could be well suited to this task. A guiding principle of evolutionary algorithms is
known as a fitness test. The purpose of this test is to measure the success (or health) of a given population. This test
could be governed by fuzzy parameters such as “moderate noise” and “slightly low cluster count” to help direct the
generation of test cases.
Once time complexity is addressed and some measure of fitness is in place, it would be interesting to
explore additional clustering/classification mechanisms, including the many Bayesian Network variants, Artificial
Neural Networks, and other fuzzy clustering methods. It would also be of notable interest to explore which
clustering methods and data sets produce noise patterns that produce higher rates of classification success; in the
previous analysis section it was noted that optimal classification success occurred when the noise level was between
11-20%.
Another point of importance is furthering Weka as a research platform. Weka is a very useful tool,
however, as with any tool there is always room for improvement. The current implementation of DBScan is limited
in its ability to be an effective super class which led to the development of DBScanV2 and SequentialDatabaseV2.
Future work on the Weka implementations of DBScanV2 and SequentialDatabaseV2 should focus on offering the
ability to better maintain a lookup of data objects with their respective assigned cluster number and also allow a
more effective overloading of the core point test. Additionally, in order to properly link a new implementation of
47
Weka’s DBScan to existing research it will be important to ensure comparison testing is performed between the
original and the newly proposed implementations.
With an effective base class for DBScan in place it will become substantially easier to experiment with
variants, particularly those which simply mutate the core point test which is ultimately the crux of neighborhoods
and the formation of a clusters shape. The core point test is commonly based on the idea of a neighborhood measure
which could be based on a simple point count (original DBScan) or a more complicated measure such density or
magnitude based on a neighborhood relation function (FN-DBScan). Future work should continue to explore ways
to mutate the core point test.
5.2 Conclusions
The results of this paper have empirically shown that by using FN-DBScan a Bayesian Network can be
trained and verified using 10-Fold cross validation on the IRIS dataset to achieve a 30.4% classification accuracy
rate on the testing data and 28.3% on the training data. The classic DBScan algorithm can only achieve 33.33%
accuracy on both the testing and training data. The accuracy increase achievable through the use of FN-DBScan is
due to the generated noise which can then be classified using a Bayes’ Net; noise generated by the original DBScan
does not offer any increases beyond the standard 33.33% accuracy. Although further analysis is required on
additional datasets, these results imply that when verified with a Bayesian classifier, FN-DBScan may be more
effective and accurate when clustering similar data objects and isolating noisy data, and require only a minimal
increase in computational time. Furthermore, leveraging a trained Bayesian Network for ad-hoc classification into
previously defined clusters, offers significant performance increases over re-running DBScan. Overall a hybrid FN-
DBScan and Bayesian network solution has the potential to offer increased accuracy and efficiency when dealing
with real world data analysis situations.
48
REFERENCES
[1] Y. Li, J. Han, and J. Yang, “Clustering Moving Objects,” KDD, pp.617-622, 2004.
[2] R. Cheng, D.V. Kalashnikov, and S. Prabhakar, “Evaluating probabilistic queries over imprecise data,” SIGMOD, pp.551-562, 2003.
[3] E. Januzaj, H. P. Kriegel and M. Pfeifle, “Scalable Density-Based Distributed Clustering,” The 15th European Conference on Machine Learning (ECML) and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Pisa, Italy, September 2004.
[4] H.P. Kriegel, P. Kunath, M. Pfeifle, and M. Renz, “Approximated Clustering of Distributed High Dimensional Data,” PAKDD, pp. 432-441. 2005.
[5] A. Kolmogorov, Foundations of the Theory of Probability, 2nd ed., New York: Chelsea, 1956.
[6] G. Shafer and V. Vovk, "The origins and legacy of Kolmogorov's Grundbegriffe," [Online]. Available: http://www.probabilityandfinance.com/articles/04.pdf. [Accessed 02 December 2012].
[7] K. Murphy, "A brief introduction to Bayes' Rule," [Online]. Available: http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html. [Accessed 5 December 2011].
[8] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, San Francisco, CA: Morgan Kaufmann, 1988.
[9] N. L. Zang and D. Poole, “A simple approach to Bayesian network computations,” Proc. 10th Canadian Conference on Artificial Intelligence, pp. 171-178. 1994.
[10] R. D. Shachter, B. D’Ambrosio, and B.A. Del Favero, “Symbolic probabilistic inference in belief networks,” AAAI, pp. 126-131, 1990.
[11] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 3rd Ed, New Jersey: Prentice Hall, 2010.
[12] B. Coppin, Artificial Intelligence Illuminated, Sudbury: Jones and Bartlett Publishers, 2004.
[13] S. Russell, J. Binder, D. Koller, and K. Kanazawa, “Local learning in probabilistic networks with hidden variables,” Proc. 1995 Joint Int. Conf. Artificial Intelligence (IJCAI), pages 1146–1152, 1995.
[14] L.A. Zadeh, Fuzzy sets, Information and Control, 8:338-353, 1965.
[15] A. Kandel and W. Byatt, “Fuzzy Sets, Fuzzy Algebra and Fuzzy Statistics”, Proceedings of the IEEE, vol 66, no 12, pp. 1619-1639, 1978.
[16] J.-S.R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, New Jersey: Prentice Hall, 1996.
[17] J.K. Parker, L.O. Hall, and A. Kandel, "Scalable fuzzy neighborhood DBSCAN," IEEE International Conference on Fuzzy Systems (FUZZ), pp.1-8, 18-23 July 2010.
[18] A.K. Jain and R.C. Dubes, Algorithm for Clustering Data, New Jersey: Prentice Hall, 1998.
[19] J. Han and M. Kamber, Data Mining: Concepts and Techniques, San Diego: Acad. Press, 2001.
[20] W.D. Fisher, "On grouping for maximum homogeneity," J. Amer. Statist. Assoc., Vol. 53, pp. 789-798, 1958.
[21] A. K. Jain, M.N. JMurty, and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Vol. 31, No.3, pp.265-323, 1999.
[22] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley & Sons, 1990.
[23] E. Januzaj, H. P. Kriegel and M. Pfeifle, “DBDC:Density Based Distributed Clustering”, Proc. 9th International Conference on Extending Database Technology (EDBT), Heraklion, Greece, pp. 88-105, 2004.
[24] K. Jain Anil, Algorithms for Clustering Data, New Jersey: Prentice Hall, 1988.
[25] H. Samet, The Design and Analysis of Spatial Data Structures, Reading, MA: Addison-Wesley, 1990.
49
[26] R.H. Gueting, “An Introduction to Spatial Database Systems,” The VLDB Journal 3(4), pp. 357-399, 1994.
[27] T. Abraham and J.F. Roddick, “Survey of spatio-temporal databases,” GeoInformatica 3(1), pp. 61–99, 1999.
[28] J. Han, M. Kamber, and A.K.H. Tung, “Spatial clustering methods in data mining: a survey,” Geographic Data Mining and Knowledge Discovery, London: Taylor & Francis, 2001.
[29] D. Birant and A. Kut, “ST-DBSCAN: an algorithm for clustering spatial–temporal data,” Data & Knowledge Engineering 60, pp. 208–221, 2007.
[30] M. L. Yiu, N. Mamoulis, “Clustering Objects On a Spatial Network,” SIGMOD, pp.443-454, 2004.
[31] M. Ester, H.P. Kriegel, J. Sander, X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226-231, 1996.
[32] T. Ali, S. Asghar, N.A. Sajid, "Critical analysis of DBSCAN variations," International Conference on Information and Emerging Technologies (ICIET), pp.1-6, 14-16 June 2010.
[33] E.N. Nasibov and G. Ulutagay, “Robustness of density-based clustering methods with various neighborhood relations, “ Fuzzy Sets and Systems, Volume 160, Issue 24, Pages 3601-3615, 16 December 2009.
[34] P. Viswanath, R. Pinkesh, "l-DBSCAN: A Fast Hybrid Density Based Clustering Method," 18th International Conference on Pattern Recognition (ICPR), pp.912-915, 2006.
[35] Y. Wu, J. Guo, X. Zhang, "A Linear DBSCAN Algorithm Based on LSH," International Conference on Machine Learning and Cybernetics, vol.5, pp.2608-2614, 19-22 August 2007.
[36] B. Liu, "A Fast Density-Based Clustering Algorithm for Large Databases," International Conference on Machine Learning and Cybernetics, pp.996-1000, 13-16 August. 2006.
[37] S. Jiang, X. Li, "A Hybrid Clustering Algorithm," Sixth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp.366-370, 14-16 Aug. 2009.
[38] Y. El-Sonbaty, M.A. Ismail, M. Farouk, "An efficient density based clustering algorithm for large databases," 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 673- 677, 15-17 Nov. 2004.
[39] R.T. Ng and J. Han, ”Efficient and Effective Clustering Methods for spatial data mining,” Proc. 20th Int. Conf. on Very Large Data Bases, Santiago, Chile, pp.144-155, 1994.
[40] M. Ankerst, M.M. Breunig, H.P. Kriegel, and J. Sander, "OPTICS: Ordering Points to Identify the Clustering Structure," Proc. of the ACM SIGMOD'99 International Conference on Management of Data, Philadelphia, PA, pp. 49-60, 1999.
[41] O. Uncu, W.A. Gruver, D.B. Kotak, D. Sabaz, Z. Alibhai, C. Ng, "GRIDBSCAN: GRId Density-Based Spatial Clustering of Applications with Noise," IEEE International Conference on Systems, Man and Cybernetics (SMC), vol.4, pp.2976-2981, 8-11 October 2006.
[42] D. Dumitrescu, B. Lazzerini, and L. C. Jain, Fuzzy Sets and Their Application to Clustering and Training, New York: CRC Press LLC, 2000.
[43] F. Hopper, F. Klawonn, R. Kruse, T. Runkler, Fuzzy Cluster Analysis, England: John Wiley & Sons, 1999.
[44] E.N. Nasibov, “A robust algorithm for fuzzy clustering problem on the base of fuzzy joint points method,” Cybernetics and Systems Analysis 44 (1), 2008.
[45] E. N. Nasibov, G. Ulutagay, “A new unsupervised approach for fuzzy clustering,” Fuzzy Sets and Systems, Volume 158, Issue 19, pp.2118-2133, October 2007.
[46] H.P. Kriegel and M. Pfeifle, “Density-based clustering of uncertain data,” In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD), pp.672-677. 2005.
[47] H.-P Kriegel, K. Kailing, A. Pryakin, M. Schubert, “Clustering Multi-Represented Objects with Noise,” PAKDD, pp.394-403, 2004.
50
51
[48] J. C. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters,” Journal of Cybernetics, 3:32–57, 1973.
[49] J.C. Bezdek, "A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms," IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.PAMI-2, no.1, pp.1-8, January 1980.
[50] J.M. Keller, M.R. Gray, J.A. Givens Jr., “A fuzzy K-nearest neighbor algorithm,” IEEE Trans., Syst., Man, and Cybern., Vol.15, No.4, pp. 580–585, 1985.
[51] T.M. Cover and P.E. Hard, “Nearest neighbor pattern classification,” IEEE Trans. Inform. Theory, vol. IT-13, pp. 21-27, Jan 1967.
[52] W. Jiacai, G. Ruijun, "An Extended Fuzzy k-Means Algorithm for Clustering Categorical Valued Data," International Conference on Artificial Intelligence and Computational Intelligence (AICI), vol.2, pp.504-507, 23-24 Oct. 2010.
[53] A. Tepwankul and S. Maneewongwattana, "U-DBSCAN : A density-based clustering algorithm for uncertain objects," IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp.136-143, 1-6 March 2010.
[54] J. R. Quinlan, “Induction of decision trees”, Machine Learning, 1:81-106, 1986.
[55] C.E. Shannon and W. Weaver, The mathematical theory of communication, Illinois: University of Illinois Press, 1949.
[56] Z. Pawlak, “Rough sets,” International Journal of Computer and Information Sciences, 11, pp. 341-356, 1982.
[57] J. H. Holland, Adaptation in natural and artificial systems, Ann Arbor: The University of Michigan Press, 1975.
[58] R. A. Fisher, "IRIS Data Set," UCI Machine Learning Repository, [Online]. Available: http://archive.ics.uci.edu/ml/datasets/Iris. [Accessed 5 December 2011].