CS583 Clustering
Transcript of CS583 Clustering
-
8/11/2019 CS583 Clustering
1/40
UIC - CS 594 1
Chapter 5: Clustering
-
8/11/2019 CS583 Clustering
2/40
UIC - CS 594 2
Searching for groups Clustering is unsupervised or undirected.
Unlike classification, in clustering, no pre-
classified data. Search for groups or clusters of data
points (records) that are similar to oneanother.
Similar points may mean: similarcustomers, products, that will behave insimilar ways.
-
8/11/2019 CS583 Clustering
3/40
UIC - CS 594 3
Group similar points together Group points into classes using some
distance measures.
Within-cluster distance, and between clusterdistance
Applications:
As a stand-alone toolto get insight into datadistribution
As a preprocessing stepfor other algorithms
-
8/11/2019 CS583 Clustering
4/40
UIC - CS 594 4
An Illustration
-
8/11/2019 CS583 Clustering
5/40
UIC - CS 594 5
Examples of ClusteringApplications
Marketing: Help marketers discover distinctgroups in their customer bases, and then use this
knowledge to develop targeted marketing
programs Insurance:Identifying groups of motor insurance
policy holders with some interesting
characteristics.
City-planning:Identifying groups of houses
according to their house type, value, and
geographical location
-
8/11/2019 CS583 Clustering
6/40
UIC - CS 594 6
Concepts of Clustering Clusters
Different ways of
representing clusters Division with boundaries
Spheres
Probabilistic
Dendrograms
1 2 3
I1
I2
In
0.5 0.2 0.3
-
8/11/2019 CS583 Clustering
7/40
UIC - CS 594 7
Clustering Clustering quality
Inter-clusters distance maximized
Intra-clusters distance minimized
The qualityof a clustering result depends on boththe similarity measure used by the method and itsapplication.
The qualityof a clustering method is alsomeasured by its ability to discover some or all ofthe hiddenpatterns
Clustering vs. classification
Which one is more difficult? Why?
There are a huge number of clustering techniques.
-
8/11/2019 CS583 Clustering
8/40
UIC - CS 594 8
Dissimilarity/Distance Measure
Dissimilarity/Similarity metric: Similarity isexpressed in terms of a distance function, whichis typically metric: d (i, j)
The definitions of distance functions are usually
very different for interval-scaled, boolean,categorical, ordinal and ratio variables.
Weights should be associated with different
variables based on applications and datasemantics.
It is hard to define similar enough or goodenough. The answer is typically highly subjective.
-
8/11/2019 CS583 Clustering
9/40
UIC - CS 594 9
Types of data in clustering
analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
-
8/11/2019 CS583 Clustering
10/40
UIC - CS 594 10
Interval-valued variables
Continuous measurements in a roughly linearscale, e.g., weight, height, temperature, etc
Standardize data (depending on applications)
Calculate the mean absolute deviation:
where
Calculate the standardized measurement (z-score)
.)...21
1nffff
xx(xnm
|)|...|||(|121 fnffffff mxmxmxns
f
fif
if s
mxz
-
8/11/2019 CS583 Clustering
11/40
UIC - CS 594 11
Similarity Between Objects Distance: Measure the similarity ordissimilarity
between two data objects
Some popular ones include: Minkowski
distance:
where (xi1, xi2, , xip) and(xj1, xj2, , xjp) are two p-
dimensional data objects, and qis a positive integer
If q= 1, dis Manhattan distance
q q
pp
qq
jxixjxixjxixjid )||...|||(|),( 2211
||...||||),(2211 pp j
xi
xj
xi
xj
xi
xjid
-
8/11/2019 CS583 Clustering
12/40
UIC - CS 594 12
Similarity Between Objects (Cont.)
If q= 2,d is Euclidean distance:
Properties d(i,j)0
d(i,i)= 0
d(i,j)= d(j,i)
d(i,j)
d(i,k)+ d(k,j)
Also, one can use weighted distance, and
many other similarity/distance measures.
)||...|||(|),( 2222
2
11 pp jx
ix
jx
ix
jx
ixjid
-
8/11/2019 CS583 Clustering
13/40
-
8/11/2019 CS583 Clustering
14/40
UIC - CS 594 14
Dissimilarity of Binary Variables Example
gender is a symmetric attribute (not used below) the remaining attributes are asymmetric attributes
let the values Y and P be set to 1, and the value Nbe set to 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75.0
211
21),(
67.0111
11),(
33.0
102
10),(
maryj imd
j imjackd
maryjackd
-
8/11/2019 CS583 Clustering
15/40
UIC - CS 594 15
Nominal Variables A generalization of the binary variable in that it
can take more than 2 states, e.g., red, yellow,
blue, green, etc
Method 1: Simple matching
m: # of matches,p: total # of variables
Method 2: use a large number of binary variables
creating a new binary variable for each of the M
nominal states
pmp
jid ),(
-
8/11/2019 CS583 Clustering
16/40
UIC - CS 594 16
Ordinal Variables
An ordinal variable can be discrete or continuous Order is important, e.g., rank
Can be treated like interval-scaled (f is a variable)
replace xif by their ranks map the range of each variable onto [0, 1] by replacing
i-thobject in thef-th variable by
compute the dissimilarity using methods for interval-
scaled variables
1
1
f
ifif M
rz
},...,1{ fif Mr
-
8/11/2019 CS583 Clustering
17/40
UIC - CS 594 17
Ratio-Scaled Variables Ratio-scaled variable: a measurement on a
nonlinear scale, approximately at exponentialscale, such asAeBtorAe-Bt, e.g., growth of a
bacteria population.
Methods: treat them like interval-scaled variablesnot a good idea!
(why?the scale can be distorted)
apply logarithmic transformation
yif =log(xif)
treat them as continuous ordinal data and then treat their
ranks as interval-scaled
-
8/11/2019 CS583 Clustering
18/40
UIC - CS 594 18
Variables of Mixed Types A database may contain all six types of variables
symmetric binary, asymmetric binary, nominal,ordinal, interval and ratio
One may use a weighted formula to combinetheir effects
f is binary or nominal:
dij(f)= 0 if xif = xjf, or dij
(f)= 1 o.w.
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
compute ranks rifand
and treat zifas interval-scaled
)(1
)()(
1),( fij
pf
f
ij
f
ij
p
f djid
1
1
f
if
Mr
zif
-
8/11/2019 CS583 Clustering
19/40
UIC - CS 594 19
Major Clustering Techniques Partitioning algorithms: Construct various
partitions and then evaluate them by somecriterion
Hierarchy algorithms: Create a hierarchical
decomposition of the set of data (or objects)using some criterion
Density-based: based on connectivity anddensity functions
Model-based: A model is hypothesized for eachof the clusters and the idea is to find the best fitof the model to each other.
-
8/11/2019 CS583 Clustering
20/40
UIC - CS 594 20
Partitioning Algorithms: BasicConcept
Partitioning method: Construct a partition of a
database Dof nobjects into a set of kclusters
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-meansand k-medoidsalgorithms
k-means: Each cluster is represented by the center ofthe cluster
k-medoidsor PAM (Partition around medoids): Each
cluster is represented by one of the objects in the cluster
-
8/11/2019 CS583 Clustering
21/40
UIC - CS 594 21
The K-Means Clustering
Given k, the k-meansalgorithm is as follows:1) Choose k cluster centers to coincide with k
randomly-chosenpoints
2) Assign each data point to the closest cluster center
3) Recompute the cluster centers using the currentcluster memberships.
4) If a convergence criterion is not met, go to 2).
Typical convergence criteria are: no (or minimal)
reassignment of data points to new cluster centers, orminimal decrease in squared error.
2
1
||
k
iCp ii
mpE
pis a point and miis the mean ofcluster Ci
-
8/11/2019 CS583 Clustering
22/40
UIC - CS 594 22
Example
For simplicity, 1 dimensional data and k=2.
data: 1, 2, 5, 6,7
K-means: Randomly select 5 and 6 as initial centroids;
=> Two clusters {1,2,5} and {6,7}; meanC1=8/3,meanC2=6.5
=> {1,2}, {5,6,7}; meanC1=1.5, meanC2=6
=> no change.
Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2= 2.5
-
8/11/2019 CS583 Clustering
23/40
UIC - CS 594 23
Comments on K-Means
Strength:efficient: O(tkn), where nis # data points, kis# clusters, and t is # iterations. Normally, k, t
-
8/11/2019 CS583 Clustering
24/40
UIC - CS 594 24
Variations of the K-MeansMethod
A few variants of the k-meanswhich differ in Selection of the initial kseeds
Dissimilarity measures
Strategies to calculate cluster means Handling categorical data: k-modes
Replacing means of clusters with modes
Using new dissimilarity measures to deal with
categorical objects
Using a frequency based method to update modes of
clusters
-
8/11/2019 CS583 Clustering
25/40
UIC - CS 594 25
k-Medoids clustering method
k-Means algorithm is sensitive to outliers Since an object with an extremely large value may
substantially distort the distribution of the data.
Medoidthe most centrally located point in acluster, as a representative point of the cluster.
An example
In contrast, a centroid is not necessarily inside acluster.
Initial Medoids
-
8/11/2019 CS583 Clustering
26/40
UIC - CS 594 26
Partition Around Medoids PAM:
1. Given k
2. Randomly pick kinstances as initial medoids
3. Assign each data point to the nearest medoid x
4. Calculate the objective function the sum of dissimilarities of all points to their
nearest medoids. (squared-error criterion)
5. Randomly select an point y
6. Swap x by y if the swap reduces the objectivefunction
7. Repeat (3-6) until no change
-
8/11/2019 CS583 Clustering
27/40
UIC - CS 594 27
Comments on PAM
Pam is more robust than k-means in thepresence of noise and outliers because a
medoid is less influenced by outliers or
other extreme values than a mean(why?)
Pam works well for small data sets but
does not scale wellfor large data sets. O(k(n-k)2)for each change
where n is # of data, k is # of clusters
Outlier (100 unit away)
-
8/11/2019 CS583 Clustering
28/40
UIC - CS 594 28
CLARA: Clustering Large Applications
CLARA: Built in statistical analysis packages, such
as S+ It draws multiple samplesof the data set, applies
PAMon each sample, and gives the bestclustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will notnecessarily represent a good clustering of the wholedata set if the sample is biased
There are other scale-up methods e.g., CLARANS
-
8/11/2019 CS583 Clustering
29/40
UIC - CS 594 29
Hierarchical Clustering Use distance matrix for clustering. This method
does not require the number of clusters kas aninput, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
aa b
d ec d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
divisive
-
8/11/2019 CS583 Clustering
30/40
UIC - CS 594 30
Agglomerative Clustering
At the beginning, each data point forms a cluster(also called a node).Merge nodes/clusters that have the leastdissimilarity.
Go on mergingEventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
-
8/11/2019 CS583 Clustering
31/40
UIC - CS 594 31
A DendrogramShows How the
Clusters are Merged Hierarchically
Decompose data objects into a several levels of nestedpartitioning (treeof clusters), called a dendrogram.
A clusteringof the data objects is obtained by cutting
the dendrogram at the desired level, then eachconnected componentforms a cluster.
-
8/11/2019 CS583 Clustering
32/40
UIC - CS 594 32
Divisive Clustering
Inverse order of agglomerative clustering
Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
-
8/11/2019 CS583 Clustering
33/40
UIC - CS 594 33
More on Hierarchical Methods
Major weakness of agglomerative clusteringmethods
do not scale well: time complexity at least O(n2), wherenis the total number of objects
can never undo what was done previously Integration of hierarchical with distance-based
clustering to scale-up these clustering methods
BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of thecluster by a specified fraction
-
8/11/2019 CS583 Clustering
34/40
UIC - CS 594 34
Summary
Cluster analysisgroups objects based on theirsimilarityand has wide applications
Measure of similarity can be computed for varioustypes of data
Clustering algorithms can be categorizedintopartitioning methods, hierarchical methods,density-based methods, etc
Clustering can also be used foroutlier detectionwhich are useful for fraud detection
What is the best clustering algorithm?
-
8/11/2019 CS583 Clustering
35/40
UIC - CS 594 35
Other Data Mining Methods
-
8/11/2019 CS583 Clustering
36/40
UIC - CS 594 36
Sequence analysis
Market basket analysis analyzes things thathappen at the same time.
How about things happen over time?E.g., If a customer buys a bed, he/she is likely
to come to buy a mattress later
Sequential analysis needsA time stamp for each data record
customer identification
-
8/11/2019 CS583 Clustering
37/40
UIC - CS 594 37
Sequence analysis (cont )
The analysis shows which item come before, afteror at the same time as other items.
Sequential patterns can be used for analyzingcause and effect.
Other applications Finding cycles in association rules
Some association rules hold strongly in certain periodsof time
E.g., every Monday people buy item Xand Ytogether
Stock market predicting
Predicting possible failure in network, etc
-
8/11/2019 CS583 Clustering
38/40
UIC - CS 594 38
Discovering holes in data
Holes are empty (sparse) regions in the dataspace that contain few or no data points. Holesmay represent impossible value combinationsinthe application domain.
E.g., in a disease database, we may find thatcertain test values and/or symptoms do not gotogether, or when certain medicine is used,some test value never go beyond certain range.
Such information could lead to significantdiscovery: a cure to a disease or some biologicallaw.
-
8/11/2019 CS583 Clustering
39/40
UIC - CS 594 39
Data and pattern visualization
Data visualization: Use computer graphicseffect to reveal the patterns in data,
2-D, 3-D scatter plots, bar charts, pie charts,line plots, animation, etc.
Pattern visualization: Use good interfaceand graphics to present the results ofdata mining.
Rule visualizer, cluster visualizer, etc
-
8/11/2019 CS583 Clustering
40/40
UIC CS 594 40
Scaling up data miningalgorithms
Adapt data mining algorithms to work onvery large databases.
Data reside on hard disk (too large to fit inmain memory)
Make fewer passes over the data
Quadratic algorithms are too expensive Many data mining algorithms are quadratic,
especially, clustering algorithms.