1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close...
-
Upload
carmel-miles -
Category
Documents
-
view
215 -
download
0
description
Transcript of 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close...
1
Core Techniques: Cluster AnalysisCluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English
2
Example: Custormer Segmentation Given: a Large data base of customer data containing
their properties and past buying records:
Find groups of customers with similar behavior (clusters)
Find customers with unusual behavior (outliers)
3
Problem Definition:Given a set of N items in D dimensions
Find: a natural partitioning of the data set into a number of clusters (k) + outliers, such that: items in same cluster are similar
intra-cluster similarity is maximized
items from different clusters are different inter-cluster similarity is minimized
No predefined classes! Unsupervised Learnig Used either as a stand-alone tool to get
insight into data distribution or as a preprocessing step for other algorithms.
4
Clustering: Many Methods Partitioning methods
k-means, k-medoids Hierarchical methods
Agglomerative/divisive, BIRCH, CURE Linkage-based methods Density-based methods
DBSCAN, DENCLUE Statistical methods
IBM-IM demographic clustering, COBWEB
With different strengths and objectives
5
Differences Among Clustering Methods
Notion of Distance between X=<x1,…,xn> and Y=<y1,…,yn>: (|x1-y1|q + … + |xn-yn|q)1/q Euclidean: q=2, Manhattan: q=1
Distance from the center? or from neighbors (density-based)
The Dimensionality Curse.
6
Example Data Sets
Outliers are clear (or are they noise?)
Should we cluster according to a distance from a centroid or by the density of their neighborhood?
7
Partition to minimize distances from centers
People of similar age, income, education level…
Cluster and partition to minimize cost of distribution or utilities in a flat location
8
K-MeansK-means (MacQueen, 1967) is one of the simplest clustering
algorithms to minimize distance from centers.
1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.
9
K-Means (cont.)
The procedure will always terminate but not always in the most optimal
configuration, sensitive to the initial randomly
selected cluster centers Many variations and improvements
10 Clusters Example (5 pairs)
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other have only one.
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other have only one.
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
Multiple runs Helps, but probability is not on your side
Start with more than k initial centroids and then select k centroids from the most widely separated resulting clusters
Use hierarchical clustering to determine initial centroids on a small sample of data
Bisecting K-means Not as susceptible to initialization issues
Postprocessing
15
Partition to Minimize Distance from Neighbors: Density-Based Clustering
A natural model for describing the spreading of information or diseases
Finding frequent trajectories: e.g. from cell-phone calls, or RFID data.
16
DBSCAN Algorithm: Density ConceptsTwo global parameters:
Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Eps-
neighborhood of that point Core Object: object with at least MinPts objects
within a radius ‘Eps-neighborhood’—e.g. q Border Object: object on the border of a cluster—e.g.
p
pq
MinPts = 5
Eps = 1 cm
17
DBSCAN: The Algorithm
Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts.
If p is a core point, a cluster is formed. And repeat this process for all points density-reachable form p.
If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.
Repeat the process until all of the points have been processed.
18
DBSCAN Summary
Density-based Algorithm DBSCAN can discover clusters of arbitrary shape.
R*-Tree spatial index reduce the time complexity from O(n2) to O(n log n).
No suitable for higher dimensions: dimensionality curse
19
The Dimensionality Curse
Adding a dimension stretches the points across that dimension: High-dimensional data is extremely sparse Distance measure becomes meaningless—due
to equi-distanceSpecial algorithms based on
dimensionality reduction and subspace clustering are used.