1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close...

1

Core Techniques: Cluster AnalysisCluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English

2

Example: Custormer Segmentation Given: a Large data base of customer data containing

their properties and past buying records:

Find groups of customers with similar behavior (clusters)

Find customers with unusual behavior (outliers)

3

Problem Definition:Given a set of N items in D dimensions

Find: a natural partitioning of the data set into a number of clusters (k) + outliers, such that: items in same cluster are similar

intra-cluster similarity is maximized

items from different clusters are different inter-cluster similarity is minimized

No predefined classes! Unsupervised Learnig Used either as a stand-alone tool to get

insight into data distribution or as a preprocessing step for other algorithms.

4

Clustering: Many Methods Partitioning methods

k-means, k-medoids Hierarchical methods

Agglomerative/divisive, BIRCH, CURE Linkage-based methods Density-based methods

DBSCAN, DENCLUE Statistical methods

IBM-IM demographic clustering, COBWEB

With different strengths and objectives

5

Differences Among Clustering Methods

Notion of Distance between X=<x1,…,xn> and Y=<y1,…,yn>: (|x1-y1|q + … + |xn-yn|q)1/q Euclidean: q=2, Manhattan: q=1

Distance from the center? or from neighbors (density-based)

The Dimensionality Curse.

6

Example Data Sets

Outliers are clear (or are they noise?)

Should we cluster according to a distance from a centroid or by the density of their neighborhood?

7

Partition to minimize distances from centers

People of similar age, income, education level…

Cluster and partition to minimize cost of distribution or utilities in a flat location

8

K-MeansK-means (MacQueen, 1967) is one of the simplest clustering

algorithms to minimize distance from centers.

1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.

2. Assign each object to the group that has the closest centroid.

3. When all objects have been assigned, recalculate the positions of the K centroids.

4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/kmeans.html#macqueen

9

K-Means (cont.)

The procedure will always terminate but not always in the most optimal

configuration, sensitive to the initial randomly

selected cluster centers Many variations and improvements

10 Clusters Example (5 pairs)

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Solutions to Initial Centroids Problem

Multiple runs Helps, but probability is not on your side

Start with more than k initial centroids and then select k centroids from the most widely separated resulting clusters

Use hierarchical clustering to determine initial centroids on a small sample of data

Bisecting K-means Not as susceptible to initialization issues

Postprocessing

15

Partition to Minimize Distance from Neighbors: Density-Based Clustering

A natural model for describing the spreading of information or diseases

Finding frequent trajectories: e.g. from cell-phone calls, or RFID data.

16

DBSCAN Algorithm: Density ConceptsTwo global parameters:

Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Eps-

neighborhood of that point Core Object: object with at least MinPts objects

within a radius ‘Eps-neighborhood’—e.g. q Border Object: object on the border of a cluster—e.g.

p

pq

MinPts = 5

Eps = 1 cm

17

DBSCAN: The Algorithm

Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts.

If p is a core point, a cluster is formed. And repeat this process for all points density-reachable form p.

If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.

Repeat the process until all of the points have been processed.

18

DBSCAN Summary

Density-based Algorithm DBSCAN can discover clusters of arbitrary shape.

R*-Tree spatial index reduce the time complexity from O(n2) to O(n log n).

No suitable for higher dimensions: dimensionality curse

19

The Dimensionality Curse

Adding a dimension stretches the points across that dimension: High-dimensional data is extremely sparse Distance measure becomes meaningless—due

to equi-distanceSpecial algorithms based on

dimensionality reduction and subspace clustering are used.

1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close...

Documents

Transcript of 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close...