1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close...

19
1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English

description

3 Problem Definition: Given a set of N items in D dimensions zFind: a natural partitioning of the data set into a number of clusters (k) + outliers, such that: y items in same cluster are similar  intra-cluster similarity is maximized yitems from different clusters are different  inter-cluster similarity is minimized zNo predefined classes! Unsupervised Learnig zUsed either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms.

Transcript of 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close...

Page 1: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

1

Core Techniques: Cluster AnalysisCluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English

Page 2: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

2

Example: Custormer Segmentation Given: a Large data base of customer data containing

their properties and past buying records:

Find groups of customers with similar behavior (clusters)

Find customers with unusual behavior (outliers)

Page 3: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

3

Problem Definition:Given a set of N items in D dimensions

Find: a natural partitioning of the data set into a number of clusters (k) + outliers, such that: items in same cluster are similar

intra-cluster similarity is maximized

items from different clusters are different inter-cluster similarity is minimized

No predefined classes! Unsupervised Learnig Used either as a stand-alone tool to get

insight into data distribution or as a preprocessing step for other algorithms.

Page 4: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

4

Clustering: Many Methods Partitioning methods

k-means, k-medoids Hierarchical methods

Agglomerative/divisive, BIRCH, CURE Linkage-based methods Density-based methods

DBSCAN, DENCLUE Statistical methods

IBM-IM demographic clustering, COBWEB

With different strengths and objectives

Page 5: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

5

Differences Among Clustering Methods

Notion of Distance between X=<x1,…,xn> and Y=<y1,…,yn>: (|x1-y1|q + … + |xn-yn|q)1/q Euclidean: q=2, Manhattan: q=1

Distance from the center? or from neighbors (density-based)

The Dimensionality Curse.

Page 6: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

6

Example Data Sets

Outliers are clear (or are they noise?)

Should we cluster according to a distance from a centroid or by the density of their neighborhood?

Page 7: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

7

Partition to minimize distances from centers

People of similar age, income, education level…

Cluster and partition to minimize cost of distribution or utilities in a flat location

Page 8: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

8

K-MeansK-means (MacQueen, 1967) is one of the simplest clustering

algorithms to minimize distance from centers.

1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.

2. Assign each object to the group that has the closest centroid.

3. When all objects have been assigned, recalculate the positions of the K centroids.

4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

Page 9: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

9

K-Means (cont.)

The procedure will always terminate but not always in the most optimal

configuration, sensitive to the initial randomly

selected cluster centers Many variations and improvements

Page 10: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

10 Clusters Example (5 pairs)

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

Page 11: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

Page 12: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Page 13: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Page 14: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

Solutions to Initial Centroids Problem

Multiple runs Helps, but probability is not on your side

Start with more than k initial centroids and then select k centroids from the most widely separated resulting clusters

Use hierarchical clustering to determine initial centroids on a small sample of data

Bisecting K-means Not as susceptible to initialization issues

Postprocessing

Page 15: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

15

Partition to Minimize Distance from Neighbors: Density-Based Clustering

A natural model for describing the spreading of information or diseases

Finding frequent trajectories: e.g. from cell-phone calls, or RFID data.

Page 16: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

16

DBSCAN Algorithm: Density ConceptsTwo global parameters:

Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Eps-

neighborhood of that point Core Object: object with at least MinPts objects

within a radius ‘Eps-neighborhood’—e.g. q Border Object: object on the border of a cluster—e.g.

p

pq

MinPts = 5

Eps = 1 cm

Page 17: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

17

DBSCAN: The Algorithm

Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts.

If p is a core point, a cluster is formed. And repeat this process for all points density-reachable form p.

If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.

Repeat the process until all of the points have been processed.

Page 18: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

18

DBSCAN Summary

Density-based Algorithm DBSCAN can discover clusters of arbitrary shape.

R*-Tree spatial index reduce the time complexity from O(n2) to O(n log n).

No suitable for higher dimensions: dimensionality curse

Page 19: 1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

19

The Dimensionality Curse

Adding a dimension stretches the points across that dimension: High-dimensional data is extremely sparse Distance measure becomes meaningless—due

to equi-distanceSpecial algorithms based on

dimensionality reduction and subspace clustering are used.