Data Mining Techniques: Clustering. 2 Today Clustering Distance Measures Graph-based Techniques ...

Data Mining Techniques: Clustering

2

Today

Clustering Distance Measures Graph-based Techniques K-Means Clustering Tools and Software for Clustering

3

Prediction, Clustering, Classification What is Prediction?

The goal of prediction is to forecast or deduce the value of an attribute based on values of other attributes

A model is first created based on the data distribution The model is then used to predict future or unknown values

Supervised vs. Unsupervised Classification Supervised Classification = Classification

We know the class labels and the number of classes

Unsupervised Classification = Clustering We do not know the class labels and may not know the

number of classes

4

What is Clustering in Data Mining?

Cluster: a collection of data objects that

are “similar” to one another and thus can be treated collectively as one group

but as a collection, they are sufficiently different from other groups

Clustering unsupervised classification no predefined classes

Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters

Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters

Helps users understand the natural grouping or structure in a data set

5

Requirements of Clustering Methods

Scalability Dealing with different types of attributes Discovery of clusters with arbitrary

shape Minimal requirements for domain

knowledge to determine input parameters

Able to deal with noise and outliers Insensitive to order of input records The curse of dimensionality Interpretability and usability

Scalability Dealing with different types of attributes Discovery of clusters with arbitrary

shape Minimal requirements for domain

knowledge to determine input parameters

Able to deal with noise and outliers Insensitive to order of input records The curse of dimensionality Interpretability and usability

6

Applications of Clustering Clustering has wide applications in Pattern

Recognition

Spatial Data Analysis: create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining

Image Processing

Market Research

Information Retrieval Document or term categorization Information visualization and IR interfaces

Web Mining Cluster Web usage data to discover groups of similar access patterns Web Personalization

7

Clustering Methodologies Two general methodologies

Partitioning Based Algorithms Hierarchical Algorithms

Partitioning Based divide a set of N items into K clusters (top-down)

Hierarchical agglomerative: pairs of items or clusters are successively linked to produce

larger clusters divisive: start with the whole set as a cluster and successively divide sets into

smaller partitions

8

Distance or Similarity Measures Measuring Distance

In order to group similar items, we need a way to measure the distance between objects (e.g., records)

Note: distance = inverse of similarity Often based on the representation of objects as “feature vectors”

ID Gender Age Salary1 F 27 19,0002 M 51 64,0003 M 52 100,0004 F 33 55,0005 M 45 45,000

T1 T2 T3 T4 T5 T6Doc1 0 4 0 0 0 2Doc2 3 1 4 3 1 2Doc3 3 0 0 0 3 0Doc4 0 1 0 3 0 0Doc5 2 2 2 3 1 4

An Employee DB Term Frequencies for Documents

Which objects are more similar?

9

Distance or Similarity Measures Properties of Distance Measures:

for all objects A and B, dist(A, B) 0, and dist(A, B) = dist(B, A) for any object A, dist(A, A) = 0 dist(A, C) dist(A, B) + dist (B, C)

Common Distance Measures: Manhattan distance:

Euclidean distance:

Cosine similarity:

1 2, , , nX x x x 1 2, , , nY y y y

1 1 2 2( , ) n ndist X Y x y x y x y

2 2

1 1( , ) n ndist X Y x y x y

Can be normalizedto make values fallbetween 0 and 1.

Can be normalizedto make values fallbetween 0 and 1.

( , ) 1 ( , )dist X Y sim X Y 2 2

( )( , )

i ii

i ii i

x ysim X Y

x y

10

Distance or Similarity Measures Weighting Attributes

in some cases we want some attributes to count more than others associate a weight with each of the attributes in calculating distance, e.g.,

Nominal (categorical) Attributes can use simple matching: distance=1 if values match, 0 otherwise or convert each nominal attribute to a set of binary attribute, then use the usual

distance measure if all attributes are nominal, we can normalize by dividing the number of

matches by the total number of attributes

Normalization: want values to fall between 0 an 1: other variations possible

2 2

1 1 1( , ) n n ndist X Y w x y w x y

min'

max mini i

i

i i

x xx

x x

11

Distance or Similarity Measures Example

max distance for age: 100000-19000 = 79000 max distance for age: 52-27 = 25

dist(ID2, ID3) = SQRT( 0 + (0.04)2 + (0.44)2 ) = 0.44 dist(ID2, ID4) = SQRT( 1 + (0.72)2 + (0.12)2 ) = 1.24

ID Gender Age Salary1 F 27 19,0002 M 51 64,0003 M 52 100,0004 F 33 55,0005 M 45 45,000

min'

max mini i

i

i i

x xx

x x

ID Gender Age Salary1 1 0.00 0.002 0 0.96 0.563 0 1.00 1.004 1 0.24 0.445 0 0.72 0.32

12

Domain Specific Distance Functions For some data sets, we may need to use specialized functions

we may want a single or a selected group of attributes to be used in the computation of distance - same problem as “feature selection”

may want to use special properties of one or more attribute in the data

natural distance functions may exist in the data

Example: Zip Codesdistzip(A, B) = 0, if zip codes are identicaldistzip(A, B) = 0.1, if first 3 digits are identicaldistzip(A, B) = 0.5, if first digits are identicaldistzip(A, B) = 1, if first digits are different

Example: Zip Codesdistzip(A, B) = 0, if zip codes are identicaldistzip(A, B) = 0.1, if first 3 digits are identicaldistzip(A, B) = 0.5, if first digits are identicaldistzip(A, B) = 1, if first digits are different

Example: Customer Solicitationdistsolicit(A, B) = 0, if both A and B respondeddistsolicit(A, B) = 0.1, both A and B were chosen but did not responddistsolicit(A, B) = 0.5, both A and B were chosen, but only one respondeddistsolicit(A, B) = 1, one was chosen, but the other was not

Example: Customer Solicitationdistsolicit(A, B) = 0, if both A and B respondeddistsolicit(A, B) = 0.1, both A and B were chosen but did not responddistsolicit(A, B) = 0.5, both A and B were chosen, but only one respondeddistsolicit(A, B) = 1, one was chosen, but the other was not

13

Distance (Similarity) Matrix

similarity (or distance) of to ij i jd D D

Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.

The diagonal is all 1’s (similarity) or all 0’s (distance)

Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.

The diagonal is all 1’s (similarity) or all 0’s (distance)

1 2

1 12 1

2 21 2

1 2

n

n

n

n n n

I I I

I d d

I d d

I d d

Similarity (Distance) Matrix based on the distance or similarity measure we can construct a symmetric

matrix of distance (or similarity values) (i, j) entry in the matrix is the distance (similarity) between items i and j

14

Example: Term Similarities in Documents

T1 T2 T3 T4 T5 T6 T7 T8Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2

sim T T w wi j jkikk

N

( , ) ( )

1

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

Term-TermSimilarity Matrix

Term-TermSimilarity Matrix

15

Similarity (Distance) Thresholds A similarity (distance) threshold may be used to mark pairs that are

“sufficiently” similar

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

Using a threshold value of 10 in the previous example

16

Graph Representation The similarity matrix can be visualized as an undirected graph

each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix)

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

T1 T3

T4

T6T8

T5

T2

T7If no threshold is used, thenmatrix can be represented asa weighted graph

If no threshold is used, thenmatrix can be represented asa weighted graph

17

Simple Clustering Algorithms If we are interested only in threshold (and not the degree of

similarity or distance), we can use the graph directly for clustering Clique Method (complete link)

all items within a cluster must be within the similarity threshold of all other items in that cluster

clusters may overlap generally produces small but very tight clusters

Single Link Method any item in a cluster must be within the similarity threshold of at least one

other item in that cluster produces larger but weaker clusters

Other methods star method - start with an item and place all related items in that cluster string method - start with an item; place one related item in that cluster; then

place anther item related to the last item entered, and so on

18

Simple Clustering Algorithms Clique Method

a clique is a completely connected subgraph of a graph in the clique method, each maximal clique in the graph becomes a cluster

T1 T3

T4

T6T8

T5

T2

T7

Maximal cliques (and therefore the clusters) in the previous example are:

{T1, T3, T4, T6}{T2, T4, T6}{T2, T6, T8}{T1, T5}{T7}

Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.

19

Simple Clustering Algorithms Single Link Method

selected an item not in a cluster and place it in a new cluster place all other similar item in that cluster repeat step 2 for each item in the cluster until nothing more can be added repeat steps 1-3 for each item that remains unclustered

T1 T3

T4

T6T8

T5

T2

T7

In this case the single link method produces only two clusters:

{T1, T3, T4, T5, T6, T2, T8} {T7}

Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.

20

Clustering with Existing Clusters The notion of comparing item similarities can be extended to clusters

themselves, by focusing on a representative vector for each cluster cluster representatives can be actual items in the cluster or other “virtual”

representatives such as the centroid this methodology reduces the number of similarity computations in clustering clusters are revised successively until a stopping condition is satisfied, or until no

more changes to clusters can be made

Partitioning Methods reallocation method - start with an initial assignment of items to clusters and then

move items from cluster to cluster to obtain an improved partitioning Single pass method - simple and efficient, but produces large clusters, and

depends on order in which items are processed

Hierarchical Agglomerative Methods starts with individual items and combines into clusters then successively combine smaller clusters to form larger ones grouping of individual items can be based on any of the methods discussed earlier

21

K-Means Algorithm The basic algorithm (based on reallocation method):

1. select K data points as the initial representatives

2. for i = 1 to N, assign item xi to the most similar centroid (this gives K clusters)

3. for j = 1 to K, recalculate the cluster centroid Cj

4. repeat steps 2 and 3 until these is (little or) no change in clusters

Example: Clustering Terms

T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3Doc1 0 4 0 0 0 2 1 3 4/2 0/2 2/2Doc2 3 1 4 3 1 2 0 1 4/2 7/2 3/2Doc3 3 0 0 0 3 0 3 0 3/2 0/2 3/2Doc4 0 1 0 3 0 0 2 0 1/2 3/2 0/2Doc5 2 2 2 3 1 4 0 2 4/2 5/2 5/2

Initial (arbitrary) assignment:C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6} Cluster Centroids

22

Example: K-Means Example (continued)

T1 T2 T3 T4 T5 T6 T7 T8Class1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2Class2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2Class3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2Assign to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1

Now using simple similarity measure, compute the new cluster-term similarity matrix

Now compute new cluster centroids using the original document-term matrix

T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3Doc1 0 4 0 0 0 2 1 3 8/3 2/4 0/1Doc2 3 1 4 3 1 2 0 1 2/3 12/4 1/1Doc3 3 0 0 0 3 0 3 0 3/3 3/4 3/1Doc4 0 1 0 3 0 0 2 0 3/3 3/4 0/1Doc5 2 2 2 3 1 4 0 2 4/3 11/4 1/1

The process is repeated until no further changes are made to the clusters

23

K-Means Algorithm Strength of the k-means:

Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n

Often terminates at a local optimum

Weakness of the k-means: Applicable only when mean is defined; what about

categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers

Variations of K-Means usually differ in: Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means

24

Hierarchical Algorithms Use distance matrix as clustering criteria

does not require the number of clusters k as an input, but needs a termination condition

25

Hierarchical Agglomerative Clustering HAC starts with unclustered data and performs successive pairwise

joins among items (or previous clusters) to form larger ones this results in a hierarchy of clusters which can be viewed as a dendrogram useful in pruning search in a clustered item set, or in browsing clustering results

Some commonly used HACM methods Single Link: at each step join most similar pairs of objects that are not yet

in the same cluster Complete Link: use least similar pair between each cluster pair to

determine inter-cluster similarity - all items within one cluster are linked to each other within a similarity threshold

Ward’s method: at each step join cluster pair whose merger minimizes the increase in total within-group error sum of squares (based on distance between centroids) - also called the minimum variance method

Group Average (Mean): use average value of pairwise links within a cluster to determine inter-cluster similarity (i.e., all objects contribute to inter-cluster similarity)

26

Hierarchical Agglomerative ClusteringDendrogram for a hierarchy of clusters

A B C D E F G HI

Data Mining Techniques: Clustering. 2 Today Clustering Distance Measures Graph-based Techniques ...

Documents

Transcript of Data Mining Techniques: Clustering. 2 Today Clustering Distance Measures Graph-based Techniques ...