Definition Finding groups of objects such that the objects in a group will be similar (or related)...

DefinitionFinding groups of objects such that the

objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster

distances are maximized

Intra-cluster distances are

minimized

Applications• Group related documents for browsing• Group genes and proteins that have

similar functionality• Group stocks with similar price

fluctuations• Reduce the size of large data sets• Group users with similar buying

mentalities

Clustering is ambiguousThere is no correct or incorrect solution for

clustering.

How many clusters?

Four Clusters Two Clusters

Six Clusters

Challenges facedScalabilityAbility to deal with different types of attributesNoise & OutliersComplex shapes and types of dataIncremental clustering and insensitivity to the

order of input recordsHigh dimensionalityConstraint-based clusteringInterpretability and usability

Types of DataData Matrix

n-objects with p-variables.The structure is in the form of a relational table,

or n x p matrixDissimilarity Matrix

object-by-object structure. Stores a collection of proximities that are available for all pair of n objects.

d(i, j) is the dissimilarity between objects i and j.d(i, j) = d(j, i) and d(i, i) = 0

Types of DataInterval- Scaled VariablesBinary VariablesNominalOrdinalRatio-Scaled variablesVariables of Mixed Types

Interval- Scaled Variables

Interval-scaled variables contd…

Binary variablesBinary variable has only two states 0 and 1Dissimilarity between two binary variables is

by a 2*2 contingency table for binary variables

1 0

1 q r q+r

0 s t s+t

q+s r+t p

OBJ i

OBJ j

Dissimilarity between binary variablesName Gende

rFever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y Y N N N N

D(Jack,Mary)=0.33D(Jack,Jim)=0.67D(Mary,Jim)=0.75

Categorical Variables

Ordinalsimilar to nominal variables, but values are

ordered in some sequence.Eg. rank or employees can be assistant,

associate, fullRatio-Scaled variables

Makes a positive measurement on a non-linear scaleEg. Growth of bacteria, radioactivity

Variables of Mixed Types

Other types of data

Types of clusteringHierarchical clustering(BIRCH)

A set of nested clusters organized as a hierarchical tree

Partitional Clustering(k-means,k-mediods)A division data objects into non-overlapping

(distinct) subsets (i.e., clusters) such that each data object is in exactly one subset

Density – Based(DBSCAN)Based on density functions

Grid-Based(STING)Based on nultiple-level granularity structure

Model-Based(SOM)Hypothesize a model for each of the clusters and

find the best fit of the data to the given model

Partitional Clustering

Original Points A Partitional Clustering

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering

Traditional Dendrogram

Non-traditional Dendrogram

Clustering AlgorithmsPartitional

K-meansK-mediods

HierarchialAgglomerativeDivisive

K-Mean AlgorithmEach cluster is represented by the mean value of

the objects in the clusterInput : set of objects (n), no of clusters (k)Output : set of k clustersAlgo

Randomly select k samples & mark them a initial cluster

Repeat Assign/ reassign in sample to any given cluster to which

it is most similar depending upon the mean of the cluster Update the cluster’s mean until No Change.

K-Means (Array)Step 1: Randomly assign objects to k

clustersStep 2: Find the mean of each clusterStep 3: Re-assign objects to the cluster

with closest mean.Step 4: Go to step2

Repeat until no change.

Example 1Given: {2,3,6,8,9,12,15,18,22} Assume k=3.Solution:

Randomly partition given data set: K1 = 2,8,15 mean = 8.3 K2 = 3,9,18 mean = 10 K3 = 6,12,22 mean = 13.3

Reassign K1 = 2,3,6,8,9 mean = 5.6 K2 = mean = 0 K3 = 12,15,18,22 mean = 16.75

Reassign K1 = 3,6,8,9 mean = 6.5 K2 = 2 mean = 2 K3 = 12,15,18,22 mean = 16.75

Reassign K1 = 6,8,9 mean = 7.6 K2 = 2,3 mean = 2.5 K3 = 12,15,18,22 mean = 16.75

Reassign K1 = 6,8,9 mean = 7.6 K2 = 2,3 mean = 2.5 K3 = 12,15,18,22 mean = 16.75

STOP

Example 2Given {2,4,10,12,3,20,30,11,25} Assume k=2.

Solution:K1 = 2,3,4,10,11,12K2 = 20, 25, 30

Advantages•K-means is relatively scalable and efficient in processing large data sets•The computational complexity of the algorithm is O(nkt) n: the total number of objects k: the number of clusters t: the number of iterations Normally: k<<n and t<<nDisadvantage • Can be applied only when the mean of a cluster is defined• Users need to specify k• K-means is not suitable for discovering clusters with non convex shapes or clusters of very different size• It is sensitive to noise and outlier data points (can influence the mean value)

K-Means (graph)Step1: Form k centroids, randomlyStep2: Calculate distance between centroids

and each objectUse Euclidean’s law do determine min distance:

d(A,B) = (x2-x1)2 + (y2-y1)2

Step3: Assign objects based on min distance to k clusters

Step4: Calculate centroid of each cluster using

C = (x1+x2+…xn , y1+y2+…yn)

n n

Go to step 2.Repeat until no change in centroids.

Example 1There are four types of medicines and each

have two attributes, as shown below. Find a way to group them into 2 groups based on their features.

Medicine Weight pH

A 1 1

B 2 1

C 4 3

D 5 4

SolutionPlot the values on a graph.

Mark any k centeroids

Calculate Euclidean distance of each point from the centeroids.

D = 0 1 3.61 5

1 0 2.83 4.24

Based on minimum distance, we assign points to clusters:K1 = A

K2 = B, C, DCalculate new centeroidsC = 2+4+5 ,1+3+4 = (11/3 , 8/3)

3 3

Marking the new centroids

Continue the iteration, until there is no change in the centroids or clusters.

Final solution

Example 2Use K-means algorithm to create two

clusters. Given:

Example 3.Group the below points into 3 clusters

Definition Finding groups of objects such that the objects in a group will be similar (or related)...

Documents

Transcript of Definition Finding groups of objects such that the objects in a group will be similar (or related)...