Definition Finding groups of objects such that the objects in a group will be similar (or related)...
-
Upload
cecil-johnston -
Category
Documents
-
view
216 -
download
1
Transcript of Definition Finding groups of objects such that the objects in a group will be similar (or related)...
DefinitionFinding groups of objects such that the
objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster
distances are maximized
Intra-cluster distances are
minimized
Applications• Group related documents for browsing• Group genes and proteins that have
similar functionality• Group stocks with similar price
fluctuations• Reduce the size of large data sets• Group users with similar buying
mentalities
Clustering is ambiguousThere is no correct or incorrect solution for
clustering.
How many clusters?
Four Clusters Two Clusters
Six Clusters
Challenges facedScalabilityAbility to deal with different types of attributesNoise & OutliersComplex shapes and types of dataIncremental clustering and insensitivity to the
order of input recordsHigh dimensionalityConstraint-based clusteringInterpretability and usability
Types of DataData Matrix
n-objects with p-variables.The structure is in the form of a relational table,
or n x p matrixDissimilarity Matrix
object-by-object structure. Stores a collection of proximities that are available for all pair of n objects.
d(i, j) is the dissimilarity between objects i and j.d(i, j) = d(j, i) and d(i, i) = 0
Types of DataInterval- Scaled VariablesBinary VariablesNominalOrdinalRatio-Scaled variablesVariables of Mixed Types
Interval- Scaled Variables
Interval-scaled variables contd…
Binary variablesBinary variable has only two states 0 and 1Dissimilarity between two binary variables is
by a 2*2 contingency table for binary variables
1 0
1 q r q+r
0 s t s+t
q+s r+t p
OBJ i
OBJ j
Dissimilarity between binary variablesName Gende
rFever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y Y N N N N
D(Jack,Mary)=0.33D(Jack,Jim)=0.67D(Mary,Jim)=0.75
Categorical Variables
Ordinalsimilar to nominal variables, but values are
ordered in some sequence.Eg. rank or employees can be assistant,
associate, fullRatio-Scaled variables
Makes a positive measurement on a non-linear scaleEg. Growth of bacteria, radioactivity
Variables of Mixed Types
Other types of data
Types of clusteringHierarchical clustering(BIRCH)
A set of nested clusters organized as a hierarchical tree
Partitional Clustering(k-means,k-mediods)A division data objects into non-overlapping
(distinct) subsets (i.e., clusters) such that each data object is in exactly one subset
Density – Based(DBSCAN)Based on density functions
Grid-Based(STING)Based on nultiple-level granularity structure
Model-Based(SOM)Hypothesize a model for each of the clusters and
find the best fit of the data to the given model
Partitional Clustering
Original Points A Partitional Clustering
Hierarchical Clustering
p4p1
p3
p2
p4 p1
p3
p2
p4p1 p2 p3
p4p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering
Traditional Dendrogram
Non-traditional Dendrogram
Clustering AlgorithmsPartitional
K-meansK-mediods
HierarchialAgglomerativeDivisive
K-Mean AlgorithmEach cluster is represented by the mean value of
the objects in the clusterInput : set of objects (n), no of clusters (k)Output : set of k clustersAlgo
Randomly select k samples & mark them a initial cluster
Repeat Assign/ reassign in sample to any given cluster to which
it is most similar depending upon the mean of the cluster Update the cluster’s mean until No Change.
K-Means (Array)Step 1: Randomly assign objects to k
clustersStep 2: Find the mean of each clusterStep 3: Re-assign objects to the cluster
with closest mean.Step 4: Go to step2
Repeat until no change.
Example 1Given: {2,3,6,8,9,12,15,18,22} Assume k=3.Solution:
Randomly partition given data set: K1 = 2,8,15 mean = 8.3 K2 = 3,9,18 mean = 10 K3 = 6,12,22 mean = 13.3
Reassign K1 = 2,3,6,8,9 mean = 5.6 K2 = mean = 0 K3 = 12,15,18,22 mean = 16.75
Reassign K1 = 3,6,8,9 mean = 6.5 K2 = 2 mean = 2 K3 = 12,15,18,22 mean = 16.75
Reassign K1 = 6,8,9 mean = 7.6 K2 = 2,3 mean = 2.5 K3 = 12,15,18,22 mean = 16.75
Reassign K1 = 6,8,9 mean = 7.6 K2 = 2,3 mean = 2.5 K3 = 12,15,18,22 mean = 16.75
STOP
Example 2Given {2,4,10,12,3,20,30,11,25} Assume k=2.
Solution:K1 = 2,3,4,10,11,12K2 = 20, 25, 30
Advantages•K-means is relatively scalable and efficient in processing large data sets•The computational complexity of the algorithm is O(nkt) n: the total number of objects k: the number of clusters t: the number of iterations Normally: k<<n and t<<nDisadvantage • Can be applied only when the mean of a cluster is defined• Users need to specify k• K-means is not suitable for discovering clusters with non convex shapes or clusters of very different size• It is sensitive to noise and outlier data points (can influence the mean value)
K-Means (graph)Step1: Form k centroids, randomlyStep2: Calculate distance between centroids
and each objectUse Euclidean’s law do determine min distance:
d(A,B) = (x2-x1)2 + (y2-y1)2
Step3: Assign objects based on min distance to k clusters
Step4: Calculate centroid of each cluster using
C = (x1+x2+…xn , y1+y2+…yn)
n n
Go to step 2.Repeat until no change in centroids.
Example 1There are four types of medicines and each
have two attributes, as shown below. Find a way to group them into 2 groups based on their features.
Medicine Weight pH
A 1 1
B 2 1
C 4 3
D 5 4
SolutionPlot the values on a graph.
Mark any k centeroids
Calculate Euclidean distance of each point from the centeroids.
D = 0 1 3.61 5
1 0 2.83 4.24
Based on minimum distance, we assign points to clusters:K1 = A
K2 = B, C, DCalculate new centeroidsC = 2+4+5 ,1+3+4 = (11/3 , 8/3)
3 3
Marking the new centroids
Continue the iteration, until there is no change in the centroids or clusters.
Final solution
Example 2Use K-means algorithm to create two
clusters. Given:
Example 3.Group the below points into 3 clusters