What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such...
-
date post
20-Dec-2015 -
Category
Documents
-
view
221 -
download
0
Transcript of What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such...
![Page 1: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/1.jpg)
What is Cluster Analysis • Clustering– Partitioning a data set into several
groups (clusters) such that– Homogeneity: Objects belonging to the same cluster
are similar to each other – Separation: Objects belonging to different clusters are
dissimilar to each other.
• Three fundamental elements of clustering – The set of objects– The set of attributes– Distance measure
![Page 2: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/2.jpg)
Supervised versus Unsupervised Learning
• Supervised learning (classification)– Supervision: Training data (observations,
measurements, etc.) are accompanied by labels indicating the class of the observations
– New data is classified based on training set
• Unsupervised learning (clustering)– Class labels of training data are unknown– Given a set of measurements, observations, etc.,
need to establish existence of classes or clusters in data
![Page 3: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/3.jpg)
What Is Good Clustering?• Good clustering method will produce high
quality clusters with– high intra-class similarity– low inter-class similarity
• Quality of a clustering method is also measured by its ability to discover some or all of hidden patterns
• Quality of a clustering result depends on both the similarity measure used by the method and its implementation
![Page 4: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/4.jpg)
Requirements of Clustering in Data Mining • Scalability
• Ability to deal with different types of attributes
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• Discovery of clusters with arbitrary shape
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
![Page 5: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/5.jpg)
Application Examples
• A stand-alone tool: explore data distribution
• A preprocessing step for other algorithms• Pattern recognition, spatial data analysis,
image processing, market research, WWW, …– Cluster documents– Cluster web log data to discover groups of
similar access patterns
![Page 6: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/6.jpg)
Gene Expression Data Matrix Gene Expression Patterns
Co-expressed Genes
Why looking for co-expressed genes? Co-expression indicates co-function; Co-expression also indicates co-regulation.
Co-expressed Genes
![Page 7: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/7.jpg)
Gene-based Clustering
-1.5
-1
-0.5
0
0.5
1
1.5
Time Points
Ex
pre
ss
ion
Va
lue
-1.5
-1
-0.5
0
0.5
1
1.5
Time Points
Ex
pre
ss
ion
Le
ve
l
-1.5
-1
-0.5
0
0.5
1
1.5
Time Point
Ex
pre
ss
ion
Va
lue
Examples of co-expressed genes and coherent patterns in gene expression data
Iyer’s data [2]
[2] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999.
![Page 8: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/8.jpg)
Data Matrix
• For memory-based clustering– Also called object-by-variable structure
• Represents n objects with p variables (attributes, measures)– A relational table
npx
nfx
nx
ipx
ifx
ix
px
fxx
1
1
1111
![Page 9: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/9.jpg)
Two-way Clustering of Micoarray Data
sample 1 sample 2sample
3sample
4sample
…
gene 1 0.13 0.72 0.1 0.57
gene 2 0.34 1.58 1.05 1.15
gene 3 0.43 1.1 0.97 1
gene 4 1.22 0.97 1 0.85
gene 5 -0.89 1.21 1.29 1.08
gene 6 1.1 1.45 1.44 1.12
gene 7 0.83 1.15 1.1 1
gene 8 0.87 1.32 1.35 1.13
gene 9 -0.33 1.01 1.38 1.21
gene 10 0.10 0.85 1.03 1
gene…
• Clustering genes
• Samples are attributes
• Find genes with similar function
• Clustering samples
• Genes are attributes.
• Find samples with similar phenotype, e.g. cancers.
• Feature selection.
• Informative genes.
• Curse of dimensionality.
![Page 10: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/10.jpg)
Dissimilarity Matrix
• For memory-based clustering– Also called object-by-object structure– Proximities of pairs of objects– d(i,j): dissimilarity between objects i and j– Nonnegative– Close to 0: similar
0,2)(,1)(
0(3,2)(3,1)
0(2,1)
0
ndnd
dd
d
![Page 11: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/11.jpg)
Distance Matrix
s 1 s 2 s 3 s 4 …
g 1 0.13 0.72 0.1 0.57
g 2 0.34 1.58 1.05 1.15
g 3 0.43 1.1 0.97 1
g 4 1.22 0.97 1 0.85
g 5 -0.89 1.21 1.29 1.08
g 6 1.1 1.45 1.44 1.12
g 7 0.83 1.15 1.1 1
g 8 0.87 1.32 1.35 1.13
g 9 -0.33 1.01 1.38 1.21
g 10 0.10 0.85 1.03 1
…
g 1 g 2 g 3 g 4 …
g 1 0 D(1,2) D(1,3) D(1,4)
g 2 0 D(2,3) D(2,4)
g 3 0 D(3,4)
g 4 0
…
Original Data Matrix
Distance Matrix
![Page 12: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/12.jpg)
How Good Is the Clustering?
• Dissimilarity/similarity depends on distance function– Different applications have different functions– Inter-clusters distance maximization– Intra-clusters distance minimization
• Judgment of clustering quality is typically highly subjective
![Page 13: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/13.jpg)
Types of Data in Clustering
• Interval-scaled variables
• Binary variables
• Nominal, ordinal, and ratio variables
• Variables of mixed types
![Page 14: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/14.jpg)
Interval-valued Variables
• Continuous measurements of a roughly linear scale– Weight, height, latitude and longitude
coordinates, temperature, etc.
• Effect of measurement units in attributes– Smaller unit larger variable range larger
effect to the result– Standardization + background knowledge
![Page 15: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/15.jpg)
Standardization
• Calculate the mean absolute deviation
,– The mean is not squared, so the effect of outliers
is reduced.
• Calculate the standardized measurement (z-score)
• Mean absolute deviation is more robust– The effect of outliers is reduced but remains
detectable
|)|...|||(|121 fnffffff
mxmxmxns .)...21
1nffff
xx(xn m
f
fifif s
mx z
![Page 16: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/16.jpg)
Minkowski Distance• Minkowski distance: a generalization
• If q = 2, d is Euclidean distance
• If q = 1, d is Manhattan distance
)0(||...||||),(2211
qqj
xi
xj
xi
xj
xi
xjid q
pp
xi
xj
q=2 q=16
6
128.48
Xi (1,7)
Xj(7,1)
![Page 17: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/17.jpg)
Properties of Minkowski Distance
• Nonnegative: d(i,j) 0
• The distance of an object to itself is 0
– d(i,i) = 0
• Symmetric: d(i,j) = d(j,i)• Triangular inequality
– d(i,j) d(i,k) + d(k,j) i j
k
![Page 18: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/18.jpg)
Major Clustering Approaches• Partitioning algorithms: Construct various
partitions and then evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition of set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
![Page 19: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d4e5503460f94a2d184/html5/thumbnails/19.jpg)
Clustering Algorithms• If we “clustering” the clustering algorithms
Clustering algorithms
Partition-based
Hierarchical clustering
Centroid-based
K-means
Density-based
Model-based
…
Medoid-based
PAM CLARA CLARANS