What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such...

What is Cluster Analysis • Clustering– Partitioning a data set into several

groups (clusters) such that– Homogeneity: Objects belonging to the same cluster

are similar to each other – Separation: Objects belonging to different clusters are

dissimilar to each other.

• Three fundamental elements of clustering – The set of objects– The set of attributes– Distance measure

Supervised versus Unsupervised Learning

• Supervised learning (classification)– Supervision: Training data (observations,

measurements, etc.) are accompanied by labels indicating the class of the observations

– New data is classified based on training set

• Unsupervised learning (clustering)– Class labels of training data are unknown– Given a set of measurements, observations, etc.,

need to establish existence of classes or clusters in data

What Is Good Clustering?• Good clustering method will produce high

quality clusters with– high intra-class similarity– low inter-class similarity

• Quality of a clustering method is also measured by its ability to discover some or all of hidden patterns

• Quality of a clustering result depends on both the similarity measure used by the method and its implementation

Requirements of Clustering in Data Mining • Scalability

• Ability to deal with different types of attributes

• Minimal requirements for domain knowledge to determine input parameters

• Able to deal with noise and outliers

• Discovery of clusters with arbitrary shape

• Insensitive to order of input records

• High dimensionality

• Incorporation of user-specified constraints

• Interpretability and usability

Application Examples

• A stand-alone tool: explore data distribution

• A preprocessing step for other algorithms• Pattern recognition, spatial data analysis,

image processing, market research, WWW, …– Cluster documents– Cluster web log data to discover groups of

similar access patterns

Gene Expression Data Matrix Gene Expression Patterns

Co-expressed Genes

Why looking for co-expressed genes? Co-expression indicates co-function; Co-expression also indicates co-regulation.

Co-expressed Genes

Gene-based Clustering

Time Points

Time Point

Examples of co-expressed genes and coherent patterns in gene expression data

Iyer’s data [2]

[2] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999.

Data Matrix

• For memory-based clustering– Also called object-by-variable structure

• Represents n objects with p variables (attributes, measures)– A relational table

Two-way Clustering of Micoarray Data

sample 1 sample 2sample

3sample

4sample

gene 1 0.13 0.72 0.1 0.57

gene 2 0.34 1.58 1.05 1.15

gene 3 0.43 1.1 0.97 1

gene 4 1.22 0.97 1 0.85

gene 5 -0.89 1.21 1.29 1.08

gene 6 1.1 1.45 1.44 1.12

gene 7 0.83 1.15 1.1 1

gene 8 0.87 1.32 1.35 1.13

gene 9 -0.33 1.01 1.38 1.21

gene 10 0.10 0.85 1.03 1

gene…

• Clustering genes

• Samples are attributes

• Find genes with similar function

• Clustering samples

• Genes are attributes.

• Find samples with similar phenotype, e.g. cancers.

• Feature selection.

• Informative genes.

• Curse of dimensionality.

Dissimilarity Matrix

• For memory-based clustering– Also called object-by-object structure– Proximities of pairs of objects– d(i,j): dissimilarity between objects i and j– Nonnegative– Close to 0: similar

0,2)(,1)(

0(3,2)(3,1)

0(2,1)

Distance Matrix

s 1 s 2 s 3 s 4 …

g 1 0.13 0.72 0.1 0.57

g 2 0.34 1.58 1.05 1.15

g 3 0.43 1.1 0.97 1

g 4 1.22 0.97 1 0.85

g 5 -0.89 1.21 1.29 1.08

g 6 1.1 1.45 1.44 1.12

g 7 0.83 1.15 1.1 1

g 8 0.87 1.32 1.35 1.13

g 9 -0.33 1.01 1.38 1.21

g 10 0.10 0.85 1.03 1

g 1 g 2 g 3 g 4 …

g 1 0 D(1,2) D(1,3) D(1,4)

g 2 0 D(2,3) D(2,4)

g 3 0 D(3,4)

Original Data Matrix

Distance Matrix

How Good Is the Clustering?

• Dissimilarity/similarity depends on distance function– Different applications have different functions– Inter-clusters distance maximization– Intra-clusters distance minimization

• Judgment of clustering quality is typically highly subjective

Types of Data in Clustering

• Interval-scaled variables

• Binary variables

• Nominal, ordinal, and ratio variables

• Variables of mixed types

Interval-valued Variables

• Continuous measurements of a roughly linear scale– Weight, height, latitude and longitude

coordinates, temperature, etc.

• Effect of measurement units in attributes– Smaller unit larger variable range larger

effect to the result– Standardization + background knowledge

Standardization

• Calculate the mean absolute deviation

,– The mean is not squared, so the effect of outliers

is reduced.

• Calculate the standardized measurement (z-score)

• Mean absolute deviation is more robust– The effect of outliers is reduced but remains

detectable

|)|...|||(|121 fnffffff

mxmxmxns .)...21

1nffff

xx(xn m

fifif s

Minkowski Distance• Minkowski distance: a generalization

• If q = 2, d is Euclidean distance

• If q = 1, d is Manhattan distance

)0(||...||||),(2211

xjid q

q=2 q=16

128.48

Xi (1,7)

Xj(7,1)

Properties of Minkowski Distance

• Nonnegative: d(i,j) 0

• The distance of an object to itself is 0

– d(i,i) = 0

• Symmetric: d(i,j) = d(j,i)• Triangular inequality

– d(i,j) d(i,k) + d(k,j) i j

Major Clustering Approaches• Partitioning algorithms: Construct various

partitions and then evaluate them by some criterion

• Hierarchy algorithms: Create a hierarchical decomposition of set of data (or objects) using some criterion

• Density-based: based on connectivity and density functions

• Grid-based: based on a multiple-level granularity structure

Clustering Algorithms• If we “clustering” the clustering algorithms

Clustering algorithms

Partition-based

Hierarchical clustering

Centroid-based

K-means

Density-based

Model-based

Medoid-based

PAM CLARA CLARANS

What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such...

Documents

Transcript of What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such...

Chapter 3: Cluster Analysis 3.1 Basic Concepts of Clustering 3.2 Partitioning Methods 3.3 Hierarchical Methods 3.3.1 The Principle 3.3.2 Agglomerative.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2007.

Chapter 15 CLUSTERING METHODS - BGU 15 CLUSTERING METHODS ... Clustering, K-means, Intra-cluster homogeneity, ... The measurement unit used can affect the clustering analysis.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Bipartite spectral graph partitioning to co-cluster ... · MDS analysis of aggregate pronunciation differ-ences. Shackleton (2004) ... Bulgarian pronunciation using an edit distance

MySQL Advanced MySQL Replication MySQL Cluster MySQL Partitioning António Amorim, Carlos Jesus. CU1 - DBWorkshop 9-10 /Feb/09.

Cluster Analysis. What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Fixed Partitioning and Salient Points with MPEG-7 Cluster ...

Chapter 3: Cluster Analysis - anuradhasrinivas · PDF fileChapter 3: Cluster Analysis `3.1 Basic Concepts of Clustering 3.1.1 Cluster Analysis Categories `3.2 Partitioning Methods

Graph Partitioning with AMPL - Antonio Mucherino · Graph Partitioning with AMPL Graph partitioning Introduction Recalling some deﬁnitions: Graph partitioning Deﬁnition Graph

Heartbeat Classification of Electrocardiograms · with heart arrhythmias, using shape-based cluster analysis. We evaluated the clustering using the cluster homogeneity as the quality

Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Hierarchical spectral partitioning of bipartite graphs to cluster … · 2010-05-11 · 3 Methods The spectral graph partitioning method we apply requires as input an undirected bipartite

Bipartite spectral graph partitioning to co-cluster varieties and … · Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences Martijn Wieling Department

Enhanced Visual Analysis for Cluster Tendency Assessment and Data Partitioning

Hierarchical spectral partitioning of bipartite graphs …graph-theoretic method, the spectral partitioning of bipartite graphs, to cluster varieties and simulta-neously determine

OR. Udo... · So-called parallel distributed cluster file system ... OSD software implementation ... Partitioning of name space

4. Cluster and Outlier Analysis - Simon Fraser University · 4. Cluster and Outlier Analysis Contents of this Chapter 4.1 Introduction 4.2 Partitioning Methods 4.3 Hierarchical Methods

Chapter 3: Cluster Analysis - unibz · Chapter 3: Cluster Analysis `3.1 Basic Concepts of Clustering `3.2 Partitioning Methods `3.3 Hierarchical Methods 3.3.1 The Principle 3.3.2

Cluster Analysis - Emory Universitylxiong/cs570_s11/share/slides/14_clustering.… · Cluster Analysis Overview Partitioning methods: k-means, k-medoids Hierarchical methods: agglomerative,