Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE...

55
CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 16 - 1 - Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Clustering Techniques

Transcript of Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE...

Page 1: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 1 -

Bioinformatics:Issues and Algorithms

CSE 308-408 • Fall 2007 • Lecture 16

Clustering Techniques

Page 2: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 2 -

Administrative notes

I'll send you feedback on your proposal by the middle of the following week – then you're off and running!

Your final project / paper proposal is due on Friday, November 9 at 5:00 pm.

The proposal just needs to be a couple paragraphs telling me the problem area you plan to work on and some of the references you'll probably use.

If there's a possible connection between the work you'd like to do and the topics you've heard Professor Marzillier talk about, I'll discuss your proposal with her to get her feedback and suggestions (e.g., other papers you might read, datasets you might use for testing code you develop, etc.).

Page 3: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 3 -

Outline

• DNA Microarrays• Hierarchical Clustering• K-Means Clustering• Conservative & Greedy K-Means Clustering• Corrupted Cliques Problem• CAST Clustering Algorithm

http://www.bioalgorithms.info

Page 4: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 4 -

Applications of clustering

Motivation for clustering (from a general perspective):• Viewing and analyzing vast amounts of biological data in its

unstructured entirety can be perplexing.

• It is easier to interpret data if it is organized into clusters that combine similar (i.e., related) data points.

From a biological perspective, applications include:

• Analyzing data from DNA microarray experiments (expression analysis – i.e., determining which genes are switched “on” or “off” under certain conditions of interest).

• Building and understanding phylogenetic (evolutionary) trees based on genomic or other data.

http://www.bioalgorithms.info

Page 5: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 5 -

Inferring gene functionality

What's the problem?

• Biologists want to know functions of newly-sequenced genes.

• Simply comparing new gene sequence to known DNA sequences often does not reveal function of new gene.

• For 40% of sequenced genes, functionality cannot be ascertained by comparing to sequences of known genes.

• Microarrays allow biologists to infer gene function even when sequence similarity alone is insufficient to infer it.

http://www.bioalgorithms.info

Page 6: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 6 -

Life: a recipe for making proteins

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Page 7: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 7 -

Recall the Central Dogma

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Page 8: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 8 -

Hybridization is central

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Page 9: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 9 -

Microarrays: the concept

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Measure level of transcription for a very large number of genes in a single experiment.

Page 10: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 10 -

Microarrays and expression analysis

http://www.bioalgorithms.info

Microarrays measure activity (expression level) of genes under varying conditions and/or points in time.

• A gene is active if it is being transcribed.

• More mRNA usually indicates more gene activity.

Expression level is estimated by measuring amount of mRNA for that particular gene:

Page 11: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 11 -

Microarrays: how?

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Page 12: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 12 -

Stanford microarrays: production

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Page 13: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 13 -

Stanford microarrays: production

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Page 14: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 14 -

Stanford microarrays: production

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Coating:

1. Rinse of slides: NaOH and EtOH (2 h - shaking).

2. Wash with water.

3. Coat slides: poly-L-lycine (1 h - shaking).

4. Wash and dry.

Attach probes:

1. Produce probes (oligos, cDNA library, PCR products).

2. Print by the use of a robot.

Page 15: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 15 -

Stanford microarrays: production

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Spotting – mechanical deposition of probes:

Page 16: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 16 -

Stanford microarrays: production

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Page 17: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 17 -

Stanford microarrays: production

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Page 18: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 18 -

Stanford microarrays: production

Microarrayer

http://www.cbs.dtu.dk/courses/thaiworkshop/programme.php

Page 19: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 19 -

Microarray experiments

Steps:

• Produce cDNA from mRNA (DNA is more stable).

• Attach phosphor to cDNA to see when gene is expressed.

• Different color phosphors are available to compare many samples at once.

• Hybridize cDNA over microarray.

• Scan microarray with phosphor-illuminating laser: illumination reveals transcribed genes.

• Scan microarray multiple times for different color phosphors.

http://www.bioalgorithms.info

Page 20: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 20 -

Microarray experiments

http://www.affymetrix.com

Phosphors canbe added hereinsteadThen instead of

staining, laser illumination can be used

Page 21: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 21 -

Using microarrays

• Track sample over period of time to see how gene expression changes.

• Track two different samples under same conditions to see differences in gene expression.

Each box represents one gene’s expression over time

http://www.bioalgorithms.info

Page 22: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 22 -

Using microarrays

Interpreting colors:

• Green: expressed only from control.

• Red: expressed only from experimental cell.

• Yellow: equally expressed in both samples.

• Black: NOT expressed in either control or experimental cells.

http://www.bioalgorithms.info

Page 23: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 23 -

Microarray data

What does biologist do with microarray data?

• Microarray data usually transformed into an intensity matrix.

• Intensity matrix allows biologists to make correlations between different genes (even if they are dissimilar) and to understand how gene functions might be related.

321Gene 5387Gene 438.64Gene 39010Gene 2

10810Gene 1Time ZTime YTime XTime:

Intensity (expression level) of gene at measured time

http://www.bioalgorithms.info

Similar behavior?Clustering comes into play!

Page 24: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 24 -

Clustering microarray data

• Plot each sample as data point in N-dimensional space.

• Build matrix for distances between every two gene points.

• Genes with a small distance share same expression patterns and might be functionally related or similar.

• Clustering reveal groups of functionally related genes.From “Cluster analysis and display of genome-wide expression patterns” by Eisen, Spellman, Brown, and Botstein, Proc. Natl. Acad. Sci. USA, Vol. 95, pp. 14863–14868, December 1998.

Different genes that express similarly

Page 25: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 25 -

Clustering microarray data

Three different clusters

Intensity matrix Pairwise distance matrix

Expression patterns as points in 3-D space

http://www.bioalgorithms.info

Page 26: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 26 -

Homogeneity and Separation Principles

All approaches to clustering guided by two basic principles:

• Homogeneity: elements within a given cluster are close.

• Separation: elements in different clusters are further apart.

Not that clustering is not an easy task! (Don't be mislead by simple illustrative examples.)

Given these points, a clustering algorithm might make two distinct clusters as follows ...

http://www.bioalgorithms.info

Page 27: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 27 -

Bad clustering

Close distances from points in separate clusters

This clustering violates both Homogeneity and Separation Principles:

Far distances from points in same cluster

http://www.bioalgorithms.info

Page 28: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 28 -

Good clustering

This clustering satisfies both Homogeneity and Separation Principles:

http://www.bioalgorithms.info

Page 29: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 29 -

Clustering techniques

• Hierarchical: organize elements into a tree, leaves represent genes and length of the paths between leaves represents distances between genes. Similar genes lie within same subtrees.

http://www.bioalgorithms.info

• Agglomerative: start with every element in its own cluster, and iteratively join clusters together.

• Divisive: start with one cluster and iteratively divide it into smaller clusters.

Page 30: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 30 -

Hierarchical clustering

http://www.bioalgorithms.info

Page 31: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 31 -

Hierarchical clustering

Hierarchical Clustering often used to reveal evolutionary history:

http://www.bioalgorithms.info

Page 32: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 32 -

Hierarchical clustering algorithm

Hierarchical Clustering (d, n)Form n clusters each with one elementConstruct graph T by assigning one vertex to each clusterwhile there more than one cluster

find two closest clusters C1 and C2

merge C1 and C2 into new cluster C of size |C1| +|C2|

compute distance from C to all other clustersadd a new vertex C to T and connect to vertices C1 and C2

remove rows and columns of d corresponding to C1 and C2

add a row and column to d corrsponding to new cluster Creturn T

Algorithm takes a n x n distance matrix d of pairwise distances between points as input.

Page 33: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 33 -

Hierarchical clustering algorithm

Hierarchical Clustering (d, n)Form n clusters each with one elementConstruct graph T by assigning one vertex to each clusterwhile there more than one cluster

find two closest clusters C1 and C2

merge C1 and C2 into new cluster C of size |C1| +|C2|

compute distance from C to all other clustersadd a new vertex C to T and connect to vertices C1 and C2

remove rows and columns of d corresponding to C1 and C2

add a row and column to d corrsponding to new cluster Creturn T

Different ways to define distances between clusters may lead to different clusterings!

Page 34: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 34 -

Computing distances

dmin(C, C*) = min d(x,y) for all elements x in C and y in C*

Distance between two clusters is smallest distance between any pair of elements.

davg(C, C*) = (1 / |C*||C|) ∑ d(x,y) for all elements x in C and y in C*

Distance between two clusters is average distance between all pairs of elements.

Page 35: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 35 -

Squared-error distortion

Given a data point v and a set of points X, define distance from v to X:

d(v, X)

as (Eucledian) distance from v to closest point from X.

Given set of n data points V = {v1…vn} and set of k points X, define squared-error distortion as:

d(V, X) = ∑d(vi, X)2 / n 1 < i < n

Page 36: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 36 -

Clustering microarray data: k-means clustering

K-means clustering is one way to organize this data:

• Given set of n data points and an integer k.

• We want to find set of k points that minimizes mean-squared distance from each data point to its nearest cluster center.

Sketch of algorithm:

• Choose k initial center points randomly and cluster data.

• Calculate new centers for each cluster using points in cluster.

• Re-cluster all data using new center points.

• Repeat last two steps until no data points are moved from one cluster to another or some other convergence criterion is met.

Page 37: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 37 -

Formal definition of K-Means Clustering

The K-Means Clustering Problem.

Output: A set X consisting of k points (cluster centers) that minimizes squared-error distortion d(V, X) over all possible choices of X.

Input: A set, V, consisting of n points along with a parameter k.

A (trivially) simple variation, 1-means clustering:

The 1-Means Clustering Problem.

Output: A single point x (cluster center) that minimizes squared-error distortion d(V, x) over all possible choices of x.

Input: A set, V, consisting of n points.

1-means clustering is easy.General k-means clusteringis NP-complete, however.

Page 38: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 38 -

Clustering microarray data: k-means clustering

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici.

• Pick k = 2 centers at random.• Cluster data around these

center points.

• Re-calculate centers based on current clusters.

Page 39: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 39 -

Clustering microarray data: k-means clustering

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici.

• Re-cluster data around new center points.

• Repeat last two steps until no more data points are moved into a different cluster.

Page 40: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 40 -

K-means clustering: Lloyd's algorithm

K-Means Clustering (k)arbitrarily assign k cluster centerswhile cluster centers keep changing

assign each data point to cluster Ci corresponding to closestcluster representative (center) (1 ≤ i ≤ k)

after assignment of all data points, compute new clusterrepresentatives according to cluster centers of gravityI.e., new cluster representative is ∑v / |C| for all v in C

output final cluster centers

Note that this may only leadto a locally optimal clustering.

Page 41: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 41 -

K-means clustering: another example

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

expr

essi

on in

con

ditio

n 2

x1

x2

x3

Page 42: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 42 -

K-means clustering: another example

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

expr

essi

on in

con

ditio

n 2

x1

x2

x3

Page 43: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 43 -

K-means clustering: another example

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

expr

essi

on in

con

ditio

n 2

x1

x2x3

Page 44: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 44 -

K-means clustering: another example

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

expr

essi

on in

con

ditio

n 2

x1

x2 x3

Page 45: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 45 -

Conservative k-means clustering

Observations:

• This algorithm, known as Lloyd's algorithm, is fast, but in each iteration it moves many data points, not necessarily causing better convergence.

• A more conservative method would be to move one point at a time only if it improves the overall clustering cost.

• The smaller the clustering cost of a partition of data points, the better that clustering is.

• Different methods (e.g., squared-error distortion) can be used to measure this clustering cost.

Page 46: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 46 -

Greedy k-means clustering

ProgressiveGreedyK-Means(k)select an arbitrary partition P into k clusterswhile forever

bestChange ← 0for every cluster C

for every element i not in C if moving i to cluster C reduces its clustering cost

if (cost(P) – cost(Pi C) > bestChange

bestChange ← cost(P) – cost(Pi C)

i* ← i, C* ← Cif bestChange > 0

Change partition P by moving i* to C*else

return P

Page 47: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 47 -

Clique graphs

A more structured view of clustering:

• A clique is a graph with every vertex connected to every other vertex.

• A clique graph is a graph where each connected component is a clique.

http://www.bioalgorithms.info

Clique of size 3

Clique of size 5

Clique of size 6

Clique graph with 3connected components

Page 48: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 48 -

Transformation into a clique graph

Any graph can be transformed into a clique graph by adding or removing edges.What can we do here?

1

7

2

6

3

5 4

1

7

2

6

3

5 4

Delete 2 edges

Page 49: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 49 -

Transformation into a clique graph

As with edit distance we studied earlier, there many possible transformations:

1

3

2

4

5

1

3

2

4

5

1

3

2

4

5

1

3

2

4

5

Add 2 edges

Delete 4 edgesOr:

Page 50: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 50 -

Formal definition of Corrupted Cliques Problem

The Corrupted Cliques Problem.

Output: The smallest number of additions and removals of edges that will transform G into a clique graph.

Input: A graph, G.

Our ultimate goal is to have:

• Vertices represent data points.

• Edges represent relationship between data points.

• Cliques represent meaningful groupings (i.e., clusters).

Page 51: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 51 -

Distance graphs

Transform a distance matrix into a distance graph:

• Genes are represented as vertices in graph.

• Choose a distance threshold θ.

• If distance between two vertices is below θ, draw an edge between them.

• Resulting graph may contain cliques.

• These cliques represent clusters of similar data points!

Page 52: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 52 -

Transforming distance graph into clique graph

Distance graph for θ = 7 Clique graph

Distance matrix d

Distance graph for is not quite a clique graph. However, it can be transformed into one by removing edges (g

1,g

10) and (g

1,g

9).

Page 53: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 53 -

Heuristics for Corrupted Cliques Problem

Corrupted Cliques Problem is NP-Hard, some heuristics exist to approximately solve it. For example,

• CAST (Cluster Affinity Search Technique) is a practical and fast algorithm for CCP.

• CAST is based on notion of genes close to given cluster C, or distant from cluster C.

• Distance between gene i and cluster C defined as:d(i,C) = average distance between i and each gene in C

• Gene i is close to cluster C if d(i,C) < θ, distant otherwise.

Page 54: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 54 -

CAST algorithm

CAST(S, G, θ) P ← Ø while S ≠ Ø V ← vertex of maximal degree in distance graph G C ← {v} while a close gene i not in C or distant gene i in C exists Find nearest close gene i not in C and add it to C Remove farthest distant gene i in C Add cluster C to partition P S ← S – C Remove vertices of cluster C from distance graph G return P

S = set of elements, G = distance graph, θ = distance threshold

Page 55: Clustering Techniques - Lehigh CSElopresti/Courses/2007-08/CSE308... · Clustering Techniques. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 16 - 55 -

Wrap-up

Remember:• Come to class having done the readings.• Check Blackboard regularly for updates.

Readings for next time:• BBP Chapters 17-18 and 20 (tools, datasets, and

applications).