Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

73
Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    0

Transcript of Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Page 1: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering Applications

Reminder

Applications

Spectral Clustring

Assignment Clustering

Page 2: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

clustering methods

Clustering

hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains.

non-hierarchical methods divide a dataset of N objects into M clusters, with or without overlap.

Page 3: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Non-hierarchical Methods

non-hierarchical methods

partitioning methods - classes are mutually exclusive

clumping method, - overlap is allowed.

Page 4: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

hierarchical methods

Agglomerative or - The hierarchy is build up in a series of N-1 agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered dataset.

Hierarchical methods

Divisive methods begin with all objects in a single cluster and at each of N-1 steps divides some clusters into two smaller clusters, until each object resides in its own cluster.

Page 5: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Partitioning Methods

Partitioning methods are divided acording to the number of passes over the data.

Single pass Basic partitioning methods

Multiple passes K –means (Very widely used)

Page 6: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

K-means: Sample Application

Gene clustering. Given a series of microarray

experiments measuring the expression of a set of genes at regular time intervals in a common cell line.

Normalization allows comparisons across microarrays.

Produce clusters of genes which vary in similar ways over time.

Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway.

Sample Array. Rows are genes Sample Array. Rows are genes and columns are time points.and columns are time points.

A cluster of co-regulated genes.A cluster of co-regulated genes.

Page 7: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering gene expression dataClustering gene expression dataSamples

Gen

es

samplessamples

samples

Expression profile of the gene.

samples

Page 8: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering gene expression data

SamplesG

enes

samplessamples

samples

Expression profile of the gene.

samples

Cluster genes with similar expression profiles

Page 9: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering genes on expression profiles

• The expression profile of each gene• is a point in ‘sample space’.

Sample 1

Sample 3

Sample 2Gene g

eg1

eg3

eg2

Sample 1

Sample 3

Sample 2 • All genes together form• a scatter in this space

Page 10: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Normalized Expression Data from microarrays

T1 T2 T3Gene 1

Gene N

Representation of expression data

Time-point 1

Tim

e-po

int 3

Tim

e-po

int 2

Gene 1Gene 2

.dij

Page 11: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Identifying prevalent expression patternsTime-point 1

Tim

e-po

int 3

Tim

e-po

int 2

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1 2 3

-2

-1.5

-1

-0.5

0

0.5

1

1.5

1 2 3

-1.5

-1

-0.5

0

0.5

1

1.5

1 2 3

Time -pointTime -point

Time -point

Nor

mal

ized

Exp

ress

ion

Nor

mal

ized

Exp

ress

ion

Nor

mal

ized

Exp

ress

ion

Page 12: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

gpm1HTB1RPL11ARPL12BRPL13ARPL14ARPL15ARPL17ARPL23ATEF2YDL228cYDR133CYDR134CYDR327WYDR417CYKL153WYPL142C

GlycolysisNuclear Organization

Ribosome

Translation

Unknown

Genes MIPS functional category

Evaluate Cluster contents

Page 13: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Hierarchical Agglomerative methods

The hierarchical agglomerative clustering methods are most commonly used. The construction of an hierarchical agglomerative classification can be achieved by the following general algorithm.

1. Find the 2 closest objects and merge them into a cluster

2. Find and merge the next two closest points, where a point is either an individual object or a cluster of objects.

3. If more than one cluster remains , return to step 2

Page 14: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering genes on expression profiles

Sample 1

Sample 3

Sample 2

• Define a distance/similarity measure between points.

s sggs eeggd 2

' )()',(

s

sggs eeggd ||)',( '

Euclidean:

Manhattan:

• Define a distance between clusters of points. 1) Distance between closest pair between two clusters. (single-linkage) 2) Distance between the furthest pair of points (total linkage). 3) Average distance between points from both clusters. 4) Distance between the clusters’ centroids.

Page 15: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering genes on expression profiles

Sample 1

Sample 3

Sample 2

Hierarchical clustering:• Start with each point its own cluster.• At each iteration, merge the two clusters with the smallest distance.

Page 16: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering genes on expression profiles

Sample 1

Sample 3

Sample 2

Hierarchical clustering:• Start with each point its own cluster.• At each iteration, merge the two clusters with the smallest distance.

Eventually all points will be linked into a single cluster.

Page 17: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering genes on expression profilesClustering genes on expression profiles

The sequence of mergers can be represented in a hierarchical tree.

Sample 1

Sample 3

Sample 2

b

c

de

fg

a

a b c d e f g

Page 18: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering genes on expression profiles

Green = Expression level low with respect to reference sample.Red = Expression level high with respect to reference sample.Black = Expression level comparable to reference sample.

The columns are ordered such that similar expression profiles neighbor each other.

Eisen et al. PNAS 1998.

Page 19: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering gene expression data

samples

samples

Samples

Gen

es

samples

Expression profile of the sample.

samples

Instead of genes one may cluster samples with similar expression profiles.

Page 20: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering samples on expression profiles

Alizadeh et al. Nature 2003

Identifying different tumor types through sample clustering.

Page 21: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Alizadeh et al., Nature 403:503-11, 2000

Page 22: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Combinations of samples/genes

Gen

es

SamplesSamples

Gen

es Cluster genes with similarsample expression-profile.

Cluster samples with similargene expression-profile.

Combination model

Gen

esSamples

Each color corresponds tosome “cause”.

The cause affects a subset of genes in asubset of the samples. e.g. Ihmels et al. Nature genetics 2002

Page 23: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Combinations of samples/genesCombinations of samples/genes

Ihmels et al. Nature genetics 2002

Page 24: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering genes: Clusters of homologous genes

• Task: Detect clusters in the graph.

• A set of protein or DNA sequences.• Use alignment algorithm (e.g. BLAST) to score the similarity of each pair of sequences.

Graph of similarities of proteins in Methanococcus Jannaschii. The length of the links reflectssimilarity (short link = high similarity).

Enright and OuzounisBioinformatics 2001

Page 25: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering genes: Clusters of homologous genes

Example solution: Put ‘random walkers’ on graph and let them follow links at random.Look at the density of walkers and strengthen ‘high flow’ links, and weaken ‘low flow’ links.Stijn van Dongen, Graph Clustering by Flow simulation (PhD. Thesis, University of Amsterdam).

Page 26: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering DNA sequences:Transcription factor binding sites

Transcription factors recognize ‘fuzzy motifs’.Alignment of known fruR binding sites: AAGCTGAATCGATTTTATGATTTGGT AGGCTGAATCGTTTCAATTCAGCAAG

CTGCTGAATTGATTCAGGTCAGGCCA

GTGCTGAAACCATTCAAGAGTCAATT

GTGGTGAATCGATACTTTACCGGTTG

CGACTGAAACGCTTCAGCTAGGATAA

TGACTGAAACGTTTTTGCCCTATGAG

TTCTTGAAACGTTTCAGCGCGATCTT

ACGGTGAATCGTTCAAGCAAATATAT

GCACTGAATCGGTTAACTGTCCAGTC

ATCGTTAAGCGATTCAGCACCTTACC

**gcTGAAtCG*TTcAg**c********gcTGAAtCG*TTcAg**c******

Task: thousands of such binding sites for hundreds of different TFs. Infer which binding sites bind the same TF.

Page 27: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering DNA sequences:Transcription factor binding sites

AAGCACTATATTGGTGCAACATTCACATCGTGGTGATGAACTGTTTTTTTATCCAGTATAATTTACTCATCTGGTACGACCAGATCACCTTGCGGAAAGCACCATGTTGGTGCAATGACCTTTGGATAAAGCTGAATCGATTTTATGATTTGGTTCAATTAGGCTGAATCGTTTCAATTCAGCAAGAGAGGACATTAACTCATCGGATCAGTTCAGTAACTATTCCTCTTTACTGTATATAAAACCAGTTTATACTTCCGAACTGATCGGACTTGTTCAGCGTACACGACTCACAACTGTATATAAATACAGTTACAGATGTGCTGAAACCATTCAAGAGTCAATTGGCGCGATCAAGCTGGTATGATGAGTTAATATTATGTTTTCCAATACTGTATATTCATTCAGGTCAATTTGTGGTGAATCGATACTTTACCGGTTGAATTTG CAGCATAACTGTATATACACCCAGGGGGCGGAGCCTTTTGCTGTATATACTCACAGCATAACTGCAGCGGCTGGTCCGCTGTTTCTGCATTCTTACACGGTGAATCGTTCAAGCAAATATATTTTTTTAGTAATGACTGTATAAAACCACAGCCAATCAAATCGTTAAGCGATTCAGCACCTTACCTCAGGC

TGGATGTACTGTACATCCATACAGTAACTCACATGCACTAAAATGGTGCAACCTGTTCAGGAGATATTTTACCTGTATAAATAACCAGTATATTCACAGCAAATCTGTATATATACCCAGCTTTTTGG GCGCACCAGATTGGTGCCCCAGAATGGTGCATACAGACTACTGTATATAAAAACAGTATAACTTTCGCCACTGGTCTGATTTCTAAGATGTACCTCAGTTTATACTGTACACAATAACAGTAATGGTTCTGCTGAATTGATTCAGGTCAGGCCAAATGGCACTTGATACTGTATGAGCATACAGTATAATTGTTCCAGCTGGTCCGACCTATACTCTCGCCACTTCGTTTTCCTGTATGAAAAACCATTACTGTTATTACACTCCTGTTAATCCATACAGCAACAGTACGACTGAAACGCTTCAGCTAGGATAAGCGAAATGACTGAAACGTTTTTGCCCTATGAGCTCCGG CATATTTACTGATGATATATACAGGTATTTAGTTCTTGAAACGTTTCAGCGCGATCTTGTCTTTCTGTTACACTGGATAGATAACCAGCATTCGGAATCCTTCGCTGGATATCTATCCAGCATTTTTTGCACTGAATCGGTTAACTGTCCAGTCGACGGCCCACAATATTGGCTGTTTATACAGTATTTCAG

Each line contains a binding site for a transcription factor.

Page 28: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering DNA sequences:Transcription factor binding sites

AAGCACTATATTGGTGCAACATTCACATCGTGGTGATGAACTGTTTTTTTATCCAGTATAATTTACTCATCTGGTACGACCAGATCACCTTGCGGAAAGCACCATGTTGGTGCAATGACCTTTGGATAAAGCTGAATCGATTTTATGATTTGGTTCAATTAGGCTGAATCGTTTCAATTCAGCAAGAGAGGACATTAACTCATCGGATCAGTTCAGTAACTATTCCTCTTTACTGTATATAAAACCAGTTTATACTTCCGAACTGATCGGACTTGTTCAGCGTACACGACTCACAACTGTATATAAATACAGTTACAGATGTGCTGAAACCATTCAAGAGTCAATTGGCGCGATCAAGCTGGTATGATGAGTTAATATTATGTTTTCCAATACTGTATATTCATTCAGGTCAATTTGTGGTGAATCGATACTTTACCGGTTGAATTTG CAGCATAACTGTATATACACCCAGGGGGCGGAGCCTTTTGCTGTATATACTCACAGCATAACTGCAGCGGCTGGTCCGCTGTTTCTGCATTCTTACACGGTGAATCGTTCAAGCAAATATATTTTTTTAGTAATGACTGTATAAAACCACAGCCAATCAAATCGTTAAGCGATTCAGCACCTTACCTCAGGC

TGGATGTACTGTACATCCATACAGTAACTCACATGCACTAAAATGGTGCAACCTGTTCAGGAGATATTTTACCTGTATAAATAACCAGTATATTCACAGCAAATCTGTATATATACCCAGCTTTTTGG GCGCACCAGATTGGTGCCCCAGAATGGTGCATACAGACTACTGTATATAAAAACAGTATAACTTTCGCCACTGGTCTGATTTCTAAGATGTACCTCAGTTTATACTGTACACAATAACAGTAATGGTTCTGCTGAATTGATTCAGGTCAGGCCAAATGGCACTTGATACTGTATGAGCATACAGTATAATTGTTCCAGCTGGTCCGACCTATACTCTCGCCACTTCGTTTTCCTGTATGAAAAACCATTACTGTTATTACACTCCTGTTAATCCATACAGCAACAGTACGACTGAAACGCTTCAGCTAGGATAAGCGAAATGACTGAAACGTTTTTGCCCTATGAGCTCCGG CATATTTACTGATGATATATACAGGTATTTAGTTCTTGAAACGTTTCAGCGCGATCTTGTCTTTCTGTTACACTGGATAGATAACCAGCATTCGGAATCCTTCGCTGGATATCTATCCAGCATTTTTTGCACTGAATCGGTTAACTGTCCAGTCGACGGCCCACAATATTGGCTGTTTATACAGTATTTCAG

van Nimwegen et al. PNAS 2002

Page 29: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering DNA sequences:Transcription factor binding sitesAAGCACTATATTGGTGCAACATTCACATCGTG

AAGCACCATGTTGGTGCAATGACCTTTGGATAATGCACTAAAATGGTGCAACCTGTTCAGGAGAGCGCACCAGATTGGTGCCCCAGAATGGTGCATa*GCAC*A*atTGGTGCaac****t***g**ACTCATCTGGTACGACCAGATCACCTTGCGGACATTAACTCATCGGATCAGTTCAGTAACTATTTCCGAACTGATCGGACTTGTTCAGCGTACACGATCAAGCTGGTATGATGAGTTAATATTATGTTTTCCAGCTGGTCCGACCTATACTCTCGCCACTTCGCCACTGGTCTGATTTCTAAGATGTACCTCCAGCGGCTGGTCCGCTGTTTCTGCATTCTTAC****a*CTGgTc*Gat**GT******t*****AAGCTGAATCGATTTTATGATTTGGTTCAATTAGGCTGAATCGTTTCAATTCAGCAAGAGAGGACTGCTGAATTGATTCAGGTCAGGCCAAATGGCGTGCTGAAACCATTCAAGAGTCAATTGGCGCGGTGGTGAATCGATACTTTACCGGTTGAATTTGCGACTGAAACGCTTCAGCTAGGATAAGCGAAATGACTGAAACGTTTTTGCCCTATGAGCTCCGGTTCTTGAAACGTTTCAGCGCGATCTTGTCTTTACGGTGAATCGTTCAAGCAAATATATTTTTTTGCACTGAATCGGTTAACTGTCCAGTCGACGGCATCGTTAAGCGATTCAGCACCTTACCTCAGGC**gcTGAAtCG*TTcAg**c**************gcTGAAtCG*TTcAg**c************

GTGATGAACTGTTTTTTTATCCAGTATAATTTTGGATGTACTGTACATCCATACAGTAACTCACACAGACTACTGTATATAAAAACAGTATAACTTCCTCTTTACTGTATATAAAACCAGTTTATACTACTCACAACTGTATATAAATACAGTTACAGATAGTTTATACTGTACACAATAACAGTAATGGTTACTTGATACTGTATGAGCATACAGTATAATTGTTCCAATACTGTATATTCATTCAGGTCAATTTCAGCATAACTGTATATACACCCAGGGGGCGGAGCCTTTTGCTGTATATACTCACAGCATAACTGTATTTTACCTGTATAAATAACCAGTATATTCACAGCAAATCTGTATATATACCCAGCTTTTTGGTCGTTTTCCTGTATGAAAAACCATTACTGTTATTACACTCCTGTTAATCCATACAGCAACAGTACATATTTACTGATGATATATACAGGTATTTAGCTGTTACACTGGATAGATAACCAGCATTCGGAATCCTTCGCTGGATATCTATCCAGCATTTTTTCCACAATATTGGCTGTTTATACAGTATTTCAGAGTAATGACTGTATAAAACCACAGCCAATCAA****t*tACTGTATATa*A*ACAG********

Page 30: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Similarity/distance matrices

Useful if one wants to investigate a specific factor (advantage: no loss of information). Sort experiments according to that factor.

Array batch 1

Array batch 2

Page 31: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Clustering DNA sequences:Transcription factor binding sites

Alignment of known fruR binding sites: AAGCTGAATCGATTTTATGATTTGGT AGGCTGAATCGTTTCAATTCAGCAAG

CTGCTGAATTGATTCAGGTCAGGCCA

GTGCTGAAACCATTCAAGAGTCAATT

GTGGTGAATCGATACTTTACCGGTTG

CGACTGAAACGCTTCAGCTAGGATAA

TGACTGAAACGTTTTTGCCCTATGAG

TTCTTGAAACGTTTCAGCGCGATCTT

ACGGTGAATCGTTCAAGCAAATATAT

GCACTGAATCGGTTAACTGTCCAGTC

ATCGTTAAGCGATTCAGCACCTTACC

**gcTGAAtCG*TTcAg**c********gcTGAAtCG*TTcAg**c******

Page 32: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Probability Evaluation

0.067 , 0.467 , 0.2 , 0.267 :instanceFor

. positionat base finding of yProbabilit 3333

wwww

w

TGCA

i i

Probability that a sequence s is a binding site for the factor represented by w:

l

1i )|( w

i

s iwsP

Page 33: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Kohonen Self-organizing maps

K = r*s clusters are arranged as nodes of a two dimensional grid. Nodes represent cluster centers/prototype vectors.

This allows to represent similarity between clusters.

Algorithm: Initialize nodes at random positions.

Iterate: - Randomly pick one data point (gene) x.

- Move nodes towards x, the closest node most, remote nodes (in terms of the grid) less. Decrease amount of movements with no. of iterations.

from Tamayo et al. 1999

Page 34: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Self-organizing maps

from Tamayoet al. 1999(yeast cell cycle data)

Page 35: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

MST-method : Graph Representation of data

Representation of a set of n-dimensional “k” points as a graph each data point is represented

as a node V (a vertex) Edge between i-th and j-th

points is a connection evaluated by the “distance” between the two points V(i) and V(j)

d i,j -matrix of distances

d1,1 d1,2 …. d1,k-1 d1,k

d2,1 d2,2 …. d2,k-1 d2,k

- -- - - - - - - - - - - - - - - - - - -

dk,1 dk,2 …. dk,k-1 dk,k

Page 36: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

d1,1 d1,2 …. d1,7 d1,8

d2,1 d2,2 …. d2,7 d2,8

- -- - - - - - - - - - - - - - - - - - -

d8,1 d8,2 …. d8,7 d8,8

Edges

Vertices

V(1) V(2)

V(7)

V(8)

V(6)

V(5)V(4)

V(3)

di,j –distance between V(i) and V(j)

Graph Representation

Page 37: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Intuitive Requirement for a Cluster

Page 38: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Closest point for among all non is Closest point for among all non is

For any partition :

Intuitive requirement for a cluster (IR)

Page 39: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

If subset C has IR points of C form subtree of MST

In other words deleting a few edges one will get a tree consisting only of points of C

Set of don’t have IR !

Cluster Versus MST

Page 40: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Root 06

5

4

3

0

2

1

109

8

7

Data points with indices

Sequential Presentation

Step index

2 3

41 5

6

7

8 910

PRIM Algorithm for Cluster Identification

Page 41: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Intuitive Requirement for a Cluster

Sequential Representation

Valley

Page 42: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.
Page 43: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.
Page 44: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Cluster analysis & graph theory

Graph Formulation View data set as a set of vertices V={1,2,…,n} The similarity between objects i and j is viewed as

the weight of the edge connecting these vertices Aij. A is called the affinity matrix

We get a weighted undirected graph G=(V,A). Clustering (Segmentation) is equivalent to partition

of G into disjoint subsets. The latter could be achieved by simply removing connecting edges.

Page 45: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Nature of the Affinity Matrix

2 2( ) / 2i js s

ijA e i j 0iiA

Weight as a function of

“closer” vertices will get larger weight

Page 46: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Spectral Clustering

Algorithms that cluster points using eigenvectors of matrices derived from the data

Obtain data representation in the low-dimensional space that can be easily clustered

Variety of methods that use the eigenvectors differently

Page 47: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Spectral Clustering Algorithm Ng, Jordan, and Weiss

Given a set of points S={s1,…sn} Form the affinity matrix

Define diagonal matrix Dii=aik

Form the matrix Stack the k largest eigenvectors of L to form the columns of the new matrix X: Renormalize each of X’s rows to have unit length.

Cluster rows of Y as points in R k

2 2|| || / 2i js s

ijA e i j 0iiA

1/ 2 1/ 2L D AD

1 2, ,..., kx x x

Page 48: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Spectral Clustering Algorithm Ng, Jordan, and Weiss

Motivation Given a set of points: S={s1,s2,..sn}Rl

We would like to cluster them into k subsets Form the affinity matrix Define A Rn*n

Scaling parameter chosen by user

Define D a diagonal matrix whose (i,i) element is the sum of A’s row i

2 2|| || / 2 , 0i js s

ij iiA e A

Page 49: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Algorithm

Form the matrix L=D-1/2AD-1/2 Find x1,x2…xk , the k largest eigenvectors of L These form the the columns of the new

matrix X We have reduced dimension from nxn to nxk

Page 50: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Algorithm

Form the matrix Y Renormalize each of X’s rows to have unit length Y Rn*k

Treat each row of Y as a point in Rk

Cluster into k clusters via K-means Final Cluster Assignment

Assign point Si to cluster j if row i of Y was assigned to cluster j

2 2/( )ij ij ijj

Y X X

Page 51: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Why?

If we eventually use K-means, why not just apply K-means to the original data?

This method allows us to cluster non-convex regions

Page 52: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.
Page 53: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Basic intuition

Divide points in space. Let us use for basic intuition the case of only two clusters.

Distance between points defined by affinity matrix Ci,j

Want to pick partition Xi’s to minimize cost within a cluster

Partition can be 0 and 1 or it can be fuzzy Only members of the cluster contributes to the

distance (xi=xj=1).

Page 54: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Basic Intuition

Minimize Squared length Start with connection array B (bi,j)

“Placement” Vector X for xi placement

cost = (all I,j) xibi,jxj

Constraint: X’X=1 Maintains normalization.

Page 55: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Basic Intuition

Minimize cost=X’BX w/ constraint minimize L=X’BX-(X’X-1) L/ X=2BX-2X=0 (B-I)X=0 X => Eigenvector of B cost is Eigenvalue

X (xi’s) continuous

form cut partition from ordering

Page 56: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Spectral Partitioning

use to order nodes real problem wants to place at discrete locations this is one case where can solve LP problem

(continuous x’s) then move back to closests discrete point

Page 57: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Simple Example

Consider two 2-dimensional slightly overlapping Gaussian clouds each containing 100 points.

Page 58: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Simple Example cont-d I

Page 59: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Simple Example cont-d II

Page 60: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Example 2 (not so simple)

Page 61: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Example 2 cont-d I

Page 62: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Example 2 cont-d II

Page 63: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Assignment Clustering

Given M 0-1-N vectors of length L each, find a resolution which results in the least number of distinct vectors

A vector is called resolved it there are no Ns in it.

This is called also called Binary Clustering with Missing Values

where p is the maximum number of Ns

Page 64: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Motivation

Oligonucleotide Fingerprinting Array based Method for characterizing cDNAs, tissues

etc

Forensics DNA profiling etc

Page 65: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Fingerprinting

Fingerprint vector formed from the intensity values of each probe

Quantize the values based on V > h – 1 V < l – 0 Others – N

Page 66: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Trivial solutions

Just make each of the vectors into a cluster ….

Find a clustering solution with minimum number of clusters.

Each cluster can be resolved to the same vector. So minimize the number of unique resolved vectors

Page 67: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Actual Problem

Identify clusters of mutually compatible vectors.

Compatibility: If the vectors differ only at Ns they are called compatible

Example: 110N0NN110 N10100N1N0

Page 68: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Greedy Clique Partition

1 Find a Unique Maximal Clique

2 Remove it from the graph

3 Repeat 1, 2 until there exist no more unique maximal cliques

4 Find a Maximum Clique

5 Remove it from the graph

6 Repeat the 1-5 till the graph is empty

Page 69: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Implementation

Definitions – set of resolved vectors of f

Hash table H with entries

F(r), v(r) – a vector of length L

N(r) – positions of Ns in v(r)

)}(|{)(

)(

fRrFfrF

fRRf

)( fR

Page 70: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Implementation

Hash the entries with the resolved vectors Double hashing

Chaining for avoiding over-writes due to collisions

)(*)()( 21 rhkrhrh

Page 71: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Fill the table

For each finger print vector fill the table insert its resolved vectors

V(r)- the minimal resolution of the vectors Example

f1 = 01N01N r = 01011 v = 01N01N f2 = N1N11 r = 01011 v = 01N011

Page 72: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Finding cliques

Maximum Clique The entry r with the largest F(r) is the the

Maximum Clique Check for a unique vertex – a vertex belonging

to only one maximal clique If for an f, all the v(r)s are mutually compatible then f

is a unique vertex. Among all the cliques associated with f choose

the largest

Page 73: Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering.

Data generation

Generate a cluster structure

Generate d random mutually non-compatible vectors

Make copies and randomly change 2 bits to N

1

),...,( 1

i

d

s

ssS