K-means and Kohonen Maps Unsupervised Clustering Techniques Steve Hookway 4/8/04.

K-means and Kohonen MapsUnsupervised Clustering Techniques

Steve Hookway4/8/04

What is a DNA Microarray? An experiment on the order of 10k

elements A way to explore the function of a

gene A snapshot of the expression level

of an entire phenotype under given test conditions

Some Microarray Terminology

Probe: ssDNA printed on the solid substrate (nylon or glass) These are the genes we are going to be testing

Target: cDNA which has been labeled and is to be washed over the probe

Microarray Fabrication Deposition of DNA fragments

Deposition of PCR-amplified cDNA clones

Printing of already synthesized oligonucleotieds

In Situ synthesis Photolithography Ink Jet Printing Electrochemical Synthesis

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

cDNA Microarrays and Oligonucleotide Probes


cDNA Arrays Oligonucleotide Arrays

Long SequencesSpot Unknown SequencesMore variability

Short SequencesSpot Known SequencesMore reliable data

In Situ Synthesis Photochemically synthesized on the

chip Reduces noise caused by PCR,

cloning, and Spotting As previously mentioned, three

kinds of In Situ Synthesis Photolithography Ink Jet Printing Electrochemical Synthesis


Photolithography Similar to process used

to build VLSI circuits Photolithographic masks

are used to add each base

If base is present, there will be a hole in the corresponding mask

Can create high density arrays, but sequence length is limited


Photodeprotection

mask

C

Ink Jet Printing Four cartridges are loaded with the

four nucleotides: A, G, C,T As the printer head moves across

the array, the nucleotides are deposited where they are needed


Electrochemical Synthesis

Electrodes are embedded in the substrate to manage individual reaction sites

Electrodes are activated in necessary positions in a predetermined sequence that allows the sequences to be constructed base by base

Solutions containing specific bases are washed over the substrate while the electrodes are activated


http://www.bio.davidson.edu/courses/genomics/chip/chip.html

Application of Microarrays We only know the

function of about 20% of the 30,000 genes in the Human Genome Gene exploration Faster and better

Can be used for DNA computing

http://www.gene-chips.com/sample1.html


A Data Mining Problem On a given Microarray we test on

the order of 10k elements at a time

Data is obtained faster than it can be processed

We need some ways to work through this large data set and make sense of the data

Grouping and Reduction Grouping: discovers patterns in the

data from a microarray Reduction: reduces the complexity

of data by removing redundant probes (genes) that will be used in subsequent assays

Unsupervised Grouping: Clustering

Pattern discovery via grouping similarly expressed genes together

Three techniques most often usedk-Means ClusteringHierarchical ClusteringKohonen Self Organizing Feature Maps

Clustering Limitations Any data can be clustered,

therefore we must be careful what conclusions we draw from our results

Clustering is non-deterministic and can and will produce different results on different runs

K-means Clustering Given a set of n data points in d-

dimensional space and an integer k We want to find the set of k points in

d-dimensional space that minimizes the mean squared distance from each data point to its nearest center

No exact polynomial-time algorithms are known for this problem

“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et. al

K-means Algorithm (Lloyd’s Algorithm)

Has been shown to converge to a locally optimal solution

But can converge to a solution arbitrarily bad compared to the optimal solution

•“K-means-type algorithms: A generalized convergence theorem and characterization of local optimality” by Selim and Ismail

•“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et al.

K=3

Data Points

Optimal Centers

Heuristic Centers

Euclidean Distance

n

iiiE yxyxd

1

2)(),(

543),( 22 AOd E

Now to find the distance between two points, say the origin and the point (3,4):

Simple and Fast! Remember this when we consider the complexity!

Finding a CentroidWe use the following equation to find the n dimensional centroid point amid k n dimensional points:

),...,2

,1

(),...,,( 11121 k

xnth

k

ndx

k

stxxxxCP

k

ii

k

ii

k

ii

k

Let’s find the midpoint between 3 2D points, say: (2,4) (5,2) (8,9)

)5,5()3

924,

3

852(

CP

K-means Algorithm1. Choose k initial center points randomly2. Cluster data using Euclidean distance (or other

distance metric)3. Calculate new center points for each cluster

using only points within the cluster4. Re-Cluster all data using the new center points

1. This step could cause data points to be placed in a different cluster

5. Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points are moved from one cluster to another or some other convergence criteria is met


An example with k=2

1. We Pick k=2 centers at random

2. We cluster our data around these center points

Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

K-means example with k=2

3. We recalculate centers based on our current clusters



4. We re-cluster our data around our new center points



5. We repeat the last two steps until no more data points are moved into a different cluster


Choosing k Use another clustering method Run algorithm on data with several

different values of k Use advance knowledge about the

characteristics of your test Cancerous vs Non-Cancerous

Cluster Quality Since any data can be clustered, how do

we know our clusters are meaningful? The size (diameter) of the cluster vs. The

inter-cluster distance Distance between the members of a cluster

and the cluster’s center Diameter of the smallest sphere


Cluster Quality Continued

size=5

size=5distance=2

0

distance=5

Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter


Cluster Quality Continued

Quality can be assessed simply by looking at the diameter of a cluster

A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.From “Data Analysis Tools for DNA Microarrays” by

Sorin Draghici

Characteristics of k-means Clustering

The random selection of initial center points creates the following properties Non-Determinism May produce clusters without

patterns One solution is to choose the centers

randomly from existing patterns


Algorithm Complexity Linear in the number of data

points, N Can be shown to have time of cN

c does not depend on N, but rather the number of clusters, k

Low computational complexity High speedFrom “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

The Need for a New Algorithm

-Each data point is assigned to the correct cluster

-Data points that seem to be far away from each other in heuristic are in reality very closely related to each other


The Need for a New Algorithm

Eisen et al., 1998

Kohonen Self Organizing Feature Maps (SOFM)

Creates a map in which similar patterns are plotted next to each other

Data visualization technique that reduces n dimensions and displays similarities

More complex than k-means or hierarchical clustering, but more meaningful

Neural Network Technique Inspired by the brain


SOFM Description Each unit of the

SOFM has a weighted connection to all inputs

As the algorithm progresses, neighboring units are grouped by similarity

Input Layer

Output Layer


SOFM Algorithm

Initialize MapFor t from 0 to 1 //t is the learning factor

Randomly select a sampleGet best matching unitScale neighborsIncrease t a small amount //decrease learning factor

End for

From: http://davis.wpi.edu/~matt/courses/soms/

An Example Using Color

Three dimensional data: red, blue, green

Will be converted into 2D image map with clustering of Dark Blue and Greys together and Yellow close to Both the Red and the Green

From http://davis.wpi.edu/~matt/courses/soms/

An Example Using Color

Each color in the map is associated with a weight


An Example Using Color1. Initialize the weights

Random Values

Colors in the Corners

Equidistant


An Example Using Color Continued

2. Get best matching unit

After randomly selecting a sample, go through all weight vectors and calculate the best match (in this case using Euclidian distance)

Think of colors as 3D points each component (red, green, blue) on an axis



2. Getting the best matching unit continued…

For example, lets say we chose green as the sample. Then it can be shown that light green is closer to green than red:

Green: (0,6,0) Light Green: (3,6,3) Red(6,0,0)

49.80)6(6Re

24.4303

222

222

d

LightGreen

This step is repeated for entire map, and the weight with the shortest distance is chosen as the best match



3. Scale neighbors1. Determine which weights are

considred nieghbors2. How much each weight can become

more like the sample vector


1. Determine which weights are considered neighbors

In the example, a gaussian function is used where every point above 0 is considered a neighbor2266666667.6),( yxeyxf



2. How much each weight can become more like the sample

When the weight with the smallest distance is chosen and the neighbors are determined, it and its neighbors ‘learn’ by changing to become more like the sample…The farther away a neighbor is, the less it learns


NewColorValue = CurrentColor*(1-t)+sampleVector*t

For the first iteration t=1 since t can range from 0 to 1, for following iterations the value of t used in this formula decreases because there are fewer values in the range (as t increases in the for loop)


Conclusion of Example

Samples continue to be chosen at random until t becomes 1 (learning stops)

At the conclusion of the algorithm, we have a nicely clustered data set. Also note that we have achieved our goal: Similar colors are grouped closely together


SOFM Applied to Genetics

Consider clustering 10,000 genes Each gene was measured in 4

experiments Input vectors are 4 dimensional Initial pattern of 10,000 each

described by a 4D vector Each of the 10,000 genes is

chosen one at a time to train the SOM


SOFM Applied to Genetics

The pattern found to be closest to the current gene (determined by weight vectors) is selected as the winner

The weight is then modified to become more similar to the current gene based on the learning rate (t in the previous example)

The winner then pulls its neighbors closer to the current gene by causing a lesser change in weight


SOFM Applied to Genetics This process continues for all

10,000 genes Process is repeated until over time

the learning rate is reduced to zero


Our Favorite Example With Yeast

Reduce data set to 828 genes Clustered data into 30 clusters

using a SOFM

“Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation” by Tamayo et al.

Each pattern is represented by its average (centroid) pattern

Clustered data has same behavior

Neighbors exhibit similar behavior

A SOFM Example With Yeast


Benefits of SOFM SOFM contains the set of features

extracted from the input patterns (reduces dimensions)

SOFM yields a set of clusters A gene will always be most similar

to a gene in its immediate neighborhood than a gene further away


Conclusion K-means is a simple yet effective

algorithm for clustering data Self-organizing feature maps are

slightly more computationally expensive, but they solve the problem of spatial relationship


References Basic microarray analysis: grouping and feature

reduction by Soumya Raychaudhuri, Patrick D. Sutphin, Jeffery T. Chang and Russ B. Altman; Trends in Biotechnology Vol. 19 No. 5 May 2001

Self Organizing Maps, Tom Germano, http://davis.wpi.edu/~matt/courses/soms

“Data Analysis Tools for DNA Microarrays” by Sorin Draghici; Chapman & Hall/CRC 2003

Self-Organizing-Feature-Maps versus Statistical Clustering Methods: A Benchmark by A. Ultsh, C. Vetter; FG Neuroinformatik & Kunstliche Intelligenz Research Report 0994

References Interpreting patterns of gene expression with

self-organizing maps: Methods and application to hematopoietic differentiation by Tamayo et al.

A Local Search Approximation Algorithm for k-Means Clustering by Kanungo et al.

K-means-type algorithms: A generalized convergence theorem and characterization of local optimality by Selim and Ismail

K-means and Kohonen Maps Unsupervised Clustering Techniques Steve Hookway 4/8/04.

Documents

Transcript of K-means and Kohonen Maps Unsupervised Clustering Techniques Steve Hookway 4/8/04.