K-means and Kohonen Maps Unsupervised Clustering Techniques Steve Hookway 4/8/04.
-
Upload
barry-rostron -
Category
Documents
-
view
218 -
download
0
Transcript of K-means and Kohonen Maps Unsupervised Clustering Techniques Steve Hookway 4/8/04.
K-means and Kohonen MapsUnsupervised Clustering Techniques
Steve Hookway4/8/04
What is a DNA Microarray? An experiment on the order of 10k
elements A way to explore the function of a
gene A snapshot of the expression level
of an entire phenotype under given test conditions
Some Microarray Terminology
Probe: ssDNA printed on the solid substrate (nylon or glass) These are the genes we are going to be testing
Target: cDNA which has been labeled and is to be washed over the probe
Microarray Fabrication Deposition of DNA fragments
Deposition of PCR-amplified cDNA clones
Printing of already synthesized oligonucleotieds
In Situ synthesis Photolithography Ink Jet Printing Electrochemical Synthesis
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
cDNA Microarrays and Oligonucleotide Probes
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
cDNA Arrays Oligonucleotide Arrays
Long SequencesSpot Unknown SequencesMore variability
Short SequencesSpot Known SequencesMore reliable data
In Situ Synthesis Photochemically synthesized on the
chip Reduces noise caused by PCR,
cloning, and Spotting As previously mentioned, three
kinds of In Situ Synthesis Photolithography Ink Jet Printing Electrochemical Synthesis
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Photolithography Similar to process used
to build VLSI circuits Photolithographic masks
are used to add each base
If base is present, there will be a hole in the corresponding mask
Can create high density arrays, but sequence length is limited
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Photodeprotection
mask
C
Ink Jet Printing Four cartridges are loaded with the
four nucleotides: A, G, C,T As the printer head moves across
the array, the nucleotides are deposited where they are needed
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Electrochemical Synthesis
Electrodes are embedded in the substrate to manage individual reaction sites
Electrodes are activated in necessary positions in a predetermined sequence that allows the sequences to be constructed base by base
Solutions containing specific bases are washed over the substrate while the electrodes are activated
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
Application of Microarrays We only know the
function of about 20% of the 30,000 genes in the Human Genome Gene exploration Faster and better
Can be used for DNA computing
http://www.gene-chips.com/sample1.html
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
A Data Mining Problem On a given Microarray we test on
the order of 10k elements at a time
Data is obtained faster than it can be processed
We need some ways to work through this large data set and make sense of the data
Grouping and Reduction Grouping: discovers patterns in the
data from a microarray Reduction: reduces the complexity
of data by removing redundant probes (genes) that will be used in subsequent assays
Unsupervised Grouping: Clustering
Pattern discovery via grouping similarly expressed genes together
Three techniques most often usedk-Means ClusteringHierarchical ClusteringKohonen Self Organizing Feature Maps
Clustering Limitations Any data can be clustered,
therefore we must be careful what conclusions we draw from our results
Clustering is non-deterministic and can and will produce different results on different runs
K-means Clustering Given a set of n data points in d-
dimensional space and an integer k We want to find the set of k points in
d-dimensional space that minimizes the mean squared distance from each data point to its nearest center
No exact polynomial-time algorithms are known for this problem
“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et. al
K-means Algorithm (Lloyd’s Algorithm)
Has been shown to converge to a locally optimal solution
But can converge to a solution arbitrarily bad compared to the optimal solution
•“K-means-type algorithms: A generalized convergence theorem and characterization of local optimality” by Selim and Ismail
•“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et al.
K=3
Data Points
Optimal Centers
Heuristic Centers
Euclidean Distance
n
iiiE yxyxd
1
2)(),(
543),( 22 AOd E
Now to find the distance between two points, say the origin and the point (3,4):
Simple and Fast! Remember this when we consider the complexity!
Finding a CentroidWe use the following equation to find the n dimensional centroid point amid k n dimensional points:
),...,2
,1
(),...,,( 11121 k
xnth
k
ndx
k
stxxxxCP
k
ii
k
ii
k
ii
k
Let’s find the midpoint between 3 2D points, say: (2,4) (5,2) (8,9)
)5,5()3
924,
3
852(
CP
K-means Algorithm1. Choose k initial center points randomly2. Cluster data using Euclidean distance (or other
distance metric)3. Calculate new center points for each cluster
using only points within the cluster4. Re-Cluster all data using the new center points
1. This step could cause data points to be placed in a different cluster
5. Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points are moved from one cluster to another or some other convergence criteria is met
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
An example with k=2
1. We Pick k=2 centers at random
2. We cluster our data around these center points
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
K-means example with k=2
3. We recalculate centers based on our current clusters
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
K-means example with k=2
4. We re-cluster our data around our new center points
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
K-means example with k=2
5. We repeat the last two steps until no more data points are moved into a different cluster
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Choosing k Use another clustering method Run algorithm on data with several
different values of k Use advance knowledge about the
characteristics of your test Cancerous vs Non-Cancerous
Cluster Quality Since any data can be clustered, how do
we know our clusters are meaningful? The size (diameter) of the cluster vs. The
inter-cluster distance Distance between the members of a cluster
and the cluster’s center Diameter of the smallest sphere
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Cluster Quality Continued
size=5
size=5distance=2
0
distance=5
Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Cluster Quality Continued
Quality can be assessed simply by looking at the diameter of a cluster
A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.From “Data Analysis Tools for DNA Microarrays” by
Sorin Draghici
Characteristics of k-means Clustering
The random selection of initial center points creates the following properties Non-Determinism May produce clusters without
patterns One solution is to choose the centers
randomly from existing patterns
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Algorithm Complexity Linear in the number of data
points, N Can be shown to have time of cN
c does not depend on N, but rather the number of clusters, k
Low computational complexity High speedFrom “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
The Need for a New Algorithm
-Each data point is assigned to the correct cluster
-Data points that seem to be far away from each other in heuristic are in reality very closely related to each other
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
The Need for a New Algorithm
Eisen et al., 1998
Kohonen Self Organizing Feature Maps (SOFM)
Creates a map in which similar patterns are plotted next to each other
Data visualization technique that reduces n dimensions and displays similarities
More complex than k-means or hierarchical clustering, but more meaningful
Neural Network Technique Inspired by the brain
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
SOFM Description Each unit of the
SOFM has a weighted connection to all inputs
As the algorithm progresses, neighboring units are grouped by similarity
Input Layer
Output Layer
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
SOFM Algorithm
Initialize MapFor t from 0 to 1 //t is the learning factor
Randomly select a sampleGet best matching unitScale neighborsIncrease t a small amount //decrease learning factor
End for
From: http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color
Three dimensional data: red, blue, green
Will be converted into 2D image map with clustering of Dark Blue and Greys together and Yellow close to Both the Red and the Green
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color
Each color in the map is associated with a weight
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color1. Initialize the weights
Random Values
Colors in the Corners
Equidistant
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
2. Get best matching unit
After randomly selecting a sample, go through all weight vectors and calculate the best match (in this case using Euclidian distance)
Think of colors as 3D points each component (red, green, blue) on an axis
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
2. Getting the best matching unit continued…
For example, lets say we chose green as the sample. Then it can be shown that light green is closer to green than red:
Green: (0,6,0) Light Green: (3,6,3) Red(6,0,0)
49.80)6(6Re
24.4303
222
222
d
LightGreen
This step is repeated for entire map, and the weight with the shortest distance is chosen as the best match
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
3. Scale neighbors1. Determine which weights are
considred nieghbors2. How much each weight can become
more like the sample vector
From http://davis.wpi.edu/~matt/courses/soms/
1. Determine which weights are considered neighbors
In the example, a gaussian function is used where every point above 0 is considered a neighbor2266666667.6),( yxeyxf
An Example Using Color Continued
From http://davis.wpi.edu/~matt/courses/soms/
2. How much each weight can become more like the sample
When the weight with the smallest distance is chosen and the neighbors are determined, it and its neighbors ‘learn’ by changing to become more like the sample…The farther away a neighbor is, the less it learns
An Example Using Color Continued
NewColorValue = CurrentColor*(1-t)+sampleVector*t
For the first iteration t=1 since t can range from 0 to 1, for following iterations the value of t used in this formula decreases because there are fewer values in the range (as t increases in the for loop)
From http://davis.wpi.edu/~matt/courses/soms/
Conclusion of Example
Samples continue to be chosen at random until t becomes 1 (learning stops)
At the conclusion of the algorithm, we have a nicely clustered data set. Also note that we have achieved our goal: Similar colors are grouped closely together
From http://davis.wpi.edu/~matt/courses/soms/
SOFM Applied to Genetics
Consider clustering 10,000 genes Each gene was measured in 4
experiments Input vectors are 4 dimensional Initial pattern of 10,000 each
described by a 4D vector Each of the 10,000 genes is
chosen one at a time to train the SOM
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
SOFM Applied to Genetics
The pattern found to be closest to the current gene (determined by weight vectors) is selected as the winner
The weight is then modified to become more similar to the current gene based on the learning rate (t in the previous example)
The winner then pulls its neighbors closer to the current gene by causing a lesser change in weight
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
SOFM Applied to Genetics This process continues for all
10,000 genes Process is repeated until over time
the learning rate is reduced to zero
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Our Favorite Example With Yeast
Reduce data set to 828 genes Clustered data into 30 clusters
using a SOFM
“Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation” by Tamayo et al.
Each pattern is represented by its average (centroid) pattern
Clustered data has same behavior
Neighbors exhibit similar behavior
A SOFM Example With Yeast
“Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation” by Tamayo et al.
Benefits of SOFM SOFM contains the set of features
extracted from the input patterns (reduces dimensions)
SOFM yields a set of clusters A gene will always be most similar
to a gene in its immediate neighborhood than a gene further away
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Conclusion K-means is a simple yet effective
algorithm for clustering data Self-organizing feature maps are
slightly more computationally expensive, but they solve the problem of spatial relationship
“Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation” by Tamayo et al.
References Basic microarray analysis: grouping and feature
reduction by Soumya Raychaudhuri, Patrick D. Sutphin, Jeffery T. Chang and Russ B. Altman; Trends in Biotechnology Vol. 19 No. 5 May 2001
Self Organizing Maps, Tom Germano, http://davis.wpi.edu/~matt/courses/soms
“Data Analysis Tools for DNA Microarrays” by Sorin Draghici; Chapman & Hall/CRC 2003
Self-Organizing-Feature-Maps versus Statistical Clustering Methods: A Benchmark by A. Ultsh, C. Vetter; FG Neuroinformatik & Kunstliche Intelligenz Research Report 0994
References Interpreting patterns of gene expression with
self-organizing maps: Methods and application to hematopoietic differentiation by Tamayo et al.
A Local Search Approximation Algorithm for k-Means Clustering by Kanungo et al.
K-means-type algorithms: A generalized convergence theorem and characterization of local optimality by Selim and Ismail