Clustering and Network

94
Clustering and Network Bioinformatics in Biosophy Park, Jong Hwa MRC-DUNN Hills Road Cambridge CB2 2XY England 1 Next: 02/06/2001

description

1. Clustering and Network. Park, Jong Hwa MRC-DUNN Hills Road Cambridge CB2 2XY England. Bioinformatics in Biosophy. Next :. 02/06/2001. Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. What is clustering?. - PowerPoint PPT Presentation

Transcript of Clustering and Network

Page 1: Clustering and Network

Clustering and Network

Bioinformatics in Biosophy

Park, Jong Hwa

MRC-DUNN Hills Road Cambridge

CB2 2XYEngland

1

Next:02/06/2001

Page 2: Clustering and Network

What is clustering?

Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data.

http://www-cse.ucsd.edu/~rik/foa/l2h/foa-5-4-2.htmlhttp://cne.gmu.edu/modules/dau/stat/clustgalgs/clust1_frm.html

we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.                           

Page 3: Clustering and Network

What is a clustering algorithm ?

A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. The clustering algorithm also finds the centroid of a group of data sets. To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.

Page 4: Clustering and Network

Error function is a function that indicates quality of clustering

Definition: The centroid of a cluster is a point whose parameter

values are the mean of the parameter values of all the points in the clusters.

Page 5: Clustering and Network

What is the common metric for clustering techniques ?

Generally, the distance between two points is taken as a common metric to assess the similarity among the components of a population. The most commonly used distance measure is the Euclidean metric which defines the distance between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) as :

Page 6: Clustering and Network

For sequence comparison, the distances can be genetic distance (such as PAM)

For clustering Expression profiles, euclidean distance can be used.

Distances are defined according to problems.

Page 7: Clustering and Network

Kinds of Clustering algorithmsNon-hierarchical clustering methods

Single-pass methods Reallocation methods

K-means clusteringHierarchical clustering methods

Group average link method (UPGMA)

Single link methodMST Algorithms complete link methodVoorhees AlgorithmWard's method (minimum variance method)Centroid and median methodsGeneral algorithm for HACM

Page 8: Clustering and Network

Hierarchical ClusteringDendrograms used for representation. • General Strategy is to represent similarity matrix as

a graph, form a separate cluster around each node, and traverse the edges in decreasing order of similarity, merging two clusters according to some criterion.

• Merging criteria: • Single-link : Merge maximally connected components. • Minimum Spanning Tree based approach: merge

clusters connected by MST edge with smallest weight. • Complete-link : Merge to get a maximally complete

component.

Page 9: Clustering and Network

Partitional: Single partition is found.

Hierarchical: Sequence of nested partitions is found, by merging two partitions at every step.

• Agglomerative: glue together smaller clusters

• Divisive: fragment a larger cluster into smaller ones.

Page 10: Clustering and Network

Partitional ClusteringFind a single partition of k clusters based on some

clustering criteria. • Clustering criteria: • local : forms clusters by utilizing local structure in the

data. (e.g. Nearest neighbor clustering) • global: represents each cluster by a prototype and

assigns a pattern to a cluster with most similar prototype. (e.g. K-means, Self Organizing Maps)

• Many other techniques in literature such as density estimation and mixture decomposition.

• From [Jain & Dubes] Algorithms for Clustering Data, 1988

Page 11: Clustering and Network

Nearest Neighbor Clustering• Input: • A threshold, t, on the nearest-neighbor distance. • Set of data points {x1, x2, ? xn}. • Algorithm: • [Initialize: assign x1 to cluster C1. Set i = 1, k = 1. • Set i = i+1. Find nearest neighbor of xi among the patterns

already assigned to clusters. • Let the nearest neighbor be in cluster m. If its distance > t,

then increment k and assign xi to a new cluster Ck; else assign xi to Cm.

• If every data point is assigned to a cluster, then stop; else go to first step above.

• From [Jain & Dubes] Algorithms for Clustering Data, 1988

Page 12: Clustering and Network

Iterative Partitional ClusteringInput:

• K, number of clusters; Set of data points {x1, x2, ,, xn};

• a measure of distance between them (e.g. Euclidean, Mahalanobis); and clustering criterion (e.g. minimize squared error)

Algorithm:

• [Initialize: A random partition with K cluster-centers.]

• Generate a new partition by assigning each data point to its closest cluster center.

• Compute new cluster centers as centroids of the clusters.

• Repeat above two steps until optimum value of criterion found.

• Finally, adjust the number of clusters by merging/splitting existing clusters, or by removing small (outlier) clusters.

• From [Jain & Dubes] Algorithms for Clustering Data, 1988

Page 13: Clustering and Network

AVERAGE LINKAGE CLUSTERING: The dissimilarity between clusters is calculated

using average values. Unfortunately, there are many ways of calculating

an average! The most common (and recommended if there is no reason for using other methods) is UPGMA - Unweighted Pair-Groups Method Average.

The average distance is calculated from the distance between each point in a cluster and all other points in another cluster. The two clusters with the lowest average distance are joined together to form the new cluster.

(Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA)

Page 14: Clustering and Network

The GCG program PILEUP uses UPGMA to create its dendrogram of DNA sequences, and then uses this dendrogram to guide its multiple alignment algorithm.

The GCG program DISTANCES calculates pairwise distances between a group of sequences.

Page 15: Clustering and Network
Page 16: Clustering and Network
Page 17: Clustering and Network

COMPLETE LINKAGE CLUSTERING

(Maximum or Furthest-Neighbour Method): The dissimilarity between 2 groups is equal to the greatest dissimilarity between a member of cluster i and a member of cluster j.

Page 18: Clustering and Network

Furthest Neighbour

This method tends to produce very tight clusters of similar cases.

Page 19: Clustering and Network

SINGLE LINKAGE CLUSTERING

(Minimum or Nearest-Neighbour Method): The dissimilarity between 2 clusters is the minimum dissimilarity between members of the two clusters.

This methods produces long chains which form loose, straggly clusters. This method has been widely used in numerical taxonomy.

Page 20: Clustering and Network

WITHIN GROUPS CLUSTERING

This is similar to UPGMA except clusters are fused so that within cluster variance is minimised. This tends to produce tighter clusters than the UPGMA method.

UPGMA: Unweighted Pair-Groups Method Average

Page 21: Clustering and Network

Ward’s method

Cluster membership is assessed by calculating the total sum of squared deviations from the mean of a cluster.

The criterion for fusion is that it should produce the smallest possible increase in the error sum of squares.

Lance, G. N. and Williams, W. T. 1967. A general theory of classificatory sorting strategies. Computer Journal, 9: 373-380.

Page 22: Clustering and Network

K-Means Clustering Algorithm

This nonheirarchial method initially takes the number of components of the population equal to the final required number of clusters.

In this step itself the final required number of clusters is chosen such that the points are mutually farthest apart.

Next, it examines each component in the population and assigns it to one of the clusters depending on the minimum distance.

The centroid's position is recalculated everytime a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters.

Page 23: Clustering and Network

Complexity of K-means Algorithm

•Time Complexity = O(RKN)

•Space Complexity = O(N) where R is the number of iterations

Page 24: Clustering and Network

•K-Medians Algorithm

K-medians algorithm is similar to K-means algorithm except it uses a median instead of a mean Time Complexity = O(RN2) where R is the number of iterations

Space Complexity = O(N)

Page 25: Clustering and Network

K-Means VS. K-Medians (1)

• K-means algorithm requires a continuous space, so that a mean is a potential element of space

• K-medians algorithm also works in discrete spaces where a mean has no meaning

• K-means requires less computational time because it is easier to compute a mean than to compute a median

Page 26: Clustering and Network

Problems with K-means Clustering

•To achieve a globally minimum error is NP-Complete

•Very sensitive to initial points

•When used with large databases, time complexity can easily become intractable

•Existing algorithms are not generic enough to detect various shapes of clusters (spherical, non-spherical, etc.)

Page 27: Clustering and Network

Genetic Clustering Algorithm

• Genetic Clustering Algorithms * achieve a “better” clustering result than K-Means

• Refining the initial points * achieve a “better” local minimum and reduce convergent time

Page 28: Clustering and Network

A Genetic Clustering Algorithm

•"Clustering using a coarse-grained parallel Genetic Algorithm: A Preliminary Study", N. K. Ratha, A. K. Jain, and M. J. Chung, IEEE, 1995

•Use a genetic algorithm to solve a K-means clustering problem formulated as an optimization problem

•We can also look at it as a label assignment problem such that the assignment of {1,2,…,K} to each pattern minimizes the similarity function.

Page 29: Clustering and Network

Definition of Genetic Algorithm

• Search based on the “survival of the fittest” principle [R.Bianchini and et al.,1993]

• The “fittest candidate” is the solution at any given time.

• Run the evolution process for a sufficiently large number of generations

Page 30: Clustering and Network

Simple Genetic AlgorithmFunction GENETIC-ALGO(population, FITNESS-

FN) returns an individual inputs: population, a set of individuals (fixed number)

FITNESS-FN, a function that measures the fitness of an individual repeat

parents = SELECTION(population, FITNESS-FN) population = REPRODUCTION(parents) until some individual is fit enough return the best individual in population, according to

FINESS-FN

Page 31: Clustering and Network

Pros and Cons

Pros

• Clustering results are better as compared to K-means algorithm.

Cons

• Search space grows exponentially as a function of the problem size.

• Parallel computing helps but not much

Page 32: Clustering and Network

Need for better clustering algorithms.

Enormity of data • hierarchical clusterings soon become impractical High Dimensionality • Distance based algorithms become ill-defined because of

the curse of dimensionality. • Collapse of notion neighborhood --> physical proximity. • All the data is far from the mean! Handling Noise • Similarity measure becomes noisy as the hierarchical

algorithm groups more and more points, hence clusters that should not have been merged may get merged!

Page 33: Clustering and Network

• Handling High Dimensionality• Reduce the Dimensionality and apply traditional

techniques. • Dimensionality Reduction: • Principal Component Analysis (PCA), Latent Semantic Indexing

(LSI): • Use Singular Value Decomposition (SVD) to determine the most

influential features (maximum eigenvalues) • Given data in a n x m matrix format (n data points, m attributes),

PCA computes SVD of a covariance matrix of attributes, whereas LSI computes SVD of original data matrix. LSI is faster and memory efficient, and has been successful in information retrieval domain (clustering documents).

• Multidimensional Scaling (MDS): • Preserves original rank ordering of the distances among data

points.

Page 34: Clustering and Network

Clustering in High Dimensional Data Sets

DNA /Protein/ Interaction data are high-dimentional.

• Traditional distance-based approach

• Hypergraph-based approach

Page 35: Clustering and Network

Hypergraph-Based Clustering

• Construct a hypergraph in which related data are connected via hyperedges.

• How do we find related sets of data items? Use Association Rules!

• Partition this hypergraph in a way such that each partition contains highly connected data.

Page 36: Clustering and Network

graph• Definition: A set of items connected by edges. Each item is called a vertex or node. Formally, a graph is a set of

vertices and a relation between vertices, adjacency.

• See also directed graph, undirected graph, acyclic graph, biconnected graph, connected graph, complete graph, sparse graph, dense graph, hypergraph, multigraph, labeled graph, weighted graph, self-loop, isomorphic, homomorphic, graph drawing, diameter, degree, dual, adjacency-list representation, adjacency-matrix representation.

• Note: Graphs are so general that many other data structures, such as trees, are just special kinds of graphs. Graphs are usually represented G = (V,E), where V is the set of vertices, and E is the set of edges. If the graph is undirected, the adjacency relation is symmetric. If the graph does not allow self-loops, adjacency is irreflexive.

• A graph is like a road map. Cities are vertices. Roads from city to city are edges. (How about junctions or branches in a road? You could consider junctions to be vertices, too. If you don't want to count them as vertices, a road may connect more than two cities. So strictly speaking you have hyperedges in a hypergraph. It all depends on how you want to define it.)

• Another way to think of a graph is as a bunch of dots connected by lines. Because mathematicians stopped talking to regular people long ago, the dots in a graph are called vertices, and the lines that connect the dots are called edges. The important things are edges and the vertices: the dots and the connections between them. The actual position of a given dot or the length or straightness of a given line isn't at issue. Thus the dots can be anywhere, and the lines that join them are infinitely stretchy. Moreover, a mathematical graph is not a comparison chart, nor a diagram with an x- and y-axis, nor a squiggly line on a stock report. A graph is simply dots and lines between them---pardon me, vertices and edges.     Michael Bolton <[email protected]> 22 February 2000

Page 37: Clustering and Network

Graph

• Formally a graph is a pair (V,E) where V is any set, called the vertex set, and the edge set E is any subset of the set of all 2-element subsets of V. Usually the elements of V, the vertices, are illustrated by bold points or small circles, and the edges by lines between them.

Page 38: Clustering and Network

hypergraph• Definition: A graph whose hyperedges connect two

or more vertices. • See also multigraph, undirected graph. • Note: Consider ``family,'' a relation connecting two

or more people. If each person is a vertex, a family edge connects the father, mother, and all of their children. So G = (people, family) is a hypergraph. Contrast this with the binary relations ``married to,'' which connects a man and a woman, or ``child of,'' which is directed from a child to his or her father or mother.

Page 39: Clustering and Network

General Approach for High Dimensional Data Sets• Data• Graph• Sparse Hypergraph • Sparse Graph• Association Rules• Similarity • measure• Partitioning based • Clustering• Agglomerative • Clustering

Page 40: Clustering and Network

references• [1] Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurasamy (eds.),

Advances in Knowledge Discovery and Data Mining, AAAI Press/ The MIT Press, 1996. • [2] Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales, and

Customer Support), John Wiley & Sons, 1997. • [3] Sholom M. Weiss and Nitin Indurkhya, Predictive Data Mining (a practical guide), Morgan

Kaufmann Publishers,1998. • [4] Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel Processing,

Kluwer Academic Publishers, 1998. • [5] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988. • [6] V. Cherkassky and F. Mulier, Learning from Data, John Wiley & Sons, 1998. • [7] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis, Introduction to Parallel

Computing: Algorithm Design and Analysis, Benjamin Cummings/Addison Wesley, Redwood City, 1994.

• Research Paper References: • [1] J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993. • [2] M. Mehta, R. Agarwal, and J. Rissanen, SLIQ: A Fast Scalable Classifier for Data Mining, Proc.

Of the fifth Int. Conf. On Extending Database Technology (EDBT), Avignon, France, 1996. • [3] J. Shafer, R. Agrawal, and M. Mehta, SPRINT: A Scalable Parallel Classifier for Data Mining,

Proc. 22nd Int. Conf. On Very Large Databases, Mumbai, India, 1996.

Page 41: Clustering and Network

Gene expression and genetic network analysis

A gene’s expression level is the number of copies of that gene’s RNA produced in a cell, and correlates with the amount of the corresponding protein produced

DNA microarrays greatly improve the scalability and accuracy of gene expression level monitoring – can simultaneously monitor 1000’s of gene expression levels

http://www.ib3.gmu.edu/gref/S01/csi739/overview.pdf

Page 42: Clustering and Network
Page 43: Clustering and Network
Page 44: Clustering and Network

Goals of Gene Expression Analysis

What genes are or are not expressed?Correlate expression with other parameters• – developmental state• – cell types• – external conditions• – disease statesOutcome of analysis• – Functions of unknown genes• – Identify co-regulated groups• – Identify gene regulators and inhibitors• – Environmental impact on gene expression• – Diagnostic gene expression patterns

Page 45: Clustering and Network

Methods for Gene Expression Analysis

Early processing:• – image analysis• – statistical analysis of redundant array elements• – output raw or normalized expression levels• – store results in databaseClustering• – visualization• – unsupervised methods• – supervised methodsModeling• – reverse engineering• – Genetic network inference

Page 46: Clustering and Network
Page 47: Clustering and Network
Page 48: Clustering and Network

Unsupervised Clustering Methods

Direct visual inspection• – Carr et al (1997) Stat Comp Graph News 8(1)• – Michaels et al (1998) PSB 3:42-53Hierarchical clustering• – DeRisi et al (1996) Nature Genetics 14: 457-460Average linkage• – Eisen et al (1998) PNAS 95:14863-14868• – Alizadeh (2000) Nature 403: 503-511k-means• – Tavazoie et al (1999) Nature Genetics 22:281-

285

Page 49: Clustering and Network

Unsupervised Clustering Methods

SOMs• – Toronen et al (1999) FEBS Letters 451:142-

146• – Tamayo et al (1999) PNAS 96:2907-2912Relevance networks• – Butte et al (2000), PSB 5: 415-426SVD/PCA• – Alter et al (2000) PNAS 97(18):10101-10106Two-way clustering• – Getz et al (2000) PNAS 97(22):12079-12084• – Alon et al (1999) PNAS 96:6745-6750

Page 50: Clustering and Network

Supervised LearningGoal: classification

• – genes

• – disease state

• – developmental state

• – effects of environmental signals

Linear discriminant

Decision trees

Support vector machines

• – Brown et al (2000) PNAS 97(1) 262-267

Page 51: Clustering and Network
Page 52: Clustering and Network

Somogyi and Sniegoski, Complexity, 1996

Page 53: Clustering and Network

Gene regulation network models

– Somogyi and Sniegoski (1996) Complexity 1(6)

Boolean models• – KaufmannWeight matrix• – Weaver et al (1999) PSB 4Petri nets• – Matsuno et al (2000) PSB 5Diff Eq models• – Chen et al (1999) PSB 4

Page 54: Clustering and Network

Gene Network Inference Methods

Reverse engineering

• – Liang et al (1998), PSB 3:18-29

• – Akutsu et al (1999), PSB 4: 17-28

Perturbation methods

• – Ideker (2000) PSB 5: 302-313

Determinations

• – Kim et al (2000) Genomics 67:201-209

Page 55: Clustering and Network

Recent ApplicationsGene function assignment• – Brown et al (2000) PNAS 97(1) 262-267• – Alon et al (1999) PNAS 96:6745-6750Cell cycle• – DeRisi et al (1997) Science 278:680-686• – Toronen et al (1999) FEBS Letters 451:142-146• – Alter et al (2000) PNAS 97(18):10101-10106Cell response to external conditions• – Alter et al (2000) PNAS 97(18):10101-10106 Cancer therapeutics• – Butte et al (2000) PNAS 97(22):12182-12186• – Getz et al (2000) PNAS 97(22):12079-12084• – Tamayo et al (1999) PNAS 96:2907-2912Cancer diagnosis• – DeRisi et al (1996) Nature Genetics 14: 457-460• – Alon et al (1999) PNAS 96:6745-6750

Page 56: Clustering and Network

Microarray Analysis Software

Michael Eisen’s Lab (http://rana.lbl.gov)

Data Analysis

• – Cluster: Perform a variety of types of cluster analysis and

• other types of processing on large microarray datasets.

• Currently includes hierarchical clustering, self-organizing

• maps (SOMs), k-means clustering, principal component

• analysis. (Eisen et al. (1998) PNAS 95:14863)

• – TreeView: Graphically browse results of clustering and other analyses from Cluster. Supports tree-based and image based browsing of hierarchical trees. Multiple output formats for generation of images for publications.

Page 57: Clustering and Network

• Informatics

• – image analysis

• – gene expression raw data

• – database issues

• – data volumes

• – sources of errors

http://www.ib3.gmu.edu/gref/S01/csi739/schedule.html

Page 58: Clustering and Network

Boolean Network

(Binary Network)

Page 59: Clustering and Network

Boolean Genetic Network Modeling

Goals• Understand global characteristics of

geneticregulation networksTopicsBoolean Network Models

– terminology– dynamics

Inference of models from gene expression data– Cluster Analysis– Mutual Information

Extension to the model

Page 60: Clustering and Network

Patterns of Gene RegulationGenes typically interact with more than one

Partner

Page 61: Clustering and Network

Wiring Diagrams

Three genes: A, B, C• A activates B• B activates A and C• C inhibits AMany ways to represent interaction rules:• Boolean (Logical) function• Sigmoid function• Semi-Linear models• etc

http://www.ib3.gmu.edu/gref/S01/csi739/gene_networks.pdf

Page 62: Clustering and Network
Page 63: Clustering and Network
Page 64: Clustering and Network

• The dynamics of Boolean networks of any complexity are determined by the wiring and rules, or state-transition tables.

Page 65: Clustering and Network

• Time is discreate and all genes are updated simultaniously

Page 66: Clustering and Network
Page 67: Clustering and Network
Page 68: Clustering and Network
Page 69: Clustering and Network
Page 70: Clustering and Network
Page 71: Clustering and Network
Page 72: Clustering and Network
Page 73: Clustering and Network
Page 74: Clustering and Network
Page 75: Clustering and Network
Page 76: Clustering and Network
Page 77: Clustering and Network

Data RequirementsData Sources

• – time series

• – different environmental conditions

Fully connected boolean model with N genes

• requires 2^N observations

Boolean model with at most k inputs per gene

• requires O(2^k log(N)) [Akutsu, PSB 1999]

– e.g., 1000 genes, 3 inputs => 80 data points

(arrays)

Page 78: Clustering and Network

Reverse Engineering

Given: a (large) set of gene expression observations • Find: – wiring diagram – transition rulessuch that the network fits that observed data

• Example methods – Cluster analysis – Mutual information

Page 79: Clustering and Network
Page 80: Clustering and Network
Page 81: Clustering and Network
Page 82: Clustering and Network

Information can be quantified: Shannon entropy (H)

Page 83: Clustering and Network

Shannon entropy (H)

• can be calculated from the probabilities of occurrences of individual or combined events.

Page 84: Clustering and Network

The Shannon entropy is maximal if all states are equiprobable H(X|Y) H(Y)

Page 85: Clustering and Network

• Mutual information (M): the information (Shannon entropy) shared by non-independent elements

Page 86: Clustering and Network
Page 87: Clustering and Network
Page 88: Clustering and Network
Page 89: Clustering and Network
Page 90: Clustering and Network
Page 91: Clustering and Network
Page 92: Clustering and Network
Page 93: Clustering and Network

Summary

Gene regulation involves distributed function, redundancy and combinatorial coding

Boolean networks provide a promising initial framework for understanding gene regulation networks

Page 94: Clustering and Network

Boolean Net and Reverse Engineering

Boolean networks exhibit:

– Global complex behavior

– Self-organization

– Stability

– Redundancy

– Periodicity

Reverse Engineering

– tries to infer wiring diagrams and transition function from observed gene expression patterns

More realistic network models include

– continuous expression levels

– continuous time

– continuous transition functions

– many more biologically important variables