L16: Micro-array analysis Dimension reduction Unsupervised clustering.

36
L16: Micro-array analysis Dimension reduction Unsupervised clustering
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    225
  • download

    1

Transcript of L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Page 1: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

L16: Micro-array analysis

Dimension reductionUnsupervised clustering

Page 2: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

PCA: motivating example

• Consider the expression values of 2 genes over 6 samples.

• Clearly, the expression of g1 is not informative, and it suffices to look at g2

values.• Dimensionality can be

reduced by discarding the gene g1

g1

g2

Page 3: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

PCA: Ex2

• Consider the expression values of 2 genes over 6 samples.

• Clearly, the expression of the two genes is highly correlated.

• Projecting all the genes on a single line could explain most of the data.

Page 4: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

PCA

• Suppose all of the data were to be reduced by projecting to a single line from the mean.

• How do we select the line ?

m

Page 5: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

PCA cont’d

• Let each point xk map to x’k=m+ak. We want to minimize the error

• Observation 1: Each point xk maps to x’k = m + T(xk-m)– (ak= T(xk-m))€

xk − x 'k2

k

∑ m

xk

x’k

Page 6: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Proof of Observation 1

minak xk − x'k2

= minak xk −m + m − x'k2

= minak xk −m2

+ m − x'k2

− 2(x'k −m)T (xk −m)

= minak xk −m2

+ ak2β Tβ − 2akβ

T (xk −m)

= minak xk −m2

+ ak2 − 2akβ

T (xk −m)

2ak − 2β T (xk −m) = 0

ak = β T (xk −m)

⇒ ak2 = akβ

T (xk −m)

⇒ xk − x 'k2

= xk −m2

−β T (xk −m)(xk −m)T β

Differentiating w.r.t ak

Page 7: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Minimizing PCA Error

• To minimize error, we must maximize TS• By definition, = TS implies that is an eigenvalue, and

the corresponding eigenvector.• Therefore, we must choose the eigenvector

corresponding to the largest eigenvalue.

xk − x 'kk

∑2

=C − β T

k

∑ (xk −m)(xk −m)T β =C −β TSβ

Page 8: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

PCA

• The single best dimension is given by the eigenvector of the largest eigenvalue of S

• The best k dimensions can be obtained by the eigenvectors {1, 2, …, k} corresponding to the k largest eigenvalues.

• To obtain the k dimensional surface, take BTM

BT

1T

M

Page 9: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Clustering

• Suppose we are not given any classes.• Instead, we are asked to partition the

samples into clusters that make sense.• Alternatively, partition genes into

clusters.• Clustering is part of unsupervised

learning

Page 10: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Microarray Data • Microarray data are usually transformed into an intensity matrix

(below)• The intensity matrix allows biologists to make correlations

between different genes (even if they are dissimilar) and to understand how genes functions might be

related• Clustering comes into play

Time 1 Time i Time N

Gene 1 10 8 10

Gene 2 10 0 9

Gene 3 4 8.6 3

Gene 4 7 8 3

Gene 5 1 2 3

Intensity (expression level) of gene at measured time

… …

Page 11: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Clustering of Microarray Data

• Plot each gene as a point in N-dimensional space

• Make a distance matrix for the distance between every two gene points in the N-dimensional space

• Genes with a small distance share the same expression characteristics and might be functionally related or similar

• Clustering reveals groups of functionally related genes

Page 12: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Clusters

Graphing the intensity matrix inmulti-dimensional space

Page 13: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

The Distance Matrix, d

Page 14: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Homogeneity and Separation Principles

• Homogeneity: Elements within a cluster are close to each other

• Separation: Elements in different clusters are further apart from each other

• …clustering is not an easy task!

Given these points a clustering algorithm might make two distinct clusters as follows

Page 15: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Bad ClusteringThis clustering violates both Homogeneity and Separation principles

Close distances from points in separate clusters

Far distances from points in the same cluster

Page 16: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Good ClusteringThis clustering satisfies both Homogeneity and Separation principles

Page 17: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Clustering Techniques

• Agglomerative: Start with every element in its own cluster, and iteratively join clusters together

• Divisive: Start with one cluster and iteratively divide it into smaller clusters

• Hierarchical: Organize elements into a tree, leaves represent genes and the length of the paths between leaves represents the distances between genes. Similar genes lie within the same subtrees.

Page 18: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Hierarchical Clustering

• Initially, each element is its own cluster

• Merge the two closest clusters, and recurse

• Key question: What is closest?

• How do you compute the distance between clusters?

Page 19: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Hierarchical Clustering: Computing Distances

• dmin(C, C*) = min d(x,y) for all elements x in C and y in C*

– Distance between two clusters is the smallest distance between any pair of their elements

• davg(C, C*) = (1 / |C*||C|) ∑ d(x,y) for all elements x in C

and y in C*

– Distance between two clusters is the average distance between all pairs of their elements

Page 20: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Computing Distances (continued)

However, we still need a base distance metric for pairs of gene:

• Euclidean distance• Manhattan distance• Dot Product• Mutual information

d = x − y2

= x i − y i( )2

i=1

N

What are some qualitative differences between these?

d = x − y1

= x i − y ii=1

N

xT y

x2y

2

Page 21: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Geometrical interpretation of distances

• The distance measures are all related.

• In some cases, the magnitude of the vector is important, in other cases it is not.

||X-Y||2

||X-Y||1

=c. cos-1 (XTY)

Page 22: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Comparison between metrics

• Euclidean and Manhattan tend to perform similarly and emphasize the overall magnitude of expression.

• The dot-product is very useful if the ‘shape’ of the expression vector is more important than its magnitude.

• The above metrics are less useful for identifying genes for which the expression levels are anti-correlated. One might imagine an instance in which the same transcription factor can cause both enhancement and repression of expression. In this case, the squared correlation (r2) or mutual information is sometimes used.

Page 23: L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Page 24: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

But how many orderings can we have?

1 2 4 5 3

Page 25: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

• For n leaves there are n-1 internal nodes• Each flip in an internal node creates a new linear

ordering of the leaves• There are therefore 2n-1 orderings

1 2 4 53

E.g., flip this node

Page 26: L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Page 27: L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Page 28: L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Page 29: L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Page 30: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Bar-Joseph et al. Bioinformatics (2001)

Page 31: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Computing an Optimal Ordering

• Define LT(u,v) as the optimum score of all orderings for the subtree rooted at T where– u is the left node, and– v is the right node

• Is it sufficient to compute LT(u,v) for all T,u,v ?

u

T

v

Page 32: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

T

T1 T2

mku v

LT(u,v) = max k,m {LT1(u,k)+ LT2(u,m) }

Page 33: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Time complexity of the algorithm?

• The recursion LT(u,w) is applied for each T,u,v. Each recursion takes O(n2) time.

• Each pair of nodes has a unique Least common ancestor.• LT(u,w) only needs to be computed if LCA(u,w) = T• Total time O(n4)

T

u w

Page 34: L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Speed Improvements

• For all m in LT1(u,R)– If LT1(u,m)+LT2(k0,w)+ C(T1,T2) <= CurrMax

• Exit loop– For all k in LT1(w,L)

• If LT1(u,m)+LT2(k,w)+C(T1,T2) <= CurrMax– Exit loop

• Else recompute CurrMax.• In practice, this leads to great speed improvements• 1500 genes, 7 hrs. changes to 7 min.

Page 35: L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Page 36: L16: Micro-array analysis Dimension reduction Unsupervised clustering.