Download - Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish.

Principal Component Analysis

• 20 food products16 European Countries

Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Germany 90 49 88 19 57 51 19 21 27ItalyFrance

PCA Example: FOODS

PCA Example: FOODS

PCA Example: Red Sox Dataset: 110 Years of Redsox Performance Data Question: Pitchers and Batters Ages Matter for

Performance

Redundancy

• Arbitrary observations by r1 and r2

• Low to high redundancies from (a) to (c)• (c) can be represented by a single variable

• Spread across the best-fit line – covariance between two variables

Transform

• Linear Transformation• (x,y) in Cartesian coordinate• The same point becomes in (a,b) in another

coordinate system• Assuming linear transformation

• a = f(x, y) = x*c11 + y*c12

• b = g(x,y) = x*c21 + y*c22

=

For review of matrix, www.cs.uml.edu/~kim/580/review_matrix.pdf

Eigenvector

=

= = 4*

• Eigenvector – projection to the same coordinate

• Eigenvectors of a square matrix are orthogonal

• Unique eigenvalues are associated with eigenvectors

Transform

=

• http://www.ams.org/samplings/feature-column/fcarc-svd

Transform

(Symmetric)

• http://www.ams.org/samplings/feature-column/fcarc-svd

Eigenvectors

• Mv = λiv

λi is scalarv is orthogonal vectors

• Non-symmetric

• For unit vectors u1 and u2

• A general vector x has coefficients projected by unit vectors

•

• Vector product is of the form:

•

• =>

•

• SVD (Singular Vector Decomposition)

•

M: mxn; U: mxm; Σ: mxn; V: nxn

• U is eigenvector of MMT

• V is eigenvector of MTM

Transform = New Coordinate• What should be good for transform matrix for

PCA ?

• Covariance

Mean, Variance, Covariance• X = (x1, x2, ….. xn) Y = (y1, y2, …..yn)

• E[X] = ∑i xi /n E[Y] = ∑I yi /n

• Variance = (st. dev.)2:• V[X] = ∑I (xi – E[X])2 / (n-1)

• Covariance -- • cov[X,Y] = ∑I (xi – E[X]) (yi – E[Y]) / (n-1)

Covariance Matrix

• Three variables X,Y,Z

cov[x,x] cov[x,y] cov[x,z]cov[y,x] cov[y,y] cov[y,z]cov[z,x] cov[z,y] cov[z,z]

• cov[X,X] = V[X]• cov[X,Y] = cov{Y,X]

X Y

2.5 2.4

0.5 0.7

2.2 2.9

1.9 2.2

3.1 3.0

2.3 2.7

2.0 1.6

1.0 1.1

1.5 1.6

1.1 0.9

167 749

13l92 62.42

Numerical Example

X (adj) Y (adj)

0.69 0.49

-1.31 -1.21

0.39 0.99

0.09 0.29

1.29 1.09

0.49 0.79

0.19 -0.31

-0.81 -0.81

-0.31 -0.31

-0.71 -1.01

• After adjustments

• Covariance matrix

cov = (.6165 .6154).6154 .7166

• Eigenvalues

|.6165-λ .6154 ||.6154 .7166-λ| =0

1.2840, 0.0491• Eigenvectors

Example: Amino Acid (AA) - Basic

Clustering of AAs How many clusters ?

Use 4 AA groupsGood for acidic and basicP in polar groupNonpolar group is wide spread

Similarities of AA’s determine the ease of substitutions

Some alignment tools show similar AA’s in colors Needs a more systematic approach

Physico-Chemical Properties Physico-chemical properties of AA

determine protein structures(1) Size in volume(2) Partial Vol.

Measure expanded volume in solution when dissolved

(3) Bulkiness The ratio of side chain volume to its length: average

cross-sectional area of the side chain(4) pH of isoelectric point of AA (pI)(5) Hydrophobicity(6) Polarity index(7) Surface area(8) Fraction of area

Fraction of the accessible surface area that is buried in the interior in a set of known crystal structures

Vol. Bulk Pol. pI Hydro

Surf2

Frac

Alanine Ala A 67 11.5 0.0 6.0 1.8 113 0.74

Arginine Arg R 148 14.3 52.0 10.8 -4.5 241 0.64

Asparagine Asn N 96 12.3 3.4 5.4 -3.5 158 0.63

Aspartic Asp D 91 11.7 49.7 2.8 -3.5 151 0.62

Cysteine Cys C 86 13.5 1.5 5.1 2.5 140 0.91

Glutamine Gln Q 114 14.5 3.5 5.7 -3.5 189 0.62

Glu. Acid Glu E 109 13.6 49.9 3.2 -3.5 183 0.62

Glycine Gly G 48 3.4 0.0 6.0 -0.4 85 0.72

Histidine His H 118 13.7 51.6 7.6 -3.2 194 0.78

Isoleucine Ile I 124 21.4 0.1 6.0 4.5 182 0.88

Leucine Leu L 124 21.4 0.1 6.0 3.8 180 0.85

Lysine Lys K 135 13.7 49.5 9.7 -3.9 211 0.52

Methionine Met M 124 16.3 1.4 5.7 1.9 204 0.85

Phenyl. Phe F 135 10.8 0.4 5.5 2.9 218 0.88

Proline Prot P 90 17.4 1.6 6.3 -1.6 143 0.64

Serine Ser S 73 9.5 1.7 5.7 -0.8 122 0.66

Threonine Thr T 93 15.8 1.7 5.7 -0.7 146 0.70

Tryptophan Trp W 163 21.7 2.1 5.9 -0.9 259 0.85

Tyrosine Thr Y 141 18.0 1.6 5.7 -1.3 229 0.76

Valine Val V 105 21.6 0.1 6.0 4.2 160 0.86

Mean 109 15.4 13.6 6.0 -0.5 175 0.74

Red: acidicOrange: basicGreen: polar

(hydrophillic)

Yellow: non-polar

(hydrophobic

)

PCA of AAs How to incorporate different properties

In order to group similar AA’sVisual clustering with Volume and pI

PCA Given NxP matrix (e.g., 20x7),

Each row represents a p-dimensional data pointEach data point is

Scaled and shifted to the origin Rotated to spread out points as much as

possible

Scaling For property j, compute the average and the s.d.

μj = ∑i xij /N, σj2 = ∑i (xij - μj)2 /N

Since each property has a different scales and means, define normalized variables,

zij = (xij - μj) /σj

zij measures the deviation from the mean for each property with the mean of 0 and s.d. of 1

PCA New orthogonal coordinate system

Find vj = (vj1, vj2 ,…, vjP ) such that

∑k vik vjk = 0 for i ≠ j (orthogonal) and ∑k vjk2 = 0 (unit length)

vj represents new coordinate vectorData points in z-coordinate becomes

yij = ∑k zjk vik

New y coordinate systems is a rotation of the z coordinate system

vjk turns out to be related to the correlation coefficient

PCA Correlation coefficient, Cij

Cij = ∑k(zik - mi)(zjk - mj) /Psi sj (mi, si mean and s.d. of the i-th row)

-1 ≤ Cij ≤ 1

Results in NxN simiarlity matrix, Sij

Vol Bulk Polar pI Hyd SA FrA

Vol 1.00 9.73 0.24 0.37 -0.08 0.99 0.18

Bulk 1.00 -0.20 0.08 0.44 0.64 0.49

Polar 1.00 0.27 -0.69 0.29 -0.53

pI 1.00 -0.20 0.36 -0.18

Hyd 1.00 -0.18 0.84

SA 1.00 0.12

FrA 1.00

Clustering Family of related sequences evolved from a common

ancestor is studied with phylogenetic trees showing the order of evolution

Criteria neededCloseness between sequencesThe number of clusters

Hierarchical Clustering Algorithm – connectivity-based K-mean -- centroid

Hierarchical Clustering Hierarchical Clustering Algorithm

Each point forms its own cluster, initiallyJoin two clusters with the highest similarity to form a single larger clusterRecompute similarities between all clusterRepeat two steps above until all points are connected to clusters

Criteria of similarities ?Use scaled coordinates z

Vector zi from origin to each data point i with length |zi|2 = ∑k zik

2

Use cosine angle between two points for similarity

cos θij = ∑k zikzjk / |zi||zj|

N elements, nxn distrance matrix d

Hierarchical_Clustering (d, n) Form n clusters, each with 1 element Construct a graph T by assigning an isolated vertex to each cluster while there is more than 1 cluster Find the two closest clusters C1 and C2

Merge C1 and C2 into new cluster C with | C1 | + | C2| elements Compute distance from C to all other clusters Add a new vertex C to T Remove rows and columns of d for C1 and C2, and add for C return T

k-mean Clustering The number of clusters, k, is known ahead of the time Minimize the squared errors between data points and k

cluster centers

No known polynomial algorithmHeuristics – Lloyd algorithm

initially partition n points arbitrarily to k centers, then move some points between clusters

Converge to a local minimum, may move many points in each iteration

k-means Clustering Problem Given n data points, find k center points minimizing the squared error distortion,

d(V, X) = ∑id(vi,X)2/n

input: A set V of n data points and a parameter k output: A set X consisting of k center points minimizing d(V,X)

over all possible choices of X

k-mean Clustering Assume every possible partition of n elements to k

clusters And each partition has cost(P)

Move one point in each iteration

Progressive_Greedy_k-means(n) Select an arbitray partition P into k clusters while forever bestChange = 0 for every cluster C for every element i not in C if moving i to C reduces Cost(P) if Δ(i → C) > bestChange bestChange ← Δ(i → C) i* = i C* = C if bestChange >0 change partition P by moving i* to C* else return P

Dynamic Modeling in Chameleon Similarity between clusters is determined by

Relative interconnectivity (RI)Relative closeness (RC)

Select pairs with high RI and RC to merge

Hierarchical Clustering - Cluto Generates a set of clusters within

clusters Algorithm can be arranged as a tree

Each node becomes where two smaller clusters join

CLUTO package with cosine and group-average rulesRed/green indicates values significantly higher/lower than the averageDark colors close to the average

1. red on both pI and polarity scale

2. green on hydrophobicity and pI (can be separated into two smaller clusters)

3. green on volume and surface area

4. C is unusual in protein structure due to its potential to form disulfide bonds between pairs of cysteine residues (thus, difficult to interchange for other residues)

5. Hydrophobic6. Two largest AA’s

Clustering of properties: properties can be ordered illustrating groups of properties that are correlated

6 clusters

cluster

Property AA

1 Basic K, R, H

2 Acid and amide

E, D, Q, N

3 Small P, T, S, G, A

4 Cysteine C

5 Hydrophobic V, L, I, M, F

6 Large, aromatic

W, Y

In PAM matrix, considered probabilities of pairs of amino acids appearing together

Pairs of amino acids that tend to appear together are grouped into a cluster

six clusters (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)

Contrast to clusters via hierarchical clustering(KRH) (EDQN) (PTSGA) (C) (VLIMF) (WY)

Dayhoff Clustering - 1978

(KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)

To study protein folding Used BLOSUM50 similarity matrix

Determine correlation coefficients between similarity matrix elements for all pairs of AA’s

e.g., CAV = (∑i MA,i MV,i )/[(∑i MA,i MA,i q)*(∑i MV,i MV,i )] with summation over i is taken for 20 AA’s

Group two AA’s with highest CC’s, and either add the next AA with the highest CC to a group or a new group

Murphy, Wallqvist, Levy, 2000