Principal Component Analysis
• 20 food products16 European Countries
Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Germany 90 49 88 19 57 51 19 21 27ItalyFrance
PCA Example: FOODS
PCA Example: FOODS
PCA Example: FOODS
PCA Example: Red Sox Dataset: 110 Years of Redsox Performance Data Question: Pitchers and Batters Ages Matter for
Performance
Redundancy
• Arbitrary observations by r1 and r2
• Low to high redundancies from (a) to (c)• (c) can be represented by a single variable
• Spread across the best-fit line – covariance between two variables
Transform
• Linear Transformation• (x,y) in Cartesian coordinate• The same point becomes in (a,b) in another
coordinate system• Assuming linear transformation
• a = f(x, y) = x*c11 + y*c12
• b = g(x,y) = x*c21 + y*c22
=
For review of matrix, www.cs.uml.edu/~kim/580/review_matrix.pdf
Eigenvector
=
= = 4*
• Eigenvector – projection to the same coordinate
• Eigenvectors of a square matrix are orthogonal
• Unique eigenvalues are associated with eigenvectors
Transform
=
• http://www.ams.org/samplings/feature-column/fcarc-svd
Transform
(Symmetric)
• http://www.ams.org/samplings/feature-column/fcarc-svd
Eigenvectors
• Mv = λiv
λi is scalarv is orthogonal vectors
• Non-symmetric
• For unit vectors u1 and u2
• A general vector x has coefficients projected by unit vectors
•
• Vector product is of the form:
•
• =>
•
• SVD (Singular Vector Decomposition)
•
M: mxn; U: mxm; Σ: mxn; V: nxn
• U is eigenvector of MMT
• V is eigenvector of MTM
Transform = New Coordinate• What should be good for transform matrix for
PCA ?
• Covariance
Mean, Variance, Covariance• X = (x1, x2, ….. xn) Y = (y1, y2, …..yn)
• E[X] = ∑i xi /n E[Y] = ∑I yi /n
• Variance = (st. dev.)2:• V[X] = ∑I (xi – E[X])2 / (n-1)
• Covariance -- • cov[X,Y] = ∑I (xi – E[X]) (yi – E[Y]) / (n-1)
Covariance Matrix
• Three variables X,Y,Z
cov[x,x] cov[x,y] cov[x,z]cov[y,x] cov[y,y] cov[y,z]cov[z,x] cov[z,y] cov[z,z]
• cov[X,X] = V[X]• cov[X,Y] = cov{Y,X]
PCA
X Y
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3.0
2.3 2.7
2.0 1.6
1.0 1.1
1.5 1.6
1.1 0.9
167 749
13l92 62.42
Numerical Example
X (adj) Y (adj)
0.69 0.49
-1.31 -1.21
0.39 0.99
0.09 0.29
1.29 1.09
0.49 0.79
0.19 -0.31
-0.81 -0.81
-0.31 -0.31
-0.71 -1.01
• After adjustments
• Covariance matrix
cov = (.6165 .6154).6154 .7166
• Eigenvalues
|.6165-λ .6154 ||.6154 .7166-λ| =0
1.2840, 0.0491• Eigenvectors
Example: Amino Acid (AA) - Basic
Clustering of AAs How many clusters ?
Use 4 AA groupsGood for acidic and basicP in polar groupNonpolar group is wide spread
Similarities of AA’s determine the ease of substitutions
Some alignment tools show similar AA’s in colors Needs a more systematic approach
Physico-Chemical Properties Physico-chemical properties of AA
determine protein structures(1) Size in volume(2) Partial Vol.
Measure expanded volume in solution when dissolved
(3) Bulkiness The ratio of side chain volume to its length: average
cross-sectional area of the side chain(4) pH of isoelectric point of AA (pI)(5) Hydrophobicity(6) Polarity index(7) Surface area(8) Fraction of area
Fraction of the accessible surface area that is buried in the interior in a set of known crystal structures
Vol. Bulk Pol. pI Hydro
Surf2
Frac
Alanine Ala A 67 11.5 0.0 6.0 1.8 113 0.74
Arginine Arg R 148 14.3 52.0 10.8 -4.5 241 0.64
Asparagine Asn N 96 12.3 3.4 5.4 -3.5 158 0.63
Aspartic Asp D 91 11.7 49.7 2.8 -3.5 151 0.62
Cysteine Cys C 86 13.5 1.5 5.1 2.5 140 0.91
Glutamine Gln Q 114 14.5 3.5 5.7 -3.5 189 0.62
Glu. Acid Glu E 109 13.6 49.9 3.2 -3.5 183 0.62
Glycine Gly G 48 3.4 0.0 6.0 -0.4 85 0.72
Histidine His H 118 13.7 51.6 7.6 -3.2 194 0.78
Isoleucine Ile I 124 21.4 0.1 6.0 4.5 182 0.88
Leucine Leu L 124 21.4 0.1 6.0 3.8 180 0.85
Lysine Lys K 135 13.7 49.5 9.7 -3.9 211 0.52
Methionine Met M 124 16.3 1.4 5.7 1.9 204 0.85
Phenyl. Phe F 135 10.8 0.4 5.5 2.9 218 0.88
Proline Prot P 90 17.4 1.6 6.3 -1.6 143 0.64
Serine Ser S 73 9.5 1.7 5.7 -0.8 122 0.66
Threonine Thr T 93 15.8 1.7 5.7 -0.7 146 0.70
Tryptophan Trp W 163 21.7 2.1 5.9 -0.9 259 0.85
Tyrosine Thr Y 141 18.0 1.6 5.7 -1.3 229 0.76
Valine Val V 105 21.6 0.1 6.0 4.2 160 0.86
Mean 109 15.4 13.6 6.0 -0.5 175 0.74
Red: acidicOrange: basicGreen: polar
(hydrophillic)
Yellow: non-polar
(hydrophobic
)
PCA of AAs How to incorporate different properties
In order to group similar AA’sVisual clustering with Volume and pI
PCA Given NxP matrix (e.g., 20x7),
Each row represents a p-dimensional data pointEach data point is
Scaled and shifted to the origin Rotated to spread out points as much as
possible
Scaling For property j, compute the average and the s.d.
μj = ∑i xij /N, σj2 = ∑i (xij - μj)2 /N
Since each property has a different scales and means, define normalized variables,
zij = (xij - μj) /σj
zij measures the deviation from the mean for each property with the mean of 0 and s.d. of 1
PCA New orthogonal coordinate system
Find vj = (vj1, vj2 ,…, vjP ) such that
∑k vik vjk = 0 for i ≠ j (orthogonal) and ∑k vjk2 = 0 (unit length)
vj represents new coordinate vectorData points in z-coordinate becomes
yij = ∑k zjk vik
New y coordinate systems is a rotation of the z coordinate system
vjk turns out to be related to the correlation coefficient
PCA Correlation coefficient, Cij
Cij = ∑k(zik - mi)(zjk - mj) /Psi sj (mi, si mean and s.d. of the i-th row)
-1 ≤ Cij ≤ 1
Results in NxN simiarlity matrix, Sij
Vol Bulk Polar pI Hyd SA FrA
Vol 1.00 9.73 0.24 0.37 -0.08 0.99 0.18
Bulk 1.00 -0.20 0.08 0.44 0.64 0.49
Polar 1.00 0.27 -0.69 0.29 -0.53
pI 1.00 -0.20 0.36 -0.18
Hyd 1.00 -0.18 0.84
SA 1.00 0.12
FrA 1.00
Clustering Family of related sequences evolved from a common
ancestor is studied with phylogenetic trees showing the order of evolution
Criteria neededCloseness between sequencesThe number of clusters
Hierarchical Clustering Algorithm – connectivity-based K-mean -- centroid
Hierarchical Clustering Hierarchical Clustering Algorithm
Each point forms its own cluster, initiallyJoin two clusters with the highest similarity to form a single larger clusterRecompute similarities between all clusterRepeat two steps above until all points are connected to clusters
Criteria of similarities ?Use scaled coordinates z
Vector zi from origin to each data point i with length |zi|2 = ∑k zik
2
Use cosine angle between two points for similarity
cos θij = ∑k zikzjk / |zi||zj|
N elements, nxn distrance matrix d
Hierarchical_Clustering (d, n) Form n clusters, each with 1 element Construct a graph T by assigning an isolated vertex to each cluster while there is more than 1 cluster Find the two closest clusters C1 and C2
Merge C1 and C2 into new cluster C with | C1 | + | C2| elements Compute distance from C to all other clusters Add a new vertex C to T Remove rows and columns of d for C1 and C2, and add for C return T
k-mean Clustering The number of clusters, k, is known ahead of the time Minimize the squared errors between data points and k
cluster centers
No known polynomial algorithmHeuristics – Lloyd algorithm
initially partition n points arbitrarily to k centers, then move some points between clusters
Converge to a local minimum, may move many points in each iteration
k-means Clustering Problem Given n data points, find k center points minimizing the squared error distortion,
d(V, X) = ∑id(vi,X)2/n
input: A set V of n data points and a parameter k output: A set X consisting of k center points minimizing d(V,X)
over all possible choices of X
k-mean Clustering Assume every possible partition of n elements to k
clusters And each partition has cost(P)
Move one point in each iteration
Progressive_Greedy_k-means(n) Select an arbitray partition P into k clusters while forever bestChange = 0 for every cluster C for every element i not in C if moving i to C reduces Cost(P) if Δ(i → C) > bestChange bestChange ← Δ(i → C) i* = i C* = C if bestChange >0 change partition P by moving i* to C* else return P
Dynamic Modeling in Chameleon Similarity between clusters is determined by
Relative interconnectivity (RI)Relative closeness (RC)
Select pairs with high RI and RC to merge
Hierarchical Clustering - Cluto Generates a set of clusters within
clusters Algorithm can be arranged as a tree
Each node becomes where two smaller clusters join
CLUTO package with cosine and group-average rulesRed/green indicates values significantly higher/lower than the averageDark colors close to the average
1. red on both pI and polarity scale
2. green on hydrophobicity and pI (can be separated into two smaller clusters)
3. green on volume and surface area
4. C is unusual in protein structure due to its potential to form disulfide bonds between pairs of cysteine residues (thus, difficult to interchange for other residues)
5. Hydrophobic6. Two largest AA’s
Clustering of properties: properties can be ordered illustrating groups of properties that are correlated
6 clusters
cluster
Property AA
1 Basic K, R, H
2 Acid and amide
E, D, Q, N
3 Small P, T, S, G, A
4 Cysteine C
5 Hydrophobic V, L, I, M, F
6 Large, aromatic
W, Y
In PAM matrix, considered probabilities of pairs of amino acids appearing together
Pairs of amino acids that tend to appear together are grouped into a cluster
six clusters (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)
Contrast to clusters via hierarchical clustering(KRH) (EDQN) (PTSGA) (C) (VLIMF) (WY)
Dayhoff Clustering - 1978
(KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)
To study protein folding Used BLOSUM50 similarity matrix
Determine correlation coefficients between similarity matrix elements for all pairs of AA’s
e.g., CAV = (∑i MA,i MV,i )/[(∑i MA,i MA,i q)*(∑i MV,i MV,i )] with summation over i is taken for 20 AA’s
Group two AA’s with highest CC’s, and either add the next AA with the highest CC to a group or a new group
Murphy, Wallqvist, Levy, 2000
Top Related