Electrical Network, Graph Theory, Incidence Matrix, Topology
Biological Networks – Graph Theory and Matrix Theory
description
Transcript of Biological Networks – Graph Theory and Matrix Theory
Biological Networks – Graph Theory and Matrix Theory
Ka-Lok NgDepartment of Bioinformatics
Asia University
2
Content
Topological Statistics of the protein interaction networks
How to characterize a network ?– Graph theory, topological parameters (node degrees, average path length, clustering coefficient, and node degree correlation.)– Random graph, Scale-free network, Hierarchical network
– Evolution of Biological Networks
3
Biological Networks - metabolic networks
Metabolism is the most basic network of biochemical reactions, which generate energy for driving various cell processes, and degrade and synthesize many different bio-molecules.
4
Biological Networks - Protein-protein interaction network (PIN)
Proteins perform distinct and well-defined functions, but little is known about how
interactions among them are structured at the cellular level. Protein-protein interaction account for binding interactions and formation of protein complex. - Experiment – Yeast two-hybrid method, or co-immunoprecipitation
www.utoronto.ca/boonelab/proteomics.htm
Limitation: No subcellular location, and temporalinformation.
Cliques – protein complexes ?
5
Biological Networks - PIN
Yeast Protein-protein interaction network - protein-protein interactions are not random - highly connected proteins are unlikely to interact with each other.
Not a random network - Data from the high- throughput two-hybrid experiment (T. Ito, et al. PNAS (2001) )- The full set containing 4549 interactions among 3278 yeast proteins 87% nodes in the largest component- kmax ~ 285 !- Figure shows nuclear proteins only
6
Biological Networks – Gene regulation networks
Example of a genetic regulatory network of two genes (a and b), each coding for a regulatory protein (A and B).
In a gene regulatory network, the protein encoded by a gene can regulatethe expression of other genes, for instance, by activating or inhibiting DNA transcription. These genes in turn produce new regulatory proteins that control other genes.
7
Biological Networks – Gene regulation networks
Transcription regulatory network in Yeast- From the YPD database:
1276 regulations among 682 proteins
by 125 transcription factors (~10 regulated genes per TF)
- Part of a bigger genetic regulatory network of 1772 regulations among 908 proteins
Transcription regulatory network in H. sapiensData courtesy of Ariadne Genomics
obtained from the literature search:
1449 regulations among 689 proteins
Transcription regulatory network in E. coliData (courtesy of Uri Alon)
was curated from the Regulon
database:
606 interactions between
424 operons (by 116 TFs)
8
Biological Networks – Signal transduction networks
Nuclear transcription factor NF-kB-control of apoptosis (cell suicide), -development of B and T cells,-anti-viral and bacterial responses Oxidant-induced activation of NF-kB signal
transduction
Hormones (first message)
Receptor
cAMP, Ca++ (second message)
phosphorylation
9
Biological Networks – Cell cycle regulation networks
10
Biological Networks
Biological networks are not randomly connected
Underlying architecture clustering
How to characterize ?An universal features across different species ?
11
Graph theory- The Bridge Obsession Problem
Bridges of Königsberg
Find a tour crossing every bridge just once Leonhard Euler (Switzerland), 1735. It turns out it is impossible.
L. Euler1707-1783
12
Eulerian Cycle Problem
Find a cycle that visits every edge exactly once
Linear time
More complicated Königsberg
13
Hamiltonian Cycle Problem
Find a cycle that visits every vertex exactly once
Around the 20 famous cities in the world
NP – complete Game invented by Sir William Hamilton in 1857
Sir W. Hamilton(English mathematician)
1805 - 1865
14
Mapping Problems to Graphs
Arthur Cayley (English mathematician) studied chemical structures of hydrocarbons in the mid-1800s
He used trees (acyclic connected graphs) to enumerate structural isomers Arthur Cayley
15
Real networks Many networks show scale-free behavior World-Wide Web Internet Ecology network (food web) Science collaboration network Movie actor collaboration network
Cellular network Network in linguistic Power and neural network Sexual contacts within a population (important
for disease prevention!) etc.
Power law behavior
16
Graph Theory
A
B
Binary relation
Pathway Cluster Hierarchical Tree
Network representation. A network (graph) consists of a set of elements (vertices) and a set of binary relations (edges).
17
Graph Theory – Basic conceptsGraphsG=(N,E)N={n1 n2,... nN}E={e1 e2,... eM}ek={ni nj}Nodes: proteinsEdges: protein interactions
Mutligraphek={ni nj}+ duplicate edgesi.e. em={ni nj}Nodes: proteinsEdges: interactions of different sort: binding and similarity
HypergraphsHyperedge: ex={ni, nj, nk ...}Nodes: proteinsEdges: protein complexes
Directed hypergraphHyperedge: ex={ni, nj .. | nk nl ...}Nodes: substancesEdges: chemical reactions A + B C +DeX={A, B .. | C, D ...}
Directed graphek={ni nj}Nodes: genes and their productsEdges from A to B: gene regulation gene A regulates expression of gene B
Different systems Different graphs
NnEnd ||2)(
18
Graph Theory – Basic conceptsNode degree
Components
Complete graph (Clique)
Shortest path length
Clustering coefficient Ci
if A-B, B-C, then it is highly probable that A-C
)1(
2
ii
ii kk
EC
1.0)15(5
12
AC
Two ways to compute Ci
-Ei actual connections out of Ck2 possible c
onnections-number of triangles that included i/ki(ki-1)
Average clustering coefficient
N
iiC
NC
1
1
19
Graph Theory – Vertex adjacency matrix
01
01
1101
10
A
1
2 3
4- ∞ means not directly connected
- node i connectivity, ki = countj(mij = 1)
ki
1
3
1
1
Undirected graph
Bipartite graph
0
0TB
BA
symmetric
20
Graph Theory – Edge adjacency matrix
c
0111
1010
1101
1010
)(GE
a b c d
a
b
c
d
symmetric
1
2 3
4
ab
d
G
The edge adjacency matrix (E) of a graph G is identical to vertex adjacency matrix (A) of the line graph of G, L(G). That is the edge in G are replaced by vertices in L(G). Two vertices in L(G) are connected whenever the corresponding edges in G are adjacent. a b
cd
A(L(G)) = E(G)L(G)
The labeling of the same graph G are related by a similarity transformation, P -1A(G1)P=A(G2).
21
Graph Theory – average network distance
Interaction path length or average network distance, d
- the average of the distances between all pairs of nodes - frequency of the shortest interaction path length, f(j) - determined by using the Floyd’s algorithm The average network diameter d is given by
where j is the shortest path length between two nodes.
Network diameter (global) Average network distance (local)
j
j
jf
jjf
d)(
)(
22
Graph Theory – the shortest path
The shortest path- Floyd algorithm, an O(N3) algorithm.
For iteration n,- given three nodes i, j and k, it is short
er to reach j from i by passing through k
Mnij=min{Mn-1
ij, Mn-1ik+Mn-1
kj}
- search for all possible paths,
e.g. 1-2, 1-2-3, 1-2-4, 2-3, 2-4
1
2 3
4
i
k
j
23
Graph Theory – number of the shortest path in a graph
A nonvanishing element of A(G), Aij = 1, represents a walk of a length between the vertices i and j. Therefore, in general
0
1ijA
if there is a walk of length one between vertices i and j
otherwise
There are walks of various lengths which can be found in a given graph. Thus
0
1kjik AA
if there is a walk of length two between vertices i and j passing through the vertex kotherwise
Therefore, the expression represents the total number of walks of the length 2 in G between the vertices i and j.
For a walk of a length L, we
0
1... zjrsir AAA
if there is a walk of length L between vertices i and j passing through the vertices r, s, …..zotherwise
kj
N
kikij AAA
1
2 )(
24
Graph Theory – Trace of a matrix
Trace of the NxN matrix A
In the case of the adjacency matrix for graph without loops, Tr A = 0
The trace of powers of A is a graph invariant
N
iiiAATr
1
)(
31
33
11
22
6)()(
2)()()(
CAATr
MiDAATr
ii
N
i
N
iii
N
i
where M is the number of edges, C3 is the number of three-membered cycles.
In case of graph with n loops
N
iihATr
1
)(
25
Random Graph Theory = Graph Theory +Probability
26
Random Graph Theory = Graph Theory +Probability
27
Random Graph Theory= Graph Theory + Probability
Random graph (Erdos and Renyi, 1960)
N nodes labeled and connected by n edges CN
2 = N(N-1)/2 possible edges
possible graphs with N nodes and n edges
2
N
nC
n Number of possible graphs, C6n
1 6
2 15
3 20
4 15
5 6
6 1
N = 4 C6n
n 3 3 4 4 5 6
N = 4
kNkNki ppCkkP 11 )1()(
28
Random Graph Theory – Random network, Scale free network
Connectivity distribution P(k) In a random network, the links are randomly connected and most of the nodes have degrees close to <k>=2E/N. The degree distribution P(k) vs. k is a Poission distribution, i.e. P(k) ~ <k>ke-<k>/k! In many real life networks, the degree distribution has no well-defined peak but has a power-law distribution, P(k) ~ k-,where is a constant. Such networks are known as scale-free network.
Random network Log[P(k)] vs Log[k] plot has a peak homogenous nodes d ~ log NScale-free network Log[P(k)] vs Log[k] plot is a line with negative slope inhomogenous nodes d ~ log(log N)
Albert R. and Barabasi A.L.(2002) Rev. Mod. Phys. 74, 47
Random network Scale-free network
http://physicsweb.org/box/world/
29
Example – metabolic pathways
WIT database (43 organisms), node = substrated, edge = reaction scale-free network P(k)<k, with in = 2.2, out = 2.2 similar scaling behavior of connectivity distribution Fig. 2d, connectivity distribution averaged over 43 organisms Suggested that metabolic networks belong to the class of scale- free networks
It is interesting to notice that most of the real networks have 1 < < 3.
http://ergo.integratedgenomics.com/IGwit/
30
Random Graph, Scale-free network, Hierarchical network
Clusteringcoefficient
Node degreedistribution
scaling Cave(k) ~ k
for Deterministichierarchical networkmodel
Hierarchical network - coexistence of (1) modularity, (2) local clustering, and (3) scale-free behavior
31
Graph Theory – Network motifsCompared the abundance of small loops in E. coli transcription regulatory network to its randomized counterpart
- Treat the transcription network as directed graph node = operon (a group of contiguous genes) edge = from an operon that encode an TF to an operon regulated by that TF- Frequency of occurrences three types of motifs (feed-forward loops, single input module, and dense overlapping regulons) are much higher than the random network version
-There are 13 types of 3-node connected, directed subgraphs-Feed-Forward Loops (FFL) were significantly over-represented (40 in real vs 7+/- 5 in random)
Reference : S.S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, Nature Genetics, 31(1):64-68 (2002)
32
Graph Theory – Network motifs
Feed-forward loop- A TF X regulate a second TF Y, and both jointly regulated one or more operons Z1,….. Zn.
Single input module (SIM)- A single TF X regulates a set of operons Z1,….. Zn. X is usually autoregulate
Dense overlapping regulons (DOR)- A set of operons Z1,….. Zm are each regulated by a combination of a set of TFs, X1,….. Xn.
33
Graph Theory – Node degree correlation
- random graph models node degrees are uncorrelated- count the frequency P(K1,K2) that two proteins with connectivity K1 and K2 connected to each other by a link- compared it to the same quantity PR(K1,K2) measured in a randomized version of the same network. - The average node connectivity for a fixed K1 is given by,.
where <> denotes the multiple sampling average, and the summation sums for all K2 with a fixed K1.
- In the randomized version, the node degrees of each protein are kept the same as in the original network, whereas their linking partner is totally random.
K1 = 2
K2 = 5
),(
),(
21
2122 KKP
KKPKK
R
01
01
1101
10
M
1
2 3
4
34
Input - Database of Interacting Proteins (DIP)
DIP http://dip.doe-mbi.ucla.edu DIP is a database that documents experimentally determined protein-protein interactions. We analyze the protein-protein interaction for seven different species, C.elegan, D. melanogaster, E. coli, H. pylori, H. sapiens, M. musculus and S. cerevisiae.
- Look for general and different features of PIN for different species
35
Input - Database of Interacting Proteins (DIP)
36
Results – scale-free network study
• Large standard deviation of k• Coefficient of determination, r2 = SSR/SST >0.90• To account for the flat plateau and long tail behaviors, assume a short-length scale correction k0 and an exponential cut-off tail at kc
FLYC. elegans
E . coli H. pyloriYEAST
H. sapiens M. musculus
CkkekkkP /0 )()( yeast ~ 2.1
fly ~ 1.9
37
Results
Highly connected proteins (k 25)≧– yeast (39 sequences) and fly (317 sequences) Most of the sequences do not have high sequence similarity (E-value ≦ 0.01) different functions
38
Results
These highly connected proteins are pair-wise compared in an all-against-all manner using gapped BLAST (16), and none of the sequences shown significant sequences similarity (E-value < 0.001) except the tryptophan protein and SEC27 protein, nuclear pore protein, 26S proteasome regulatory particle chain and DNA-
directed RNA polymerase.
39
Results
L
0 2 4 6 8 10 12 14 16
Lo
g f(
L)
-7
-6
-5
-4
-3
-2
-1
0
E.coliH.pyloriS.cerevisiaeH.sapiensM.musculusD.melanogasterC.elegans
Fig. 4. The logarithm of the normalized frequency distribution of connected paths vs the logarithm of their length for
S. cerevisiae(CORE), H. pylori, E. coil, H. sapiens, M. musculus and D. melanogaster.
40
Results – node degrees correlation
2.0
2.0
2.0
2.02.0
2.0
1.01.6
2.0
1.0
2.0 2.0
2.0
1.0
Highly connected proteins are unlikely to interact.
41
Results – Hierarchical structures
The plots of Log Cave(k) vs Log k for the seven species.
All the species exhibit a rather flat plateau for small values of k, and they fall rapidly for large k.
Cave(k) ~ k-
Yeast
E. coli
42
Results – identification of cliques
identify protein complexes compute the clustering coefficients, find the cliques or pseudo-cliques
43
Identification of cliques
TheoremLet A3
ij be the (i,j)-th element of A3. Then a vertex Pi belongs to some clique if and only if A3
ii ≠0.
Example
00001
00011
00000
01001
11010
A
0
1ijA
if there is a walk of length one between vertices i and j
otherwise
01013
12034
00000
13024
34042
3Aand
The non-zero diagonal entries of A3 are a311, a3
22 and a344. Consequently, node 1, 2 and 4
belong to cliques. Since a clique must contain at least three vertices, the graph has only one clique.
44
Results - protein complexes
Identification of the highest clique degree with protein complexes
We had identified all possible cliques within the seven PINs. To identify the relation between cliques and protein complexes, we only considered cliques with the largest number of connected proteins in our preliminary study, and had succeeded in predicting some of the cliques did correspond to protein complexes (comparing data from the BIND database).
45
Evolution of Biological Networks
Databases – DIP and MIPS
Motif identification- detecting all n-node subgraphs, i.e. all 2-, 3-, 4- and some 5- node (a set of 28 five-node motifs) motifs in yeast PIN- the network consists of 3183 yeast proteins encodes 1000 to 1,000,000 copies of the specific motif types
46
Evolution of Biological Networks
-studied the conservation of 678 (47% of 1443) yeast proteins with an ortholog in each of five higher eukaryotes (A. thaliana, C.elegans, D. melanlgaster, M. musculus and H.sapiens) deposited in the InParanoid database
- 47% of the 1443 fully connected pentagons (#11), in yeast have each of their five proteins components conserved in each of the five higher eukaryotes
- this results blocks of cohesive motifs tend to be evolutionary conserved
47
Evolution of Biological Networks
Redundant links are lost(in an asymmetric fashion)
Growth Model of a scale-free network PIN- New proteins nodes are added (genes duplication)- Preferential attachment
48
Evolution of Biological NetworksGrowth1. start with m0 nodes2. add a node with m edges3. connect these edges to existing nodesat time step t : t+m0 nodes, tm edges
Preferential attachmentProbability q of connection to node i depends on the degree ki of this node.
j j
ii k
kkq )(
m0=3, m=2
This model leads to the power law distribution
P(k) = 2m2k-3 ~ k-3
49
SummaryProtein-protein interaction Network• PINs are not random networks, they have
rather heterogeneous structures highly connected protein blastp shows that they do not share sequence similarity
• The plots of Log[Pcum(k)] vs Log[k] study indicates that PINs are well approximate by scale-free networks
• ~ 2 A general biological evolution mechanism across species growth + preferential attachment model
• The plots of Log[Pcum(k)] vs Log[k] for fly and yeast seems to have deviation at the small k and large k value modification of the growth + preferential attachment model
• Highly connected proteins are unlikely to interact
• Hierarchical network model is a better description for certain species’ PINs
Log k
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
Lo
g C
ave(
k)
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
S.cerevisiae (CORE)S. cerevisiae (regression)
50
MatrixPermutations
A one-to-one mapping of the set {1,2,3…,n} onto itself is called a permutation. We denote the permutation by = j1j2…jn.
The number of possible permutation is n!, and that the set of them is denoted by Sn. For example, S2 = {12, 21}, S3 = {123, 132, 213, 231, 312, 321}.
Consider an arbitrary permutation in Sn: = j1j2…jn. We say is even or odd according whether there is an even or odd number of pairs (i,k) for which
i > k but i precedes k in We then define the sign of by, written sgn , by
1
1)sgn(
if is even
if is odd
51
Matrix
Example
Consider the permutation =35142 in S5. 3 and 5 precede and are greater than 1; hence
(3,1), and (5,1) 3,5 and 4 precede and are greater than 2; henc
e (3,2), (5,2), (4,2) 5 precedes and is greater than 4; hence (5,4) Since exactly 6 pairs satisfy the sgn(), is e
ven and sgn() = 1.
52
MatrixDeterminant
The determinant of matrix A is defined by
nnjjj aaaA ...))(sgn()det(
21 21
where the factors come from successive rows and so the first subscripts arein the natural order 1,2,…n. The sequence of second subscripts form a permutation in Sn. Also the sum is summed over all permutations in Sn.ExampleIn S2, the permutation 12 is even and the permutation 21 is odd
det(A) = a11a21 – a12a21
In S3, the permutation are 123, 231 and 321 are even, and the permutation 132, 213 and 312, are odd.
det(A) = a11a22a33 + a12a23a31 + a13a21a32 – a13a22a31 – a12a21a33 – a11a23a32
nnjjj aaaAper ...)(21 21Permanent of A where is always equal to 1
53
MatrixThe incidence matrixThe incidence matrix T(G) of a graph G with N vertices and M edges is
the NxM matrix; the rows and columns of the matrix corresponding to vertices and edges, respectively, of G. It is defined as,
Example
0
1ijT
if the j-th edge is incident with the i-th vertex
otherwise
1100
0110
1011
0001
)(GT
a b c d1
2
3
4
1
2 3
4
ab
d
G
c
428
2
,
,
jiij
ii
jiij
T
DMT
M is the number of edges
54
Matrix
The circuit matrixThe circuit matrix C(G) of a graph G, whose cycles (circuits) c and edges e a
re labeled, is a cxe matrix; the rows and columns of the matrix corresponding to circuits and edges, respectively, of G. It is defined as,
0
1ijC
if the i-th cycle contains the j-th
otherwise
1 2
34
Ga
b
c
d e
CyclesC1 = {a,d,e}C2 = {b,c,e}C3 = {a,b,c,d}
01111
10110
11001
)(GC
a b c d e
C1
C2
C3
Example
55
Principle Components Analysis (PCA) General Outline Suppose you have a microarray dataset composed of 1000 genes, each of whic
h have an expression value over 10 experiments. The dimensionality of that dataset is therefore 10 or 1000.
The data, though clumped around several central points in that hyperspace, will generally tend towards one direction. If one were to draw a solid line that best describes that direction, then that line is the first principle component (PC). The result was a space in which the axes were the eigenvectors of the covariance matrix of the experiments (in this space each point is a gene) or the genes (in this space each point is an experiment).
Any variation that is not captured by that first PC is captured by subsequent orthogonal PCs (captures the maximum amount of variation left in the data).
The first 3 PCs could themselves act as Cartesian axes. The data they capture can therefore be plotted in terms of these axes. Hence there is a reduction of dimensionality.
When the data is plotted in this manner they are said to be plotted in PC-space.References1. http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/PCA_1.htm2. Draghici S., 2003. Data Analysis Tools for DNA Microarray. Chapman & Hall/CR
C
56
Principle Components Analysis (PCA)- PCA is commonly used in microarray research as a cluster analysis tool. - to capture the variance in a dataset in terms of principle components. - trying to reduce the dimensionality of the data to summarize the most important (i.e.
defining) parts whilst simultaneously filtering out noise. - Normalization, however, can sometimes remove this noise and make the data less
variant, which could affect the ability of PCA to capture data structure. - PCA can be imposed on datasets to capture the cluster structure (just using the first f
ew PC's) prior to cluster analysis (e.g. before performing k-Means clustering to determine a good value for K).
Coordinate transformation(translation + rotation)Most of the variancealong the first eigenvector.
Variance along the second eigenvector is probably due tonoise.
57
Principle Components Analysis (PCA)- PCA pay attention to those dimensions that account for a large variance in the data and t
o ignore the dimensions in which the data are not vary very much.- The direction of PC is determined by calculating the eigenvectors of the covariance matri
x of the pattern.- Eigenvector of a matrix A is defined as a vector z such as:
Az = z where is the eignevalue- For example,
20
11A
has eignevalues 1 = -1 and 2 = -2, and eigenvectors z1=(1, 0) and z2 = (1, -1)
111 0
1)1(
0
1
20
11zAz
Similarly expression for 2 = -2- In other words, covariance matrix captures the shape of the set of data- Eigenvalue with largest absolute value implies that the data have the largest variance along its eignevector
58
Principle Components Analysis (PCA)
How to calculate the eigenvalues of a matrix ? Solve
0)2)(1(
020
11
How to find eigenvectors ?
0,1,
,0
2
)1(20
11
11
yxchoose
anythingxy
yy
xyx
y
x
y
x
1,1,
22
2
)2(20
11
22
yxchoose
yx
yy
xyx
y
x
y
x
Normalization of eignevectors | | = 1 122 yxx
Up to a multiple constant
59
Eigenvalues and eigenvectors
If A is a square matrix, then is an eigenvalue of A if, for some nonzero vectorAv = v
where v is an eigenvector of A belonging to .
Linear dependenceThe vectors v1, …, vm are said to be linearly dependent if there exist scalars a1, …, am not all of them 0, such that
a1v1 + …. amvm = 0Otherwise, the vectors are said to be linearly independent.
ExampleThe vectors u =(1, -1, 0), v=(1,3,-1) and w=(5,3,-2) are dependent since 3u+2v-w=0.
Theorem Nonzero eignevectors belonging to distinct eignevalues are linearly independent.
60
Eigenvalues and eigenvectorsTheoremAn n-square matrix A is similar to a diagonal matrix B if and only if A ha
s n linearly independent eigenvectors. In this case the diagonal elements of B are the corresponding eigenvalues.
In the above theorem, if we let P be the matrix whose columns are the n independent eigenvectors of A, then B = P-1AP.
Example: Consider the matrix .
A has two independent eigenvectors and .
Set and so .
Then A is similar to the diagonal matrix
23
21A
3
2
1
1
13
12P
5/25/3
5/15/11P
10
04
13
12
23
21
5/25/3
5/15/11APPB
As expected, the diagonal elements 4 and -1 of the diagonal matrix B are eigenvalues corresponding to the given eigenvectors.
61
Characteristic Polynomial – Cayley-Hamilton Theorem
The matrix tI-A, where I is the n-square matrix and t is a constant, is called the characteristic matrix of A. Its determinant DA(t) = det(tI – A) which is a polynomial in t is called the characteristic polynomial of A.
Cayley-Hamilton Theorem
Every matrix is a zero of its characteristic polynomial.
Example:
The characteristic polynomial of the matrix is
D(t) = (t-1)(t-2) – (-2)(-3) = t2 – 3t – 4. As expected, A is a zero of D(A):
23
21A
00
00
10
014
23
213
23
21)(
2
AD
62
Singular Value Decomposition Performing PCA is the equivalent of performing Singular Value Decompo
sition (SVD) on the covariance matrix of the data. Applications: Image processing and compression, Information Retrieval, Im
munology, Molecular dynamics, least square error What is SVD? (See MIT paper at the bottom of this page for more depth) For the sake of example, let A be a nxp matrix that represents gene expressi
on experiments such that the rows are genes and columns are experiments (i.e.arrays). The SVD of A is said to be the factorization:
A = USVT (Eq. 1)(Note that VT means 'the Transpose of matrix V'). How to determine U, V and S ? In equation 1, matrices U and V are such that they are orthogonal (their col
umns are orthonormal, UTU = I, VTV = I). The columns of U are called left singular values (gene coefficients) and the rows of VT are called right singular values (expression level vectors).
To calculate the matrices U and V, one must calculate the eigenvectors and eigenvalues of AAT and ATA. These multiplications of A by its transpose results in a square matrix (the number of columns is equal to the number of rows).
63
Singular Value Decomposition The eigenvectors of AAT (a nxn matrix) columns of U (a nxn matri
x)The eigenvectors of ATA (a pxp matrix) columns of V (a pxp matrix)The eigenvalues of AAT or ATA, when square-rooted (s1≧ s2 ≧ …) the columns of S (nxp matrix)The diagonal of S is said to be the singular values (min(n,p)) of the original matrix, A.
Each eigenvector described above represents a principle component. PC1 (Principle Component 1), which is defined as the eigenvector with the highest corresponding eigenvalue. The individual eigenvalues are numerically related to the variance they capture via PC's - the higher the value, the more variance they have captured.
Please also see Yeung & Ruzzo (2001), where it is shown that lower PC's can capture data structure.
http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
http://www.netlib.org/lapack/lug/node53.html or http://kwon3d.com/theory/jkinem/svd.html
64
Singular Value Decomposition
eigenvalues of AAT or ATA
eigenvectors of AAT
eigenvectors of ATA
Example
0.34
65
Yeast Two-hybrid System
pubs.acs.org/cen/coverstory/ 7831/7831scit1.html
http://cmbi.bjmu.edu.cn/
Have two plasmids: - fusion of target to activation domain (AD). - fusion of bait to DNA binding domain (BD).If the target protein bind to the bait protein it brings AD close to BD.
BD bind to the GAL4 promoter and brings AD close so it can activate the DNA.LacZ is expressed from a GAL4 promoter when proteins are interacting. LacZ will make the yeast blue.
66
Immunoprecipitation
(1) Antibody added to a mixture radiolabel(*) or unlabeled proteins binds specifically to its antigen (A) (left tube).(2) Antibody-antigen complex is absorbed from solution through the addition of an immobilized antibody binding protein such as Protein A-Sepharose beads (middle panel). (3) Upon centrifugation, the antibody-antigen complex is brought down in the pellet (right panel). Subsequent liberation of the antigen can be achieved by boiling the sample in the presence of SDS.
67
Co-Immunoprecipitation
Co-IP vs. IPCo-immunoprecipitation (Co-IP) is a popular technique for protein interaction discovery. Co-IP is conducted in essentially the same manner as an IP. However, in a co-IP the target antigen precipitated by the antibody “co-precipitates” a binding partner/protein complex from a lysate ( 細胞溶解液 ), i.e., the interacting protein is bound to the target antigen, which becomes bound by the antibody that becomes captured on the Protein A or G gel support.