Graph Algorithms Shortest path problems. Graph Algorithms Shortest path problems.
C. Ding, Protein Interaction Module Detection using Graph Algorithms 1 Protein Interaction Module...
-
date post
20-Dec-2015 -
Category
Documents
-
view
221 -
download
1
Transcript of C. Ding, Protein Interaction Module Detection using Graph Algorithms 1 Protein Interaction Module...
1C. Ding, Protein Interaction Module Detection using Graph Algorithms
Protein Interaction Module Detection
using Matrix-Based Graph Algorithms
Chris Ding Lawrence Berkeley National Laboratory
2C. Ding, Protein Interaction Module Detection using Graph Algorithms
Bioinformatics & Computational Biology Computational genomics:
Molecular biology at genomic level
3C. Ding, Protein Interaction Module Detection using Graph Algorithms
Genomics Research
• More than 100 genomes’ DNA sequenced• DNA microarray chip technology• Protein – protein interaction technology• Gene knock-out for gene regulatory network• Many high-through technologies• Bio-imaging (embryos imaging, EM)• Huge number of databases
—GenBank, Protein Data Bank, SCOP, Pfam
• Gene Ontology
4C. Ding, Protein Interaction Module Detection using Graph Algorithms
A Genomics Research Trend
• Large # of genomes have been sequenced. • Traditional Approach: Predict genes, predict
proteins, predict structures, prediction functions• This structural genomics is inadequate• Protein interactions: a new approach
5C. Ding, Protein Interaction Module Detection using Graph Algorithms
Protein – Protein Interactions
Proteins carry out tasks together with other proteins• 83% proteins interact with others• Proteins interact in promoters• Multi-protein complexes (assemblies)• Synergistic interactions • Complex – complex cross-talks
Proteins work out in modular fashion• Gene regulation• Biological Pathway Most drug block certain pathways
Major goal of research: detect protein modules
7C. Ding, Protein Interaction Module Detection using Graph Algorithms
Protein interaction experiments
• Two-hybrid Assay— Protein coordination in promoter region—Binary interactions—Capture transient and unstable interactions
• Mass Spectrometry —TAP-MS: Tandem affinity purification —HMS-PCI: high throughput protein interaction id.—Use bait proteins—Capture multi-protein complexes
• Problems: —Results do not agree—Lots of noise
Tandem-Affinity Purification with Mass-Spectrometry (TAP-MS) determines constituents of multi-protein complexes.
Gavin, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002;415(6868):141-147.
More reliable technology (Deng, et al)
Many baits are simultaneously processed to obtain many complexes
9C. Ding, Protein Interaction Module Detection using Graph Algorithms
Protein Interaction Experiments
Different experiments don’t agree: small overlap
Two-hybrid
Ito et alTwo-hybrid
Uetz et alMass Spec
Gavin et alMass Spec
Ho et al
Two-hybrid
Ito et al4363 186 54 63
Two-hybrid Uetz et al
186 1403 54 56
Mass Spec
Gavin et al54 54 3222 198
Mass Spec
Ho et al63 56 198 3596
Salwinski and Eisenberg, 2003
10C. Ding, Protein Interaction Module Detection using Graph Algorithms
Protein InteractionsA genome has 5000 proteins. Each interacts ~ 5 others.
11C. Ding, Protein Interaction Module Detection using Graph Algorithms
Outline
• Protein interaction• Interaction Data
• Graph models• Spectral Clustering• Cliques• Bi-cliques
• Results
12C. Ding, Protein Interaction Module Detection using Graph Algorithms
Bipartite Graph Model
p –nodes: proteins c –nodes: protein complexes
p –nodes: proteins c –nodes: protein domains
Protein Complex:
Protein domain:
13C. Ding, Protein Interaction Module Detection using Graph Algorithms
Unified Representation of Protein Complex Data
protein – protein network:
protein complexe – protein complex network:
jiij
T
,ppBB
proteinsboth containing
complexesprotein of #)(
jiij
T
,ccBB
complexesprotein
by shared proteins of #)(
Input: Protein Complex data: B
(Ding, He, Meraz, Holbrook, Proteins, 2004)
14C. Ding, Protein Interaction Module Detection using Graph Algorithms
Bridged Bipartite Graph: Pfam domains match SCOP domain
A B
k
ijT
jkikji ABBAfsC )(),(
Matching :
Co-location of domains:
k
ijT
jkikji AAAAssR )(),(
k
ijT
jkikji BBBBffR )(),(
Reach 90% accuracy Compared to direct match
(Zhang, Chandonia, Ding, Holbrook, BMC Bioinformatics 2004)
15C. Ding, Protein Interaction Module Detection using Graph Algorithms
Protein Interaction Module: densely connected subgraphs
16C. Ding, Protein Interaction Module Detection using Graph Algorithms
Protein Interaction Modules
Find highly connected regions:• Graph clustering• Cliques• Bi-cliques
17C. Ding, Protein Interaction Module Detection using Graph Algorithms
Outline
• Protein interaction
• Interaction Data
• Graph models
• Spectral Clustering • Cliques
• Bi-cliques
18C. Ding, Protein Interaction Module Detection using Graph Algorithms
Spectral Clustering: MinMaxCut
max within-cluster similarities (weights)
min between-cluster similaities (weights)
Ai Bj
ijw(A,B) sim
Ai Aj
ijw(A,A) sim
(Ding, RECOMB’02)
19C. Ding, Protein Interaction Module Detection using Graph Algorithms
Spectral Clustering Method (MinMaxCut)
Minimize similarity between A,B:
Maximize similarity within A & B:
),(
),(
),(
),(),(min
, BB
BA
AA
BABA
BA sim
sim
sim
simJMMC
Dqq
qWDqq
T
T
)(min)J( min
Minimizing leads toMMCJ
Solution given by eigenvector DqqWD )(
Ai Bj
ijw(A,B) sim
Ai Aj
ijw(A,A) sim
Cluster membership indicator:
Binnn
Ainnniq
if
if
21
12
/
/)(
}0)(|{},0)(|{ 11 iqiBiqiACluster assignment:
21C. Ding, Protein Interaction Module Detection using Graph Algorithms
A NP-hard intractable combinatorial optimization problem can be effectively
solved by a simple eigenvector !
22C. Ding, Protein Interaction Module Detection using Graph Algorithms
Spectral Clustering
• 2-way clustering
• K-way clustering—Recursive 2-way clustering—K-way relaxation (K eigenvectors)
• Cluster Self-aggregation and Perturbation Analysis• Characteristics
—Principled approach—Clear and well-motivated clustering objective functions—Everything is proved rigorously—Based on well-established matrix/algebra theory —A rich framework (clustering, ordering, ranking, etc)
• State of Art Algorithm
23C. Ding, Protein Interaction Module Detection using Graph Algorithms
Recursive MinMaxCut Clustering of Lymphoma Issues
B cell lymphoma go thru different stages
3 normal stages
3 cancer stages
Key question: can we detect them automatically ?
(Alizadeh et al, 2000)
(Ding, RECOMB’02)
24C. Ding, Protein Interaction Module Detection using Graph Algorithms
Gene expression of lymphoma (Stanford)
(Ding, RECOMB’02)
25C. Ding, Protein Interaction Module Detection using Graph Algorithms
Spectral Clustering
• 2-way clustering
• K-way clustering—Recursive 2-way clustering—K-way relaxation (K eigenvectors) (principled)
• Cluster Self-aggregation and Perturbation Analysis• Characteristics
—Principled approach—Clear and well-motivated clustering objective functions—Everything is proved rigorously—Based on well-established matrix/algebra theory —A rich framework (clustering, ordering, ranking, etc)
• State of Art Algorithm
26C. Ding, Protein Interaction Module Detection using Graph Algorithms
Outline
• Protein interaction• Interaction Data• Graph models
• Spectral Clustering— Application to computing protein
interaction modules• Cliques • Bi-cliques• Results
27C. Ding, Protein Interaction Module Detection using Graph Algorithms
Clustering Protein Complex Graph
protein – protein network:
protein complex – protein complex network:
jiij
T
,ppBB
proteinsboth containing
complexesprotein of #)(
jiij
T
,ccBB
complexesprotein
by shared proteins of #)(
Input: Protein Complex data: B
(Ding, He, Meraz, Holbrook, Proteins, 2004)
29C. Ding, Protein Interaction Module Detection using Graph Algorithms
Experimental Protein ComplexComputed
30C. Ding, Protein Interaction Module Detection using Graph Algorithms
Implications of discovered protein clusters on protein interactions: F-statistics
F - statistics of amino acids and physical propertyacross all protein clusters: statistical significance
Lys 100 Asn 56 Val 30 Ile 24 Asp 89 Gln 50 Tyr 29 Ser 23
Arg 73 Cys 39 Met 29 Leu 22 Pro 70 His 33 Trp 28 Gly 21 Glu 66 Ala 31 Thr 28 Phe 21
pI 169 Basic 149 Acidic 97 MW 60 Aromatic 30 Helix 37 Beta-Sheet 33 Coil 27
K
kkk
K
kkk n
Knffn
KF
1
2
1
)1(1
/)(1
1
Lys, Arg, Asp are most significant: => electrostatic forces are dominant surface factors influencing protein interactions Surprise: secondary structure is not important factor in protein module formation
31C. Ding, Protein Interaction Module Detection using Graph Algorithms
Protein Secondary Structure
• Alpha helix
• Beta sheet
• Coil regions
32C. Ding, Protein Interaction Module Detection using Graph Algorithms
Outline
• Protein interaction
• Interaction Data
• Graph models
• Spectral Clustering
• Cliques • Bi-cliques
• Results
33C. Ding, Protein Interaction Module Detection using Graph Algorithms
Protein Interaction Modules
• Find highly connected regions• cliques • k-core: subgraph with node degree > k
Every node connects to everyone else
Clique
Every node connecto to at leat 3 others
K-core
34C. Ding, Protein Interaction Module Detection using Graph Algorithms
Motzkin-Struss Formalism for computing maximal cliques
Clique computing is NP-hard. Even approximating clique is hard.
Motzkin-Straus Theorem.
xAxTx 0
max
),,( 1 nxxx Vector on all nodes of the graph
1.. 1 nxxts L1 enforce sparsity
Non-zero entries define the clique
Every node connects to everyone else
Clique
0iiA
35C. Ding, Protein Interaction Module Detection using Graph Algorithms
Generalized Motzkin-Straus Formalism
Setting =1.05 we can compute maximum clique
better than standard approach =1.0.
xAxTx 0
max
),,( 1 nxxx Vector on all nodes of the graph
1,11 nxx
L1 enforce sparsity
Non-zero entries define the clique
s.t.
(Ding, Zhang, Holbrook, 2006)
1iiA
36C. Ding, Protein Interaction Module Detection using Graph Algorithms
Algorithm for computing clique
Solving the constrained quadratic programming problem
xAxTx
max0
1,11 nxx s.t.
Theorem 2. Convergence: Iterative algorithm converges
Theorem 1. Correctness: Solution converges to local maxima
/1)(
Axx
Axxx
Ti
iiupdate
Tx )1,,1( Initialize
37C. Ding, Protein Interaction Module Detection using Graph Algorithms
Proof of Correctness
/)(2 AxxT
Constrained Optimization Theory
KKT Optimality Condition (Complementarity Slackness):
)1( 1 nT xxAxxL
0])(2[ 1 iii xxAx
At Convergence satisfies KKT condition
/1
2/
)(
i
ii
AxxxUpdate
rule:
/1***
2/
)(
i
ii
Axxx
Introduce Lagrangian function
Lagrangian multipier value:
38C. Ding, Protein Interaction Module Detection using Graph Algorithms
Proof of Convergence
Using Auxiliary Function (from Machine Learning)
)(),(),()',( xLxxGxLxxG
L(x) is monotonically increasing and is bounded from up. Thus the algorithm converges
),(maxarg )()1( tx
t xxGx
)1()()1()()()( ),(),( tttttt xLxxGxxGxL
set
G(x,x’) is an auxiliary function of L(x) if
)3()2()1( xLxLxL
We maximize a lower-bound.
39C. Ding, Protein Interaction Module Detection using Graph Algorithms
Proof of Convergence (cont)
Key: (1) find auxiliary function, (2) find global maxima
The auxiliary function is
1)'('2
)',(
ii
ii
i
xx
Axx
x
xxG
Thus G(x,x’) is concave in x. Global maxima easily obtained.
)1()log1('')',(,,
i
iji
jijijij i x
xx
xxxAxxxG
First order derivative:
ijii
ii
ji
xx
Axx
xx
xxG ])1()'('
2[)',( 2
2
2
2nd order derivative: is negative definite
40C. Ding, Protein Interaction Module Detection using Graph Algorithms
Emg1 Imp3 Imp4 Kre31 Mpp10 Nop14 Sof1 YMR093W YPR144C
snoRNA binding
Cus1 Msl1 Prp3 Prp9 Sme1 Smx2 Smx3 Yhc1 YJR084W
RNA binding
Fyv4 Mrp1 Mrp10 Mrp13 Mrp17 Mrp21 Mrp4 Mrp51 Mrps9 Nam9 Pet123 Rsm10 Rsm19
Rsm22 Rsm23 Rsm24 Rsm25 Rsm26 Rsm27 Trf4 Ubp10 YDR036C YGR150C YMR158W YMR188C YNL306W YOR205C YPL013C
structural constituent of ribosome
Atp11 Caf130 Caf40 Ccr4 Cdc36 Cdc39 Fas2 Not3 Not5 Pop2 Sig1 YDR214W
3'-5' -exoribonuclease activity
Apc1 Apc2 Cdc16 Cdc23 Cdc27 Doc1 ubiquitin-protein ligase & protein binding
Sec65 Srp14 Srp21 Srp54 Srp68 Srp72 signal recognition particle
Csl4 Mtr3 Rrp42 Rrp43 Rrp45 Rrp6 Ski6 Ski7 3'-5' exonuclease activity
Cft2 Fip1 Pap1 Pfs2 Pta1 Ref2 Rna14 YGR156W Ysh1
cleavage and polyadenylylation
Lsm1 Lsm2 Lsm5 Lsm6 Lsm7 Pat1 Prp24 Prp38 Snu23
RNA binding
Apl1 Apl3 Apl5 Apl6 Apm3 Apm4 Aps2 Aps3 Constitutients of ribosome
Partial list of Discovered Cliques
41C. Ding, Protein Interaction Module Detection using Graph Algorithms
(Halic et al, 2004)
The clique also includes a yeast protein SRP21, which is not found in mammalian SRP; forms a pre-SRP structure in the nucleolus that is translocated to the cytoplasm
Subunits Of SRP (signal recognition particle) Complex
Clique: Srp19, Srp14, Srp21, Srp54, Srp68, Srp72
42C. Ding, Protein Interaction Module Detection using Graph AlgorithmsFig 27-33 Lehninger
Signal Recognition Particle (SRP) help proteins to pass through ER membrane
Network to transport proteins and lipids
ribosome
43C. Ding, Protein Interaction Module Detection using Graph Algorithms
Outline
• Protein interaction
• Interaction Data
• Graph models
• Spectral Clustering
• Cliques
• Bi-cliques• Results
44C. Ding, Protein Interaction Module Detection using Graph Algorithms
Cliques in a bipartite graph
• Finding a complete block in the adjacency matrix• Similarly to bi-clustering, widely used in bioinformatics• Example. Gene expression profiles: a gene is relevant
only for certain subset of celluar processs, not all process.
• Two types of maximal bi-cliques:
Maximum Node Bicliques: max |R|+|C| (perimeter)
Maximum Edge Bicliques: max |R|*|C| (area)
46C. Ding, Protein Interaction Module Detection using Graph Algorithms
DNA Gene expression
Effects of feature selection: Select 900 genes out of
4025 genesG
enesG
enes
Tissue sampleTissue sample
Lymphoma Cancer(Alizadeh et al, 2000)
47C. Ding, Protein Interaction Module Detection using Graph Algorithms
Generalized Motzkin-Strauss Theoremfor maximal edge biclique
Given bipartite graph with adjacency matrix B.
Compute maximal edge bi-clique.
Generalized Motzkin-Strauss Theorem.
yBxTyx 0,0
max
),,( 1 myyy Vector on column nodes of the graph
1,11 nxx
Non-zero entries define the biclique
s.t.
(Ding, Zhang, Holbrook, 2006)
1,11 myy
),,( 1 nxxx Vector on row nodes of the graph
48C. Ding, Protein Interaction Module Detection using Graph Algorithms
Algorithm for computing bicliques
/1)(
Byx
Byxx
Ti
ii
Solving the constrained quadratic programming problem
Update:
/1)(
Byx
xByy
Tj
T
jj
yAxTyx
max0,0
1,11 nxx
s.t.1,11
nyy
Theorem 1. At convergence, solution satisfies KKT condition.
L monotonically increases under update. Algorithm converges.
Theorem 2.
49C. Ding, Protein Interaction Module Detection using Graph Algorithms
A New Upper Bound on the size of maximum-edge biclique
Using the generalized Motzin – Strauss theorem, derive
Largest singular value of B
(Ding, Li, Jordan, 2007)
50C. Ding, Protein Interaction Module Detection using Graph Algorithms
Biclique Example
Solution vector y
Solution vector x
52C. Ding, Protein Interaction Module Detection using Graph Algorithms
Biclique Example
Max-node biclique: 1+12
Max-edge biclique: 1x12
Max-node biclique: 3+8
Max-edge biclique: 3x8
There are 6 maximal overlapping bicliques. The algorithm correctly picks up the maximum-edge biclique
53C. Ding, Protein Interaction Module Detection using Graph Algorithms
Biclique/biclustering Algorithms
• All existing algorithm explicitly permute (select) rows and columns in delicate ways
• In our approach, this is automatically done by vectors x and y
• Our algorithm is far more superior over other existing approaches.
54C. Ding, Protein Interaction Module Detection using Graph Algorithms
Outline
• Protein interaction
• Interaction Data
• Graph models
• Spectral Clustering
• Cliques
• Bi-cliques—Protein – protein-domain interactions
• Results
55C. Ding, Protein Interaction Module Detection using Graph Algorithms
Proteins are built from domainsDomains: independent unit (function, evolution,
folding)Protein-protein interactions are mediated by Domain-
domain interactions
Domain-based Approach
Domain-domain interaction
d1 d2 d3p1
d4 d5p2
d5 d3 p3
d2 d4
p4
56C. Ding, Protein Interaction Module Detection using Graph Algorithms
Results on Protein Domains – Protein interactions
MIPS Yeast Genome Database. Bipartite Graph: B rows: protein domains.
columns: protein complex (permanent assemblies)
Discovered biclique:
Cytoplasmic ribosomal large subunit (500.40.10)
Mitochondrial ribosomal large subnit (500.60.10)
Ribosomal_L14. Ribosoml_l1; Ribosomal_L1; Ribosomal_L5_C; Ribosomal_L23; Ribosomal_L11_N;
Ribosomal_L11; KOW; Ribosomal_L6;
Ribosomal_L2;Ribosomal_L2_C;
Ribosomal_L6; L15; Ribosomal_L3,
Ribosomal_L13
These two protein complexes share no commom proteinsBut they share 15 common protein domains. Importance of working with protein domains
58C. Ding, Protein Interaction Module Detection using Graph Algorithms
Clathrin-associated Protein Adaptor complex (AP, AP2)
Adaptin_N (green, silver)
Adap_comp_sub (blue)
Clat_adaptor_s(red)
59C. Ding, Protein Interaction Module Detection using Graph Algorithms Fig 21-42, Lehninger
AP for Cholesterol Transportation
LDL-- low-density lipoprotein
Regulating the formation of transport vesicles as well as cargo selection, between organelles of the post-Golgi network, namely, the trans-Golgi network.
60C. Ding, Protein Interaction Module Detection using Graph Algorithms
Outline
• Protein interaction• Interaction Data• Graph models• Spectral Clustering• Cliques • Bi-cliques
• Results—Pyrococcus, Halobacterium, Sulfolobus
61C. Ding, Protein Interaction Module Detection using Graph Algorithms
A collection predicted functional linkages between proteins.
Genomic Dataa) Gene order b) Gene fusion (Rosetta Stone)c) Phylogenetic profilesd) Operon structure
OrganismsPyrococcus furiosus (2245
proteins,11220 interactions)Halobacterium NRC-1 (1962
proteins, 12056 interactions)Sulfolobus solfataricus (2432
proteins, 11368 interactions)
Prolinks Database
62C. Ding, Protein Interaction Module Detection using Graph Algorithms
Cliques detected on all 3 organisms
Sulfolobus Pyrococcus
Conserved Pathway System: Oxidoreductase
Experiment verifed the computed protein complexe
Predicted protein complexes from cliques Orthologs
Prolinks
Mike Admas’
Group at UGA
64C. Ding, Protein Interaction Module Detection using Graph Algorithms
Conserved across 3 organisms
Oligopeptide ABC Transporter Complex
Conserved Complex:
Sulfolobus
Pyrococcus
Halobacterium
65C. Ding, Protein Interaction Module Detection using Graph Algorithms
Conserved Complex
Sulfolobus
Pyrococcus
Halobacterium
66C. Ding, Protein Interaction Module Detection using Graph Algorithms
Proteins in overlapping clique
67C. Ding, Protein Interaction Module Detection using Graph Algorithms
Histidine – Histamine (C5H9N3)
Human tissue histamine stored in mast cell, in person’s nose, mouth, feet• Histamine disorder• Histapenia (histamine low):
—hyperactivity —Schizophrenia (mental disorder) —allergy (canker sores)—low sexual response
•Histadelia (histamine high)
68C. Ding, Protein Interaction Module Detection using Graph Algorithms
Summary
Protein interactions have a rich variety of questions, mostly not well-understood at present.
A bipartite graph representation captures many essential features of the protein interactions
Two significant graph algorithms are developed to find protein modules Spectral clustering clique and biclique finding
A large number of results are obtained on yeast and archaea, their biological meaning identified
69C. Ding, Protein Interaction Module Detection using Graph Algorithms
Other Significant Work in Bioinformatics
• Protein 3D Structure Prediction. Extract numerical features from sequence and predict 3D fold using support vector machine. Even when for proteins with less than 25% sequence similarity, we get 50% accuracy. (This 2001 paper was cited 179 times according to Google Scholar.)
• Minimum Redundancy Maximum Relevance Feature Selection. We proposed a feature/gene selection method that minimizes redundancy and increase the representability of the feature set. This improves significantly the generalization ability and therefore prediction accuracy and stability. (This 2003 paper is cited 31 times.)
• PSoL: Positive Samples Only Learning. In most bioinformatics prediction problems, such as a function (e.g., binding to a metal sites or not), there no true negative examples. The problem is to predict the positive examples embedded in a large un-labeled examples. We developed a SVM based algorithm for novel functional RNA gene prediction. (Bioinformatics 2006).
Research Goal• Establish a nationally recognized research program on protein
interaction network with integration of multiple types data, data mining, graph algorithms, complex network theory, etc.