C. Ding, Protein Interaction Module Detection using Graph Algorithms 1 Protein Interaction Module...

69
1 C. Ding, Protein Interaction Module Detection using Graph Algorithms Protein Interaction Module Detection using Matrix-Based Graph Algorithms Chris Ding Lawrence Berkeley National Laboratory
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    1

Transcript of C. Ding, Protein Interaction Module Detection using Graph Algorithms 1 Protein Interaction Module...

1C. Ding, Protein Interaction Module Detection using Graph Algorithms

Protein Interaction Module Detection

using Matrix-Based Graph Algorithms

Chris Ding Lawrence Berkeley National Laboratory

2C. Ding, Protein Interaction Module Detection using Graph Algorithms

Bioinformatics & Computational Biology Computational genomics:

Molecular biology at genomic level

3C. Ding, Protein Interaction Module Detection using Graph Algorithms

Genomics Research

• More than 100 genomes’ DNA sequenced• DNA microarray chip technology• Protein – protein interaction technology• Gene knock-out for gene regulatory network• Many high-through technologies• Bio-imaging (embryos imaging, EM)• Huge number of databases

—GenBank, Protein Data Bank, SCOP, Pfam

• Gene Ontology

4C. Ding, Protein Interaction Module Detection using Graph Algorithms

A Genomics Research Trend

• Large # of genomes have been sequenced. • Traditional Approach: Predict genes, predict

proteins, predict structures, prediction functions• This structural genomics is inadequate• Protein interactions: a new approach

5C. Ding, Protein Interaction Module Detection using Graph Algorithms

Protein – Protein Interactions

Proteins carry out tasks together with other proteins• 83% proteins interact with others• Proteins interact in promoters• Multi-protein complexes (assemblies)• Synergistic interactions • Complex – complex cross-talks

Proteins work out in modular fashion• Gene regulation• Biological Pathway Most drug block certain pathways

Major goal of research: detect protein modules

DOE Genome to Life

Protein Interactions

Antibody – antigen binding

7C. Ding, Protein Interaction Module Detection using Graph Algorithms

Protein interaction experiments

• Two-hybrid Assay— Protein coordination in promoter region—Binary interactions—Capture transient and unstable interactions

• Mass Spectrometry —TAP-MS: Tandem affinity purification —HMS-PCI: high throughput protein interaction id.—Use bait proteins—Capture multi-protein complexes

• Problems: —Results do not agree—Lots of noise

Tandem-Affinity Purification with Mass-Spectrometry (TAP-MS) determines constituents of multi-protein complexes.

Gavin, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002;415(6868):141-147.

More reliable technology (Deng, et al)

Many baits are simultaneously processed to obtain many complexes

9C. Ding, Protein Interaction Module Detection using Graph Algorithms

Protein Interaction Experiments

Different experiments don’t agree: small overlap

Two-hybrid

Ito et alTwo-hybrid

Uetz et alMass Spec

Gavin et alMass Spec

Ho et al

Two-hybrid

Ito et al4363 186 54 63

Two-hybrid Uetz et al

186 1403 54 56

Mass Spec

Gavin et al54 54 3222 198

Mass Spec

Ho et al63 56 198 3596

Salwinski and Eisenberg, 2003

10C. Ding, Protein Interaction Module Detection using Graph Algorithms

Protein InteractionsA genome has 5000 proteins. Each interacts ~ 5 others.

11C. Ding, Protein Interaction Module Detection using Graph Algorithms

Outline

• Protein interaction• Interaction Data

• Graph models• Spectral Clustering• Cliques• Bi-cliques

• Results

12C. Ding, Protein Interaction Module Detection using Graph Algorithms

Bipartite Graph Model

p –nodes: proteins c –nodes: protein complexes

p –nodes: proteins c –nodes: protein domains

Protein Complex:

Protein domain:

13C. Ding, Protein Interaction Module Detection using Graph Algorithms

Unified Representation of Protein Complex Data

protein – protein network:

protein complexe – protein complex network:

jiij

T

,ppBB

proteinsboth containing

complexesprotein of #)(

jiij

T

,ccBB

complexesprotein

by shared proteins of #)(

Input: Protein Complex data: B

(Ding, He, Meraz, Holbrook, Proteins, 2004)

14C. Ding, Protein Interaction Module Detection using Graph Algorithms

Bridged Bipartite Graph: Pfam domains match SCOP domain

A B

k

ijT

jkikji ABBAfsC )(),(

Matching :

Co-location of domains:

k

ijT

jkikji AAAAssR )(),(

k

ijT

jkikji BBBBffR )(),(

Reach 90% accuracy Compared to direct match

(Zhang, Chandonia, Ding, Holbrook, BMC Bioinformatics 2004)

15C. Ding, Protein Interaction Module Detection using Graph Algorithms

Protein Interaction Module: densely connected subgraphs

16C. Ding, Protein Interaction Module Detection using Graph Algorithms

Protein Interaction Modules

Find highly connected regions:• Graph clustering• Cliques• Bi-cliques

17C. Ding, Protein Interaction Module Detection using Graph Algorithms

Outline

• Protein interaction

• Interaction Data

• Graph models

• Spectral Clustering • Cliques

• Bi-cliques

18C. Ding, Protein Interaction Module Detection using Graph Algorithms

Spectral Clustering: MinMaxCut

max within-cluster similarities (weights)

min between-cluster similaities (weights)

Ai Bj

ijw(A,B) sim

Ai Aj

ijw(A,A) sim

(Ding, RECOMB’02)

19C. Ding, Protein Interaction Module Detection using Graph Algorithms

Spectral Clustering Method (MinMaxCut)

Minimize similarity between A,B:

Maximize similarity within A & B:

),(

),(

),(

),(),(min

, BB

BA

AA

BABA

BA sim

sim

sim

simJMMC

Dqq

qWDqq

T

T

qq

)(min)J( min

Minimizing leads toMMCJ

Solution given by eigenvector DqqWD )(

Ai Bj

ijw(A,B) sim

Ai Aj

ijw(A,A) sim

Cluster membership indicator:

Binnn

Ainnniq

if

if

21

12

/

/)(

}0)(|{},0)(|{ 11 iqiBiqiACluster assignment:

20C. Ding, Protein Interaction Module Detection using Graph Algorithms

Graph clustering examples

21C. Ding, Protein Interaction Module Detection using Graph Algorithms

A NP-hard intractable combinatorial optimization problem can be effectively

solved by a simple eigenvector !

22C. Ding, Protein Interaction Module Detection using Graph Algorithms

Spectral Clustering

• 2-way clustering

• K-way clustering—Recursive 2-way clustering—K-way relaxation (K eigenvectors)

• Cluster Self-aggregation and Perturbation Analysis• Characteristics

—Principled approach—Clear and well-motivated clustering objective functions—Everything is proved rigorously—Based on well-established matrix/algebra theory —A rich framework (clustering, ordering, ranking, etc)

• State of Art Algorithm

23C. Ding, Protein Interaction Module Detection using Graph Algorithms

Recursive MinMaxCut Clustering of Lymphoma Issues

B cell lymphoma go thru different stages

3 normal stages

3 cancer stages

Key question: can we detect them automatically ?

(Alizadeh et al, 2000)

(Ding, RECOMB’02)

24C. Ding, Protein Interaction Module Detection using Graph Algorithms

Gene expression of lymphoma (Stanford)

(Ding, RECOMB’02)

25C. Ding, Protein Interaction Module Detection using Graph Algorithms

Spectral Clustering

• 2-way clustering

• K-way clustering—Recursive 2-way clustering—K-way relaxation (K eigenvectors) (principled)

• Cluster Self-aggregation and Perturbation Analysis• Characteristics

—Principled approach—Clear and well-motivated clustering objective functions—Everything is proved rigorously—Based on well-established matrix/algebra theory —A rich framework (clustering, ordering, ranking, etc)

• State of Art Algorithm

26C. Ding, Protein Interaction Module Detection using Graph Algorithms

Outline

• Protein interaction• Interaction Data• Graph models

• Spectral Clustering— Application to computing protein

interaction modules• Cliques • Bi-cliques• Results

27C. Ding, Protein Interaction Module Detection using Graph Algorithms

Clustering Protein Complex Graph

protein – protein network:

protein complex – protein complex network:

jiij

T

,ppBB

proteinsboth containing

complexesprotein of #)(

jiij

T

,ccBB

complexesprotein

by shared proteins of #)(

Input: Protein Complex data: B

(Ding, He, Meraz, Holbrook, Proteins, 2004)

28C. Ding, Protein Interaction Module Detection using Graph Algorithms

Computed Protein Clusters

29C. Ding, Protein Interaction Module Detection using Graph Algorithms

Experimental Protein ComplexComputed

30C. Ding, Protein Interaction Module Detection using Graph Algorithms

Implications of discovered protein clusters on protein interactions: F-statistics

F - statistics of amino acids and physical propertyacross all protein clusters: statistical significance

Lys 100 Asn 56 Val 30 Ile 24 Asp 89 Gln 50 Tyr 29 Ser 23

Arg 73 Cys 39 Met 29 Leu 22 Pro 70 His 33 Trp 28 Gly 21 Glu 66 Ala 31 Thr 28 Phe 21

pI 169 Basic 149 Acidic 97 MW 60 Aromatic 30 Helix 37 Beta-Sheet 33 Coil 27

K

kkk

K

kkk n

Knffn

KF

1

2

1

)1(1

/)(1

1

Lys, Arg, Asp are most significant: => electrostatic forces are dominant surface factors influencing protein interactions Surprise: secondary structure is not important factor in protein module formation

31C. Ding, Protein Interaction Module Detection using Graph Algorithms

Protein Secondary Structure

• Alpha helix

• Beta sheet

• Coil regions

32C. Ding, Protein Interaction Module Detection using Graph Algorithms

Outline

• Protein interaction

• Interaction Data

• Graph models

• Spectral Clustering

• Cliques • Bi-cliques

• Results

33C. Ding, Protein Interaction Module Detection using Graph Algorithms

Protein Interaction Modules

• Find highly connected regions• cliques • k-core: subgraph with node degree > k

Every node connects to everyone else

Clique

Every node connecto to at leat 3 others

K-core

34C. Ding, Protein Interaction Module Detection using Graph Algorithms

Motzkin-Struss Formalism for computing maximal cliques

Clique computing is NP-hard. Even approximating clique is hard.

Motzkin-Straus Theorem.

xAxTx 0

max

),,( 1 nxxx Vector on all nodes of the graph

1.. 1 nxxts L1 enforce sparsity

Non-zero entries define the clique

Every node connects to everyone else

Clique

0iiA

35C. Ding, Protein Interaction Module Detection using Graph Algorithms

Generalized Motzkin-Straus Formalism

Setting =1.05 we can compute maximum clique

better than standard approach =1.0.

xAxTx 0

max

),,( 1 nxxx Vector on all nodes of the graph

1,11 nxx

L1 enforce sparsity

Non-zero entries define the clique

s.t.

(Ding, Zhang, Holbrook, 2006)

1iiA

36C. Ding, Protein Interaction Module Detection using Graph Algorithms

Algorithm for computing clique

Solving the constrained quadratic programming problem

xAxTx

max0

1,11 nxx s.t.

Theorem 2. Convergence: Iterative algorithm converges

Theorem 1. Correctness: Solution converges to local maxima

/1)(

Axx

Axxx

Ti

iiupdate

Tx )1,,1( Initialize

37C. Ding, Protein Interaction Module Detection using Graph Algorithms

Proof of Correctness

/)(2 AxxT

Constrained Optimization Theory

KKT Optimality Condition (Complementarity Slackness):

)1( 1 nT xxAxxL

0])(2[ 1 iii xxAx

At Convergence satisfies KKT condition

/1

2/

)(

i

ii

AxxxUpdate

rule:

/1***

2/

)(

i

ii

Axxx

Introduce Lagrangian function

Lagrangian multipier value:

38C. Ding, Protein Interaction Module Detection using Graph Algorithms

Proof of Convergence

Using Auxiliary Function (from Machine Learning)

)(),(),()',( xLxxGxLxxG

L(x) is monotonically increasing and is bounded from up. Thus the algorithm converges

),(maxarg )()1( tx

t xxGx

)1()()1()()()( ),(),( tttttt xLxxGxxGxL

set

G(x,x’) is an auxiliary function of L(x) if

)3()2()1( xLxLxL

We maximize a lower-bound.

39C. Ding, Protein Interaction Module Detection using Graph Algorithms

Proof of Convergence (cont)

Key: (1) find auxiliary function, (2) find global maxima

The auxiliary function is

1)'('2

)',(

ii

ii

i

xx

Axx

x

xxG

Thus G(x,x’) is concave in x. Global maxima easily obtained.

)1()log1('')',(,,

i

iji

jijijij i x

xx

xxxAxxxG

First order derivative:

ijii

ii

ji

xx

Axx

xx

xxG ])1()'('

2[)',( 2

2

2

2nd order derivative: is negative definite

40C. Ding, Protein Interaction Module Detection using Graph Algorithms

Emg1 Imp3 Imp4 Kre31 Mpp10 Nop14 Sof1 YMR093W YPR144C

snoRNA binding

Cus1 Msl1 Prp3 Prp9 Sme1 Smx2 Smx3 Yhc1 YJR084W

RNA binding

Fyv4 Mrp1 Mrp10 Mrp13 Mrp17 Mrp21 Mrp4 Mrp51 Mrps9 Nam9 Pet123 Rsm10 Rsm19

Rsm22 Rsm23 Rsm24 Rsm25 Rsm26 Rsm27 Trf4 Ubp10 YDR036C YGR150C YMR158W YMR188C YNL306W YOR205C YPL013C

structural constituent of ribosome

Atp11 Caf130 Caf40 Ccr4 Cdc36 Cdc39 Fas2 Not3 Not5 Pop2 Sig1 YDR214W

3'-5' -exoribonuclease activity

Apc1 Apc2 Cdc16 Cdc23 Cdc27 Doc1 ubiquitin-protein ligase & protein binding

Sec65 Srp14 Srp21 Srp54 Srp68 Srp72 signal recognition particle

Csl4 Mtr3 Rrp42 Rrp43 Rrp45 Rrp6 Ski6 Ski7 3'-5' exonuclease activity

Cft2 Fip1 Pap1 Pfs2 Pta1 Ref2 Rna14 YGR156W Ysh1

cleavage and polyadenylylation

Lsm1 Lsm2 Lsm5 Lsm6 Lsm7 Pat1 Prp24 Prp38 Snu23

RNA binding

Apl1 Apl3 Apl5 Apl6 Apm3 Apm4 Aps2 Aps3 Constitutients of ribosome

Partial list of Discovered Cliques

41C. Ding, Protein Interaction Module Detection using Graph Algorithms

(Halic et al, 2004)

The clique also includes a yeast protein SRP21, which is not found in mammalian SRP; forms a pre-SRP structure in the nucleolus that is translocated to the cytoplasm

Subunits Of SRP (signal recognition particle) Complex

Clique: Srp19, Srp14, Srp21, Srp54, Srp68, Srp72

42C. Ding, Protein Interaction Module Detection using Graph AlgorithmsFig 27-33 Lehninger

Signal Recognition Particle (SRP) help proteins to pass through ER membrane

Network to transport proteins and lipids

ribosome

43C. Ding, Protein Interaction Module Detection using Graph Algorithms

Outline

• Protein interaction

• Interaction Data

• Graph models

• Spectral Clustering

• Cliques

• Bi-cliques• Results

44C. Ding, Protein Interaction Module Detection using Graph Algorithms

Cliques in a bipartite graph

• Finding a complete block in the adjacency matrix• Similarly to bi-clustering, widely used in bioinformatics• Example. Gene expression profiles: a gene is relevant

only for certain subset of celluar processs, not all process.

• Two types of maximal bi-cliques:

Maximum Node Bicliques: max |R|+|C| (perimeter)

Maximum Edge Bicliques: max |R|*|C| (area)

45C. Ding, Protein Interaction Module Detection using Graph Algorithms

Bicliques in a 2D Dataset

46C. Ding, Protein Interaction Module Detection using Graph Algorithms

DNA Gene expression

Effects of feature selection: Select 900 genes out of

4025 genesG

enesG

enes

Tissue sampleTissue sample

Lymphoma Cancer(Alizadeh et al, 2000)

47C. Ding, Protein Interaction Module Detection using Graph Algorithms

Generalized Motzkin-Strauss Theoremfor maximal edge biclique

Given bipartite graph with adjacency matrix B.

Compute maximal edge bi-clique.

Generalized Motzkin-Strauss Theorem.

yBxTyx 0,0

max

),,( 1 myyy Vector on column nodes of the graph

1,11 nxx

Non-zero entries define the biclique

s.t.

(Ding, Zhang, Holbrook, 2006)

1,11 myy

),,( 1 nxxx Vector on row nodes of the graph

48C. Ding, Protein Interaction Module Detection using Graph Algorithms

Algorithm for computing bicliques

/1)(

Byx

Byxx

Ti

ii

Solving the constrained quadratic programming problem

Update:

/1)(

Byx

xByy

Tj

T

jj

yAxTyx

max0,0

1,11 nxx

s.t.1,11

nyy

Theorem 1. At convergence, solution satisfies KKT condition.

L monotonically increases under update. Algorithm converges.

Theorem 2.

49C. Ding, Protein Interaction Module Detection using Graph Algorithms

A New Upper Bound on the size of maximum-edge biclique

Using the generalized Motzin – Strauss theorem, derive

Largest singular value of B

(Ding, Li, Jordan, 2007)

50C. Ding, Protein Interaction Module Detection using Graph Algorithms

Biclique Example

Solution vector y

Solution vector x

51C. Ding, Protein Interaction Module Detection using Graph Algorithms

Biclique Example

52C. Ding, Protein Interaction Module Detection using Graph Algorithms

Biclique Example

Max-node biclique: 1+12

Max-edge biclique: 1x12

Max-node biclique: 3+8

Max-edge biclique: 3x8

There are 6 maximal overlapping bicliques. The algorithm correctly picks up the maximum-edge biclique

53C. Ding, Protein Interaction Module Detection using Graph Algorithms

Biclique/biclustering Algorithms

• All existing algorithm explicitly permute (select) rows and columns in delicate ways

• In our approach, this is automatically done by vectors x and y

• Our algorithm is far more superior over other existing approaches.

54C. Ding, Protein Interaction Module Detection using Graph Algorithms

Outline

• Protein interaction

• Interaction Data

• Graph models

• Spectral Clustering

• Cliques

• Bi-cliques—Protein – protein-domain interactions

• Results

55C. Ding, Protein Interaction Module Detection using Graph Algorithms

Proteins are built from domainsDomains: independent unit (function, evolution,

folding)Protein-protein interactions are mediated by Domain-

domain interactions

Domain-based Approach

Domain-domain interaction

d1 d2 d3p1

d4 d5p2

d5 d3 p3

d2 d4

p4

56C. Ding, Protein Interaction Module Detection using Graph Algorithms

Results on Protein Domains – Protein interactions

MIPS Yeast Genome Database. Bipartite Graph: B rows: protein domains.

columns: protein complex (permanent assemblies)

Discovered biclique:

Cytoplasmic ribosomal large subunit (500.40.10)

Mitochondrial ribosomal large subnit (500.60.10)

Ribosomal_L14. Ribosoml_l1; Ribosomal_L1; Ribosomal_L5_C; Ribosomal_L23; Ribosomal_L11_N;

Ribosomal_L11; KOW; Ribosomal_L6;

Ribosomal_L2;Ribosomal_L2_C;

Ribosomal_L6; L15; Ribosomal_L3,

Ribosomal_L13

These two protein complexes share no commom proteinsBut they share 15 common protein domains. Importance of working with protein domains

57C. Ding, Protein Interaction Module Detection using Graph Algorithms

58C. Ding, Protein Interaction Module Detection using Graph Algorithms

Clathrin-associated Protein Adaptor complex (AP, AP2)

Adaptin_N (green, silver)

Adap_comp_sub (blue)

Clat_adaptor_s(red)

59C. Ding, Protein Interaction Module Detection using Graph Algorithms Fig 21-42, Lehninger

AP for Cholesterol Transportation

LDL-- low-density lipoprotein

Regulating the formation of transport vesicles as well as cargo selection, between organelles of the post-Golgi network, namely, the trans-Golgi network.

60C. Ding, Protein Interaction Module Detection using Graph Algorithms

Outline

• Protein interaction• Interaction Data• Graph models• Spectral Clustering• Cliques • Bi-cliques

• Results—Pyrococcus, Halobacterium, Sulfolobus

61C. Ding, Protein Interaction Module Detection using Graph Algorithms

A collection predicted functional linkages between proteins.

Genomic Dataa) Gene order b) Gene fusion (Rosetta Stone)c) Phylogenetic profilesd) Operon structure

OrganismsPyrococcus furiosus (2245

proteins,11220 interactions)Halobacterium NRC-1 (1962

proteins, 12056 interactions)Sulfolobus solfataricus (2432

proteins, 11368 interactions)

Prolinks Database

62C. Ding, Protein Interaction Module Detection using Graph Algorithms

Cliques detected on all 3 organisms

Sulfolobus Pyrococcus

Conserved Pathway System: Oxidoreductase

Experiment verifed the computed protein complexe

Predicted protein complexes from cliques Orthologs

Prolinks

Mike Admas’

Group at UGA

64C. Ding, Protein Interaction Module Detection using Graph Algorithms

Conserved across 3 organisms

Oligopeptide ABC Transporter Complex

Conserved Complex:

Sulfolobus

Pyrococcus

Halobacterium

65C. Ding, Protein Interaction Module Detection using Graph Algorithms

Conserved Complex

Sulfolobus

Pyrococcus

Halobacterium

66C. Ding, Protein Interaction Module Detection using Graph Algorithms

Proteins in overlapping clique

67C. Ding, Protein Interaction Module Detection using Graph Algorithms

Histidine – Histamine (C5H9N3)

Human tissue histamine stored in mast cell, in person’s nose, mouth, feet• Histamine disorder• Histapenia (histamine low):

—hyperactivity —Schizophrenia (mental disorder) —allergy (canker sores)—low sexual response

•Histadelia (histamine high)

68C. Ding, Protein Interaction Module Detection using Graph Algorithms

Summary

Protein interactions have a rich variety of questions, mostly not well-understood at present.

A bipartite graph representation captures many essential features of the protein interactions

Two significant graph algorithms are developed to find protein modules Spectral clustering clique and biclique finding

A large number of results are obtained on yeast and archaea, their biological meaning identified

69C. Ding, Protein Interaction Module Detection using Graph Algorithms

Other Significant Work in Bioinformatics

• Protein 3D Structure Prediction. Extract numerical features from sequence and predict 3D fold using support vector machine. Even when for proteins with less than 25% sequence similarity, we get 50% accuracy. (This 2001 paper was cited 179 times according to Google Scholar.)

• Minimum Redundancy Maximum Relevance Feature Selection. We proposed a feature/gene selection method that minimizes redundancy and increase the representability of the feature set. This improves significantly the generalization ability and therefore prediction accuracy and stability. (This 2003 paper is cited 31 times.)

• PSoL: Positive Samples Only Learning. In most bioinformatics prediction problems, such as a function (e.g., binding to a metal sites or not), there no true negative examples. The problem is to predict the positive examples embedded in a large un-labeled examples. We developed a SVM based algorithm for novel functional RNA gene prediction. (Bioinformatics 2006).

Research Goal• Establish a nationally recognized research program on protein

interaction network with integration of multiple types data, data mining, graph algorithms, complex network theory, etc.