Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH...

23
Bioinformatics for Protein Interations and Biological Networks Adrian Heilbut Hogue Lab, Samuel Lunenfeld Research Institute, Mount Sinai Hospital Department of Biochemistry, University of Toronto [email protected] http://individual.utoronto.ca/amh CPI Protein Protein Interactions Tutorial 2004 Montreal 1 CPI PPI Tutorial 2004 - AMH 1. Laboratory Information Management and Relational Databases 2. Experimental Design and Statistics 3. MS/MS Protein Identification 4. Graph models of biological networks a) Graph theory b) Statistical mechanics of biological graphs 5. Accessing and Visualizing Data 6. Interaction Prediction & Confidence measures Bioinformatics for Protein Interactions and Networks data integration and analysis experimentation 2 CPI PPI Tutorial 2004 - AMH Objectives 1. To appreciate the importance of effective laboratory information management systems. 2. To understand the usefulness of relational databases and to be able to use some basic SQL. 3. To think more critically about experimental design issues in large-scale interaction experiments. 4. To be able to integrate and visualize interaction data from public databases to help generate biological hypotheses. 5. To understand graph-theory approaches for local and global characterization and analysis of biological networks. 3 CPI PPI Tutorial 2004 - AMH tools 1. relational databases 2. statistics and machine learning 3. graph theory 4

Transcript of Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH...

Page 1: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

Bioinformatics for Protein Interations and

Biological Networks

Adrian HeilbutHogue Lab, Samuel Lunenfeld Research Institute, Mount Sinai Hospital

Department of Biochemistry, University of Toronto

[email protected] http://individual.utoronto.ca/amh

CPI Protein Protein Interactions Tutorial 2004 Montreal

1

CPI PPI Tutorial 2004 - AMH

1. Laboratory Information Management and Relational Databases

2. Experimental Design and Statistics

3. MS/MS Protein Identification

4. Graph models of biological networksa) Graph theory

b) Statistical mechanics of biological graphs

5. Accessing and Visualizing Data

6. Interaction Prediction & Confidence measures

Bioinformatics for Protein Interactions and Networks

data integrationand analysis

experimentation

2

CPI PPI Tutorial 2004 - AMH

Objectives

1. To appreciate the importance of effective laboratory information management systems.

2. To understand the usefulness of relational databases and to be able to use some basic SQL.

3. To think more critically about experimental design issues in large-scale interaction experiments.

4. To be able to integrate and visualize interaction data from public databases to help generate biological hypotheses.

5. To understand graph-theory approaches for local and global characterization and analysis of biological networks.

3

CPI PPI Tutorial 2004 - AMH

tools

1. relational databases

2. statistics and machine learning

3. graph theory

4

Page 2: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

1. LIMS & Relational Databases

1. HMS-PCI protocol review2. Laboratory Information Managment Systems3. Relational databases4. An Example LIMS database5. Using SQL6. Summary

5

CPI PPI Tutorial 2004 - AMH

example: HMS-PCI workflow

• sample explosion and tracking

• grounding to biology

• process monitoring and quality control

Bait selection

Cloning

Transfection & Expression

Immuno-precipitation

Separation

Digestion

LC MS/MS

Protein Identification

Biologicaldatabases

LIMS

Interpretation

Data distribution

High-throughput mass spectrometry protein complex identification

Sequences

Localization

Qua

ntita

tion

GeneticsFunction

Expressio

n

Inte

ract

ions

6

CPI PPI Tutorial 2004 - AMH

Laboratory Information Management

• LIMS: laboratory information management system

• Research vs. Manufacturing vs. Clinical labs - very different requirements

• Homebrew vs. commercial

• Development time & resources

• it takes longer than you think...

• Relational databases and SQL

7

CPI PPI Tutorial 2004 - AMH

LIMS system architectures

SQL

database

server

client sw

application server

web application

web browser

Frustrated

ScientistSQL savvy

scientist

rest of biological data

in universe...

Genbank, BIND, ...

cytoscape

instrument sw

bioinformatics sw

8

Page 3: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Relational Databases for Biology

• Advantages of using a real database

• forces careful thought about data model

• centralized storage and security

• dissemination of data

• scalability

• ad-hoc queries with SQL

• ACID (Atomicity, Consistency, Isolation, Durability)

•• Commercial: DB2, SQL Server, Oracle, Access

• Free: mySQL, PostgreSQL, Firebird

•• Excel, XML: great for certain things, but not

relational databases

9

CPI PPI Tutorial 2004 - AMH

Relational Model - 1970

• Developed to solve problems of data independence - how do you store structured data when the structure of your data is evolving?

• Users should be insulated from internal representations of data

• Everything in a relational database can be viewed in terms of tables and operations on tables that result in new tables

• Firmly grounded in math and logic

Codd, E.F. “A Relational Model of Data for Large Shared Data Banks” Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387

10

CPI PPI Tutorial 2004 - AMH

Relational Model

• Data is modeled as “relations”

• Attributes: S1, S2, S3 from a specific domain (fields, each of a specific data type)

• Relations: set of n-tuples (rows), where first element from S1, second from S2, etc.

• Relations can be represented as tables

• All rows are distinct

• Columns are labelled

• Rows are in no particular order

11

CPI PPI Tutorial 2004 - AMH

Normalization

• Basic idea:

• Unrelated facts should be stored separately

• makes updating and querying much easier

• First Normal Form: all records have the same number of fields, and every attribute is atomic (single-valued)

• Second Normal form: 1NF + every nonkey attribute is dependent on the primary key

• Third Normal form: 2NF + every nonkey attribute is independent of all other attributes

ref: Kent, W. “A Simple Guide to Five Normal Forms in Relational Database Theory” Communications of the ACM 26(2), Feb 1983, 120-125. available at: http://www.bkent.net/Doc/simple5.htm

12

Page 4: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Relational Algebra

• Language for manipulating sets of relations

• ∪ Union

• ∩ Intersection

• ! Difference

• " Selection

• ! Projection

• " cartesian product

• # renaming

13

CPI PPI Tutorial 2004 - AMH

SQL

• Structured Query Language

• Based on relational algebra

• A (semi)standard way to define a database (DDL), store, and retrieve data

• Easy - just logic

• CREATE TABLE...

• SELECT ... INNER JOIN ... WHERE

• DELETE

• UPDATE

14

CPI PPI Tutorial 2004 - AMH

a very simple LIMS...

ORFVARCHAR(15)

TagPositionCHAR(1)

CloneIDINT

PulldownIDINT

CloneIDINT

CloneIDINT

PullDownIDINT

BandIDINT

PlateNumINT

WellNoINT PlateNum

INTWellNo

INTGI

INTScoreINT

PulldownsConstructs

Bands Hits

GIINT

ORFVARCHAR(15)

GIORF

15

CPI PPI Tutorial 2004 - AMH

and a little bit of biological data...

ORFNameVARCHAR(15)

CommonNameVARCHAR(15)

SGDIDVARCHAR(15)

DescriptionVARCHAR(250)

ORFNameVARCHAR(15)

PfamIDVARCHAR(25)

EvalueFLOAT

orfdomains

orfs

PfamIDVARCHAR(15)

DomDescVARCHAR(250)

domdesc

ORFNameVARCHAR(15)

GOIDVARCHAR(20)

localizations

ie. from

ie. from running hmmeron all the yeast proteins

ORFNameVARCHAR(15)

LocVARCHAR(20) ORFGO

16

Page 5: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Relational Integrity and Foreign Keys

• Foreign keys allow you to specify that a field in one table must relate to a field in another table

• Database will not allow changes (deletions, updates) that violate integrity of links

17

CPI PPI Tutorial 2004 - AMH

Indexing & Performance

• Query performance depends on having appropriate indexes, to avoid doing linear searches

• O(N) ! O(logN)

• with 108 rows, that makes a big difference

• once your data is in a database, easy as CREATE INDEX ON...

• Commercial database management systems do sophisticated cost-based query optimization to execute complicated queries efficiently

18

CPI PPI Tutorial 2004 - AMH

SQL: Data Definition

• To define a table:

• Give it a name

• Specify the columns, their data types, and references to other tables

• Specify a primary key• all other columns should depend (only) on primary key

• primary key is unique

19

CPI PPI Tutorial 2004 - AMH

SQL example: SELECT

• Find out more about Cdc2

> SELECT * FROM orfs WHERE CommonName = ‘Cdc2’

ORFNameVARCHAR(15)

CommonNameVARCHAR(15)

SGDIDVARCHAR(15)

DescriptionVARCHAR(250)

YDL102W Cdc2 S0002260Catalytic subunit of DNA polymerase delta; required for chromosomal DNA replication during mitosis ...

orfs

...

20

Page 6: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

SQL: SELECT with INNER JOIN

• Find all proteins that have a protein kinase domain

ORFNameVARCHAR(15)

CommonNameVARCHAR(15)

SGDIDVARCHAR(15)

DescriptionVARCHAR(250)

YDL102W Cdc2 S0002260Catalytic subunit of DNA polymerase delta; required for chromosomal DNA replication during mitosis ...

orfs

ORFNameVARCHAR(15)

PfamIDVARCHAR(25)

EvalueFLOAT

orfdomains

> SELECT * FROM orfsINNER JOIN orfdomains on orfs.ORFName = orfdomains.ORFName WHERE PfamID = ‘PF00069’ AND Evalue < 0.001

21

CPI PPI Tutorial 2004 - AMH

Views

• To get a list of interactions from our LIMS, we have to join 5 tables together

• A VIEW is like a canned select statement that acts like a read-only table

CREATE VIEW ORF_INTX AS (SELECT Constructs.ORF as BaitORF, GIORF.ORF as HitORF FROM Constructs

INNER JOIN PullDowns ON PullDowns.CloneID = Constructs.CloneIDINNER JOIN Bands ON Bands.PullDownID = PullDowns.PullDownIDINNER JOIN Hits ON Hits.PlateNumber = BandsPlateNumber AND Hits.WellNumber = Bands.WellNumber INNER JOIN GIORF ON Hits.GI = GIORF.GI)

ORFVARCHAR(15)

TagPositionCHAR(1)

CloneIDINT

PulldownIDINT

CloneIDINT

CloneIDINT

PullDownIDINT

BandIDINT

PlateNumINT

WellNoINT PlateNum

INTWellNoINT

GIINT

ScoreINT

PulldownsConstructs

Bands Hits

GIINT

ORFVARCHAR(15)

GIORF

INTX

BaitORFVARCHAR(15)

HitORFVARCHAR(15)

22

CPI PPI Tutorial 2004 - AMH

slightly more interesting queries...

• Let’s find all the kinases that pull down other kinases

> SELECT orf_intx.* FROM orf_intxINNER JOIN orfdomains db on orf_intx.BaitORF = orfdomains.ORFNameINNER JOIN orfdomains dh on orf_intx.HitORF = orfdomains.ORFName WHERE db.PfamID = ‘PF00069’ AND db.Evalue < 0.001 and dh.PfamID = ‘PF00069’ AND dh.Evalue < 0.001

> SELECT orf_intx.* FROM orf_intxINNER JOIN orfdomains db on orfs_intx.BaitORF = orfdomains.ORFName WHERE db.PfamID = ‘PF00069’ AND db.Evalue < 0.001AND orf_intx.HitORF IN (SELECT ORFName from Localizations WHERE Loc = ‘Nucleus’)

UNION SELECT ORFName from Orf_GO WHERE GOID = ‘GO:0000130’)

Find all the kinases that interact with known transcription factors or with nuclear-localized proteins

23

CPI PPI Tutorial 2004 - AMH

LIMS & Database Summary

• Interaction proteomics experiments generate large amounts of data

• Large amounts of data are best stored in a relational database

•• SQL is an invaluable basic tool

• easy to learn and use - worth learning

• allows surprisingly sophisticated queries

•• More info:

• SQL for Web Nerds http://philip.greenspun.com/sql

• many tutorials on the web

24

Page 7: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

2. Experimental Design & Statistics

1. Issues with high throughput interaction data2. Measuring performance: • Confusion matrix: types of errors

• ROC curves

3. Example: Design of HMS-PCI Experiments• Reproducibility

• False positive and negative rate estimates

• Biochemical factors

• Planning numbers of replicates

4. Summary

25

CPI PPI Tutorial 2004 - AMH

High Throughput Interaction Data

• Prone to false positives and negatives

• Low overlap between data from different methods

• Impractical to validate every interaction by traditional methods

• High costs of false positive data

• Need to be able to prioritize results

Bader & Hogue 2002

26

CPI PPI Tutorial 2004 - AMH

Confusion matrix

true class

positive negative

yesTrue

positiveFalse

positive

noFalse

negativeTrue

negativehypoth

esis

reference: Fawcett, 2004 http://www.hpl.hp.com/personal/Tom_Fawcett/papers/ROC101.pdf

fp rate

tp rate

totals: P N

precision =TP

FP + FP

=TP + TN

P + Naccuracy

=

TP

Psensitivity =recall

=

FP

N

=

TP

P

specificity = 1 − fprate

27

CPI PPI Tutorial 2004 - AMH

ROC curves

• Receiver Operating Characteristic

• originally used in signal detection theory and applied to medical diagnostic systems

• illustrates tradeoff between true-positive and false positive rates

• area under curve (AUC) provides a simple scalar value that can be used to compare performance

ROC graphs 15

0 0.2 0.4 0.6 0.8 1.0

0

0.2

0.4

0.6

0.8

1.0

False Positive rate

True

Pos

itive

rate

A

B

0 0.2 0.4 0.6 0.8 1.0

0

0.2

0.4

0.6

0.8

1.0

False Positive rate

True

Pos

itive

rate A

B

Figure 7. Two ROC graphs. The graph on the left shows the area under two ROCcurves. The graph on the right shows the area under the curves of a discrete classifier(A) and a probabilistic classifier (B).

Every instance that is classified to this leaf node will be assigned thesame score. The rectangle of figure 6 will be of size nm

PN , and if theseinstances are not averaged this one leaf may account for errors in ROCcurve area as high as nm

2PN .

5. Area under an ROC Curve (AUC)

An ROC curve is a two-dimensional depiction of classifier perfor-mance. To compare classifiers we may want to reduce ROC performanceto a single scalar value representing expected performance. A commonmethod is to calculate the area under the ROC curve, abbreviatedAUC (Bradley, 1997; Hanley and McNeil, 1982). Since the AUC is aportion of the area of the unit square, its value will always be between 0and 1.0. However, because random guessing produces the diagonal linebetween (0, 0) and (1, 1), which has an area of 0.5, no realistic classifiershould have an AUC less than 0.5.

The AUC has an important statistical property: the AUC of aclassifier is equivalent to the probability that the classifier will ranka randomly chosen positive instance higher than a randomly chosennegative instance. This is equivalent to the Wilcoxon test of ranks(Hanley and McNeil, 1982). The AUC is also closely related to theGini coefficient (Breiman et al., 1984), which is twice the area betweenthe diagonal and the ROC curve. Hand and Till (2001) point out thatGini + 1 = 2 × AUC.

Figure 7a shows the areas under two ROC curves, A and B. ClassifierB has greater area and therefore better average performance. Figure 7b

ROC101.tex; 16/03/2004; 12:56; p.15

Fawcett, 2004

28

Page 8: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

example: HMS-PCI workflow

• reproducibility

• fp, fn rates at given # of replicates

• cross-contamination

• sources of variability

Bait selection

Cloning

Transfection & Expression

Immuno-precipitation

Separation

Digestion

LC MS/MS

Protein IdentificationBiological

databases

LIMS

Interpretation

Distribution

29

CPI PPI Tutorial 2004 - AMH

HMS-PCI: Design of Large-Scale Interaction Proteomics Projects

• How many times should a bait be expressed?

• How many hits can be expected?

• Should both N & C termini be tagged?

• How many trials are required?

• What is the false negative rate?

• What is the false positive rate?

Needed for:• Project design

• Data integrity and interpretation

• Cost estimation and control

30

CPI PPI Tutorial 2004 - AMH

HMS-PCI design: Reproducibility Study

• 49 baits from diverse protein families

• N- and C- tagged

• 4-10 replicates with each construct

• Controls

• Negative controls: FLAG-tag in empty vector

• Positive controls: VHL protein• had been observed to reproducibly pull down high and low abundance

interactors

• Replicates collected over different days and run side-by-side on gels

31

CPI PPI Tutorial 2004 - AMH

HMS-PCI design: Success Rate

• Expression success ≡ bait observed by MS

• N and C have roughly equivalent expression rates

• 5-6 attempts required for 4 successful replicates, on average

Expression Success

0.00

0.10

0.20

0.30

0.40

0.50

0 0.25 0.5 0.75 1

Fraction of attempts successful

% o

f to

tal b

ait

s a

ttem

pte

d

N

C

32

Page 9: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

HMS-PCI design: “Hit” definition

• Some operational definition of a hit required to filter through data

• Hit should be:

• Specific: < 5% background observation frequency

• Reproducible: 2+ observations

# proteins

after control subtraction 3031

< 5% BOF 1081

2+ observations 190

33

CPI PPI Tutorial 2004 - AMH

HMS-PCI design: Reproducibility Rates

• Reproducibility varies greatly between hits

Observed Reproducibility Rate

0.01

0.02 0.07

0.39

0.01

0.04

0.17

0.31

0.00

0.10

0.20

0.30

0.40

0.50

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Reproducibility Rate

Fra

cti

on

of

Hit

s

N

Average

C

34

CPI PPI Tutorial 2004 - AMH

• Bernoulli process:

• repeated trials, each of which is either a “success” or “failure”

• probability p of success in each trial

• # of successes in n trials has binomial distribution

• probability of k or more successes:

Reminder: Binomial Distribution

P [X = k] =

(n

k

)pk(1 − p)n−k

n∑k=2

(n

k

)pk(1 − p)n−k

(n

k

)=

n!

k!(n − k)!

35

CPI PPI Tutorial 2004 - AMH

HMS-PCI design: Minimizing False-Negative Risk

• How many trials are required to observe a hit twice?

• Depends on reproducibility rate for that hit

Number of Trials Needed to Observe Prey 2+

Times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Reproducibility Rate

Fra

cti

on

of

Hit

Po

ol

2

3

4

5

6

# of trials

2 3 4 5 6

0 0.00 0.00 0.00 0.00 0.00

0.1 0.01 0.03 0.05 0.08 0.11

0.2 0.04 0.10 0.18 0.26 0.34

0.3 0.09 0.22 0.35 0.47 0.58

0.4 0.16 0.35 0.52 0.66 0.77

0.5 0.25 0.50 0.69 0.81 0.89

0.6 0.36 0.65 0.82 0.91 0.96

0.7 0.49 0.78 0.92 0.97 0.99

0.8 0.64 0.90 0.97 0.99 1.00

0.9 0.81 0.97 1.00 1.00 1.00

1 1.00 1.00 1.00 1.00 1.00

Reproducibility

Rate

Theoretical Probability of 2+

observations in X # of trials

2 3 4 5 6

0 0.00 0.00 0.00 0.00 0.00 0.00

0.1 0.00 0.00 0.00 0.00 0.00 0.00

0.2 0.01 0.00 0.00 0.00 0.00 0.00

0.3 0.02 0.00 0.00 0.01 0.01 0.01

0.4 0.07 0.01 0.02 0.03 0.04 0.05

0.5 0.39 0.10 0.19 0.27 0.31 0.34

0.6 0.01 0.00 0.01 0.01 0.01 0.01

0.7 0.04 0.02 0.03 0.04 0.04 0.04

0.8 0.17 0.11 0.15 0.16 0.17 0.17

0.9 0.00 0.00 0.00 0.00 0.00 0.00

1 0.31 0.31 0.31 0.31 0.31 0.31

1.00 0.55 0.72 0.83 0.89 0.93

Fraction

of Prey

Pool

Predicted Fraction of Observed

Prey Pool Found in X # of trialsReproducibility

Rate

36

Page 10: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

• Assume each protein is observed randomly, iid across all experiments

• Estimate frequency from observed background observation frequency (normalized for number of repeats of specific baits)

• Choose prey frequency cutoff based on number of trials and false positive rate considered acceptable

HMS-PCI design: Controlling False-Positive Risk

Is it necessary to tag both the amino - and carboxy- termini to detect

all of aprotein’sinteractions?

Table11A indicates that proteinsmust betagged at both positions in

order to obtain all of the interactions that can be found with the

HMS-PCI protocol.

Table 11B shows that the overlap between N- and C- tagged

constructsisextremely small.

TheN-terminal tagsaremoreproductiveoverall. It may bepossible

to improvethedesign of theC-tags.

Optimizing Experimental Design in High-Throughput Interaction ProteomicsAdrian Heilbut*, Paul Taylor, Lynda Moore, Mike F. Moran, Daniel Figeys, Thodoros Topaloglou, Gregg B. Morin*

mds prot eom ic s inc , Toront o, Canada

Optimizing Experimental Design in High-Throughput Interaction ProteomicsAdrian Heilbut*, Paul Taylor, Lynda Moore, Mike F. Moran, Daniel Figeys, Thodoros Topaloglou, Gregg B. Morin*

mds prot eom ic s inc , Toront o, Canada

1. Motivation

3. Experimental System

Conclusions

High throughput capability can be leveraged to significantly

improve the quality of protein interaction data, in addition to

expandingcoverage

Even a very simple statistical model can be helpful to rationally

guideexperimental designs.

A statistical approach allows for prioritization of preys based on

experimental quality, and permits potentially questionable data to

beflagged.

Experimental confidence estimates and higher quality underlying

data will facilitate the integration of data from orthogonal

experimental methods, and will make interaction data more useful

for generatinghypothesesand constructingmodels.* current address: Samuel Lunenfeld Research Insti tute

Mount Sinai Hospi tal ,Toronto,ON

49 different bait cDNAs, in both N and C-terminally FLAG

tagged constructs

Calcium phosphatetransfection intoHEK293cells

Empty vectors lacking a cDNA insert were used as negative

controls with each batch of samples. Proteins identified in

negativecontrol laneswereautomatically subtracted.

Each construct wasexpressed and immunoprecipitated 4-6 times

Predicted False Negative Rate

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Reproducibility Rate

FractionofHitPool

2

3

4

5

6

# of trials

Predicted False Negative Rate

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Reproducibility Rate

FractionofHitPool

2

3

4

5

6

# of trials

9.Planning Trial Size: Minimizing false-negative risk

\

High-throughput mass spectrometry protein complex identification

(HMS-PCI) hasemerged asakey technology for functional genomics

Based on lack of concordanceamongdatasetsand theliterature,HMS-

PCI data is thought to be non-saturating and prone to both false-

positive and false-negative interactions. Verifying every putative

experimental interaction by traditional methods isnot practical.

Intuitively, experiments need to be done more than once, but how

many repetitionsarenecessary and sufficient?

Protein biochemistry varies enormously. When HMS-PCI is applied

to large numbers of proteins, biochemistry protocols cannot be

optimized for each individual bait.

5. Defining True Interactions

Real interactionsmust bereproducible.

Worst-case assumption: Preys are observed randomly, and

distributed uniformly throughout thedataset.

If individual experiments are assumed independent and identically

distributed, then experimentscan bemodelled asBernoulli trials, and

an expectation for each protein can be calculated based on its

observed frequency.

Low frequency preys are those more likely to be true, bait-specific

interactions. Frequently observed preys can easily be reproduced,

but aremeaningless.

4. Prey Frequency Distribution

A:Determinetheprey reproducibility ratedistribution, and hencethe

number of trials that must be performed in order to observe all

reproducibleinteractions.

B: Develop acriterion for acceptingan observed prey as‘real’

C:Comparetheeffectivenessof N-terminal vs.C-terminal FLAG tags

D:Estimatethefalse-negativeratefor agiven number of trials

E:Estimatethefalse-positiverate

7. Prey Reproducibility Rate Distribution

6.Bait Biochemistry, Success, and Productivity

12. Summary of Results

To maximize recovery of complexes in an HMS-PCI experiment,

both N and C -terminally tagged baitsshould beattempted

An experiment using only onetag position w il l miss25%- 75%of all

observablei n ter act i ons.

5-6 repetitions for each bait should be attempted to obtain the 4

successful trials that are required for acceptable false-positive and

false-negative rates.

With 4 trials, the false-negativerateisapproximately 15%

With 4 trials and a 5% frequency cut-off, the false -positive rate is

lesst h an 5%

Approximately 5hitscan beexpected per bait, on average.

Hit Reproducibility (frequency)

0

10

20

30

40

50

1 2 3 4 5 6 7 8 9 10 11 12

# times observed

%oftotalhits(n=190)

Expression Success

0.00

0.10

0.20

0.30

0.40

0.50

0 0.25 0.5 0.75 1

Fraction of attempts successful

%oftotalbaitsattempted

N

C

Prey Frequency isdefined as thepercentageof baitswith which

any given prey protein isobserved.

We have accumulated a database containing the results of

hundredsof immunoprecipitation experiments. Over theentire

dataset, the prey frequency distribution is extremely skewed,

duetovariationsin background bindingand protein abundance

The global observation frequency of each protein can beused to

help assessthesignificanceof a new observation of that protein.

Exper iments can be improved ei ther by improving

reproducibility (by improving biochemistry or analytical

sensitivity) or by theincreasingnumber of trials.

Wewill experiencediminishing returns if thenumber of trials

isset too high, unless thesuccesscriteria isalso tightened. For

instance, wemay demand 3 or more observations, or include

semi-quantitative information such as thenumber or intensity

of peptides.

How many trialsarerequired toobserveaprey twice?

The number of tr ials required w i l l depend on the

reproducibility rate for that prey. We estimate the actual

reproducibility ratesusing theobserved reproducibility rates.

Figure9A illustrates thenumber of trials required to observea

prey at least twice.

1 or 2 trials will provide a highly incomplete dataset, from

which it w ill be difficult to distuinguish real preys from

background.

Table 9C shows the effect of the non-unifom distribution of

reproducibility rates on the expected number of false

negatives,as thenumber of repetitionsisvaried.

2 3 4 5 6

0 1.00 1.00 1.00 1.00 1.00

0.1 0.99 0.97 0.95 0.92 0.89

0.2 0.96 0.90 0.82 0.74 0.66

0.3 0.91 0.78 0.65 0.53 0.42

0.4 0.84 0.65 0.48 0.34 0.23

0.5 0.75 0.50 0.31 0.19 0.11

0.6 0.64 0.35 0.18 0.09 0.04

0.7 0.51 0.22 0.08 0.03 0.01

0.8 0.36 0.10 0.03 0.01 0.00

0.9 0.19 0.03 0.00 0.00 0.00

1 0.00 0.00 0.00 0.00 0.00

Reproducibility

Rate

Theoretical Probability of NOT

Observing 2+ in X # of trials

2 3 4 5 6

0 0.00 0.00 0.00 0.00 0.00 0.00

0.1 0.00 0.00 0.00 0.00 0.00 0.00

0.2 0.01 0.00 0.00 0.00 0.00 0.00

0.3 0.02 0.01 0.01 0.01 0.01 0.01

0.4 0.07 0.05 0.04 0.03 0.02 0.01

0.5 0.39 0.29 0.19 0.12 0.07 0.04

0.6 0.01 0.01 0.00 0.00 0.00 0.00

0.7 0.04 0.02 0.01 0.00 0.00 0.00

0.8 0.17 0.06 0.02 0.01 0.00 0.00

0.9 0.00 0.00 0.00 0.00 0.00 0.00

1 0.31 0.00 0.00 0.00 0.00 0.00

1.00 0.45 0.28 0.17 0.11 0.07

Fraction

of Prey

Pool

Predicted Fraction of Prey

Pool NOT Found in X # of trialsReproducibility

Rate

If thepresenceof a prey isa result of a biologically meaningful

interaction, then prey observation should be reproducible

between replicateexperimentswith thesamebait.

Reproducibility varies between preys for many possible

reasons, such as protein abundance, different interaction

affinities, transient interactions, and peptide chemistry

affectingmass spectrometry identification.

Bait ‘succcess’ is defined by detection of the expressed bait

protein by massspectrometry.

Bait proteinswereof varying sizesand functional classes. The

sizedistribution of theproteinsisshown in Fig.6A.

A subset of baitswere chosen from a common pathway, which

could potentially invalidate our assumptions of independence

between experiments.

Overall, the N-terminal tag expression success rate is slightly

better than C-terminal expression, and the difference is

probably statistically significant (chi squared = 3.65, df=1,

p=0.056). (6B)

Each bait was attempted in 4-6 replicate experiments. Fig. 6C

shows the expression success rate over those trials, and

compares thedistributionsof success rates for N-tagged vs. C-

tagged constructs. Once a construct expresses, the expression

rateiscomparablefor both tagpositions.

!

Bait size histogram

0

5

10

15

20

25

30

200

400

600

800

1000

1200

1400

1600

1800

2000

protein size (# aa)

numberofbaits

11. N-tag vs. C-tag Hit Overlap

8. Prey Reproducibility: N-tags vs. C-tags\

Observed Reproducibility Rate

0.01

0.02

0.07

0.39

0.01 0.04

0.17

0.31

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Reproducibility Rate

FractionofHits

N

Average

C

Reproducibility variesgreatly between preys. Asshown in

figure8A,many reproducibleproteinsareonly observed in

50%of experiments.

Possible reasons for variability include protein abundance,

different interaction affinities, transient interactions,

peptidechemistry.

N and C-tagged constructshaveno significant differencein

prey reproducibility,oncethey aresuccessfully expressed.

# hits

seen with N only 110 0.68

seen with C only 29 0.18

seen in both N&C 15 0.09

seen when N+C are combined 8 0.05

total 162

% of total

hits

tag successful attempted success rate

N 40 49 0.82

C 33 49 0.67

combined 42 49 0.86 Pred ic ted False Pos t ive Rate vs . Database

Frequ en cy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

2

3

4

5

6

# of trials

10. Estimating false-positives

p: prey observation frequencyn: number of trialsk: number of observations required (2)E : expected number of falsepositivescutoff: frequency cutoffNumhits(p): number of hitsat each preyobservation frequency

falsep o sitiv e

5%

< 0.05

“safe”

region5%

< 0.05

“safe”

region

Using the global prey frequencies, we can also estimate the rate of

false-positivesduetobackground

Assume background proteins have a uniform random distribution,

and that background does not vary over time or experimental

conditions

Choose a prey frequency cutoff based on the number of trials

performed and thefalsepositiverate considered acceptable.

0.77

0.27C-tag only experiment

N-tag only experiment

Fraction of total hits

observed

9A

9B

9C

8A

7A

6C

6B

6A

2. Objectives

BaitSelection

& Cloning

Ectopic

expression

Lysis and

Immunoprecipitation

Gel

Separation

Band

ExcisionLC-MS/MS Informatics

Times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Reproducibility Rate

FractionofHitPool

2

3

4

5

6

# of trials

Times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Reproducibility Rate

FractionofHitPool

2

3

4

5

6

# of trials

Number of Trials needed to Observe Prey 2+

11A 11B

37

CPI PPI Tutorial 2004 - AMH

• need to tag both positions to get all interactions that can be found by this method on a large set of proteins

• overlap between N and C tagged constructs is small

HMS-PCI design: N- vs. C- tags

# hits % total hits

seen with N only 151 79%

seen with C only 47 25%

seen in N and C 18 9%

seen N ∪ C 10 5%

total 190

38

CPI PPI Tutorial 2004 - AMH

HMS-PCI example: Conclusions

• For the protocol employed, 5-6 attempts required for acceptable false-positive and false-negative rates

• Both N and C terminally tagged baits should be attempted for maximal coverage

• Lots of opportunities to improve process:

• constructs, tags, replicates, controls, sensitivity...

• With more data, protocols can be optimized (given available resources and requirements)

• Lots of work still to do...

39

CPI PPI Tutorial 2004 - AMH

Experimental Design Summary

• Careful experimental design and error estimation is essential for high-throughput biology

• Need to be able to filter and prioritize results

• Pilot studies will be required to establish optimal designs for novel protocols

• LIMS facilitates analysis of process issues

• Don’t forget about basic statistics and basic design

• Reproducibility should be just as important for large-scale studies as in regular biology experiments

40

Page 11: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

3. MS/MS Protein ID

• Protein mass spectrometry review

• Peptide fragmentation & nomenclature

• ESI: Multiply-charged ions and deconvolution

• Spectra

• MS/MS database search

• Example: X! Tandem

• Search statistics and scoring

• De novo sequencing

41

CPI PPI Tutorial 2004 - AMH

MS/MS Protein Identification

Proteins Peptidesionizedpeptides

fragment ionspectrum

(trypsin)digestion

MALDIor

electrosprayionization

CIDor

PSD

DLTDYLMK

VAPEEHPVLLTEAPLNPK

PEEHPVLLTEAPLNPK

QEYDESGPSIVHR

DLTDYLMK

VAPEEHPVLLTEAPLNPK

PEEHPVLLTEAPLNPK

QEYDESGPSIVHR

42

CPI PPI Tutorial 2004 - AMH

Tandem Mass Spectrometry

• many different possible configurations

• common types used in proteomics:

• Quadrupole Time-of-flight (ie. QStar)

• Quadrupole Ion trap (ie. LCQ)

• Linear Ion Trap (LTQ, QTRAP)

• acccuracy & resolution of instrument impacts search• m/z - mass to charge ratio

• mass accuracy (ppm) = (measured m/z - actual) / (actual * 10^6)

• mass resolution = m / peak width

43

CPI PPI Tutorial 2004 - AMH

QSTAR XL

The API QSTAR® XL Hybrid LC/MS/MS

System is the premier quadrupole time-of-

flight LC/MS/MS system, setting a new

dimension of flexibility and performance.

The enhanced ion optics and new

detector provide answers to the most

challenging analytical questions at the

highest sensitivity. This high sensitivity,

coupled with excellent mass accuracy,

yield unequivocal molecular weights and

high-quality structural information for both

protein and small molecule analysis.

Novel scan functions offer a high degree of

selectivity for low-level protein analysis,

together with the most sensitive precursor

ion scanning capabil ity for accurate

analysis of post-translational modifications

(PTMs) and target compound analysis for

drug metabolites. The QSTAR XL system

is the most flexible MS/MS platform,

offering fast, easy switching between

the broadest range of ionisation, including

NanoSpray™ source, oMALDI™ source,

APCI, PhotoSpray™ source, and

TurboIonSpray® source.

Key Features of the QSTAR XL System:

! Most flexible MS/MS platform for

both electrospray and MALDI analysis

! New oMALDI 2 ion source with

enhanced collisional cooling for

better sensitivity

! New NanoSpray source for increased

productivity with capillary and

nanoflow HPLC

! New ion optics and detector to

improve ruggedness in 24/7

working environment

! Increased quadrupole mass selection

of up to 6,000 amu and time-of-flight

mass range of up to 40,000 amu

! Increased efficiency of high mass

transmission

! Improved low mass fragment ion

transmission

! Unique trapping pulsing capability for

maximum duty cycle and ultimate

sensitivity

! Unique scan functions for enhanced

selectivity and sensitivity for low level

compound analysis

New Increased Mass Selection Capability

The high-mass ion transmission properties

have been significantly improved in the new

generation QSTAR XL system. Ions of up to

40,000 amu can now be analysed by the

time-of-flight detector, and their efficiency

of transmission has been increased.

The CID capabilities have been enhanced

so that ions of up to 6,000 amu can be

isolated and fragmented for sequence

analysis, with improved transmission of low

mass fragments (see figure 2).

Novel Multiple Charge Separation

A unique, proprietary charge separation

method is applied to the QSTAR XL system

to improve detection limits of peptides and

proteins when analysed from complex

mixtures. Multiple charge separation (MCS)

eliminates singly charged ions in the

spectra, thereby enhancing the signal-to-

noise ratio of multiply charged ions at very

low levels. This high degree of separation

offers significant gains in signal-to-noise for

species that have a charge state higher

than 1. The benefits of charge separation

are particularly apparent at low femtomole

concentration levels, where in regular

TOF MS spectra peptide ions are often lost

in a sea of chemical noise. The suppression

of chemical noise reduces the need for

chromatography and makes peptide mass

fingerprinting using an electrospray source

equivalent to peptide mass fingerprinting

by MALDI.

europe.appliedbiosystems.com

T he A P I Q S TA R ® X L

H y b r id L C /M S /M S S y st emA new dimension in flexibility and performance

NEWPRODUCTREVIEW

Figure 2.

Fragmentation of the synthetic peptide Bovine Corticotropin

Releasing Factor (CRF) at 4695.5 Da, showing

the high mass fragment ion sequence information

obtainable from large parent peptides

Figure 1.

Schematic of the new QSTAR XL system featuring the

oMALDI 2 source. Other enhancements include a

DC quad and new detector for improved ruggedness

Laser

New Detector

CarrierPlate High Efficiency

LINACCollision Cell

Ultra StableQuadrupoleMass Filter

DC Quad The quadrupole lens provides a marked improvement in the ability to optimise the ion beam profile

Q0. Patented collisional focusingmaximises ion transmission forsuperior sensitivity

Q2 Patented LINACTM High Pressure collision cell provides increases sensitivity and unique trapping pulsing capability for maximum duty-cycle.

44

Page 12: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

LCQ and LTQ!"#$%&'(")*(&+,&%)-.)'#$)/(00(+&0)123)4-0)5,&6)7!

8!4

0$$9:$

;$&'$9

"&6(::&,<5=>$

:$0?@/A-0:<

%=:'(6-:$?

4-0)',&6

$09)"&6?

4-0)',&6

,(0+)$:$"',-9$

8:$"',-0

%=:'(6:($,

9<0-9$

BC (0:$'?

!#$&'#

:(D=(9

E&"==%

6=%6

E&"==%

6=%6

4-0).-,%&'(-0

FG)&'%H 9$?-:I&'(-04-0)',&0?6-,'

(0'-)I&"==%

7JK)&0&:<?(?

!"#$%&"'!( )*+,-)"./01%./$2)$33$%"/4533/!)-%/01601/!$%!7"787"9./$2)$33$%"/:-"-/

:$,$%:$%)9./"'$/53"7+-"$/,$,"7:$/+-,,7%&/!9!"$+

37+7"-"7*%!( 37+7"$:/!"*#-&$/*4/7*%!./;6</*4/3*=/+-!!/#-%&$/3*!"/7%/01601/+*:$!

L)M'&09$%A(0A'(%$N)%&??)?6$"',-%$'$,9(..$,$0')?'&+$?)-.)%&??)&0&:<?(?)6$,.-,%$9)(0)'#$)?&%$)6#<?("&: ?6&"$)&')9(?",$'$)6-(0'?)(0)'(%$

Thermo LTQ www.thermo.com

Thermo LCQ www.thermo.comfig from www.enovatia.com

45

CPI PPI Tutorial 2004 - AMH

ESI: Multiply-charged peptides

• Electrospray ionization results in multiply charged peptides - spectra require deconvolution

Fenn J, 1989

46

CPI PPI Tutorial 2004 - AMH

Mass Spec Protein Identification

• MS: Peptide mass mapping

• MS/MS: Sequence Tags

• MS/MS: De novo sequencing

• MS/MS: Database search / correlation

47

CPI PPI Tutorial 2004 - AMH

Peptide Fragmentation

http://www.matrixscience.com/help_index.html

a,b,c series: charge stays on N terminal fragmentx,y,z series: charge stays on C terminal fragment

48

Page 13: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

MS/MS Database Searching

• Given an experimental ms/ms spectra, which sequences in the protein database are most likely to have generated that spectra?

• Algorithm

• For each protein in the database

• in silico digestion to generate peptides with appropriate precursor mass for queries

• theoretically fragment peptides

• include possibilities of mutations/post-translational modification?

• compare query spectra to predicted spectra of each peptide, and assign a score and statistical significance

• collect peptide matches for each protein

49

CPI PPI Tutorial 2004 - AMH

Database Search Engines

• Mascot

• Sequest

• X! Tandem

• Protein or translated genomic/EST sequences can be searched

50

CPI PPI Tutorial 2004 - AMH

Sequence Databases

• NCBI NR - All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF

• IPI

• UniProt - successor to SwissProt

•• Issues

• Databases need to be updated regularly

• Databases contain significant redundancy

• Some identifiers (GIs) are unstable and can disappear from one release to the next; accessions should be stable

• Identifications are often ambiguous - clustering required

51

CPI PPI Tutorial 2004 - AMH

X! Tandem

• Open Source software for matching MS/MS spectra to sequences

• Uses a new 2-step algorithm to speed up searches

• Fast and free

http://www.thegpm.orgCraig R & Beavis RC. A method for reducing the time required to match protein sequences to tandem mass spectra. Rapid Commun Mass Spec 2003.

52

Page 14: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

4. Modelling biological networks as graphs

1. Basic elements of graph theory & definitions2. Statistical measures of graph topology3. Small world networks4. Scale-free networks5. Biological implications of topology

1. Hubs & lethality2. Date hubs vs. party hubs

53

CPI PPI Tutorial 2004 - AMH

Graph Theory

• Graph

• set of Nodes (Vertices) V = {v1,v2,v3,v4}

• set of Edges E = {(v1,v2),(v1,v3),(v1,v4)}

• an edge is a symmetric relation on V

• Directed graph (digraph)• directed edges, or arcs - asymmetric

• Graphs can be used to model many, many different kinds of real-world objects and relationships

• Standard algorithms can be applied

v1

v3

v4

v2

(v1,v2) (v1,v4)

(v1,v3)

v1

v3

v4

v2

(v1,v2) (v

1,v4)

(v1,v3)

54

CPI PPI Tutorial 2004 - AMH

Degree

• Degree of a node ≡ # of edge ends connected to it

0 1

1 2

55

CPI PPI Tutorial 2004 - AMH

Clustering Coefficient

• In a cluster, neighbours of a node tend to be connected to each other

• Clustering can be measured by counting number of “triangles” around a node, vs. how many there might be

A

B

C

D

E

E

A

B

C

D

E

E

56

Page 15: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Graph Data Structures: Adjacency List vs. Matrix

2 major ways to represent a graph in the computer

• Adjacency list• Store a list of adjacent edges of each node

• Efficient use of memory for sparse graphs

• Adjacency matrix• Matrix of all possible edges

• Mark those edges that are actually present

v1 v2 v3 v4

v1 0 1 1 1

v2 1 0 0 0

v3 1 0 0 0

v4 1 0 0 0

v1

v3

v4

v2

(v1,v2) (v1,v4)

(v1,v3)

E = {(v1,v2),(v1,v3),(v1,v4)}

57

CPI PPI Tutorial 2004 - AMH

Statistical Mechanics of Networks

• Many natural graphs are characterized by distinctive statistical properties.

• Characteristic distance

• Average Clustering coefficient

• Degree distribution

58

CPI PPI Tutorial 2004 - AMH

Small-World Networks

• Graphs traditionally modeled as regular lattices or random graphs

• L(p) = characteristic path length• average of shortest path lengths between each pair of vertices

• C(p) = average clustering coefficient• Cv = clustering coefficient = number of triangles around node v / total possible number of triangles

Nature © Macmillan Publishers Ltd 1998

8

letters to nature

NATURE | VOL 393 | 4 JUNE 1998 441

removed from a clustered neighbourhood to make a short cut has, atmost, a linear effect on C; hence C(p) remains practically unchangedfor small p even though L(p) drops rapidly. The important implica-tion here is that at the local level (as reflected by C(p)), the transitionto a small world is almost undetectable. To check the robustness ofthese results, we have tested many different types of initial regulargraphs, as well as different algorithms for random rewiring, and allgive qualitatively similar results. The only requirement is that therewired edges must typically connect vertices that would otherwisebe much farther apart than Lrandom.

The idealized construction above reveals the key role of shortcuts. It suggests that the small-world phenomenon might becommon in sparse networks with many vertices, as even a tinyfraction of short cuts would suffice. To test this idea, we havecomputed L and C for the collaboration graph of actors in featurefilms (generated from data available at http://us.imdb.com), theelectrical power grid of the western United States, and the neuralnetwork of the nematode worm C. elegans17. All three graphs are ofscientific interest. The graph of film actors is a surrogate for a socialnetwork18, with the advantage of being much more easily specified.It is also akin to the graph of mathematical collaborations centred,traditionally, on P. Erdos (partial data available at http://www.acs.oakland.edu/!grossman/erdoshp.html). The graph ofthe power grid is relevant to the efficiency and robustness ofpower networks19. And C. elegans is the sole example of a completelymapped neural network.

Table 1 shows that all three graphs are small-world networks.These examples were not hand-picked; they were chosen because oftheir inherent interest and because complete wiring diagrams wereavailable. Thus the small-world phenomenon is not merely acuriosity of social networks13,14 nor an artefact of an idealized

model—it is probably generic for many large, sparse networksfound in nature.

We now investigate the functional significance of small-worldconnectivity for dynamical systems. Our test case is a deliberatelysimplified model for the spread of an infectious disease. Thepopulation structure is modelled by the family of graphs describedin Fig. 1. At time t ¼ 0, a single infective individual is introducedinto an otherwise healthy population. Infective individuals areremoved permanently (by immunity or death) after a period ofsickness that lasts one unit of dimensionless time. During this time,each infective individual can infect each of its healthy neighbourswith probability r. On subsequent time steps, the disease spreadsalong the edges of the graph until it either infects the entirepopulation, or it dies out, having infected some fraction of thepopulation in the process.

p = 0 p = 1 Increasing randomness

Regular Small-world Random

Figure 1 Random rewiring procedure for interpolating between a regular ring

lattice and a random network, without altering the number of vertices or edges in

the graph. We start with a ring of n vertices, each connected to its k nearest

neighbours by undirected edges. (For clarity, n ¼ 20 and k ¼ 4 in the schematic

examples shown here, but much larger n and k are used in the rest of this Letter.)

We choose a vertex and the edge that connects it to its nearest neighbour in a

clockwise sense. With probability p, we reconnect this edge to a vertex chosen

uniformly at random over the entire ring, with duplicate edges forbidden; other-

wise we leave the edge in place. We repeat this process by moving clockwise

around the ring, considering each vertex in turn until one lap is completed. Next,

we consider the edges that connect vertices to their second-nearest neighbours

clockwise. As before, we randomly rewire each of these edges with probability p,

and continue this process, circulating around the ring and proceeding outward to

more distant neighbours after each lap, until each edge in the original lattice has

been considered once. (As there are nk/2 edges in the entire graph, the rewiring

process stops after k/2 laps.) Three realizations of this process are shown, for

different values of p. For p ¼ 0, the original ring is unchanged; as p increases, the

graph becomes increasingly disordered until for p ¼ 1, all edges are rewired

randomly. One of our main results is that for intermediate values of p, the graph is

a small-world network: highly clustered like a regular graph, yet with small

characteristic path length, like a random graph. (See Fig. 2.)

Table 1 Empirical examples of small-world networks

Lactual Lrandom Cactual Crandom.............................................................................................................................................................................Film actors 3.65 2.99 0.79 0.00027Power grid 18.7 12.4 0.080 0.005C. elegans 2.65 2.25 0.28 0.05.............................................................................................................................................................................Characteristic path length L and clustering coefficient C for three real networks, comparedto random graphs with the same number of vertices (n) and average number of edges pervertex (k). (Actors: n ¼ 225;226, k ¼ 61. Power grid: n ¼ 4;941, k ¼ 2:67. C. elegans: n ¼ 282,k ¼ 14.) The graphs are defined as follows. Two actors are joined by an edge if they haveacted in a film together. We restrict attention to the giant connected component16 of thisgraph, which includes !90% of all actors listed in the Internet Movie Database (available athttp://us.imdb.com), as of April 1997. For the power grid, vertices represent generators,transformers and substations, and edges represent high-voltage transmission linesbetween them. For C. elegans, an edge joins two neurons if they are connected by eithera synapse or a gap junction. We treat all edges as undirected and unweighted, and allvertices as identical, recognizing that these are crude approximations. All three networksshow the small-world phenomenon: L ! Lrandom but C q Crandom.

0

0.2

0.4

0.6

0.8

1

0.0001 0.001 0.01 0.1 1

p

L(p) / L(0)

C(p) / C(0)

Figure 2 Characteristic path length L(p) and clustering coefficient C(p) for the

family of randomly rewired graphs described in Fig. 1. Here L is defined as the

number of edges in the shortest path between two vertices, averaged over all

pairs of vertices. The clustering coefficient C(p) is defined as follows. Suppose

that a vertex v has kv neighbours; then at most kvðkv " 1Þ=2 edges can exist

between them (this occurs when every neighbour of v is connected to everyother

neighbour of v). Let Cv denote the fraction of these allowable edges that actually

exist. Define C as the average of Cv over all v. For friendship networks, these

statistics have intuitive meanings: L is the average number of friendships in the

shortest chain connecting two people; Cv reflects the extent to which friends of v

are also friends of each other; and thus C measures the cliquishness of a typical

friendship circle. The data shown in the figure are averages over 20 random

realizations of the rewiring process described in Fig.1, and have been normalized

by the values L(0), C(0) for a regular lattice. All the graphs have n ¼ 1;000 vertices

and an average degree of k ¼ 10 edges per vertex. We note that a logarithmic

horizontal scale has been used to resolve the rapid drop in L(p), corresponding to

the onset of the small-world phenomenon. During this drop, C(p) remains almost

constant at its value for the regular lattice, indicating that the transition to a small

world is almost undetectable at the local level.

Nature © Macmillan Publishers Ltd 1998

8

letters to nature

NATURE | VOL 393 | 4 JUNE 1998 441

removed from a clustered neighbourhood to make a short cut has, atmost, a linear effect on C; hence C(p) remains practically unchangedfor small p even though L(p) drops rapidly. The important implica-tion here is that at the local level (as reflected by C(p)), the transitionto a small world is almost undetectable. To check the robustness ofthese results, we have tested many different types of initial regulargraphs, as well as different algorithms for random rewiring, and allgive qualitatively similar results. The only requirement is that therewired edges must typically connect vertices that would otherwisebe much farther apart than Lrandom.

The idealized construction above reveals the key role of shortcuts. It suggests that the small-world phenomenon might becommon in sparse networks with many vertices, as even a tinyfraction of short cuts would suffice. To test this idea, we havecomputed L and C for the collaboration graph of actors in featurefilms (generated from data available at http://us.imdb.com), theelectrical power grid of the western United States, and the neuralnetwork of the nematode worm C. elegans17. All three graphs are ofscientific interest. The graph of film actors is a surrogate for a socialnetwork18, with the advantage of being much more easily specified.It is also akin to the graph of mathematical collaborations centred,traditionally, on P. Erdos (partial data available at http://www.acs.oakland.edu/!grossman/erdoshp.html). The graph ofthe power grid is relevant to the efficiency and robustness ofpower networks19. And C. elegans is the sole example of a completelymapped neural network.

Table 1 shows that all three graphs are small-world networks.These examples were not hand-picked; they were chosen because oftheir inherent interest and because complete wiring diagrams wereavailable. Thus the small-world phenomenon is not merely acuriosity of social networks13,14 nor an artefact of an idealized

model—it is probably generic for many large, sparse networksfound in nature.

We now investigate the functional significance of small-worldconnectivity for dynamical systems. Our test case is a deliberatelysimplified model for the spread of an infectious disease. Thepopulation structure is modelled by the family of graphs describedin Fig. 1. At time t ¼ 0, a single infective individual is introducedinto an otherwise healthy population. Infective individuals areremoved permanently (by immunity or death) after a period ofsickness that lasts one unit of dimensionless time. During this time,each infective individual can infect each of its healthy neighbourswith probability r. On subsequent time steps, the disease spreadsalong the edges of the graph until it either infects the entirepopulation, or it dies out, having infected some fraction of thepopulation in the process.

p = 0 p = 1 Increasing randomness

Regular Small-world Random

Figure 1 Random rewiring procedure for interpolating between a regular ring

lattice and a random network, without altering the number of vertices or edges in

the graph. We start with a ring of n vertices, each connected to its k nearest

neighbours by undirected edges. (For clarity, n ¼ 20 and k ¼ 4 in the schematic

examples shown here, but much larger n and k are used in the rest of this Letter.)

We choose a vertex and the edge that connects it to its nearest neighbour in a

clockwise sense. With probability p, we reconnect this edge to a vertex chosen

uniformly at random over the entire ring, with duplicate edges forbidden; other-

wise we leave the edge in place. We repeat this process by moving clockwise

around the ring, considering each vertex in turn until one lap is completed. Next,

we consider the edges that connect vertices to their second-nearest neighbours

clockwise. As before, we randomly rewire each of these edges with probability p,

and continue this process, circulating around the ring and proceeding outward to

more distant neighbours after each lap, until each edge in the original lattice has

been considered once. (As there are nk/2 edges in the entire graph, the rewiring

process stops after k/2 laps.) Three realizations of this process are shown, for

different values of p. For p ¼ 0, the original ring is unchanged; as p increases, the

graph becomes increasingly disordered until for p ¼ 1, all edges are rewired

randomly. One of our main results is that for intermediate values of p, the graph is

a small-world network: highly clustered like a regular graph, yet with small

characteristic path length, like a random graph. (See Fig. 2.)

Table 1 Empirical examples of small-world networks

Lactual Lrandom Cactual Crandom.............................................................................................................................................................................Film actors 3.65 2.99 0.79 0.00027Power grid 18.7 12.4 0.080 0.005C. elegans 2.65 2.25 0.28 0.05.............................................................................................................................................................................Characteristic path length L and clustering coefficient C for three real networks, comparedto random graphs with the same number of vertices (n) and average number of edges pervertex (k). (Actors: n ¼ 225;226, k ¼ 61. Power grid: n ¼ 4;941, k ¼ 2:67. C. elegans: n ¼ 282,k ¼ 14.) The graphs are defined as follows. Two actors are joined by an edge if they haveacted in a film together. We restrict attention to the giant connected component16 of thisgraph, which includes !90% of all actors listed in the Internet Movie Database (available athttp://us.imdb.com), as of April 1997. For the power grid, vertices represent generators,transformers and substations, and edges represent high-voltage transmission linesbetween them. For C. elegans, an edge joins two neurons if they are connected by eithera synapse or a gap junction. We treat all edges as undirected and unweighted, and allvertices as identical, recognizing that these are crude approximations. All three networksshow the small-world phenomenon: L ! Lrandom but C q Crandom.

0

0.2

0.4

0.6

0.8

1

0.0001 0.001 0.01 0.1 1

p

L(p) / L(0)

C(p) / C(0)

Figure 2 Characteristic path length L(p) and clustering coefficient C(p) for the

family of randomly rewired graphs described in Fig. 1. Here L is defined as the

number of edges in the shortest path between two vertices, averaged over all

pairs of vertices. The clustering coefficient C(p) is defined as follows. Suppose

that a vertex v has kv neighbours; then at most kvðkv " 1Þ=2 edges can exist

between them (this occurs when every neighbour of v is connected to everyother

neighbour of v). Let Cv denote the fraction of these allowable edges that actually

exist. Define C as the average of Cv over all v. For friendship networks, these

statistics have intuitive meanings: L is the average number of friendships in the

shortest chain connecting two people; Cv reflects the extent to which friends of v

are also friends of each other; and thus C measures the cliquishness of a typical

friendship circle. The data shown in the figure are averages over 20 random

realizations of the rewiring process described in Fig.1, and have been normalized

by the values L(0), C(0) for a regular lattice. All the graphs have n ¼ 1;000 vertices

and an average degree of k ¼ 10 edges per vertex. We note that a logarithmic

horizontal scale has been used to resolve the rapid drop in L(p), corresponding to

the onset of the small-world phenomenon. During this drop, C(p) remains almost

constant at its value for the regular lattice, indicating that the transition to a small

world is almost undetectable at the local level.

Watts & Strogatz, Nature 393 440-442 (1998)

59

CPI PPI Tutorial 2004 - AMH

Scale-Free Networks

• Degree distribution is described by a power law:

P(k) ! k-"

- only a few nodes with lots of links

- but more nodes than you’d expect with an intermediate number of links (than in an ER random network)

‘scale-free’ no representative average type of node, or characteristic scale of

local connectivity distribution

Albert R & Barabasi A-L. “Emergence of Scaling in Random Networks” Science 286 (509-512) 1999

60

Page 16: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Example Scale-Free Networks

ing systems form a huge genetic networkwhose vertices are proteins and genes, thechemical interactions between them repre-senting edges (2). At a different organization-al level, a large network is formed by thenervous system, whose vertices are the nervecells, connected by axons (3). But equallycomplex networks occur in social science,where vertices are individuals or organiza-tions and the edges are the social interactionsbetween them (4), or in the World Wide Web(WWW), whose vertices are HTML docu-ments connected by links pointing from onepage to another (5, 6). Because of their largesize and the complexity of their interactions,the topology of these networks is largelyunknown.

Traditionally, networks of complex topol-ogy have been described with the randomgraph theory of Erdos and Renyi (ER) (7),but in the absence of data on large networks,the predictions of the ER theory were rarelytested in the real world. However, driven bythe computerization of data acquisition, suchtopological information is increasingly avail-able, raising the possibility of understandingthe dynamical and topological stability oflarge networks.

Here we report on the existence of a highdegree of self-organization characterizing thelarge-scale properties of complex networks.Exploring several large databases describingthe topology of large networks that spanfields as diverse as the WWW or citationpatterns in science, we show that, indepen-dent of the system and the identity of itsconstituents, the probability P(k) that a ver-tex in the network interacts with k othervertices decays as a power law, followingP(k) ! k"#. This result indicates that largenetworks self-organize into a scale-free state,a feature unpredicted by all existing randomnetwork models. To explain the origin of thisscale invariance, we show that existing net-work models fail to incorporate growth andpreferential attachment, two key features ofreal networks. Using a model incorporating

these two ingredients, we show that they areresponsible for the power-law scaling ob-served in real networks. Finally, we arguethat these ingredients play an easily identifi-able and important role in the formation ofmany complex systems, which implies thatour results are relevant to a large class ofnetworks observed in nature.

Although there are many systems thatform complex networks, detailed topologicaldata is available for only a few. The collab-oration graph of movie actors represents awell-documented example of a social net-work. Each actor is represented by a vertex,two actors being connected if they were casttogether in the same movie. The probabilitythat an actor has k links (characterizing his orher popularity) has a power-law tail for largek, following P(k) ! k"#actor, where #actor $2.3 % 0.1 (Fig. 1A). A more complex net-work with over 800 million vertices (8) is theWWW, where a vertex is a document and theedges are the links pointing from one docu-ment to another. The topology of this graphdetermines the Web’s connectivity and, con-sequently, our effectiveness in locating infor-mation on the WWW (5). Information aboutP(k) can be obtained using robots (6), indi-cating that the probability that k documentspoint to a certain Web page follows a powerlaw, with #www $ 2.1 % 0.1 (Fig. 1B) (9). Anetwork whose topology reflects the histori-cal patterns of urban and industrial develop-ment is the electrical power grid of the west-ern United States, the vertices being genera-tors, transformers, and substations and theedges being to the high-voltage transmissionlines between them (10). Because of the rel-atively modest size of the network, contain-ing only 4941 vertices, the scaling region isless prominent but is nevertheless approxi-mated by a power law with an exponent#power ! 4 (Fig. 1C). Finally, a rather largecomplex network is formed by the citationpatterns of the scientific publications, the ver-tices being papers published in refereed jour-nals and the edges being links to the articles

cited in a paper. Recently Redner (11) hasshown that the probability that a paper iscited k times (representing the connectivity ofa paper within the network) follows a powerlaw with exponent #cite $ 3.

The above examples (12) demonstrate thatmany large random networks share the com-mon feature that the distribution of their localconnectivity is free of scale, following a powerlaw for large k with an exponent # between2.1 and 4, which is unexpected within theframework of the existing network models.The random graph model of ER (7) assumesthat we start with N vertices and connect eachpair of vertices with probability p. In themodel, the probability that a vertex has kedges follows a Poisson distribution P(k) $e"&&k/k!, where

& ! N"N " 1

k#pk'1 " p(N"1"k

In the small-world model recently intro-duced by Watts and Strogatz (WS) (10), Nvertices form a one-dimensional lattice,each vertex being connected to its twonearest and next-nearest neighbors. Withprobability p, each edge is reconnected to avertex chosen at random. The long-rangeconnections generated by this process de-crease the distance between the vertices,leading to a small-world phenomenon (13),often referred to as six degrees of separa-tion (14 ). For p $ 0, the probability distri-bution of the connectivities is P(k) $ )(k "z), where z is the coordination number inthe lattice; whereas for finite p, P(k) stillpeaks around z, but it gets broader (15). Acommon feature of the ER and WS modelsis that the probability of finding a highlyconnected vertex (that is, a large k) decreas-es exponentially with k; thus, vertices withlarge connectivity are practically absent. Incontrast, the power-law tail characterizingP(k) for the networks studied indicates thathighly connected (large k) vertices have alarge chance of occurring, dominating theconnectivity.

There are two generic aspects of real net-works that are not incorporated in these mod-els. First, both models assume that we startwith a fixed number (N) of vertices that arethen randomly connected (ER model), or re-connected (WS model), without modifyingN. In contrast, most real world networks areopen and they form by the continuous addi-tion of new vertices to the system, thus thenumber of vertices N increases throughoutthe lifetime of the network. For example, theactor network grows by the addition of newactors to the system, the WWW grows expo-nentially over time by the addition of newWeb pages (8), and the research literatureconstantly grows by the publication of newpapers. Consequently, a common feature of

Fig. 1. The distribution function of connectivities for various large networks. (A) Actor collaborationgraph with N $ 212,250 vertices and average connectivity *k+ $ 28.78. (B) WWW, N $325,729, *k+ $ 5.46 (6). (C) Power grid data, N $ 4941, *k+ $ 2.67. The dashed lines haveslopes (A) #actor $ 2.3, (B) #www $ 2.1 and (C) #power $ 4.

R E P O R T S

15 OCTOBER 1999 VOL 286 SCIENCE www.sciencemag.org510

ActorCollaboration

networkWWW Power Grid

Albert & Barabasi, 1999

61

CPI PPI Tutorial 2004 - AMH

Scale-free Protein Interaction Networks

Proteomics 2004, 4, 928–942 Protein interaction networks 931

Figure 2. Large-scale characteristics of the protein interaction databases. (a) Degree distribution of the four databases,shown on a log-log plot. Note that all datasets have a power law tail, indicating that the underlying network has a scale-freetopology. The solid line is obtained from the fitting to the function P(k) , kg to the DIP data, the best fit indicating g< 2.5 forDIP data set. (b) Distribution of the clustering coefficient for the four studied databases shown on a log-log plot. The straightline has slope 22. (c) Cluster size distribution for the four databases shown on a log-log plot. Apart from the points corre-sponding to the giant component (for right) the P(n) curves follow a power law. The solid line is obtained from the leastsquare fitting to P(n) , n2a for the MIPS dataset, providing a = 3.4.

lar biology [28], a series of studies have focused on iden-tifying the biological modules in various cellular networks,ranging from the metabolism [15, 29–31] to geneticnetworks [2, 32]. Modularity assumes the existence ofgroups of proteins that work together to achieve somewell-defined biological function. For example, it is experi-mentally well established that protein complexes that actas functional modules carry out many biological func-tions. From the network perspective these modulesshould appear as distinct group of nodes that are highlyinterconnected with each other but have only a few linksto nodes outside of the module. Yet, the scale-free to-pology apparently forbids the existence of independentmodules in the network, as the hub proteins’ ability tointeract with a high fraction of each module’s componentsmakes a module’s relative isolation all but impossible.Recently, we proposed that the network’s scale-freetopology can be reconciled with its potential modularitywithin the framework of a hierarchical modularity [15, 30,33]. The most important test of such hierarchical modular-ity is the scaling of the clustering coefficient, C, defined asCi = 2ni/ki(ki21) for each node i that has ki links, where ni

denotes the number of direct links between the ki neigh-bors of node i. For the random and the scale-free modelthe clustering coefficient of a node with k links is inde-pendent of k, that is, on average hubs have the sameclustering coefficient as small nodes do. In contrast, fora hierarchical network the clustering coefficient C(k) de-pends on the node’s degree as [15, 33–35]

C(k) , k2b (1)

where b is the modularity exponent characterizing thenetwork’s hierarchical modularity. Therefore, the C(k)function, which can be measured for arbitrary networks[30], can provide direct evidence if the network has ahierarchical modularity. To test the organization of mod-ularity in protein interaction networks we measured theC(k) function for each of the four studied protein networkdatabases. As Fig. 2b shows, we find that C(k) is notindependent of k, but can be well approximated by apower law with exponent b < 2, giving direct evidenceof hierarchical modularity in protein interaction net-works.

Another important property of the currently available pro-tein interaction networks is that they are fragmented intomany distinct clusters [11, 22, 26]. Indeed, we find thateach of the four databases are dominated by a giant clus-ter that contains a significant fraction of all connectedproteins, such that one can find a path of protein interac-tions between any two proteins belonging to this giantcomponent. A small fraction of proteins, however, areeither completely isolated (i.e., do not have any knowninteractions to other proteins) or form small islands of iso-lated groups of interconnected proteins. To characterizethe fragmented nature of the protein interaction networkwe determined the size n of each isolated cluster, andprepared a normalized histogram of the results, obtainingthe cluster size distribution. As Fig. 2c shows, each of thedatasets have a giant component of approximately 103

proteins. However, the giant component coexists withmany isolated proteins, somewhat fewer two protein clus-

! 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Yook, Oltvai, Barabasi, 2004

Degree distributionClustering coefficient

Cluster sizes

62

CPI PPI Tutorial 2004 - AMH

Hubs and Lethality

Barabasi, 2001

Proteins are traditionally identified onthe basis of their individual actions ascatalysts, signalling molecules, or

building blocks in cells and microorgan-isms. But our post-genomic view is expand-ing the protein’s role into an element in anetwork of protein–protein interactions aswell, in which it has a contextual or cellularfunction within functional modules1,2. Herewe provide quantitative support for thisidea by demonstrating that the phenotypicconsequence of a single gene deletion in theyeast Saccharomyces cerevisiae is affected toa large extent by the topological position ofits protein product in the complex hierar-chical web of molecular interactions.

The S. cerevisiae protein–protein inter-action network we investigate has 1,870proteins as nodes, connected by 2,240 iden-tified direct physical interactions, and isderived from combined, non-overlappingdata3,4, obtained mostly by systematic two-hybrid analyses3. Owing to its size, a com-plete map of the network (Fig. 1a),although informative, in itself offers littleinsight into its large-scale characteristics.Our first goal was therefore to identify thearchitecture of this network, determiningwhether it is best described by an inherentlyuniform exponential topology, with pro-teins on average possessing the same num-ber of links, or by a highly heterogeneousscale-free topology, in which proteins havewidely different connectivities5.

As we show in Fig. 1b, the probabilitythat a given yeast protein interacts with kother yeast proteins follows a power law5

with an exponential cut-off6 at kc!20, atopology that is also shared by the pro-tein–protein interaction network of thebacterium Helicobacter pylori 7. This indi-cates that the network of protein interac-tions in two separate organisms forms ahighly inhomogeneous scale-free networkin which a few highly connected proteinsplay a central role in mediating interactionsamong numerous, less connected proteins.

An important known consequence of theinhomogeneous structure is the network’ssimultaneous tolerance to random errors,coupled with fragility against the removal ofthe most connected nodes8. We find that ran-dom mutations in the genome of S. cerevisiae,modelled by the removal of randomly select-ed yeast proteins, do not affect the overalltopology of the network. By contrast, whenthe most connected proteins are computa-tionally eliminated, the network diameterincreases rapidly. This simulated toleranceagainst random mutation is in agreementwith results from systematic mutagenesis

experiments, which identified a strikingcapacity of yeast to tolerate the deletion of asubstantial number of individual proteinsfrom its proteome9,10. However, if this isindeed due to a topological component toerror tolerance, then, on average, less con-nected proteins should prove to be less essen-tial than highly connected ones.

To test this, we rank-ordered all interact-ing proteins based on the number of linksthey have, and correlated this with the phe-notypic effect of their individual removalfrom the yeast proteome. As shown in Fig.1c, the likelihood that removal of a proteinwill prove lethal correlates with the numberof interactions the protein has. For exam-ple, although proteins with five or fewerlinks constitute about 93% of the totalnumber of proteins, we find that only about21% of them are essential. By contrast, onlysome 0.7% of the yeast proteins withknown phenotypic profiles have more than15 links, but single deletion of 62% or so ofthese proves lethal. This implies that highlyconnected proteins with a central role in thenetwork’s architecture are three times morelikely to be essential than proteins with onlya small number of links to other proteins.

The simultaneous emergence of aninhomogeneous structure in both metabol-ic5,11 and protein interaction networks sug-gests that there has been evolutionaryselection of a common large-scale structureof biological networks and indicates thatfuture systematic protein–protein interac-tion studies in other organisms will uncoveran essentially identical protein-networktopology. The correlation between the con-nectivity and indispensability of a givenprotein confirms that, despite the impor-tance of individual biochemical functionand genetic redundancy, the robustnessagainst mutations in yeast is also derivedfrom the organization of interactions andthe topological positions of individual pro-teins12. A better understanding of celldynamics and robustness will be obtainedfrom an integrated approach that simulta-neously incorporates the individual andcontextual properties of all constituents incomplex cellular networks.H. Jeong*, S. P. Mason†, A.-L. Barabási*,

Z. N. Oltvai†

*Department of Physics, University of Notre Dame,

Notre Dame, Indiana 46556, USA

e-mail: [email protected], [email protected]

brief communications

NATURE | VOL 411 | 3 MAY 2001 | www.nature.com 41

Lethality and centrality in protein networksThe m ost highly connected proteins in the cell are the m ost im portant for its survival.

Figure 1 Characteristics of the yeast proteome. a, Map of protein–protein interactions. The largest cluster, which contains ~78% of all

proteins, is shown. The colour of a node signifies the phenotypic effect of removing the corresponding protein (red, lethal; green, non-

lethal; orange, slow growth; yellow, unknown). b, Connectivity distribution P(k) of interacting yeast proteins, giving the probability that a

given protein interacts with kother proteins. The exponential cut-off6 indicates that the number of proteins with more than 20 interactions

is slightly less than expected for pure scale-free networks. In the absence of data on the link directions, all interactions have been consid-

ered as bidirectional. The parameter controlling the short-length scale correction has value k0!1. c, The fraction of essential proteins

with exactly k links versus their connectivity, k, in the yeast proteome. The list of 1,572 mutants with known phenotypic profile was

obtained from the Proteome database13. Detailed statistical analysis, including r!0.75 for Pearson’s linear correlation coefficient,

demonstrates a positive correlation between lethality and connectivity. For additional details, see http://www.nd.edu/~networks/cell.

© 2001 Macmillan Magazines Ltd

63

CPI PPI Tutorial 2004 - AMH

Date Hubs & Party Hubs

Different proteins are expressed at different times and places - not all interactions need occur at the same state/time/place

a

d

c

b

a

d

ac

a

b

64

Page 17: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Network Motifs

• Motifs: small patterns of interactions that occur more frequently than random in biological networks

• Different motifs may play different information-processing roles

• Count # of occurrences of each possible network containing 2, 3,...4 nodes, and compare to randomized networks

65

CPI PPI Tutorial 2004 - AMH

Searching for Network Motifs

Milo R. et al “Network Motifs: Simple Building Blocks of Complex Networks” Science 298 p824-827 (2002)

Cl concentrations in the Sajama ice core, and toa number of other pedological and geomorpho-logical features indicative of long-term dry cli-mates (8, 11–14, 18). This decline in humanactivity around the Altiplano paleolakes is seenin most caves, with early and late occupationsseparated by largely sterile mid-Holocene sed-iments. However, a few sites, including thecaves of Tulan-67 and Tulan-68, show thatpeople did not completely disappear from thearea. All of the sites of sporadic occupationare located near wetlands in valleys, nearlarge springs, or where lakes turned into wet-lands and subsistence resources were locallystill available despite a generally arid climate(7, 8, 19, 20).

Archaeological data from surrounding ar-eas suggest that the Silencio Arqueologicoapplies best to the most arid areas of thecentral Andes, where aridity thresholds forearly societies were critical. In contrast, aweaker expression is to be expected in themore humid highlands of northern Chile(north of 20°S, such as Salar Huasco) andPeru (21). In northwest Argentina, the Silen-cio Arqueologico is found in four of the sixknown caves (22) [see review in (23)]. It isalso found on the coast of Peru in sites thatare associated with ephemeral streams (24).The southern limit in Chile and northwestArgentina has yet to be explored.

References and Notes1. T. Dillehay, Science 245, 1436 (1989).2. D. J. Meltzer et al., Am. Antiq. 62, 659 (1997).3. T. F. Lynch, C. M. Stevenson, Quat. Res. 37, 117(1992).

4. D. H. Sandweiss et al., Science 281, 1830 (1998).5. L. Nunez, M. Grosjean, I. Cartajena, in Interhemispher-ic Climate Linkages, V. Markgraf, Ed. (Academic Press,San Diego, CA 2001), pp. 105–117.

6. M. A. Geyh, M. Grosjean, L. Nunez, U. Schotterer,Quat. Res. 52, 143 (1999).

7. J. L. Betancourt, C. Latorre, J. A. Rech, J. Quade, K.Rylander, Science 289, 1542 (2000).

8. M. Grosjean et al., Global Planet. Change 28, 35(2001).

9. C. Latorre, J. L. Betancourt, K. A. Rylander, J. Quade,Geol. Soc. Am. Bull. 114, 349 (2002).

10. Charcoal in layers containing triangular points hasbeen 14C dated at Tuina-1, Tuina-5, Tambillo-1, SanLorenzo-1, and Tuyajto-1 between 13,000 and 9000cal yr B.P. (table S1 and fig. S1).

11. P. A. Baker et al., Science 291, 640 (2001).12. G. O. Seltzer, S. Cross, P. Baker, R. Dunbar, S. Fritz,Geology 26, 167 (1998).

13. L. G. Thompson et al., Science 282, 1858 (1998).14. M. Grosjean, Science 292, 2391 (2001).15. E. P. Tonni, written communication.16. M. T. Alberdi, written communication.17. J. Fernandez et al., Geoarchaeology 6, 251 (1991).18. The histogram of middens is processed from (9).19. M. Grosjean, L. Nunez, I. Cartajena, B. Messerli, Quat.Res. 48, 239 (1997).

20. The term Silencio Arqueologico describes the mid-Holocene collapse of human population at thosearchaeological sites of the Atacama Desert that arevulnerable to multicentennial or millennial-scaledrought. The term Silencio Archaeologico does notconflict with the presence of humans at sites that arenot susceptible to climate change, such as in springand river oases that drain large (Pleistocene) aquifersor at sites where wetlands were created during thearid middle Holocene, such as Tulan-67, Tulan-68,and Laguna Miscanti.

21. M. Aldenderfer, Science 241, 1828 (1988).22. A mid-Holocene hiatus is found at Inca Cueva 4,Huachichocana 3, Pintocamayoc, and Yavi, whereasoccupation continued at the oases of Susques andQuebrada Seca.

23. L. Nunez et al., Estud. Atacamenos 17, 125 (1999).24. D. H. Sandweiss, K. A. Maasch, D. G. Anderson, Sci-ence 283, 499 (1999).

25. Grants from the National Geographic Society (5836-96), the Swiss National Science Foundation(21-57073), and Fondo Nacional de Desarrollo Cien-

tıfico y Tecnologico (1930022) and comments by J. P.Bradbury, B. Meggers, G. Seltzer, and D. Stanford areacknowledged.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/298/5594/821/DC1Figs. S1 to S3Tables S1 and S2

22 July 2002; accepted 9 September 2002

Network Motifs: Simple BuildingBlocks of Complex Networks

R. Milo,1 S. Shen-Orr,1 S. Itzkovitz,1 N. Kashtan,1 D. Chklovskii,2

U. Alon1*

Complex networks are studied across many fields of science. To uncover theirstructural design principles, we defined “network motifs,” patterns of inter-connections occurring in complex networks at numbers that are significantlyhigher than those in randomized networks. We found such motifs in networksfrom biochemistry, neurobiology, ecology, and engineering. The motifs sharedby ecological food webs were distinct from the motifs shared by the geneticnetworks of Escherichia coli and Saccharomyces cerevisiae or from those foundin the World Wide Web. Similar motifs were found in networks that performinformation processing, even though they describe elements as different asbiomolecules within a cell and synaptic connections between neurons in Cae-norhabditis elegans. Motifs may thus define universal classes of networks. Thisapproach may uncover the basic building blocks of most networks.

Many of the complex networks that occur innature have been shown to share global statis-tical features (1–10). These include the “smallworld” property (1–9) of short paths betweenany two nodes and highly clustered connec-tions. In addition, in many natural networks,there are a few nodes with many more connec-tions than the average node has. In these types

of networks, termed “scale-free networks” (4,6), the fraction of nodes having k edges, p(k),decays as a power law p(k) ! k–" (where " isoften between 2 and 3). To go beyond theseglobal features would require an understandingof the basic structural elements particular toeach class of networks (9). To do this, wedeveloped an algorithm for detecting networkmotifs: recurring, significant patterns of inter-connections. A detailed application to a generegulation network has been presented (11).Related methods were used to test hypotheseson social networks (12, 13). Here we generalizethis approach to virtually any type of connec-tivity graph and find the striking appearance of

1Departments of Physics of Complex Systems andMolecular Cell Biology, Weizmann Institute of Sci-ence, Rehovot, Israel 76100. 2Cold Spring Harbor Lab-oratory, Cold Spring Harbor, NY 11724, USA.

*To whom correspondence should be addressed. E-mail: [email protected]

Fig. 1. (A) Examplesof interactions repre-sented by directededges between nodesin some of the net-works used for thepresent study. Thesenetworks go from thescale of biomolecules(transcription factorprotein X binds regu-latory DNA regionsof a gene to regulatethe production rateof protein Y),through cells (neuronX is synaptically con-nected to neuron Y),to organisms (Xfeeds on Y). (B) All 13 types of three-node connected subgraphs.

R E P O R T S

25 OCTOBER 2002 VOL 298 SCIENCE www.sciencemag.org824

66

CPI PPI Tutorial 2004 - AMH

Searching For Network Motifs

motifs in networks representing a broad rangeof natural phenomena.

We started with networks where the inter-actions between nodes are represented by di-rected edges (Fig. 1A). Each network wasscanned for all possible n-node subgraphs (inthe present study, n ! 3 and 4), and the numberof occurrences of each subgraph was recorded.Each network contains numerous types of n-node subgraphs (Fig. 1B). To focus on thosethat are likely to be important, we compared thereal network to suitably randomized networks(12–16) and only selected patterns appearing inthe real network at numbers significantly higherthan those in the randomized networks (Fig. 2).For a stringent comparison, we used random-ized networks that have the same single-nodecharacteristics as does the real network: Eachnode in the randomized networks has the same

number of incoming and outgoing edges as thecorresponding node has in the real network.The comparison to this randomized ensembleaccounts for patterns that appear only becauseof the single-node characteristics of the network(e.g., the presence of nodes with a large numberof edges). Furthermore, the randomized net-works used to calculate the significance of n-node subgraphs were generated to preserve thesame number of appearances of all (n – 1)-nodesubgraphs as in the real network (17, 18). Thisensures that a high significance was not as-signed to a pattern only because it has a highlysignificant subpattern. The “network motifs”are those patterns for which the probability P ofappearing in a randomized network an equal orgreater number of times than in the real networkis lower than a cutoff value (here P ! 0.01).Patterns that are functionally important but not

statistically significant could exist, whichwould be missed by our approach.

We applied the algorithm to several net-works from biochemistry (transcriptional generegulation), ecology (food webs), neurobiology(neuron connectivity), and engineering (elec-tronic circuits, World Wide Web). The networkmotifs found are shown in Table 1. Transcrip-tion networks are biochemical networks re-sponsible for regulating the expression of genesin cells (11, 19). These are directed graphs, inwhich the nodes represent genes (Fig. 1A).Edges are directed from a gene that encodes fora transcription factor protein to a gene transcrip-tionally regulated by that transcription factor.We analyzed the two best characterized tran-scriptional regulation networks, correspondingto organisms from different kingdoms: a eu-karyote (the yeast Saccharomyces cerevisiae)(20) and a bacterium (Escherichia coli) (11,19). The two transcription networks show thesame motifs: a three-node motif termed “feed-forward loop” (11) and a four-node motiftermed “bi-fan.” These motifs appear numeroustimes in each network (Table 1), in nonhomolo-gous gene systems that perform diverse biolog-ical functions. The number of times they appearis more than 10 standard deviations greater thantheir mean number of appearances in random-ized networks. Only these subgraphs, of the 13possible different three-node subgraphs (Fig.1B) and 199 different four-node subgraphs, aresignificant and are therefore considered net-work motifs. Many other three- and four-nodesubgraphs recur throughout the networks, but atnumbers that are less than the mean plus 2standard deviations of their appearance in ran-domized networks.

We next applied the algorithm to ecosystemfood webs (21, 22), in which nodes representgroups of species. Edges are directed from anode representing a predator to the node repre-senting its prey. We analyzed data collected bydifferent groups at seven distinct ecosystems(22), including both aquatic and terrestrial hab-itats. Each of the food webs displayed one ortwo three-node network motifs and one to fivefour-node network motifs. One can define the“consensus motifs” as the motifs shared bynetworks of a given type. Five of the seven foodwebs shared one three-node motif, and all sevenshared one four-node motif (Table 1). In con-trast to the three-node motif (termed “threechain”), the three-node feedforward loop wasunderrepresented in the food webs. This sug-gests that direct interactions between species ata separation of two layers [as in the case ofomnivores (23)] are selected against. The bi-parallel motif indicates that two species that areprey of the same predator both tend to share thesame prey. Both network motifs may thus rep-resent general tendencies of food webs (21, 22).

We next studied the neuronal connectivitynetwork of the nematode Caenorhabditis ele-gans (24). Nodes represent neurons (or neuron

Fig. 2. Schematic view of network motif detection. Network motifs are patterns that recur muchmore frequently (A) in the real network than (B) in an ensemble of randomized networks. Eachnode in the randomized networks has the same number of incoming and outgoing edges as doesthe corresponding node in the real network. Red dashed lines indicate edges that participate in thefeedforward loop motif, which occurs five times in the real network.

150 200 250 300 350 4000

0.005

0.01

0.015

Subnetwork size

Co

nce

ntr

atio

n o

f F

ee

dfo

rwa

rd lo

op

Real

Random

Fig. 3. Concentration C ofthe feedforward loop motifin real and randomizedsubnetworks of the E. colitranscription network (11).C is the number of appear-ances of the motif dividedby the total number of ap-pearances of all connectedthree-node subgraphs (Fig.1B). Subnetworks of size Swere generated by choos-ing a node at random andadding to it nodes con-nected by an incoming oroutgoing edge, until Snodes were obtained, andthen including all of theedges between these Snodes present in the fullnetwork. Each of the sub-networks was randomized(17, 18) (shown are mean and SD of 400 subnetworks of each size).

R E P O R T S

www.sciencemag.org SCIENCE VOL 298 25 OCTOBER 2002 825

Milo R. et al “Network Motifs: Simple Building Blocks of Complex Networks” Science 298 p824-827 (2002)

67

CPI PPI Tutorial 2004 - AMH

Motifs in Networks

classes), and edges represent synaptic connec-tions between the neurons. We found the feed-forward loop motif in agreement with anatomi-cal observations of triangular connectivity struc-tures (24). The four-node motifs include thebi-fan and the bi-parallel (Table 1). Two ofthese motifs (feedforward loop and bi-fan) were

also found in the transcriptional gene regulationnetworks. This similarity in motifs may point toa fundamental similarity in the design con-straints of the two types of networks. Both net-works function to carry information from sen-sory components (sensory neurons/transcriptionfactors regulated by biochemical signals) to ef-

fectors (motor neurons/structural genes). Thefeedforward loop motif common to both typesof networks may play a functional role in infor-mation processing. One possible function of thiscircuit is to activate output only if the inputsignal is persistent and to allow a rapid deacti-vation when the input goes off (11). Indeed,many of the input nodes in the neural feedfor-ward loops are sensory neurons, which mayrequire this type of information processingto reject transient input fluctuations that areinherent in a variable or noisy environment.

We also studied several technological net-works. We analyzed the ISCAS89 benchmarkset of sequential logic electronic circuits (7, 25).The nodes in these circuits represent logic gatesand flip-flops. These nodes are linked by direct-ed edges. We found that the motifs separate thecircuits into classes that correspond to the cir-cuit’s functional description. In Table 1, wepresent two classes, consisting of five forward-logic chips and three digital fractional multipli-ers. The digital fractional multipliers share threemotifs, including three- and four-node feedbackloops. The forward logic chips share the feed-forward loop, bi-fan, and bi-parallel motifs,which are similar to the motifs found in thegenetic and neuronal information-processingnetworks. We found a different set of motifs ina network of directed hyperlinks betweenWorld Wide Web pages within a single domain(4). The World Wide Web motifs may reflect adesign aimed at short paths between relatedpages. Application of our approach to nondi-rected networks shows distinct sets of motifs innetworks of protein interactions and Internetrouter connections (18).

None of the network motifs shared by thefood webs matched the motifs found in the generegulation networks or the World Wide Web.Only one of the food web consensus motifs alsoappeared in the neuronal network. Differentmotif sets were found in electronic circuits withdifferent functions. This suggests that motifscan define broad classes of networks, each withspecific types of elementary structures. Themotifs reflect the underlying processes that gen-erated each type of network; for example, foodwebs evolve to allow a flow of energy from thebottom to the top of food chains, whereas generegulation and neuron networks evolve to pro-cess information. Information processing seemsto give rise to significantly different structuresthan does energy flow.

We further characterized the statistical sig-nificance of the motifs as a function of networksize, by considering pieces of various sizes(subnetworks) of the full network. The concen-tration of motifs in the subnetworks is about thesame as that in the full network (Fig. 3). Incontrast, the concentration of the correspondingsubgraphs in the randomized versions of thesubnetworks decreases sharply with size. Inanalogy with statistical physics, the number ofappearances of each motif in the real networks

Table 1. Network motifs found in biological and technological networks. The numbers of nodes and edgesfor each network are shown. For each motif, the numbers of appearances in the real network (Nreal) andin the randomized networks (Nrand! SD, all values rounded) (17, 18) are shown. The P value of all motifsis P " 0.01, as determined by comparison to 1000 randomized networks (100 in the case of the WorldWide Web). As a qualitative measure of statistical significance, the Z score # (Nreal – Nrand)/SD is shown.NS, not significant. Shown are motifs that occur at least U # 4 times with completely different sets ofnodes. The networks are as follows (18): transcription interactions between regulatory proteins and genesin the bacterium E. coli (11) and the yeast S. cerevisiae (20); synaptic connections between neurons inC. elegans, including neurons connected by at least five synapses (24); trophic interactions in ecologicalfood webs (22), representing pelagic and benthic species (Little Rock Lake), birds, fishes, invertebrates(Ythan Estuary), primarily larger fishes (Chesapeake Bay), lizards (St. Martin Island), primarily inverte-brates (Skipwith Pond), pelagic lake species (Bridge Brook Lake), and diverse desert taxa (CoachellaValley); electronic sequential logic circuits parsed from the ISCAS89 benchmark set (7, 25), where nodesrepresent logic gates and flip-flops (presented are all five partial scans of forward-logic chips and threedigital fractional multipliers in the benchmark set); and World Wide Web hyperlinks between Web pagesin a single domain (4) (only three-node motifs are shown). e, multiplied by the power of 10 (e.g., 1.46e6# 1.46$ 106).

*Has additional four-node motif: (X3Z, W; Y3Z, W; Z3W), Nreal# 150, Nrand# 85! 15, Z# 4. †Has additionalfour-node motif: (X3Y, Z; Y3Z; Z3W), Nreal# 204, Nrand# 80! 20, Z# 6. The three-node pattern (X3Y, Z; Y3Z;Z3Y) also occurs significantly more than at random. It is not a motif by the present definition because it does notappear with completely distinct sets of nodes more than U # 4 times. ‡Has additional four-node motif: (X3Y;Y3Z, W; Z3X; W3X), Nreal # 914, Nrand # 500 ! 70, Z # 6. §Has two additional three-node motifs: (X3Y, Z;Y3Z; Z3Y), Nreal # 3e5, Nrand # 1.4e3 ! 6e1, Z # 6000, and (X3Y, Z; Y3Z), Nreal # 5e5, Nrand # 9e4 ! 1.5e3,Z # 250.

Network Nodes Edges Nreal Nrand ± SD Z score Nreal Nrand ± SD Z score Nreal Nrand ± SD Z score

Gene regulation

(transcription)

X

Y

Z

Feed-

forward

loop

X Y

Z W

Bi-fan

E. coli 424 519 40 7 ± 3 10 203 47 ± 12 13

S. cerevisiae* 685 1,052 70 11 ± 4 14 1812 300 ± 40 41

Neurons X

Y

Z

Feed-

forward

loop

X Y

Z W

Bi-fan X

Y Z

W

Bi-

parallel

C. elegans† 252 509 125 90 ± 10 3.7 127 55 ± 13 5.3 227 35 ± 10 20

Food webs X

Y

Z

Three

chain

X

Y Z

W

Bi-

parallel

Little Rock 92 984 3219 3120 ± 50 2.1 7295 2220 ± 210 25

Ythan 83 391 1182 1020 ± 20 7.2 1357 230 ± 50 23

St. Martin 42 205 469 450 ± 10 NS 382 130 ± 20 12

Chesapeake 31 67 80 82 ± 4 NS 26 5 ± 2 8

Coachella 29 243 279 235 ± 12 3.6 181 80 ± 20 5

Skipwith 25 189 184 150 ± 7 5.5 397 80 ± 25 13

B. Brook 25 104 181 130 ± 7 7.4 267 30 ± 7 32

Electronic circuits

(forward logic chips)

X

Y

Z

Feed-

forward

loop

Bi-fan X

Y Z

W

Bi-

parallel

s15850 10,383 14,240 424 2 ± 2 285 1040 1 ± 1 1200 480 2 ± 1 335

s38584 20,717 34,204 413 10 ± 3 120 1739 6 ± 2 800 711 9 ± 2 320

s38417 23,843 33,661 612 3 ± 2 400 2404 1 ± 1 2550 531 2 ± 2 340

s9234 5,844 8,197 211 2 ± 1 140 754 1 ± 1 1050 209 1 ± 1 200

s13207 8,651 11,831 403 2 ± 1 225 4445 1 ± 1 4950 264 2 ± 1 200

Electronic circuits

(digital fractional multipliers)

X

Y Z

Three-

node

feedback

loop

Bi-fan X Y

Z W

Four-

node

feedback

loop

s208 122 189 10 1 ± 1 9 4 1 ± 1 3.8 5 1 ± 1 5

s420 252 399 20 1 ± 1 18 10 1 ± 1 10 11 1 ± 1 11

s838‡ 512 819 40 1 ± 1 38 22 1 ± 1 20 23 1 ± 1 25

World Wide Web X

Y

Z

Feedback

with two

mutual

dyads

X

Y Z

Fully

connected

triad

X

Y Z

Uplinked

mutual

dyad

nd.edu§ 325,729 1.46e6 1.1e5 2e3 ± 1e2 800 6.8e6 5e4±4e2 15,000 1.2e6 1e4 ± 2e2 5000

X Y

Z W

X Y

Z W

R E P O R T S

25 OCTOBER 2002 VOL 298 SCIENCE www.sciencemag.org826

Milo R. et al

68

Page 18: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Graph Theory & Networks References

Gary Chartrand “Introductory Graph Theory” Dover 1977

Albert-László Barabási & Zoltán N.Oltvai. “Network Biology: Understanding the cell’s functional organization” Nature Reviews Genetics 5 101-114. (Feb 2004)

Albert-László Barabási. “Linked” (Nice easy read.)

Reka Albert & Albert-László Barabási. “Statistical mechanics of complex networks” Reviews of Modern Physics 74. 47-97 (January 2002) (Nice, not so easy read.)

69

CPI PPI Tutorial 2004 - AMH

5. Accessing & Visualizing Interaction Data

1. Types of interaction data2. Interaction Databases3. Integrating other biological data4. Cytoscape for integrating biological data5. Pajek for graph analysis

70

CPI PPI Tutorial 2004 - AMH

more than just protein interactions...

• Genetic interactions (epistatic, synthetic lethal )

• Coexpression

• Small-molecule/protein interactions

• Chemical-genetic interactions

• Chemical reactions

• Transcription factor binding (ChIP)

• Localization

• etc...

• Quantitative data:

• transcripts, proteins, metabolites...

71

CPI PPI Tutorial 2004 - AMH

Major Public Interaction Databases

BINDBlueprint (Hogue)

Freecomplex data modelextensive curation

http://www.blueprint.org

DIPUCLA

(Eisenberg)Free only for

academicscurated http://dip.doe-mbi.ucla.edu/

GRID MSHRI (Tyers) FreeSimple, easy-to-use

formathigh throughput only

http://biodata.mshri.on.ca/grid/servlet/Index

INTACT EBI (Apweiler) Free http://www.ebi.ac.uk/intact/index.html

MINT Rome Free http://mint.bio.uniroma2.it/mint/index.php

HPRDHopkins/IOB

(Pandey)Free only for

academicsHuman data http://www.hprd.org

many databases are limited to protein-protein interaction data

72

Page 19: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Commercial Sources

• Ingenuity

• Database of mainly mammalian functional interactions.

• Analysis tools are useful, but very focussed on interpreting microarray data.

• high cost (but discounted for academics)

Ingenuity Pathways AnalysisApplication Note 0104

Application Note 0104 Page 5

The functional analysis and the molecularrelationships provided in Network 12suggest that circadian phased expressionof nuclear hormone receptors affectscircadian regulation of lipid metabolism.This hypothesis is further explored byexamining the well-characterizedglycerolipid biosynthesis pathway in itsentirety, and seeing which of the enzymesin that pathway have an expression patternthat correlates with circadian phasing.

Clicking on the Pathway Tag icon in theNetwork Explorer bar for Network 12provides a direct link to a graphicalrepresentation of the canonical metabolicpathway “Glycerolipid Metabolism”. As seenin Figure 6, this graphic shows all of thegenes in the user-defined input list thatplay a role in Glycerolipid Metabolism, theircorresponding Enzyme Class (EC) numbers,and the Ingenuity Pathways Analysisnetworks that gene is involved in.

By providing seamless navigation betweenthe glycerolipid metabolism pathway andNetwork 12, Ingenuity Pathways Analysisenables users to quickly answer additionalquestions about the network. Specifically,users can address the question of thera-peutic relevance of this network byactivating the Drug View icon in theNetwork Explorer toolbar. As displayed inFigure 7, Network 12 contains severaltargets of FDA approved drugs used inthe treatment of cholesterol disorders,adding additional relevance to thiscircadianly regulated network.

Figure 5: Coordinate regulation of metabolic enzymes. Network 12 identifies thefunctional relationship between circadianly regulated metabolic enzymes MGLL and LPL(diamond shape) and the nuclear hormone receptor PPARA (rectangle shape). Thearrowhead reflects the directionality of the relationships (PPARA acts on MGLL and LPL).

Figure 6: Circadianregulation of glycerolipidmetabolism enzymes.The coloring scheme isidentical to that of Network12 (circadianly regulatedFocus Genes are green).Membership of individualgenes in enzyme classeswas established using theLIGAND database 5.

73

CPI PPI Tutorial 2004 - AMH

6

BIND Data Policy

• Source Code

• BIND source code is available at SourceForge.net under the terms andagreements of the GNU General Public License (GPL).

• Data

• BIND data is free for both commercial and academic use. If you use BIND data,please cite:

• Bader GD, Betel D, Hogue CW. (2003) BIND: the Biomolecular InteractionNetwork Database. Nucleic Acids Res. 31(1):248-50 PMID: 12519993

• This data is distributed in the hope that it will be useful, but WITHOUT ANYWARRANTY; without even the implied warranty of MERCHANTABILITY orFITNESS FOR A PARTICULAR PURPOSE.

BIND + NCBI RefSeq :Biomolecular Assembly – the “edge”

Bader, Hogue 2002

74

CPI PPI Tutorial 2004 - AMH

Modeling Protein complex data as interactions

A

B

C

D

E

AB

C

D

E

A

B

C

DE

matrix

spoke

Topology and number of complexes remain unknown

possible models:

75

CPI PPI Tutorial 2004 - AMH

Data

• At present, different databases contain non-overlapping data - need to collect data from multiple sources

• Emerging standards and consortia: PSI (psidev.sourceforge.net) and Biopax (www.biopax.org) will eventually facilitate synchronization

76

Page 20: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Visualization

• Informative layouts

• Integration of many interaction types

• Integration of state, function data

• Exploration

• Filtering biologically interesting subgraphs

• Network vs. matrix

77

CPI PPI Tutorial 2004 - AMH

Visualization Tools

Pajekhttp://vlado.fmf.uni-lj.si/pub/

networks/pajek/

Cytoscape www.cytoscape.org

Osprey biodata.mshri.on.ca/osprey

78

CPI PPI Tutorial 2004 - AMH

Cytoscape demo

79

CPI PPI Tutorial 2004 - AMH

6. Assessing and Predicting Interactions

1. Supervised Classification

2. Rating confidence in interactionsStatistical methodsGraph-theoretic methods

3. Predicting interactionsliterature mininginterlogsintegrating functional & genomic data

phylogenetic profiles, gene fusion, coexpresion, GO, localization

protein-protein docking

80

Page 21: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Classification

• Rating confidence in interactions and predicting novel interactions both pose a classification problem

• Classification:

• multiple inputs, x - ‘feature vectors’

• single discrete output y

• predict output from future inputs

• Supervised learning:

• Train classifier using known positive and negative examples• apply standard methods: naive bayes, support vector

machine, logistic regression

81

CPI PPI Tutorial 2004 - AMH

Naive Bayes Classifier

• Assumes that all features are independent, given class labels

• Calculate probability (ie. of two proteins interacting) based on each feature separately, and then just multiply them together to get to the overall probability

p(x|y) =∏

i

p(xi|y)x1 x2 xn

y

...

82

CPI PPI Tutorial 2004 - AMH

Support Vector Machine

83

CPI PPI Tutorial 2004 - AMH

Literature Mining

• Search for abstracts containing two protein names, and a set of interaction words

• # of papers containing two proteins together is strong evidence of an interaction

• Apply machine learning, natural language processing methods to identify likely interactions• PreBIND (Donaldson et al, 2003) (data available at ftp.bind.org)

• Support Vector Machine to classify “interaction” abstracts

84

Page 22: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

‘Interlogs’

• Two proteins are more likely to interact if they both have homologs in another species that are known to interact

A

B

a'

b'

experim

enta

l in

tera

ction

homology

homology

infe

rred inte

raction

interacting proteins may have coevolved such that only dis-crete interacting domains were conserved.

The data described above suggest that the approach ofsequence-based searches for candidate interologs can be usedglobally to identify potential networks of interactions. How-ever, such networks only can be considered as biological hy-potheses. Hence, we investigated methods to generate re-agents that can be used to study potential interaction net-works identified by interolog searches. The reverse two-hybridsystem provides a genetic selection that allows the rapid iden-tification of cis-acting mutations or trans-acting moleculesthat dissociate potential interactions (Vidal 1997). The two-hybrid SPAL10::URA3 inducible reporter gene (Vidal et al.1996) confers sensitivity to 5-Fluoroorotic acid (5-FOA). Thedissociation of the yeast two-hybrid interaction confers a se-lective advantage allowing screens for dissociating com-pounds or for mutations that prevent the normal associationof a protein pair using positive selection. Such reagents can beused back in vivo to characterize the role of the correspondingprotein-protein interactions (Endoh et al. 2001).

To test the degree to which the reverse two-hybrid sys-tem can be applied to our network of identified interologs, wedetermined the percentage of the interactions describedabove that could be counter-selected on media containing5-FOA (Vidal 1997). Starting from the 35 true worm interologsdescribed above, 77% (27/35) of C. elegans interactions weredetected as 5-FOA sensitive (Fig. 3). Because the reverse two-hybrid system can be automated (Endoh et al. 2001), it ispossible that relatively large numbers of yeast two-hybrid in-teractions that emerge from interolog searches could indeedbe tested back in the relevant biological settings.

This work suggests that interaction maps from one spe-cies may be useful in predicting interactions in another spe-cies and may provide insight into the function of otherwiseuncharacterized proteins. In addition, the identification of aninterolog provides additional support for the validity of theinitial interaction found in the “reference” species. This maybe most meaningful if the only evidence for the original in-teraction comes, itself, from a high-throughput experiment.When the function of one of the proteins in the starting spe-

Figure 1 Experimentally verified interactions between Saccharomyces cerevisiae and Caenorhabditis elegans. (A–D Yeast diploid cells expressingeach of 35 C. elegans potential interologs. Pairs are arranged in the order described in Table 2. The five patches at the bottom are controls (negativecontrol on the left side and controls of increasing interaction strength towards the right side). See Vidal (1997) for a detailed description of thesecontrols. (B) !-Galactosidase assay to detect the expression of GAL1::lacZ. (C) Growth assay on SC-Leu-Trp-His, +20 mM 3AT plates to detect theexpression of GAL1::HIS3. (D) Growth assay on SC-Leu-Trp-Ura plates to detect the expression of SPAL10::URA3. (E) Conservation of interactions.Each C. elegans protein pair tested was plotted according to two E-values. The first E-value corresponds to the conservation between the X (fromyeast) and X! (from C. elegans) proteins while the second E-value corresponds to the conservation between the Y (from yeast) and Y! (from C.elegans) proteins. The smaller of the two E-values was plotted on the X-axis and the greater on the Y-axis. The C. elegans protein pairs that testedpositive in the two-hybrid system are labeled in black.

Matthews et al.

2122 Genome Researchwww.genome.org

Matthews, 2001

85

CPI PPI Tutorial 2004 - AMH

Domain Fusion

• Two proteins A and B with homologs in another organism that are fused into a single protein chain are likely to functionally or physically interact

Detecting Protein Function andProtein-Protein Interactionsfrom Genome Sequences

Edward M. Marcotte, Matteo Pellegrini, Ho-Leung Ng,Danny W. Rice, Todd O. Yeates, David Eisenberg*

A computational method is proposed for inferring protein interactions fromgenome sequences on the basis of the observation that some pairs of interactingproteins have homologs in another organism fused into a single protein chain.Searching sequences frommany genomes revealed 6809 such putative protein-protein interactions in Escherichia coli and 45,502 in yeast. Many members ofthese pairs were confirmed as functionally related; computational filteringfurther enriches for interactions. Some proteins have links to several otherproteins; these coupled links appear to represent functional interactions suchas complexes or pathways. Experimentally confirmed interacting pairs aredocumented in a Database of Interacting Proteins.

The lives of biological cells are controlled byinteracting proteins in metabolic and signal-ing pathways and in complexes such as themolecular machines that synthesize and useadenosine triphosphate (ATP), replicate andtranslate genes, or build up the cytoskeletalinfrastructure (1). Our knowledge of protein-protein interactions has been accumulatedfrom biochemical and genetic experiments,including the widely used yeast two-hybridtest (2). Here we ask if protein-protein inter-actions can be recognized from genome se-quences by purely computational means.

Some interacting proteins such as the GyrA and Gyr B subunits of Escherichia coliDNA gyrase are fused into a single chain inanother organism, in this case the topoisom-erase II of yeast (3). Thus, the sequencesimilarities of Gyr A (804 amino acid resi-dues) and Gyr B (875 residues) to differentsegments of the topoisomerase II (1429 resi-dues) might be used to predict that Gyr A andGyr B interact in E. coli.

To find other such putative protein inter-actions in E. coli, we searched the 4290protein sequences of the E. coli genome (4)for these patterns of sequence homology (5).We found 6809 pairs of nonhomologous se-quences, both members of the pair havingsignificant similarity (6) to a single protein insome other genome that we term a RosettaStone sequence because it deciphers the in-teraction between the protein pairs. The 4290proteins could form at most (4290)2/2 ! 9 "106 pair interactions, but we would expect

many fewer interactions in a functioning cell;roughly 2 to 10 interactions for each proteindoes not seem unreasonably many.

Each of these 6809 pairs is a candidate fora pair of interacting proteins in E. coli. Fivesuch candidates are shown in Fig. 1. The firstthree pairs of E. coli proteins were amongthose easily determined from the biochemicalliterature in fact to interact. The final twopairs of proteins are not known to interact.They are representatives of many such pairswhose putative interactions at this time mustbe taken as testable hypotheses.

We devised three independent tests of in-teractions predicted by the method we termdomain fusion analysis, each showing that areasonable fraction may in fact interact. Thefirst method uses the annotation of proteinsgiven in the SWISS-PROT database (7). Forcases where the interacting proteins haveboth been annotated, we compare their anno-tations, looking for a similar function for bothmembers of the pair. Similar function would

imply at least a functional interaction. Of the3950 E. coli pairs of known function, 2682(68%) share at least one keyword in theirSWISS-PROT annotations (ignoring the key-word “hypothetical protein”), suggesting re-lated functional roles. When pairs of annotat-ed E. coli proteins are selected at random,only 15% share a keyword. In short, of the E.coli pairs that the domain fusion analysisturns up as candidates for protein-protein in-teractions, more than half have both memberswith a similar function; the method thereforeseems to be a robust predictor of proteinfunction. Where the function of one memberof a protein pair is known, the function of theother member can be predicted. Performing asimilar analysis in yeast turns up 45,502 pro-tein pairs. Of the 9857 pairs of known func-tion, 32% share at least one keyword in theirannotations compared with 14% when pro-teins are selected at random.

The second test of the interactions predict-ed by the domain fusion analysis uses asconfirmation the Database of Interacting Pro-teins (8). This database is a compilation ofprotein pairs that have been found to interactin some published experiment. As of Decem-ber 1998, the database contained 939 entries,724 of which have both members of the pairlisted in the ProDom database. Of these 724pairs, we found 46 or 6.4% linked by RosettaStone sequences. We expect this percentageto rise as more genomes are sequenced.

The third test of domain fusion predic-tions is by another computational method forpredicting interactions (9), the method ofphylogenetic profiles, which detects func-tional interactions by analyzing correlatedevolution of proteins. This method was ap-plied to the 6809 interactions predicted by thedomain fusion analysis for E. coli proteins.Some 321 of these predictions (#5%) weresuggested by the phylogenetic profile methodto interact, more than eight times as manyinteractions in common as for randomly cho-

UCLA–Department of Energy Laboratory of StructuralBiology and Molecular Medicine, Departments ofChemistry and Biochemistry and Biological Chemis-try, Box 951570, University of California at Los An-geles, Los Angeles, CA 90095–1570, USA.

*To whom correspondence should be addressed: E-mail: [email protected]

Fig. 1. Five examplesof pairs of E. coli pro-teins predicted to inter-act by the domain fu-sion analysis. Each pro-tein is shown schemat-ically with boxes rep-resenting domains [asdefined in the ProDomdomain database (17)].For each example, atriplet of proteins is pic-tured: The second andthird proteins are pre-dicted to interact be-cause their homologsare fused in the firstprotein (called the Rosetta Stone protein in the text). The first three predictions are known to interactfrom experiments (18). The final two examples show pairs of proteins from the same pathway (twononsequential enzymes from the histidine biosynthesis pathway and the first two steps of the prolinebiosynthesis pathway) that are not known to interact directly.

R E P O R T S

www.sciencemag.org SCIENCE VOL 285 30 JULY 1999 751

Marcotte et al, Science 285 751-753 1999

86

CPI PPI Tutorial 2004 - AMH

Phylogenetic Profiling

• Proteins that interact tend to be evolutionarily correlated - they are either both conserved or both lost.

the clustering of phylogenetic profiles is that these as yetuncharacterized proteins have functions associated with theribosome.

The comparisons of the phylogenetic profiles of flagellarproteins (Fig. 2B) further support the idea that proteins withsimilar profiles are likely to be functionally linked; 10 flagellarproteins share a common profile. Their homologs are found ina subset of five bacterial genomes: those of Aquifex aeolicus,Borrelia burgdorferi, B. subtilis, Helicobacter pylori, and Myco-bacterium tuberculosis. Other proteins that appear in neigh-boring clusters (groups of proteins that share a commonprofile) include various flagellar proteins and cell-wall main-tenance proteins. Flagellar and cell-wall maintenance proteinsmay be biochemically linked, because flagella are insertedthrough the cell wall. For example, the lytic murein transgly-cosylase MltD has a phylogenetic profile that differs by onlyone bit from that of the flagellar structural protein FlgL. Thistransglycosylase cuts the cell wall for unknown reasons. There-fore, another prediction is that this enzyme may participate inflagellar assembly.

Fig. 2 A and B includes proteins in structural complexes,whereas Fig. 2C shows proteins involved in amino acid me-tabolism. We find that more than half of the proteins withphylogenetic profiles similar (within one bit) to that of thehistidine synthesis protein His5 are involved in amino acidmetabolism. With the 16 currently available fully sequencedgenomes, however, phylogenetic profiles are not able to sep-

arate the metabolic pathways of specific amino acids. Instead,because of the limitations of currently available data, a histi-dine biosynthesis protein seems to have the same profile as atryptophan, arginine, and cysteine synthesis protein. It isprobable that, as more genomes are fully sequenced and thenumber of entries in phylogenetic profiles is increased, similarbut distinct amino acid metabolic pathways will cluster sepa-rately in phylogenetic-profile spaces.

The examples included in Fig. 2 show that proteins withphylogenetic profiles similar to a query protein are likely to befunctionally linked with it. We next show the converse: thatgroups of proteins known to be functionally linked often havesimilar phylogenetic profiles. As shown in Table 1, we chosegroups of E. coli proteins that share a common keyword in theirSwissProt (7) annotation, reflecting well known families offunctionally linked proteins. Because homologous proteinscoded by the same genome necessarily have similar profiles,they were eliminated from the groups. For each group, wecomputed the number of protein pairs that are neighbors;

FIG. 1. Our method of analyzing protein phylogenetic profiles isillustrated schematically for the hypothetical case of four fully se-quenced genomes (from E. coli, Saccharomyces cerevisiae, Haemophi-lus influenzae, and Bacillus subtilis) in which we focus on seven proteins(P1–P7). For each E. coli protein, we construct a profile, indicatingwhich genomes code for homologs of the protein. We next cluster theprofiles to determine which proteins share the same profiles. Proteinswith identical (or similar) profiles are boxed to indicate that they arelikely to be functionally linked. Boxes connected by lines have phylo-genetic profiles that differ by one bit and are termed neighbors.

Table 1. Phylogenetic profiles link protein with similar keywords

KeywordNo.

proteins

No.neighbors

in keywordgroup

No.neighborsin random

group

Ribosome 60 197 27Transcription 36 17 10tRNA synthase and ligase 26 11 5Membrane proteins* 25 89 5Flagellar 21 89 3Iron, ferric, and ferritin 19 31 2Galactose metabolism 18 31 2Molybdoterin and Molybdenum,

and molybdoterin 12 6 1Hypothetical† 1,084 108,226 8,440

Proteins grouped on the basis of similar keywords in SwissProt havemore similar phylogenetic profiles than random proteins. Column 2gives the number of nonhomologous proteins in the keyword group.Column 3 gives the number of protein pairs in the keyword group withprofiles that differ by less than 3 bits. These pairs are called neighbors.Column 4 lists the number of neighbors found on average for a randomgroup of proteins of the same size as the keyword group.*Only membrane proteins without uniformly zero phylogenetic pro-

files were included.†Unlike the other rows of the table, the hypothetical proteins docontain homologous pairs.

Table 2. Phylogenetic profiles link proteins in EcoCyc classes

EcoCyc classNo.

proteins

No.neighborsin EcoCyc

class

No.neighborsin random

group

Carbon compounds 88 798 60Anaerobic respiration 66 275 30Aerobic respiration 28 39 6Electron transport 26 91 5Purine biosynthesis 21 11 3Salvage nucleosides 15 10 1Fermentation 19 17 3Tricarboxylic acid cycle 16 6 1Glycolysis 14 5 1Peptidoglycan biosynthesis 12 10 1

Proteins grouped according to metabolic function on the basis ofEcoCyc classes have more similar phylogenetic profiles than randomproteins. Column 2 gives the number of proteins in the EcoCyc class.Column 3 gives the number of protein pairs in the EcoCyc class withprofiles that differ by less than 3 bits. These pairs are called neighbors.Column 4 lists the number of neighbors found on average for a randomgroup of proteins of the same size as the keyword group.

4286 Biochemistry: Pellegrini et al. Proc. Natl. Acad. Sci. USA 96 (1999)

Pellegrini et al. PNAS 96(4285-4288) 1999

87

CPI PPI Tutorial 2004 - AMH

Integrating Genomic Data to Predict Interactions

Jansen et al. “A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data” Science 302 449-452 2003

in our filtered version (19)]. A negatives gold-standard is harder to define, but essential forsuccessful training. Thus, we synthesized neg-atives from lists of proteins in separate subcel-lular compartments (9). Our positive and nega-tive gold-standards satisfy the first two criteriaand provide a good practical solution for thethird. Hence, our goal, precisely defined, was topredict whether two proteins are in the samecomplex, not whether they necessarily had di-rect physical contact.

As a measure of reliability, the overlap ofinformation sources (i.e., “interaction datasets,” which could either be noisy experimentaldata or sets of genomic features) with the gold-standards can be expressed in terms of a “like-lihood ratio.” For example, consider a genomicfeature f expressed in binary terms (i.e.,“present” or “absent”). The likelihood ratioL( f ) is then defined as the fraction of gold-standard positives having feature f divided bythe fraction of negatives having f. For twofeatures f1 and f2 with uncorrelated evidence,the likelihood ratio of the combined evidence issimply the product L( f1, f2) ! L( f1)L( f2). Forcorrelated evidence, L( f1, f2) cannot be factor-ized in this way. Bayesian networks are a for-mal representation of such relationships be-tween features. The combined likelihood ratiois proportional to the estimated odds that twoproteins are in the same complex, given multi-ple sources of information.

We predict a protein pair as positive if itscombined likelihood ratio exceeds a particularcutoff (L " Lcut) (negative otherwise). To getan overall assessment of how the predictionperforms, we segmented the gold-standard into

separate training and testing sets (using a sev-enfold cross-validation protocol). Then weevaluated the number of true- (TP) and false-positive (FP) predictions in the testing set. Fi-nally, we applied the Bayesian network beyondthe testing set, computing likelihood ratios forall possible protein pairs in the genome.

Figure 1 schematically shows the infor-mation sources and results of our calcula-tions. We term the results “probabilistic in-teractomes” (PIs), in which each protein pairis associated with a probability measure forbeing in the same complex (i.e., likelihoodratio L). Our procedure not only allows com-bining existing experimental interaction datasets (resulting in a PI-experimental or “PIE”),but also the de novo prediction of proteincomplexes from genomic data sets (when theinput data are not interaction data sets per se,resulting in a PI-predicted or “PIP”).

We combined four interaction data setsfrom high-throughput experiments into thePIE (1–4) (Fig. 1B). The PIE represents atransformation of the individual binary-valued interaction sets into a data set whereevery protein pair is weighted according tothe likelihood that it exists within a complex.

We computed the PIP from several genomicdata sources: the correlation of mRNA amountsin two expression data sets (one with temporalprofiles during the cell cycle, one of expressionlevels under 300 cellular conditions), two sets ofinformation on biological function, and informa-tion about whether proteins are essential for sur-vival (6, 20–22). Although none of these infor-mation sources are interaction data per se, theycontain information weakly associated with in-

teraction: Two subunits of the same protein com-plex often have coregulated mRNA expressionand similar biological functions and are morelikely to be both essential or nonessential (8).

For computing the PIE and the PIP, we usedtwo different types of Bayesian networks: a“naıve” network for the PIP and a fully con-nected one for the PIE (19). The naıve networkis simpler to compute but requires informationsources with essentially uncorrelated evidence.In contrast, the fully connected Bayesian net-work accommodates correlated evidence,which is the case for the four experimentalinteraction data sets.

Finally, we combined the PIP, PIE, andgold-standard into a total PI (PIT), whichrepresents our most comprehensive view ofthe known and putative protein complexes inyeast (23). Because the PIP and PIE dataprovide essentially uncorrelated evidence forprotein-protein interactions, we chose a naıvenetwork to construct the PIT.

Figure 1C gives an overview of how wecompared the PIP, PIE, gold-standard, and ournew experiments. In particular, Fig. 2 shows theperformance of the integration resulting in thePIP and PIE. When tested against the gold-standard, we observed that the ratio of true tofalse positives (TP/FP) increases monotonicallywith Lcut, confirming L as an appropriate mea-sure of the odds of a real interaction. Conser-vatively estimated, protein pairs with L " 600have a better than 50% chance of being in thesame complex, suggesting Lcut ! 600 as auseful threshold (19). Unless otherwise noted,we use this throughout our analysis. It gives9897 predicted interactions from the PIP and

Fig. 1. The information sources integrated in our analysis and theircomparison with each other. (A) The three different types of data used:(i) Interaction data from high-throughput experiments. These compriselarge-scale two-hybrid screens (Y2H) (1, 2) and in vivo pull-down exper-iments (3, 4). (ii) Other genomic features. We considered expressiondata, biological function of proteins (from Gene Ontology biological process and the MIPS functionalcatalog), and data about whether proteins are essential (6, 19–22). (iii) Gold-standards of known interac-tions and noninteracting protein pairs. (The MIPS functional catalog differs from the MIPS complexescatalog used for the gold-standard.) (B) Combination of data sets into probabilistic interactomes. (C)Comparison of the probabilistic interactomes with the gold-standards and our new experimental data.Numbers next to the arrows indicate which figures refer to these various comparisons.

R E P O R T S

17 OCTOBER 2003 VOL 302 SCIENCE www.sciencemag.org450

in our filtered version (19)]. A negatives gold-standard is harder to define, but essential forsuccessful training. Thus, we synthesized neg-atives from lists of proteins in separate subcel-lular compartments (9). Our positive and nega-tive gold-standards satisfy the first two criteriaand provide a good practical solution for thethird. Hence, our goal, precisely defined, was topredict whether two proteins are in the samecomplex, not whether they necessarily had di-rect physical contact.

As a measure of reliability, the overlap ofinformation sources (i.e., “interaction datasets,” which could either be noisy experimentaldata or sets of genomic features) with the gold-standards can be expressed in terms of a “like-lihood ratio.” For example, consider a genomicfeature f expressed in binary terms (i.e.,“present” or “absent”). The likelihood ratioL( f ) is then defined as the fraction of gold-standard positives having feature f divided bythe fraction of negatives having f. For twofeatures f1 and f2 with uncorrelated evidence,the likelihood ratio of the combined evidence issimply the product L( f1, f2) ! L( f1)L( f2). Forcorrelated evidence, L( f1, f2) cannot be factor-ized in this way. Bayesian networks are a for-mal representation of such relationships be-tween features. The combined likelihood ratiois proportional to the estimated odds that twoproteins are in the same complex, given multi-ple sources of information.

We predict a protein pair as positive if itscombined likelihood ratio exceeds a particularcutoff (L " Lcut) (negative otherwise). To getan overall assessment of how the predictionperforms, we segmented the gold-standard into

separate training and testing sets (using a sev-enfold cross-validation protocol). Then weevaluated the number of true- (TP) and false-positive (FP) predictions in the testing set. Fi-nally, we applied the Bayesian network beyondthe testing set, computing likelihood ratios forall possible protein pairs in the genome.

Figure 1 schematically shows the infor-mation sources and results of our calcula-tions. We term the results “probabilistic in-teractomes” (PIs), in which each protein pairis associated with a probability measure forbeing in the same complex (i.e., likelihoodratio L). Our procedure not only allows com-bining existing experimental interaction datasets (resulting in a PI-experimental or “PIE”),but also the de novo prediction of proteincomplexes from genomic data sets (when theinput data are not interaction data sets per se,resulting in a PI-predicted or “PIP”).

We combined four interaction data setsfrom high-throughput experiments into thePIE (1–4) (Fig. 1B). The PIE represents atransformation of the individual binary-valued interaction sets into a data set whereevery protein pair is weighted according tothe likelihood that it exists within a complex.

We computed the PIP from several genomicdata sources: the correlation of mRNA amountsin two expression data sets (one with temporalprofiles during the cell cycle, one of expressionlevels under 300 cellular conditions), two sets ofinformation on biological function, and informa-tion about whether proteins are essential for sur-vival (6, 20–22). Although none of these infor-mation sources are interaction data per se, theycontain information weakly associated with in-

teraction: Two subunits of the same protein com-plex often have coregulated mRNA expressionand similar biological functions and are morelikely to be both essential or nonessential (8).

For computing the PIE and the PIP, we usedtwo different types of Bayesian networks: a“naıve” network for the PIP and a fully con-nected one for the PIE (19). The naıve networkis simpler to compute but requires informationsources with essentially uncorrelated evidence.In contrast, the fully connected Bayesian net-work accommodates correlated evidence,which is the case for the four experimentalinteraction data sets.

Finally, we combined the PIP, PIE, andgold-standard into a total PI (PIT), whichrepresents our most comprehensive view ofthe known and putative protein complexes inyeast (23). Because the PIP and PIE dataprovide essentially uncorrelated evidence forprotein-protein interactions, we chose a naıvenetwork to construct the PIT.

Figure 1C gives an overview of how wecompared the PIP, PIE, gold-standard, and ournew experiments. In particular, Fig. 2 shows theperformance of the integration resulting in thePIP and PIE. When tested against the gold-standard, we observed that the ratio of true tofalse positives (TP/FP) increases monotonicallywith Lcut, confirming L as an appropriate mea-sure of the odds of a real interaction. Conser-vatively estimated, protein pairs with L " 600have a better than 50% chance of being in thesame complex, suggesting Lcut ! 600 as auseful threshold (19). Unless otherwise noted,we use this throughout our analysis. It gives9897 predicted interactions from the PIP and

Fig. 1. The information sources integrated in our analysis and theircomparison with each other. (A) The three different types of data used:(i) Interaction data from high-throughput experiments. These compriselarge-scale two-hybrid screens (Y2H) (1, 2) and in vivo pull-down exper-iments (3, 4). (ii) Other genomic features. We considered expressiondata, biological function of proteins (from Gene Ontology biological process and the MIPS functionalcatalog), and data about whether proteins are essential (6, 19–22). (iii) Gold-standards of known interac-tions and noninteracting protein pairs. (The MIPS functional catalog differs from the MIPS complexescatalog used for the gold-standard.) (B) Combination of data sets into probabilistic interactomes. (C)Comparison of the probabilistic interactomes with the gold-standards and our new experimental data.Numbers next to the arrows indicate which figures refer to these various comparisons.

R E P O R T S

17 OCTOBER 2003 VOL 302 SCIENCE www.sciencemag.org450

88

Page 23: Bioinformatics for Protein Interations and Biological Netw ... · CPI PPI Tutorial 2004 - AMH Relational Algebra ¥Language f or manipulating sets of r elations ¥

CPI PPI Tutorial 2004 - AMH

Prediction Performance

163 from the PIE. In contrast, likelihood ratiosderived from single genomic features (e.g.,mRNA coexpression) or from individual inter-action experiments (e.g., the Ho data set) didnot exceed the cutoff when used alone, withTP/FP values far below 1. This demonstratesthat information sources that, taken alone, areonly weak predictors of interactions canyield reliable predictions when combined.

The PIP had a higher sensitivity than thePIE for comparable TP/FP ratios (Fig. 2C).(“Sensitivity” measures coverage and is definedas TP/P, where P is the number of gold-standard positives.) Specifically, the sensitivityof the PIP is !27% at our cutoff. This mayseem low, but compares favorably with the PIE,which had a sensitivity of less than 1%. Thismeans that we can predict, at comparable errorlevels, more complex interactions de novo thanare present in the high-throughput experimentalinteraction data sets.

One might ask whether simpler voting pro-cedures can match the performance of more

complicated machine-learning methods such asBayesian networks. To test this hypothesis, wecompared the PIP with a voting procedurewhere each of the four genomic features con-tributes an additive vote toward positive classi-fication. We found that the Bayesian networkachieved greater sensitivity for comparable TP/FP ratios (Fig. 2C) (19).

Figure 3 shows parts of the PIP and PIEgraphs and how these compare with the gold-standard and our new experiments. First, totest whether the thresholded PIP was biasedtoward certain complexes, we looked at thedistribution of predictions among gold-stan-dard positives (Fig. 3A); they were roughlyequally apportioned among the differentcomplexes, suggesting a lack of bias.

We have thus far treated all interactions asindependent. However, the joint distribution ofinteractions in the PIs can help identify largecomplexes: An ideal complex should be a“clique” in an interaction graph (i.e., a subgraphwith N(N " 1)/2 links between N proteins).

Although this rarely happens in practice, be-cause of incorrect or missing links, large com-plexes tend to have many interconnectionswithin them, whereas false-positive links to out-side proteins tend to occur randomly, without acoherent pattern (Fig. 4).

Figure 3B shows parts of the thresholdedPIP that are restricted to proteins with !20links (23), highlighting large complexes. Somepredicted complexes overlap with the gold-standard positives (cytoplasmic ribosome) orthe PIE (exosome, RNA polymerase I, 26Sproteasome). Comparison with the gold-standard negatives showed where the PIP likelyproduced false complexes. Many protein asso-ciations only appear in the PIP and thus poten-tially represent new interactions and complex-es. An interesting example is the mitochondrialribosome; it has appreciable overlap with bothgold-standard positives and the PIE and con-tains plausible, newly predicted interactionswith three proteins (19).

To further test the predictions in the PIP,we conducted TAP-tagging experiments, inwhich a protein expressed at its normal intra-cellular concentration (“bait”) is tagged andused to “pull down” endogenous proteincomplexes. We picked 98 proteins as TAP-tagging baits. These produced 424 experi-mental interactions overlapping with the PIPthresholded at Lcut # 300. (Of these, 185, inturn, overlapped with gold-standard posi-tives, and 16 with negatives, highlighting thereliability of our experiments.)

Figure 3C shows three examples of theoverlap between the PIP and TAP-tagging. Wepredicted that the putative DEAD-box RNAhelicase Dbp3 interacts with three other RNAhelicases (Hca4, Mak5, and Dbp7), with pro-teins implicated in ribosomal RNA (rRNA) me-tabolism (e.g., Nop2, Rrp5, Mak5, and compo-nents of RNA polymerase I), and with Nsr1, theyeast homolog of mammalian Nucleolin and aGAR domain–containing protein (24). WhenDbp3 was TAP-tagged and purified, we foundpreviously unknown interactions with Nsr1,Hca4, and Nop1, connecting Dbp3 with knownrRNA-processing proteins. Further purifica-tions with TAP-tagged versions of Mak5, Rrp5,Dbp7, Dbp3, Nsr1, Hca4, and Nop2 verified thephysical association.

The nucleosome, a fundamental unit with-in chromatin, provides a second example ofoverlap. It is composed of eight histones (twoH2A, two H2B, two H3, and two H4), whichcan block RNA polymerase II progression.This blockage is relieved upon interactionwith the FACT complex (also known as SPNor yFACT), which consists of Spt16 andPob3 in yeast. Mammalian Pob3 has a highmobility group (HMG) domain for interac-tion with histones; however, yeast Pob3 lacksthis domain. Instead, the HMG protein Nhp6(with two virtually identical isoforms,Nhp6A and Nhp6B) binds histones (25–27).

Fig. 2. Comparison of PIP and PIE with eachother and with the individual informationsources. (A) The TP/FP ratio as a function of Lcutfor the PIP and the individual data from whichit was computed. The ratio is computed asfollows:

TP$Lcut%/FP$Lcut% " & L # L cutpos$L%/&L # Lcutneg$L%

where pos(L) and neg(L) are the number ofpositives and negatives in the gold-standardwith a given likelihood ratio L. The vertical lineindicates our standard threshold Lcut# 600. (B)The same plot as in (A), but for the PIE. (C)Comparison of TP/FP ratios between the PIPand PIE. The abscissa represents the sensitivityof the probabilistic interactomes. The gray areaindicates the gain of sensitivity of the PIP overthe PIE for equal TP/FP ratios. The arrow showsthe difference in sensitivity at TP/FP # 0.3. At this level, the PIP contains 183,295 protein pairs, ofwhich 6179 are gold-standard positives (75% sensitivity), whereas the PIE contains 31,511 proteinpairs and 1758 gold-standard positives among these (21% sensitivity). This difference in sensitivitybetween PIE and PIP illustrates the value of the de novo prediction. It also reflects, to some degree,that the experiments were done only on subsets of the genome and may have been measuringdifferent types of interactions than the complexes’ gold-standard, which we used to parameterizethe PIP. The white circles show the performance of a voting procedure in which each of the fourgenomic features (from which we computed the PIP) contributed an additive vote. There are fourpossible outcomes in the additive voting procedure, depending on how many data sets contributea positive vote (19).

R E P O R T S

www.sciencemag.org SCIENCE VOL 302 17 OCTOBER 2003 451

Jansen et al. “A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data” Science 302 449-452 2003

• Experimental validation

• Relies on

89

CPI PPI Tutorial 2004 - AMH

Protein-Protein Docking

• Predict interactions and/or binding orientations of complexes based on molecular structure of constituent proteins

• CAPRI: Critical Assessment of Prediction of Interactions

• Challenges

• Conformational flexibility

• Transient complexes

Table 1

Targets and predictions in the CAPRI experiment.

Target Number of prediction! Quality of modelsy Remarks Refs

Groups Models High Medium Acceptable

Round 1 (July–September 2001)T01 HPr–HPr kinase 16 69 0 0 8 Helix movement in kinase [35"]T02 Rotavirus VP6–Fab 15 70 0 1 6 Published electron micrograph

of Fab bound to the virus [33]T03 Flu virus hemagglutinin–Fab 13 62 0 2 0 [32"]Round 2 (January–March 2002)T04 a-amylase–VHH domain AM-D10 13 65 0 0 0 Camelid single-chain antibody [31"]T05 a-amylase–VHH domain AM-B07 13 64 0 0 0 Camelid single-chain antibody [31"]T06 a-amylase–VHH domain AM-D09 13 65 4 4 0 Camelid single-chain antibody [31"]T07 Streptococcal superantigen–TCRb 14 70 5 7 8 Homolog complex in PDB [34"]!Number of groups submitting models and number of models submitted for each target. yTwo criteria were used to judge a docking model: Irms, theroot mean square distance between the Ca of interface residues in the X-ray structure and the model; and fNC, the fraction of native contacts,defined as the number of correctly predicted pairs of contact residues divided by the number of pairs present in the X-ray structure. High-qualitymodels: Irms <1 A, fNC >0.5; medium-quality models: Irms <2 A, fNC >0.3; acceptable models: Irms <4 A, fNC >0.1 [28""].

Figure 2

A CAPRI target and its prediction. The ribbon drawing shows the X-ray structure [31"] of T06, a complex between pig a-amylase (green) and theVHH domain of a camelid antibody (purple), which binds at the enzyme active site. Spheres mark the geometric centers of the VHH domain inhigh-quality (green), medium-quality (blue) and incorrect (yellow) models of the complex derived by docking the VHH domain on the a-amylase.Not only the green and blue spheres, but also about one-third of the yellow spheres cluster in the active site region. Figure courtesy of R Leplae andSJ Wodak (Brussels).

386 Sequences and topology

Current Opinion in Structural Biology 2003, 13:383–388 www.current-opinion.com

Janin and Seraphin, 2003

90

CPI PPI Tutorial 2004 - AMH

Summary

• Integration of prediction methods with experiments needed for more efficient experimentation and better interpretation of experimental data

• Standard supervised classification algorithms can be applied, if suitable positive and negative training data can be assembled

• Prediction methods already showing significant success in combination with experimental validation

91