Biological Networks CSci 732: Introduction to Bioinformatics

Biological Networks

CSci 732: Introduction to Bioinformatics

Anne DentonAssistant ProfessorDepartment of Computer Science

North Dakota State University, Fargo, ND

The Promise of Bioinformatics

Rich supply of data from high-throughput experiments– Sequencing– Microarray experiments– Scores of specialized high-throughput experiments

Data made available– Most biological data disseminated in large biological

databases– Dissemination is a condition of funding (in US)– Makes biology different from most other disciplines!

Challenge and Opportunity

Traditional approach– Studies of one or a few gene at a time– Conclusions based on thorough domain knowledge

Challenge– Standard data analysis techniques inadequate for hundreds

or thousands of genes

Opportunity– Massive data valuable to quantify evidence– New aspects can be studied, such as network structure

From Sequencing to Functional Genomics

Sequencing genomes is a mature discipline experimentally and computationally– Whole-genome sequencing centrally involved

computations

But what do the proteins do?– Sequence comparison (BLAST) among species– Great computational and experimental challenges– Networks have a fundamental role in specifying

relationships

Computer Scientist’s Introduction to Genomics

Encoding– Smallest unit of information

Computer: 1 bit (0 or 1) Cell: 1 nucleotide of DNA (A,C,T, or G)

– Most practical unit of information Computer: 8 bit are 1 byte

(can represent 256 values) Cell: 3 nucleotide determine 1 amino acid of the protein

(20 different amino acids)

Further Analogy between Cells and Computers

Encoded values can serve very different purposes and are stored in the same location– Computer: Instructions and data are stored in the

same memory (von Neumann architecture)– Cell: Proteins in the cell do very different things

Catalyze chemical reactions (proteins are part of the process)

Regulate other proteins (proteins change the process)

Networks in Bioinformatics

Many network definitions:– Protein-protein

interactions– Biochemical

pathways – Annotation Networks

Different definitions in each category

Scientific American 05/03

Why Study Networks?

Biochemical pathways tell us about functioning of cell– Chemical processes (Metabolic pathways)– Control of other proteins (Regulatory pathways)

Neighbors in networks often have similar function

Structure of networks can tell us about evolution Combined study of networks and data can

uncover yet more information about cells

Outline

Part 1: Properties of the Networks– Scale-free networks

Part 2: Networks and data– Relational data mining– Problem: Similarity between network neighbors– Solution: Focus on differences between neighbors– Comparison of different networks

Example of a Biological Network:Physical Protein-Protein Interactions

Proteins interact (attach to each other)– Tests if proteins are stable in a close position– Proteins may perform function together (not tested)– Mathematically: Undirected graph

Only one definition of interactions between proteins? No!– Definition based on function: Genetic– Definition based on evolution: Domain fusion

Physical

Protein-Protein in Yeast


Scale-free Networks

Properties Barabasi, Bonabeau 1998 Some nodes have large number of links,

most have only a few Number of nodes that have a particular

number of links decreases as a power law Robust against accidental failures Hubs with high connectivity

Power-law Behavior

Probability that any node is connected to k other nodes is 1/k n with n between 2 and 3

For k = 2: Probability of having twice as many links is a quarter as likely


Robust against accidental failures

As many as 80% of nodes can fail in a scale-free network without breaking up the entire cluster

I.e., even with large number of random mutations in genes, unaffected proteins continue to work together

Note that random removal of nodes is unlikely to remove hubs

Removal of only a few hubs does break up network significantly

Examples Outside Biology

Hyperlink structure of the Internet Physical structure (routers and

communication lines) of the Internet Social networks Airline system Scientific papers connected by citations

Scale-free Network (Airline System)


Random Networks in Contrast

Links placed randomly – Mathematical model (Erdos 1959)

Example: Highway system Bell-shaped (Poisson, similar to Gauss) distribution of number of

nodes around typical value For large number of links k,

probability of k links

decreases exponentially Very unlikely to have

nodes with a very large

number of links

Random Network (Highway System)


Reasons for the development of scale-free networks

Networks grow over time– Older nodes have had longer to accumulate links – The most connected nodes in E.coli metabolic

network have an early evolutionary history

Preferential attachment – "The rich get richer“

cf. many people hyper-link to Google

"Small World" Property

Game "Six Degrees of Kevin Bacon ": Acting in the same movie as links connects most actors in 6 steps

Internet pages are typically 19 clicks apart Any two chemicals in a cell are only 3

reactions apart! Small-world property is not limited to scale-

free networks

Are Protein Networks Pure Scale-free Networks?

Clusters of tightly connected nodes Example

– Proteins that perform function together

Recovering scale free network – Clustering of nodes leads to groups that interact

as scale-free networks

Summary of Scale-free Networks

Characterized by – Few hubs with a large number of edges– Many nodes with few edges

Show small-world property– Any node can be reached from any other in only a

few steps

Ubiquitous in biology and outside

Has Everything Interesting Related to Networks Been Done?

Graph theory is an old topic:

Euler 1736 Work on scale-

free network

has added to it Surely now

everything is done!

Protein-Protein Interaction Networks

Name: YAL003W

Function: Protein Synthesis

Localization: Cytoplasm

Class: GTP/GDP-exchange factors

Complex: Translation complexes

MOTIF: PDOC00648

Name: YAL040C

Function: Cell Cycle & DNA Process

Localization: nucleus

Class: Cyclins

Complex: Cyclin-dependent kinases

Motif: PDOC00013

Phenotype: Cell cycle defects

Outline (2): Data and Networks

Relational rather than graph-theoretic approach to mining of data on a graph– Problem of similarity between neighbors

New Algorithm– Focuses on differences

Comparison of Networks– Different definitions of Networks– Generalization of difference-based algorithm

Questions of Interest

How do data between nodes relate?– Are there typical patterns among interacting

proteins? Can we find relationships that are not yet

known to biologists?– It is expected that proteins in the nucleus interact

with other proteins in the nucleus– It is more surprising if proteins in the nucleus

interact with proteins in the mitochondria How do different networks compare?

Frequent Patterns in Tables:Association Rule Mining

1st step: Finding sets of items that are frequent– Originally: Items in shopping carts– Here: Properties of proteins– Support: Fraction of transactions, in which the set of

items occurs 2nd step: Finding associations A -> B

– If we find set A, we are likely to find set B– Similar to correlation, but goes in one direction only– Confidence: Fraction of transactions that have A,

which also have B

Relational Approach

Node Table

ORF Annotations

YPR184W {cytoplasm}

YER146W {cytoplasm}

YNL287W {SensitivityTOaaaod}

YBL026W {transcription, nucleus}

YMR207C {nucleus}

Edge Table

ORF1 ORF1

YPR184W YER146W

YNL287W YBL026W

YBL026W YMR207C

Node Table

ORF Annotations

YPR184W {cytoplasm}

YER146W {cytoplasm}

YNL287W {SensitivityTOaaaod}

YBL026W {transcription, nucleus}

YMR207C {nucleus}

0. 1.

Generalization

Works for any number of nodes

Even 2-node structure allows complex rules

Results become harder to interpret for many nodes

“Small world property”: Any protein can be reached in 3-4 hops

Effect of Joining on Item Sets

ARM on Joined Tables

Protein names (key) used for joining but not in ARM One node table participates multiple times

– Items labeled to keep track of instance

Typical transactions– Simplest approach had been done

overemphasizes similarities

T1 {0.cyto, 0.Cond_pheno, 0. PDOC00030, 1.Hydrolases}

T2 {0.Cond_pheno, 0.Transferases, 0.Polymerases, 1.Cond_pheno, 1.Transferases, 1.Polymerases}

Problem with Naive Implementation

Many rules that are not interesting Rules that reflect similarity between neighbors

– 0.nucleus → 1.nucleus

Rules involving only one protein– 0.transcription → 0.nucleus– Note: Different support and confidence compared with ARM

on node relation (protein data ignoring network)

Rules that are a consequence of both above– 0.transcription → 1.nucleus

Algorithm

TID Unique Operation

1 {0.cytoplasm} {1.cytoplasm}

2 {0.metabolism, 0.nucleus} {1.transcription, 1.nucleus}

3 {0.transcription, 0.nucleus} {1.nucleus}

TID Final Transactions

2 {0.metabolism} {1.transcription}

Properties of Algorithm

Significant pruning at transaction level– Fewer items in transactions – Fewer transactions

Note: Pruning at rule or item set level would not be consistent

– Example: 0.transcription → 1.nucleus– Differs from conventional setting: pruning at itemset level is

gold standard Modular approach

– Unique operation can be combined with different ARM implementations

Results for {0.transcription} {1.nucleus}

Standard ARM: – Support = 0.29%, Confidence = 28.38%Rule: {0.transcription} {0.nucleus} (0.70%, 69.59%)

Rule: {0.nucleus} {1.nucleus} (5.74%, 29.06%)

Differential ARM: – Support 0.02%, Confidence 2.08%

Typical range for differential ARM – Support 0.2-2%, Confidence 6-20%

Focus on Differences

Similarity can be tested through calculation of correlation of items with themselves

– Computationally easy– Association meaningless

Typical rules that are interesting to biologists– Which kinds of proteins show compartmental cross-talk?

E.g., proteins in the nucleus interacting with proteins in the mitochondria

– How do interaction definitions differ?E.g., which protein families show physical interactions but no domain fusion interactions

Other Results of Differential ARM

Expected Cross-Talk past analysis papers– {1.mitochondria} {0.cytoplasm} (1.2%, 27.3%)

Interesting Related Rules not found before– {1.mitochondria} {0.nucleus} (0.72%, 16%)– {1.ER} {0.mitochondira} (0.21%, 6%)

Performance

Comparison of Different Protein-Protein Interaction Networks

Different Definitions– Physical interactions– Genetic interactions– Domain fusion

Are the resulting networks biologically equivalent?– Common assumption: All networks signify

similarity in neighbors

Physical Interactions

Tests whether proteins physically interact when brought close together

Yeast-2-hybrid method– A gene is cut in half, and each half attached to

one of the proteins in question – The gene can only perform its job if its parts come

close through the protein-protein interactions

Can be done in any cell, i.e. does not test functions in cell

Genetic Interactions

In vivo analysis, i.e., in the living organism Typical scenario

– Assume gene A and B can individually be deleted and the organism survives

– Assume deleting A and B together means the organism does not survive

Other combinations are possible– Organism does not survive deletion of A and B

individually but does survive combined deletion– Or: Organism survives but is noticeably changed

Domain Fusion Interactions

Comparative Genome Analysis Purely computational analysis Based on evolutionary relationships

– Assume species A has one gene with two domains 1 and 2 – Assume species B has two genes that have the same

evolutionary origin (orthologs), 1' and 2' – Likely that proteins from genes 1' and 2' interact to generate

the same function

24,000 protein-protein interactions in yeast No experimental verification!

Can ARM Results be Compared Between Networks?

Not all proteins studied for all interactions Networks have very different properties

Table Int/orf Max int #>20 #int

Physical 3.55 289 73 14672

Genetic 7.88 157 93 8336

Domain fusion 44.6 231 305 28040

Construction of Network Comparison Basis Set

Transactions

C D E

C D K

G D E

G D K

J H I

Algorithm

Only nodes are considered that are involved in both networks Items are eliminated if they occur in either of the two interaction

types

1.A1.C1.D

0.C0.E

2.A2.B

Results

Rules based on physical interactions have higher confidence than genetic or domain fusion

Example: Physical compared with Domain Fusion{0.ABC trans family signature(PDOC00185)} {1.ATP/GTP binding site motif A(PDOC00017)} (0.45%, 90%)

– ABC family is known to function together with ATP binding site

– Nevertheless ABC family signature don’t occur in one gene together with ATP binding site in other species (no domain fusion)

– Possible reason: many proteins with ATP binding site

Conclusions of Part 2

Differential algorithm to ARM in relations that describe networks contrasts

– Neighbors within neighbors– Multiple networks

Solves other problems of network setting – Overwhelming number of rules due to neighbor similarity– Problem of rules that don’t involve all nodes– Rules that follow from combinations of both above problems

Some results confirmed by biologists Some results interesting but plausible to biologists

Overall Conclusions

Computer science techniques can uncover patterns in data that could not be identified by simple inspection– Large amount of data– Complex data

Exciting times are only starting– Functional genomics only at its beginning

Mutual understanding of language takes time

Overall Conclusions (Data Miner’s perspective)

Bioinformatics excellent “playing field” for data miners

Data more easily available than in most other disciplines (possible exception: astronomy)

Results can directly benefit researchers in biology

Algorithms can/have to pass test of reality Real data motivate fundamentally new

algorithms

Acknowledgements (Part 2)

Computational side Christopher BesemannBiological interpretation (Collaborators from NDSU Dept. of Biology) Ajay Yekkirala Ron Hutchison Marc Anderson

The work was funded by the Dept. of CS, EPSCoR and the NDSU Research Foundation

Biological Networks CSci 732: Introduction to Bioinformatics

Documents

Transcript of Biological Networks CSci 732: Introduction to Bioinformatics