Biological Networks CSci 732: Introduction to Bioinformatics
-
Upload
barbara-blackwell -
Category
Documents
-
view
25 -
download
2
description
Transcript of Biological Networks CSci 732: Introduction to Bioinformatics
Biological Networks
CSci 732: Introduction to Bioinformatics
Anne DentonAssistant ProfessorDepartment of Computer Science
North Dakota State University, Fargo, ND
The Promise of Bioinformatics
Rich supply of data from high-throughput experiments– Sequencing– Microarray experiments– Scores of specialized high-throughput experiments
Data made available– Most biological data disseminated in large biological
databases– Dissemination is a condition of funding (in US)– Makes biology different from most other disciplines!
Challenge and Opportunity
Traditional approach– Studies of one or a few gene at a time– Conclusions based on thorough domain knowledge
Challenge– Standard data analysis techniques inadequate for hundreds
or thousands of genes
Opportunity– Massive data valuable to quantify evidence– New aspects can be studied, such as network structure
From Sequencing to Functional Genomics
Sequencing genomes is a mature discipline experimentally and computationally– Whole-genome sequencing centrally involved
computations
But what do the proteins do?– Sequence comparison (BLAST) among species– Great computational and experimental challenges– Networks have a fundamental role in specifying
relationships
Computer Scientist’s Introduction to Genomics
Encoding– Smallest unit of information
Computer: 1 bit (0 or 1) Cell: 1 nucleotide of DNA (A,C,T, or G)
– Most practical unit of information Computer: 8 bit are 1 byte
(can represent 256 values) Cell: 3 nucleotide determine 1 amino acid of the protein
(20 different amino acids)
Further Analogy between Cells and Computers
Encoded values can serve very different purposes and are stored in the same location– Computer: Instructions and data are stored in the
same memory (von Neumann architecture)– Cell: Proteins in the cell do very different things
Catalyze chemical reactions (proteins are part of the process)
Regulate other proteins (proteins change the process)
Networks in Bioinformatics
Many network definitions:– Protein-protein
interactions– Biochemical
pathways – Annotation Networks
Different definitions in each category
Scientific American 05/03
Why Study Networks?
Biochemical pathways tell us about functioning of cell– Chemical processes (Metabolic pathways)– Control of other proteins (Regulatory pathways)
Neighbors in networks often have similar function
Structure of networks can tell us about evolution Combined study of networks and data can
uncover yet more information about cells
Outline
Part 1: Properties of the Networks– Scale-free networks
Part 2: Networks and data– Relational data mining– Problem: Similarity between network neighbors– Solution: Focus on differences between neighbors– Comparison of different networks
Example of a Biological Network:Physical Protein-Protein Interactions
Proteins interact (attach to each other)– Tests if proteins are stable in a close position– Proteins may perform function together (not tested)– Mathematically: Undirected graph
Only one definition of interactions between proteins? No!– Definition based on function: Genetic– Definition based on evolution: Domain fusion
Physical
Protein-Protein in Yeast
Scientific American 05/03
Scale-free Networks
Properties Barabasi, Bonabeau 1998 Some nodes have large number of links,
most have only a few Number of nodes that have a particular
number of links decreases as a power law Robust against accidental failures Hubs with high connectivity
Power-law Behavior
Probability that any node is connected to k other nodes is 1/k n with n between 2 and 3
For k = 2: Probability of having twice as many links is a quarter as likely
Scientific American 05/03
Robust against accidental failures
As many as 80% of nodes can fail in a scale-free network without breaking up the entire cluster
I.e., even with large number of random mutations in genes, unaffected proteins continue to work together
Note that random removal of nodes is unlikely to remove hubs
Removal of only a few hubs does break up network significantly
Examples Outside Biology
Hyperlink structure of the Internet Physical structure (routers and
communication lines) of the Internet Social networks Airline system Scientific papers connected by citations
Scale-free Network (Airline System)
Scientific American 05/03
Random Networks in Contrast
Links placed randomly – Mathematical model (Erdos 1959)
Example: Highway system Bell-shaped (Poisson, similar to Gauss) distribution of number of
nodes around typical value For large number of links k,
probability of k links
decreases exponentially Very unlikely to have
nodes with a very large
number of links
Random Network (Highway System)
Scientific American 05/03
Reasons for the development of scale-free networks
Networks grow over time– Older nodes have had longer to accumulate links – The most connected nodes in E.coli metabolic
network have an early evolutionary history
Preferential attachment – "The rich get richer“
cf. many people hyper-link to Google
"Small World" Property
Game "Six Degrees of Kevin Bacon ": Acting in the same movie as links connects most actors in 6 steps
Internet pages are typically 19 clicks apart Any two chemicals in a cell are only 3
reactions apart! Small-world property is not limited to scale-
free networks
Are Protein Networks Pure Scale-free Networks?
Clusters of tightly connected nodes Example
– Proteins that perform function together
Recovering scale free network – Clustering of nodes leads to groups that interact
as scale-free networks
Summary of Scale-free Networks
Characterized by – Few hubs with a large number of edges– Many nodes with few edges
Show small-world property– Any node can be reached from any other in only a
few steps
Ubiquitous in biology and outside
Has Everything Interesting Related to Networks Been Done?
Graph theory is an old topic:
Euler 1736 Work on scale-
free network
has added to it Surely now
everything is done!
Protein-Protein Interaction Networks
Name: YAL003W
Function: Protein Synthesis
Localization: Cytoplasm
Class: GTP/GDP-exchange factors
Complex: Translation complexes
MOTIF: PDOC00648
Name: YAL040C
Function: Cell Cycle & DNA Process
Localization: nucleus
Class: Cyclins
Complex: Cyclin-dependent kinases
Motif: PDOC00013
Phenotype: Cell cycle defects
Outline (2): Data and Networks
Relational rather than graph-theoretic approach to mining of data on a graph– Problem of similarity between neighbors
New Algorithm– Focuses on differences
Comparison of Networks– Different definitions of Networks– Generalization of difference-based algorithm
Questions of Interest
How do data between nodes relate?– Are there typical patterns among interacting
proteins? Can we find relationships that are not yet
known to biologists?– It is expected that proteins in the nucleus interact
with other proteins in the nucleus– It is more surprising if proteins in the nucleus
interact with proteins in the mitochondria How do different networks compare?
Frequent Patterns in Tables:Association Rule Mining
1st step: Finding sets of items that are frequent– Originally: Items in shopping carts– Here: Properties of proteins– Support: Fraction of transactions, in which the set of
items occurs 2nd step: Finding associations A -> B
– If we find set A, we are likely to find set B– Similar to correlation, but goes in one direction only– Confidence: Fraction of transactions that have A,
which also have B
Relational Approach
Node Table
ORF Annotations
YPR184W {cytoplasm}
YER146W {cytoplasm}
YNL287W {SensitivityTOaaaod}
YBL026W {transcription, nucleus}
YMR207C {nucleus}
Edge Table
ORF1 ORF1
YPR184W YER146W
YNL287W YBL026W
YBL026W YMR207C
Node Table
ORF Annotations
YPR184W {cytoplasm}
YER146W {cytoplasm}
YNL287W {SensitivityTOaaaod}
YBL026W {transcription, nucleus}
YMR207C {nucleus}
0. 1.
Generalization
Works for any number of nodes
Even 2-node structure allows complex rules
Results become harder to interpret for many nodes
“Small world property”: Any protein can be reached in 3-4 hops
Effect of Joining on Item Sets
ARM on Joined Tables
Protein names (key) used for joining but not in ARM One node table participates multiple times
– Items labeled to keep track of instance
Typical transactions– Simplest approach had been done
overemphasizes similarities
T1 {0.cyto, 0.Cond_pheno, 0. PDOC00030, 1.Hydrolases}
T2 {0.Cond_pheno, 0.Transferases, 0.Polymerases, 1.Cond_pheno, 1.Transferases, 1.Polymerases}
Problem with Naive Implementation
Many rules that are not interesting Rules that reflect similarity between neighbors
– 0.nucleus → 1.nucleus
Rules involving only one protein– 0.transcription → 0.nucleus– Note: Different support and confidence compared with ARM
on node relation (protein data ignoring network)
Rules that are a consequence of both above– 0.transcription → 1.nucleus
Algorithm
TID Unique Operation
1 {0.cytoplasm} {1.cytoplasm}
2 {0.metabolism, 0.nucleus} {1.transcription, 1.nucleus}
3 {0.transcription, 0.nucleus} {1.nucleus}
TID Final Transactions
2 {0.metabolism} {1.transcription}
Properties of Algorithm
Significant pruning at transaction level– Fewer items in transactions – Fewer transactions
Note: Pruning at rule or item set level would not be consistent
– Example: 0.transcription → 1.nucleus– Differs from conventional setting: pruning at itemset level is
gold standard Modular approach
– Unique operation can be combined with different ARM implementations
Results for {0.transcription} {1.nucleus}
Standard ARM: – Support = 0.29%, Confidence = 28.38%Rule: {0.transcription} {0.nucleus} (0.70%, 69.59%)
Rule: {0.nucleus} {1.nucleus} (5.74%, 29.06%)
Differential ARM: – Support 0.02%, Confidence 2.08%
Typical range for differential ARM – Support 0.2-2%, Confidence 6-20%
Focus on Differences
Similarity can be tested through calculation of correlation of items with themselves
– Computationally easy– Association meaningless
Typical rules that are interesting to biologists– Which kinds of proteins show compartmental cross-talk?
E.g., proteins in the nucleus interacting with proteins in the mitochondria
– How do interaction definitions differ?E.g., which protein families show physical interactions but no domain fusion interactions
Other Results of Differential ARM
Expected Cross-Talk past analysis papers– {1.mitochondria} {0.cytoplasm} (1.2%, 27.3%)
Interesting Related Rules not found before– {1.mitochondria} {0.nucleus} (0.72%, 16%)– {1.ER} {0.mitochondira} (0.21%, 6%)
Performance
Comparison of Different Protein-Protein Interaction Networks
Different Definitions– Physical interactions– Genetic interactions– Domain fusion
Are the resulting networks biologically equivalent?– Common assumption: All networks signify
similarity in neighbors
Physical Interactions
Tests whether proteins physically interact when brought close together
Yeast-2-hybrid method– A gene is cut in half, and each half attached to
one of the proteins in question – The gene can only perform its job if its parts come
close through the protein-protein interactions
Can be done in any cell, i.e. does not test functions in cell
Genetic Interactions
In vivo analysis, i.e., in the living organism Typical scenario
– Assume gene A and B can individually be deleted and the organism survives
– Assume deleting A and B together means the organism does not survive
Other combinations are possible– Organism does not survive deletion of A and B
individually but does survive combined deletion– Or: Organism survives but is noticeably changed
Domain Fusion Interactions
Comparative Genome Analysis Purely computational analysis Based on evolutionary relationships
– Assume species A has one gene with two domains 1 and 2 – Assume species B has two genes that have the same
evolutionary origin (orthologs), 1' and 2' – Likely that proteins from genes 1' and 2' interact to generate
the same function
24,000 protein-protein interactions in yeast No experimental verification!
Can ARM Results be Compared Between Networks?
Not all proteins studied for all interactions Networks have very different properties
Table Int/orf Max int #>20 #int
Physical 3.55 289 73 14672
Genetic 7.88 157 93 8336
Domain fusion 44.6 231 305 28040
Construction of Network Comparison Basis Set
Transactions
C D E
C D K
G D E
G D K
J H I
Algorithm
Only nodes are considered that are involved in both networks Items are eliminated if they occur in either of the two interaction
types
1.A1.C1.D
0.C0.E
2.A2.B
Results
Rules based on physical interactions have higher confidence than genetic or domain fusion
Example: Physical compared with Domain Fusion{0.ABC trans family signature(PDOC00185)} {1.ATP/GTP binding site motif A(PDOC00017)} (0.45%, 90%)
– ABC family is known to function together with ATP binding site
– Nevertheless ABC family signature don’t occur in one gene together with ATP binding site in other species (no domain fusion)
– Possible reason: many proteins with ATP binding site
Conclusions of Part 2
Differential algorithm to ARM in relations that describe networks contrasts
– Neighbors within neighbors– Multiple networks
Solves other problems of network setting – Overwhelming number of rules due to neighbor similarity– Problem of rules that don’t involve all nodes– Rules that follow from combinations of both above problems
Some results confirmed by biologists Some results interesting but plausible to biologists
Overall Conclusions
Computer science techniques can uncover patterns in data that could not be identified by simple inspection– Large amount of data– Complex data
Exciting times are only starting– Functional genomics only at its beginning
Mutual understanding of language takes time
Overall Conclusions (Data Miner’s perspective)
Bioinformatics excellent “playing field” for data miners
Data more easily available than in most other disciplines (possible exception: astronomy)
Results can directly benefit researchers in biology
Algorithms can/have to pass test of reality Real data motivate fundamentally new
algorithms
Acknowledgements (Part 2)
Computational side Christopher BesemannBiological interpretation (Collaborators from NDSU Dept. of Biology) Ajay Yekkirala Ron Hutchison Marc Anderson
The work was funded by the Dept. of CS, EPSCoR and the NDSU Research Foundation