Bioinformatics III “Systems biology”,“Integrative cell biology”

29
1. Lecture WS 2004/05 Bioinformatics III 1 Bioinformatics III “Systems biology”,“Integrative cell biology” Course will address two areas: 25% genomics: single protein phylogenies versus genome rearrangement, comparative genomics 75% integrated view of cellular networks

description

Bioinformatics III “Systems biology”,“Integrative cell biology”. Course will address two areas: 25% genomics: single protein phylogenies versus genome rearrangement, comparative genomics 75% integrated view of cellular networks. Content. Week1scale-free networks in biology - PowerPoint PPT Presentation

Transcript of Bioinformatics III “Systems biology”,“Integrative cell biology”

Page 1: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 1

Bioinformatics III “Systems biology”,“Integrative cell biology”

Course will address two areas:

25% genomics: single protein phylogenies versus genome rearrangement,

comparative genomics

75% integrated view of cellular networks

Page 2: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 2

Content

Week1 scale-free networks in biology

Week2 transcription, regulatory networks

Week3 protein complexes (Cellzome, Aloy et al. 2004)

Week4 protein networks: exp. data (Y2H; MS), computational data (Rosetta)

Week5 protein networks: graphical layout (force minimization)

Week6 protein networks: quality check (Bayesian analysis)

Week7 protein networks: modularity?

Week8 phylogeny

Week9 genome rearrangement (breakpoint analysis)

Week10+11 metabolic networks: metabolic flux analysis, extreme pathways,

elementary modes, C13 method

Week12 mathematical modelling of signal transduction networks

Week13 integration of protein networks with metabolic pathways

Week14 exam

Page 3: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 3

Literature

lecture slides will be available 1-2 days prior to lecture

suggested reading: links will be put up on course website

http://gepard.bioinformatik.uni-saarland.de/teaching...

Page 4: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 4

assignments

12 weekly assignments planned

Homework assignments are handed out in the Thursday lectures and are

available on the course website on the same day.

Solutions need to be returned until Thursday of the following week 14.00

to Tihamer Geyer in room 1.09 Geb. 17.1, first floor, or handed in prior (!) to the

lecture starting at 14.15. 2 students may submit one joint solution.

Also possible: submit solution by e-mail as 1 printable PDF-file to

[email protected].

Tutorial: participation is recommended but not mandatory. Tue 11-13.

Homeworks submitted on Thursdays will be discussed on the following Tuesday.

In case of illness please send E-mail to:

[email protected] and provide a medical certificate.

Page 5: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 5

Schein = successful written exam

The successful participation in the lecture course („Schein“) will be certified upon

successful completion of the written exam in February 2005.

Participation at the exam is open to those students who have received 50% of

credit points for the 12 assignments.

Unless published otherwise on the course website until 3 weeks prior to exam,

the exam will be based on all material covered in the lectures and in the

assignments.

In case of illness please send E-mail to:

[email protected] and provide a medical certificate.

A „second and final chance“ exam will be offered in April 2005.

Page 6: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 6

tutor

Dr. Tihamer Geyer – assignments

Geb. 17.1, room 1.09

[email protected]

Page 7: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 7

Systems biology

Biological research in the 1900s followed a reductionist approach:

detect unusual phenotype isolate/purify 1 protein/gene, determine its

function

However, it is increasingly clear that discrete biological function can only rarely

be attributed to an individual molecule.

new task of understanding the structure and dynamics of the complex

intercellular web of interactions that contribute to the structure and function of

a living cell.

Page 8: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 8

Systems biology

Development of high-throughput data-collection techniques,

e.g. microarrays, protein chips, yeast two-hybrid screens

allow to simultaneously interrogate all cell components at any given time.

there exists various types of interaction webs/networks

- protein-protein interaction network

- metabolic network

- signalling network

- transcription/regulatory network ...

These networks are not independent but form „network of networks“.

Page 9: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 9

DOE initiative: Genomes to Lifea coordinated effort

slides borrowedfrom talk of

Marvin FrazierLife Sciences DivisionU.S. Dept of Energy

Page 10: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 10

Facility IProduction and Characterization of Proteins

Estimating Microbial Genome Capability

• Computational Analysis– Genome analysis of genes, proteins, and operons– Metabolic pathways analysis from reference data– Protein machines estimate from PM reference data

• Knowledge Captured– Initial annotation of genome– Initial perceptions of pathways and processes– Recognized machines, function, and homology– Novel proteins/machines (including

prioritization)– Production conditions and experience

Page 11: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 11

• Analysis and Modeling

– Mass spectrometry expression analysis

– Metabolic and regulatory pathway/ network analysis and modeling

• Knowledge Captured– Expression data and conditions– Novel pathways and processes– Functional inferences about novel

proteins/machines– Genome super annotation: regulation, function,

and processes (deep knowledge about cellular subsystems)

Facility II Whole Proteome Analysis

Modeling Proteome Expression, Regulation, and Pathways

Page 12: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 12

Facility III Characterization and Imaging of Molecular Machines

Exploring Molecular Machine Geometry and Dynamics

• Computational Analysis, Modeling and Simulation

– Image analysis/cryoelectron microscopy

– Protein interaction analysis/mass spec

– Machine geometry and docking modeling

– Machine biophysical dynamic simulation

• Knowledge Captured

– Machine composition, organization, geometry,

assembly and disassembly

– Component docking and dynamic simulations

of machines

Page 13: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 13

Facility IVAnalysis and Modeling of Cellular Systems

Simulating Cell and Community Dynamics

• Analysis, Modeling and Simulation

– Couple knowledge of pathways, networks, and

machines to generate an understanding of

cellular and multi-cellular systems

– Metabolism, regulation, and machine simulation

– Cell and multicell modeling and flux visualization

• Knowledge Captured

– Cell and community measurement data sets

– Protein machine assembly time-course data sets

– Dynamic models and simulations of cell processes

Page 14: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 14

“Genomes To Life” Computing Roadmap

Biological Complexity

ComparativeGenomics

Constraint-BasedFlexible Docking

Co

mp

uti

ng

an

d I

nfo

rmat

ion

In

fras

tru

ctu

re C

apab

ilit

ies

Constrained rigid

docking

Genome-scale protein threading

Community metabolic regulatory, signaling simulations

Molecular machine classical simulation

Protein machineInteractions

Cell, pathway, and network

simulation

Molecule-basedcell simulation

Current U.S. Computing

Page 15: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 15

First breakthrough: scale-free metabolic networks

(d) The degree distribution, P(k), of the metabolic network illustrates its scale-free topology.

(e) The scaling of the clustering coefficient C(k) with the degree k illustrates the hierarchical

architecture of metabolism (The data shown in d and e represent an average over 43

organisms).

(f) The flux distribution in the central metabolism of Escherichia coli follows a power law,

which indicates that most reactions have small metabolic flux, whereas a few reactions, with

high fluxes, carry most of the metabolic activity. It should be noted that on all three plots the

axis is logarithmic and a straight line on such log–log plots indicates a power-law scaling.

CTP, cytidine triphosphate; GLC, aldo-hexose glucose; UDP, uridine diphosphate; UMP,

uridine monophosphate; UTP, uridine triphosphate.Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)

Page 16: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 16

Second breakthrough: Yeast protein interaction network:first example of a scale-free network

A map of protein–protein interactions in

Saccharomyces cerevisiae, which is

based on early yeast two-hybrid

measurements, illustrates that a few

highly connected nodes (which are also

known as hubs) hold the network

together.

The largest cluster, which contains

78% of all proteins, is shown. The colour

of a node indicates the phenotypic effect

of removing the corresponding protein

(red = lethal, green = non-lethal, orange

= slow growth, yellow = unknown).

Barabasi & Oltvai, Nature Rev Gen 5, 101 (2004)

Page 17: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 17

Characterising metabolic networks

Barabasi & Oltvai, Nature Rev Gen 5, 101 (2004)

To study the network characteristics of the metabolism a graph theoretic description needs to

be established.

(a) illustrates the graph theoretic description for a simple pathway (catalysed by Mg2+-

dependant enzymes).

(b) In the most abstract approach all interacting metabolites are considered equally. The

links between nodes represent reactions that interconvert one substrate into another. For

many biological applications it is useful to ignore co-factors, such as the high-energy-

phosphate donor ATP, which results

(c) in a second type of mapping that connects only the main source metabolites to the main

products.

Page 18: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 18

Degree

Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)

The most elementary characteristic of a node is its

degree (or connectivity), k, which tells us how

many links the node has to other nodes.

a In the undirected network, node A has k = 5.

b In networks in which each link has a selected

direction there is an incoming degree, kin, which

denotes the number of links that point to a node,

and an outgoing degree, kout, which denotes the

number of links that start from it.

E.g., node A in b has kin = 4 and kout = 1.

An undirected network with N nodes and L links is

characterized by an average degree <k> = 2L/N

(where <> denotes the average).

Page 19: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 19

Degree distribution

Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)

The degree distribution, P(k), gives the probability

that a selected node has exactly k links.

P(k) is obtained by counting the number o f nodes

N(k) with k = 1,2... links and dividing by the total

number of nodes N.

The degree distribution allows us to distinguish

between different classes of networks.

For example, a peaked degree distribution, as

seen in a random network, indicates that the

system has a characteristic degree and that there

are no highly connected nodes (which are also

known as hubs).

By contrast, a power-law degree distribution

indicates that a few hubs hold together numerous

small nodes.

Page 20: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 20

Barabasi & Oltvai, Nature Rev Gen 5, 101 (2004)

Aa

The Erdös–Rényi (ER) model of a random network starts with N

nodes and connects each pair of nodes with probability p, which

creates a graph with approximately pN (N-1)/2 randomly placed

links.

Ab

The node degrees follow a Poisson distribution, where most

nodes have approximately the same number of links (close to

the average degree <k>). The tail (high k region) of the degree

distribution P(k ) decreases exponentially, which indicates that

nodes that significantly deviate from the average are extremely

rare.

Ac

The clustering coefficient is independent of a node's degree, so

C(k) appears as a horizontal line if plotted as a function of k.

The mean path length is proportional to the logarithm of the

network size, l log N, which indicates that it is characterized by

the small-world property.

Random networks

Page 21: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 21

Origin of scale-free topology and hubs in biological networks

Barabasi & Oltvai, Nature Rev Gen 5, 101 (2004)

The origin of the scale-free topology in complex networks

can be reduced to two basic mechanisms: growth and

preferential attachment. Growth means that the network

emerges through the subsequent addition of new nodes,

such as the new red node that is added to the network that

is shown in part a . Preferential attachment means that new

nodes prefer to link to more connected nodes. For

example, the probability that the red node will connect to

node 1 is twice as large as connecting to node 2, as the

degree of node 1 (k1=4) is twice the degree of node 2 (k2

=2). Growth and preferential attachment generate hubs

through a 'rich-gets-richer' mechanism: the more connected

a node is, the more likely it is that new nodes will link to it,

which allows the highly connected nodes to acquire new

links faster than their less connected peers.

Page 22: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 22

Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)

Scale-free networks Scale-free networks are characterized by a power-law degree

distribution; the probability that a node has k links follows P(k) ~ k- -,

where is the degree exponent. The probability that a node is highly

connected is statistically more significant than in a random graph, the

network's properties often being determined by a relatively small number

of highly connected nodes („hubs“, see blue nodes in Ba).

In the Barabási–Albert model of a scale-free network, at each time point

a node with M links is added to the network, it connects to an already

existing node I with probability I = kI/JkJ, where kI is the degree of node

I and J is the index denoting the sum over network nodes. The network

that is generated by this growth process has a power-law degree

distribution with = 3.

Bb Such distributions are seen as a straight line on a log–log plot. The

network that is created by the Barabási–Albert model does not have an

inherent modularity, so C(k) is independent of k.

(Bc). Scale-free networks with degree exponents 2< <3, a range that is

observed in most biological and non-biological networks, are ultra-small,

with the average path length following ℓ ~ log log N, which is significantly

shorter than log N that characterizes random small-world networks.

Page 23: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 23

Network measures

Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)

Scale-free networks and the degree exponent

Most biological networks are scale-free, which means that their

degree distribution approximates a power law, P(k) k- , where

is the degree exponent and ~ indicates 'proportional to'. The

value of determines many properties of the system. The

smaller the value of , the more important the role of the hubs

is in the network. Whereas for >3 the hubs are not relevant, for

2> >3 there is a hierarchy of hubs, with the most connected

hub being in contact with a small fraction of all nodes, and for

= 2 a hub-and-spoke network emerges, with the largest hub

being in contact with a large fraction of all nodes. In general, the

unusual properties of scale-free networks are valid only for <

3, when the dispersion of the P(k) distribution, which is defined

as 2 = <k2> - <k>2, increases with the number of nodes (that

is, diverges), resulting in a series of unexpected features,

such as a high degree of robustness against accidental node

failures. For >3, however, most unusual features are absent,

and in many respects the scale-free network behaves like a

random one.

Page 24: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 24

Shortest path and mean path length

Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)

Distance in networks is measured with the path length,

which tells us how many links we need to pass through to

travel between two nodes. As there are many alternative

paths between two nodes, the shortest path — the path

with the smallest number of links between the selected

nodes — has a special role.

In directed networks, the distance ℓAB from node A to

node B is often different from the distance ℓBA from B to

A. E.g. in b , ℓBA = 1, whereas ℓAB = 3. Often there is no

direct path between two nodes. As shown in b, although

there is a path from C to A, there is no path from A to C.

The mean path length, <ℓ>, represents the average over

the shortest paths between all pairs of nodes and offers

a measure of a network's overall navigability.

Page 25: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 25

Clustering coefficient

Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)

In many networks, if node A is connected to B, and B is connected to C,

then it is highly probable that A also has a direct link to C. This

phenomenon can be quantified using the clustering coefficient CI =

2nI/k(k-1), where nI is the number of links connecting the kI neighbours of

node I to each other. In other words, CI gives the number of 'triangles'

that go through node I, whereas kI (kI -1)/2 is the total number of triangles

that could pass through node I, should all of node I's neighbours be

connected to each other. For example, only one pair of node A's five

neighbours in a are linked together (B and C), which gives nA = 1 and CA

= 2/20. By contrast, none of node F's neighbours link to each other,

giving CF = 0. The average clustering coefficient, <C >, characterizes the

overall tendency of nodes to form clusters or groups. An important

measure of the network's structure is the function C(k), which is defined

as the average clustering coefficient of all nodes with k links. For many

real networks C(k) k-1, which is an indication of a network's

hierarchical character.

The average degree <k>, average path length <ℓ> and average

clustering coefficient <C> depend on the number of nodes and links (N

and L) in the network. By contrast, the P(k) and C(k ) functions are

independent of the network's size and they therefore capture a network's

generic features, which allows them to be used to classify various

networks.

Page 26: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 26

Barabasi & Oltvai, Nature Rev Gen 5, 101 (2004)

Hierarchical networks To account for the coexistence of modularity, local clustering and scale-

free topology in many real systems it has to be assumed that clusters

combine in an iterative manner, generating a hierarchical network.

The starting point of this construction is a small cluster of 4 densely

linked nodes (4 central nodes in Ca).

Next, 3 replicas of this module are generated and the 3 external nodes of

the replicated clusters connected to the central node of the old cluster,

which produces a large 16-node module.

3 replicas of this 16-node module are then generated and the 16

peripheral nodes connected to the central node of the old module, which

produces a new module of 64 nodes. The hierarchical network model

seamlessly integrates a scale-free topology with an inherent modular

structure by generating a network that has a power-law degree

distribution with degree exponent = 1 + ln4/ln3 = 2.26 (Cb) and a

large, system-size independent average clustering coefficient <C> ~ 0.6.

The most important signature of hierarchical modularity is the scaling of

the clustering coefficient, which follows C(k) ~ k-1 a straight line of slope -

1 on a log–log plot (Cc). A hierarchical architecture implies that sparsely

connected nodes are part of highly clustered areas, with communication

between the different highly clustered neighbourhoods being maintained

by a few hubs (Ca).

Page 27: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 27

First breakthrough: scale-free metabolic networks

(d) The degree distribution, P(k), of the metabolic network illustrates its scale-free topology.

(e) The scaling of the clustering coefficient C(k) with the degree k illustrates the hierarchical

architecture of metabolism (The data shown in d and e represent an average over 43

organisms).

(f) The flux distribution in the central metabolism of Escherichia coli follows a power law,

which indicates that most reactions have small metabolic flux, whereas a few reactions, with

high fluxes, carry most of the metabolic activity. It should be noted that on all three plots the

axis is logarithmic and a straight line on such log–log plots indicates a power-law scaling.

CTP, cytidine triphosphate; GLC, aldo-hexose glucose; UDP, uridine diphosphate; UMP,

uridine monophosphate; UTP, uridine triphosphate.Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)

Page 28: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 28

Second breakthrough: Yeast protein interaction network:first example of a scale-free network

A map of protein–protein interactions in

Saccharomyces cerevisiae, which is

based on early yeast two-hybrid

measurements, illustrates that a few

highly connected nodes (which are also

known as hubs) hold the network

together.

The largest cluster, which contains

78% of all proteins, is shown. The colour

of a node indicates the phenotypic effect

of removing the corresponding protein

(red = lethal, green = non-lethal, orange

= slow growth, yellow = unknown).

Barabasi & Oltvai, Nature Rev Gen 5, 101 (2004)

Page 29: Bioinformatics III  “Systems biology”,“Integrative cell biology”

1. Lecture WS 2004/05

Bioinformatics III 29

Summary

Many cellular networks show properties of scale-free networks

- protein-protein interaction networks

- metabolic networks

- genetic regulatory networks (where nodes are individual genes and links are

derived from expression correlation e.g. by microarray data)

- protein domain networks

However, not all cellular networks are scale-free.

E.g. the transcription regulatory networks of S. cerevisae and E.coli are examples

of mixed scale-free and exponential characteristics.

Next lecture:

- mathematical properties of networks

- origin of scale-free topology

- topological robustness

Barabasi & Oltvai, Nature Rev Gen 5, 101 (2004)