582746 Modelling and Analysis in Bioinformatics
Transcript of 582746 Modelling and Analysis in Bioinformatics
![Page 1: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/1.jpg)
582746 Modelling and Analysis in Bioinformatics
Lecture 1: Global Network Models
26.10.2015
![Page 2: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/2.jpg)
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
![Page 3: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/3.jpg)
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
![Page 4: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/4.jpg)
Course topics
I Computational models for biological networks (Leena Salmela)
I Gene regulation (Antti Honkela)
I Probabilistic analysis of sequence level problems(Veli Makinen)
![Page 5: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/5.jpg)
Practical arrangements
I Mondays: Lectures to introduce the topics
I Thursday mornings: Study group to deepen the knowledge onthe subject
I Thursday afternoons: Exercise sessions
I 23.11.-27.11. Visiting lecturers (no exercise session)
![Page 6: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/6.jpg)
How to pass the course?
I Attending study groups on Thursday mornings is mandatory
I Attending visiting lectures on Monday 23.11. and Thursday26.11. is mandatory
I Submit the exercises and get at least 6 points for each threeexercise sets (network models, gene regulation, probabilisticanalysis of sequence-level problems)
I If you miss a study group or visiting lecture, contact thelecturers for an alternative assignment
![Page 7: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/7.jpg)
Grading
I Grading is based on submitted exercises
I 60 points will be available
I 30 points =⇒ Passed, 50 points =⇒ 5
I No exam
I Not possible to pass with a separate exam
![Page 8: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/8.jpg)
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
![Page 9: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/9.jpg)
Protein-protein interaction network
I Vertices are proteins
I The proteins are connected if they interact with each other.
![Page 10: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/10.jpg)
Metabolic network
I Vertices are metabolites,i.e. chemical compounds
I Edges describe how thecell can transform ametabolite into another
![Page 11: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/11.jpg)
Gene regulatory networkI Vertices are genesI Genes are linked if one regulates the other
Source: Shen-Orr SS, Milo R, Mangan S, Alon U. 2002
![Page 12: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/12.jpg)
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
![Page 13: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/13.jpg)
(Shortest path) distance
I Distance dij is the length of theshortest path between vertices ni
and nj , i.e. the minimal number ofedges one needs to traverse to getfrom ni to nj
I The shortest path may not beunique, but the length of theshortest path is unique
I In directed network, we may havedij 6= dji
I If there is no path between ni andnj , we have dij =∞
Ignoring weights, above wehave d12 = 2,d13 = 1,d14 = 2,. . .
![Page 14: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/14.jpg)
Diameter and average path length
I The diameter dm = max(dij ) isthe maximal distance between anytwo nodes (= the longest shortestpath)
I Average or characteristic pathlengthd = 〈dij〉 = 1
N2V
∑NVi=1
∑NVj=1 dij Ignoring weights, above
dm = 3, d ≈ 1.57
![Page 15: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/15.jpg)
Efficiency
I Efficiency, or average inverse pathlength: deff = 〈1/dij〉
I Useful when average path lengthis infinite (disconnected network)
I Fully connected network hasefficiency deff = 1, graph with noedges has deff = 0 Ignoring weights, above
deff ≈ 0.73
![Page 16: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/16.jpg)
Weighted graphs
I If the edges in the graph haveassociated weights wij , it isnatural to define distances basedon the weights:
I dij as the sum of weights in theminimum weight path between ni
and nj
I Maximum and average pathlength as well as efficiencynaturally generalize by changingthe distance measure to theweighted version
With weights, d12 = 40 (redpath), d ≈ 24.2,dm = 50 = d25, deff = 0.06
![Page 17: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/17.jpg)
Finding shortest paths
Finding shortest paths in graphs is part of classical algorithmtheory, two efficient algorithms
I Dijkstra’s algorithm: given a vertex find shortest paths to allother vertices, basic implementation runs in O(N2
V ) time, canbe implemented faster for sparse graphs
I Floyd-Warshall algorithm: find shortest paths for all pairs ofvertices in the graph in O(N3
V ) time; outputs a distance
matrix (dij )NVi ,j=1 in same time.
I Both work with weighted formulations
![Page 18: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/18.jpg)
Shortest path distances in empirical networks
Path length analysis of many networksthat occur in nature reveals thesmall-world property
I Metabolite graphs: average pathlength d ≈ 3 (NV ≈ 103 − 104)
I WWW: links chains between twoweb documents d ≈ 16(NV > 109)
I Erdos number: shortest co-authorchain to Paul Erdos, d = 4.65(NV ≈ 4× 105)
I ...
Paul Erdos (1913-1996, a Hungarian
mathematician, published over 1400 scientific
papers over his lifetime with over 500
different co-authors
![Page 19: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/19.jpg)
Node degree
I Degree ki of vertex ni is thenumber of edges adjacent to avertex
I In a network without self-loopsand without multiple edgesbetween any pair of edges: degree= number of neighbours
I In directed networks: in-degree isthe number of incoming edges andout-degree is the number outgoingedges
![Page 20: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/20.jpg)
Degree distribution
I Given a fixed set of vertices, p(k) denotes the probability thata randomly chosen vertex has degree k .
I (Empirical) degree distribution is the list of probabilities (orrelative frequencies) p(k), k = 0 . . .NV .
I Analysis of the degree distribution is an important means tocharacterize networks
![Page 21: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/21.jpg)
Degree distribution
Fitting the empirical degreedistribution to a theoretical distributiongiven by a mathematical law is animportant tool for network analysis
I Regular lattice: p(k) ≈ 1, where kis a constant
I Scale free network: p(k) ∝ k−γ
I Random network:p(k) ∝
(NV−1k
)pk (1− p)NV−1−k
![Page 22: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/22.jpg)
Degree distribution
The degree distributions of of scale-free network and randomnetwork look markedly different
I Scale free network: p(k) ∝ k−γ (power law, heavy tail)
I Random network: p(k) ∝(NV−1
k
)pk (1− p)NV−1−k (binomial,
light tail)
![Page 23: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/23.jpg)
Fitting degree distributions
I Typically the fitting of the empirical distribution is based onthe histogram of observations for p(k)
I This is prone to errors in the region of high degree nodes dueto low number of observations
I Binning can help: divide the range of k into intervals and putall observations in the interval into a common bin
I Cumulative degree distribution pc (k) =∑∞
l=k p(l), thelikelyhood that a given node has degree at least k , is morereliable and does not require binning
![Page 24: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/24.jpg)
Degree correlations and assortative mixing
Degree correlation is a statistic that reveals additional informationof the connection patterns of the nodes
I Assortative networks: high correlation between the degrees ofadjacent nodes; highly connected nodes mostly connect toother highly connected nodes
I Disassortative networks: highly connected nodes mostlyconnect to low degree nodes
I Assortativity index −1 ≤ r ≤ 1: Pearson correlationcoefficient of degrees of adjacent nodes, r > 0 assortative,r < 0 disassortative
![Page 25: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/25.jpg)
Examples
Social networks are typically assortative, technological andbiological networks tend to be disassortative
(M. Newman. Phys. Rev. E 67, 026126 (2003) )
![Page 26: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/26.jpg)
Clustering coefficient
I Clustering coefficient measures theprobability that two vertices witha common neighbor are connected
I Let Ei denote the number of edgesbetween the neighbors of vi , andEmax = ki (ki − 1)/2 thetheoretical maximum. Clusteringcoefficient for vertex ni is now
Ci =Ei
Emax=
2Ei
ki (ki − 1)
I Clustering coefficient for the wholegraph is obtained by averagingover the vertices
![Page 27: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/27.jpg)
Clustering coefficient in natural networks
I Natural networks often haverelatively high clusteringcoefficient indicating localclustering within the network
I Negative correlation between thedegree and the clusteringcoefficient has also been observed;
I Low degree nodes lie in localclusters, while the neighbors ofhigh degree nodes are less oftenconnected
I Indicates modular networkstructure
Example: PPIs in Mouse andHuman:
(http://bccs.bristol.ac.uk/toProgramme
/ project/2008/Angela Onslow S08/)
![Page 28: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/28.jpg)
Matching index
To be functionally related, twovertices do not need to beconnected, examples:
I Two transcription factorproteins regulating thesame gene
I Two metabolite moleculestaking part in similarreactions
Zamora-Lopez. Frontiers in Neuroinformatics 4, 2010
![Page 29: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/29.jpg)
Matching index
I Matching index measuresthe amount of neighborsthe two nodes share:
MIij =Sharedij
ki + kj − Sharedij
I Similarity in terms ofperceiving theneighborhood similarly
Zamora-Lopez. Frontiers in Neuroinformatics 4, 2010
![Page 30: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/30.jpg)
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
![Page 31: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/31.jpg)
Models of complex networks
I Theoretical models of networks areneeded as a basis for comparisonto determine the significance ofglobal properties or non-trivialsubstructures of natural networks.
I We will look at three specificmodels
I Erdos-Renyi ModelI Watts-Strogatz ModelI Barabasi-Albert model
![Page 32: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/32.jpg)
Erdos-Renyi Model
I ER network consists of NV vertices
I Edge is drawn between a pair ofnodes randomly with probability p
I Degree distribution of the ERmodel is binomial:p(k) ∝
(NV−1k
)pk (1− p)NV−1−k
I Degree distribution can beapproximated by Poissondistribution for large graphs
![Page 33: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/33.jpg)
Erdos-Renyi Model
Significant body of theoretical research exists for the ER model,e.g.
I For NV p < 1 the network almost surely has no largeconnected components
I For NV p ≈ 1 the network will almost surely have one largeconnected component
I For NV p > logNV the network will almost surely beconnected
I ER network has the small-world property when p > 1/NV
with average path length scaling as l ∼ logNV
I No local clustering, expected clustering coefficientC = p = 〈k〉 /NV for all nodes
![Page 34: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/34.jpg)
Watts-Strogatz model
1. Arrange vertices in a ring structure
2. Connect each vertex to K closestneighbours
3. With probability prew , rewire eacheach edge by detaching from oneend and attaching to a randomlychosen vertex.
After steps (1-2) there is localclustering, step (3) lowers average pathlength by creating shortcuts
http://en.wikipedia.org/wiki/File:Watts strogatz.svg
![Page 35: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/35.jpg)
Watts-Strogatz model
I Even for low rewiring probability(prew << 1) the average pathlength goes down rapidly
I Small average path length andlocal clustering is retained forintermediate prew
I When prew → 1, we get ER model,i.e. local clustering is destroyed
I Degree distribution is similar toER graph: homogeneuos andpeaked around k = K
http://en.wikipedia.org/wiki/File:Watts strogatz.svg
![Page 36: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/36.jpg)
Barabasi-Albert model
I Start with an initial smallconnected network of N0 vertices
I Iteratively add new vertices andconnect the new vertex to m ≤ N0
vertices
I Draw the nodes that will beconnected the the new vertex withprobability proportional to theirdegree (preferential attachment):
ρ(ni ) = ki/∑
j
kj (BA graph from http://melihsozdinler.blogspot.com/)
![Page 37: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/37.jpg)
Barabasi-Albert model
I Unlike ER or WS model,Barabasi-Albert model explain theinhomogeneuos degree distributionobserved in natural graphs
I With enough iterations, the degreedistribution of the BA model isscale-free, with p(k) ∼ k−3
I Average path length in BAnetworks has been found to besmaller than in ER and WS models
![Page 38: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/38.jpg)
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
![Page 39: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/39.jpg)
Robustness and Attack tolerance
I Robustness against pertubations(mutations, environment changes)is a preferable property forbiological networks
I Networks analysis is interested inpreservation of network topologyunder perturbations (usually:removals of vertices or edges)
![Page 40: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/40.jpg)
Robustness and Attack tolerance
I Both ER networks and scale-freenetworks (such as BA model) arerobust towards random deletionsof nodes and connections
I A random mutation is likely tohit a low degree node in BAmodel
I Scale-free networks are not robusttowards intentioanl attacks
I Removal a set of highlyconnected nodes may collapsethe global structure
I ”Robust, yet fragile”
![Page 41: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/41.jpg)
Modularity and hierarchical organization
I Many natural networks areobserved to possesmodular structure withdensely connectedfunctional clusters ofnodes that are sparselyconnected to other nodes.
I Also, hierarchicalorganization of networkstructure can be observed
I The random networkmodels discussed above,do not directly explainthese phenomena
(Zhao et al. BMC Bioinformatics 2006, 7:386)
![Page 42: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/42.jpg)
Modularity and hierarchical organization
I Barabasi and Albert model hasbeen later extended to thatdirection
I Based on replicating basicmodules and wiring them to thecentral module of rest of thenetwork
I Recursive application leads tohierarchical organization
I Deterministic rather thanrandom procedure
![Page 43: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/43.jpg)
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
![Page 44: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/44.jpg)
Statistical testing of network properties
I How to determine if an observed property of the network issignificant or if it occured just by chance?
I Set up a null hypothesis
I Test if the observed property is consistent with the nullhypothesis
![Page 45: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/45.jpg)
Statistical testing of network properties: Example
I Suppose that we have observed aclustering coefficient C for a givennetwork. Is the network highly clustered?
I Null hypothesis: The clustering coefficientis consistent with a network of the samesize and degree distribution.
I Create an ensemble of random networkswith same size and degree distribution andcompute the clustering coefficient of eachnetwork
I Reject the null hypothesis if theprobability of a network with clusteringcoefficient of at least C is low enough
I If the null hypothesis can be rejected, wecan conclude that the network is highlyclustered as compared to the null model.
![Page 46: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/46.jpg)
What next?
I Thursday 10-12: Study group on analytical properties of ERnetworks
I Thursday 12-14: Exercise session
![Page 47: 582746 Modelling and Analysis in Bioinformatics](https://reader034.fdocuments.us/reader034/viewer/2022042302/625a8902a9e1f13ac7387f81/html5/thumbnails/47.jpg)
Moodle Enrolment
I https:
//moodle.helsinki.fi/course/view.php?id=18471
I Enrolment key: BIOMODELS