Graphs in R and Bioconductor
Transcript of Graphs in R and Bioconductor
![Page 1: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/1.jpg)
Graphs in R andBioconductor
Statistical Analysis of Microarray Expression Data with Rand Bioconductor
Copenhagen DK, November 2007
Denise Scholtens, Ph.D.
Assistant Professor, Department of Preventive MedicineNorthwestern University Medical School, Chicago IL USA
![Page 2: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/2.jpg)
Graphs
Sets of nodes and edges Nodes: objects of interest Edges: relationships between them
A useful abstraction to talk aboutrelationships, interactions, etc.
![Page 3: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/3.jpg)
Graphs Nodes
Node types Conditions on interactions between nodes
Edges Edge types Direction Weights
Graphs are very flexible, but can quickly getcomplicated!
![Page 4: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/4.jpg)
Graphs Knowledge representation
Data structure, visualization Exploratory data analysis (EDA)
Graph traversal and analysis Inference
Adopting statistical paradigm for makingconclusions about data recorded in graphs
![Page 5: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/5.jpg)
Graphs Inference
Ex. Testing association between two graphs Ex. Identifying local features of large global graphs Statistical approaches
FP – edges that were tested, were found, but are not there innature
FN – edges that were tested, were not found, but are there innature
Untested – edges that were never tested, may or may notexist in nature, but we don’t know
![Page 6: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/6.jpg)
Example: undirected graph
![Page 7: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/7.jpg)
Example: directed graph
![Page 8: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/8.jpg)
Elementary computations on IMCA pathway
> library("graph")> data("integrinMediatedCellAdhesion")> class(IMCAGraph)> s = acc(IMCAGraph, "SOS")Ha-Ras Raf MEK 1 2 3 ERK MYLK MYO 4 5 6F-actin cell proliferation 7 5
![Page 9: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/9.jpg)
Example: Directed AcyclicGraph (DAG) Gene Ontology (GO) A structed vocabulary to describe
molecular function of gene products,biological processes, and cellularcomponents.
A set of "is a", "is part of", and "has a"relationships between these terms
![Page 10: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/10.jpg)
GO graphs
![Page 11: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/11.jpg)
Example: Bipartite graph
Two distinct sets of nodes U and V Edges exist between elements of U and V Edges cannot connect nodes in U to other
nodes in U, and similarly for V E.g. literature co-citation graphs
![Page 12: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/12.jpg)
Gene-Literature graphs
DKC1
![Page 13: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/13.jpg)
An adjacency matrix AG (n x m) is often used torepresent a bipartite graph G with node sets U, V
One mode graphs
AU = AGt AG – co-citation of genes in literature
AV = AG AG
t – literature containing common genes
(Boolean algebra)
Bipartite graph transformation
![Page 14: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/14.jpg)
Data structure
Flexible way to record and visualize data Undirected/directed edges Node colors Structures (DAG, bipartite graph) Multiple edges Weighted or labeled edges
Note: user has responsibility to recognize and makeuse of the graph structure!
![Page 15: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/15.jpg)
Directed, undirected graphsAdjacent nodesAccessible nodesSelf-loopNode degreeWalk: alternating sequence of nodes and incident edgesClosed walkDistance between nodes, shortest walkTrail: walk with no repeated edgesPath: trail with no repeated nodes (except possibly first/last)Connected graphWeakly connected directed graphStrongly connected directed graph
Graphs: vocabulary
![Page 16: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/16.jpg)
graph basic class definitions andfunctionality
Rgraphviz rendering functionalityDifferent layout algorithms.Node plotting, line type, color etc. can becontrolled by the user.
RBGL interface to graph algorithms
graph, Rgraphviz, RBGL
![Page 17: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/17.jpg)
graph package Classes
graph a general class that all other classes should extend
graphNEL node/edge-list representation Can specify direction of edges, edge weights, etc
distGraph graph based on distances between nodes
clusterGraph series of completely connected subgraphs (cliques) with
no edges between them
![Page 18: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/18.jpg)
> library("graph"); library(Rgraphviz)
> myNodes = c("s", "p", "q", "r")
> myEdges = list(s = list(edges = c("p", "q")),p = list(edges = c("p", "q")),q = list(edges = c("p", "r")),r = list(edges = c("s")))
> g = new("graphNEL", nodes = myNodes,edgeL = myEdges, edgemode ="directed")
> plot(g)
Creating a graph
![Page 19: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/19.jpg)
> nodes(g)[1] "s" "p" "q" "r"
> edges(g)$s[1] "p" "q"$p[1] "p" "q"$q[1] "p" "r"$r[1] "s"
> degree(g)$inDegrees p q r1 3 2 1$outDegrees p q r2 2 2 1
Querying nodes, edges, degree
![Page 20: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/20.jpg)
> g1 <- addNode("e", g)
> g2 <- removeNode("d", g)
> ## addEdge(from, to, graph, weights)
> g3 <- addEdge("e", "a", g1, pi/2)
> ## removeEdge(from, to, graph)
> g4 <- removeEdge("e", "a", g3)
> identical(g4, g1)
[1] TRUE
Graph manipulation
![Page 21: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/21.jpg)
> adj(g, c("b", "c"))$b[1] "b" "c"$c[1] "b" "d"
> acc(g, c("b", "c"))$ba c d3 1 2
$ca b d2 1 1
Adjacent and accessible nodes
![Page 22: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/22.jpg)
![Page 23: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/23.jpg)
Node-edge lists Adjacency matrix (straightforward) Adjacency matrix (sparse) From-To matrix
They are equivalent, but may be hugely differentin performance and convenience for differentapplications.
Can coerce between the representations
Graph representation
![Page 24: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/24.jpg)
> ft [,1] [,2][1,] 1 2[2,] 2 3[3,] 3 1[4,] 4 4
> ftM2adjM(ft) 1 2 3 41 0 1 0 02 0 0 1 03 1 0 0 04 0 0 0 1
> ftM2graphNEL(ft)A graphNEL graph with directed edgesNumber of Nodes = 4Number of Edges = 4
Graph representations:from-to matrix
![Page 25: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/25.jpg)
Connected componentscc = connComp(rg)table(listLen(cc)) 1 2 3 4 15 1836 7 3 2 1 1
Choose the largest componentwh = which.max(listLen(cc))sg = subGraph(cc[[wh]], rg)
Depth first searchdfsres = dfs(sg, node = "N14")nodes(sg)[dfsres$discovered][1] "N14" "N94" "N40" "N69" "N02" "N67" "N45" "N53"[9] "N28" "N46" "N51" "N64" "N07" "N19" "N37" "N35"[17] "N48" "N09"
rg
RBGL: interface to BoostGraph Library
![Page 26: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/26.jpg)
dfs(sg, "N14")bfs(sg, "N14")
depth/breadth first search
![Page 27: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/27.jpg)
connected componentssc = strongComp(g2)
nattrs = makeNodeAttrs(g2,fillcolor="")
for(i in 1:length(sc)) nattrs$fillcolor[sc[[i]]] =
myColors[i]
plot(g2, "dot", nodeAttrs=nattrs)
wc = connComp(g2)
![Page 28: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/28.jpg)
Different algorithms for different types of graphso all edge weights the sameo positive edge weightso real numbers
…and different settings of the problemo single pairo single sourceo single destinationo all pairs
Functionsbfsdijkstra.spsp.betweenjohnson.all.pairs.sp
Shortest path algorithms
![Page 29: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/29.jpg)
1
set.seed(123)rg2 = randomEGraph(nodeNames, edges = 100)fromNode = "N43"toNode = "N81"sp = sp.between(rg2,
fromNode, toNode)
sp[[1]]$path [1] "N43" "N08" "N88" [4] "N73" "N50" "N89" [7] "N64" "N93" "N32" [10] "N12" "N81"
sp[[1]]$length [1] 10
Shortest path
![Page 30: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/30.jpg)
ap = johnson.all.pairs.sp(rg2)hist(ap)
Shortest path
![Page 31: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/31.jpg)
mst = mstree.kruskal(gr)gr
Minimal spanning tree
![Page 32: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/32.jpg)
Consider graph g with single connectedcomponent.Edge connectivity of g: minimumnumber of edges in g that can be cut toproduce a graph with two components.Minimum disconnecting set: the set ofedges in this cut.
> edgeConnectivity(g)$connectivity[1] 2
$minDisconSet$minDisconSet[[1]][1] "D" "E"
$minDisconSet[[2]][1] "D" "H"
Connectivity
![Page 33: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/33.jpg)
![Page 34: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/34.jpg)
dot: directed graphs. Works best on DAGsand other graphs that can be drawn ashierarchies.
neato: undirected graphs using ’spring’ models
twopi: radial layout. One node (‘root’) chosen asthe center. Remaining nodes on a sequence ofconcentric circles about the origin, with radialdistance proportional to graph distance. Rootcan be specified or chosen heuristically.
Rgraphviz: layout engines
![Page 35: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/35.jpg)
Rgraphviz: layout engines
![Page 36: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/36.jpg)
Rgraphviz: layout engines
![Page 37: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/37.jpg)
Combining R graphics and Rgraphviz: custom nodedrawing functions
![Page 38: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/38.jpg)
Inference questions
Compare two graphs GraphAT
Identify local features in global graphs apComplex
Large scale topological features of graphs RBGL, measurement error effects
Estimating error probabilities ppiStats
![Page 39: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/39.jpg)
Compare two graphsGraphAT
![Page 40: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/40.jpg)
Compare two graphsObserved graph of the literature protein-protein interactions
used in Ge et al. (315 edges, 298 nodes)
![Page 41: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/41.jpg)
Cluster graph for the 30 Clusters reported in Ge et al. (156,205edges, 2885 nodes)
(All genes that are not in the list of literature-reported PPIs
have been removed from this graph for visualization purposes.)
![Page 42: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/42.jpg)
Comparing two graphs Nodes – yeast genes Graph 1 – literature reported protein-protein
interactions Graph 2 – cell cycle gene expression cluster
membership Do the graphs overlap more than random? Is there anything special about the overlapping
edges?
![Page 43: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/43.jpg)
Graph of the reported intracluster edges in Ge et al.
(42 edges,65 nodes)This graph was derived by intersecting the observed literature graph with the cluster graph.
![Page 44: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/44.jpg)
Graph resulting from random reassignment
of 315 edges among 2885 nodesNote that the structure of this graph is quite different from the observed literature graph.
![Page 45: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/45.jpg)
Intersection of random edge graph with cluster graph
![Page 46: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/46.jpg)
Intersecting Edges
Random Edge algorithm (RE)
Permuting Node Labels algorithm (PN)
![Page 47: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/47.jpg)
Graph with the same test statistic as the observed graph ofintracluster edges reported in Ge et al – 42 intracluster edges
![Page 48: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/48.jpg)
Other Test Statistics
![Page 49: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/49.jpg)
Questions of Interest Is it reasonable to condition on the structure of the
observed graphs – something like anancillary/sufficient statistic?
Why is the number of intersecting edges invariant tothe node label permutation and random edgereassignment algorithms?
What are the most informative test statistics?
![Page 50: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/50.jpg)
EDA Questions of Interest Which expression clusters have intersections with which of the
literature clusters? Are known cell-cycle regulated protein complexes indeed
clustered together in both graphs? Are there expression clusters that have a number of literature
cluster edges going between them, suggesting that expressionclustering was too fine, or that literature clusters are not cell-cycle regulated.
Is the expression behavior of genes that are involved in multipleprotein complexes different from that of genes that are involvedin only one complex?
![Page 51: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/51.jpg)
Identify local featuresin global graphs
apComplex
![Page 52: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/52.jpg)
Local Modeling of Global Interactome Data
AP-MS (Affinity Purification - Mass Spectrometry)
Measures Complex Comembership
Gavin, et al. (Nature, 2002 and Nature, 2006) Ho, et al. (Nature, 2002) Krogan, et al. (Mol Cell 2004, Nature 2006)
Y2H (Yeast Two Hybrid)
Measures Physical Interactions
Ito, et al. (PNAS, 1998) Uetz, et al. (Nature, 2000)
![Page 53: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/53.jpg)
AP-MS data:
Using a bait protein, AP-MS technology finds prey proteins that arecomembers of at least one complex with the bait.
Y2H data:
Y2H technology finds pairs of physically interacting proteins.
(one purification)
bait
prey
![Page 54: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/54.jpg)
AP-MS data: Y2H data:
We want to estimate thebipartite protein complexmembership graph, A:
*Estimation of A requiresestimation of K, the numberof complexes.
![Page 55: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/55.jpg)
1. Some proteins participate in more than one complex
2. In an AP-MS experiment, some proteins are used as baits andsome proteins are only ever found as prey
3. Graph theoretic paradigm to allow for succinct formulation• Bipartite graph for complex membership (A)• Relationship of complex membership (A) to complex comembership
(Y) assayed in an AP-MS experiment (Z)• AP-MS and Y2H are different technologies that measure different
relationships between proteins
4. Statistical paradigm to allow for false positive and false negativeobservations
Four unique aspects to thealgorithm
![Page 56: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/56.jpg)
PP2A
Heterotrimericcomplex consisting of:
Tpd3- regulatory A subunit
Rts1 or Cdc55- regulatory B subunits
Pph21 or Pph22- catalytic subunits
Jiang and Broach (1999). EMBO.
1. Some proteins participate inmore than one complex
Gavin, et al. (2002)Rgraphviz plot ofyTAP C151
Bader & Hogue (2002)Portion of Figure 2: Overlap of the spoke models of TAP and HMS-PCI.
Jansen, et al. (2003)PIT Bayesian Network, LR>600
http://genecensus.org/intint
Tpd3
Pph21
Myo5
Cdc55
Cdc11
Pph22
Cdc10
![Page 57: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/57.jpg)
1. Some proteins participate in more than one complex
PP2A
Heterotrimericcomplex consisting of:
Tpd3- regulatory A subunit
Rts1 or Cdc55- regulatory B subunits
Pph21 or Pph22- catalytic subunits
Jiang and Broach (1999). EMBO.
The apComplex algorithm detects:
Zds1 and Zds2 (known cell-cycle regulators)only exist in complexes with the Cdc55-Pph22 trimer!
![Page 58: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/58.jpg)
2. Graph theoretic paradigm toallow for succinct expressionof constructs involved
•Bipartite graph forcomplex membership•Relationship of complexmembership (A) tocomplex comembership(Y) assayed in an AP-MSexperiment (Z)•AP-MS and Y2H aredifferent technologies thatmeasure differentrelationships betweenproteins
![Page 59: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/59.jpg)
2. Graph theoreticparadigm to allow forsuccinct expression ofconstructs involved
•Relationship ofcomplex membership(A) to complexcomembership (Y)assayed in an AP-MSexperiment (Z)
![Page 60: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/60.jpg)
)ˆ,ˆ(algorithmestimation
proteinsonly -hit and proteins bait for dataMSAP'
AYZ
ZYA
M
NAAY
!!!!!! "!
!!!!!!!!! "!!!! "!#$=
In summary…
We start with an initial estimate for A, and then refine thatestimate according to a two component probability measure:
P(Z|A,µ,α)=L(Z|Y=A⊗A',µ,α)C (Z|A,µ,α)usual likelihood regularization/penalty term
(no. of complexes)
![Page 61: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/61.jpg)
Large scale topological featuresof graphs
RBGLmeasurement error
![Page 62: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/62.jpg)
Apl6
Apm3 Apl5
untested: ?
tested:absent
Apl6
Apm3 Apl5
Eno2
Aps3
Ckb1
6 AP-MS observationsGavin et al. (2002)
Apl5: Apl5, Apl6, Apm3, Aps3, Ckb1Apl6: Apl5, Apl6, Apm3, Eno2Apm3: Apl6, Apm3
Measurement ErrorFPsFNsStochasticSystematic
Missing DataUntested Edges
Suitable InferenceWhat can weconclude?
![Page 63: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/63.jpg)
Statistical Modeling Network data are experimentally obtained
and hence should be subjected to sametypes of data analysis as other data
Given some model, what can we conclude?
Likelihood methods Global and local feature estimation
![Page 64: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/64.jpg)
Measurement Error
Stochastic FPs and FNs made ‘randomly’
Systematic FPs and FNs made in some predictable way ‘Sticky’ proteins may cause FPs Conformationally deformed proteins may cause FNs Errant treatment of untested edges as absent will
cause systematic FNs
![Page 65: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/65.jpg)
Random graphs vs regularnetworks and what’s in between?
Small-world connectivity Watts & Strogatz (1998) 6-degrees-of separation Lsmall-world ≈ Lrandom
L=average path length Csmall-world >> Crandom
C=average clustering coefficient of nodes If node n has kn neighbors, then
21
#
)/-(kk
kC
nn
nn
neighbors between edges observed =
![Page 66: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/66.jpg)
Small-world
Watts & Strogatz (Nature 1998)
![Page 67: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/67.jpg)
Scale-free
Class of small world networks Node degree distribution follows a power
law In scale-free graphs there are a few highly
connected “hubs” Biology: Relative robustness of network to
random perturbation, but huge breakdownto targeted disruption
![Page 68: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/68.jpg)
Simulation Study Random, Scale-Free, Overlapping Cluster
Graphs 50, 500, and 1000 nodes Approx 5 edges/node Stochastic FNs: 0.05, 0.15, 0.25 Stochastic FPs: PPV=0.50 Systematic FPs: ‘sticky’ baits detect
neighbors of neighbors with p=0.50
![Page 69: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/69.jpg)
L, stochastic FNs
![Page 70: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/70.jpg)
L, stochastic FNs
In general, FNs increase L In the overlapping cluster graph, FNs splinter
graph into unconnected components How to treat L for an unconnected graph? Misleading results if treated naïvely
![Page 71: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/71.jpg)
C, stochastic FPs(probability an observed edge is true =0.5)
![Page 72: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/72.jpg)
C, stochastic FPs
If a node has k neighbors, C is the fraction ofthose neighbors that exist
Both numerator and denominator areaffected
For cluster graphs, nodes have moreneighbors due to FPs, but the number ofedges between the neighbors does notincrease proportionately
![Page 73: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/73.jpg)
Node degree distribution,systematic FPs
Looked at log-log plot of complementarycumulative distribution function (notfrequency distribution) Based on theoretical work by Li, et al (2006)
Towards a theory of scale-free graphs: Definition,properties, and implications. InternetMathematics.
Assess fit of straight line using R2
![Page 74: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/74.jpg)
Node degree distribution,systematic FPs
![Page 75: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/75.jpg)
Take Home Messagefrom Simulation Study
FPs and FNs do affect statistics on graphs Even with small amounts of measurement
error, the effects can lead to biologicalmisinterpretations Specifically, scale-free
![Page 76: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/76.jpg)
Estimating error probabilities
ppiData
![Page 77: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/77.jpg)
Systematic Error:Bias in Bait-Prey Systems
Apl6
Apm3 Apl5
Apl6
Apm3 Apl5
Doubly tested bait-bait edges may be• reciprocated
•tested twice, observed twice• unreciprocated
•tested twice, observed once
For a bait subject only to stochasticerror, we expect the set ofunreciprocated edges to consist ofan approximately equal number of in-and out-edges.
If this is not the case, the bait issubject to systematic bias.
![Page 78: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/78.jpg)
In-degree vs. Out-degreeunreciprocated edges in bait-induced subgraphs (square root scale)
Gavin, 2006 (AP-MS) Krogan, 2006 (AP-MS)
![Page 79: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/79.jpg)
Per protein ‘coin-tossing’model
Quantify departure from symmetry using abinomial distribution with probabilityparameter p=0.50.
Previous pictures show nodes with p-value<0.01 in dark blue.
Removing these nodes has implications foroverall estimates of stochastic FP and FNprobabilities.
![Page 80: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/80.jpg)
Global estimation of pTP andpFP
n.convolutio the withdeal weSo
. and observe We
.)1())1(2()(
),|,Pr(
, graph the in edges potential of number total the for Then,
ns.observatio positive false the for variables random be and let Similarly,
.)1())1(2(),|,Pr(
, graph the in edges of number true the given Then,
ns.observatio positive true the for variables random be and Let
))((22
)(22
FTFT
urN
FP
u
FPFP
r
FP
FFTFF
F
FPTFFFF
FF
ur
TP
u
TPTP
r
TP
TTTTT
T
TPTTTTT
T
TT
UUURRR
ppppurNur
puUrR
N
UR
ppppurur
puUrR
UR
FFTFF
TTTTT
+=+=
!!""#
$%%&
'
!!(!
(=(==
!!""#
$%%&
'
!!(
(=(==
(
!!(!
!!(
![Page 81: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/81.jpg)
Estimating pTP and pFP Using the expectation of R and U, we can derive two
independent equations with three unknown parameters (pTP, pFP,and ΔT).
For resultant the method of moments estimators, any one of theparameters defines the other two, so we can easily derive a one-dimensional solution manifolds (with variance bounds).
In practice for AP-MS data, we choose to estimate pTP using agold standard set of complex co-membership relationships thatexist under similar experimental conditions to those used for AP-MS studies.
![Page 82: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/82.jpg)
Including systematically biased baits Without systematically biased baits
![Page 83: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/83.jpg)
Apl6
Apm3 Apl5
untested: ?
tested:absent
Apl6
Apm3 Apl5
Eno2
Aps3
Ckb1
What is an example and do statistics really make a difference?
Bait-induced subgraph:
Degree = number of edgesincident on a node
For node n we observe:
IEn = set of in-edgesOEn = set of out-edges
IDn = |IEn| = in-degreeODn = |OEn| = out-degree
Rn = |IEn ∩ OEn| = ‘reciprocated’ degree
Un = |{IEn U OEn}\{IEn ∩ OEn}|= ‘unreciprocated’ degree
IEApm3 = {Apl5, Apl6}OEApm3 = {Apl6}
IDApm3 = 2ODApm3 = 1
RApm3 = 1UApm3 = 1
Estimating degreeunder stochastic error only
![Page 84: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/84.jpg)
Current practice : Dn=Rn+UnDApm3=2
Apl6
Apm3 Apl5
Apl6
Apm3 Apl5
Eno2
Aps3
Ckb1
What is an example and do statistics really make a difference?
![Page 85: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/85.jpg)
Likelihood ApproachpTP=true positive probabilitypFP=false positive probability
If pFP=0, then for ‘true’ edgespT2=p(observing reciprocated edge) = pTP
2
pT1=p(observing unreciprocated edge)=2pTP(1-pTP)pT0=p(observing no edge)=(1-pTP)2
ur
T
u
T
r
T
T
TP pppururp
puUrR!!"
" ##$
%&&'
(
!!"
"
!="==
="
012
01
1),|,Pr(
degree Given
Write a similar statement for FPs and then maximize the convolution to estimate degree.
![Page 86: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/86.jpg)
Notes Independence assumption allows product of
multinomial probabilities Only models stochastic error Only one observation in the likelihood Truncated binomial for data not subject to
FPs was developed by Blumenthal andDahiya (JASA 1981) and Olkin et al. (JASA1981).
![Page 87: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/87.jpg)
Summary
Three main applications(1) Knowledge representation(2) Exploratory data analysis(3) Inference
Bioconductor provides a rich set of tools for(1) and (2)…need more of (3)!
![Page 88: Graphs in R and Bioconductor](https://reader034.fdocuments.us/reader034/viewer/2022052023/6286b5135d955c00aa66f095/html5/thumbnails/88.jpg)
Acknowledgements
Robert Gentleman Wolfgang Huber (thanks for lots of slides) Vince Carey Jeff Gentry Elizabeth Whalen Seth Falcon