Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

97
The Centrality of Centrality (Robustness and Centrality) (Based on Boldi et al [BRV11b]) 1

description

The first part of my lectures will be devoted to the design of practical algorithms for very large graphs. The second part will be devoted to algorithms resilient to memory errors. Modern memory devices may suffer from faults, where some bits may arbitrarily flip and corrupt the values of the affected memory cells. The appearance of such faults may seriously compromise the correctness and performance of computations, and the larger is the memory usage the higher is the probability to incur into memory errors. In recent years, many algorithms for computing in the presence of memory faults have been introduced in the literature: in particular, an algorithm or a data structure is called resilient if it is able to work correctly on the set of uncorrupted values. This part will cover recent work on resilient algorithms and data structures.

Transcript of Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Page 1: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

The Centrality of Centrality (Robustness and Centrality)

(Based on Boldi et al [BRV11b])

1

Page 2: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Da Demetrescu et al. McGraw Hill 2004 15th Century Florentine Marriages Data

Page 3: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Da Demetrescu et al. McGraw Hill 2004

The social network of friendships within a 34-person karate club provides clues to the fault lines that eventually split the club apart (Zachary, 1977)

Page 4: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Network structure questions

•  Are only few nodes holding a given network together? Or, is the network robust?

•  In particular, which nodes have a stronger impact in determining the network’s structure?

•  In other words, which nodes are more “central” to a network?

•  How do social networks differ (in “centrality”) from known networks, such as the Web or the Internet?

4

Page 5: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Intuition on Centrality

•  Ideally, every node (often representing an individual) has some degree of influence or importance within the social domain under consideration;

•  One expects such importance to be reflected in the structure of the social network;

•  Centrality is a quantitative measure that aims at revealing the importance of a node

5

Page 6: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Types of Centrality

•  Different types of centrality considered in the literature (see Borgatti [Bor05] for a survey)

•  Many have to do with shortest paths •  E.g., the betweenness centrality of a node v

is the sum, over all pairs of nodes x and y, of the fraction of shortest paths from x to y passing through v.

•  Role played by shortest paths justified by small-world phenomenon (Milgram’s experiment).

6

Page 7: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Types of Centrality •  Albeit interesting, centrality measures based on

distances, like betweenness, often computationally too expensive on real-world large-scale graphs;

•  Best known algorithm for betweenness centrality [Bra01] takes time O(n m) and space for O(n + m) integers (n nodes, m arcs)

•  Both bounds infeasible for large networks, where typically n ≈ 109 and m ≈ 1011.

•  For this reason, in most cases other strictly local measures of centrality usually preferred (e.g., degree centrality). 7

Page 8: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Centrality and Robustness Evaluate node centrality based on how much the removal of a node disrupts the graph structure (Albert et al [AJB00]) •  remove nodes by following a certain strategy •  observe whether the graph structure is affected

(e.g., distance distribution, connectivity…) This idea provides also a notion of robustness:

•  If removing few nodes has no noticeable impact è then the network structure is robust in a strong sense

•  If the removal strategy quickly affects the structure è this probably reflects an importance order of the nodes

8

Page 9: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Metrics for Network Structure? •  Diameter or some analogous measure •  Number of reachable pairs of the graph (pairs

<x,y> such that there is directed path from x to y) •  Distance distribution (discrete distribution that

gives, for every integer t, the fraction of pairs of nodes that are at distance t).

•  Others may be conceived

9

Page 10: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Neighborhood Function

10

Quick detour Given a directed graph G, its neighborhood function NG(t) returns for each t in N the number of pairs of nodes <x,y> such that y is reachable from x within t steps.

Page 11: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Neighborhood of a Node

N(u,h) = # of nodes within h steps of u = |{ v : dist(u,v) ≤ h }|

u

1 2 3 4 5

9 8 7 6 5 4 3 2 1

h

N(u,h)

Example Graph Example Neighbourhood Fn

11

Page 12: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Neighborhood Function

N(u,h) = # of nodes within h steps of u = |{ v : dist(u,v) ≤ h }| N(h) = # of pairs of nodes with h steps of each other = Σu N(u,h)

12

Page 13: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Neighborhood Function

13

Provides wealth of information about graph (e.g., it easily allows one to compute its diameter), but it is very expensive to compute it exactly. Recently, some progress on computing NG(t): •  ANF (Approximate Neighborhood Function),

proposed by Palmer et al [PGF02] able to approximate NG(t) on medium/large graphs.

•  HyperANF by Boldi et al [BRV11a} improves over ANF in terms of speed and scalability.

Page 14: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Removal Strategy •  Consider ordering of graph nodes (supposed to

represent their “importance” or “centrality”). •  Remove nodes (and their incident arcs) following this

order, until a certain percentage θ of the arcs deleted; •  Finally, compare number of reachable pairs and

distance distribution of new graph with the original one. •  Chosen node ordering considered to be reliable

measure of centrality if the measured difference increases rapidly with θ (i.e., it is sufficient to delete a small fraction of important nodes to change the structure of the graph).

•  Next: Experiments by Boldi et al [BRV11b] 14

Page 15: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Removal Strategies •  Random. Used as baseline (important to show that

phenomena observed are due to the peculiar choice of nodes, not to some generic property of the graph).

•  Largest-degree first. Decreasing (out)degree order. Used as baseline.

•  Near-Root. In web graphs roots of web sites and their (quasi-)immediate successors (e.g., pages linked by the root) are likely to be most important in establishing the distance distribution, as people tend to link higher levels of web sites. This strategy removes essentially first root nodes, then nodes that are children of a root, and so on.

15

Page 16: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Removal Strategies •  PageRank. Kind of refinement over degree centrality:

nodes with high degree connected to nodes of low rank will have rather low rank, too.

•  Label propagation. Used for graph clustering (Raghavan et al [RAK07]): each node has initial label (cluster identifier) and through rounds takes label of majority of its neighbors. This removal strategy picks, for each cluster in decreasing size order, the node with the highest number of neighbors in other clusters. Intuitively, this node is tightly connected in the cluster, but also has significant connection outside of the cluster: thus, one expects that its removal should seriously disrupt the distance distribution.

16

Page 17: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Measures of Divergence •  Reachable nodes %. # of node pairs that are still

reachable divided by # of pairs initially reachable, expressed as percentage

•  δ-average distance: δ(G’,G) = davg(G’) / davg(G) – 1

δ(G’,G) = 0.3 means that node removal has increased the average distance in original G by 30%.

17

Page 18: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Measures of Divergence •  δ-harmonic diameter. Average on distances

harmonic and computed on all pairs:

18

Combines reachability information, as unreachable pairs contribute zero to the sum. Easily computable from the neighborhood function (as shown).

Page 19: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Measures of Divergence Others can be used •  Kullback-Leibler divergence •  ℓ-norms (ℓ1, ℓ2)

For robustness, they seem to agree with δ-average distance:

19

Page 20: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Social Networks vs. Web Graphs

20

Page 21: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Discussion •  Social networks suffer less disconnection than web

graphs under the removal strategies considered: •  Label propagation, can disconnect almost all pairs of

a web graph by removing 30% of the arcs, but disconnects only less than 50% of the pairs on social networks

•  Average distance of web graphs increases by 50-80% upon removal of 30% of the arcs; in most social networks there is just an increase of a few percents (always less than 20%).

21

Page 22: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Discussion (Web graphs) •  Web graphs have (shortest) path structure that passes

through fundamental hubs (identified by removal strategies such as LP)?

•  Fundamental hubs not necessarily home pages (LP more disruptive than near-root)

•  Hubs not necessarily of high degree (quite the opposite is true). Behavior of web graphs under largest-degree strategy: smallest reduction in reachable pairs and almost unnoticeable change of the average distance. High-degree nodes not relevant for the global structure of the network?

22

Page 23: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Discussion (Web graphs) •  Random removal separates a good number of

reachable pairs, but the increase in average distance is marginal. Considering both measures important in evaluating removal strategies?

•  PageRank and LP seem always best. •  Is LP actually able to identify structurally important

nodes in the graph, at least significantly better than other methods considered?

•  How are the rankings correlated? Kendall’s tau between PageRank and LP close to 0 (complete uncorrelation)…

23

Page 24: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Discussion (Social networks) •  Social networks appear much more resistant to node

removal. No strict clustering, no definite hubs can be used to eliminate or elongate shortest paths.

•  None of the strategies considered is able to disrupt social networks as much as web graphs.

•  Do we need different strategies, or is this intrinsic to social networks?

•  Disclaimer; conclusions about social networks should be taken with a grain of salt, due to the heterogeneity of such networks and the lack of a large repertoire of examples.

24

Page 25: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Strong Bridges and Strong Articulation Points

(Based on work by Firmani, I., Laura, and Santaroni)

Page 26: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

2-Edge Connectivity Let G = (V,E) be an undirected connected graph, with m edges and n vertices.

An edge e∈E is a bridge if its removal disconnects G (i.e., increases the number of connected components of G)

Graph G is 2-edge-connected if it has no bridges.

The 2-edge-connected components of G are its maximal 2-edge-connected subgraphs

1 2

3 4 5

6 7

Page 27: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

2-Edge Connectivity Let G = (V,E) be an undirected connected graph, with m edges and n vertices.

An edge e∈E is a bridge if its removal disconnects G (i.e., increases the number of connected components of G)

Graph G is 2-edge-connected if it has no bridges.

The 2-edge-connected components of G are its maximal 2-edge-connected subgraphs

1 2

3 4 5

6 7

Page 28: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

2-Vertex Connectivity Let G = (V,E) be an undirected connected graph, with m edges and n vertices.

A vertex v∈V is an articulation point if its removal disconnects G (i.e., increases the number of connected components of G)

Graph G is 2-vertex-connected if it has no articulation points.

The 2-vertex-connected components of G are its maximal 2-vertex-connected subgraphs

1 2

3 4 5

6 7

Page 29: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

2-Vertex Connectivity Let G = (V,E) be an undirected connected graph, with m edges and n vertices.

A vertex v∈V is an articulation point if its removal disconnects G (i.e., increases the number of connected components of G)

Graph G is 2-vertex-connected if it has no articulation points.

The 2-vertex-connected components of G are its maximal 2-vertex-connected subgraphs

1 2

3 4 5

6 7

2

6

4

Page 30: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Bounds for Undirected G

Q1: Find whether G is 2-vertex-connected (2-edge-connected). I.e., find one connectivity cut (if any)

Q2: Find all connectivity cuts (articulation points, bridges) in G

Q3: Find the connectivity (2-vertex-, 2-edge-connected) components of G

O(m+n)

O(m+n)

O(m+n)

[Hopcroft & Tarjan 1973], [Tarjan 1974]

Page 31: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Directed Graphs Let G = (V,E) be a directed graph, with m edges and n vertices.

G is strongly connected if there is a directed path from each vertex to every other vertex in G.

The strongly connected components (SCCs) of G are its maximal strongly connected subgraphs.

1 2

3 4 5

6 7

Page 32: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Directed: 2-Vertex Connectivity Let G = (V,E) be a directed strongly connected graph, with m edges and n vertices. A vertex v∈V is a strong articulation point if its removal increases the number of strongly connected components of G Graph G is 2-vertex-connected if it has no strong articulation points. The 2-vertex-connected components of G are its maximal 2-vertex-connected subgraphs

Page 33: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

1 2

3 4 5

6 7

Strong Articulation Points

Page 34: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

1 2

3 4

6 7

Strong Articulation Points

Vertex 5 is a strong articulation point

Page 35: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

1 2

3 4 5

6 7

Strong Articulation Points

Page 36: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

1

3 4 5

6 7

Strong Articulation Points

Page 37: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

1

3 4 5

6 7

Vertex 2 is NOT a strong articulation point

Strong Articulation Points

Page 38: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Directed: 2-Edge Connectivity Let G = (V,E) be a directed strongly connected graph, with m edges and n vertices. An edge (u,v)∈E is a strong bridge if its removal increases the number of strongly connected components of G Graph G is 2-edge-connected if it has no strong bridges. The 2-edge-connected components of G are its maximal 2-edge-connected subgraphs

Page 39: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

1 2

3 4 5

6 7

Strong Bridges

Page 40: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

1 2

3 4 5

6 7

Strong Bridges

Edge (2,3) is a strong bridge

Page 41: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Bounds for Directed G

Q1: Find whether directed G is 2-vertex-connected (2-edge-connected). I.e., find one connectivity cut (if any)

Q2: Find all connectivity cuts (articulation points, bridges) in G

Q3: Find the connectivity (2-vertex-, 2-edge-connected) components of G

O(m+n) [Tarjan 76] +

[Gabow & Tarjan 83]

[Georgiadis 10]

O(m+n) [Italiano et al 10]

?????

Page 42: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Further Motivation

Constraint programming - filtering for tree constraint: computing all strong articulation points open problem posed by Beldiceanu et al [2005] Reliability in directed networks Connectivity and flow of information in social networks [Mislove et al. 2007] Speed up computation of matrix determinants [Bini & Pan 1994] [Maybee et al. 1989] …

Page 43: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Product Co-Purchase

Page 44: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Product Co-Purchase

Page 45: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Product Co-Purchase

Page 46: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Product Co-Purchase

Page 47: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

2012 Elections

Page 48: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Naive Algorithms Check whether vertex v is strong articulation point in G :

Compute strongly connected components of G/{v}

O(n(m+n)) for computing all strong articulation points

Check whether edge e is strong bridge in G : Compute strongly connected components of G/{e}

O(m(m+n)) for computing all strong bridges Not difficult to get O(n(m+n)) algorithm Computationally unfeasible on big graphs!

Page 49: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Redundant Edges

Given a directed graph G = (V,E), we say that an edge (u,v) is redundant if there is an alternative path from vertex u to vertex v avoiding edge (u,v). Otherwise, we say that (u,v) is non-redundant.

Observation. Let G = (V , E ) be a strongly connected graph. Then the edge (u,v) ∈ E is a strong bridge if and only if (u,v) is non-redundant in G.

Computing strong bridges equivalent to computing redundant edges.

Page 50: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Boolean Matrix Multiplication?

In a directed acyclic graph, finding all redundant edges is the transitive reduction problem.

Transitive reduction equivalent to transitive closure [Aho, Garey & Ullman 72]

Transitive closure equivalent to Boolean matrix multiplication [Furman 70], [Fischer & Meyer 71]

Thus, for DAGs the best known bound for computing redundant edges is O(nω).

Page 51: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Warm Up: How many SAP?

1 2

6 3

5 4

At most n

Page 52: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

1 2

3 4 5

6 7

How Many Strong Bridges?

At most 2n-2 (will prove it later)

Page 53: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Vertex Dominators A flowgraph G(s) = (V,E,s) is a directed graph with a start vertex s in V such that every vertex in V reachable from s Given a flowgraph G(s)=(V,E,s), can define a dominance relation: vertex u is dominator of vertex v if every path from s to v includes u Let dom(v) be set of dominators of v. For any v ≠ s we have that {s,v} ⊆ dom(v): s and v are the trivial dominators of v

1

3 2

6

7 5 4

Page 54: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Dominator Trees Dominance relation is transitive and its transitive reduction is referred to as the dominator tree DT(s).

DT(s) rooted at s.

u dominates v if and only if u is ancestor of v in DT(s).

If u is dominator of v, and every other non-trivial dominator of u also dominates v, u is an immediate dominator of v.

If v has any non-trivial dominators, then v has a unique immediate dominator: the immediate dominator of v is the parent of v in the dominator tree DT(s).

1

3 2

6

7 5 4

1

3 2

6 7 5 4

Dominators (and dominator trees) can be computed in O(m+n) time [Buchsbaum et al 1998]

Page 55: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Vertex Dominators and SAP Lemma 1 Let G = (V,E) be a strongly connected graph, and let s be any vertex in G. Let G(s) = (V,E,s) be the flowgraph with start vertex s. If u is a non-trivial dominator of a vertex v in G(s), then u is a strong articulation point in G.

1 2

3 4 5

6 7

Vertex 3 is strong articulation point in G

Vertex 3 is a non-trivial dominator in G(2)

2

3

1 4 5 6 7

Page 56: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Vertex Dominators and SAP Lemma 2 Let G = (V,E) be a strongly connected graph. If u is a strong articulation point in G, then there must be a vertex s ∈ V such that u is a non-trivial dominator of a vertex v in the flowgraph G(s) = (V,E,s).

1 2

3 4 5

6 7

Vertex 5 must be non-trivial dominator in some G(s). Here s=6.

Vertex 5 is strong articulation point in G

5

6

7

1 2 3 4

Page 57: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Still Not Efficient Corollary Let G = (V,E) be a strongly connected graph. Vertex u is a strong articulation point in G if and only there is a vertex s∈V such that u is a non-trivial dominator of a vertex v in the flowgraph G(s) = (V,E,s).

Must compute dominator trees for all flowgraphs G(v), for each vertex v in V, and output all non-trivial dominators found.

Page 58: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Dominator Trees 2

3

1 4 5 6 7

6

4

5

1 2 3 7

1

3

2 4 6 7 5

3

1

2

4 5 6 7

5

1 2 4 6 7 5

5

6

7

1 2 3 4

5

7

6

1 2 3 4

1 2

3 4 5

6 7

G

Dominator Trees

6

Page 59: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Still Not Efficient Corollary Let G = (V,E) be a strongly connected graph. Vertex u is a strong articulation point in G if and only then there is a vertex s∈V such that u is a non-trivial dominator of a vertex v in the flowgraph G(s) = (V,E,s).

Must compute dominator trees for all flowgraphs G(v), for each vertex v in V, and output all non-trivial dominators found.

Like trivial algorithm

Takes O(n(m+n)) time

Only more complicated...

Page 60: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Reversal Graph

Observation. Let G = (V,E) be a strongly connected graph and GR = (V,ER) be its reversal graph. Then GR is strongly connected. Furthermore, vertex v is a strong articulation point in G if and only if v is a strong articulation point in GR.

Reversal Graph GR = (V,ER) : reverse all edges in G. If (u,v) in G then (v,u) in GR.

Page 61: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Exploit Dominators

Theorem. Let G = (V,E) be a strongly connected graph, and let s ∈ V be any vertex in G. Then vertex v ≠ s is a strong articulation point in G if and only if v ∈ D(s) ∪ DR(s).

Given a strongly connected graph G=(V,E), let •  G(s) = (V,E,s) be the flowgraph with start vertex s •  D(s) the set of non-trivial dominators in G(s) •  GR(s) = (V,ER,s) be the flowgraph with start vertex s •  DR(s) the set of non-trivial dominators in GR(s)

Page 62: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Strong Articulation Points

Theorem. Let G = (V,E) be a strongly connected graph, and let s ∈ V be any vertex in G. Then vertex v ≠ s is a strong articulation point in G if and only if v ∈ D(s) ∪ DR(s).

Proof:

If v ∈ D(s) ∪ DR(s) we know from previous lemmas that v must be an articulation point.

So, we need to prove only one direction.

Page 63: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Strong Articulation Points

1

s s

Theorem. Let G = (V,E) be a strongly connected graph, and let s ∈ V be any vertex in G. Then vertex v ≠ s is a strong articulation point in G if and only if v ∈ D(s) ∪ DR(s).

Proof: Let v be a strong articulation point G\{v}

G

Page 64: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Strong Articulation Points

1

s s

Theorem. Let G = (V,E) be a strongly connected graph, and let s ∈ V be any vertex in G. Then vertex v ≠ s is a strong articulation point in G if and only if v ∈ D(s) ∪ DR(s).

Proof: Let v be a strong articulation point G

G\{v}

Page 65: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Strong Articulation Points

1

s s

Theorem. Let G = (V,E) be a strongly connected graph, and let s ∈ V be any vertex in G. Then vertex v ≠ s is a strong articulation point in G if and only if v ∈ D(s) ∪ DR(s).

Proof: Let v be a strong articulation point G

G\{v}

Page 66: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Strong Articulation Points

1

s s

Theorem. Let G = (V,E) be a strongly connected graph, and let s ∈ V be any vertex in G. Then vertex v ≠ s is a strong articulation point in G if and only if v ∈ D(s) ∪ DR(s).

Proof: Let v be a strong articulation point

v  ∈  D(s)    

G

G\{v}

Page 67: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Strong Articulation Points

1

s s

Theorem. Let G = (V,E) be a strongly connected graph, and let s ∈ V be any vertex in G. Then vertex v ≠ s is a strong articulation point in G if and only if v ∈ D(s) ∪ DR(s).

Proof: Let v be a strong articulation point

v  ∈  DR(s)    

G

G\{v}

Page 68: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Linear-Time Algorithm Input: A strongly connected graph G = (V , E ), with n vertices and m edges.

Output: The strong articulation points of G.

1.  Choose arbitrarily a vertex s ∈ V in G, and test weather s is a strong articulation point in G. If s is an articulation point, output s.

2.  Compute and output D(s), the set of non-trivial dominators in the flowgraph G(s) = (V,E,s).

3.  Compute the reversal graph GR = (V,ER).

4.  Compute and output DR(s), the set of non-trivial dominators in the flowgraph GR(s) = (V,ER,s).

Total time is O(m+n)

Page 69: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Strong Bridges

Page 70: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Strong Bridges

Lemma. If there is an algorithm to compute the strong articulation points of a strongly connected graph in time T(m,n), then there is algorithm to compute the strong bridges of a strongly connected graph in time O(m + n + T(2m, n + m)). “Proof” :

1. Reduction:

Mainly of theoretical interest (# vertices blows up)

Page 71: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Strong Bridges

Edge (u,v) is dominator of vertex w if every path from s to v contains edge (u,v) If (u,v) is an edge dominator of vertex w, and every other edge dominator of u dominates w, we say that (u,v) is an immediate edge dominator of vertex w. If a vertex has an edge dominator, then it has a unique immediate edge dominator. With some care, able to extend all the theory from (vertex) dominators to edge dominators. Given a flowgraph G(s) = (V,E,s), edge dominators can be computed in time O(m+n). Need to re-implement code for dominators.

2. Edge Dominators

Page 72: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Edge Dominators in Practice Lemma. [Tarjan 1974] Let G = (V,E,s) be a flowgraph and let T be a DFS tree of G with start vertex s. Edge (v,w) is an edge dominator in flowgraph G if and only if all of the following conditions are met: - (v,w) is a tree edge, - w has no entering forward edge or cross edge, and - there is no back edge (x,w) such that w does not dominate x. Need to (1) compute dominator tree DT(s) and (2) check whether w ancestor of x in DT(s) for back edge (x,w). Given a flowgraph G(s) = (V,E,s), edge dominators can be computed in time O(m+n). Reuse code for (vertex) dominators. More efficient in practice. But still slightly slower than (vertex) dominators.

Page 73: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Computing All Strong Bridges

Theorem. Let G = (V,E) be a strongly connected graph, and let s ∈ V be any vertex in G. Then edge (u,v) is a strong bridge in G if and only if (u,v) ∈ ED(s) ∪ EDR(s).

Given a strongly connected graph G=(V,E), let •  G(s) = (V,E,s) be the flowgraph with start vertex s •  ED(s) the set of edge dominators in G(s) •  GR(s) = (V,ER,s) be the flowgraph with start vertex s •  EDR(s) the set of edge dominators in GR(s)

Incidentally, this proves also that can be at most 2n-2 strong bridges in a directed graph.

Page 74: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Preliminary Experiments

CPU Intel Xeon X5650 (6 cores) @ 2.67GHz

12MB cache

32GB of DDR3 RAM @ 1GHz

Linux Red Hat 4.1.2-46 (Kernel 2.6.18)

Java Virtual Machine 1.6.0_16 (64-Bit)

WebGraph library 3.0.1 Implementations written in Java to exploit features offered by WebGraph (designed to deal with large graphs)

Page 75: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Datasets Real-World Large Scale Graphs (up to billion edges):

Ø  Web Graphs (nodes webpages, edges hyperlinks)

Ø  Social Graphs (social networks, edges represent interactions between people)

Ø  Communication Graphs (email networks)

Ø  Peer2Peer (nodes hosts in P2P network topology, edges connections between P2P hosts)

Ø  Product Co-Purchase Graphs (nodes products, edges link commonly co-purchased products)

Page 76: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Dat

aset

s

Page 77: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Analysis of SAPs and SBs Graph Vertices Edges SAPs Vertices in

giant SCC SAPs in

giant SCC Running

times

cnr-2000 325K 3.2M 21K 112K 14K 2s

uk-2002 18M 298M 1.8M 12M 1.8M 21s

it-2004 41M 1.15B 4.5M 29.8M 4.5M 4m01s

uk-2005 39M 0.93B 3.2M 25M 3.2M 2m02s

sk-2005 50M 1.94B 5.5M 35M 5.5M 10m02s

uk-2007-05 105M 3.73B 10.5M 68M 10M 48m14s

Running times (secs)

Page 78: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Analysis of SAPs and SBs Graph Vertices Edges SAPs Vertices in

giant SCC SAPs in

giant SCC Running

times

cnr-2000 325K 3.2M 21K 112K 14K 2s

uk-2002 18M 298M 1.8M 12M 1.8M 21s

it-2004 41M 1.15B 4.5M 29.8M 4.5M 4m01s

uk-2005 39M 0.93B 3.2M 25M 3.2M 2m02s

sk-2005 50M 1.94B 5.5M 35M 5.5M 10m02s

uk-2007-05 105M 3.73B 10.5M 68M 10M 48m14s

…able to process massive graphs (billion edges) in 10-15 minutes

Page 79: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Faster Implementations?

SAP

SB

Page 80: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Analysis of SAPs and SBs Graph Vertices Edges SAPs Vertices in

giant SCC SAPs in

giant SCC Running

times

cnr-2000 325K 3.2M 21K 112K 14K 2s

uk-2002 18M 298M 1.8M 12M 1.8M 21s

it-2004 41M 1.15B 4.5M 29.8M 4.5M 4m01s

uk-2005 39M 0.93B 3.2M 25M 3.2M 2m02s

sk-2005 50M 1.94B 5.5M 35M 5.5M 10m02s

uk-2007-05 105M 3.73B 10.5M 68M 10M 48m14s

SAPs appear often (15-25% co-purchase, 11-18% social)

Page 81: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Analysis of SAPs and SBs Graph Vertices Edges SAPs Vertices in

giant SCC SAPs in

giant SCC Running

times

cnr-2000 325K 3.2M 21K 112K 14K 2s

uk-2002 18M 298M 1.8M 12M 1.8M 21s

it-2004 41M 1.15B 4.5M 29.8M 4.5M 4m01s

uk-2005 39M 0.93B 3.2M 25M 3.2M 2m02s

sk-2005 50M 1.94B 5.5M 35M 5.5M 10m02s

uk-2007-05 105M 3.73B 10.5M 68M 10M 48m14s

The vast majority of SAPs are in big SCC (less for Web graphs)

Page 82: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Analysis of SAPs and SBs Graph Vertices Edges SAPs Vertices in

giant SCC SAPs in

giant SCC Running

times

cnr-2000 325K 3.2M 21K 112K 14K 2s

uk-2002 18M 298M 1.8M 12M 1.8M 21s

it-2004 41M 1.15B 4.5M 29.8M 4.5M 4m01s

uk-2005 39M 0.93B 3.2M 25M 3.2M 2m02s

sk-2005 50M 1.94B 5.5M 35M 5.5M 10m02s

uk-2007-05 105M 3.73B 10.5M 68M 10M 48m14s

SBs are less frequent (except for email graphs)

Page 83: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Analysis of SAPs and SBs Graph Vertices Edges SAPs Vertices in

giant SCC SAPs in

giant SCC Running

times

cnr-2000 325K 3.2M 21K 112K 14K 2s

uk-2002 18M 298M 1.8M 12M 1.8M 21s

it-2004 41M 1.15B 4.5M 29.8M 4.5M 4m01s

uk-2005 39M 0.93B 3.2M 25M 3.2M 2m02s

sk-2005 50M 1.94B 5.5M 35M 5.5M 10m02s

uk-2007-05 105M 3.73B 10.5M 68M 10M 48m14s

SBs are also in SCC (less for social graphs)

Page 84: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Some Properties of SAPs

Page 85: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Robustness and Centrality

Can we exploit SAPs to obtain a better node removal strategy for centrality? (still working on this) Simpler question: how many of the high degree (removed) vertices are actually SAP?

Page 86: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Robustness and Centrality

Page 87: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Summary

Linear-time algorithms to compute all strong articulation points and all strong bridges of directed graphs. Theoretically optimal. Intuitively, SAPs and SBs “connect” different groups / communities of directed networks

Page 88: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Summary Fast in practice: first rough implementation on real-world large scale graphs able to process graphs with billions of edges in 10-15 minutes. SAPs tend to appear frequently in real-world graphs (especially in co-product and social graphs). SBs tend to be less frequent Both SAPs and SBs mostly concentrated in giant SCC Avg in, out degree and PageRank of SAPs considerably higher than other vertices (further indication of their importance?)

Page 89: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Open Problems / Future Work

Higher connectivity cuts in strongly connected graphs? (e.g., separation pairs: vertex and edge cuts of cardinality 2) Can the 2-vertex and 2-edge-connected components of a directed graph be computed in linear time?

Best known time is O(n(m+n)) by repeatedly deleting SAPs / SBs.

Page 90: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

2-vertex- and 2-edge-connected components are strange creatures

Page 91: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Open Problems / Future Work

Higher connectivity cuts in strongly connected graphs? (e.g., separation pairs: vertex and edge cuts of cardinality 2) Can the 2-vertex and 2-edge-connected components of a directed graph be computed in linear time?

Best known time is O(n(m+n)) by repeatedly deleting SAPs / SBs.

Page 92: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Open Problems / Future Work

Higher connectivity cuts in strongly connected graphs? (e.g., separation pairs: vertex and edge cuts of cardinality 2) Can the 2-vertex and 2-edge-connected components of a directed graph be computed in linear time?

Best known time is O(n(m+n)) by repeatedly deleting SAPs / SBs.

Perform more experiments for centrality on social networks Understand more the semantics of SAPs and SBs in real-world graphs

Page 93: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Homeworks

Page 94: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Definition Let G = (V,E) be a directed strongly connected graph and let x and y be any two vertices in V. We say that x and y are 2-vertex-connected (resp. 2-edge-connected) if the deletion of any vertex (resp. edge) leaves x and y in the same strongly connected component (s.c.c.).

Denote those relationships by x ~V y (2-vertex-connectedness) and x ~E y (2-edge-connectedness).

Page 95: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Homework 1

Prove or disprove:

a)  2-vertex-connectedness is an equivalence relationship (reflexive, symmetric, transitive).

b)  2-edge-connectedness is an equivalence relationship.

Page 96: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Homework 2

Prove or disprove:

a)  Two vertices are 2-vertex-connected if and only if they are in the same 2-vertex-connected component of G.

b)  Two vertices are 2-edge-connected if and only if they are in the same 2-edge-connected component of G.

Page 97: Algorithms for Big Data: Graphs and Memory Errors 2 (Lecture by Giuseppe Italiano)

Extra Credits

Let G = (V,E) be a directed strongly connected graph and let v be any vertex in V.

Design an efficient algorithm to compute all vertices in V that are 2-edge-connected to v.