Complexity and Efficient Algorithms Group / Department of Computer Science
Approximating Structural Properties of Graphs by Random Walks Christian Sohler
Complexity and Efficient Algorithms Group / Department of Computer Science
2
Very Large Networks
Examples Social networks The human brain Crystals Chip design
Size 109 – 1023 vertices Petabytes of additional information possible
Complexity and Efficient Algorithms Group / Department of Computer Science
3
Very Large Networks
Classical graph problems Connectivity MinCut, MaxCut Graphclustering Graphisomorphism
Difficulties Graph does not fit into main memory
Complexity and Efficient Algorithms Group / Department of Computer Science
4
Classification of Very Large Networks – A Vision
Exampe questions Is a country a democracy or a totalitarian
country? Is a patient schizophrenic? Is software malicious?
Formalization Given a set of graphs with class labels
(training set) Find a classifier for new graphs
Complexity and Efficient Algorithms Group / Department of Computer Science
5
Classification of Very Large Networks – A Vision
A typical szenario Hundreds or thousands of graphs Each graph is extremly large Graphs are sparse
A possible approach Describe graphs by features
(graph properties) Apply classical learning algorithms
The challenge Computation of ten thousands of features
for graphs with billions of vertices
(12,3,-5,10,0,0,…,20,3)
Complexity and Efficient Algorithms Group / Department of Computer Science
6
Classification of Very Large Networks – A Sampling Approach
Random Sampling Compute a graph property approximately
by random sampling
Informal Question What can we learn from the local structure
of a sparse graph about its global properties?
Sampling from Graphs How can we sample a graph?
Complexity and Efficient Algorithms Group / Department of Computer Science
7
Classification of Very Large Networks – A Sampling Approach
Examples of different sampling strategies1. Sample set S of s vertices and look at all edges within S
(the subgraph G[S] induced by S)
2. Sample set S of s edges and look at their graph
3. Sample a set S of s vertices and perform a BFS from each of them
4. Sample a set S of s vertices and perform a random walk from each of them Many more possibilities…
Question Which is the right sampling strategy for my learning problem?
Complexity and Efficient Algorithms Group / Department of Computer Science
8
Classification of Very Large Networks – A Sampling Approach
Examples of different sampling strategies1. Sample set S of s vertices and look at all edges within S
(the subgraph G[S] induced by S)
2. Sample set S of s edges and look at their graph
3. Sample a set S of s vertices and perform a BFS from each of them
4. Sample a set S of s vertices and perform a random walk from each of them Many more possibilities…
Question Which is the right sampling strategy for my learning problem? Depends on the problem…
Complexity and Efficient Algorithms Group / Department of Computer Science
9
Classification of Very Large Networks – A Sampling Approach
Question 1 Assume you have some classification task that involves city maps. Which
of our four sampling methods is your method of choice?
Possible Answers1. Sample set S of s vertices and look at all edges within S
2. Sample set S of s edges and look at their graph
3. Sample a set S of s vertices and perform a BFS from each of them
4. Sample a set S of s vertices and perform a random walk from each of them
Complexity and Efficient Algorithms Group / Department of Computer Science
10
Classification of Very Large Networks – A Sampling Approach
Question 2 Assume you have some classification task that involves social networks.
Which of our four sampling methods is your method of choice?
Possible Answers1. Sample set S of s vertices and look at all edges within S
2. Sample set S of s edges and look at their graph
3. Sample a set S of s vertices and perform a BFS from each of them
4. Sample a set S of s vertices and perform a random walk from each of them
Complexity and Efficient Algorithms Group / Department of Computer Science
11
First Wrap-Up
Motivation Some classification problems involve sets of huge graphs No efficient algorithm for some fundamental graph problems known
Sampling approach We would like to pick small samples from the graph(s) and use them for
graph classification
Challenge There are many different sampling procedures We need to understand which is the right one for which problem
Complexity and Efficient Algorithms Group / Department of Computer Science
12
Sampling from Very Large Networks
Property Testing [Rubinfeld, Sudan, 1996, Goldreich, Goldwasser, Ron, 1998] Formal framework to study sampling algorithms for very large networks
Relaxation of „Standard Decision Problems“ Want to distinguish whether input graph G has a property or is far away from it If G neither has the property nor is far away from it the algorithm may give an
arbitrary answer Randomized algorithms with bounded (worst case) error probability Only looks at small part of the graph
Different graph models Dense graphs, bounded degree graphs, directed graphs
Complexity and Efficient Algorithms Group / Department of Computer Science
13
Property Testing in Bounded Degree Graphs
Bounded degree graphs [Goldreich, Ron, 2002]
Undirected Graph G=(V,E) Maximum degree bounded by D D constant
Oracle access V={1,…,n} n is known to the algorithm Query(i,j) returns j-th neighbor of vertex i or a
symbol that indicates that this neighbor does not exist
2 4 /
1 3 5
2 / /
1 5 /
2 4 /
1 2 3
4 5
Complexity and Efficient Algorithms Group / Department of Computer Science
14
Property Testing in Bounded Degree Graphs
Graph properties A graph property is a set of graphs that is
closed under isomorphism
Definition [Goldreich, Ron, 2002] G=(V,E) is e-far from P, if one has to modify
more than eDn edges to obtain a bounded degree graph with property P.
connected
e-far
Complexity and Efficient Algorithms Group / Department of Computer Science
15
Property Testing in Bounded Degree Graphs
Property Tester for property P [Goldreich, Ron, 2002] Oracle access to input graph G Accepts with probability at least 2/3, if G has property P Rejects with probability at least 2/3, if G is e-far from P
Quality measures Query complexity: Maximum number of oracle queries Running time
Complexity and Efficient Algorithms Group / Department of Computer Science
16
A First Example: Connectivity
Connectivitytester(G,e,D) [Goldreich, Ron, 2002](1) Sample set S with s=8/(eD) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) 4/(eD) vertices have been discovered or
(b) all vertices of a small connected component have been discovered
(4) if (b) then reject
(5) accept
Complexity and Efficient Algorithms Group / Department of Computer Science
17
A First Example: Connectivity
Connectivitytester(G,e,D) [Goldreich, Ron, 2002](1) Sample set S with s=8/(eD) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) 4/(eD) vertices have been discovered or
(b) all vertices of a small connected component have been discovered
(4) if (b) then reject
(5) accept
Observation• ConnectivityTester accepts every connected graph
Complexity and Efficient Algorithms Group / Department of Computer Science
18
A First Example: Connectivity
Connectivitytester(G,e,D) [Goldreich, Ron, 2002](1) Sample set S with s=8/(eD) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) 4/(eD) vertices have been discovered or
(b) all vertices of a small connected component have been discovered
(4) if (b) then reject
(5) accept
Claim• If G is e-far from connected, then G has more than eDn/2 connected
components.
Complexity and Efficient Algorithms Group / Department of Computer Science
19
A First Example: Connectivity
Connectivitytester(G,e,D) [Goldreich, Ron, 2002](1) Sample set S with s=8/(eD) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) 4/(eD) vertices have been discovered or
(b) all vertices of a small connected component have been discovered
(4) if (b) then reject
(5) accept
Claim• At least eDn/4 of the connected components have size at most 4/(eD).
Complexity and Efficient Algorithms Group / Department of Computer Science
20
A First Example: Connectivity
Connectivitytester(G,e,D) [Goldreich, Ron, 2002](1) Sample set S with s=8/(eD) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) 4/(eD) vertices have been discovered or
(b) all vertices of a small connected component have been discovered
(4) if (b) then reject
(5) accept
Theorem• Connectivitytester is a property tester with query complexity O(1/(e²D)).
Complexity and Efficient Algorithms Group / Department of Computer Science
21
Second Wrap-Up – Introduction to Property Testing
Property Testing Approximately decide based on random sampling whether a graph has a
property or is far away from it Quality measure: Query complexity
Connectivity Sampling + BFS Check whether the sample violates the property
Complexity and Efficient Algorithms Group / Department of Computer Science
22
Second Wrap-Up – Introduction to Property Testing
Question 3 Is the following algorithm a property tester for planarity (for right choice of f)?
Planaritytester(G,e,D) (1) Sample set S with s= f(e,D) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) f(e,D) vertices have been discovered or
(b) the discovered graph is not planar
(4) if (b) then reject
(5) accept
Complexity and Efficient Algorithms Group / Department of Computer Science
23
Second Wrap-Up – Introduction to Property Testing
Bad news• There is a class of graphs such that every cycle
has Length W(log n) and that are e-far from planar
Good news• The sampling is fine, we just need to modify
our acceptance condition
23
Complexity and Efficient Algorithms Group / Department of Computer Science
24
Random Walks, Stationary Distributions & Convergence
Random Walk In each step:
move from current vertex v to a neighbor chosen uniformly at random
Convergence If G is connected and not bipartite, a random walk converges to a unique
stationary distribution Pr[Random Walk is at vertex v] deg(v)
Complexity and Efficient Algorithms Group / Department of Computer Science
25
Random Walks, Stationary Distributions & Convergence
Random Walks on Maps A random walk on a planar graph has the
tendency to stay local It takes a long time to reach the stationary
distribution Reason: The network has sparse cuts
Random Walks on Social Networks A random walk will quickly move to a „random
place“ Fast convergence The network does not have sparse cuts
Complexity and Efficient Algorithms Group / Department of Computer Science
26
Random Walks, Stationary Distributions & Convergence
Lazy Random Walk In each step:
- Probability to move from current vertex v to neighbor u is 1/(2D) - stays at v with remaining probability
Convergence of Lazy Random Walks Stationary distribution is uniform
Rate of Convergence Can be expressed in terms of the conductance of G or the second largest
eigenvalue of the transition matrix O(log n) steps, if G is an expander graph
Complexity and Efficient Algorithms Group / Department of Computer Science
27
Conductance, Expanders & Small Worlds
Definition The expansion F(U) of a set U is defined as
The conductance FG of G is minU:1≤|U|≤|V|/2 F(U)
Definition A graph G=(V,E) is called f-expander, if FG≥f for some constant .f
Interpretations Expander graphs satisfy the „small-world phenomenon“ Conductance can be viewed as a measure for the social connectivity of a
network
||
|} and :),{(|
UD
UVvUuEvu
Complexity and Efficient Algorithms Group / Department of Computer Science
28
Testing Expanders
Facts A lazy random walk converges to uniform distribution A lazy random walk converges quickly in expander graphs
Hope A lazy random walk converges much slower, if the graph is e-far from an
expander graph In particular, we hope that the distribution of the endpoints of a Q(log n)-
step lazy random walk differs significantly from the uniform distribution
Question If so, how could we exploit this to design a property testing algorithm?
Complexity and Efficient Algorithms Group / Department of Computer Science
29
The Birthday Problem & Testing Uniform Distributions
Birthday Problem n possible birthdays k persons with birthday chosen uniformly at random How large must k be so that with constant probability two person have the
same birthday?
Analysis p=(1/n,..,1/n)T
||p||² is the collision probability of two birthdays If we have k persons then the expected number of collision is So, for k = Q(n) we expect to see a collision
²||||2
pk
Complexity and Efficient Algorithms Group / Department of Computer Science
30
Testing Uniform Distributions
Observation The uniform distribution minimizes the expected number of pairwise
collisions If a distribution q differs significantly from the uniform distribution then
||q||²>>||p||²
TestUniformDistribution(distribution q)1. Sample Q(n) elements according to q
2. if the number of pairwise collisions is too large then reject
3. else accept
Complexity and Efficient Algorithms Group / Department of Computer Science
31
Testing Expanders
TestingExpanders(G)1. Sample set S of s vertices uniformly at random
2. for each vS do
3. Let q be the distribution of endpoints of a Q(log n)-step lazy random walk
4. if TestUniformDistribution(q) rejects then reject
5. accept
History• Algorithm was invented by [Goldreich and Ron, 2000] and algorithm
conjectured to be a property tester• First complete analysis by [Czumaj and Sohler, 2010]
(but weaker than conjectured)• Later improved by [Nachmias and Shapira, 2010] and [Kale and Seshadhri,
2011]
Complexity and Efficient Algorithms Group / Department of Computer Science
32
Final Result
Theorem [Nachmias and Shapira, 2010, Kale and Seshadhri, 2011] Algorithm TestingExpansion accepts every f-expander and rejects every
graph that is e-far from a (Q f²)-expander. The algorithm has a running time of O(n1/2+d).
Key structural property of „e-far“-graphs If G is e-far from a (Q f²)-expander then there exists a set U of W(en)
vertices with F(U) = O(f²). Implies that for many vertices, the distribution of endpoints of a random
walk of length O(log n) is significantly different from the uniform distribution
Complexity and Efficient Algorithms Group / Department of Computer Science
33
Third Wrap-Up – Testing Expansion
(Lazy) Random Walks Moves from a vertex to a random neighbor Converges to uniform distribution Speed of convergence depends on graph structure
Testing Expansion Random Walk converges quickly in expander graphs Random Walk converges slower if we are far from expander graphs Number of collisions among end points of random walks is minimized in
expander graphs We can test expansion by counting collisions
Complexity and Efficient Algorithms Group / Department of Computer Science
34
Graph Clustering & Web Communities
Web Graph Communities Set of vertices that induces an expander graph and has a sparse cut to the
rest of the graph Question: Is the web graph composed of a set of at most k communities?
Definition A subset CV is called (Fin, Fout )-cluster, if
FG(G[C]) ≥ Fin
F(C) ≤ Fout
Definition A partition of V into at most k (Fin, Fout )-clusters is called (k, Fin, Fout )-clustering
Complexity and Efficient Algorithms Group / Department of Computer Science
35
Testing k-Clusterings
A Simple Case? Distinguish between a union of at most k expander graphs with no edges in
between and a set of more than k (large) expander graphs with no edges in between
Can we use our previous algorithm to test for a k-clustering?
Expander
Expander
Expander
Expander
Complexity and Efficient Algorithms Group / Department of Computer Science
36
Testing k-Clusterings
A Simple Case? No! We do not know the size of the clusters (expander graphs) and estimating
the support size of a distribution is hard [Raskhodnikova et al., 2009]
Expander
Expander
Expander
Expander
Complexity and Efficient Algorithms Group / Department of Computer Science
37
Testing k-Clusterings
New idea If two vertices come from the same cluster, the random walks quickly
converge to the same distribution So, we could try to sample a set of vertices and check for sets of vertices
whose random walks induce the same distributions
Expander
Expander
Expander
Expander
Complexity and Efficient Algorithms Group / Department of Computer Science
38
Main Idea [Batu et al. 2013; Chan et al. 2014] if pq then then the following experiments should give roughly the same
number of collisions between elements from S and T:
Draw two sets S and T of m elements from p Draw two sets S and T of m elements from q Draw set S of m elements from p and set T of m elements from q
If p and q differ significantly, at least one of the three values is different
Testing Closeness of Distributions
Complexity and Efficient Algorithms Group / Department of Computer Science
39
Theorem [Batu et al. 2013; Chan et al. 2014] There is a tester that w.p. 2/3 accepts, if ||p-q||≤e/2 and rejects, if ||p-q||≥e.
The query complexity of the algorithms is O(b/e²), where b is an upper bound on ||p||² and ||q||².
Testing Closeness of Distributions
Complexity and Efficient Algorithms Group / Department of Computer Science
40
Theorem [Batu et al. 2013; Chan et al. 2014] There is a tester that w.p. 2/3 accepts, if ||p-q||≤e/2 and rejects, if ||p-q||≥e.
The query complexity of the algorithms is O(b/e²), where b is an upper bound on ||p||² and ||q||².
We will need b to be O(1/n)
Testing Closeness of Distributions
Complexity and Efficient Algorithms Group / Department of Computer Science
41
The Algorithm
ClusteringTest1. Sample set S of s vertices uniformly at random
2. For any vS let D(v) be the distribution of end points of a random walk of length Q(log n) starting at v
3. for each pair u,vS do
4. if D(u) and D(v) are close then add an edge (u,v) to the „cluster graph“ on vertex set S
5. accept, if and only if the cluster graph is a collection of at most k cliques
Complexity and Efficient Algorithms Group / Department of Computer Science
42
Testing k-Clusterings
Observation Algorithm ClusteringTest distinguishes between at most k expanders and
more than k (large) expanders
Expander
Expander
Expander
Expander
Complexity and Efficient Algorithms Group / Department of Computer Science
43
Testing k-Clusterings
Observation Algorithm ClusteringTest distinguishes between at most k expanders and
more than k (large) expanders Can we generalize it to testing of (k, Fin, Fout )-clusterings ?
Expander
Expander
Expander
Expander
Complexity and Efficient Algorithms Group / Department of Computer Science
44
Testing k-Clusterings - Soundness
Challenge Since the clusters may be connected in a (k, Fin, Fout )-clustering the
stationary distribution may be uniform over G (and not over the cluster)
Complexity and Efficient Algorithms Group / Department of Computer Science
45
Testing k-Clusterings - Soundness
Challenge Since the clusters may be connected in a (k, Fin, Fout )-clustering the
stationary distribution may be uniform over G (and not over the cluster) Need to show that for proper length of the random walk there is an
„intermediate“ distribution that it is „reasonably stable“ w.r.t. l2-error
Complexity and Efficient Algorithms Group / Department of Computer Science
46
The Algorithm
ClusteringTest1. Sample set S of s vertices uniformly at random
2. For any vS let D(v) be the distribution of end points of a random walk of length Q(log n) starting at v
3. for each pair u,vS do
4. if D(u) and D(v) are close then add an edge (u,v) to the „cluster graph“ on vertex set S
5. accept, if and only if the cluster graph is a collection of at most k cliques
Complexity and Efficient Algorithms Group / Department of Computer Science
47
The Algorithm
ClusteringTest1. Sample set S of s vertices uniformly at random
2. For any vS let D(v) be the distribution of end points of a random walk of length Q(log n) starting at v
3. if ||D(v)||² > O(1/n) then reject
4. for each pair u,vS do
5. if D(u) and D(v) are close then add an edge (u,v) to the „cluster graph“ on vertex set S
6. accept, if and only if the cluster graph is a collection of at most k connected components
Complexity and Efficient Algorithms Group / Department of Computer Science
48
Testing k-Clusterings - Completeness
Required Properties of a (k, Fin, Fout)-clustering For most vertices v: The distribution D(v) of end points of a lazy random
walk of proper length has ||D(v)||² = O(1/n) For most pairs u,v from the same cluster: ||D(v)- D(u)||² is very small
Useful Tool – Higher Order Cheeger‘s Inequality [Lee et al. 2014] Relates (k, Fin, Fout )-clustering to the k+1 largest eigenvalues
Complexity and Efficient Algorithms Group / Department of Computer Science
49
Testing k-Clusterings - Soundness
Structural property of „e-far“-graphs (similarly to expanders) If G is e-far from a (k, Fin*, Fout* )-clusterings then there exists a partition into
k+1 sets C1,…,Ck+1 each of W(e²n/k) vertices and with F(Ci) = O(Fin*/e²).
Complexity and Efficient Algorithms Group / Department of Computer Science
50
Testing k-Clusterings
Theorem [Czumaj, Peng, Sohler, 2015] Algorithm ClusteringTester accepts every (k, Fin, Fout)-clustering with
probability at least 2/3 and rejects every graph that is e-far from every (k, Fin*, Fout *)-clustering with probability at least 2/3, where Fout =O(e4 Fin²) and Fin* = Q(e4 Fin²/log n) for constants k,D.
The running time of the algorithm is O*(n).
Complexity and Efficient Algorithms Group / Department of Computer Science
51
Fourth Wrap-Up
Testing Clusterings End points of Random Walk of proper length should be uniform on its
cluster with not much probability „outside“ If Random Walks start from two different points of the same cluster, their
end point distributions are similar Collision statistics can be used to pairwise test similarity of distributions This can be used to approximate the cut structure
Take away message The distribution of end points of random walks (possibly comparing
different starting vertices) contains a lot of information about the cut structure of a graph
Complexity and Efficient Algorithms Group / Department of Computer Science
52
Summary
Vision Learning from very large sets of massive graphs
Approach Feature computation by random sampling Analysis in the framework of property testing
Two Examples Expanders (connectivity measure in social networks) Clustering (structure of social networks)
Complexity and Efficient Algorithms Group / Department of Computer Science
53
Thank you!
Source
Slide 2: Allan Ajifo und cobalt123; creative common license
Slide 3: GustavoG und Jasper Nance; creative common license
Slide 4: Wikipedia; Jason Brown; creative common license
Slide 5: GustavoG; creative common license
Slide 6: GoldenRibbon, creative common license
Top Related