Download - Complexity and Efficient Algorithms Group / Department of Computer Science Approximating Structural Properties of Graphs by Random Walks Christian Sohler.

Complexity and Efficient Algorithms Group / Department of Computer Science

Approximating Structural Properties of Graphs by Random Walks Christian Sohler


2

Very Large Networks

Examples Social networks The human brain Crystals Chip design

Size 109 – 1023 vertices Petabytes of additional information possible


3

Very Large Networks

Classical graph problems Connectivity MinCut, MaxCut Graphclustering Graphisomorphism

Difficulties Graph does not fit into main memory


4

Classification of Very Large Networks – A Vision

Exampe questions Is a country a democracy or a totalitarian

country? Is a patient schizophrenic? Is software malicious?

Formalization Given a set of graphs with class labels

(training set) Find a classifier for new graphs


5

Classification of Very Large Networks – A Vision

A typical szenario Hundreds or thousands of graphs Each graph is extremly large Graphs are sparse

A possible approach Describe graphs by features

(graph properties) Apply classical learning algorithms

The challenge Computation of ten thousands of features

for graphs with billions of vertices

(12,3,-5,10,0,0,…,20,3)


6

Classification of Very Large Networks – A Sampling Approach

Random Sampling Compute a graph property approximately

by random sampling

Informal Question What can we learn from the local structure

of a sparse graph about its global properties?

Sampling from Graphs How can we sample a graph?


7


Examples of different sampling strategies1. Sample set S of s vertices and look at all edges within S

(the subgraph G[S] induced by S)

2. Sample set S of s edges and look at their graph

3. Sample a set S of s vertices and perform a BFS from each of them

4. Sample a set S of s vertices and perform a random walk from each of them Many more possibilities…

Question Which is the right sampling strategy for my learning problem?


8


Examples of different sampling strategies1. Sample set S of s vertices and look at all edges within S

(the subgraph G[S] induced by S)



4. Sample a set S of s vertices and perform a random walk from each of them Many more possibilities…

Question Which is the right sampling strategy for my learning problem? Depends on the problem…


9


Question 1 Assume you have some classification task that involves city maps. Which

of our four sampling methods is your method of choice?

Possible Answers1. Sample set S of s vertices and look at all edges within S



4. Sample a set S of s vertices and perform a random walk from each of them


10


Question 2 Assume you have some classification task that involves social networks.

Which of our four sampling methods is your method of choice?

Possible Answers1. Sample set S of s vertices and look at all edges within S



4. Sample a set S of s vertices and perform a random walk from each of them


11

First Wrap-Up

Motivation Some classification problems involve sets of huge graphs No efficient algorithm for some fundamental graph problems known

Sampling approach We would like to pick small samples from the graph(s) and use them for

graph classification

Challenge There are many different sampling procedures We need to understand which is the right one for which problem


12

Sampling from Very Large Networks

Property Testing [Rubinfeld, Sudan, 1996, Goldreich, Goldwasser, Ron, 1998] Formal framework to study sampling algorithms for very large networks

Relaxation of „Standard Decision Problems“ Want to distinguish whether input graph G has a property or is far away from it If G neither has the property nor is far away from it the algorithm may give an

arbitrary answer Randomized algorithms with bounded (worst case) error probability Only looks at small part of the graph

Different graph models Dense graphs, bounded degree graphs, directed graphs


13

Property Testing in Bounded Degree Graphs

Bounded degree graphs [Goldreich, Ron, 2002]

Undirected Graph G=(V,E) Maximum degree bounded by D D constant

Oracle access V={1,…,n} n is known to the algorithm Query(i,j) returns j-th neighbor of vertex i or a

symbol that indicates that this neighbor does not exist

2 4 /

1 3 5

2 / /

1 5 /

2 4 /

1 2 3

4 5


14


Graph properties A graph property is a set of graphs that is

closed under isomorphism

Definition [Goldreich, Ron, 2002] G=(V,E) is e-far from P, if one has to modify

more than eDn edges to obtain a bounded degree graph with property P.

connected

e-far


15


Property Tester for property P [Goldreich, Ron, 2002] Oracle access to input graph G Accepts with probability at least 2/3, if G has property P Rejects with probability at least 2/3, if G is e-far from P

Quality measures Query complexity: Maximum number of oracle queries Running time


16

A First Example: Connectivity

Connectivitytester(G,e,D) [Goldreich, Ron, 2002](1) Sample set S with s=8/(eD) vertices uniformly at random from V

(2) For every vertex from S:

(3) Perform a BFS until

(a) 4/(eD) vertices have been discovered or

(b) all vertices of a small connected component have been discovered

(4) if (b) then reject

(5) accept


17








(5) accept

Observation• ConnectivityTester accepts every connected graph


18








(5) accept

Claim• If G is e-far from connected, then G has more than eDn/2 connected

components.


19








(5) accept

Claim• At least eDn/4 of the connected components have size at most 4/(eD).


20








(5) accept

Theorem• Connectivitytester is a property tester with query complexity O(1/(e²D)).


21

Second Wrap-Up – Introduction to Property Testing

Property Testing Approximately decide based on random sampling whether a graph has a

property or is far away from it Quality measure: Query complexity

Connectivity Sampling + BFS Check whether the sample violates the property


22


Question 3 Is the following algorithm a property tester for planarity (for right choice of f)?

Planaritytester(G,e,D) (1) Sample set S with s= f(e,D) vertices uniformly at random from V



(a) f(e,D) vertices have been discovered or

(b) the discovered graph is not planar


(5) accept


23


Bad news• There is a class of graphs such that every cycle

has Length W(log n) and that are e-far from planar

Good news• The sampling is fine, we just need to modify

our acceptance condition

23


24

Random Walks, Stationary Distributions & Convergence

Random Walk In each step:

move from current vertex v to a neighbor chosen uniformly at random

Convergence If G is connected and not bipartite, a random walk converges to a unique

stationary distribution Pr[Random Walk is at vertex v] deg(v)


25


Random Walks on Maps A random walk on a planar graph has the

tendency to stay local It takes a long time to reach the stationary

distribution Reason: The network has sparse cuts

Random Walks on Social Networks A random walk will quickly move to a „random

place“ Fast convergence The network does not have sparse cuts


26


Lazy Random Walk In each step:

- Probability to move from current vertex v to neighbor u is 1/(2D) - stays at v with remaining probability

Convergence of Lazy Random Walks Stationary distribution is uniform

Rate of Convergence Can be expressed in terms of the conductance of G or the second largest

eigenvalue of the transition matrix O(log n) steps, if G is an expander graph


27

Conductance, Expanders & Small Worlds

Definition The expansion F(U) of a set U is defined as

The conductance FG of G is minU:1≤|U|≤|V|/2 F(U)

Definition A graph G=(V,E) is called f-expander, if FG≥f for some constant .f

Interpretations Expander graphs satisfy the „small-world phenomenon“ Conductance can be viewed as a measure for the social connectivity of a

network

||

|} and :),{(|

UD

UVvUuEvu


28

Testing Expanders

Facts A lazy random walk converges to uniform distribution A lazy random walk converges quickly in expander graphs

Hope A lazy random walk converges much slower, if the graph is e-far from an

expander graph In particular, we hope that the distribution of the endpoints of a Q(log n)-

step lazy random walk differs significantly from the uniform distribution

Question If so, how could we exploit this to design a property testing algorithm?


29

The Birthday Problem & Testing Uniform Distributions

Birthday Problem n possible birthdays k persons with birthday chosen uniformly at random How large must k be so that with constant probability two person have the

same birthday?

Analysis p=(1/n,..,1/n)T

||p||² is the collision probability of two birthdays If we have k persons then the expected number of collision is So, for k = Q(n) we expect to see a collision

²||||2

pk


30

Testing Uniform Distributions

Observation The uniform distribution minimizes the expected number of pairwise

collisions If a distribution q differs significantly from the uniform distribution then

||q||²>>||p||²

TestUniformDistribution(distribution q)1. Sample Q(n) elements according to q

2. if the number of pairwise collisions is too large then reject

3. else accept


31

Testing Expanders

TestingExpanders(G)1. Sample set S of s vertices uniformly at random

2. for each vS do

3. Let q be the distribution of endpoints of a Q(log n)-step lazy random walk

4. if TestUniformDistribution(q) rejects then reject

5. accept

History• Algorithm was invented by [Goldreich and Ron, 2000] and algorithm

conjectured to be a property tester• First complete analysis by [Czumaj and Sohler, 2010]

(but weaker than conjectured)• Later improved by [Nachmias and Shapira, 2010] and [Kale and Seshadhri,

2011]


32

Final Result

Theorem [Nachmias and Shapira, 2010, Kale and Seshadhri, 2011] Algorithm TestingExpansion accepts every f-expander and rejects every

graph that is e-far from a (Q f²)-expander. The algorithm has a running time of O(n1/2+d).

Key structural property of „e-far“-graphs If G is e-far from a (Q f²)-expander then there exists a set U of W(en)

vertices with F(U) = O(f²). Implies that for many vertices, the distribution of endpoints of a random

walk of length O(log n) is significantly different from the uniform distribution


33

Third Wrap-Up – Testing Expansion

(Lazy) Random Walks Moves from a vertex to a random neighbor Converges to uniform distribution Speed of convergence depends on graph structure

Testing Expansion Random Walk converges quickly in expander graphs Random Walk converges slower if we are far from expander graphs Number of collisions among end points of random walks is minimized in

expander graphs We can test expansion by counting collisions


34

Graph Clustering & Web Communities

Web Graph Communities Set of vertices that induces an expander graph and has a sparse cut to the

rest of the graph Question: Is the web graph composed of a set of at most k communities?

Definition A subset CV is called (Fin, Fout )-cluster, if

FG(G[C]) ≥ Fin

F(C) ≤ Fout

Definition A partition of V into at most k (Fin, Fout )-clusters is called (k, Fin, Fout )-clustering


35

Testing k-Clusterings

A Simple Case? Distinguish between a union of at most k expander graphs with no edges in

between and a set of more than k (large) expander graphs with no edges in between

Can we use our previous algorithm to test for a k-clustering?

Expander

Expander

Expander

Expander


36


A Simple Case? No! We do not know the size of the clusters (expander graphs) and estimating

the support size of a distribution is hard [Raskhodnikova et al., 2009]

Expander

Expander

Expander

Expander


37


New idea If two vertices come from the same cluster, the random walks quickly

converge to the same distribution So, we could try to sample a set of vertices and check for sets of vertices

whose random walks induce the same distributions

Expander

Expander

Expander

Expander


38

Main Idea [Batu et al. 2013; Chan et al. 2014] if pq then then the following experiments should give roughly the same

number of collisions between elements from S and T:

Draw two sets S and T of m elements from p Draw two sets S and T of m elements from q Draw set S of m elements from p and set T of m elements from q

If p and q differ significantly, at least one of the three values is different

Testing Closeness of Distributions


39

Theorem [Batu et al. 2013; Chan et al. 2014] There is a tester that w.p. 2/3 accepts, if ||p-q||≤e/2 and rejects, if ||p-q||≥e.

The query complexity of the algorithms is O(b/e²), where b is an upper bound on ||p||² and ||q||².



40

Theorem [Batu et al. 2013; Chan et al. 2014] There is a tester that w.p. 2/3 accepts, if ||p-q||≤e/2 and rejects, if ||p-q||≥e.

The query complexity of the algorithms is O(b/e²), where b is an upper bound on ||p||² and ||q||².

We will need b to be O(1/n)



41

The Algorithm

ClusteringTest1. Sample set S of s vertices uniformly at random

2. For any vS let D(v) be the distribution of end points of a random walk of length Q(log n) starting at v

3. for each pair u,vS do

4. if D(u) and D(v) are close then add an edge (u,v) to the „cluster graph“ on vertex set S

5. accept, if and only if the cluster graph is a collection of at most k cliques


42


Observation Algorithm ClusteringTest distinguishes between at most k expanders and

more than k (large) expanders

Expander

Expander

Expander

Expander


43


Observation Algorithm ClusteringTest distinguishes between at most k expanders and

more than k (large) expanders Can we generalize it to testing of (k, Fin, Fout )-clusterings ?

Expander

Expander

Expander

Expander


44

Testing k-Clusterings - Soundness

Challenge Since the clusters may be connected in a (k, Fin, Fout )-clustering the

stationary distribution may be uniform over G (and not over the cluster)


45


Challenge Since the clusters may be connected in a (k, Fin, Fout )-clustering the

stationary distribution may be uniform over G (and not over the cluster) Need to show that for proper length of the random walk there is an

„intermediate“ distribution that it is „reasonably stable“ w.r.t. l2-error


46

The Algorithm





5. accept, if and only if the cluster graph is a collection of at most k cliques


47

The Algorithm



3. if ||D(v)||² > O(1/n) then reject



6. accept, if and only if the cluster graph is a collection of at most k connected components


48

Testing k-Clusterings - Completeness

Required Properties of a (k, Fin, Fout)-clustering For most vertices v: The distribution D(v) of end points of a lazy random

walk of proper length has ||D(v)||² = O(1/n) For most pairs u,v from the same cluster: ||D(v)- D(u)||² is very small

Useful Tool – Higher Order Cheeger‘s Inequality [Lee et al. 2014] Relates (k, Fin, Fout )-clustering to the k+1 largest eigenvalues


49


Structural property of „e-far“-graphs (similarly to expanders) If G is e-far from a (k, Fin*, Fout* )-clusterings then there exists a partition into

k+1 sets C1,…,Ck+1 each of W(e²n/k) vertices and with F(Ci) = O(Fin*/e²).


50


Theorem [Czumaj, Peng, Sohler, 2015] Algorithm ClusteringTester accepts every (k, Fin, Fout)-clustering with

probability at least 2/3 and rejects every graph that is e-far from every (k, Fin*, Fout *)-clustering with probability at least 2/3, where Fout =O(e4 Fin²) and Fin* = Q(e4 Fin²/log n) for constants k,D.

The running time of the algorithm is O*(n).


51

Fourth Wrap-Up

Testing Clusterings End points of Random Walk of proper length should be uniform on its

cluster with not much probability „outside“ If Random Walks start from two different points of the same cluster, their

end point distributions are similar Collision statistics can be used to pairwise test similarity of distributions This can be used to approximate the cut structure

Take away message The distribution of end points of random walks (possibly comparing

different starting vertices) contains a lot of information about the cut structure of a graph


52

Summary

Vision Learning from very large sets of massive graphs

Approach Feature computation by random sampling Analysis in the framework of property testing

Two Examples Expanders (connectivity measure in social networks) Clustering (structure of social networks)


53

Thank you!

Source

Slide 2: Allan Ajifo und cobalt123; creative common license

Slide 3: GustavoG und Jasper Nance; creative common license

Slide 4: Wikipedia; Jason Brown; creative common license

Slide 5: GustavoG; creative common license

Slide 6: GoldenRibbon, creative common license