COMS 6998-06 Network Theory Week 6: February 28, 2008 Dragomir R. Radev Thursdays, 6-8 PM 233 Mudd...

COMS 6998-06 Network TheoryWeek 6: February 28, 2008

Dragomir R. RadevThursdays, 6-8 PM

233 MuddSpring 2008

(11) Community identification

(12) Spectral clustering

Community structure

Assortative mixing and community structure

• Community structure usually implies assortative mixing

• Assortative mixing doesn’t necessarily imply community structure (e.g., assortative mixing by age)

Building the graph

• Two nodes are linked if the two end points share a lot of attributes

• Examples: Hamming distance, Pearson correlation, cosine

Traditional methods

• Spectral bisection

• Kernighan-Lin (1970) O(n3)

• Fiduccia-Mattheyses heuristic: O(m)

Spectral algorithms

• The spectrum of a matrix is the list of all eigenvalues of a matrix sorted from smallest to largest.

• The eigenvectors in the spectrum are sorted by the absolute value of their corresponding eigenvalues.

• In spectral methods, eigenvectors are based on the Laplacian of the original matrix.

Laplacian matrix

• The Laplacian L of a matrix is a symmetric matrix.

• L = D – G, where D is the degree matrix corresponding to G.

• Example:

A

B

C

G

F

E

D

A B C D E F G

A 3 -1 0 0 0 -1 -1

B -1 3 0 -1 0 0 -1

C 0 0 3 -1 -1 -1 0

D 0 -1 -1 3 -1 0 0

E 0 0 -1 -1 2 0 0

F -1 0 -1 0 0 2 0

G -1 -1 0 0 0 0 2

Fiedler vector

• The Fiedler vector is the eigenvector of L(G) with the second smallest eigenvalue. That quantity is called the algebraic connectivity of the matrix.

A

B

C

G

F

E

D

A -0.3682 C1

B -0.2808 C1

C 0.3682 C2

D 0.2808 C2

E 0.5344 C2

F 0.0000 ?

G -0.5344 C1

Spectral bisection algorithm

• Compute 2• Compute the corresponding v2• For each node n of G

– If v2(n) < 0• Assign n to cluster C1

– Else if v2(n) > 0• Assign n to cluster C2

– Else if v2(n) = 0• Assign n to cluster C1 or C2 at random

A. Pothen, H. Simon, K.-P. Liou, “Partitioning sparse matrices with eigenvectors of graphs”, SIAM J. Mat. Anal. Appl. 11:430-452 (1990)M. Fiedler, “Algebraic Connectivity of Graphs”, Czech. Math. J., 23:298-305 (1973)M. Fiedler, Czech. Math. J., 25:619-637 (1975)

[Figure by Demmel]

Kernighan-Lin

• Q=benefit function = #edges within groups - #edges across groups

• Start with a bipartition of specified sizes. Let’s call the two parts A and B.

• Compute Q for each pair of vertices such that one of them is in A and the other in B.

• Pick the pair with the largest Q and swap its elements.• Repeat until there are no more swaps possible (no

vertex can change partitions more than once).• Retrace the algorithm and stop when Q is highest. Note

that Q did not have to change monotonically.

Kernighan-Lin

• Disadvantage: need split size– Can be fixed in O(n3)– However, this is likely to produce a lopsided

split (of sizes 0 and n).

Hierarchical clustering

• Based on similarity and dendrograms.

Centrality-based methods

• Girvan-Newman method based on edge betweenness

• Divisive method

• Computed in O(mn).

• Edge betweenness is the number of geodesic paths that run along a given edge.

Modularity

• Measures the quality of a split

• g groups, gxg matrix E such that eij is the fraction of edges in the original network that connect vertices in group i to vertices in group j.

where ||x|| is the sum of all values of x• Q is the fraction of all edges that lie within communities –

the expected value of the same quantity in a network in which the vertices have the same degrees but edges are placed at random.

2Trii ij kii ijk

Q e e e e e

Alternatives

• The method is slow – there are m edges to be removed so total runtime is O(m2n).

• Improvement 1 – Monte Carlo speedup (Tyler et al.) – compute EB only using a sample of vertices. An extra advantage – can give probabilistic node assignments (how?)

• Improvement 2 – local information only (Radicchi) – triangles are unlikely across communities. This method works better for social networks.

Wu and Huberman

• Use voltage gaps.• Challenge – pick the poles; randomized method works: pick random

pairs of non-neighbors• This method can be used to find nearby nodes too (how?)

F Wu, BA Huberman: Finding communities in linear time: a physics approach- The European Physical Journal B-Condensed Matter, 2004

http://www.springerlink.com/index/H6JFB70VQXADYDRM.pdf

A B

D E

H I

F G

C

J

Example from Caldarelli 2007]

Girvan/Newman example

H I D A B E F L C G

Edge betweenness dendrogram

Example from Caldarelli 2007]

[Detecting community structure in networks, M. E. J. Newman, Eur. Phys. J. B 38, 321–330 (2004). ]

http://www-personal.umich.edu/%7Emejn/papers/epjb.pdf

Existing software

• http://www.sandia.gov/~bahendr/chaco.html (CHACO)

• http://glaros.dtc.umn.edu/gkhome/views/metis/index.html (METIS)

• http://www.cs.utexas.edu/users/dml/Software/graclus.html (GRACLUS)

http://www.sandia.gov/~bahendr/chaco.html

http://www.sandia.gov/~bahendr/chaco.html

http://glaros.dtc.umn.edu/gkhome/views/metis/index.html

http://glaros.dtc.umn.edu/gkhome/views/metis/index.html

http://www.cs.utexas.edu/users/dml/Software/graclus.html

http://www.cs.utexas.edu/users/dml/Software/graclus.html

Minimum Spanning Tree• A spanning tree for a connected, undirected graph G=(V,E)

– a subgraph of G that is

– an undirected tree and contains

– all the vertices of G.

• In a weighted graph G=(V,E,W), the weight of a subgraph – the sum of the weights of the edges in the subgraph

• A minimum spanning tree for a weighted graph – a spanning tree with the minimum weight.

[MST and mincut slides by Rada Mihalcea]

Prim/Jarnik’s Minimum Spanning Tree Algorithm

• Select an arbitrary starting vertex (the root)• Branch out from the tree constructed so far by

– choosing an edge at each iteration– attach the edge to the tree

• the edge that has minimum weight among all edges that can be attached

– add to the tree the vertex associated with the edge

• Vertices are divided into three disjoint categories:– Tree vertices: in the tree constructed so far,– Fringe vertices: not in the tree, but adjacent to some vertex in the

tree,– Unseen vertices: all others

Prim’s Algorithm

Algorithm PrimMinSpanningTree(G)Initialize all nodes as unseen

Select an arbitrary vertex s to start the tree

Reclassify it as tree

Reclassify all vertices adjacent to s as fringe

While there are fringe vertices

Select an edge of minimum weight between a tree vertex t and a fringe vertex v

Reclassify v as tree; add edge tv to the tree

Reclassify all unseen vertices adjacent to v as fringe

Prim’s Algorithmv1 v2

v3 v4 v5

v6 v7

2

14

1

4

10

6

72

5 8

3

v1 v2

v3 v4 v5

v6 v7

2

14

1

4

10

6

72

5 8

3

v1 v2

v3 v4 v5

v6 v7

2

14

1

4

10

6

72

5 8

3

v1 v2

v3 v4 v5

v6 v7

2

14

1

4

10

6

72

5 8

3

Prim’s Algorithmv1 v2

v3 v4 v5

v6 v7

2

14

1

4

10

6

72

5 8

3

v1 v2

v3 v4 v5

v6 v7

2

14

1

4

10

6

72

5 8

3

v1 v2

v3 v4 v5

v6 v7

2

14

1

4

10

6

72

5 8

3

Properties of Minimum Spanning Trees

• Definition: Minimum spanning tree property– Let G be a connected, weighted graph G=(V,E,W), and let T be

any spanning tree of G. Suppose that for every edge vw of G that is not in T:

• if uv is added to T, then it creates a cycle • uv is a maximum-weight edge on that cycle

– The tree T is said to have the minimum spanning tree property

• A graph can admit several minimum spanning trees• Lemma: In a connected, weighted graph G = (V, E, W), if

T1 and T2 are two spanning trees that have the MST property, then they have the same total weight

• The number of edges in a minimum spanning tree is |V|-1

Flow networksWhat if weights in a graph are maximum capacities of some

flow of material?Examples:

– Pipe network to transport fluid (e.g., water, oil)• Edges – pipes, vertices – junctions of pipes

– Data communication network • Edges – network connections of different capacity, vertices – routers

(do not produce or consume data just move it)

Concepts (informally):– Source vertex s (where material is produced)– Sink vertex t (where material is consumed)– For all other vertices – what goes in must go out– Goal: maximum rate of material flow from source to sink

Formalization

Flow network – G=(V,E)– Directed, each edge has capacity c(u,v) 0– Two special vertices: source s, and sink t– For any other vertex v, there is a path s…v…t

Flow – f : V V R– Capacity constraint: u,v V: f(u,v) c(u,v)– Skew symmetry: u,v V: f(u,v) = –f(v,u)– Flow conservation: u V – {s, t}:

v V

v V

, of(u,v)= f(u,V)= 0

f(v,u)= f(V

r

,u)= 0

Cancellation of flows

Do we want to have positive flows going in both directions between two vertices?

No! such flows cancel (maybe partially) each other.

13

11

54

15

10

14

19

3

s t9

a b

c d

2 5

3

3

6

6

13

11

54

15

10

14

19

3

s t9

a b

c d

3

3

3

6

6

Maximum flow

Want to maximize the total

value of the flow f: v V

f = f(s,v)= f(s,V)= f(V,t)

Find a flow of maximum value.

13

11

54

15

10

14

19

3

s t9

a b

c d

5

13

3

8

6

108

2

Augmenting path

Idea for the algorithm:– If we have some flow, and can find a path p from s to t

(augmenting path), such that there is a > 0, and for each edge (u,v) in p we can add a units of flow: f(u,v) + a c(u,v)

– then mark down that path, to get a better flow

Augmenting path in this graph?

13

11

54

15

10

14

19

3

s t9

a b

c d

5

13

3

8

6

108

2

Ford-Fulkerson methodFord-Fulkerson(G,s,t) 1 initialize flow f to 0 everywhere

2 while there is an augmenting path p do

3 augment flow f along p

4 return f

• How do we find augmenting path?• How much additional flow can we send through that

path?

Residual network

How do we find augmenting path?

It is any path in residual network:– Residual capacities: cf(u,v) = c(u,v) – f(u,v)

– Residual network: Gf=(V,Ef), where Ef = {(u,v) V V : cf(u,v) > 0}

Compute residual network:

13

11

54

15

10

14

19

3

s t9

a b

c d

5

13

3

8

6

108

2

Residual capacity of a path

How much additional flow can we send through an augmenting path?

• Residual capacity of a path p in Gf:

cf(p) = min{cf(u,v): (u,v) p}

• Doing augmentation:– (u,v) in p, add cf(p) to f(u,v) (and subtract from f(v,u))

– Resulting flow is a valid flow with a larger value.

13

11

54

15

10

14

19

3

s t9

a b

c d

5

13

3

8

6

108

2What is the residual capacity of the

path (s,a,b,t)?

Ford-Fulkerson method, with details

Ford-Fulkerson(G,s,t) 1 for each edge (u,v)G.E do 2 f(u,v) = f(v,u) = 0

3 while path p from s to t in residual network Gf do

4 cf = min{cf(u,v): (u,v)p}

5 for each edge (u,v) in p do

6 f(u,v) = f(u,v) + cf

7 f(v,u) = -f(u,v)

8 return f

The algorithms based on this method differ in how they choose p in step 3.

Ford-Fulkerson method, example13

11

54

15

10

14

19

3

s t

a b

c d

33

3

13

11

54

15

10

14

19

3

s t

a b

c d

13

33

3

1313

13

11

54

15

10

14

19

3

s t

a b

c d

18

38

8

1313

5

• An s-t cut is a partition (S, T) such that s S, t T.– The capacity of an s-t cut (S, T) is:

• Min s-t cut: find an s-t cut of minimum capacity.

Cuts

TwSvEwvSe

wvueu

,),( ofout

).,(:)(

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

Capacity = 30

Cuts

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

Capacity = 62

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

Capacity = 28

Flows and Cuts

10

6

6

10

10

0 10

4 8 8

04 0

00

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

Value = 24

• L1. Let f be a flow, and let (S, T) be a cut. Then, the net flow sent across the cut is equal to the amount reaching t.

fefefefseSeSe

ofout to in ofout

)()()(

Max Flow and Min Cut• Corollary. Let f be a flow, and let (S, T) be a cut. If |f| = cap(S, T),

then f is a max flow and (S, T) is a min cut.

• Max-flow / min-cut theorem (Ford-Fulkerson): In any network, the value of the max flow is equal to the value of the min cut.

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

10

9

9

15

15

4 10

4 8 9

10 0

Flow value = 28

00

Cut capacity = 28

COMS 6998-06 Network Theory Week 6: February 28, 2008 Dragomir R. Radev Thursdays, 6-8 PM 233 Mudd...

Documents

Transcript of COMS 6998-06 Network Theory Week 6: February 28, 2008 Dragomir R. Radev Thursdays, 6-8 PM 233 Mudd...