© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden...

19
© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew Tomkins Yahoo! Research* VLDB, Trondheim, September 1, 2005 * (Work performed while at IBM Almad

Transcript of © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden...

Page 1: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

© 2005 IBM Corporation

Discovering Large Dense Subgraphs in Massive Graphs

David GibsonIBM Almaden Research CenterRavi Kumar Yahoo! Research*Andrew Tomkins Yahoo! Research*

VLDB, Trondheim, September 1, 2005

* (Work performed while at IBM Almaden)

Page 2: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 2 of 19

Agenda

Application Areas Other Approaches Shingling Recursion Data Set Performance Results Evolution studies

Page 3: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 3 of 19

Applications

Web communities– 4B Web pages + hyperlinks

Host collusion– 50M Web hosts + intersite links

Blogging neighbourhoods– 4M Users + friend links

Telephone call networks– Subscribers + people called

Email graph– Enron employees + correspondents

Page 4: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 4 of 19

Other Approaches

Trawling for bipartite cores [Kumar et al 1999] Network flow [Flake et al 2000] Peeling [Abello et al 2002] Bursts [Tomkins et al 2003]

Why is discovering dense subgraphs hard? – Size of locally dense regions is highly variable

Page 5: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 5 of 19

Graphs, cliques, and dense subgraphs

Our goal:

Find large, dense, subgraphs

Constraints:Stream processing modelOut-of-core sort

G

60%67%

C1 C2100%

Page 6: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 6 of 19

Shingling

The text problem:Create a document fingerprint which is immune to small changes.1. Convert to a set of shingles

2. Hash each element of the set

3. Return minimum hash value

4. Repeat with different hash functions

HashElement 1: `overlapping subsequences of‘ 23Element 2: `subsequences of words‘ 12 MinimumElement 3: `of words in‘ 39Element 4: `words in the‘ 22Element 5: `in the document‘ 44

Page 7: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 7 of 19

Shingling II

Shingling general sets– Jaccard similarity between sets A and B:

– P[shingle matches] = J(A,B) = | A ∩ B | / | A U B |

Parameters– Pick c shingles to improve estimate

– Pick s = size of shingles for stricter matching

A B

Page 8: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 8 of 19

Algorithm

Edge table representation: v1 w1 w2

v2 w2 w3v3 w4 w5

UNION-FIND identifies clusters– Scan edge table once

– O(log n) memory is possible

UNION-FIND Exact-MatchToo lenient Too strict

Need to find dense clusters of similar edge lists

Use shingles to compare edge lists– And reduce data volume

v1

v2

v3

w1

w2

w3

w4

w5

Page 9: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 9 of 19

Algorithm

Shingle outlink sets:v w1 … wN

v s1 … sC

Transpose to find sets of v’s:s1 v1 v2 …s2 v1 v3 …

Could run UnionFind now

Or, reduce graph again!Reduces data volumeFinds dense clusters of v’s

V

W

S

N

C

Shingle

Page 10: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 10 of 19

Algorithm

1. Shingle2. Transpose3. Recurse4. Map back

V

W

V’

V’’

etc…

E0

E1

E2

0. Base case: UnionFind

Shingle

Shingle

Page 11: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 11 of 19

Algorithm

RecursiveShingle( E )

Shingle: S[v] = Shingle( E[v] ) for v in V

Transpose: E’[s] = { v | s in S[v] }

Recurse: clusters = RecursiveShingle( E’ ) base: clusters = UnionFind( E’ )

Map back: return { Uv in C E[v] | C in clusters }

E0

E1

E2

Page 12: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 12 of 19

Data Stream Processing

RecursiveShingle( E )

Shingle: Linear scan of E

Transpose: Sort of size |E’|

Recurse: (2 or 3 times)UnionFind is linear

Map back: Linear scan of clusters and E

Page 13: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 13 of 19

Data Set: The Web Host Graph

2.1 billion pages in the WebFountain store in September 2004

Site Browser system aggregates site information– 50 million hostnames

– 11 billion host host links. Mean outdegree = 220

Historical trace June – September, every two weeks

– How do large clusters form?

Page 14: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 14 of 19

Test Runs

Strict shingles Nonstrict shingles

Vertices (M)

Edges(M)

Vertices (M)

Edges (M)

0 50 11 000 110 GB 50 11 000

1 275 420 5.5 GB 957 2 500

2 60 98 690 MB 1000 1 200

700 750Running time: O(days)

Page 15: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 15 of 19

Link Spam and Search Engines

Some results– Several hundred giant dense subgraphs of at least 10 000 nodes

– 2000 dense subgraphs of at least 1000 nodes

– 64 000 dense subgraphs of at least 100 nodes

Sampling of clusters– 88% are clearly spam networks

Clusters can be used to weight search engine results– Easy to integrate into search engine workflow

Page 16: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 16 of 19

Reduction in outdegree

1

2

3

Page 17: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 17 of 19

Cluster Sizes

Depth 2

Depth 3

Page 18: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 18 of 19

Historical study

Study the growth of inlinks to cluster centers

10% growth in 3 months. Most growth is bursty

Uni

que

IP a

ddre

ss

inlin

ks

Page 19: © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 19 of 19

Summary

Shingles + Recursion = Large Dense Subgraphs

Extensions:– Undirected graphs, hierarchical decompositions

– Other application areas, such as blogs

Data stream algorithms scale well

Thank you!

[email protected]