© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden...

© 2005 IBM Corporation

Discovering Large Dense Subgraphs in Massive Graphs

David GibsonIBM Almaden Research CenterRavi Kumar Yahoo! Research*Andrew Tomkins Yahoo! Research*

VLDB, Trondheim, September 1, 2005

* (Work performed while at IBM Almaden)

VLDB 2005

Discovering Dense Subgraphs © 2005 IBM CorporationSlide 2 of 19

Agenda

Application Areas Other Approaches Shingling Recursion Data Set Performance Results Evolution studies

VLDB 2005


Applications

Web communities– 4B Web pages + hyperlinks

Host collusion– 50M Web hosts + intersite links

Blogging neighbourhoods– 4M Users + friend links

Telephone call networks– Subscribers + people called

Email graph– Enron employees + correspondents

VLDB 2005


Other Approaches

Trawling for bipartite cores [Kumar et al 1999] Network flow [Flake et al 2000] Peeling [Abello et al 2002] Bursts [Tomkins et al 2003]

Why is discovering dense subgraphs hard? – Size of locally dense regions is highly variable

VLDB 2005


Graphs, cliques, and dense subgraphs

Our goal:

Find large, dense, subgraphs

Constraints:Stream processing modelOut-of-core sort

G

60%67%

C1 C2100%

VLDB 2005


Shingling

The text problem:Create a document fingerprint which is immune to small changes.1. Convert to a set of shingles

2. Hash each element of the set

3. Return minimum hash value

4. Repeat with different hash functions

HashElement 1: òverlapping subsequences of‘ 23Element 2: `subsequences of words‘ 12 MinimumElement 3: òf words in‘ 39Element 4: `words in the‘ 22Element 5: ìn the document‘ 44

VLDB 2005


Shingling II

Shingling general sets– Jaccard similarity between sets A and B:

– P[shingle matches] = J(A,B) = | A ∩ B | / | A U B |

Parameters– Pick c shingles to improve estimate

– Pick s = size of shingles for stricter matching

A B

VLDB 2005


Algorithm

Edge table representation: v1 w1 w2

v2 w2 w3v3 w4 w5

UNION-FIND identifies clusters– Scan edge table once

– O(log n) memory is possible

UNION-FIND Exact-MatchToo lenient Too strict

Need to find dense clusters of similar edge lists

Use shingles to compare edge lists– And reduce data volume

v1

v2

v3

w1

w2

w3

w4

w5

VLDB 2005


Algorithm

Shingle outlink sets:v w1 … wN

v s1 … sC

Transpose to find sets of v’s:s1 v1 v2 …s2 v1 v3 …

Could run UnionFind now

Or, reduce graph again!Reduces data volumeFinds dense clusters of v’s

V

W

S

N

C

Shingle

VLDB 2005


Algorithm

1. Shingle2. Transpose3. Recurse4. Map back

V

W

V’

V’’

etc…

E0

E1

E2

0. Base case: UnionFind

Shingle

Shingle

VLDB 2005


Algorithm

RecursiveShingle( E )

Shingle: S[v] = Shingle( E[v] ) for v in V

Transpose: E’[s] = { v | s in S[v] }

Recurse: clusters = RecursiveShingle( E’ ) base: clusters = UnionFind( E’ )

Map back: return { Uv in C E[v] | C in clusters }

E0

E1

E2

VLDB 2005


Data Stream Processing

RecursiveShingle( E )

Shingle: Linear scan of E

Transpose: Sort of size |E’|

Recurse: (2 or 3 times)UnionFind is linear

Map back: Linear scan of clusters and E

VLDB 2005


Data Set: The Web Host Graph

2.1 billion pages in the WebFountain store in September 2004

Site Browser system aggregates site information– 50 million hostnames

– 11 billion host host links. Mean outdegree = 220

Historical trace June – September, every two weeks

– How do large clusters form?

VLDB 2005


Test Runs

Strict shingles Nonstrict shingles

Vertices (M)

Edges(M)

Vertices (M)

Edges (M)

0 50 11 000 110 GB 50 11 000

1 275 420 5.5 GB 957 2 500

2 60 98 690 MB 1000 1 200

700 750Running time: O(days)

VLDB 2005


Link Spam and Search Engines

Some results– Several hundred giant dense subgraphs of at least 10 000 nodes

– 2000 dense subgraphs of at least 1000 nodes

– 64 000 dense subgraphs of at least 100 nodes

Sampling of clusters– 88% are clearly spam networks

Clusters can be used to weight search engine results– Easy to integrate into search engine workflow

VLDB 2005


Reduction in outdegree

1

2

3

VLDB 2005


Cluster Sizes

Depth 2

Depth 3

VLDB 2005


Historical study

Study the growth of inlinks to cluster centers

10% growth in 3 months. Most growth is bursty

Uni

que

IP a

ddre

ss

inlin

ks

VLDB 2005


Summary

Shingles + Recursion = Large Dense Subgraphs

Extensions:– Undirected graphs, hierarchical decompositions

– Other application areas, such as blogs

Data stream algorithms scale well

Thank you!

[email protected]

© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden...

Documents

Transcript of © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden...