© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden...
-
Upload
isai-graddick -
Category
Documents
-
view
218 -
download
3
Transcript of © 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden...
© 2005 IBM Corporation
Discovering Large Dense Subgraphs in Massive Graphs
David GibsonIBM Almaden Research CenterRavi Kumar Yahoo! Research*Andrew Tomkins Yahoo! Research*
VLDB, Trondheim, September 1, 2005
* (Work performed while at IBM Almaden)
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 2 of 19
Agenda
Application Areas Other Approaches Shingling Recursion Data Set Performance Results Evolution studies
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 3 of 19
Applications
Web communities– 4B Web pages + hyperlinks
Host collusion– 50M Web hosts + intersite links
Blogging neighbourhoods– 4M Users + friend links
Telephone call networks– Subscribers + people called
Email graph– Enron employees + correspondents
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 4 of 19
Other Approaches
Trawling for bipartite cores [Kumar et al 1999] Network flow [Flake et al 2000] Peeling [Abello et al 2002] Bursts [Tomkins et al 2003]
Why is discovering dense subgraphs hard? – Size of locally dense regions is highly variable
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 5 of 19
Graphs, cliques, and dense subgraphs
Our goal:
Find large, dense, subgraphs
Constraints:Stream processing modelOut-of-core sort
G
60%67%
C1 C2100%
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 6 of 19
Shingling
The text problem:Create a document fingerprint which is immune to small changes.1. Convert to a set of shingles
2. Hash each element of the set
3. Return minimum hash value
4. Repeat with different hash functions
HashElement 1: `overlapping subsequences of‘ 23Element 2: `subsequences of words‘ 12 MinimumElement 3: `of words in‘ 39Element 4: `words in the‘ 22Element 5: `in the document‘ 44
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 7 of 19
Shingling II
Shingling general sets– Jaccard similarity between sets A and B:
– P[shingle matches] = J(A,B) = | A ∩ B | / | A U B |
Parameters– Pick c shingles to improve estimate
– Pick s = size of shingles for stricter matching
A B
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 8 of 19
Algorithm
Edge table representation: v1 w1 w2
v2 w2 w3v3 w4 w5
UNION-FIND identifies clusters– Scan edge table once
– O(log n) memory is possible
UNION-FIND Exact-MatchToo lenient Too strict
Need to find dense clusters of similar edge lists
Use shingles to compare edge lists– And reduce data volume
v1
v2
v3
w1
w2
w3
w4
w5
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 9 of 19
Algorithm
Shingle outlink sets:v w1 … wN
v s1 … sC
Transpose to find sets of v’s:s1 v1 v2 …s2 v1 v3 …
Could run UnionFind now
Or, reduce graph again!Reduces data volumeFinds dense clusters of v’s
V
W
S
N
C
Shingle
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 10 of 19
Algorithm
1. Shingle2. Transpose3. Recurse4. Map back
V
W
V’
V’’
etc…
E0
E1
E2
0. Base case: UnionFind
Shingle
Shingle
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 11 of 19
Algorithm
RecursiveShingle( E )
Shingle: S[v] = Shingle( E[v] ) for v in V
Transpose: E’[s] = { v | s in S[v] }
Recurse: clusters = RecursiveShingle( E’ ) base: clusters = UnionFind( E’ )
Map back: return { Uv in C E[v] | C in clusters }
E0
E1
E2
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 12 of 19
Data Stream Processing
RecursiveShingle( E )
Shingle: Linear scan of E
Transpose: Sort of size |E’|
Recurse: (2 or 3 times)UnionFind is linear
Map back: Linear scan of clusters and E
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 13 of 19
Data Set: The Web Host Graph
2.1 billion pages in the WebFountain store in September 2004
Site Browser system aggregates site information– 50 million hostnames
– 11 billion host host links. Mean outdegree = 220
Historical trace June – September, every two weeks
– How do large clusters form?
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 14 of 19
Test Runs
Strict shingles Nonstrict shingles
Vertices (M)
Edges(M)
Vertices (M)
Edges (M)
0 50 11 000 110 GB 50 11 000
1 275 420 5.5 GB 957 2 500
2 60 98 690 MB 1000 1 200
700 750Running time: O(days)
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 15 of 19
Link Spam and Search Engines
Some results– Several hundred giant dense subgraphs of at least 10 000 nodes
– 2000 dense subgraphs of at least 1000 nodes
– 64 000 dense subgraphs of at least 100 nodes
Sampling of clusters– 88% are clearly spam networks
Clusters can be used to weight search engine results– Easy to integrate into search engine workflow
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 16 of 19
Reduction in outdegree
1
2
3
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 17 of 19
Cluster Sizes
Depth 2
Depth 3
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 18 of 19
Historical study
Study the growth of inlinks to cluster centers
10% growth in 3 months. Most growth is bursty
Uni
que
IP a
ddre
ss
inlin
ks
VLDB 2005
Discovering Dense Subgraphs © 2005 IBM CorporationSlide 19 of 19
Summary
Shingles + Recursion = Large Dense Subgraphs
Extensions:– Undirected graphs, hierarchical decompositions
– Other application areas, such as blogs
Data stream algorithms scale well
Thank you!