CS728-2008 Lecture 9 Storeing and Querying Large Web Graphs.
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
3
Transcript of CS728-2008 Lecture 9 Storeing and Querying Large Web Graphs.
CS728-2008Lecture 9
Storeing and Querying Large Web Graphs
Last Time• Algorithms for link-based clustering • finding “tightly knit communities” TKC on the
web graphs
Today’s lecture
Dealing with large graphs
Building Indexes for adjacency and connectivity testing
Distance and Transitive Closure
• New Data Structure: 2-hop covers
Connectivity Server
• Support for fast queries on the web graph– Which URLs point to a given URL?– Which URLs does a given URL point to?
Stores mappings in main memory from• URL to outlinks, URL to inlinks
• Applications– Crawl control, Web graph analysis– Connectivity, crawl optimization– TKCs, other Link analysis
Problem of Adjacency lists
• ALs store set of neighbors of each node• Assume each URL represented by an
integer– Use some natural ordering or hashing– E.g., for a 60K page web, need 16 bits– For 4 billion page web, need 32 bits per node
• Naively, this demands 32-64 bits to represent each hyperlink
• Can we compress this?
Adjacency list compression
• Properties exploited in compression:– Similarity (between lists)
– Locality (many links from a page go to “nearby” pages)
– Can use gap encodings in sorted lists
– Look at distribution of gap values
Gap encoding (Elias)• Given a list of integers in increasing order.
– E.g., 33,47,154,159,202 …
• It suffices to store gaps.– 33,14,107,5,43 …
• We Hope: most gaps encoded with far fewer bits.
• Represent a gap G as the pair <length,offset>
• length is in unary and uses log2G +1 bits to specify the length of the binary encoding of
• offset = G - 2log2G in binary.
Recall that the unary encoding of x isa sequence of x 1’s followed by a 0.
Elias codes for gap encoding
• e.g., 9 represented as <1110,001>.• 2 is represented as <10,0>.• Exercise: does zero have a code?• Encoding G takes 2 log2G +1 bits.
– codes are always of odd length.– 1 = 20 + 0 = 1 – 2 = 21 + 0 = 110 – 3 = 21 + 1 = 101– 4 = 22 + 0 = 11000– 5 = 22 + 1 = 11001– 6 = 22 + 2 = 11010– 7 = 22 + 3 = 11011– 8 = 23 + 0 = 1110000– 9 = 23 + 1 = 1110001
Exercise
• Given the following sequence of coded gaps, reconstruct the gap sequence:
1110001110101011111101101111011
Storage Requirements
• Recently a paper by Boldi/Vigna report that we can get down to an average of ~3 bits/link– (URL to URL edge)
– For a 118M node web graph
• How can this be possible?
Why is this remarkable?
Main ideas of Boldi/Vigna
• First consider lexicographically ordered list of all URLs, e.g., – www.stanford.edu/alchemy– www.stanford.edu/biology– www.stanford.edu/biology/plant– www.stanford.edu/biology/plant/copyright– www.stanford.edu/biology/plant/people– www.stanford.edu/chemistry
Boldi/Vigna
• Each of these URLs has an adjacency list• Main thesis: because of use of webpage templates, the
adjacency list of a node is usually similar to one of the 7 preceding URLs in the lexicographic ordering
• Express adjacency list in terms of one of these• E.g., consider these adjacency lists
– 1, 2, 4, 8, 16, 32, 64– 1, 4, 9, 16, 25, 36, 49, 64– 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144– 1, 4, 8, 16, 25, 36, 49, 64
Connectivity Queries
• Beyond basic adjacency we’d like to answer other queries…– Transitive closure: is there a path from x to y?– Distance: what is the length of shortest path
from x to y?
• Applications– Link analysis– XML path queries with wildcards
Naïve Solutions
• Given a web graph, we can compute and store All Pairs Shortest Paths (APSPs) off-line– Then answer any query in constant time– What are Space requirements for an n-node graph ?
• Alternatively, given a node, we can compute online– Answer query Single Source Shortest Path Algorithm– Minimal additional space required.– What is the time complexity to answer query?
Transitive Closure Encoding Problem
We want to find a compact representation for the
transitive closure• whose size is comparable to the data‘s size • that supports connection tests (almost) as fast
as the naive transitive closure lookup • that can be built efficiently for large data sets
Main Idea: 2-Hop Covers and 2-Hop Labeling
• 2-Hop cover is set of hops (x,y) so that every connected pair is covered by 2 hops
• For each node a, we maintain two sets of labels (which are simply lists of nodes): Lin(a) and Lout(a)
• For each connection (a,b),– choose a node c on the path from a to b (center node)– add c to Lout(a) and to Lin(b)
• Then (a,b)Transitive Closure T Lout(a)Lin(b)≠
a c b
Reachability and distance queries via 2-hop Labels
(Cohen et al., SODA 2002)
2-hop Covers
• Conjecture: For any graph with n nodes and m edges, a 2-hop cover always exists and has size bounded by O(n √m )
• Optimization Problem: Find a cover which minimizes the sum of the label sizes
• Problem is NP-hard – => approximation required
• Theorem: There exists a polytime algorithm that approx optimal size 2-hop cover within factor of log n.
• Based on a greedy (set cover) algorithm
1 2 4
3
5
6
(We can cover 8 connections with 6 cover entries)
Approximation AlgorithmWhat are good center nodes? Nodes that can
cover many uncovered connections.
Consider the center graph of candidates
initial density:
8 81.33
2 4 6
Edges
I O
2
1
2I
4
56
O
2
Initial step:All connections are uncovered
Approximation Algorithm
1 2 4
3
5
6
Consider the center graph of candidates
4
I O
1
2
3
4
4
56
Initial step:All connections are uncovered
Cover connections in subgraph with greatest density with corresponding center node
1 2 4
3
5
6
Approximation Algorithm
Consider the center graph of candidates
2
1
2I O2
Next step:Some connections already covered
Repeat this algorithm until all connections are covered
Theorem: Generated Cover is optimal up to a logarithmic factor
Experimental Results
Small example from real world: subset of DBLP
6,210 documents (publications)
168,991 elements
25,368 links (citations)
14 Megabytes (uncompressed XML)
Element-level graph has 168,991 nodes and 188,149 edges
Its transitive closure: 344,992,370 connections 2,632.1 MB
Experimental Results
For example above:Transitive Closure: 344,992,370 connectionsTwo-Hop Cover: 1,289,930 entries
compression factor of ~267 queries are still fast (~7.6 entries/node)
But: Computation took 45 hours and 80 GB RAM!Need: Smart partitioning of problem to fit memory
Final Results for Index Creation
Transitive Closure: 344,992,370 connections
Two-Hop Cover: 9,999,052 entries
compression factor of ~34.5
queries are still ok (~59.2 entries/node)
build time is good (~23 minutes with 1 CPU and 1GB RAM)
Cover size 8 times larger than best,but ~118 times faster with ~1%
memory
Why Distances are much more Difficult than TC
• Should be simple to add distance information:
v u w
Lout(v)={u, …}
Lin(w)= {u, …}
Lout(v)={(u,2), …}
Lin(w)= {(u,4), …}
2 4
• Is this correct ...
dist(v,w)=dist(v,u)+dist(u,w)=2+4=6
Why Distances are Difficult
v u w
2 4
dist(v,w)=1 Center node u does not reflect the correct distance of v and w
Solution: Distance-aware Centergraph• Add edges to the center graph only if the
corresponding connection is a shortest path
• Correct, problems:– Expensive to build the center graph (2 additional lookups per
connection)
- Approx bound is no longer tight
1 2 4
3
5
6
1
2
3
4
I
4
56
O