CS728-2008 Lecture 9 Storeing and Querying Large Web Graphs.

CS728-2008Lecture 9

Storeing and Querying Large Web Graphs

Last Time• Algorithms for link-based clustering • finding “tightly knit communities” TKC on the

web graphs

Today’s lecture

Dealing with large graphs

Building Indexes for adjacency and connectivity testing

Distance and Transitive Closure

• New Data Structure: 2-hop covers

Connectivity Server

• Support for fast queries on the web graph– Which URLs point to a given URL?– Which URLs does a given URL point to?

Stores mappings in main memory from• URL to outlinks, URL to inlinks

• Applications– Crawl control, Web graph analysis– Connectivity, crawl optimization– TKCs, other Link analysis

Problem of Adjacency lists

• ALs store set of neighbors of each node• Assume each URL represented by an

integer– Use some natural ordering or hashing– E.g., for a 60K page web, need 16 bits– For 4 billion page web, need 32 bits per node

• Naively, this demands 32-64 bits to represent each hyperlink

• Can we compress this?

Adjacency list compression

• Properties exploited in compression:– Similarity (between lists)

– Locality (many links from a page go to “nearby” pages)

– Can use gap encodings in sorted lists

– Look at distribution of gap values

Gap encoding (Elias)• Given a list of integers in increasing order.

– E.g., 33,47,154,159,202 …

• It suffices to store gaps.– 33,14,107,5,43 …

• We Hope: most gaps encoded with far fewer bits.

• Represent a gap G as the pair <length,offset>

• length is in unary and uses log2G +1 bits to specify the length of the binary encoding of

• offset = G - 2log2G in binary.

Recall that the unary encoding of x isa sequence of x 1’s followed by a 0.

Elias codes for gap encoding

• e.g., 9 represented as <1110,001>.• 2 is represented as <10,0>.• Exercise: does zero have a code?• Encoding G takes 2 log2G +1 bits.

– codes are always of odd length.– 1 = 20 + 0 = 1 – 2 = 21 + 0 = 110 – 3 = 21 + 1 = 101– 4 = 22 + 0 = 11000– 5 = 22 + 1 = 11001– 6 = 22 + 2 = 11010– 7 = 22 + 3 = 11011– 8 = 23 + 0 = 1110000– 9 = 23 + 1 = 1110001

Exercise

• Given the following sequence of coded gaps, reconstruct the gap sequence:

1110001110101011111101101111011

Storage Requirements

• Recently a paper by Boldi/Vigna report that we can get down to an average of ~3 bits/link– (URL to URL edge)

– For a 118M node web graph

• How can this be possible?

Why is this remarkable?

Main ideas of Boldi/Vigna

• First consider lexicographically ordered list of all URLs, e.g., – www.stanford.edu/alchemy– www.stanford.edu/biology– www.stanford.edu/biology/plant– www.stanford.edu/biology/plant/copyright– www.stanford.edu/biology/plant/people– www.stanford.edu/chemistry

Boldi/Vigna

• Each of these URLs has an adjacency list• Main thesis: because of use of webpage templates, the

adjacency list of a node is usually similar to one of the 7 preceding URLs in the lexicographic ordering

• Express adjacency list in terms of one of these• E.g., consider these adjacency lists

– 1, 2, 4, 8, 16, 32, 64– 1, 4, 9, 16, 25, 36, 49, 64– 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144– 1, 4, 8, 16, 25, 36, 49, 64

Connectivity Queries

• Beyond basic adjacency we’d like to answer other queries…– Transitive closure: is there a path from x to y?– Distance: what is the length of shortest path

from x to y?

• Applications– Link analysis– XML path queries with wildcards

Naïve Solutions

• Given a web graph, we can compute and store All Pairs Shortest Paths (APSPs) off-line– Then answer any query in constant time– What are Space requirements for an n-node graph ?

• Alternatively, given a node, we can compute online– Answer query Single Source Shortest Path Algorithm– Minimal additional space required.– What is the time complexity to answer query?

Transitive Closure Encoding Problem

We want to find a compact representation for the

transitive closure• whose size is comparable to the data‘s size • that supports connection tests (almost) as fast

as the naive transitive closure lookup • that can be built efficiently for large data sets

Main Idea: 2-Hop Covers and 2-Hop Labeling

• 2-Hop cover is set of hops (x,y) so that every connected pair is covered by 2 hops

• For each node a, we maintain two sets of labels (which are simply lists of nodes): Lin(a) and Lout(a)

• For each connection (a,b),– choose a node c on the path from a to b (center node)– add c to Lout(a) and to Lin(b)

• Then (a,b)Transitive Closure T Lout(a)Lin(b)≠

a c b

Reachability and distance queries via 2-hop Labels

(Cohen et al., SODA 2002)

2-hop Covers

• Conjecture: For any graph with n nodes and m edges, a 2-hop cover always exists and has size bounded by O(n √m )

• Optimization Problem: Find a cover which minimizes the sum of the label sizes

• Problem is NP-hard – => approximation required

• Theorem: There exists a polytime algorithm that approx optimal size 2-hop cover within factor of log n.

• Based on a greedy (set cover) algorithm

1 2 4

3

5

6

(We can cover 8 connections with 6 cover entries)

Approximation AlgorithmWhat are good center nodes? Nodes that can

cover many uncovered connections.

Consider the center graph of candidates

initial density:

8 81.33

2 4 6

Edges

I O

2

1

2I

4

56

O

2

Initial step:All connections are uncovered

Approximation Algorithm

1 2 4

3

5

6


4

I O

1

2

3

4

4

56

Initial step:All connections are uncovered

Cover connections in subgraph with greatest density with corresponding center node

1 2 4

3

5

6

Approximation Algorithm


2

1

2I O2

Next step:Some connections already covered

Repeat this algorithm until all connections are covered

Theorem: Generated Cover is optimal up to a logarithmic factor

Experimental Results

Small example from real world: subset of DBLP

6,210 documents (publications)

168,991 elements

25,368 links (citations)

14 Megabytes (uncompressed XML)

Element-level graph has 168,991 nodes and 188,149 edges

Its transitive closure: 344,992,370 connections 2,632.1 MB

Experimental Results

For example above:Transitive Closure: 344,992,370 connectionsTwo-Hop Cover: 1,289,930 entries

compression factor of ~267 queries are still fast (~7.6 entries/node)

But: Computation took 45 hours and 80 GB RAM!Need: Smart partitioning of problem to fit memory

Final Results for Index Creation

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 9,999,052 entries

compression factor of ~34.5

queries are still ok (~59.2 entries/node)

build time is good (~23 minutes with 1 CPU and 1GB RAM)

Cover size 8 times larger than best,but ~118 times faster with ~1%

memory

Why Distances are much more Difficult than TC

• Should be simple to add distance information:

v u w

Lout(v)={u, …}

Lin(w)= {u, …}

Lout(v)={(u,2), …}

Lin(w)= {(u,4), …}

2 4

• Is this correct ...

dist(v,w)=dist(v,u)+dist(u,w)=2+4=6

Why Distances are Difficult

v u w

2 4

dist(v,w)=1 Center node u does not reflect the correct distance of v and w

Solution: Distance-aware Centergraph• Add edges to the center graph only if the

corresponding connection is a shortest path

• Correct, problems:– Expensive to build the center graph (2 additional lookups per

connection)

- Approx bound is no longer tight

1 2 4

3

5

6

1

2

3

4

I

4

56

O

CS728-2008 Lecture 9 Storeing and Querying Large Web Graphs.

Documents

Transcript of CS728-2008 Lecture 9 Storeing and Querying Large Web Graphs.