Download - Accelerating Ranking-System Using WebGraph

Accelerating Ranking-SystemUsing WebGraph

Project Report

by

Padmaja Adipudi

Outline of My Talk

• Needle Search Engine/Ranking-System• Ranking-System Issue/Resolution

– Accelerating Ranking-System using WebGraph– Ranking Algorithms Overview– Google’s PageRank, ClusterRank, SourceRank & Truncated

PageRank• Experimental Results

– Efficiency Measure– Quality Measure

• Conclusion– Which algorithm is better in terms of Efficiency & Quality

Search Engine

• Web is a terrific place to get the information on any topic.

• Search Engine is a useful application for the information retrieval on the WWW.

• Search Engine has five basic components, a Crawler, a Parser, a Ranking-System, a Repository and a Front-End.

Ranking-System

• Determines the importance of a Web page.

• Google's PageRank algorithm is the famous Ranking-System and is based on URL link structure.

• In Google’s PageRank, the importance of a Web page is based on the importance of it’s parent Web pages.

Needle Search Engine

• A Search Engine developed by former students at UCCS.

• ClusterRank algorithm is implemented as the Ranking-System.

• The former student Yi-Zhang developed a Cluster ranking system which takes an average of 3 hours to rank 300,000 URLs.

Ranking-System Issue

• The major issue with the current ranking system is, it takes long update times, 3 hours for 300K URLs.

• As the number of pages increases it is going to be a severe problem.

Project Goal

• Accelerate the existing Ranking-System of the Needle Search Engine at UCCS using a package called “WebGraph”.

• Upgrade the Needle Search Engine system up to 1 Million Web pages from the 50K Web pages (crawled).

Steps to reach Goal

• Use WebGraph package to represent the graph efficiently using compression techniques.

• Compute the Page-Rank using algorithms namely ClusterRank, SourceRank and Truncated PageRank.

• Compare the results based on time and quality measure for ClusterRank with the results of SourceRank, Truncated PageRank and choose the best for the Needle Search Engine.

Work Flow

Compressed Graph

ClusterRank

SourceRank

Truncated PageRank

Pa

ge R

ank R

esu

lts

Why Truncated & Source Algorithms

• These are the latest papers available in the Page Ranking area.

• Authors used WebGraph package for their experiments while developing the algorithm.

Node Graph

• Node graph is used in ranking system.• Node graph consists of nodes and

directed links from node to node.• URLs are represented by nodes and the

hyperlinks are represented as directed links between nodes.

• Compression techniques to represent the Node graph in efficient manner.

Google’s PageRank• Page Lawrence, Brin Sergey, Rajeev Motwani, Terry Winograd

from Stanford University, 1999.• Importance of a page is based on the incoming link count and also

how important are those incoming links.• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

– PR(Tn): Each page has a notion of its own self-importance. That’s “PR(T1)” for the first page in the web all the way up to PR(Tn) for the last page.

– C(Tn): Each page spreads its vote out evenly amongst all of its outgoing links. The count, or number, of outgoing links for page 1 is C(T1), C(Tn) for page n, and so on for all pages.

– PR(Tn)/C(Tn): if a page (page A) has a back link from page N, the share of the vote page A gets is PR(Tn)/C(Tn).

– d: All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is "damped down" by multiplying it by 0.85 (the factor d).

ClusterRank

• Yi Zhang, a student at UCCS is the author, 2006.

• Algorithm is based on Google’s PageRank. • Designed to speed up PageRank calculation

and also to provide a feature of grouping similar Web pages together in to clusters.

• The original PageRank algorithm is applied on Clusters.

• The rank is then distributed to members of the by weighted average.

ClusterRank (Cont’d)• Group all pages into clusters.• Perform first level clustering for dynamically

generated page.• URLs are grouped based on the “?” , “#”• Example: All URLs below will be grouped in to one

Cluster– http://www.uccs.edu/057/cs_sub.shtml – http://www.uccs.edu/057/cs_sub.shtml#news – http://www.uccs.edu/057/cs_sub.shtml#dates– http://www.uccs.edu/057/cs_sub.shtml#spotlight

ClusterRank (Cont’d)

• Perform second level clustering on virtual directory and graph density.

• URLs are grouped based on the last “/” symbol of the URL.

• Density is calculated for the proposed clusters.

• Approve the cluster based on the pre-set threshold value.

ClusterRank (Cont’d)

• Calculate the rank for each cluster using the original PageRank algorithm.

• Distribute the rank number to its members by weighted average by using: – PR = CR * Pi/Ci.– The notations here are:– PR: The rank of a member page– CR: The cluster rank from previous stage– Pi: The incoming links of this page– Ci: Total incoming links of this cluster.

SourceRank• James Caverlee, Ling Liu, and S.Webb from

Georgia Institute of Technology, 2007. • The Web graph is represented as Sources.• The Source is a logical collection of Web

pages.• Assigns a score to each page based on the

overall quality of the source that the page belongs to, through a random walk over Web sources.

SourceRank (Cont’d)

• Group all pages into Sources based on “Domain”.

• URLs are grouped based on the first “/” symbol of the URL

• Example: All URLs below will be grouped in to one Source– http://office.microsoft.com/en-us/default.aspx– http://office.microsoft.com/en-us/assistance/default.aspx– http://office.microsoft.com/en-us/assistance/

CH790018071033.aspx

SourceRank (Cont’d)

• Calculate the rank for each Source with the original PageRank algorithm

• Distribute the rank number to its members by weighted average by using:– PR = SR * Si– The notations here are:– PR: The rank of a member page– SR: The source rank from previous stage– Si: Total incoming unique links of this source

Truncated PageRank

• L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates from Italy, 2006.

• In PageRank, the Web page can gain high Page-Rank score with supporters (in-links) that are topologically “Close” to the target node.

• Spammers can afford to influence only a few levels.• Truncated PageRank is similar to PageRank, except

that the supporters that are too “close” to a target node do not contribute towards its ranking.

Truncated PageRank (Cont’d)

• PR(p) = t · Mt = damping(t) · Mt

The notations here are:C: Normalization constant : The damping factor

WebGraph Package

• Paolo Boldi and Sebastiano Vigna from Italy, 2004.• Represents the Node graph in efficient manner using

Differential compression technique.• Allows applications to encode compactly a new

version of data with respect to a previous or reference version of same data.

• WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3:08 bits per link, and its transposed version in as little as 2:89 bits per link.

• WebBase is a repository of Web pages crawled by Ubi crawler from Stanford University.

WebGraph Package (Cont’d)

• Node graph initial representation:

• Node graph with Reference compression:


• Node graph with Differential compression:

• Differential compression allows to code a link in less

than a bit (Not possible with plain Reference compression)


Graph in BV Format

PageRank Module

Link Structure From DB

Graph in Ascii format

Graph in BVformat

BVGraph Details

• BVGraph: Boldi Vigna Graph• BVGraph is generated using a graph that is

represented in ASCII format.• The first line contains the number of nodes

‘n’, then ‘n’ lines follow the i-th line containing the successors of the node ‘i’ in the increasing order (nodes are numbered from 0 to n-1). The successors are separated by a single space.

BVGraph Details (Cont’d)• For example, consider a graph of three vertices, a, b, and c, consisting of the following edges:

• (a, b) (a, c) (b, c) (b, a)• (a:0, b:1, c:2)• This graph could be expressed as below 3 1 2 0 2 1

A

C

B

BVGraph – Current Implementation

• The URLLinkStructure table in the Database had linking information.

• ASCII graph is generated by using data in URLLinkStructure table and then the BV Graph is generated

• ASCII graph is represented as basename.graph-txt

• BVGraph is generated using the command:– java it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph

basename bvbasename

BVGraph – Current Implementation (Cont’d)

• The grapgh could be generated for incoming links as well as outgoing links.

• BVnode-in, BVnode-out, BVSource-in graphs are generated.

• BVGraph can be loaded using two loading methods load and loadOffline.

• The load method is used for small graphs • The loadOffline method is used for large

graphs

ClusterRank Using BVGraph

Steps Without BVGraph (Per iteration in Sec)

With BVGraph

(Per iteration in Sec)

300K 9452 7737

ClusterRank Using BVGraph (Cont’d)

• Time gain using WebGraph for 300K URLS

Total time gain using WebGraph for 300K URLs

9452

7737

0 2000 4000 6000 8000 10000

1

Wit

ho

ut/

Wit

h B

VG

rap

h

Time in seconds

With WebGraph

Without WebGraph

Time Measure for Algorithms (in Seconds)

Algorithm URLs: 633061Node InLinks: 2905183Average InLinks per Node:4.6Clusters: 48271Cluster InLinks: 983579Average InLinks perCluster: 16.35Sources: 425Source InLinks: 75217Average InLinks perSource: 176.98

URLs: 289503Node InLinks: 21781790Average InLinks per Node:78.06Clusters: 164136Cluster InLinks: 18210270Average InLinks per Cluster:109.35Sources: 14892Source InLinks: 9988138Average InLinks per Source:670.8

URLs: 4 M Node InLinks: 28346447Average InLinks per Node:5.82Clusters: 256919Cluster InLinks: 9120926Average InLinks per Cluster:32.54Sources: 482Source InLinks: 509693Average InLinks per Source:1057.45

Cluster Rank

422 6780 2520

Source Rank

3 660 21

Truncated PageRank

2 12 17

Time Measure for Algorithms (Cont’d)

Time Measure between algorithms per iteration

422

6780

2520

3660

212 12 17010002000300040005000600070008000

1 2 3

Node InLinks (1: 2905183, 2: 21781790, 3: 28346447)

Tim

e in

sec

ondsd

s

Cluster Rank

Source Rank

Truncated PageRank


Cluster Rank Time Measure based on Cluster InLinks

422

2520

6780

010002000300040005000600070008000

1 2 3

Cluster InLinks (1: 983579, 2: 9120926, 3: 18210270)

Tim

e in

sec

ondsd

s

Cluster Rank


Source Rank Time Measure based on Source InLinks

3 21

660

0

100

200

300400

500

600

700

1 2 3

Source InLinks (1: 75217, 2: 509693, 3: 9988138)

Tim

e in

sec

ondsd

s

Source Rank


Truncated PageRank Time Measure based on Node InLinks

2

12

17

0

5

10

15

20

1 2 3

Node InLinks (1: 2905183, 2: 21781790, 3: 28346447)

Tim

e in

sec

ondsd

s

Truncated PageRank

Node In-Link Distribution across Nodes (4M URLs)

Distribution of Nodes and InLinks for 4M

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

1 70

139

208

278

349

425

523

642

751

900

1068

1325

1579

1920

2444

3527

6298

9058

12648

65578

# of InLinks

# o

f N

od

es

Nodes

Node In-Link Distribution across Nodes (4M URLs)

Cluster In-Link Distribution across Clusters (4M URLs)

Distribution of Clusters and InLinks for 4M

0

20000

40000

60000

80000

100000

120000

140000

1 40

79

118

157

196

237

283

340

412

489

608

744

936

1109

144

8

181

5

237

6

315

8

603

3

999

7

542

23

# of InLinks

# o

f C

lus

ters

rs

Nodes

Source In-Link Distribution across Sources (4M URLs)

Distribution of Sources and InLinks for 4M

0

10

20

30

40

50

60

70

80

90

100

1 8 15 22 29 40 53 68 80 95 114

152

195

263

347

759

1340

3255

1395

0

# of InLinks

# o

f S

ourc

es

Nodes

Quality Measure for Algorithms

• Survey performed on quality of ranking algorithms, using 25 search keywords, by a group of people

• Obtained keywords from Google’s Keyword tool at: https://adwords.google.com/select/KeywordToolExternal

• Listed below are the keywords identified.

pictures university faculty stadium undergraduate

map admissions scholarships loan mba

alumni computer graduate business research

students technology accommodation campus vacations

dean department aid gpa parking

Quality Measure for Algorithms (Cont’d)

• Survey performed to identify the following from KeyWord Search

– First page accuracy – Second page accuracy – Result order on the first page – Result order on the second page – Overall, are the important pages showing up

early? – Overall, the percentage in result hits are

relevant?

Quality Measure For Algorithms (Cont’d)

Algorithm Quality measure based on

the scale 1 to 5 (1 being the best)

ClusterRank 2.06

SourceRank 1.65

Truncated PageRank 2.94

Conclusion

• The ClusteRank computation can be accelerated using WebGraph.

• The SourceRank algorithm takes less time for Page-Rank calculation compared to ClusterRank and is close to Truncated PageRank for the existing 4M URLs.

• The SourceRank has better quality points out of the three algorithms.

• By considering the Efficiency and Quality, SourceRank is better out of the three for the existing data based on experiments performed.

Success Criteria

• Identified the efficiency of Page-Rank computation algorithm using time-measure generated by experiments

• Identified the quality of the algorithm using manual survey results

• Implemented the efficient algorithm for the Needle Search Engine in UCCS

• Upgraded the existing Needle Search Engine to 1 Million pages (crawled, actual URLs are 4 Million) from the current 50K URLs (crawled, actual URLs are 300K).

References

• [1] Paolo Boldi, Sebastiano Vigna. The WebGraph Framework 1: Compression Techniques. http://www2004.org/proceedings/docs/1p595.pdf

• [2] Yen-Yu Chen, Qingqing Gan, Torsten Suel. I/O-Efficient Techniques for Computing PageRank.

http://cis.poly.edu/suel/papers/pagerank.pdf• [3] Taher H. Haveliwala. Efficient

Computation of PageRank.

References (Cont’d)

• [4] Yi Zhang. Design and Implementation of a Search Engine with the Cluster Rank Algorithm.

• [5] John A. Tomlin. A New Paradigm for Ranking Pages on the World Wide Web.

• [6] Lawrence Page, Sergey Brin, Rajeeve Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web http://www.cs.huji.ac.il/~csip/1999-66.pdf


• [7] Ricardo BaezaYates, Paolo Boldi, Carlos Castillo. Generalizing PageRank: Damping Functions for LinkBased Ranking Algorithms. http://www.dcc.uchile.cl/~ccastill/papers/baeza06_general_pagerank_damping_functions_link_ranking.pdf

• [8] Gonzalo Navarro. Compressing Web Graphs like Texts.

• [9] The Spiders Apprentice. http://www.monash.com/spidap1.html


• [10] James Caverlee, Ling Liu, S.Webb. Spam-Resilient Web Ranking via influence Throttling. http://www-static.cc.gatech.edu/~caverlee/pubs/caverlee07ipdps.pdf

• [11] G. Jeh, J. Widom, “SimRank: A Measure of Structural-Context Similarity”.

http://www-cs-students.stanford.edu/~glenj/simrank.pdf

• [12] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Using rank propagation and probabilistic counting for link-based spam detection, Technical report”, 2006.