Accelerating Ranking-SystemUsing WebGraph
Project Report
by
Padmaja Adipudi
Outline of My Talk
• Needle Search Engine/Ranking-System• Ranking-System Issue/Resolution
– Accelerating Ranking-System using WebGraph– Ranking Algorithms Overview– Google’s PageRank, ClusterRank, SourceRank & Truncated
PageRank• Experimental Results
– Efficiency Measure– Quality Measure
• Conclusion– Which algorithm is better in terms of Efficiency & Quality
Search Engine
• Web is a terrific place to get the information on any topic.
• Search Engine is a useful application for the information retrieval on the WWW.
• Search Engine has five basic components, a Crawler, a Parser, a Ranking-System, a Repository and a Front-End.
Ranking-System
• Determines the importance of a Web page.
• Google's PageRank algorithm is the famous Ranking-System and is based on URL link structure.
• In Google’s PageRank, the importance of a Web page is based on the importance of it’s parent Web pages.
Needle Search Engine
• A Search Engine developed by former students at UCCS.
• ClusterRank algorithm is implemented as the Ranking-System.
• The former student Yi-Zhang developed a Cluster ranking system which takes an average of 3 hours to rank 300,000 URLs.
Ranking-System Issue
• The major issue with the current ranking system is, it takes long update times, 3 hours for 300K URLs.
• As the number of pages increases it is going to be a severe problem.
Project Goal
• Accelerate the existing Ranking-System of the Needle Search Engine at UCCS using a package called “WebGraph”.
• Upgrade the Needle Search Engine system up to 1 Million Web pages from the 50K Web pages (crawled).
Steps to reach Goal
• Use WebGraph package to represent the graph efficiently using compression techniques.
• Compute the Page-Rank using algorithms namely ClusterRank, SourceRank and Truncated PageRank.
• Compare the results based on time and quality measure for ClusterRank with the results of SourceRank, Truncated PageRank and choose the best for the Needle Search Engine.
Work Flow
Compressed Graph
ClusterRank
SourceRank
Truncated PageRank
Pa
ge R
ank R
esu
lts
Why Truncated & Source Algorithms
• These are the latest papers available in the Page Ranking area.
• Authors used WebGraph package for their experiments while developing the algorithm.
Node Graph
• Node graph is used in ranking system.• Node graph consists of nodes and
directed links from node to node.• URLs are represented by nodes and the
hyperlinks are represented as directed links between nodes.
• Compression techniques to represent the Node graph in efficient manner.
Google’s PageRank• Page Lawrence, Brin Sergey, Rajeev Motwani, Terry Winograd
from Stanford University, 1999.• Importance of a page is based on the incoming link count and also
how important are those incoming links.• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
– PR(Tn): Each page has a notion of its own self-importance. That’s “PR(T1)” for the first page in the web all the way up to PR(Tn) for the last page.
– C(Tn): Each page spreads its vote out evenly amongst all of its outgoing links. The count, or number, of outgoing links for page 1 is C(T1), C(Tn) for page n, and so on for all pages.
– PR(Tn)/C(Tn): if a page (page A) has a back link from page N, the share of the vote page A gets is PR(Tn)/C(Tn).
– d: All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is "damped down" by multiplying it by 0.85 (the factor d).
ClusterRank
• Yi Zhang, a student at UCCS is the author, 2006.
• Algorithm is based on Google’s PageRank. • Designed to speed up PageRank calculation
and also to provide a feature of grouping similar Web pages together in to clusters.
• The original PageRank algorithm is applied on Clusters.
• The rank is then distributed to members of the by weighted average.
ClusterRank (Cont’d)• Group all pages into clusters.• Perform first level clustering for dynamically
generated page.• URLs are grouped based on the “?” , “#”• Example: All URLs below will be grouped in to one
Cluster– http://www.uccs.edu/057/cs_sub.shtml – http://www.uccs.edu/057/cs_sub.shtml#news – http://www.uccs.edu/057/cs_sub.shtml#dates– http://www.uccs.edu/057/cs_sub.shtml#spotlight
ClusterRank (Cont’d)
• Perform second level clustering on virtual directory and graph density.
• URLs are grouped based on the last “/” symbol of the URL.
• Density is calculated for the proposed clusters.
• Approve the cluster based on the pre-set threshold value.
ClusterRank (Cont’d)
• Calculate the rank for each cluster using the original PageRank algorithm.
• Distribute the rank number to its members by weighted average by using: – PR = CR * Pi/Ci.– The notations here are:– PR: The rank of a member page– CR: The cluster rank from previous stage– Pi: The incoming links of this page– Ci: Total incoming links of this cluster.
SourceRank• James Caverlee, Ling Liu, and S.Webb from
Georgia Institute of Technology, 2007. • The Web graph is represented as Sources.• The Source is a logical collection of Web
pages.• Assigns a score to each page based on the
overall quality of the source that the page belongs to, through a random walk over Web sources.
SourceRank (Cont’d)
• Group all pages into Sources based on “Domain”.
• URLs are grouped based on the first “/” symbol of the URL
• Example: All URLs below will be grouped in to one Source– http://office.microsoft.com/en-us/default.aspx– http://office.microsoft.com/en-us/assistance/default.aspx– http://office.microsoft.com/en-us/assistance/
CH790018071033.aspx
SourceRank (Cont’d)
• Calculate the rank for each Source with the original PageRank algorithm
• Distribute the rank number to its members by weighted average by using:– PR = SR * Si– The notations here are:– PR: The rank of a member page– SR: The source rank from previous stage– Si: Total incoming unique links of this source
Truncated PageRank
• L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates from Italy, 2006.
• In PageRank, the Web page can gain high Page-Rank score with supporters (in-links) that are topologically “Close” to the target node.
• Spammers can afford to influence only a few levels.• Truncated PageRank is similar to PageRank, except
that the supporters that are too “close” to a target node do not contribute towards its ranking.
Truncated PageRank (Cont’d)
• PR(p) = t · Mt = damping(t) · Mt
The notations here are:C: Normalization constant : The damping factor
WebGraph Package
• Paolo Boldi and Sebastiano Vigna from Italy, 2004.• Represents the Node graph in efficient manner using
Differential compression technique.• Allows applications to encode compactly a new
version of data with respect to a previous or reference version of same data.
• WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3:08 bits per link, and its transposed version in as little as 2:89 bits per link.
• WebBase is a repository of Web pages crawled by Ubi crawler from Stanford University.
WebGraph Package (Cont’d)
• Node graph initial representation:
• Node graph with Reference compression:
WebGraph Package (Cont’d)
• Node graph with Differential compression:
• Differential compression allows to code a link in less
than a bit (Not possible with plain Reference compression)
WebGraph Package (Cont’d)
Graph in BV Format
PageRank Module
Link Structure From DB
Graph in Ascii format
Graph in BVformat
BVGraph Details
• BVGraph: Boldi Vigna Graph• BVGraph is generated using a graph that is
represented in ASCII format.• The first line contains the number of nodes
‘n’, then ‘n’ lines follow the i-th line containing the successors of the node ‘i’ in the increasing order (nodes are numbered from 0 to n-1). The successors are separated by a single space.
BVGraph Details (Cont’d)• For example, consider a graph of three vertices, a, b, and c, consisting of the following edges:
• (a, b) (a, c) (b, c) (b, a)• (a:0, b:1, c:2)• This graph could be expressed as below 3 1 2 0 2 1
A
C
B
BVGraph – Current Implementation
• The URLLinkStructure table in the Database had linking information.
• ASCII graph is generated by using data in URLLinkStructure table and then the BV Graph is generated
• ASCII graph is represented as basename.graph-txt
• BVGraph is generated using the command:– java it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph
basename bvbasename
BVGraph – Current Implementation (Cont’d)
• The grapgh could be generated for incoming links as well as outgoing links.
• BVnode-in, BVnode-out, BVSource-in graphs are generated.
• BVGraph can be loaded using two loading methods load and loadOffline.
• The load method is used for small graphs • The loadOffline method is used for large
graphs
ClusterRank Using BVGraph
Steps Without BVGraph (Per iteration in Sec)
With BVGraph
(Per iteration in Sec)
300K 9452 7737
ClusterRank Using BVGraph (Cont’d)
• Time gain using WebGraph for 300K URLS
Total time gain using WebGraph for 300K URLs
9452
7737
0 2000 4000 6000 8000 10000
1
Wit
ho
ut/
Wit
h B
VG
rap
h
Time in seconds
With WebGraph
Without WebGraph
Time Measure for Algorithms (in Seconds)
Algorithm URLs: 633061Node InLinks: 2905183Average InLinks per Node:4.6Clusters: 48271Cluster InLinks: 983579Average InLinks perCluster: 16.35Sources: 425Source InLinks: 75217Average InLinks perSource: 176.98
URLs: 289503Node InLinks: 21781790Average InLinks per Node:78.06Clusters: 164136Cluster InLinks: 18210270Average InLinks per Cluster:109.35Sources: 14892Source InLinks: 9988138Average InLinks per Source:670.8
URLs: 4 M Node InLinks: 28346447Average InLinks per Node:5.82Clusters: 256919Cluster InLinks: 9120926Average InLinks per Cluster:32.54Sources: 482Source InLinks: 509693Average InLinks per Source:1057.45
Cluster Rank
422 6780 2520
Source Rank
3 660 21
Truncated PageRank
2 12 17
Time Measure for Algorithms (Cont’d)
Time Measure between algorithms per iteration
422
6780
2520
3660
212 12 17010002000300040005000600070008000
1 2 3
Node InLinks (1: 2905183, 2: 21781790, 3: 28346447)
Tim
e in
sec
ondsd
s
Cluster Rank
Source Rank
Truncated PageRank
Time Measure for Algorithms (Cont’d)
Cluster Rank Time Measure based on Cluster InLinks
422
2520
6780
010002000300040005000600070008000
1 2 3
Cluster InLinks (1: 983579, 2: 9120926, 3: 18210270)
Tim
e in
sec
ondsd
s
Cluster Rank
Time Measure for Algorithms (Cont’d)
Source Rank Time Measure based on Source InLinks
3 21
660
0
100
200
300400
500
600
700
1 2 3
Source InLinks (1: 75217, 2: 509693, 3: 9988138)
Tim
e in
sec
ondsd
s
Source Rank
Time Measure for Algorithms (Cont’d)
Truncated PageRank Time Measure based on Node InLinks
2
12
17
0
5
10
15
20
1 2 3
Node InLinks (1: 2905183, 2: 21781790, 3: 28346447)
Tim
e in
sec
ondsd
s
Truncated PageRank
Node In-Link Distribution across Nodes (4M URLs)
Distribution of Nodes and InLinks for 4M
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
1 70
139
208
278
349
425
523
642
751
900
1068
1325
1579
1920
2444
3527
6298
9058
12648
65578
# of InLinks
# o
f N
od
es
Nodes
Node In-Link Distribution across Nodes (4M URLs)
Cluster In-Link Distribution across Clusters (4M URLs)
Distribution of Clusters and InLinks for 4M
0
20000
40000
60000
80000
100000
120000
140000
1 40
79
118
157
196
237
283
340
412
489
608
744
936
1109
144
8
181
5
237
6
315
8
603
3
999
7
542
23
# of InLinks
# o
f C
lus
ters
rs
Nodes
Source In-Link Distribution across Sources (4M URLs)
Distribution of Sources and InLinks for 4M
0
10
20
30
40
50
60
70
80
90
100
1 8 15 22 29 40 53 68 80 95 114
152
195
263
347
759
1340
3255
1395
0
# of InLinks
# o
f S
ourc
es
Nodes
Quality Measure for Algorithms
• Survey performed on quality of ranking algorithms, using 25 search keywords, by a group of people
• Obtained keywords from Google’s Keyword tool at: https://adwords.google.com/select/KeywordToolExternal
• Listed below are the keywords identified.
pictures university faculty stadium undergraduate
map admissions scholarships loan mba
alumni computer graduate business research
students technology accommodation campus vacations
dean department aid gpa parking
Quality Measure for Algorithms (Cont’d)
• Survey performed to identify the following from KeyWord Search
– First page accuracy – Second page accuracy – Result order on the first page – Result order on the second page – Overall, are the important pages showing up
early? – Overall, the percentage in result hits are
relevant?
Quality Measure For Algorithms (Cont’d)
Algorithm Quality measure based on
the scale 1 to 5 (1 being the best)
ClusterRank 2.06
SourceRank 1.65
Truncated PageRank 2.94
Conclusion
• The ClusteRank computation can be accelerated using WebGraph.
• The SourceRank algorithm takes less time for Page-Rank calculation compared to ClusterRank and is close to Truncated PageRank for the existing 4M URLs.
• The SourceRank has better quality points out of the three algorithms.
• By considering the Efficiency and Quality, SourceRank is better out of the three for the existing data based on experiments performed.
Success Criteria
• Identified the efficiency of Page-Rank computation algorithm using time-measure generated by experiments
• Identified the quality of the algorithm using manual survey results
• Implemented the efficient algorithm for the Needle Search Engine in UCCS
• Upgraded the existing Needle Search Engine to 1 Million pages (crawled, actual URLs are 4 Million) from the current 50K URLs (crawled, actual URLs are 300K).
References
• [1] Paolo Boldi, Sebastiano Vigna. The WebGraph Framework 1: Compression Techniques. http://www2004.org/proceedings/docs/1p595.pdf
• [2] Yen-Yu Chen, Qingqing Gan, Torsten Suel. I/O-Efficient Techniques for Computing PageRank.
http://cis.poly.edu/suel/papers/pagerank.pdf• [3] Taher H. Haveliwala. Efficient
Computation of PageRank.
References (Cont’d)
• [4] Yi Zhang. Design and Implementation of a Search Engine with the Cluster Rank Algorithm.
• [5] John A. Tomlin. A New Paradigm for Ranking Pages on the World Wide Web.
• [6] Lawrence Page, Sergey Brin, Rajeeve Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web http://www.cs.huji.ac.il/~csip/1999-66.pdf
References (Cont’d)
• [7] Ricardo BaezaYates, Paolo Boldi, Carlos Castillo. Generalizing PageRank: Damping Functions for LinkBased Ranking Algorithms. http://www.dcc.uchile.cl/~ccastill/papers/baeza06_general_pagerank_damping_functions_link_ranking.pdf
• [8] Gonzalo Navarro. Compressing Web Graphs like Texts.
• [9] The Spiders Apprentice. http://www.monash.com/spidap1.html
References (Cont’d)
• [10] James Caverlee, Ling Liu, S.Webb. Spam-Resilient Web Ranking via influence Throttling. http://www-static.cc.gatech.edu/~caverlee/pubs/caverlee07ipdps.pdf
• [11] G. Jeh, J. Widom, “SimRank: A Measure of Structural-Context Similarity”.
http://www-cs-students.stanford.edu/~glenj/simrank.pdf
• [12] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Using rank propagation and probabilistic counting for link-based spam detection, Technical report”, 2006.
Top Related