Pagerank (1)

22
The PageRank Citation The PageRank Citation Ranking: Ranking: Bringing Order to the Bringing Order to the Web Web Larry Page etc. Stanford University Presented by Guoqiang Su & Wei Li

Transcript of Pagerank (1)

Page 1: Pagerank (1)

The PageRank Citation Ranking:The PageRank Citation Ranking:Bringing Order to the WebBringing Order to the Web

Larry Page etc.

Stanford University

Presented by

Guoqiang Su & Wei Li

Page 2: Pagerank (1)

ContentsContents

MotivationRelated workPage Rank & Random Surfer ModelImplementationApplicationConclusion

Page 3: Pagerank (1)

MotivationMotivation

Web: heterogeneous and unstructuredFree of quality control on the webCommercial interest to manipulate ranking

Page 4: Pagerank (1)

Related WorkRelated Work

Academic citation analysisLink-based analysisClustering methods of link structureHubs & Authorities Model

Page 5: Pagerank (1)

BacklinkBacklink

Link Structure of the WebApproximation of importance / quality

Page 6: Pagerank (1)

PageRankPageRank

Pages with lots of backlinks are importantBacklinks coming from important pages

convey more importance to a page

Problem: Rank Sink

uBv vN

vRcuR

)()(

Page 7: Pagerank (1)

Rank SinkRank SinkPage cycles pointed by some incoming link

Problem: this loop will accumulate rank but never distribute any rank outside

Page 8: Pagerank (1)

Escape TermEscape Term

Solution: Rank Source

c is maximized and = 1E(u) is some vector over the web pages

– uniform, favorite page etc.

)()(

)( ucEN

vRcuR

uBv v

1R

Page 9: Pagerank (1)

Matrix NotationMatrix Notation

R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized

ReEAcR TT )(

)( TeEA

Page 10: Pagerank (1)

Computing PageRankComputing PageRank

- initialize vector over web pages

loop:

- new ranks sum of normalized backlink ranks

- compute normalizing factor

- add escape term

- control parameter

while - stop when converged

SR 0

iT

i RAR 1

111 ii RRd

dERR ii 11

ii RR 1

Page 11: Pagerank (1)

Random Surfer ModelRandom Surfer Model Page Rank corresponds to the probability

distribution of a random walk on the web graphs

E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever

Page 12: Pagerank (1)

ImplementationImplementationComputing resources — 24 million pages — 75 million URLs

Memory and disk storage

Weight Vector

(4 byte float)

Matrix A (linear access)

Page 13: Pagerank (1)

Implementation (Con't)Implementation (Con't)

Unique integer ID for each URLSort and Remove dangling linksRank initial assignmentIteration until convergenceAdd back dangling links and Re-compute

Page 14: Pagerank (1)

Convergence PropertiesConvergence PropertiesGraph (V, E) is an expander with factor if

for all (not too large) subsets S: |As| |s|Eigenvalue separation: Largest eigenvalue

is sufficiently larger than the second-largest eigenvalue

Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.

Page 15: Pagerank (1)

Convergence Properties (con't)Convergence Properties (con't)PageRank computation is O(log(|V|)) due to

rapidly mixing graph G of the web.

Page 16: Pagerank (1)

Personalized PageRankPersonalized PageRankRank Source E can be initialized :

– uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives

result in overly high ranking– total weight on a single page, e.g. Netscape, McCarthy

great variation of ranks under different single pages as rank source

– and everything in-between, e.g. server root pages

allow manipulation by commercial interests

Page 17: Pagerank (1)

Applications IApplications IEstimate web traffic

– Server/page aliases

– Link/traffic disparity, e.g. porn sites, free web-mail

Backlink predictor– Citation counts have been used to predict future citations

– very difficult to map the citation structure of the web completely

– avoid the local maxima that citation counts get stuck in and get better performance

Page 18: Pagerank (1)

Applications II - Ranking ProxyApplications II - Ranking Proxy

Surfer's Navigation Aid

Annotating links by PageRank (bar graph)

Not query dependent

Page 19: Pagerank (1)

IssuesIssues Users are no random walkers – Content based methods Starting point distribution

– Actual usage data as starting vector

Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)

Page 20: Pagerank (1)

Evaluation IEvaluation I

Page 21: Pagerank (1)

Evaluation IIEvaluation II

Page 22: Pagerank (1)

ConclusionConclusionPageRank is a global ranking based on the

web's graph structurePageRank use backlinks information to

bring order to the webPageRank can separate out representative

pages as cluster centerA great variety of applications