Methods of Computing the PageRank Vector Tom Mangan.

43
Methods of Computing the PageRank Vector Tom Mangan

Transcript of Methods of Computing the PageRank Vector Tom Mangan.

Methods of Computing the

PageRank VectorTom Mangan

Brief History of Web Search

•Boolean term matching

Brief History of Web Search

•Boolean term matching

•Sergey Brin and Larry Page

•Reputation based ranking

•PageRank

Reputation

•Count links to a page

•Weight links by how many come from a page

•Further weight links by the reputation of the linker

11 22

33 44

55 66

11 22

33 44

55 66

Link Matrix

Calculating Rank

Where:= the set of all pages linking to P= # of links from page Q

Calculating Rank

Where:= the set of all pages linking to P= # of links from page Q

The PageRank Vector

Define:

where

where

From our earlier mini-web:

Taken one row at a time:

where

Iterating this equation is called thePower Method

where

Iterating this equation is called thePower Method

and we define thePageRank vector:

where

•Convergence requires:

Power Method

irreducibility (Perron-Frobenius Thm)

Definitions

•Markov chain

•The conditional probability of each future state depends only on the present state

•Markov matrix

•Transition matrix of a Markov chain

Transition Matrix

From our earlier mini-web:

Markov Matrix Properties

•Row-stochastic

•Stationary vector gives long-term probability of each state

•All eigenvalues λ ≤ 1

not row-stochastic

Define a vector a such that:

Then we obtain a row-stochastic matrix:

or

S may or may not be reducible,so we make one more fix:

The Google Matrix:

Now G is a positive, irreducible, row-stochasticmatrix, and the power method will converge,

but we’ve lost sparsity.

Note that:

so now the power method looks like:

Power method converges atthe same rate as

thus

11 22

33 44

55 66

Link Matrix

A Linear System Formulation

•Amy Langville and Carl Meyer

•Exploit dangling nodes

•Solve a system instead of iterating

By Langville and Meyer,solving the system

and letting

produces the PageRank vector(proof omitted)

Exploiting Dangling Nodes:

Re-order the rows andcolumns of H such that

Exploiting Dangling Nodes:

Re-order the rows andcolumns of H such that

then

has some nice propertiesthat simplify solving the

linear system.

Non-singular

Source: L&M, A Reordering for the PageRank Problem

Langville and Meyer

Algorithm 1•Re-order rows and columns so that

dangling nodes are lumped at bottom

•Solve

•Compute

•Normalize

Improvement

•In testing, Algorithm 1 reduces the time necessary to find the PageRank vector by a factor of 1-6

•This time is data-dependent

Further Improvement?

•First improvement came from finding zero rows in

•Now find zero rows in

Source: L&M, A Reordering for the PageRank Problem

Langville and Meyer

Algorithm 2•Reorder rows and columns so that

all submatrices have zero rows at bottom

•Solve

•For i = 2 to b, compute

•Normalize

Problem withAlgorithm 2

•Finding submatrices of zero rows takes longer than time saved in solve step

•L & M wait until all submatrices are reordered to solve primary

Proposal

•As each submatrix is isolated, send it out for parallel solving

Source: L&M, A Reordering for the PageRank Problem

SourcesDeGroot, M. and Schervish, M., Probability and Statistics, 3rd Ed., Addison Wesley,

2002

Langville, A. and Meyer, C., A Reordering for the PageRank Problem, Journal of Scientific Computing, Vol. 27 No. 6, 2006

Langville, A. and Meyer, C., Deeper Inside PageRank, 2004

Lee, C., Golub, G. and Zenios, S., A Fast Two-Stage Algorithm for Computing PageRank, undated

Rebaza, J., Lecture Notes