5 Understanding Page Rank

17
Understanding Google’s PageRank™ Amy Langville, Carl Meyer, Google’s Page Rank and Beyond: The Science of Search Engine Rankings. Princeton University Pres, 2006

description

An introductory lecture on the Google PageRank algorithm stressing the mathematical underpinnings. Based on the excellent book by Langville & Meyer.

Transcript of 5 Understanding Page Rank

Page 1: 5 Understanding Page Rank

Understanding Google’s PageRank™

Amy Langville, Carl Meyer, Google’s Page Rank and Beyond: The Science of Search Engine Rankings. Princeton University Pres, 2006

Page 2: 5 Understanding Page Rank

Review: The Search Engine

Page 3: 5 Understanding Page Rank

An Elegant Formula

ππS + (1-) E)

Google’s (Brin & Page) PageRank™ equation.

US Patent #6285999, filed 1998, granted 2001

This formula resolves the world’s largest matrix calculation.

Page 4: 5 Understanding Page Rank

ππS + (1-) E)

Derived from a formula B&P worked out in graduate school (itself derived from traditional bibliometrics research literature).

r(Pi) =

Essential characteristic: high-ranking pages associate with high-ranking pages

r (Pj)

|Pj|_____

Pj BPi

Page 5: 5 Understanding Page Rank

ππS + (1-) E)

r(Pi) =

Must be applied to a set of linked pages, or a graph.To do this we analyze the graph to see it’s out-links and back-links.

Therefore. . .

r (Pj)

|Pj|_____

Pj BPi

r(Pi) : the rank of a given pagePj Bpi : the ranks of the set of back-

linking pagesr (Pj) : the rank of a given page|Pj| : the number of out-links on

a page

Page 6: 5 Understanding Page Rank

ππS + (1-) E)

A site graph like this:1

23

5

4 6

Page 7: 5 Understanding Page Rank

ππS + (1-) E)becomes a directed graph like this:

11 22

33

66

44 55

Page 8: 5 Understanding Page Rank

But there’s a problem

Nothing’s ranked!

r (Pj)

|Pj|_____

Pj BPi

r(Pi) : the rank of a given pagePj Bpi : the ranks of the set of back-

linking pagesr (Pj) : the rank of a given page|Pj| : the number of out-links on

a page

r(Pi) =

11 22

33

66

44 55

Page 9: 5 Understanding Page Rank

The solution. . . sort ofStart by assuming all the ranks are equal. In this example each page is just 1 of 6, so the initial rank is expressed as 1/6

Then, you keep feeding the number through the formula until you get a ranking.

This results in a rank matrix. . .

11 22

33

66

44 55

Page 10: 5 Understanding Page Rank

Directed graph iterative node values

r0 r1 r2

Rank(i2)

P1 1/6 1/18 1/36 5

P2 1/6 5/36 1/18 4

P3 1/6 1/12 1/36 5

P4 1/6 1/4 17/72 1

P5 1/6 5/36 11/72 3

P6 1/6 1/6 14/72 2

11 22

33

66

44 55

Page 11: 5 Understanding Page Rank

CMS matrixThis can’t go on foreverSome values are equivalent (ties).

In the interest of speed and efficiency, we need to know if the ranks converge—that is, will we break all ties, or will we keep doing this indefinitely and never have a decisive ranking?

To determine this, the formula must be transformed using binary adjacency transformation, and Markov chain theory.

11 22

33

66

44 55

Page 12: 5 Understanding Page Rank

Convert the iterative calculation to a matrix calculation using binary adjacency transformation for a 1Xn matrix

P1 P2 P3 P4 P5 P6

P1 0 ½ ½ 0 0 0

P2 0 0 0 0 0 0

P3 1/3 1/3 0 0 1/3 0

P4 0 0 0 0 ½ ½

P5 0 0 0 ½ 0 ½

P6 0 0 0 1 0 0

[ ]

Page 13: 5 Understanding Page Rank

Now, you can treat a row as a vector, or set of values

P1 P2 P3 P4 P5 P6

P1 0 ½ ½ 0 0 0

P2 0 0 0 0 0 0

P3 1/3 1/3 0 0 1/3 0

P4 0 0 0 0 ½ ½

P5 0 0 0 ½ 0 ½

P6 0 0 0 1 0 0

[ ]π

Page 14: 5 Understanding Page Rank

This is a sparse matrix. That’s good.

P1 P2 P3 P4 P5 P6

P1 0 ½ ½ 0 0 0

P2 0 0 0 0 0 0

P3 1/3 1/3 0 0 1/3 0

P4 0 0 0 0 ½ ½

P5 0 0 0 ½ 0 ½

P6 0 0 0 1 0 0

[ ]

Page 15: 5 Understanding Page Rank

ππS + (1-) E)

So now this:

Has become this: ππ)

We only need a couple more adjustments.

r (Pj)

|Pj|_____

Pj BPi

r(Pi) =

Page 16: 5 Understanding Page Rank

ππS + (1-) E)

Sometimes, people teleport to a page. They just enter the URL and go. And just as easily, they can teleport out. To account for this, B&P added two adjustments:

S accounts for people who reach a dead end and jump to another page within a site. is a weighted probability that someone will leave.

S is a matrix of probable page destinations.

Page 17: 5 Understanding Page Rank

ππS + (1-) E)

What about people who jump out to a completely new destination? To account for this, B&P added the final adjustments:

1- is the inverted weighted probability that someone will leave and go to a completely new site.

E is a random teleportation matrix of probable page destinations.