N COMP 465: Data Mining More on PageRankcs.rhodes.edu/welshc/COMP465_S15/Lecture16.pdf4/14/2015 1...
Transcript of N COMP 465: Data Mining More on PageRankcs.rhodes.edu/welshc/COMP465_S15/Lecture16.pdf4/14/2015 1...
4/14/2015
1
COMP 465: Data Mining More on PageRank
Slides Adapted From: www.mmds.org (Mining Massive Datasets)
Power Iteration:
Set ๐๐ = 1/N
1: ๐โฒ๐ = ๐๐
๐๐๐โ๐
2: ๐ = ๐โฒ
Goto 1
Example: ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 โฆ 6/15
rm 1/3 1/6 3/12 1/6 3/15
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
y
a m
y a m
y ยฝ ยฝ 0
a ยฝ 0 1
m 0 ยฝ 0
2
Iteration 0, 1, 2, โฆ
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Power Iteration:
Set ๐๐ = 1/N
1: ๐โฒ๐ = ๐๐
๐๐๐โ๐
2: ๐ = ๐โฒ
Goto 1
Example: ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 โฆ 6/15
rm 1/3 1/6 3/12 1/6 3/15
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
y
a m
y a m
y ยฝ ยฝ 0
a ยฝ 0 1
m 0 ยฝ 0
3
Iteration 0, 1, 2, โฆ
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Imagine a random web surfer:
At any time ๐, surfer is on some page ๐
At time ๐ + ๐, the surfer follows an out-link from ๐ uniformly at random
Ends up on some page ๐ linked from ๐
Process repeats indefinitely
Let: ๐(๐) โฆ vector whose ๐th coordinate is the
prob. that the surfer is at page ๐ at time ๐
So, ๐(๐) is a probability distribution over pages
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
ji
ij
rr
(i)dout
j
i1 i2 i3
4/14/2015
2
Where is the surfer at time t+1?
Follows a link uniformly at random
๐ ๐ + ๐ = ๐ด โ ๐(๐)
Suppose the random walk reaches a state ๐ ๐ + ๐ = ๐ด โ ๐(๐) = ๐(๐)
then ๐(๐) is stationary distribution of a random walk
Our original rank vector ๐ satisfies ๐ = ๐ด โ ๐
So, ๐ is a stationary distribution for the random walk
)(M)1( tptp
j
i1 i2 i3
5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Does this converge?
Does it converge to what we want?
Are results reasonable?
ji
t
it
j
rr
i
)()1(
d Mrr or
equivalently
7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Example: ra 1 0 1 0
rb 0 1 0 1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
=
b a
Iteration 0, 1, 2, โฆ
ji
t
it
j
rr
i
)()1(
d
4/14/2015
3
Example: ra 1 0 0 0
rb 0 1 0 0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
=
b a
Iteration 0, 1, 2, โฆ
ji
t
it
j
rr
i
)()1(
d
2 problems: (1) Some pages are
dead ends (have no out-links)
Random walk has โnowhereโ to go to
Such pages cause importance to โleak outโ
(2) Spider traps:
(all out-links are within the group)
Random walked gets โstuckโ in a trap
And eventually spider traps absorb all importance
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
Dead end
Power Iteration:
Set ๐๐ = 1
๐๐ = ๐๐
๐๐๐โ๐
And iterate
Example: ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 โฆ 0
rm 1/3 3/6 7/12 16/24 1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
Iteration 0, 1, 2, โฆ
y
a m
y a m
y ยฝ ยฝ 0
a ยฝ 0 0
m 0 ยฝ 1
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2 + rm
m is a spider trap
All the PageRank score gets โtrappedโ in node m.
The Google solution for spider traps: At each time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to some random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a few time steps
12 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
y
a m
y
a m
4/14/2015
4
Power Iteration:
Set ๐๐ = 1
๐๐ = ๐๐
๐๐๐โ๐
And iterate
Example: ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 โฆ 0
rm 1/3 1/6 1/12 2/24 0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
Iteration 0, 1, 2, โฆ
y
a m
y a m
y ยฝ ยฝ 0
a ยฝ 0 0
m 0 ยฝ 0
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2
Here the PageRank โleaksโ out since the matrix is not stochastic.
Teleports: Follow random teleport links with probability 1.0 from dead-ends
Adjust matrix accordingly
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
y
a m
y a m
y ยฝ ยฝ โ
a ยฝ 0 โ
m 0 ยฝ โ
y a m
y ยฝ ยฝ 0
a ยฝ 0 0
m 0 ยฝ 0
y
a m
Why are dead-ends and spider traps a problem and why do teleports solve the problem? Spider-traps are not a problem, but with traps
PageRank scores are not what we want
Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps
Dead-ends are a problem
The matrix is not column stochastic so our initial assumptions are not met
Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
Googleโs solution that does it all: At each step, random surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
PageRank equation [Brin-Page, 98]
๐๐ = ๐ฝ ๐๐๐๐
๐โ๐
+ (1 โ ๐ฝ)1
๐
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
di โฆ out-degree of node i
This formulation assumes that ๐ด has no dead ends. We can either
preprocess matrix ๐ด to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
4/14/2015
5
PageRank equation [Brin-Page, โ98]
๐๐ = ๐ฝ๐๐๐๐
๐โ๐
+ (1 โ ๐ฝ)1
๐
The Google Matrix A:
๐ด = ๐ฝ ๐ + 1 โ ๐ฝ1
๐ ๐ร๐
We have a recursive problem: ๐ = ๐จ โ ๐ And the Power method still works!
What is ?
In practice =0.8,0.9 (make 5 steps on avg., jump)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
[1/N]NxNโฆN by N matrix
where all entries are 1/N
y
a =
m
1/3
1/3
1/3
0.33
0.20
0.46
0.24
0.20
0.52
0.26
0.18
0.56
7/33
5/33
21/33
. . .
18 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
y
a m
13/15
7/15
1/2 1/2 0
1/2 0 0
0 1/2 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
0.8 + 0.2
M [1/N]NxN
A
Key step is matrix-vector multiplication rnew = A โ rold
Easy if we have enough main memory to hold A, rold, rnew
Say N = 1 billion pages We need 4 bytes for
each entry (say)
2 billion entries for vectors, approx 8GB
Matrix A has N2 entries 1018 is a large number!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
ยฝ ยฝ 0
ยฝ 0 0
0 ยฝ 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
7/15 7/15 1/15
7/15 1/15 1/15
1/15 7/15 13/15
0.8 +0.2
A = โM + (1-) [1/N]NxN
=
A =
4/14/2015
6
Suppose there are N pages Consider page i, with di out-links
We have Mji = 1/|di| when i โ j and Mji = 0 otherwise
The random teleport is equivalent to: Adding a teleport link from i to every other page
and setting transition probability to (1-)/N
Reducing the probability of following each out-link from 1/|di| to /|di|
Equivalent: Tax each page a fraction (1-) of its score and redistribute evenly
21 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
๐ = ๐จ โ ๐, where ๐จ๐๐ = ๐ท ๐ด๐๐ +๐โ๐ท
๐ต
๐๐ = ๐ด๐๐ โ ๐๐๐i=1
๐๐ = ๐ฝ ๐๐๐ +1โ๐ฝ
๐โ ๐๐
๐๐=1
= ๐ฝ ๐๐๐ โ ๐๐ +1โ๐ฝ
๐๐i=1 ๐๐
๐i=1
= ๐ฝ ๐๐๐ โ ๐๐ +1โ๐ฝ
๐๐i=1 since ๐๐ = 1
So we get: ๐ = ๐ท ๐ด โ ๐ +๐โ๐ท
๐ต ๐ต
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22
[x]N โฆ a vector of length N with all entries x Note: Here we assumed M
has no dead-ends
We just rearranged the PageRank equation
๐ = ๐ท๐ด โ ๐ +๐ โ ๐ท
๐ต๐ต
where [(1-)/N]N is a vector with all N entries (1-)/N
M is a sparse matrix! (with no dead-ends)
10 links per node, approx 10N entries So in each iteration, we need to:
Compute rnew = M โ rold
Add a constant value (1-)/N to each entry in rnew
Note if M contains dead-ends then ๐๐๐๐๐
๐ < ๐ and we also have to renormalize rnew so that it sums to 1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
Input: Graph ๐ฎ and parameter ๐ท Directed graph ๐ฎ (can have spider traps and dead ends) Parameter ๐ท
Output: PageRank vector ๐๐๐๐
Set: ๐๐๐๐๐ =
1
๐
repeat until convergence: ๐๐๐๐๐ค โ ๐๐
๐๐๐ > ๐๐
โ๐: ๐โฒ๐๐๐๐ = ๐ท
๐๐๐๐๐
๐ ๐๐โ๐
๐โฒ๐๐๐๐ = ๐ if in-degree of ๐ is 0
Now re-insert the leaked PageRank:
โ๐: ๐๐๐๐๐ = ๐โฒ๐
๐๐๐+๐โ๐บ
๐ต
๐๐๐๐ = ๐๐๐๐
24
where: ๐ = ๐โฒ๐๐๐๐ค
๐
If the graph has no dead-ends then the amount of leaked PageRank is 1-ฮฒ. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org