Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody...

29
Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer

Transcript of Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody...

Page 1: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Exploiting Web Matrix Permutations to Speedup PageRank ComputationPresented by: Aries Chan, Cody Lawson, and Michael Dwyer

Page 2: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Introduction

Internet Statistics 151 million active internet user as

of January 2004 76% used a search engine at least

once during the month Average time spent searching was

about 40 minutes

Page 3: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Introduction

Search Engines Most common means of accessing

Web Easiest method of organizing and

accessing information Therefore, high quality / usable

search engines are very importantRank Provider Searches (000) Share of Searches

- All Search 10,812,734 100.0%

1 Google Search 6,986,580 64.6%

2 Yahoo! Search 1,726,060 16.0%

3 MSN/Windows Live/Bing Search 1,156,415 10.7%

Page 4: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Introduction

Basic Idea of a Search Engine Scanning Web Graph

Using crawlers to create an index of the information

Ranking Organizing and ordering information in

a usable way

Page 5: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Introduction

Webpage Ranking Problems Personalization

http://www.google.com/history/?hl=en Updating – Keeping order up to date

25% of links are changed in one week 5% “new content” in one week 35% of entire web changed in eleven

weeks Reducing Computation Time

Page 6: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Introduction

Accelerating the PageRank Use compression to fit the Web

Graph into main memory Use sequence of scans of the Web

Graph to efficiently compute in external memory

Reduce computation time through numerical methods

Page 7: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Introduction Reduce computation time through

numerical methods Use iterative methods such as Gauss-

Seidel and Jacobi method (Arasu et al and Bianchini et al)

Subtract off estimates of non-principal eigenvectors (Kamvar et al )

Sort the graph lexicographically by url resulting in an approximate block structure (Kamvar et al)

Split into two problems: dangling nodes and non-dangling nodes (Lee et al)

Others have been working on ways to only update the ranks of nodes influenced by Web Graph changes

Page 8: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Introduction

Del Corso, Gulli, and Romani(authors of the paper) Numerical optimization of PageRank

View the PageRank computation as a linear system

Transform a dense problem into one which uses a sparse matrix

Treatment of dangling nodes which naturally adapts to the random surfer model

Exploiting web matrix permutations Increase data locality and reduce number

of iterations necessary to solve the problem

Page 9: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Google’s PageRank Overview Web as an oriented graph The random surfer

vi = ∑ pji vj

(sum of PageRanks of nodes linking to i weighted by the transition probability)

Equilibrium distribution vT = vTH (left eigenvector of H with

eigenvalue 1)

Page 10: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Problems with the ideal model Dangling nodes (trap the user)

Impose a random jump to every other page

B = H + auT

Cyclic paths in the Web Graph (reducibility) Brin and Page suggested adding

artificial transitions (low probability jump to all nodes)

G = B + (1 )euT

Page 11: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Current PageRank Solution Since G is just a rank one

modification of H, the power method takes advantage of the sparsity of matrix H.

Page 12: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Google’s PageRank eigenproblem as a linear system Expand

vTG = vT (eigenproblem) G = (H + auT) + (1 )euT

(expansion of Google matrix) vT (H + auT) + (1 )vTeuT = vT

Restructure Taking S = I HT uaT

And with vTe = 1 Sv = (1 )u

(after taking transpose and rearranging)

Page 13: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Dangling Nodes Problems

Dangling Nodes Pages with no links to other pages Pages whose existence is inferred

but crawler has not reached According to a 2001 sample,

approximately 76% are dangling nodes.

Page 14: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Dangling Nodes Problems Natural Model

B = H + auT

Jump with probability 1 to any other node Drastic Model

Completely removes dangling nodes Problems

Dangling nodes themselves are not ranked Removing nodes create new dangling nodes Self-loop Model eg.

B = H + F Fij = {1 if i = j & dangling; 0 otherwise} Still row stochastic and is similar to

natural Problem

Gives unreasonably high rank to the dangling nodes

Page 15: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Which model is the best?

Natural model the most “accurate”. Problem

Gives a much more dense matrix B It is at least partially for this reason

that the problem is approached as an eigenproblem to exploit the sparsity of H

Can we have an equally lightweight iterative approach?

Page 16: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Iterative Approach with Sparsity Sparse matrix R

R = I HT

PageRank v obtained by solving Ry = u, v = y such that ||v||1=1

Why does this work? Since S = R uaT

and Sv = (1 )u (R uaT)v = (1 )u Use Sherman-Morrison formula to

calculate the inverse

Page 17: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Iterative Approach with Sparsity We get

The vector v is our PageRank vector and was solved using sparse matrix R

(R uaT)-1 = R-1 + R-1uaTR-1

1/ + aTR-1u

v = (1 )(1 +aTy

)y1/ + aTy

v = y

= (1 )(1 +aTy

)1/ + aTy

Page 18: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Exploiting Web Matrix Permutations Use a variety of “cheap” operators to

permute the web graph in an organized way in hopes to: increase data locality reduce the number of iterations in order

to solve the problem Explore different iterative methods in

order to solve the problem quicker. Compare the performance of different

iterative methods based on specifically permuted web graphs.

Page 19: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Permutation Strategies The following operations were chosen

based on their “limited impact on the computational cost” (Del Corso et al.) O - orders nodes by increasing out-degree Q - orders nodes by decreasing out-degree X - orders nodes by increasing in-degree Y - orders nodes by decreasing in-degree B - ordering the nodes according to their

BFS (Breadth First Search) order of visit T - transposes the matrix

Page 20: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Permutation Strategies (cont.) O, Q, X, Y, and T operators a full matrix The B operator conveniently arranges R into

a lower block triangular structure due to BFS order of visit.

Combining these operations on R, the following structures of the permuted web matrix are produced.

Page 21: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Permutation Strategies (cont.) Visual representations of the permuted web

graph.

Page 22: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Iterative Strategies Power Method

Computes dominant eigenvector Jacobi Method

Using an initial guess, approximates the solution to the linear system of equations. Each successive iteration uses the previous approximated solution as its next guess till a degree of convergence is reached.(*further explained)

Gauss-Seidel Method (Reverse Gauss-Seidel) Modification of the Jacobi Method, which approximates

the solution to each successive equation in the linear system based on the values derived from the previous equations, all within each iterative loop. (*further explained)

*(http://college.cengage.com/mathematics/larson/elementary_linear/5e/students/ch08-10/chap_10_2.pdf)

Page 23: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Iterative Strategies (cont.) Further exploration led to iterative

methods based on the distinct block structure of certain web graph permutations.

DN (or DNR) Method The permuted matrix ROT has the property that the

lower diagonal block coincides with the identity matrix. The matrix can be easily partitioned into non-dangling and dangling portions. Then the non-dangling part is solved by Gauss-Seidel (or Reverse Gauss-Seidel respectively).

LB/UB/LBR/UBR Methods Uses the Gauss-Seidel or Reverse Gauss-Seidel

methods to solve the individual blocks of the triangular block matrices produced by the B operator.

Page 24: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Results

Page 25: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Results (cont.)

For both the Power Method and Jacobi Method, the number of Mflops is not dependent on permutations of the web matrix. (“the small differences in numerical data are due to the finite precision” [Del Corso et al.])

The Jacobi Method (applied to the matrix R) is only a slight improvement (about 3%) compared to the Power Method (applied to the matrix G).

Page 26: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Results (cont.)

The Gauss-Seidel and Reverse Gauss-Seidel Methods reduced Mflops by around 40% and running time by around 45% on the full matrix compared to the Power Method on the full matrix.

In particular the Reverse Gauss-Seidel Method performed on the permuted matrix RYTB reduced the number of Mflops by 51% and running time by 82% when compared to the Power method on the full matrix.

Page 27: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Results (cont.)

The block methods achieved even better results In particular, the best overall reduction in computation time and

number of Mflops was achieved by LBR method on the permuted matrix RQTB.

58% reduction in terms of Mflops 89% reduction in terms of running time

Page 28: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Conclusion Objective:

Accelerating Google PageRank by numerical methods

Contribution: Viewed web matrix as a sparse linear

system Formalized new method for treating

dangling nodes Explored new iterative methods and

applied them to web matrix permutations

Achievement: 1/10 of the computation time Reduced over 50% Mflops

Page 29: Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

References

G.M. Del Corso, A. Gulli, F. Romani, “Exploiting Web Matrix Permutations to Speedup PageRank Computation”

Nielsen MegaView Search (http://en-us.nielsen.com/rankings/insights/rankings/internet)