Fast Katz and Commuters

53
FAST KATZ AND COMMUTERS Quadrature Rules and Sparse Linear Solvers for Link Prediction Heuristics David F. Gleich Sandia National Labs Berkeley LAPACK Seminar February 3, 2011 With Pooya Esfandiar, Francesco Bonchi, Chen Grief, Laks V. S. Lakshmanan, and Byung-Won On David F. Gleich (Sandia) Berkeley SCMC Seminar Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. 1 / 43

description

 

Transcript of Fast Katz and Commuters

Page 1: Fast Katz and Commuters

FAST KATZ AND COMMUTERS

Quadrature Rules and Sparse Linear Solvers

for Link Prediction Heuristics

David F. Gleich

Sandia National Labs

Berkeley LAPACK Seminar

February 3, 2011

With Pooya Esfandiar, Francesco Bonchi, Chen Grief,

Laks V. S. Lakshmanan, and Byung-Won On

David F. Gleich (Sandia) Berkeley SCMC Seminar

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin

Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

1 / 43

Page 2: Fast Katz and Commuters

MAIN RESULTS

A – adjacency matrix

L – Laplacian matrix

Katz score :

                                               

Commute time:

                             

For Commute

Compute one       fast

David F. Gleich (Sandia) Berkeley SCMC Seminar

For Katz Compute one       fast Compute top       fast

2 of 43

Page 3: Fast Katz and Commuters

OUTLINE

Why study these measures?

Katz Rank and Commute Time

How else do people compute them?

Quadrature rules for pairwise scores

Sparse linear systems solves for top-k

As many results as we have time for…

David F. Gleich (Sandia) Berkeley SCMC Seminar 3 of 43

Page 4: Fast Katz and Commuters

WHY? LINK PREDICTION

David F. Gleich (Sandia) Berkeley SCMC Seminar

Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient

Neighborhood based

Path based

4 of 43

Page 5: Fast Katz and Commuters

NOTE

All graphs are undirected

All graphs are connected

David F. Gleich (Sandia) Berkeley SCMC Seminar 5 of 43

Page 6: Fast Katz and Commuters

LEO KATZ

David F. Gleich (Sandia) Berkeley SCMC Seminar 6 of 43

Page 7: Fast Katz and Commuters

NOT QUITE, WIKIPEDIA

    : adjacency,                     : random walk

PageRank                          

Katz                          

These are equivalent if     has constant degree

David F. Gleich (Sandia) Berkeley SCMC Seminar 7 of 43

Page 8: Fast Katz and Commuters

WHAT KATZ ACTUALLY SAID

Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43

“we assume that each link independently has the

same probability of being effective” …

“we conceive a constant     , depending

on the group and the context of the particular

investigation, which has the force of a probability

of effectiveness of a single link. A k-step chain

then, has probability       of being effective.”

“We wish to find the column sums of the matrix”

David F. Gleich (Sandia) Berkeley SCMC Seminar 8 of 43

Page 9: Fast Katz and Commuters

A MODERN TAKE

The Katz score (node-based) is

                   

The Katz score (edge-based) is

                   

David F. Gleich (Sandia) Berkeley SCMC Seminar 9 of 43

Page 10: Fast Katz and Commuters

RETURNING TO THE MATRIX

                     

                                                     

Carl Neumann                

                                    David F. Gleich (Sandia) Berkeley SCMC Seminar 10 of 43

Page 11: Fast Katz and Commuters

Carl Neumann

I’ve heard the Neumann series called the “von Neumann”

series more than I’d like! In fact, the von Neumann kernel

of a graph should be named the “Neumann” kernel!

David F. Gleich (Sandia) Berkeley SCMC Seminar

Wikipedia page

11 / 43

Page 12: Fast Katz and Commuters

PROPERTIES OF KATZ’S MATRIX

    is symmetric

    exists when                      

                is spd when                      

Note that         1/max-degree suffices

David F. Gleich (Sandia) Berkeley SCMC Seminar 12 of 43

Page 13: Fast Katz and Commuters

COMMUTE TIME

David F. Gleich (Sandia) Berkeley SCMC Seminar

Picture taken from Google images, seems to be

Bay Bridge Traffic by Jim M. Goldstein.

13 / 43

Page 14: Fast Katz and Commuters

COMMUTE TIME

Consider a uniform random walk on a graph                    

                                                       

                                               

David F. Gleich (Sandia) Berkeley SCMC Seminar

Also called the hitting

time from node i to j, or

the first transition time

14 of 43

Page 15: Fast Katz and Commuters

SKIPPING DETAILS

                    : graph Laplacian

                                                             

              is the only null-vector

David F. Gleich (Sandia) Berkeley SCMC Seminar 15 of 43

Page 16: Fast Katz and Commuters

WHAT DO OTHER PEOPLE DO?

1) Just work with the linear algebra formulations

2) For Katz, Truncate the Neumann series as a few (3-5) terms

3) Use low-rank approximations from EVD(A) or EVD(L)

4) For commute, use Johnson-Lindenstrauss inspired random sampling

5) Approximately decompose into smaller problems

David F. Gleich (Sandia) Berkeley SCMC Seminar

Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,

Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007, Wang et al. ICDM2007 16 of 43

Page 17: Fast Katz and Commuters

THE PROBLEM

All of these techniques are

preprocessing based because

most people’s goal is to compute

all the scores.

We want to avoid

preprocessing the graph.

David F. Gleich (Sandia) Berkeley SCMC Seminar

There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse

17 of 43

Page 18: Fast Katz and Commuters

WHY NO PREPROCESSING?

The graph is constantly changing

as I rate new movies.

David F. Gleich (Sandia) Berkeley SCMC Seminar 18 of 43

Page 19: Fast Katz and Commuters

WHY NO PREPROCESSING?

David F. Gleich (Sandia) Berkeley SCMC Seminar

Top-k predicted “links”

are movies to watch!

Pairwise scores give

user similarity

19 of 43

Page 20: Fast Katz and Commuters

PAIRWISE ALGORITHMS

Katz                      

Commute                                        

David F. Gleich (Sandia) Berkeley SCMC Seminar

Golub and Meurant

to the rescue!

20 of 43

Page 21: Fast Katz and Commuters

MMQ - THE BIG IDEA

Quadratic form                    

Weighted sum                      

Stieltjes integral                      

Quadrature approximation                      

Matrix equation               David F. Gleich (Sandia) Berkeley SCMC Seminar

Think                    

A is s.p.d. use EVD

“A tautology”

Lanczos

21 of 43

Page 22: Fast Katz and Commuters

LANCZOS

        , k-steps of the Lanczos method produce

                                    and                      

David F. Gleich (Sandia) Berkeley SCMC Seminar

                    =

             

22 of 43

Page 23: Fast Katz and Commuters

PRACTICAL LANCZOS

Only need to store the last 2 vectors in      

Updating requires O(matvec) work

      is not orthogonal

David F. Gleich (Sandia) Berkeley SCMC Seminar 23 of 43

Page 24: Fast Katz and Commuters

MMQ PROCEDURE

Goal                        

Given                        

1. Run k-steps of Lanczos on     starting with    

2. Compute       ,     with an additional eigenvalue at     ,

set                     3. Compute     ,     with an additional eigenvalue at   , set

                  4. Output               as lower and upper bounds on    

David F. Gleich (Sandia) Berkeley SCMC Seminar

Correspond to a Gauss-Radau rule, with

u as a prescribed node

Correspond to a Gauss-Radau rule, with

l as a prescribed node

24 of 43

Page 25: Fast Katz and Commuters

PRACTICAL MMQ

Increase k to become more accurate

Bad eigenvalue bounds yield worse results

      and     are easy to compute

      not required, we can iteratively

update it’s LU factorization

David F. Gleich (Sandia) Berkeley SCMC Seminar 25 of 43

Page 26: Fast Katz and Commuters

PRACTICAL MMQ

David F. Gleich (Sandia) Berkeley SCMC Seminar 26 of 43

Page 27: Fast Katz and Commuters

ONE LAST STEP FOR KATZ

Katz                      

           

                                                                   

                                         

David F. Gleich (Sandia) Berkeley SCMC Seminar 27 of 43

Page 28: Fast Katz and Commuters

KATZ SCORES ARE LOCALIZED

David F. Gleich (Sandia) Berkeley SCMC Seminar

Up to 50 neighbors is 99.65% of the total mass

28 of 43

Page 29: Fast Katz and Commuters

TOP-K ALGORITHM FOR KATZ

Approximate    

                           

where     is sparse

Keep     sparse too

Ideally, don’t “touch” all of    

David F. Gleich (Sandia) Berkeley SCMC Seminar 29 of 43

Page 30: Fast Katz and Commuters

INSPIRATION - PAGERANK

Approximate    

                           

where     is sparse

Keep     sparse too? YES!

Ideally, don’t “touch” all of     ? YES!

David F. Gleich (Sandia) Berkeley SCMC Seminar

McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too.

30 of 43

Page 31: Fast Katz and Commuters

THE ALGORITHM - MCSHERRY

For              

Start with the Richardson iteration

                                                     

Rewrite

                                     

Richardson converges if                        

David F. Gleich (Sandia) Berkeley SCMC Seminar 31 of 43

Page 32: Fast Katz and Commuters

THE ALGORITHM

Note     is sparse.

If                 , then         is sparse.

Idea

only add one component of         to        

David F. Gleich (Sandia) Berkeley SCMC Seminar 32 of 43

Page 33: Fast Katz and Commuters

THE ALGORITHM

For              

Init:                                

                               

                               

How to pick   ?

David F. Gleich (Sandia) Berkeley SCMC Seminar 33 of 43

Page 34: Fast Katz and Commuters

THE ALGORITHM FOR KATZ

For                            

Init:                                  

                             

                                           

Pick   as max       David F. Gleich (Sandia) Berkeley SCMC Seminar

Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more

34 of 43

Page 35: Fast Katz and Commuters

CONVERGENCE?

The “standard” proof works for         1/max-degree.

For                       , then for symmetric     , this algorithm is the Gauss-Southwell procedure and it still converges.

David F. Gleich (Sandia) Berkeley SCMC Seminar

35 of 43

Page 36: Fast Katz and Commuters

RESULTS – DATA, PARAMETERS

All unweighted, connected graphs

Easy                                    

Hard                            

David F. Gleich (Sandia) Berkeley SCMC Seminar 36 of 43

Page 37: Fast Katz and Commuters

KATZ BOUND CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar 37 of 43

Page 38: Fast Katz and Commuters

COMMUTE BOUND CONVERG.

David F. Gleich (Sandia) Berkeley SCMC Seminar 38 of 43

Page 39: Fast Katz and Commuters

KATZ SET CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar

For arXiv graph.

39 of 43

Page 40: Fast Katz and Commuters

TIMING

David F. Gleich (Sandia) Berkeley SCMC Seminar 40 of 43

Page 41: Fast Katz and Commuters

CONCLUSIONS

These algorithms are faster than many alternatives.

For pairwise commute, stopping criteria are simpler with bounds

For top-k problems, we often need less than 1 matvec for good enough results

David F. Gleich (Sandia) Berkeley SCMC Seminar 41 of 43

Page 42: Fast Katz and Commuters

WARTS AND TODOS

Stopping criteria on our top-k algorithm can be a bit hairy… we should refine it.

The top-k approach doesn’t work right for commute time… we have an alternative.

                               

      requires the entire diagonal of      

Take advantage of new research to “seed” commute time better! David F. Gleich (Sandia) Berkeley SCMC Seminar

von Luxburg et al. NIPS2010

42 of 43

Page 43: Fast Katz and Commuters

By AngryDogDesign on DeviantArt

Paper at WAW2010

Slides should be online soon

Code is online already

stanford.edu/~dgleich/

publications/2010/codes/fast-katz

David F. Gleich (Sandia) Berkeley SCMC Seminar 43

Page 44: Fast Katz and Commuters

F-MEASURE

David F. Gleich (Sandia) Berkeley SCMC Seminar 44 of 43

Page 45: Fast Katz and Commuters

MAIN RESULTS – SLIDE ONE

A – adjacency matrix

L – Laplacian matrix

Katz score :

                                               

Commute time:

                               

David F. Gleich (Sandia) Berkeley SCMC Seminar 45 of 43

Page 46: Fast Katz and Commuters

MAIN RESULTS – SLIDE TWO

For Katz Compute one       fast

Compute top       fast

For Commute

Compute one       fast

For almost commute

Compute top $F_{i,:}$ fast

David F. Gleich (Sandia) Berkeley SCMC Seminar 46 of 43

Page 47: Fast Katz and Commuters

MAIN RESULTS – SLIDE THREE

David F. Gleich (Sandia) Berkeley SCMC Seminar 47 of 43

Page 48: Fast Katz and Commuters

KATZ BOUND CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar 48 of 43

Page 49: Fast Katz and Commuters

KATZ ERROR CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar 49 of 43

Page 50: Fast Katz and Commuters

COMMUTE BOUND CONVERG.

David F. Gleich (Sandia) Berkeley SCMC Seminar 50 of 43

Page 51: Fast Katz and Commuters

COMMUTE ERROR CONVERG.

David F. Gleich (Sandia) Berkeley SCMC Seminar 51 of 43

Page 52: Fast Katz and Commuters

KATZ SET CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar 52 of 43

Page 53: Fast Katz and Commuters

KATZ ORDER CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar 53 of 43