Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

46
FAST KATZ AND COMMUTERS Efficient Estimation of Social Relatedness David F. Gleich Sandia National Labs WAW2010 16 December 2010 Palo Alto, CA With Pooya Esfandiar, Francesco Bonchi, Chen Grief, Laks V. S. Lakshmanan, and Byung-Won On David F. Gleich (Sandia) ICME la/opt seminar Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. 1 / 28

description

WAW2010 presentation

Transcript of Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

Page 1: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

FAST KATZ AND COMMUTERS Efficient Estimation of Social Relatedness

David F. Gleich

Sandia National Labs

WAW2010

16 December 2010

Palo Alto, CA

With Pooya Esfandiar, Francesco Bonchi, Chen Grief,

Laks V. S. Lakshmanan, and Byung-Won On

David F. Gleich (Sandia) ICME la/opt seminar

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin

Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

1 / 28

Page 2: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

MAIN RESULTS

A – adjacency matrix

L – Laplacian matrix

Katz score :

                                               

Commute time:

                             

David F. Gleich (Sandia) ICME la/opt seminar

For Katz Compute one       fast Compute top       fast For Commute Compute one       fast

2 of 28

Page 3: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

OUTLINE

Katz Rank and Commute Time, then Why

Matrices, moments, and quadrature rules for pairwise scores

Sparse linear systems solves for top-k

Some results

David F. Gleich (Sandia) ICME la/opt seminar 3 of 28

Page 4: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

NOTE – EVERYTHING IS SIMPLE

All graphs are undirected

All graphs are connected

All graphs are unweighted

David F. Gleich (Sandia) ICME la/opt seminar 4 of 28

Page 5: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

KATZ SCORES

The Katz score is

                   

                                                                           

Carl Neumann                                                              

David F. Gleich (Sandia) ICME la/opt seminar 5 of 28

Page 6: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

COMMUTE TIME

Consider a uniform random walk on a graph                  

                                                     

                                               

David F. Gleich (Sandia) ICME la/opt seminar

Fouss et al. TKDE 2007

Also called the hitting

time from node i to j, or

the first transition time

                    : graph Laplacian

                                                             

              is the only null-vector

6 of 28

Page 7: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

WHAT DO OTHER PEOPLE DO?

1) Just work with the linear algebra formulations

2) For Katz, truncate the Neumann series as a few (3-5) terms

3) Use low-rank approximations from EVD(A) or EVD(L)

4) For commute, use Johnson-Lindenstrauss inspired random sampling

5) Approximately decompose into smaller problems

David F. Gleich (Sandia) ICME la/opt seminar

Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,

Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007, Wang et al. ICDM2007

7 of 28

Page 8: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

THE PROBLEM

All of these techniques are

preprocessing based because

most people’s goal is to compute

all the scores.

We want to avoid

preprocessing the graph.

David F. Gleich (Sandia) ICME la/opt seminar

There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse

8 of 28

Page 9: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

WHY? LINK PREDICTION

David F. Gleich (Sandia) ICME la/opt seminar

Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient

Neighborhood based

Path based

9 of 28

Page 10: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

WHY NO PREPROCESSING?

The graph is constantly changing

as I rate new movies.

David F. Gleich (Sandia) ICME la/opt seminar 10 of 28

Page 11: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

WHY NO PREPROCESSING?

David F. Gleich (Sandia) ICME la/opt seminar

Top-k predicted “links”

are movies to watch!

Pairwise scores give

user similarity

11 of 28

Page 12: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

PAIRWISE ALGORITHMS

Katz                    

Commute                                        

David F. Gleich (Sandia) ICME la/opt seminar

Golub and Meurant

to the rescue!

12 of 28

Page 13: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

MMQ - THE BIG IDEA

Quadratic form                    

Weighted sum                      

Stieltjes integral                      

Quadrature approximation                      

Matrix equation               David F. Gleich (Sandia) ICME la/opt seminar

Think                    

A is s.p.d. use EVD

“A tautology”

Lanczos

13 of 28

Page 14: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

MMQ PROCEDURE SKETCH

Goal                        

Given                        

1. Run k-steps of Lanczos on     starting with    

2. Compute       ,     with an additional eigenvalue at     ,

set                    

3. Compute     ,     with an additional eigenvalue at   , set

                  4. Output               as lower and upper bounds on b

David F. Gleich (Sandia) ICME la/opt seminar

Correspond to a Gauss-Radau rule, with

u as a prescribed node

Correspond to a Gauss-Radau rule, with

l as a prescribed node

Bad bounds give worse

performance!

Larger k gives better results;

k steps is k matvecs

Easy to “update” this inverse as k increases.

14 of 28

Page 15: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

PRACTICAL MMQ

David F. Gleich (Sandia) ICME la/opt seminar 15 of 28

Page 16: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

ONE LAST STEP FOR KATZ

Katz                      

         

                                                                   

                                       

David F. Gleich (Sandia) ICME la/opt seminar 16 of 28

Page 17: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

TOP-K ALGORITHM FOR KATZ

Approximate    

                           

where     is sparse

Keep     sparse too

Ideally, don’t “touch” all of    

For PageRank, we can do this!

David F. Gleich (Sandia) ICME la/opt seminar 17 of 28

Page 18: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

THE ALGORITHM - MCSHERRY

For              

Start with the Richardson iteration

                                                     

Rewrite

                                     

Note     is sparse

If                 , then         is sparse.

Idea Only add one component of         to        

David F. Gleich (Sandia) ICME la/opt seminar

McSherry WWW2005

Richardson converges if                        

18 of 28

Page 19: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

THE ALGORITHM FOR KATZ

For                          

Init:                                

                             

                                           

Pick   as max       David F. Gleich (Sandia) ICME la/opt seminar

Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more

19 of 28

Page 20: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

CONVERGENCE?

The “standard” proof works for         1/max-degree.

If you pick   as the maximum element, we can show this is convergent if Richardson converges. This proof requires     to be symmetric positive definite.

David F. Gleich (Sandia) ICME la/opt seminar 20 of 28

Page 21: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

RESULTS – DATA, PARAMETERS

All unweighted, connected graphs

Easy     :                                

Hard     :                        

David F. Gleich (Sandia) ICME la/opt seminar 21 of 28

Page 22: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

KATZ BOUND CONVERGENCE

David F. Gleich (Sandia) ICME la/opt seminar 22 of 28

Page 23: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

COMMUTE BOUND CONVERG.

David F. Gleich (Sandia) ICME la/opt seminar 23 of 28

Page 24: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

KATZ SET CONVERGENCE

David F. Gleich (Sandia) ICME la/opt seminar

For arXiv graph.

24 of 28

Page 25: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

TIMING

David F. Gleich (Sandia) ICME la/opt seminar 25 of 28

Page 26: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

CONCLUSIONS

These algorithms are faster than many alternatives in these special cases.

For pairwise commute, stopping criteria are simpler with bounds.

For top-k problems, we often need less than 1 matvec for good enough results

David F. Gleich (Sandia) ICME la/opt seminar 26 of 28

Page 27: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

WARTS AND TODOS

Stopping criteria on our top-k algorithm can be a bit hairy… we should refine it.

The top-k approach doesn’t work right for commute time… we have an alternative.

Take advantage of new research to “seed” commute time better!

Evaluate more datasets, like Netflix.

David F. Gleich (Sandia) ICME la/opt seminar

von Luxburg et al. NIPS2010

27 of 28

Page 28: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

By AngryDogDesign on DeviantArt

More details in the paper

Slides should be online soon

Code is online already

stanford.edu/~dgleich/

publications/2010/codes/fast-katz

David F. Gleich (Sandia) ICME la/opt seminar 28

Page 29: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

LEO KATZ

David F. Gleich (Sandia) ICME la/opt seminar 29 of 28

Page 30: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

NOT QUITE, WIKIPEDIA

    : adjacency,                     : random walk

PageRank                        

Katz                        

These are equivalent if     has constant degree

David F. Gleich (Sandia) ICME la/opt seminar 30 of 28

Page 31: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

WHAT KATZ ACTUALLY SAID

Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43

“we assume that each link independently has the

same probability of being effective” …

“we conceive a constant     , depending

on the group and the context of the particular

investigation, which has the force of a probability

of effectiveness of a single link. A k-step chain

then, has probability       of being effective.”

“We wish to find the column sums of the matrix”

David F. Gleich (Sandia) ICME la/opt seminar 31 of 28

Page 32: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

RETURNING TO THE MATRIX

                     

                                                     

Carl Neumann                          

                                   

David F. Gleich (Sandia) ICME la/opt seminar 32 of 28

Page 33: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

PROPERTIES OF KATZ’S MATRIX

    is symmetric

    exists when                      

            is sym. pos. def. when                      

Note that         1/max-degree suffices

David F. Gleich (Sandia) ICME la/opt seminar 33 of 28

Page 34: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

SKIPPING DETAILS

                    : graph Laplacian

                                                             

              is the only null-vector

David F. Gleich (Sandia) ICME la/opt seminar 34 of 28

Page 35: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

Carl Neumann

I’ve heard the Neumann series called the “von Neumann”

series more than I’d like! In fact, the von Neumann kernel

of a graph should be named the “Neumann” kernel!

David F. Gleich (Sandia) ICME la/opt seminar

Wikipedia page

35 / 28

Page 36: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

LANCZOS

          , k-steps of the Lanczos method produce

                                    and                      

David F. Gleich (Sandia) ICME la/opt seminar

                    =

           

36 of 28

Page 37: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

PRACTICAL LANCZOS

Only need to store the last 2 vectors in      

Updating requires O(matvec) work

      is not orthogonal

David F. Gleich (Sandia) ICME la/opt seminar 37 of 28

Page 38: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

PRACTICAL MMQ

Increase k to become more accurate

Bad eigenvalue bounds yield worse results

      and     are easy to compute

      not required, we can iteratively

update it’s LU factorization

David F. Gleich (Sandia) ICME la/opt seminar 38 of 28

Page 39: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

THE ALGORITHM

For              

Init:                                  

                               

                               

How to pick   ?

David F. Gleich (Sandia) ICME la/opt seminar 39 of 28

Page 40: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

MAIN RESULTS – SLIDE ONE

A – adjacency matrix

L – Laplacian matrix

Katz score :

                                               

Commute time:

                               

David F. Gleich (Sandia) ICME la/opt seminar 40 of 28

Page 41: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

TOP-K INSPIRATION: PAGERANK

Approximate    

                           

where     is sparse

Keep     sparse too? YES!

Ideally, don’t “touch” all of     ? YES!

David F. Gleich (Sandia) ICME la/opt seminar

McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too.

41 of 28

Page 42: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

THE ALGORITHM

Note     is sparse.

If                 , then         is sparse.

Idea

only add one component of         to        

David F. Gleich (Sandia) ICME la/opt seminar 42 of 28

Page 43: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

MAIN RESULTS – SLIDE TWO

For Katz Compute one       fast

Compute top       fast

For Commute

Compute one       fast

For almost commute

Compute top       fast

David F. Gleich (Sandia) ICME la/opt seminar 43 of 28

Page 44: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

MAIN RESULTS – SLIDE THREE

David F. Gleich (Sandia) ICME la/opt seminar 44 of 28

Page 45: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

F-MEASURE

David F. Gleich (Sandia) ICME la/opt seminar 45 of 28

Page 46: Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

PAIRWISE RESULTS

Katz upper and lower bounds

Katz error convergence

Commute-time upper and lower bounds

Commute-time error convergence

For the arXiv graph here

David F. Gleich (Sandia) ICME la/opt seminar 46 of 28