Understanding Cycling Behaviours of Commuters - Methodological Issues
Fast Katz and Commuters
-
Upload
david-gleich -
Category
Documents
-
view
700 -
download
0
description
Transcript of Fast Katz and Commuters
FAST KATZ AND COMMUTERS
Quadrature Rules and Sparse Linear Solvers
for Link Prediction Heuristics
David F. Gleich
Sandia National Labs
Berkeley LAPACK Seminar
February 3, 2011
With Pooya Esfandiar, Francesco Bonchi, Chen Grief,
Laks V. S. Lakshmanan, and Byung-Won On
David F. Gleich (Sandia) Berkeley SCMC Seminar
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin
Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
1 / 43
MAIN RESULTS
A – adjacency matrix
L – Laplacian matrix
Katz score :
Commute time:
For Commute
Compute one fast
David F. Gleich (Sandia) Berkeley SCMC Seminar
For Katz Compute one fast Compute top fast
2 of 43
OUTLINE
Why study these measures?
Katz Rank and Commute Time
How else do people compute them?
Quadrature rules for pairwise scores
Sparse linear systems solves for top-k
As many results as we have time for…
David F. Gleich (Sandia) Berkeley SCMC Seminar 3 of 43
WHY? LINK PREDICTION
David F. Gleich (Sandia) Berkeley SCMC Seminar
Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient
Neighborhood based
Path based
4 of 43
NOTE
All graphs are undirected
All graphs are connected
David F. Gleich (Sandia) Berkeley SCMC Seminar 5 of 43
LEO KATZ
David F. Gleich (Sandia) Berkeley SCMC Seminar 6 of 43
NOT QUITE, WIKIPEDIA
: adjacency, : random walk
PageRank
Katz
These are equivalent if has constant degree
David F. Gleich (Sandia) Berkeley SCMC Seminar 7 of 43
WHAT KATZ ACTUALLY SAID
Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43
“we assume that each link independently has the
same probability of being effective” …
“we conceive a constant , depending
on the group and the context of the particular
investigation, which has the force of a probability
of effectiveness of a single link. A k-step chain
then, has probability of being effective.”
“We wish to find the column sums of the matrix”
David F. Gleich (Sandia) Berkeley SCMC Seminar 8 of 43
A MODERN TAKE
The Katz score (node-based) is
The Katz score (edge-based) is
David F. Gleich (Sandia) Berkeley SCMC Seminar 9 of 43
RETURNING TO THE MATRIX
Carl Neumann
David F. Gleich (Sandia) Berkeley SCMC Seminar 10 of 43
Carl Neumann
I’ve heard the Neumann series called the “von Neumann”
series more than I’d like! In fact, the von Neumann kernel
of a graph should be named the “Neumann” kernel!
David F. Gleich (Sandia) Berkeley SCMC Seminar
Wikipedia page
11 / 43
PROPERTIES OF KATZ’S MATRIX
is symmetric
exists when
is spd when
Note that 1/max-degree suffices
David F. Gleich (Sandia) Berkeley SCMC Seminar 12 of 43
COMMUTE TIME
David F. Gleich (Sandia) Berkeley SCMC Seminar
Picture taken from Google images, seems to be
Bay Bridge Traffic by Jim M. Goldstein.
13 / 43
COMMUTE TIME
Consider a uniform random walk on a graph
David F. Gleich (Sandia) Berkeley SCMC Seminar
Also called the hitting
time from node i to j, or
the first transition time
14 of 43
SKIPPING DETAILS
: graph Laplacian
is the only null-vector
David F. Gleich (Sandia) Berkeley SCMC Seminar 15 of 43
WHAT DO OTHER PEOPLE DO?
1) Just work with the linear algebra formulations
2) For Katz, Truncate the Neumann series as a few (3-5) terms
3) Use low-rank approximations from EVD(A) or EVD(L)
4) For commute, use Johnson-Lindenstrauss inspired random sampling
5) Approximately decompose into smaller problems
David F. Gleich (Sandia) Berkeley SCMC Seminar
Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,
Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007, Wang et al. ICDM2007 16 of 43
THE PROBLEM
All of these techniques are
preprocessing based because
most people’s goal is to compute
all the scores.
We want to avoid
preprocessing the graph.
David F. Gleich (Sandia) Berkeley SCMC Seminar
There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse
17 of 43
WHY NO PREPROCESSING?
The graph is constantly changing
as I rate new movies.
David F. Gleich (Sandia) Berkeley SCMC Seminar 18 of 43
WHY NO PREPROCESSING?
David F. Gleich (Sandia) Berkeley SCMC Seminar
Top-k predicted “links”
are movies to watch!
Pairwise scores give
user similarity
19 of 43
PAIRWISE ALGORITHMS
Katz
Commute
David F. Gleich (Sandia) Berkeley SCMC Seminar
Golub and Meurant
to the rescue!
20 of 43
MMQ - THE BIG IDEA
Quadratic form
Weighted sum
Stieltjes integral
Quadrature approximation
Matrix equation David F. Gleich (Sandia) Berkeley SCMC Seminar
Think
A is s.p.d. use EVD
“A tautology”
Lanczos
21 of 43
LANCZOS
, k-steps of the Lanczos method produce
and
David F. Gleich (Sandia) Berkeley SCMC Seminar
=
22 of 43
PRACTICAL LANCZOS
Only need to store the last 2 vectors in
Updating requires O(matvec) work
is not orthogonal
David F. Gleich (Sandia) Berkeley SCMC Seminar 23 of 43
MMQ PROCEDURE
Goal
Given
1. Run k-steps of Lanczos on starting with
2. Compute , with an additional eigenvalue at ,
set 3. Compute , with an additional eigenvalue at , set
4. Output as lower and upper bounds on
David F. Gleich (Sandia) Berkeley SCMC Seminar
Correspond to a Gauss-Radau rule, with
u as a prescribed node
Correspond to a Gauss-Radau rule, with
l as a prescribed node
24 of 43
PRACTICAL MMQ
Increase k to become more accurate
Bad eigenvalue bounds yield worse results
and are easy to compute
not required, we can iteratively
update it’s LU factorization
David F. Gleich (Sandia) Berkeley SCMC Seminar 25 of 43
PRACTICAL MMQ
David F. Gleich (Sandia) Berkeley SCMC Seminar 26 of 43
ONE LAST STEP FOR KATZ
Katz
David F. Gleich (Sandia) Berkeley SCMC Seminar 27 of 43
KATZ SCORES ARE LOCALIZED
David F. Gleich (Sandia) Berkeley SCMC Seminar
Up to 50 neighbors is 99.65% of the total mass
28 of 43
TOP-K ALGORITHM FOR KATZ
Approximate
where is sparse
Keep sparse too
Ideally, don’t “touch” all of
David F. Gleich (Sandia) Berkeley SCMC Seminar 29 of 43
INSPIRATION - PAGERANK
Approximate
where is sparse
Keep sparse too? YES!
Ideally, don’t “touch” all of ? YES!
David F. Gleich (Sandia) Berkeley SCMC Seminar
McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too.
30 of 43
THE ALGORITHM - MCSHERRY
For
Start with the Richardson iteration
Rewrite
Richardson converges if
David F. Gleich (Sandia) Berkeley SCMC Seminar 31 of 43
THE ALGORITHM
Note is sparse.
If , then is sparse.
Idea
only add one component of to
David F. Gleich (Sandia) Berkeley SCMC Seminar 32 of 43
THE ALGORITHM
For
Init:
How to pick ?
David F. Gleich (Sandia) Berkeley SCMC Seminar 33 of 43
THE ALGORITHM FOR KATZ
For
Init:
Pick as max David F. Gleich (Sandia) Berkeley SCMC Seminar
Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more
34 of 43
CONVERGENCE?
The “standard” proof works for 1/max-degree.
For , then for symmetric , this algorithm is the Gauss-Southwell procedure and it still converges.
David F. Gleich (Sandia) Berkeley SCMC Seminar
35 of 43
RESULTS – DATA, PARAMETERS
All unweighted, connected graphs
Easy
Hard
David F. Gleich (Sandia) Berkeley SCMC Seminar 36 of 43
KATZ BOUND CONVERGENCE
David F. Gleich (Sandia) Berkeley SCMC Seminar 37 of 43
COMMUTE BOUND CONVERG.
David F. Gleich (Sandia) Berkeley SCMC Seminar 38 of 43
KATZ SET CONVERGENCE
David F. Gleich (Sandia) Berkeley SCMC Seminar
For arXiv graph.
39 of 43
TIMING
David F. Gleich (Sandia) Berkeley SCMC Seminar 40 of 43
CONCLUSIONS
These algorithms are faster than many alternatives.
For pairwise commute, stopping criteria are simpler with bounds
For top-k problems, we often need less than 1 matvec for good enough results
David F. Gleich (Sandia) Berkeley SCMC Seminar 41 of 43
WARTS AND TODOS
Stopping criteria on our top-k algorithm can be a bit hairy… we should refine it.
The top-k approach doesn’t work right for commute time… we have an alternative.
requires the entire diagonal of
Take advantage of new research to “seed” commute time better! David F. Gleich (Sandia) Berkeley SCMC Seminar
von Luxburg et al. NIPS2010
42 of 43
By AngryDogDesign on DeviantArt
Paper at WAW2010
Slides should be online soon
Code is online already
stanford.edu/~dgleich/
publications/2010/codes/fast-katz
David F. Gleich (Sandia) Berkeley SCMC Seminar 43
F-MEASURE
David F. Gleich (Sandia) Berkeley SCMC Seminar 44 of 43
MAIN RESULTS – SLIDE ONE
A – adjacency matrix
L – Laplacian matrix
Katz score :
Commute time:
David F. Gleich (Sandia) Berkeley SCMC Seminar 45 of 43
MAIN RESULTS – SLIDE TWO
For Katz Compute one fast
Compute top fast
For Commute
Compute one fast
For almost commute
Compute top $F_{i,:}$ fast
David F. Gleich (Sandia) Berkeley SCMC Seminar 46 of 43
MAIN RESULTS – SLIDE THREE
David F. Gleich (Sandia) Berkeley SCMC Seminar 47 of 43
KATZ BOUND CONVERGENCE
David F. Gleich (Sandia) Berkeley SCMC Seminar 48 of 43
KATZ ERROR CONVERGENCE
David F. Gleich (Sandia) Berkeley SCMC Seminar 49 of 43
COMMUTE BOUND CONVERG.
David F. Gleich (Sandia) Berkeley SCMC Seminar 50 of 43
COMMUTE ERROR CONVERG.
David F. Gleich (Sandia) Berkeley SCMC Seminar 51 of 43
KATZ SET CONVERGENCE
David F. Gleich (Sandia) Berkeley SCMC Seminar 52 of 43
KATZ ORDER CONVERGENCE
David F. Gleich (Sandia) Berkeley SCMC Seminar 53 of 43