SCS CMU Proximity on Large Graphs Speaker: Hanghang Tong 2008-4-10 15-826 Guest Lecture.

Post on 04-Jan-2016

214 views 1 download

Transcript of SCS CMU Proximity on Large Graphs Speaker: Hanghang Tong 2008-4-10 15-826 Guest Lecture.

SCS CMU

Proximity on Large Graphs

Speaker:

Hanghang Tong

2008-4-10 15-826 Guest Lecture

SCS CMU

2

Graphs are everywhere!

SCS CMU

4

Graph Mining: the big picture Graph/Global Level

Subgraph/Community Level

Node Level We are here!

SCS CMU

5

Proximity on Graph: What?

A BH1 1

D1 1

E

F

G1 11

I J1

1 1

a.k.a Relevance, Closeness, ‘Similarity’…

SCS CMU

6

Proximity is the main tool behind…

• Link prediction [Liben-Nowell+], [Tong+]• Ranking [Haveliwala], [Chakrabarti+]• Email Management [Minkov+]• Image caption [Pan+]• Neighborhooh Formulation [Sun+]• Conn. subgraph [Faloutsos+], [Tong+], [Koren+]• Pattern match [Tong+]• Collaborative Filtering [Fouss+]• Many more…

Will return to this later

SCS CMU

7

Roadmap

• Motivation

• Part I: Definitions

• Part II: Fast Solutions

• Part III: Applications

• Conclusion

• Basic: RWR

• Variants

• Asymmetry of Prox.

• Group Prox

• Prox w/ Attributes

• Prox w/ Time

SCS CMU

8

Why not shortest path?

‘pizza delivery guy’ problem

A BD1 1

A BD1 1

E

F

G1 11

A BD1 1

E F1

1 1

A BD1 1

‘multi-facet’ relationship

Some ``bad’’ proximities

SCS CMU

9

A BD1 1

A BD1 11 E

Why not max. netflow?

No punishment on long paths

Some ``bad’’ proximities

SCS CMU

11

What is a ``good’’ Proximity?

A BH1 1

D1 1

E

F

G1 11

I J1

1 1

• Multiple Connections

• Quality of connection

•Direct & In-directed Conns

•Length, Degree, Weight…

SCS CMU

12

1

4

3

2

56

7

910

8

11

12

Random walk with restart

SCS CMU

13

Random walk with restart

Node 4

Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12

0.130.100.130.220.130.050.050.080.040.030.040.02

1

4

3

2

56

7

910

811

120.13

0.10

0.13

0.13

0.05

0.05

0.08

0.04

0.02

0.04

0.03

Ranking vector More red, more relevant

Nearby nodes, higher scores

4r

SCS CMU

Why RWR is a good score?

14

2c 3cc ...

W 2W 3W

all paths from i to j with length 1

all paths from i to j with length 2

all paths from i to j with length 3

SCS CMU

15

Roadmap

• Motivation

• Part I: Definitions

• Part II: Fast Solutions

• Part III: Applications

• Conclusion

• Basic: RWR

• Variants

• Asymmetry of Prox.

• Group Prox

• Prox w/ Attributes

• Prox w/ Time

SCS CMU

16

Variant: escape probability

• Define Random Walk (RW) on the graph• Esc_Prob(AB)

– Prob (starting at A, reaches B before returning to A)

Esc_Prob = Pr (smile before cry)

A Bthe remaining graph

SCS CMU

17

Other Variants

• Other measure by RWs– Community Time/Hitting Time [Fouss+]– SimRank [Jeh+]

• Equivalence of Random Walks– Electric Networks:

• EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+]

– String Systems

• Katz [Katz], [Huang+], [Scholkopf+]• Matrix-Forest-based Alg [Chobotarev+]

SCS CMU

18

Other Variants

• Other measure by RWs– Community Time/Hitting Time [Fouss+]– SimRank [Jeh+]

• Equivalence of Random Walks– Electric Networks:

• EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+]

– String Systems

• Katz [Katz], [Huang+], [Scholkopf+]• Matrix-Forest-based Alg [Chobotarev+]

All are related to, or similar to random walk with restart!

SCS CMU

19

Roadmap

• Motivation

• Part I: Definitions

• Part II: Fast Solutions

• Part III: Applications

• Conclusion

• Basic: RWR

• Variants

• Asymmetry of Prox.

• Group Prox

• Prox w/ Attributes

• Prox w/ Time

SCS CMU

20

Asymmetry of Proximity [Tong+ KDD07 a]

A B

1 1

1

111

1 A B

1 1

1

10.51

0.5

What is Prox from A to B?What is Prox from B to A?

What is Prox between A and B?

SCS CMU

21

Asymmetry also exists in un-directed graphs

• Hanghang’s most important conf. is KDD

• The most important author in KDD is ...

So is love…

HanghangKDD

SCS CMU

22

Roadmap

• Motivation

• Part I: Definitions

• Part II: Fast Solutions

• Part III: Applications

• Conclusion

• Basic: RWR

• Variants

• Asymmetry of Prox.

• Group Prox

• Prox w/ Attributes

• Prox w/ Time

SCS CMU

23

Group Proximity [Tong+ 2007]

• Q: How close are Accountants to SECs?

• A: Prob (starting at any RED, reaches any GREEN before touching any RED again)

Accountant

CEO

Manager

SEC

SCS CMU

24

Proximity on Attribute Graphs

What is the proximity from node 7 to 10?If we know that…

Accountant

CEO

Manager

SEC

SCS CMU

25

Sol: Augmented graphs

12

1311

9

10

5

6

7

8

1

2

3

4

SCS CMU

26

Attributes on nodes/edges (ER graph) [Chakrabarti+ WWW07]

skip

Wrote Sent Received

In-Replied-toCited

Works

SCS CMU

27

Proximity w/ Time

• Sol #1: treat time an categorical attr. [Minkov+]

• Sol #2: aggregate slice matrices [Tong+]

Time

•Global aggregation

•Slide window

•Exponential emphasis

SCS CMU

28

Summary of Part I

• Goal: Summarize multiple … relationships

• Solutions– Basic: Random Walk with Restart– Property: Asymmetry– Variants: Esc_Prob and many others.– Generalization: Group Prox.; w/ Attr.; w/ Time

SCS CMU

29

• Motivation

• Part I: Definitions

• Part II: Fast Solutions

• Part III: Applications

• Conclusion

Roadmap

• B_Lin: RWR

• FastAllDAP: Esc_Prob

• BB_Lin: Skewed BGs

• FastUpdate: Time-Evolving

SCS CMU

Preliminary: Sherman–Morrison Lemma

30

Tv

uAA =

If:

1 11 1 1

1( )

1

TT

T

A u v AA A u v A

v A u

Then:

SCS CMU

SM Lemma: Applications

• RLS – and almost any algorithm in time series!

• Leave-one-out cross validation for LS

• Kalman filtering

• Incremental matrix decomposition

• … and all the fast sols we will introduce!

31

SCS CMU

32

Computing RWR

1

43

2

5 6

7

9 10

811

12

0.13 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0

0.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0

0.13

0.22

0.13

0.050.9

0.05

0.08

0.04

0.03

0.04

0.02

0

1/3 1/3 0 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 1/4 0 0 0 0 0 0 0

0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 0

0 0 0 0 1/4 0 1/2 0 0 0 0 0

0 0 0 0 1/4 1/2 0 0 0 0 0 0

0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 0

0 0 0 0 0 0 0 1/4 0 1/3 0 0

0 0 0 0 0 0 0 0 1/2 0 1/3 1/2

0 0 0 0 0 0 0 1/4 0 1/3 0 1/2

0 0 0 0 0 0 0 0 0 1/3 1/3 0

0.13 0

0.10 0

0.13 0

0.22

0.13 0

0.05 00.1

0.05 0

0.08 0

0.04 0

0.03 0

0.04 0

2 0

1

0.0

n x n n x 1n x 1

Ranking vector Starting vectorAdjacent matrix

1

(1 )i i ir cWr c e

Restart p

SCS CMU

33

Beyond RWR

P-PageRank[Haveliwala]

PageRank[Haveliwala]

RWR[Pan, Sun]

SM Learning[Zhou, Zhu]

RL in CBIR[He]

Fast RWR (B_Lin) Finds the Root Solution !

: Maxwell Equation for Web![Chakrabarti]

SCS CMU

34

• RWR is the building block for computing…– Escape Probability (augmented w/sink)

[Tong+]– ..Effective ConductancResistance Dist.

Commute Time– MRF (special structure) [Cohen]

• Similar Idea of B_Lin to compute other measurements

Beyond RWR

SCS CMU

35

• Q: Given query i, how to solve it?

0 1/3 1/3 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 0 0 0 1/4 0 0 0 0

1/3 1/3 0 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 1/4

0.9

0 0 0 0 0 0 0

0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 0

0 0 0 0 1/4 0 1/2 0 0 0 0 0

0 0 0 0 1/4 1/2 0 0 0 0 0 0

0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 0

0 0 0 0 0 0 0 1/4 0 1/3 0 0

0 0 0 0 0 0 0 0 1/2 0 1/3 1/2

0 0 0 0 0

0

0

0

0

00.1

0

0

0

0

0 0 1/4 0 1/3 0 1/2 0

0 0 0 0 0 0 0 0 0 1/3 1/3

1

0 0

??

Adjacent matrix Starting vector

SCS CMU

36

1

43

2

5 6

7

9 10

8 11

120.130.10

0.13

0.130.05

0.05

0.08

0.04

0.02

0.04

0.03

OntheFly: 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 0 0 0 1/4 0 0 0 0

1/3 1/3 0 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 1/4

0.9

0 0 0 0 0 0 0

0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 0

0 0 0 0 1/4 0 1/2 0 0 0 0 0

0 0 0 0 1/4 1/2 0 0 0 0 0 0

0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 0

0 0 0 0 0 0 0 1/4 0 1/3 0 0

0 0 0 0 0 0 0 0 1/2 0 1/3 1/2

0 0 0 0 0

0

0

0

0

00.1

0

0

0

0

0 0 1/4 0 1/3 0 1/2 0

0 0 0 0 0 0 0 0 0 1/3 1/3

1

0 0

0

0

0

1

0

0

0

0

0

0

0

0

0.13

0.10

0.13

0.22

0.13

0.05

0.05

0.08

0.04

0.03

0.04

0.02

1

43

2

5 6

7

9 10

811

12

0.3

0

0.3

0.1

0.3

0

0

0

0

0

0

0

0.12

0.18

0.12

0.35

0.03

0.07

0.07

0.07

0

0

0

0

0.19

0.09

0.19

0.18

0.18

0.04

0.04

0.06

0.02

0

0.02

0

0.14

0.13

0.14

0.26

0.10

0.06

0.06

0.08

0.01

0.01

0.01

0

0.16

0.10

0.16

0.21

0.15

0.05

0.05

0.07

0.02

0.01

0.02

0.01

0.13

0.10

0.13

0.22

0.13

0.05

0.05

0.08

0.04

0.03

0.04

0.02

No pre-computation/ light storage

Slow on-line response O(mE)

ir

ir

SCS CMU

37

0.20 0.13 0.14 0.13 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.34

0.28 0.20 0.13 0.96 0.64 0.53 0.53 0.85 0.60 0.48 0.53 0.45

0.14 0.13 0.20 1.29 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.33

0.13 0.10 0.13 2.06 0.95 0.78 0.78 0.61 0.43 0.34 0.38 0.32

0.09 0.09 0.09 1.27 2.41 1.97 1.97 1.05 0.73 0.58 0.66 0.56

0.03 0.04 0.04 0.52 0.98 2.06 1.37 0.43 0.30 0.24 0.27 0.22

0.03 0.04 0.04 0.52 0.98 1.37 2.06 0.43 0.30 0.24 0.27 0.22

0.08 0.11 0.04 0.82 1.05 0.86 0.86 2.13 1.49 1.19 1.33 1.13

0.03 0.04 0.03 0.28 0.36 0.30 0.30 0.74 1.78 1.00 0.76 0.79

0.04 0.04 0.04 0.34 0.44 0.36 0.36 0.89 1.50 2.45 1.54 1.80

0.04 0.05 0.04 0.38 0.49 0.40 0.40 1.00 1.14 1.54 2.28 1.72

0.02 0.03 0.02 0.21 0.28 0.22 0.22 0.56 0.79 1.20 1.14 2.05

4

PreCompute

1 2 3 4 5 6 7 8 9 10 11 12r r r r r r r r r r r r

1

43

2

5 6

7

9 10

8 11

120.130.10

0.13

0.130.05

0.05

0.08

0.04

0.02

0.04

0.03

13

2

5 6

7

9 10

811

12

[Haveliwala]

R:

SCS CMU

38

2.20 1.28 1.43 1.29 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.34

1.28 2.02 1.28 0.96 0.64 0.53 0.53 0.85 0.60 0.48 0.53 0.45

1.43 1.28 2.20 1.29 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.33

1.29 0.96 1.29 2.06 0.95 0.78 0.78 0.61 0.43 0.34 0.38 0.32

0.91 0.86 0.91 1.27 2.41 1.97 1.97 1.05 0.73 0.58 0.66 0.56

0.37 0.35 0.37 0.52 0.98 2.06 1.37 0.43 0.30 0.24 0.27 0.22

0.37 0.35 0.37 0.52 0.98 1.37 2.06 0.43 0.30 0.24 0.27 0.22

0.84 1.14 0.84 0.82 1.05 0.86 0.86 2.13 1.49 1.19 1.33 1.13

0.29 0.40 0.29 0.28 0.36 0.30 0.30 0.74 1.78 1.00 0.76 0.79

0.35 0.48 0.35 0.34 0.44 0.36 0.36 0.89 1.50 2.45 1.54 1.80

0.39 0.53 0.39 0.38 0.49 0.40 0.40 1.00 1.14 1.54 2.28 1.72

0.22 0.30 0.22 0.21 0.28 0.22 0.22 0.56 0.79 1.20 1.14 2.05

PreCompute:

1

43

2

5 6

7

9 10

8 11

120.130.10

0.13

0.130.05

0.05

0.08

0.04

0.02

0.04

0.03

1

43

2

5 6

7

9 10

811

12

Fast on-line response

Heavy pre-computation/storage costO(n ) O(n )

0.13

0.10

0.13

0.22

0.13

0.05

0.05

0.08

0.04

0.03

0.04

0.02

3 2

SCS CMU

39

Q: How to Balance?

On-line Off-line

SCS CMU

40

B_Lin: Basic Idea[Tong+]

1

43

2

5 6

7

9 10

8 11

120.130.10

0.13

0.130.05

0.05

0.08

0.04

0.02

0.04

0.03

1

43

2

5 6

7

9 10

811

12

Find Community

Fix the remaining

Combine1

43

2

5 6

7

9 10

8 11

12

1

43

2

5 6

7

9 10

8 11

12

5 6

7

9 10

811

12

1

43

2

5 6

7

9 10

8 11

12

1

43

2

5 6

7

9 10

8 11

12

1

43

2

SCS CMU

41

Pre-computational stage

• Q: • A: A few small, instead of ONE BIG, matrices inversions

Efficiently compute and store Q-1

SCS CMU

42

• Q: Efficiently recover one column of Q• A: A few, instead of MANY, matrix-vector multiplication

On-Line Query Stage

+

0

0

0

0

0

0

1

0

0

0

0

0

-1

ie ir

SCS CMU

43

Pre-compute Stage

• p1: B_Lin Decomposition– P1.1 partition– P1.2 low-rank approximation

• p2: Q matrices– P2.1 computing (for each partition)– P2.2 computing (for concept space)

11Q

SCS CMU

44

P1.1: partition

1

43

2

5 6

7

9 10

811

12

1

43

2

5 6

7

9 10

811

12

Within-partition links cross-partition links

skip

SCS CMU

45

P1.1: block-diagonal

1

43

2

5 6

7

9 10

811

12

1

43

2

5 6

7

9 10

811

12

skip

SCS CMU

46

P1.2: LRA for

31

4

2

5 6

7

9 10

811

12

1

43

2

5 6

7

9 10

811

12

|S| << |W2|~

skip

SCS CMU

47

+

=

skip

SCS CMU

48

p2.1 Computing

c

skip

SCS CMU

49

Comparing and

• Computing Time– 100,000 nodes; 100 partitions– Computing 100,00x is Faster!

• Storage Cost – 100x saving!

Q 1,1

Q 1,2

Q 1,k

11Q

=

11Q

11Q

skip

SCS CMU

50

• Q: How to fix the green portions?

W +~

~~

11Q

+ ?

skip

SCS CMU

51

p2.2 Computing:

1S UV=_

-1

1

43

2

5 6

7

9 10

811

12

Q 1,1

Q 1,2

Q 1,k

skip

SCS CMU

52

SM Lemma says:

We have:

Communities Bridges

1 1 11 1

11U VcQQ Q Q

skip

SCS CMU

53

On-Line Stage

• Q

+

Query

0

0

0

0

0

0

1

0

0

0

0

0

Result

?

• A (SM lemma)

Pre-Computation

ie ir

skip

SCS CMU

54

On-Line Query Stage

q1:q2:q3:q4:q5:q6:

skip

SCS CMU

55

skip

SCS CMU

56

Query Time vs. Pre-Compute Time

Log Query Time

Log Pre-compute Time

•Quality: 90%+ •On-line:

•Up to 150x speedup•Pre-computation:

•Two orders saving

SCS CMU

57

• Motivation

• Part I: Definitions

• Part II: Fast Solutions

• Part III: Applications

• Conclusion

Roadmap

• B_Lin: RWR

• FastAllDAP: Esc_Prob

• BB_Lin: Skewed BGs

• FastUpdate: Time-Evolving

SCS CMU

58

FastAllDAP [Tong+]

Footnote: augmented w/ universal sink as practical modification

A Bthe remaining graph

• Q: How to compute – Esc_Prob = Pr (smile before cry)?

SCS CMU

59

Solving DAP (Straight-forward way)

One matrix inversion, one proximity!

2 1,

ˆProx( )=c ( )ti j i ji j p I cP p c p

1 x (n-2) 1 x (n-2)(n-2) x (n-2)

1-c: fly-out probability (to black-hole)

SCS CMU

60

Esc_Prob(1->5) =

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

P=

I - +

-1

1 5

3

2

6

4

0.5 0.5

0.5

0.50.5

0.5

0.5

1

0.5 1

P: Transition matrix (row norm.)

2cc

SCS CMU

61

• Case 1, Medium Size Graph– Matrix inversion is feasible, but…– What if we want many proximities?– Q: How to get all (n ) proximities efficiently?– A: FastAllDAP!

• Case 2: Large Size Graph – Matrix inversion is infeasible– Q: How to get one proximity efficiently?– A: FastOneDAP!

Challenges

2

skip

SCS CMU

62

FastAllDAP

• Q1: How to efficiently compute all possible proximities on a medium size graph?– a.k.a. how to efficiently solve multiple

linear systems simultaneously?

• Goal: reduce # of matrix inversions!

SCS CMU

63

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

FastAllDAP: Observation

1 5

3

2

6

4

0.5 0.5

0.5

0.50.5

0.5

0.5

1

0.5 1

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

Need two different matrix inversions!

P=

P=

SCS CMU

64

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

FastAllDAP: Rescue

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

Redundancy among different linear systems!

P=

P=

Overlap between two gray parts!

Prox(1 5)

Prox(1 6)

SCS CMU

65

FastAllDAP: Theorem

• Theorem:

• Proof: by SM Lemma

• Example:

SCS CMU

66

FastAllDAP: Algorithm

• Alg.– Compute Q– For i,j =1,…, n, compute

• Computational Save O(1) instead of O(n )!

• Example– w/ 1000 nodes, – 1m matrix inversion vs. 1 matrix!

2

SCS CMU

67

FastAllDAP

Size of Graph

Time (sec)

Straight-Solver

FastAllDAP

1,000xfaster!

SCS CMU

68

• Motivation

• Part I: Definitions

• Part II: Fast Solutions

• Part III: Applications

• Conclusion

Roadmap

• B_Lin: RWR

• FastAllDAP: Esc_Prob

• BB_Lin: Skewed BGs

• FastUpdate: Time-Evolving

SCS CMU

RWR on Bipartite Graph

69

n

m

authors

Conferences

Author-Conf. Matrix

Observation: n >> m!

Examples: 1. DBLP: 400k aus, 3.5k confs 2. NetFlix: 2.7M usrs, 18k mvs

SCS CMU

70

• Q: Given query i, how to solve it?

0 1/3 1/3 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 0 0 0 1/4 0 0 0 0

1/3 1/3 0 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 1/4

0.9

0 0 0 0 0 0 0

0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 0

0 0 0 0 1/4 0 1/2 0 0 0 0 0

0 0 0 0 1/4 1/2 0 0 0 0 0 0

0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 0

0 0 0 0 0 0 0 1/4 0 1/3 0 0

0 0 0 0 0 0 0 0 1/2 0 1/3 1/2

0 0 0 0 0

0

0

0

0

00.1

0

0

0

0

0 0 1/4 0 1/3 0 1/2 0

0 0 0 0 0 0 0 0 0 1/3 1/3

1

0 0

RWR on Skewed bipartite graphs

??

… . . .. . …. .. .

. …. . ..

0

0

n m

Ar

… .

. .. . …. .. .

. …. . .. Ac

SCS CMU

• Step 1:

• Step 2:

• Cost:

• Examples – NetFlix: 1.5hr for pre-computation; – DBLP: 1 few minutes

71

BB_Lin: Pre-Computation [Tong+ 06]

M = Ac ArX

1( 0.9 )I M 3( )O m m E

2-step RWR for Conferences

All Conf-Conf Prox. Scores

SCS CMU

72

BB_Lin: Pre-Computation [Tong+ 06]

• Step 1:

• Step 2:

M = Ac ArX

1( 0.9 )I M

2-step RWR for Conferences

All Conf-Conf Prox. Scores

SCS CMU

73

BB_Lin: Pre-Computation [Tong+ 06]

• Step 1:

• Step 2:

• Cost:

• Examples – NetFlix: 1.5hr for pre-computation; – DBLP: 1 few minutes

M = Ac ArX

1( 0.9 )I M 3( )O m m E

2-step RWR for Conferences

All Conf-Conf Prox. Scores

Ac/ArE edges

m x m

SCS CMU

BB_Lin: On-Line Stage

74Ac/Ar

E edges

Case 1: - Conf - Conf

authors

Conferences

Read out !

SCS CMU

BB_Lin: On-Line Stage

75Ac/Ar

E edges

Case 2: - Au - Conf

authors

Conferences

1 matrix-vec!

SCS CMU

BB_Lin: On-Line Stage

76Ac/Ar

E edges

Case 3: - Au - Au

authors

Conferences

2 m atrix-vec!

SCS CMU

BB_Lin: Examples

• NetFlix dataset (2.7m user x 18k movies)– 1.5hr for pre-computation; – <1 sec for on-line

• DBLP dataset (400k authors x 3.5k confs)– A few minutes for pre-computation– <0.01 sec for on-line

77

SCS CMU

78

• Motivation

• Part I: Definitions

• Part II: Fast Solutions

• Part III: Applications

• Conclusion

Roadmap

• B_Lin: RWR

• FastAllDAP: Esc_Prob

• BB_Lin: Skewed BGs

• FastUpdate: Time-Evolving

SCS CMU

79

Challenges

• BB_Lin is good for skewed bipartite graphs– for NetFlix (2.7M nodes and 100M edges)– w/ 1.5 hr pre-computation for m x m core matrix– fraction of seconds for on-line query

• But…what if the graph is evolving over time– New edges/nodes arrive; edge weights increase…– 1.5hr itself becomes a part of on-line cost!

SCS CMU

80

1( 0.9 )I M 3( )O m m E t=0

Q: How to update the core matrix?

t=11( 0.9 )I M

~ ~3( )O m m E

?

SCS CMU

Update the core matrix

• Step 1:

• Step 2:

81

M = Ac ArX

1( 0.9 )I M ~

~

~ ?

M= X+

Rank 2 update

= + X

2( )O m 3( )O m m E

SCS CMU

Update : General Case [Tong+ 2008]

• E’ edges changed

• Involves n’ authors, m’ confs.

• Observation

82

M = Ac ArX~

min( ', ') 'n m E

n authors

m Conferences

SCS CMU

83

• Observation: – the rank of update is small!

• Algorithm:– E’ edges changed– Involves n’ authors, m’ confs.

– our Alg.

– (details in the paper)

Update : General Case

2(min( ', ') ')O n m m E3( )O m m E

min( ', ') 'n m E

83

n authors

m Conferences

SCS CMU

84

FastOneUpdate

176x speedup

40x speedup

Time (Seconds)

Datasets

SCS CMU

85

Fast-Batch-Update

Min (n’, m’)E’

Time (Seconds)Time (Seconds)

15x speed-up on average!

SCS CMU

86

Summary of Part II

• Goal: Efficiently Solve Linear System(s)

• Sols.– B_Lin: Approximate one large linear system– FastAllDAP: multiple inner-related linear

systems– BB_Lin: the intrinsic complexity is small– FastUpdate: (smooth) dynamic linear system

SCS CMU

87

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

B_Lin

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

p p p p p p

FastAllDAP

BB_Lin…FastUpdate

SCS CMU

88

• Motivation

• Part I: Definitions

• Part II: Fast Solutions

• Part III: Applications

• Conclusion

Roadmap

• Link Prediction

• NF

• gCap

• CePS

• G-Ray

• pTrack/cTrack

SCS CMU

890 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0

0.05

0.1

0.15

0.2

0.25

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.090

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18 Link Prediction: existence

no link

with link

density

density

Prox (ij)+Prox (ji)

Prox (ij)+Prox (ji)

Prox. is effective to distinguish red and blue!

SCS CMU

90

Link Prediction: direction

• Q: Given the existence of the link, what is the direction of the link?

• A: Compare prox(ij) and prox(ji)>70%

Prox (ij) - Prox (ji)

density

SCS CMU

91

Neighborhood Formulation

ICDM

KDD

SDM

Philip S. Yu

IJCAI

NIPS

AAAI M. Jordan

Ning Zhong

R. Ramakrishnan

… …

Conference Author

A: RWR! [Sun ICDM2005]

Q: what is most related conference to ICDM

SCS CMU

92

NF: example

ICDM

KDD

SDM

ECML

PKDD

PAKDD

CIKM

DMKD

SIGMOD

ICML

ICDE

0.009

0.011

0.0080.007

0.005

0.005

0.005

0.0040.004

0.004

SCS CMU

93

gCaP: Automatic Image Caption• Q

Sea Sun Sky Wave{ } { }Cat Forest Grass Tiger

{?, ?, ?,}

?A: RWR! [Pan KDD2004]

SCS CMU

94

Test Image

Sea Sun Sky Wave Cat Forest Tiger Grass

Image

Keyword

Region

SCS CMU

95

Test Image

Sea Sun Sky Wave Cat Forest Tiger Grass

Image

Keyword

Region

{Grass, Forest, Cat, Tiger}

SCS CMU

96

Center-Piece Subgraph(CePS)

A C

B

A C

B

?

Original GraphBlack: query nodes

CePS

Q

A: RWR! [Tong KDD 2006]

Red: Max (Prox(Red, A) x Prox(Red, B) x Prox(Red, C))

CePS guy

SCS CMU

97

CePS: Example

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V. Jagadish

Laks V.S. Lakshmanan

Heikki Mannila

Christos Faloutsos

Padhraic Smyth

Corinna Cortes

15 1013

1 1

6

1 1

4 Daryl Pregibon

10

2

11

3

16

SCS CMU

98

K_SoftAnd: Relaxation of AND

Asking AND query? No Answer!

Disconnected Communities

Noise

SCS CMU

99

1 7

5

10

11

9 8

12

13

4

3

62

0.0103

0.1617

0.1387

0.1617

0.0849

0.1617

0.1617

0.0103

0.1387

0.13870.16170.1617

0.0103

2_SoftAnd

And

1_SoftAnd (OR)

x 1e-4

1 7

5

10

11

9 8

12

13

4

3

62

0.4505

0.1010

0.0710

0.1010

0.2267

0.1010

0.1010

0.4505

0.0710

0.07100.10100.1010

0.4505

1 7

5

10

11

9 8

12

13

4

3

62

0.0103

0.0046

0.0019

0.0046

0.0024

0.0046

0.0046

0.0103

0.0019

0.00190.00460.0046

0.0103

SCS CMU

100

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V. Jagadish

Laks V.S. Lakshmanan

Umeshwar Dayal

Bernhard Scholkopf

Peter L. Bartlett

Alex J. Smola

1510

13

3 3

5 2 2

327

4

CePS: 2 Soft_AND

Stat.

DB

SCS CMU

101

OutputInput

Attributed Data Graph

Query Graph

Matching SubgraphAccountant

CEO

Manager

SEC

Graph X-Ray

SCS CMU

102

G-Ray: How to?

matching node

matching node

matching node

matching node

Goodness = Prox (12, 4) x Prox (4, 12) x Prox (7, 4) x Prox (4, 7) x Prox (11, 7) x Prox (7, 11) x Prox (12, 11) x Prox (11, 12)

SCS CMU

103

Effectiveness: star-query

Query Result

SCS CMU

104

Effectiveness: line-query

Query

Result

SCS CMU

105

Query

Result

Effectiveness: loop-query

SCS CMU

106

pTrack

• [Given] – (1) a large, skewed time-evolving bipartite

graphs, – (2) the query nodes of interest

• [Track] – (1) top-k most related nodes for each query

node at each time step t; – (2) the proximity score (or rank of proximity)

between any two query nodes at each time step t

Author A’ Rank in KDD

Year

SCS CMU

107

Philip S. Yu’s Top-5 conferences up to each year

ICDE

ICDCS

SIGMETRICS

PDIS

VLDB

CIKM

ICDCS

ICDE

SIGMETRICS

ICMCS

KDD

SIGMOD

ICDM

CIKM

ICDCS

ICDM

KDD

ICDE

SDM

VLDB

1992 1997 2002 2007

DatabasesPerformanceDistributed Sys.

DatabasesData Mining

SCS CMU

108

KDD’s Rank wrt. VLDB over years

Rank

Year

Data Mining and Databases are more and more relavant!

SCS CMU

109

cTrack

• [Given] – (1) a large, skewed time-evolving graphs, – (2) the query nodes of interest

• [Track] – (1) top-k most central nodes at each time step t; – (2) the centrality score (or rank of centrality) for

each query node at each time step t

SCS CMU

110

Ranking of Centrality up to each year (in NIPS)

M. Jordan

G.HintonC. Koch

T. Sejnowski

Year

Rank of Influential-ness

SCS CMU

111

10 most influential authors up to each year

Author-paper bipartite graph from NIPS 1987-1999. 3k. 1740 papers, 2037 authors, spreading over 13 years

T. Sejnowski

M. Jordan

SCS CMU

112

A BH1 1

D1 1

E

F

G1 11

I J1

1 1

RWR

Varia

nts

w/ Tim

e

w/ Attr

ibut

e

Gro

up P

orx.

Definitions

B_Lin

FastAllDAP

BB_Lin

FastUpdate

Computations

Link Prediction

NF

gCap

CePS

G-Ray

pTrack

cTrack

Applications

ProximityOn

Graphs

Weighted Multiple

Relationship

EfficientlySolve Linear

System(s)

Use Proximity as

Building block

SCS CMU

Take-home Messages

• Proximity Definitions – RWR

– and a lot of variants

• Computations– SM Lemma

113

(1 )i i ir cWr c e

1 11 1 1

1( )

1

TT

T

A u v AA A u v A

v A u

SCS CMU

References

• L. Page, S. Brin, R. Motwani, & T. Winograd. (1998), The PageRank Citation Ranking: Bringing Order to the Web, Technical report, Stanford Library.

• T.H. Haveliwala. (2002) Topic-Sensitive PageRank. In WWW, 517-526, 2002

• J.Y. Pan, H.J. Yang, C. Faloutsos & P. Duygulu. (2004) Automatic multimedia cross-modal correlation discovery. In KDD, 653-658, 2004.

• C. Faloutsos, K. S. McCurley & A. Tomkins. (2002) Fast discovery of connection subgraphs. In KDD, 118-127, 2004.

• J. Sun, H. Qu, D. Chakrabarti & C. Faloutsos. (2005) Neighborhood Formation and Anomaly Detection in Bipartite Graphs. In ICDM, 418-425, 2005.

• W. Cohen. (2007) Graph Walks and Graphical Models. Draft.114

SCS CMU

References

• P. Doyle & J. Snell. (1984) Random walks and electric networks, volume 22. Mathematical Association America, New York.

• Y. Koren, S. C. North, and C. Volinsky. (2006) Measuring and extracting proximity in networks. In KDD, 245–255, 2006.

• A. Agarwal, S. Chakrabarti & S. Aggarwal. (2006) Learning to rank networked entities. In KDD, 14-23, 2006.

• S. Chakrabarti. (2007) Dynamic personalized pagerank in entity-relation graphs. In WWW, 571-580, 2007.

• F. Fouss, A. Pirotte, J.-M. Renders, & M. Saerens. (2007) Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. IEEE Trans. Knowl. Data Eng. 19(3), 355-369 2007.

115

SCS CMU

References

• H. Tong & C. Faloutsos. (2006) Center-piece subgraphs: problem definition and fast solutions. In KDD, 404-413, 2006.

• H. Tong, C. Faloutsos, & J.Y. Pan. (2006) Fast Random Walk with Restart and Its Applications. In ICDM, 613-622, 2006.

• H. Tong, Y. Koren, & C. Faloutsos. (2007) Fast direction-aware proximity for graph mining. In KDD, 747-756, 2007.

• H. Tong, B. Gallagher, C. Faloutsos, & T. Eliassi-Rad. (2007) Fast best-effort pattern matching in large attributed graphs. In KDD, 737-746, 2007.

• H. Tong, S. Papadimitriou, P.S. Yu & C. Faloutsos. (2008) Proximity Tracking on Time-Evolving Bipartite Graphs. to appear in SDM 2008.

116

SCS CMU

117

Thank you!

htong@cs.cmu.edu

www.cs.cmu.edu/~htong