via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in...

34
1 Semi-supervised learning on graphs via stationary empirical correlations Ya Xu Justin S. Dyer Art B. Owen Department of Statistics Stanford University Kriging on graphs

Transcript of via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in...

Page 1: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

1

Semi-supervised learning on graphs

via stationary empirical correlations

Ya Xu

Justin S. Dyer

Art B. Owen

Department of Statistics

Stanford University

Kriging on graphs

Page 2: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

2

Prediction on graphs

wij

non-spam

??

non-spam

spam?

non-spam

??

?

Some nodes are labeled, some not.

We want to predict the unlabeled using labels and graph structure.

Operative assumption: nearby nodes are similar. Kriging on graphs

Page 3: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

3

The story in one slide1) Many graph-based predictions are linear in the observed responses.

2) So there’s a “Gaussian model” story.

3) We find the implied correlations,

4) and replace them with empirical ones.

5) Sometimes it makes a big improvement.

6) We did small examples, but with scaling in mind

7) Now · · · early results for Wikipedia graph

Why it improves

The semi-supervised learning methods we found had preconceived notions of how correlation

varies with graph distance.

We estimate the correlation vs distance pattern from the data.

Kriging on graphs

Page 4: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

4

Graph NotationG The graph

wij Edge weight from i to j

W Adjacency matrix

wi+ =∑j wij Out-degree of i

w+j =∑i wij In-degree of j

w++ =∑ij wij Graph volume

Yi Response value at node i

Y (0) Measured responses i = 1, . . . , r

Y (1) Unknown responses i = r + 1, . . . , n

Kriging on graphs

Page 5: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

5

Graph random walk

Transition probability Pij =wijwi+

Stationary distribution πi e.g. PageRank

The associated random walk leaves node i for node j with probability proportional to wij .

We assume it is aperiodic and irreducible. (If necessary add teleportation.)

∴ it has a stationary distribution π

Graph Laplacian

∆ij =

wi+ − wii i = j

−wij i 6= j

needed later

Kriging on graphs

Page 6: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

6

What is a graph Laplacian (anyway)?For a function f(i) ∈ R, let

f =

f(1)

f(2)...

f(n)

∈ Rn

(∆f)(i) =∑j

wij(f(i)− f(j))

In 1 dimension with i and i+ 1 neighbors:

(f(i)− f(i− 1)) + (f(i)− f(i+ 1)) = −(f(i+ 1)− 2f(i) + f(i− 1)

)So ∆ is like negative curvature with respect to the graph

Kriging on graphs

Page 7: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

7

Zhou, Huang, Scholkoff (2005)Node similarity:

sij ≡ πiPij + πjPji

Variation functional:

Ω(Z) =12

∑i,j

sij

( Zi√πi− Zj√

πj

)2

Criterion:

Z = arg minZ∈Rn

Ω(Z) + λ‖Z − Y ∗‖2

Y ∗i =

Yi observed

µi (default, e.g. 0) otherwise

ZHS trade off fit to observations vs graph smoothness via λ.

Result is a linear function of Y ∗

There must be an equivalent Gaussian process storyKriging on graphs

Page 8: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

8

Kriging

?

2.1

2.42.3

1.1

1.9

1.7

1) Predict at ‘?’ by weighting the obs

2) 1.9 gets more weight than 1.7 because it is closer

3) the.= 2s get more weight than the 1.1, because there are 3 of them

4) but not triple the weight, because they’re somewhat redundant

5) the tradeoffs are automatic · · · given covariance in a Gaussian model

Kriging originated in geostatistics Kriging on graphs

Page 9: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

9

Kriging modelobs Y = Xβ + S + ε ∈ Rn

coefficients β ∈ Rk we’ll have k = 1

predictors X ∈ Rn×k e.g. X =√π or 1n

correlated part S ∼ N(0,Σ)

noise ε ∼ N(0,Γ) Γ is diagonal

PredictionsNow Y = Z + ε, for signal Z = Xβ + S

Taking X fixed and β ∼ N(µ, δ−1)makes Z ∼ N(Xµ,Ψ), Ψ = XXTδ−1 + Σ

Predict by Z = E(Z | Y (0))

Kriging on graphs

Page 10: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

10

Kriging some moreZ is signal Y (0) has observed responses

Partition Ψ

Ψ = Cov

Z(0)

Z(1)

=

Ψ00 Ψ01

Ψ10 Ψ11

=(

Ψ•0 Ψ•1

).

Joint distribution of signal (everywhere) and observations Z

Y (0)

∼ N

XE(β)

X0E(β)

,

Ψ Ψ•0

Ψ0• Ψ00 + Γ00

· · · yields expression for E(Z | Y (0))

Kriging on graphs

Page 11: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

11

ZHS method as krigingLet Π = diag(πi) and define

∆ij =

si+ − sii i = j

−sij i 6= j

The Laplacian after replacing wij by sij = πiPij + πjPji

Choose

noise variance Γ = λ−1In

signal variance Σ = Π1/2∆+Π1/2 (+ for generalized inverse)

predictors X = diag(√πi)T

defaults µi = µXi, r + 1 ≤ i ≤ n

Then

limδ→0+ Kriging(Γ,Σ, X, δ) = ZHS methodKriging on graphs

Page 12: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

12

InterpretationThe ZHS method is a kind of kriging

The correlation matrix depends on the graph but not on the nature of the response

This seems strange: shouldn’t some variables correlate strongly with their neighbors, others

weakly and still others negatively?

It also anticipates Z ∝√π (for every response variable)

Kriging on graphs

Page 13: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

13

Belkin, Matveeva, Niyogi (2004)Graph Tikhonov regularization

ZT∆Z + λ0‖Z(0) − Y (0)‖2

∆ is the graph Laplacian, penalty is only on observed responses

As kriging

noise variance Γ =

λ−10 Ir

λ−11 In−r

signal variance Σ = ∆+ (no Π1/2)

predictors X = 1n (no√πi)

let δ → 0+ and then let λ1 → 0+

Kriging on graphs

Page 14: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

14

Zhou et al (2004)Undirected graph precursor to ZHS, using Dii = wi+ = w+i:

12

∑i,j

wij

( Zi√Dii

− Zj√Djj

)2

+ λ‖Z − Y ∗‖2

As kriging

noise variance Γ = λ−1I

signal variance Σ = D1/2∆+D1/2

predictors X = diag(√Dii)

with δ → 0+

Kriging on graphs

Page 15: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

15

More examplesZhou, Scholkopf, Hofmann (2005)

They define a hub walk and an authority walk. Each has a transition matrix, stationary

distribution, similarity matrix and similarity-Laplacian. They replace Ω(Z) by the convex

combination

γΩH(Z) + (1− γ)ΩA(Z), 0 < γ < 1.

The resulting signal variance is the corresponding convex combination of hub and authority

signal variance matrices.

Belkin, Niyogi, Sindhwani (2006) Manifold regularization. Get covariance (K + γ∆)−1 when

their Mercer kernel is linear with matrix K .

Kondor and Lafferty (2002) and Smola and Kondor (2003) and Zhu, Ghahramani and Lafferty

(2003) use spectral criterion ZTLZ where L =∑i f(di)uiuT

i where (di, ui) are

eigen-val/vects of Λ. Kriging covariance is Σ =∑i f(di)−1uiu

Ti .

Kriging on graphs

Page 16: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

16

Empirical stationary correlationsIn Random walk smoothing ZHS

Y ∼ N(µ√π, Π1/2(∆+ + 11Tδ−1)Π1/2 + λ−1I

)In Tikhonov smoothing BMN

Y ∼ N(µ1, I(∆+ + 11Tδ−1)I + λ−1I

)Our proposal XDO

Y ∼ N(µX, V 1/2(σ2R)V 1/2 + λ−1I

)where X ∈ Rn and V = diag(vi) are given,

R is a correlation matrix we choose, via Rij = ρ(sij)for a smooth function ρ(·) of similarity sij(eg sij = πiPij + πjPji) We also choose σ > 0.

Stationary because ρ depends only on s,

Empirical because we get ρ from data

NB: E(Y ) and Var(Y ) not necessarily stationary Kriging on graphs

Page 17: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

17

What we’ll do

Estimate ρ(sij) at every pair of nodes i and j using (essentially)

E(Yi − Yj)2.= (Yi − Yj)2, ∀1 ≤ i < j ≤ r

Extreme overfitting.

Then smooth them.

Kriging on graphs

Page 18: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

18

The variogram estimator

Φij ≡12

E((

(Yi − µXi)− (Yj − µXj))2)

=1λ

+12σ2(X2i − 2XiXjRij +X2

j

)(by model)

Φij ≡12((Yi − µXi)− (Yj − µXj)

)2 1 ≤ i < j ≤ r

Set Φij = Φij and solve for Rija naive correlation for signals Zi and Zj

Kriging on graphs

Page 19: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

19

Using the variogram estimator1) Φij is a naive estimator of Φij .

2) We plug it in to solve for a naive Rij .

3) Then fit a spline curve to (log(1 + sij), Rij) pairs: Rij.= ρ(sij).

4) Put Σ = σ2V RV , and make positive definite: Σ+

4′) (Variant) Use low rank approx to Σ (might scale better for large n)

Then we use kriging with the estimated correlation matrix.

Kriging on graphs

Page 20: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

20

Smoothing to get ρ

0 2 4 6 8

−2

−1

01

x1

y1

rho

0 2 4 6 80.

20.

40.

60.

81.

01.

2

x2

y2rho

title

Kriging on graphs

Page 21: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

21

UK web link dataset• Nodes are 107 UK universities

• Edges are web links

• Weights wij : # links from i to j

• Yi: research score measuring quality of Uni i’s research

We will try to predict the university research scores from the graph structure and some of the

scores.

Data features

• RAE scores in [0.4, 6.5] with mean∼ 3 and standard deviation∼ 1.9.

• 15% of weights wij are 0, 50% are below 7, max is 2130

Kriging on graphs

Page 22: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

22

Experiment1) Randomly hold out some universities (ranging from∼ 10% to∼ 90%)

2) Predict held out scores

3) Find mean square error

4) Repeat 50 times

Methods:

Random walk smoothing,

Tikhonov smoothing

and empirical correlation versions of both

Tuning

Empirical correlation has two tuning parameters: λ and σ

The other methods have just one

The comparison is fair because we use hold outs

For RW & Tikhonov methods we eventually just took their best parameter value and it still did not

beat cross-validated empirical correlations Kriging on graphs

Page 23: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

23

Implementation notesTikhonov

This method is defined for undirected graphs

So we use W = W +WT

. . . in both original and empirical stationary versions

Choosing µ for which β ∼ N(µ, δ−1)

For RW: use µ = 0 for binary responses, but for UNI data take

µ =1r

r∑i=1

YiXi

on ’held in’ nodes

For Tikhonov: µ disappears from equations in δ → 0 limit, so we don’t need it

Kriging on graphs

Page 24: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

24

Random walk ZHS for Uni dataRecall the criterion

12

∑i,j

sij

( Zi√πi− Zj√

πj

)2

+ λ‖Z − Y ∗‖2

We find (empirically) that the estimate Zi is nearly∝ √πiNodes with comparable PageRank πi get similar predictions

The similarity sij is virtually ignored

Kriging on graphs

Page 25: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

25

Results for University data

Notes

• RW has Z nearly∝ X =√π

• Tikhonov ignores direction of links

• Empirical correlation performance not sensitive to rank reduction Kriging on graphs

Page 26: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

26

Numerical summaryImprovement over baseline

Random walk Tikhonov

Baseline MSE 1.71 3.64

Random walk 3.8% -

Tikhonov - 3.2%

Empirical 25.0% 50.9%

Empirical R5 32.4% 53.9%

Empirical R1 19.1% 50.9%

Mean square prediction errors when 50 of 107 university scores are held out.

Baseline is plain regression on X with no other graphical input.

Kriging on graphs

Page 27: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

27

Web KB dataWe used the data for Cornell, omitting ’other’.

Y =

1 student web page

−1 faculty, staff, dept, course, project

Wij =

1 i links to j

0 else.

Kriging on graphs

Page 28: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

28

Results for Web KB data

Notes• Now X = 1 so∝ X is not helpful; solid line is coin toss

• Tikhonov ignores direction of links, but now it helps

• Empirical correlation performance not sensitive to rank reductionKriging on graphs

Page 29: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

29

Numerical results for webKBImprovement over baseline

Random walk Tikhonov

Baseline (1−AUC) 0.5 0.5

Random walk −5.4% -

Tikhonov - 8.5%

Empirical 43.0% 37.5%

Empirical R5 40.0% 31.9%

Empirical R1 29.0% 16.3%

Baseline is a coin toss, AUC= 0.5

Kriging on graphs

Page 30: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

30

Next steps1) more examples

Biological examples . . . noisy edges

Wikipedia preliminary results

2) scaling issues

sparsity helps; sparse plus low rank promising

Ya Xu has Markov chain algorithms for some similarities

3) more similarity measures

Kriging on graphs

Page 31: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

31

Peek aheadFor Wikipedia graph: can make up responses like 1 iff web page links to a given page such as

the one for US, China, France or Bob Dylan.

Have 2.4 million nodes and observe 10%

Empirical stationary correlation got 22.5% (1-AUC) error in rank 1 version vs near 53.8% for a

local average.

PROPACK sparse SVD code works for empirical algorithm because it needs eigenvectors for

largest eigenvalues but fails for random walk smoothing and Tikhonov smoothing which need

smallest eigen.

Kriging on graphs

Page 32: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

32

Randomized labelsProtein Webspam Wikipedia

Ω(Yb) Ω(Yb) Ω(Yb)

Ω(Yb) Ω(Yb) Ω(Yb)

Observed roughness Ω(Y ) = • vs histogram of Ω(permuted(Y ))RW on top, Tikhonov below Kriging on graphs

Page 33: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

33

Second peek aheadWork in progress with Justin Dyer

Copula of links between Wikipedia pages

Kriging on graphs

Page 34: via stationary empirical correlationsfinmath.stanford.edu/~owen/pubtalks/asa2010.pdf3 The story in one slide 1) Many graph-based predictions are linear in the observed responses. 2)

34

Thanks Ying Lu asked me to speak

San Francisco Bay Area Chapter of the ASA for organizing:

Spencer Graves, Chris Barker, Dean Fearn

Steve Scott and Google for hosting

NSF division of mathematics and statistics

Kriging on graphs