Large Graph Construction for Scalable Semi …...Motivation Related Work Anchor-Based Label...

Post on 28-Aug-2020

3 views 0 download

Transcript of Large Graph Construction for Scalable Semi …...Motivation Related Work Anchor-Based Label...

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Large Graph Construction for ScalableSemi-Supervised Learning

Wei Liu Junfeng He Shih-Fu Chang

Department of Electrical Engineering

Columbia University

New York, NY, USA

International Conference on Machine Learning, 2010

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Outline

1 Motivation

2 Related WorkMid-Scale SSLLarge-Scale SSL

3 Anchor-Based Label PredictionAnchorsAnchor-to-Data Nonparametric Regression

4 AnchorGraphCoupling Label Prediction and Graph ConstructionAnchorGraph Regularization

5 Experiments

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Why Semi-Supervised Learning (SSL)?

In practice, labeling data is expensive.

Using labeled data and unlabeled data, SSL generalizesand improves supervised learning.

Most SSL methods cannot handle a large amount ofunlabeled data because they scale poorly with the datasize.

Large-scale SSL is critical nowadays as we can easilycollect massive (up to hundreds of millions) unlabeled datafrom the Internet (Facebook, Flickr).

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

SSL on Large Social Networks

http://www.cmu.edu/joss/content/articles/volume1/Freeman.html

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Mid-Scale SSL

Manifold and Graph

Manifold Assumption

Close samples in some similarity measure have similar classlabels. Graph-based SSL methods including LGC, GFHF,LapSVM, etc. adopt this assumption.

Neighborhood Graph

A sparse graph can reveal data manifold, where adjacent graphnodes represent nearby data points. The commonly used graphis kNN graph (linking graph edges between any data point andits k nearest neighbors) whose construction takes O(kn2).Inefficient in large scale!

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Mid-Scale SSL

Manifold and Graph

Dense graphs hardly reveal manifolds.

Sparse graphs are preferred.

Example

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2data manifold (two submanifolds)

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.26NN Graph

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Large-Scale SSL

Existing Large-Scale SSL Methods

Nonparametric Function Induction, Delalleu et al., 2005Use the truncated graph on a sample subset and its

connection to the rest samples.

Harmonic Mixtures, Zhu & Lafferty, 2005Need a large sparse graph, but do not explain how to

construct.

Prototype Vector Machine (PVM), Zhang et al., 2009Use the Nyström approximation of the fully-connected

graph adjacency matrix. There is no guarantee for thegraph Laplacian matrix computed from this approximationto be positive semidefinite.

Eigenfunction, Fergus et al., 2010Rely on the dimension-independent data density

assumption which is not always true.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Large-Scale SSL

Our Aim

Develop a scalable SSL approach that is easy toimplement and has closed-form solutions.

Address the critical issue on how to build an entire sparsegraph over a large dataset.

Low space and time complexities with respect to the datasize. Our Goal: achieve linear complexity!

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Anchors

How to scale up?

Graph-based SSL methods usually need a cubic timecomplexity O(n3) with respect to the data size n becausen × n matrix inversion is needed in inferring full-size labelprediction.

In large scale, full-size label prediction is infeasible, whichhas motivated the idea of landmark samples, i.e., anchors.

If one inferred the labels on a much smaller landmark setwith size m (≪ n), the time complexity could be reduced toO(m3).

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Anchors

Anchor Points

Example

u1

2u

3u

4u

x1

data point

anchor point

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4K−Means Clustering

datacluster center

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Anchors

Anchor-to-Data Relationship

Input data points X = {x1, · · · , xn} ⊂ Rd and introduced

anchor points U = {u1, · · · ,um}.Suppose a soft label prediction function f : Rd → R.Predict the label for each data point as a weighted averageof the labels on anchors:

f (x i) =

m∑

k=1

Zik f (uk), i = 1, · · · ,n (1)

where Zik ’s denote the combination weights.The matrix form

f = Za, Z ∈ Rn×m, m ≪ n (2)

where a = [f (u1), · · · , f (um)]⊤ is the label vector that we

will solve.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Anchor-to-Data Nonparametric Regression

How to obtain Z?

We impose the nonnegative normalization constraints∑m

k=1 Zik = 1 and Zik ≥ 0 in order to maintain the unifiedrange of values for all predicted soft labels via anchors.

The manifold assumption implies that nearby points shouldhave similar labels and distant points are very unlikely totake similar labels.

So, we further impose Zik = 0 when anchor uk is far awayfrom x i . Specifically, Zik 6= 0 only for s closest anchorpoints (the s indexes 〈i〉 ⊂ [1 : m]) of x i .

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Anchor-to-Data Nonparametric Regression

The weights Zik ’s are sample-adaptive, and thus theanchor-based label prediction scheme falls intononparametric regression: the label of x i is a locallyweighted average of the labels on anchors.The regression matrix Z ∈ R

n×m is nonnegative andsparse.

Example

s = 2

Z11

12Z u1

2u

3u

4u

x1

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Anchor-to-Data Nonparametric Regression

Predefined Kernel-based Weights

The Nadaraya-Watson Kernel Regression

Zik =Kh(x i ,uk )

k ′∈〈i〉 Kh(x i ,uk ′)∀k ∈ 〈i〉 (3)

Typically, we may adopt the Gaussian kernelKh(x i ,uk ) = exp(−‖x i − uk‖

2/2h2).

The above predefined weights are sensitive to thehyperparameter h and lack a meaningful interpretation.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Anchor-to-Data Nonparametric Regression

Optimized Weights

Local Anchor Embedding (LAE)

minzi∈Rs

g(zi) =12‖x i − U〈i〉zi‖

2

s.t. 1⊤zi = 1, zi ≥ 0 (4)

U = [u1, · · · ,um] and U〈i〉 ∈ Rd×s is a sub-matrix

composed of s nearest anchors of x i .Similar to Locally Linear Embedding (LLE), LAEreconstructs any x i as a convex combination of its nearestanchors. The combination coefficients are preserved forthe weights Zik ’s used in nonparametric regression:

Zi ,〈i〉 = z⊤i , Zi ,〈i〉 = 0, |〈i〉| = s. (5)

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Anchor-to-Data Nonparametric Regression

We propose the Nesterov-accelerated projected gradientalgorithm, a first-order optimization procedure, to solve theoptimization problem.In contrast to the predefined kernel-based weights, LAE ismore advantageous because it provides optimizedregression weights that are also sparser.

Example

Project x i onto the convex hull of its closest anchors U〈i〉

u1

2u

3u

4u

x1

x4

x7

42 430, 1Z Z= =

73 740.5, 0.5Z Z= =

2s =

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Coupling Label Prediction and Graph Construction

Large Graph Construction

Construct a large graph G(X ,W ) over the dataset X .W ∈ R

n×n denotes the adjacency matrix.

The exact kNN graph construction (square timecomplexity) is infeasible in large scale; it is unrealistic tosave a full matrix W in memory.

Our idea is low-rank matrix factorization: representing Was a Gram matrix.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Coupling Label Prediction and Graph Construction

Principles for Designing W

W ≥ 0, which is sufficient to make the resulting graphLaplacian L = D − W (D = diag(W1)) positivesemidefinite. Keeping p.s.d. graph Laplacians is importantto guarantee global optimum of graph-based SSL(ensuring that the graph regularizer f⊤Lf is convex).

Sparse W . Sparse graphs have much less spuriousconnections between dissimilar points and tend to exhibithigh quality.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Coupling Label Prediction and Graph Construction

Nyström Approximation

When using the Nyström method to design the graphadjacency matrix W = K , produce an improper densegraph.

K = K.UK−1

UUKU .approximates a predefined kernel matrix

K ∈ Rn×n.

Cannot guarantee K ≥ 0; K is dense.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Coupling Label Prediction and Graph Construction

Design W Using Z

In anchor-based label prediction, we have obtained thenonnegative and sparse Z ∈ R

n×m. Don’t waste it.

In depth, each row Zi . in Z is a new representation of rawsample x i . x i → Z⊤

i . is reminiscent of sparse coding withthe basis U since x i ≈ U〈i〉zi = UZ⊤

i . .

Guess the Gram matrix W = ZZ⊤?

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Coupling Label Prediction and Graph Construction

Design W Using Z

Low-Rank Matrix Factorization

W = ZΛ−1Z⊤, (6)

in which the diagonal matrix Λ ∈ Rm×m contains inbound

degrees of anchors Λkk =∑n

i=1 Zik .

W balances the popularity of anchors.

W is nonnegative and empirically sparse.

Theoretically, we can derive this equation by a probabilisticmeans (e.g., employ the bipartite graph on the next slide).

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Coupling Label Prediction and Graph Construction

Bipartite Data-Anchor Graph

u

x6

1

2u

x1

x2

x4

x5

x3

Z21

22Z

Figure: A bipartite graph representation B(X ,U ,Z ) of data X andanchors U . Z is defined as the cross adjacency matrix between Xand U (Z1 = 1).

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Coupling Label Prediction and Graph Construction

One-Step Transition in Markov Random Walks

u

x6

1

2u

x1

x2

x4

x5

x3

p(1)(uk |x i) =Zik

∑mk ′=1 Zik ′

= Zik

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Coupling Label Prediction and Graph Construction

One-Step Transition in Markov Random Walks

u

x6

1

2u

x1

x2

x4

x5

x3

p(1)(x i |uk ) =Zik

∑nj=1 Zjk

=Zik

Λkk

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Coupling Label Prediction and Graph Construction

Two-Step Transition in Markov Random Walks

u

x6

1

2u

x1

x2

x4

x5

x3

p(2)(x j |x i) = p(2)(x i |x j) =

m∑

k=1

p(1)(x j |uk )p(1)(uk |x i) =

m∑

k=1

ZikZjk

Λkk

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Coupling Label Prediction and Graph Construction

Interpret W as two-step transition probabilities in thebipartite data-anchor graph B:

Wij = p(2)(x j |x i) = p(2)(x i |x j) (7)

⇒W = ZΛ−1Z⊤

Term the large graph G described by such a WAnchorGraph .

Don’t need to compute W and the graph LaplacianL = D − W explicitly. Only need to save Z (O(sn)) inmemory.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

AnchorGraph Regularization

With f = Za, the object function of SSL is

Q(a) = ‖Zla − y l‖2

︸ ︷︷ ︸

label fitting

+ γa⊤Z⊤LZa︸ ︷︷ ︸

graph regularization

(8)

where we compute a “reduced" Laplacian matrix:

L = Z⊤LZ = Z⊤(D − ZΛ−1Z⊤)Z

= Z⊤Z − (Z⊤Z )Λ−1(Z⊤Z ),

where D = diag(W1) = diag(ZΛ−1Z⊤1) = diag(ZΛ−1Λ1) =diag(Z1) = diag(1) = I.

Computing L is tractable, taking O(m3 + m2n) time.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

AnchorGraph Regularization

We can easily obtain the globally optimal solution asfollows

a∗ = (Z⊤l Zl + γL)−1Z⊤

l y l . (9)

As such, we yield a closed-form solution for large-scaleSSL.

Final prediction of multiple classes (Y = {1, · · · , c}):

yi = arg maxj∈Y

Zi .aaa∗j

λj, i = l + 1, · · · ,n (10)

where a∗j is solved corresponding to each class and the

normalization factor λj = 1⊤Zaaa∗j balances skewed class

distributions.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

AnchorGraph Regularization

Complexity Analysis

Space Complexity

O(n) since we save Z and L ∈ Rm×m in memory.

Time Complexity

O(m2n) since n ≫ m.

Table: Time complexity analysis. n is the data size, m is # of anchorpoints, s is # of nearest anchors in designing Z , and T is # ofiterations of LAE (n ≫ m ≫ s).

Stage AnchorGraphRegfind anchors O(mn)

design Z O(smn) or O(smn + s2Tn)graph regularization O(m3 + m2n)total time complexity O(m2n)

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Compared Methods

LGC, GFHF, PVM, Eigenfunction, etc.

random AGR0: random anchors, predefined Z

random AGR: random anchors, LAE-optimized Z

AGR0: k-means anchors, predefined Z

AGR: k-means anchors, LAE-optimized Z

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Mid-sized Dataset

Table: Classification error rates (%) on USPS-Train (7,291 samples)with 100 labeled samples. Take 1,000 anchors for four versions ofAGR. The running time of k-means clustering is 7.65 seconds. Errorrate GFHF< AGR < AGR0 < LGC. AGR is much faster than GFHF.

Method Error Rate Running Time(%) (seconds)

1NN 20.15±1.80 0.12LGC with 6NN graph 8.79±2.27 403.02GFHF with 6NN graph 5.19±0.43 413.28random AGR0 11.15±0.77 2.55random AGR 10.30±0.75 8.85AGR0 7.40±0.59 10.20AGR 6.56±0.55 16.57

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Mid-sized Dataset

200 300 400 500 600 700 800 900 10004

6

8

10

12

14

16

18

20

22

# of anchor points

Err

or R

ate

(%)

Classification Error Rates vs. # Anchors on USPS−Train

LGCGFHF

random AGR0

random AGR

AGR0

AGR

Figure: 100 labeled samples. When the anchor size reaches 600,both AGR0 and AGR begin to outperform LGC.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Large Dataset

Table: Classification error rates (%) on Extended MNIST (630,000samples) with 100 labeled samples. Take 500 anchors for twoversions of AGR. The running time of k-means clustering is 195.16seconds. AGR is the best in classification accuracy, achieving 1/2-foldimprovement over the baseline 1NN.

Method Error Rate Running Time(%) (seconds)

1NN 39.65±1.86 5.46Eigenfunction 36.94±2.67 44.08PVM(square loss) 29.37±2.53 266.89AGR0 24.71±1.92 232.37AGR 19.75±1.83 331.72

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Conclusions

Develop a scalable SSL approach using the proposedlarge graph construction, AnchorGraph, which isguaranteed to result in positive semidefinite graphLaplacians.

Both time and memory needed by AGR grow only linearlywith the data size, so it can enable us to apply SSL to evenlarger datasets with millions of samples.

AnchorGraph has wider utility such as large-scalenonlinear dimensionality reduction, spectral clustering,regression, etc.

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Thanks! Ask me via wliu@ee.columbia.edu.