Large Graph Construction for Scalable Semi …...Motivation Related Work Anchor-Based Label...
Transcript of Large Graph Construction for Scalable Semi …...Motivation Related Work Anchor-Based Label...
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Large Graph Construction for ScalableSemi-Supervised Learning
Wei Liu Junfeng He Shih-Fu Chang
Department of Electrical Engineering
Columbia University
New York, NY, USA
International Conference on Machine Learning, 2010
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Outline
1 Motivation
2 Related WorkMid-Scale SSLLarge-Scale SSL
3 Anchor-Based Label PredictionAnchorsAnchor-to-Data Nonparametric Regression
4 AnchorGraphCoupling Label Prediction and Graph ConstructionAnchorGraph Regularization
5 Experiments
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Why Semi-Supervised Learning (SSL)?
In practice, labeling data is expensive.
Using labeled data and unlabeled data, SSL generalizesand improves supervised learning.
Most SSL methods cannot handle a large amount ofunlabeled data because they scale poorly with the datasize.
Large-scale SSL is critical nowadays as we can easilycollect massive (up to hundreds of millions) unlabeled datafrom the Internet (Facebook, Flickr).
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
SSL on Large Social Networks
http://www.cmu.edu/joss/content/articles/volume1/Freeman.html
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Mid-Scale SSL
Manifold and Graph
Manifold Assumption
Close samples in some similarity measure have similar classlabels. Graph-based SSL methods including LGC, GFHF,LapSVM, etc. adopt this assumption.
Neighborhood Graph
A sparse graph can reveal data manifold, where adjacent graphnodes represent nearby data points. The commonly used graphis kNN graph (linking graph edges between any data point andits k nearest neighbors) whose construction takes O(kn2).Inefficient in large scale!
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Mid-Scale SSL
Manifold and Graph
Dense graphs hardly reveal manifolds.
Sparse graphs are preferred.
Example
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2data manifold (two submanifolds)
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.26NN Graph
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Large-Scale SSL
Existing Large-Scale SSL Methods
Nonparametric Function Induction, Delalleu et al., 2005Use the truncated graph on a sample subset and its
connection to the rest samples.
Harmonic Mixtures, Zhu & Lafferty, 2005Need a large sparse graph, but do not explain how to
construct.
Prototype Vector Machine (PVM), Zhang et al., 2009Use the Nyström approximation of the fully-connected
graph adjacency matrix. There is no guarantee for thegraph Laplacian matrix computed from this approximationto be positive semidefinite.
Eigenfunction, Fergus et al., 2010Rely on the dimension-independent data density
assumption which is not always true.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Large-Scale SSL
Our Aim
Develop a scalable SSL approach that is easy toimplement and has closed-form solutions.
Address the critical issue on how to build an entire sparsegraph over a large dataset.
Low space and time complexities with respect to the datasize. Our Goal: achieve linear complexity!
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Anchors
How to scale up?
Graph-based SSL methods usually need a cubic timecomplexity O(n3) with respect to the data size n becausen × n matrix inversion is needed in inferring full-size labelprediction.
In large scale, full-size label prediction is infeasible, whichhas motivated the idea of landmark samples, i.e., anchors.
If one inferred the labels on a much smaller landmark setwith size m (≪ n), the time complexity could be reduced toO(m3).
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Anchors
Anchor Points
Example
u1
2u
3u
4u
x1
data point
anchor point
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4K−Means Clustering
datacluster center
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Anchors
Anchor-to-Data Relationship
Input data points X = {x1, · · · , xn} ⊂ Rd and introduced
anchor points U = {u1, · · · ,um}.Suppose a soft label prediction function f : Rd → R.Predict the label for each data point as a weighted averageof the labels on anchors:
f (x i) =
m∑
k=1
Zik f (uk), i = 1, · · · ,n (1)
where Zik ’s denote the combination weights.The matrix form
f = Za, Z ∈ Rn×m, m ≪ n (2)
where a = [f (u1), · · · , f (um)]⊤ is the label vector that we
will solve.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Anchor-to-Data Nonparametric Regression
How to obtain Z?
We impose the nonnegative normalization constraints∑m
k=1 Zik = 1 and Zik ≥ 0 in order to maintain the unifiedrange of values for all predicted soft labels via anchors.
The manifold assumption implies that nearby points shouldhave similar labels and distant points are very unlikely totake similar labels.
So, we further impose Zik = 0 when anchor uk is far awayfrom x i . Specifically, Zik 6= 0 only for s closest anchorpoints (the s indexes 〈i〉 ⊂ [1 : m]) of x i .
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Anchor-to-Data Nonparametric Regression
The weights Zik ’s are sample-adaptive, and thus theanchor-based label prediction scheme falls intononparametric regression: the label of x i is a locallyweighted average of the labels on anchors.The regression matrix Z ∈ R
n×m is nonnegative andsparse.
Example
s = 2
Z11
12Z u1
2u
3u
4u
x1
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Anchor-to-Data Nonparametric Regression
Predefined Kernel-based Weights
The Nadaraya-Watson Kernel Regression
Zik =Kh(x i ,uk )
∑
k ′∈〈i〉 Kh(x i ,uk ′)∀k ∈ 〈i〉 (3)
Typically, we may adopt the Gaussian kernelKh(x i ,uk ) = exp(−‖x i − uk‖
2/2h2).
The above predefined weights are sensitive to thehyperparameter h and lack a meaningful interpretation.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Anchor-to-Data Nonparametric Regression
Optimized Weights
Local Anchor Embedding (LAE)
minzi∈Rs
g(zi) =12‖x i − U〈i〉zi‖
2
s.t. 1⊤zi = 1, zi ≥ 0 (4)
U = [u1, · · · ,um] and U〈i〉 ∈ Rd×s is a sub-matrix
composed of s nearest anchors of x i .Similar to Locally Linear Embedding (LLE), LAEreconstructs any x i as a convex combination of its nearestanchors. The combination coefficients are preserved forthe weights Zik ’s used in nonparametric regression:
Zi ,〈i〉 = z⊤i , Zi ,〈i〉 = 0, |〈i〉| = s. (5)
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Anchor-to-Data Nonparametric Regression
We propose the Nesterov-accelerated projected gradientalgorithm, a first-order optimization procedure, to solve theoptimization problem.In contrast to the predefined kernel-based weights, LAE ismore advantageous because it provides optimizedregression weights that are also sparser.
Example
Project x i onto the convex hull of its closest anchors U〈i〉
u1
2u
3u
4u
x1
x4
x7
42 430, 1Z Z= =
73 740.5, 0.5Z Z= =
2s =
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Coupling Label Prediction and Graph Construction
Large Graph Construction
Construct a large graph G(X ,W ) over the dataset X .W ∈ R
n×n denotes the adjacency matrix.
The exact kNN graph construction (square timecomplexity) is infeasible in large scale; it is unrealistic tosave a full matrix W in memory.
Our idea is low-rank matrix factorization: representing Was a Gram matrix.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Coupling Label Prediction and Graph Construction
Principles for Designing W
W ≥ 0, which is sufficient to make the resulting graphLaplacian L = D − W (D = diag(W1)) positivesemidefinite. Keeping p.s.d. graph Laplacians is importantto guarantee global optimum of graph-based SSL(ensuring that the graph regularizer f⊤Lf is convex).
Sparse W . Sparse graphs have much less spuriousconnections between dissimilar points and tend to exhibithigh quality.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Coupling Label Prediction and Graph Construction
Nyström Approximation
When using the Nyström method to design the graphadjacency matrix W = K , produce an improper densegraph.
K = K.UK−1
UUKU .approximates a predefined kernel matrix
K ∈ Rn×n.
Cannot guarantee K ≥ 0; K is dense.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Coupling Label Prediction and Graph Construction
Design W Using Z
In anchor-based label prediction, we have obtained thenonnegative and sparse Z ∈ R
n×m. Don’t waste it.
In depth, each row Zi . in Z is a new representation of rawsample x i . x i → Z⊤
i . is reminiscent of sparse coding withthe basis U since x i ≈ U〈i〉zi = UZ⊤
i . .
Guess the Gram matrix W = ZZ⊤?
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Coupling Label Prediction and Graph Construction
Design W Using Z
Low-Rank Matrix Factorization
W = ZΛ−1Z⊤, (6)
in which the diagonal matrix Λ ∈ Rm×m contains inbound
degrees of anchors Λkk =∑n
i=1 Zik .
W balances the popularity of anchors.
W is nonnegative and empirically sparse.
Theoretically, we can derive this equation by a probabilisticmeans (e.g., employ the bipartite graph on the next slide).
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Coupling Label Prediction and Graph Construction
Bipartite Data-Anchor Graph
u
x6
1
2u
x1
x2
x4
x5
x3
Z21
22Z
Figure: A bipartite graph representation B(X ,U ,Z ) of data X andanchors U . Z is defined as the cross adjacency matrix between Xand U (Z1 = 1).
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Coupling Label Prediction and Graph Construction
One-Step Transition in Markov Random Walks
u
x6
1
2u
x1
x2
x4
x5
x3
p(1)(uk |x i) =Zik
∑mk ′=1 Zik ′
= Zik
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Coupling Label Prediction and Graph Construction
One-Step Transition in Markov Random Walks
u
x6
1
2u
x1
x2
x4
x5
x3
p(1)(x i |uk ) =Zik
∑nj=1 Zjk
=Zik
Λkk
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Coupling Label Prediction and Graph Construction
Two-Step Transition in Markov Random Walks
u
x6
1
2u
x1
x2
x4
x5
x3
p(2)(x j |x i) = p(2)(x i |x j) =
m∑
k=1
p(1)(x j |uk )p(1)(uk |x i) =
m∑
k=1
ZikZjk
Λkk
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Coupling Label Prediction and Graph Construction
Interpret W as two-step transition probabilities in thebipartite data-anchor graph B:
Wij = p(2)(x j |x i) = p(2)(x i |x j) (7)
⇒W = ZΛ−1Z⊤
Term the large graph G described by such a WAnchorGraph .
Don’t need to compute W and the graph LaplacianL = D − W explicitly. Only need to save Z (O(sn)) inmemory.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
AnchorGraph Regularization
With f = Za, the object function of SSL is
Q(a) = ‖Zla − y l‖2
︸ ︷︷ ︸
label fitting
+ γa⊤Z⊤LZa︸ ︷︷ ︸
graph regularization
(8)
where we compute a “reduced" Laplacian matrix:
L = Z⊤LZ = Z⊤(D − ZΛ−1Z⊤)Z
= Z⊤Z − (Z⊤Z )Λ−1(Z⊤Z ),
where D = diag(W1) = diag(ZΛ−1Z⊤1) = diag(ZΛ−1Λ1) =diag(Z1) = diag(1) = I.
Computing L is tractable, taking O(m3 + m2n) time.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
AnchorGraph Regularization
We can easily obtain the globally optimal solution asfollows
a∗ = (Z⊤l Zl + γL)−1Z⊤
l y l . (9)
As such, we yield a closed-form solution for large-scaleSSL.
Final prediction of multiple classes (Y = {1, · · · , c}):
yi = arg maxj∈Y
Zi .aaa∗j
λj, i = l + 1, · · · ,n (10)
where a∗j is solved corresponding to each class and the
normalization factor λj = 1⊤Zaaa∗j balances skewed class
distributions.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
AnchorGraph Regularization
Complexity Analysis
Space Complexity
O(n) since we save Z and L ∈ Rm×m in memory.
Time Complexity
O(m2n) since n ≫ m.
Table: Time complexity analysis. n is the data size, m is # of anchorpoints, s is # of nearest anchors in designing Z , and T is # ofiterations of LAE (n ≫ m ≫ s).
Stage AnchorGraphRegfind anchors O(mn)
design Z O(smn) or O(smn + s2Tn)graph regularization O(m3 + m2n)total time complexity O(m2n)
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Compared Methods
LGC, GFHF, PVM, Eigenfunction, etc.
random AGR0: random anchors, predefined Z
random AGR: random anchors, LAE-optimized Z
AGR0: k-means anchors, predefined Z
AGR: k-means anchors, LAE-optimized Z
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Mid-sized Dataset
Table: Classification error rates (%) on USPS-Train (7,291 samples)with 100 labeled samples. Take 1,000 anchors for four versions ofAGR. The running time of k-means clustering is 7.65 seconds. Errorrate GFHF< AGR < AGR0 < LGC. AGR is much faster than GFHF.
Method Error Rate Running Time(%) (seconds)
1NN 20.15±1.80 0.12LGC with 6NN graph 8.79±2.27 403.02GFHF with 6NN graph 5.19±0.43 413.28random AGR0 11.15±0.77 2.55random AGR 10.30±0.75 8.85AGR0 7.40±0.59 10.20AGR 6.56±0.55 16.57
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Mid-sized Dataset
200 300 400 500 600 700 800 900 10004
6
8
10
12
14
16
18
20
22
# of anchor points
Err
or R
ate
(%)
Classification Error Rates vs. # Anchors on USPS−Train
LGCGFHF
random AGR0
random AGR
AGR0
AGR
Figure: 100 labeled samples. When the anchor size reaches 600,both AGR0 and AGR begin to outperform LGC.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Large Dataset
Table: Classification error rates (%) on Extended MNIST (630,000samples) with 100 labeled samples. Take 500 anchors for twoversions of AGR. The running time of k-means clustering is 195.16seconds. AGR is the best in classification accuracy, achieving 1/2-foldimprovement over the baseline 1NN.
Method Error Rate Running Time(%) (seconds)
1NN 39.65±1.86 5.46Eigenfunction 36.94±2.67 44.08PVM(square loss) 29.37±2.53 266.89AGR0 24.71±1.92 232.37AGR 19.75±1.83 331.72
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Conclusions
Develop a scalable SSL approach using the proposedlarge graph construction, AnchorGraph, which isguaranteed to result in positive semidefinite graphLaplacians.
Both time and memory needed by AGR grow only linearlywith the data size, so it can enable us to apply SSL to evenlarger datasets with millions of samples.
AnchorGraph has wider utility such as large-scalenonlinear dimensionality reduction, spectral clustering,regression, etc.
Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions
Thanks! Ask me via [email protected].