Large Graph Construction for Scalable Semi …...Motivation Related Work Anchor-Based Label...

Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions

Large Graph Construction for ScalableSemi-Supervised Learning

Wei Liu Junfeng He Shih-Fu Chang

Department of Electrical Engineering

Columbia University

New York, NY, USA

International Conference on Machine Learning, 2010

Outline

1 Motivation

2 Related WorkMid-Scale SSLLarge-Scale SSL

3 Anchor-Based Label PredictionAnchorsAnchor-to-Data Nonparametric Regression

4 AnchorGraphCoupling Label Prediction and Graph ConstructionAnchorGraph Regularization

5 Experiments

Why Semi-Supervised Learning (SSL)?

In practice, labeling data is expensive.

Using labeled data and unlabeled data, SSL generalizesand improves supervised learning.

Most SSL methods cannot handle a large amount ofunlabeled data because they scale poorly with the datasize.

Large-scale SSL is critical nowadays as we can easilycollect massive (up to hundreds of millions) unlabeled datafrom the Internet (Facebook, Flickr).

SSL on Large Social Networks

http://www.cmu.edu/joss/content/articles/volume1/Freeman.html

Mid-Scale SSL

Manifold and Graph

Manifold Assumption

Close samples in some similarity measure have similar classlabels. Graph-based SSL methods including LGC, GFHF,LapSVM, etc. adopt this assumption.

Neighborhood Graph

A sparse graph can reveal data manifold, where adjacent graphnodes represent nearby data points. The commonly used graphis kNN graph (linking graph edges between any data point andits k nearest neighbors) whose construction takes O(kn2).Inefficient in large scale!

Mid-Scale SSL

Manifold and Graph

Dense graphs hardly reveal manifolds.

Sparse graphs are preferred.

Example

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3−0.8

−0.6

−0.4

−0.2

1.2data manifold (two submanifolds)

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3−0.8

−0.6

−0.4

−0.2

1.26NN Graph

Large-Scale SSL

Existing Large-Scale SSL Methods

Nonparametric Function Induction, Delalleu et al., 2005Use the truncated graph on a sample subset and its

connection to the rest samples.

Harmonic Mixtures, Zhu & Lafferty, 2005Need a large sparse graph, but do not explain how to

construct.

Prototype Vector Machine (PVM), Zhang et al., 2009Use the Nyström approximation of the fully-connected

graph adjacency matrix. There is no guarantee for thegraph Laplacian matrix computed from this approximationto be positive semidefinite.

Eigenfunction, Fergus et al., 2010Rely on the dimension-independent data density

assumption which is not always true.

Large-Scale SSL

Our Aim

Develop a scalable SSL approach that is easy toimplement and has closed-form solutions.

Address the critical issue on how to build an entire sparsegraph over a large dataset.

Low space and time complexities with respect to the datasize. Our Goal: achieve linear complexity!

Anchors

How to scale up?

Graph-based SSL methods usually need a cubic timecomplexity O(n3) with respect to the data size n becausen × n matrix inversion is needed in inferring full-size labelprediction.

In large scale, full-size label prediction is infeasible, whichhas motivated the idea of landmark samples, i.e., anchors.

If one inferred the labels on a much smaller landmark setwith size m (≪ n), the time complexity could be reduced toO(m3).

Anchors

Anchor Points

Example

data point

anchor point

−4 −3 −2 −1 0 1 2 3 4−4

4K−Means Clustering

datacluster center

Anchors

Anchor-to-Data Relationship

Input data points X = {x1, · · · , xn} ⊂ Rd and introduced

anchor points U = {u1, · · · ,um}.Suppose a soft label prediction function f : Rd → R.Predict the label for each data point as a weighted averageof the labels on anchors:

f (x i) =

Zik f (uk), i = 1, · · · ,n (1)

where Zik ’s denote the combination weights.The matrix form

f = Za, Z ∈ Rn×m, m ≪ n (2)

where a = [f (u1), · · · , f (um)]⊤ is the label vector that we

will solve.

Anchor-to-Data Nonparametric Regression

How to obtain Z?

We impose the nonnegative normalization constraints∑m

k=1 Zik = 1 and Zik ≥ 0 in order to maintain the unifiedrange of values for all predicted soft labels via anchors.

The manifold assumption implies that nearby points shouldhave similar labels and distant points are very unlikely totake similar labels.

So, we further impose Zik = 0 when anchor uk is far awayfrom x i . Specifically, Zik 6= 0 only for s closest anchorpoints (the s indexes 〈i〉 ⊂ [1 : m]) of x i .

The weights Zik ’s are sample-adaptive, and thus theanchor-based label prediction scheme falls intononparametric regression: the label of x i is a locallyweighted average of the labels on anchors.The regression matrix Z ∈ R

n×m is nonnegative andsparse.

Example

12Z u1

Predefined Kernel-based Weights

The Nadaraya-Watson Kernel Regression

Zik =Kh(x i ,uk )

k ′∈〈i〉 Kh(x i ,uk ′)∀k ∈ 〈i〉 (3)

Typically, we may adopt the Gaussian kernelKh(x i ,uk ) = exp(−‖x i − uk‖

2/2h2).

The above predefined weights are sensitive to thehyperparameter h and lack a meaningful interpretation.

Optimized Weights

Local Anchor Embedding (LAE)

minzi∈Rs

g(zi) =12‖x i − U〈i〉zi‖

s.t. 1⊤zi = 1, zi ≥ 0 (4)

U = [u1, · · · ,um] and U〈i〉 ∈ Rd×s is a sub-matrix

composed of s nearest anchors of x i .Similar to Locally Linear Embedding (LLE), LAEreconstructs any x i as a convex combination of its nearestanchors. The combination coefficients are preserved forthe weights Zik ’s used in nonparametric regression:

Zi ,〈i〉 = z⊤i , Zi ,〈i〉 = 0, |〈i〉| = s. (5)

We propose the Nesterov-accelerated projected gradientalgorithm, a first-order optimization procedure, to solve theoptimization problem.In contrast to the predefined kernel-based weights, LAE ismore advantageous because it provides optimizedregression weights that are also sparser.

Example

Project x i onto the convex hull of its closest anchors U〈i〉

42 430, 1Z Z= =

73 740.5, 0.5Z Z= =

Coupling Label Prediction and Graph Construction

Large Graph Construction

Construct a large graph G(X ,W ) over the dataset X .W ∈ R

n×n denotes the adjacency matrix.

The exact kNN graph construction (square timecomplexity) is infeasible in large scale; it is unrealistic tosave a full matrix W in memory.

Our idea is low-rank matrix factorization: representing Was a Gram matrix.

Principles for Designing W

W ≥ 0, which is sufficient to make the resulting graphLaplacian L = D − W (D = diag(W1)) positivesemidefinite. Keeping p.s.d. graph Laplacians is importantto guarantee global optimum of graph-based SSL(ensuring that the graph regularizer f⊤Lf is convex).

Sparse W . Sparse graphs have much less spuriousconnections between dissimilar points and tend to exhibithigh quality.

Nyström Approximation

When using the Nyström method to design the graphadjacency matrix W = K , produce an improper densegraph.

K = K.UK−1

UUKU .approximates a predefined kernel matrix

K ∈ Rn×n.

Cannot guarantee K ≥ 0; K is dense.

Design W Using Z

In anchor-based label prediction, we have obtained thenonnegative and sparse Z ∈ R

n×m. Don’t waste it.

In depth, each row Zi . in Z is a new representation of rawsample x i . x i → Z⊤

i . is reminiscent of sparse coding withthe basis U since x i ≈ U〈i〉zi = UZ⊤

Guess the Gram matrix W = ZZ⊤?

Design W Using Z

Low-Rank Matrix Factorization

W = ZΛ−1Z⊤, (6)

in which the diagonal matrix Λ ∈ Rm×m contains inbound

degrees of anchors Λkk =∑n

i=1 Zik .

W balances the popularity of anchors.

W is nonnegative and empirically sparse.

Theoretically, we can derive this equation by a probabilisticmeans (e.g., employ the bipartite graph on the next slide).

Bipartite Data-Anchor Graph

Figure: A bipartite graph representation B(X ,U ,Z ) of data X andanchors U . Z is defined as the cross adjacency matrix between Xand U (Z1 = 1).

One-Step Transition in Markov Random Walks

p(1)(uk |x i) =Zik

∑mk ′=1 Zik ′

One-Step Transition in Markov Random Walks

p(1)(x i |uk ) =Zik

∑nj=1 Zjk

Two-Step Transition in Markov Random Walks

p(2)(x j |x i) = p(2)(x i |x j) =

p(1)(x j |uk )p(1)(uk |x i) =

ZikZjk

Interpret W as two-step transition probabilities in thebipartite data-anchor graph B:

Wij = p(2)(x j |x i) = p(2)(x i |x j) (7)

⇒W = ZΛ−1Z⊤

Term the large graph G described by such a WAnchorGraph .

Don’t need to compute W and the graph LaplacianL = D − W explicitly. Only need to save Z (O(sn)) inmemory.

AnchorGraph Regularization

With f = Za, the object function of SSL is

Q(a) = ‖Zla − y l‖2

︸︷︷︸

label fitting

+ γa⊤Z⊤LZa︸︷︷︸

graph regularization

where we compute a “reduced" Laplacian matrix:

L = Z⊤LZ = Z⊤(D − ZΛ−1Z⊤)Z

= Z⊤Z − (Z⊤Z )Λ−1(Z⊤Z ),

where D = diag(W1) = diag(ZΛ−1Z⊤1) = diag(ZΛ−1Λ1) =diag(Z1) = diag(1) = I.

Computing L is tractable, taking O(m3 + m2n) time.

We can easily obtain the globally optimal solution asfollows

a∗ = (Z⊤l Zl + γL)−1Z⊤

l y l . (9)

As such, we yield a closed-form solution for large-scaleSSL.

Final prediction of multiple classes (Y = {1, · · · , c}):

yi = arg maxj∈Y

Zi .aaa∗j

λj, i = l + 1, · · · ,n (10)

where a∗j is solved corresponding to each class and the

normalization factor λj = 1⊤Zaaa∗j balances skewed class

distributions.

Complexity Analysis

Space Complexity

O(n) since we save Z and L ∈ Rm×m in memory.

Time Complexity

O(m2n) since n ≫ m.

Table: Time complexity analysis. n is the data size, m is # of anchorpoints, s is # of nearest anchors in designing Z , and T is # ofiterations of LAE (n ≫ m ≫ s).

Stage AnchorGraphRegfind anchors O(mn)

design Z O(smn) or O(smn + s2Tn)graph regularization O(m3 + m2n)total time complexity O(m2n)

Compared Methods

LGC, GFHF, PVM, Eigenfunction, etc.

random AGR0: random anchors, predefined Z

random AGR: random anchors, LAE-optimized Z

AGR0: k-means anchors, predefined Z

AGR: k-means anchors, LAE-optimized Z

Mid-sized Dataset

Table: Classification error rates (%) on USPS-Train (7,291 samples)with 100 labeled samples. Take 1,000 anchors for four versions ofAGR. The running time of k-means clustering is 7.65 seconds. Errorrate GFHF< AGR < AGR0 < LGC. AGR is much faster than GFHF.

Method Error Rate Running Time(%) (seconds)

1NN 20.15±1.80 0.12LGC with 6NN graph 8.79±2.27 403.02GFHF with 6NN graph 5.19±0.43 413.28random AGR0 11.15±0.77 2.55random AGR 10.30±0.75 8.85AGR0 7.40±0.59 10.20AGR 6.56±0.55 16.57

Mid-sized Dataset

200 300 400 500 600 700 800 900 10004

# of anchor points

Classification Error Rates vs. # Anchors on USPS−Train

LGCGFHF

random AGR0

random AGR

Figure: 100 labeled samples. When the anchor size reaches 600,both AGR0 and AGR begin to outperform LGC.

Large Dataset

Table: Classification error rates (%) on Extended MNIST (630,000samples) with 100 labeled samples. Take 500 anchors for twoversions of AGR. The running time of k-means clustering is 195.16seconds. AGR is the best in classification accuracy, achieving 1/2-foldimprovement over the baseline 1NN.

Method Error Rate Running Time(%) (seconds)

1NN 39.65±1.86 5.46Eigenfunction 36.94±2.67 44.08PVM(square loss) 29.37±2.53 266.89AGR0 24.71±1.92 232.37AGR 19.75±1.83 331.72

Conclusions

Develop a scalable SSL approach using the proposedlarge graph construction, AnchorGraph, which isguaranteed to result in positive semidefinite graphLaplacians.

Both time and memory needed by AGR grow only linearlywith the data size, so it can enable us to apply SSL to evenlarger datasets with millions of samples.

AnchorGraph has wider utility such as large-scalenonlinear dimensionality reduction, spectral clustering,regression, etc.

Thanks! Ask me via wliu@ee.columbia.edu.

Large Graph Construction for Scalable Semi …...Motivation Related Work Anchor-Based Label...

Documents

Transcript of Large Graph Construction for Scalable Semi …...Motivation Related Work Anchor-Based Label...

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

Scalable Big Graph Processing in MapReducehcheng/paper/SIGMOD2014qin.pdf · Scalable Big Graph Processing in MapReduce Lu Qin†,JeffreyXuYu‡,LijunChang§,HongCheng‡,ChengqiZhang†,

Engineering a Scalable High Quality Graph Partitioneralgo2.iti.kit.edu/documents/talks_cs/ipdps_atlanta.pdf · Scalable High Quality Graph Partitioner KIT ... Successful in existing

Scalable GPU graph traversaley204/teaching/ACS/R212_2014_2015/pa… · Scalable GPU Graph Traversal Duane Merrill University of Virginia Charlottesville Virginia USA dgm4d@virginia.edu

Factorized Graph Representations for semi-supervised ...

Local Graph Sparsiﬁcation for Scalable Clustering

Efficient and scalable graph view maintenance for deductive graph ...

GMLP: Building Scalable and Flexible Graph Neural Networks ...

The Only Scalable Graph Database for The Enterprise

Scalable Methods for Graph-Based Unsupervised and Semi ...

Semi-Supervised Graph-to-Graph Translation

The Perfect Fit: Scalable Graph for Big Data

AbstractGAR: An efcient and scalable Graph-based Activity Regularization for semi-supervised learning Ozsel Kilinc Electrical Engineering Department University of South Florida Tampa,

Sharded Joins for Scalable Incremental Graph Queries

Graph Convolution for Semi-Supervised Classification ...

SCALABLE GRAPH DATA ANALYTICS WITH GRADOOP · SCALABLE GRAPH DATA ANALYTICS WITH GRADOOP ERHARD RAHM, MARTIN JUNGHANNS, ANDRÉ PETERMANN, ... Graph Processing Systems Graph Databases

GraphQ: Scalable PIM-Based Graph Processingalchem.usc.edu/portal/static/download/graphq.pdfGraphQ: Scalable PIM-Based Graph Processing MICRO ’52, October 12–16, 2019, Columbus,

Graph Sparsification Approaches to Scalable Integrated Circuit … · 2017-08-02 · 1 Graph Sparsification Approaches to Scalable . Integrated Circuit Modeling and Simulations. Zhuo

Scalable Dynamic Graph Summarization - WordPress.com · Scalable Dynamic Graph Summarization Ioanna Tsalouchidou Web Research Group, DTIC Pompeu Fabra University, Spain ioanna.tsalouchidou@upf.edu

Scalable Consistency Training for Graph Neural Networks ...