Consistency of Spectral Algorithms for Hypergraphs …Ph.D. Thesis Defense Advisor: Prof. Ambedkar...

Consistency of Spectral Algorithms for Hypergraphsunder Planted Partition Model

Debarghya Ghoshdastidar

Ph.D. Thesis Defense

Advisor: Prof. Ambedkar Dukkipati

January 2, 2017

Debarghya Ghoshdastidar Ph.D. Thesis Defense Jan 2, 2017 1 / 47

Overview

Purpose of the work:

Theoretical study of spectral methods for hypergraph partitioning

Contributions:

Model for random hypergraphs with planted partition

Error bounds for partitioning planted hypergraphs

New algorithms with improved error rates

Analysis of edge sampling strategies

Bi-partite hypergraph coloring


Spectral Algorithm for Graph PartitioningSpectral Clustering


Graph Partitioning

Objective:

High connectivity within clusters

Few edges across clusters (small cut)

Balanced partitions

Applications:

Network Data Imagepartitioning clustering segmentation


Spectral Graph Partitioning / Spectral Clustering

Input Graph Good balanced cut

(Normalized) Find k dominant Run k-meansAdjacency matrix eigenvectors on rows


Spectral Clustering (in practice)

Input Graph Good balanced cut

(Normalized) Find k dominant Run k-meansAdjacency matrix eigenvectors on rows


Theoretical analysis

Stochastic block model: [Holland, Laskey & Leinhardt '83]

Random hypergraph (V, E) on |V| = n nodes

Nodes have (hidden) class labels, ψ : 1, . . . , n → 1, . . . , kP(euv ∈ E) depends on labels of u, v

Question:

Error(ψ,ψ′) = minσ

n∑i=1

1ψi 6= σ(ψ′i) (ψ′ is output label)

Find βn such that

Error(ψ,ψ′) ≤ βn with probability 1− o(1)

Consistency of algorithms:

Weakly consistent if βn = o(n); Strongly consistent if βn = o(1)

Spectral clustering is weakly consistent [Rohe, Chatterjee & Yu '11]


Hypergraph PartitioningApplications and Algorithms


Hypergraphs

Collection of sets / Generalization of graphs

Each edge can connect more than two nodes

Graph 3-uniform Hypergraph(2-uniform) hypergraph

m-uniform hypergraph:

Each edge connects m nodes

Adjacencies can be represented by mth-order tensor

Ai1i2...im =

1 if there is edge on i1, i2, . . . , im0 otherwise


Hypergraphs in Databases [Gibson, Kleinberg & Raghavan '00]

Gender Male Female Male Male FemaleHair Red Black Bald Black Red

Glasses Yes No Yes No No

Edges can be of varying sizes (non-uniform hypergraph)

Male, Black hair, Without glasses, and so on . . .


Hypergraphs in Computer Vision [Agarwal et al. '05]

Subspace clustering Motion segmentation

Matching / Image Registration

Involves 3-way / 4-way similarities (uniform hypergraph)


Hypergraph Partitioning Methods

Partitioning circuits [Schweikert & Kernighan '79]

Graph approximation for hypergraphs [Hadley '95]

Spectral hypergraph partitioning [Zien et al. '99]

hMETIS for VLSI design [Karypis & Kumar '00]

Uniform hypergraph in databases [Gibson et al. '00]

Uniform hypergraph in vision [Agarwal et al. '05]

Tensor based algorithms [Govindu '05; Chen & Lerman '09]

Learning with non-uniform hypergraph [Zhou et al. '07]

Higher order learning [Duchenne et al. '11; Rota Bulo & Pellilo '13; etc.]


Algorithms studied in our work

HOSVD / SCC: [Govindu '05; Chen & Lerman '09]

Uniform hypergraph partitioning using higher order SVD of adjacencytensor.

TTM / TeTrIS: (proposed)

Uniform hypergraph partitioning by solving a tensor trace maximizationproblem.

TeTrIS is efficient (sampled) version of TTM.

NH-Cut: [Zhou, Huang & Scholkopf '07]

Non-uniform hypergraph partitioning by minimizing normalizedhypergraph cut.

COLOR: (proposed)

Vertex 2-coloring of bi-partite non-uniform hypergraph.


Uniform Hypergraph PartitioningSpectral Algorithms

Approach 1: Higher order SVD of adjacency tensor

Approach 2: Associativity or tensor trace maximization


Approach 1: Higher Order SVD

Matrix eigen decomposition:

A U Σ UT .(orthonormal) (diagonal)

HOSVD of 3rd-order tensor: [De Lathauwer et al. '00]

A U Σ UT .


HOSVD based Partitioning [Govindu '05]

m-uniform hypergraph

Adjacency tensor A Flattened matrix A

Find dominant left Run k-meanssingular vectors on rows


Approach 2: Associativity Maximization

Normalized associativity:

For any cluster V1 ⊂ Vassociativity(V1) =

∑e⊂V1

w(e)

volume(V1) =∑

v∈V1degree(v)

Normalized associativity of partition

N-assoc(V1, . . . ,Vk) =

k∑j=1

associativity(Vj)volume(Vj)

Problem:

Find partition that maximizes N-assoc(V1, . . . ,Vk)


Tensor Trace Maximization (TTM)

Problem (reformulated):

For m-uniform hypergraph

N-assoc(V1, . . . ,Vk) = 1m! Trace

(A×1 Y

b1 ×2 . . .×m Y bm)

Y ∈ Rn×k has orthogonal columns, and∑

j bj = 1

Y b1 A Y b2 .

Spectral relaxation of TTM:

Set b1 = b2 = 12 , b3 = . . . = bm = 0 and X = Y 1/2

Optimize over all orthonormal X


Spectral TTM Algorithm [Ghoshdastidar & Dukkipati, ICML'15]


Matrix A

Adjacency tensor A Add slices of tensor

Run k-means Find k dominanton rows eigenvectors


Uniform Hypergraph PartitioningConsistency

Planted partition model for uniform hypergraphs

Error bounds for algorithms


Planted Partition Model (graph)

Sparse Stochastic Block Model: [Lei & Rinaldo '15]

Given n nodes, and k (hidden) classes

An unknown matrix B ∈ [0, 1]k×k symmetric

An unknown sparsity factor αn

Independent edges with probabilities depending on labels

• • •Class-1 Class-2 Class-3

Prob(•,•) = αnB11, Prob(•,•) = αnB12, Prob(•,•) = αnB13 . . .


Planted Partition Model (uniform hypergraph)

Extension of Sparse SBM: (proposed)


Unknown mth-order tensor B ∈ [0, 1]k×k×...×k

Unknown sparsity factor αn

Independent edges with label-dependent distribution

Unweighted hypergraph:Prob(edge) = αnBi1i2...im

Weighted hypergraph:w(edge) ∈ [0, 1]E[w(edge)] = αnBi1i2...im


Consistency of HOSVD [Ghoshdastidar & Dukkipati, NIPS, 2014]

Define:

nmax (or nmin) = max. (min.) cluster size

A = E[AAT

]and Amin = min

i,jAij : Aij > 0

δ = kth eigen-gap of normalized A

Theorem

There exists constant C > 0, such that, if

δ > 0 and Amin > Cknmax(log n)2

nminδ2

then with probability (1− o(1))

Error(ψ,ψ′) = O

(knmax log n

δ2Amin

)= o(n).


Consistency of TTM [Ghoshdastidar & Dukkipati, ICML, 2015]

Define:

d = mini

E[degree(i)] = mini

∑e3i

E[w(e)]

δ = kth eigen-gap of normalized E[A]

Theorem


δ > 0 and d > Cknmax(log n)2

nminδ2


Error(ψ,ψ′) = O

(knmax log n

δ2d

)= o(n).


Special Case


k = O(log n) clusters of equal size

Edge probabilities

Prob(edge) =

αnp if edge lies within a clusterαnq otherwise (p > q)

HOSVD TTM

Allowable sparsity: αn = Ω((logn)m+1.5

n(m−1)/2

)αn = Ω

((logn)2m+1

nm−1

)Dense hypergraph:(αn = 1)

error = O((logn)2m+1

nm−2

)error = O

((logn)2m−1

nm−2

)


Non-uniform Hypergraph PartitioningAlgorithm and Consistency

Approach 3: Normalized hypergraph cut minimization

Planted partition model for non-uniform hypergraphs

Consistency result (with proof sketch)


Normalized Hypergraph Cut

Approach: [Zhou, Huang & Scholkopf '07]

Solve spectral relaxation of minimizing normalized hypergraph cut

Reduction to graph:

A,D ∈ Rn×n so that Aij =∑e3i,j

1|e| , Dii = degree(i)

Spectral clustering:

Normalized Laplacian, L = I −D−1/2AD−1/2

Compute k leading orthonormal eigenvectors of L

k-means on normalized rows of eigenvector matrix


Planted Partition Model (non-uniform hypergraph)

Model: (proposed)


Maximum edge cardinality M

Unknown mth-order tensors B(m) ∈ [0, 1]k×k×...×k

Unknown sparsity factors αm,n, m = 2, 3, . . . ,M

Independent edges with label-dependent distribution

Prob(m-edge) = αm,nB(m)i1i2...im


Consistency of NH-Cut [Ghoshdastidar & Dukkipati, Ann. Stat., 2017]

Define:

A = E[A], D = E[D] and L = I −D−1/2AD−1/2

d = mini E[degree(i)]

δ = kth eigen-gap of L

Theorem


δ > 0 and d > Cknmax(log n)2

nminδ2


Error(ψ,ψ′) = O

(knmax log n

δ2d

)= o(n).


Proof of consistency

Stage 1: (expected case)

If δ > 0, then A is essentially of rank k

If A used instead of A, then Error = 0

Stage 2: (matrix concentration)

A can be expressed as a sum of random matrices

A =∑e∈2V

1e ∈ E(

1|e|heh

Te

)If d > 9 log n for all large n, then w.p. (1− 4

n2 ),

‖L− L‖2 ≤ 12

√log n

dProof uses matrix concentration inequality [Tropp '12]


Proof of consistency

Stage 3: (matrix perturbation)

X,X row normalized eigenvector matrices of L,L

If δ > 24√

lognd for all large n, then w.p. (1− 4

n2 )

‖X −X‖F ≤24

δ

√2knmax log n

dProof using matrix perturbation [Davis & Kahan '70]

Stage 4: (analyzing k-means)

Rows of X are ε-separable for ε = (log n)−1/2

k-means succeeds w.p. (1− o(1))

Error = O(‖X −X‖2F )

Based on guarantees of k-means [Ostrovsky et al. '12]


Sampling Hypergraph Edges

Consistency of partitioning with edge sampling

Approach 4: TTM with iterative sampling

Numerical comparison


Edge Sampling (weighted m-uniform hypergraph)

Complexity of tensor methods:

O(nm) runtime to compute all edge weights

Typically m = 3 to 8 in practice

Efficient variant: Use only N nm sampled edges

Question:

Edges sampled with replacement

Sampling distribution (pe)e∈E

Find min. number of samples needed for consistency

Sampling bound for TTM: [Ghoshdastidar & Dukkipati, arXiv:1602.06516]

(Special case) Error = o(n) if

Uniform sampling: N = Ω(α−1n k2m−1n(log n)2

)Weighted, pe ∝ w(e): N = Ω

(k2m−1n(log n)2

)Debarghya Ghoshdastidar Ph.D. Thesis Defense Jan 2, 2017 33 / 47

TTM with Iterative Sampling (TeTrIS)

Iterative Sampling:

Principle:

Sample edges with large weight more frequentlyEdges within cluster usually have large weight

Approach (SCC): [Chen & Lerman '09]

Sample few edgesCluster using HOSVD based methodRe-sample with preference to within cluster edgesRe-cluster and repeat till convergence

TeTrIS Algorithm: [proposed]

Replace HOSVD step by TTM

Sampling bound for TTM justifies the usefulness of sampling largeweight edges via iterative sampling


Numerical Comparison

Motion Segmentation:

Cluster motion trajectories

Posed as subspace clustering problem

Each motion – subspace of dimension ≤ 4

Mean clustering error on Hopkins 155 data set (%)

Method 2 motion 3 motion All(120 videos) (35 videos)

k-means 19.57 26.16 21.06k-flats 13.05 15.78 13.67SSC 1.53 4.40 2.18LRR 2.13 4.03 2.56NSN 3.62 8.28 4.67

SCC (HOSVD) 2.38 5.71 3.13TeTrIS (TTM) 1.36 5.38 2.27


Hypergraph vertex 2-coloring

Objective: No edge can be mono-chromatic

Assume: Planted bi-partite hypergraphM = O(1) and E[#edges] ≥ Cn log n

Algorithm: [Ghoshdastidar & Dukkipati, arXiv:1507.00763]

Spectral step:

Let A ∈ Rn×n as Aij =∑

e3i,j

1|e|

Compute eigenvector x for smallest eigenvalue of AColor node-i red if sign(xi) > 0, else blue

⇒ Achieves error < cn for c 1

Iterative refinement:

Re-color node-i red if∑

j∈VRAij <

∑j∈VB

Aij , else blue

⇒ Error reduces by half in each iteration (log2 n steps suffice)


Summary

Hypergraph partitioning can be done efficiently

First study of planted non-uniform hypergraphs

Literature considers only planted k-SAT / 2-coloring

Statistical analysis of tensor based methods

Popular in practice, but no known error bound

Removing the assumptions on k-means

First study of sampled spectral algorithms

Justification for iterative sampling


Further works & open questions

Extension to large scale hypergraph partitioningDown sampling of hypergraphs

Analysis of other approaches under planted modelMove based strategiesOptimization based algorithms

Study of sparse planted hypergraphsOverlapping communities / Degree heterogenityAlgorithmic barrier for partitioning

[Angelini et al. '15; Florescu & Perkins '16]

Generalization of graphs problems to hypergraphsTheoretical studiesApplications


Thank You

Acknowledgment:The work was supported by Google Ph.D. Fellowship in Statistical Learning Theory


References

Agarwal, S., Lim, J., Zelnik-Manor, L., Perona, P., Kriegman, D. & Belongie,S. (2005). In IEEE Computer Vision and Pattern Recognition 838-845.

Angelini, M. C., Caltagirone, F., Krzakala, F. and Zdeborova, L. (2015). InAnnual Allerton Conference on Communication, Control, and Computing.

Chen, G. & Lerman, G. (2009). International Journal of Computer Vision81(3) 317-330.

Davis, C. & Kahan, W. M. (1970). SIAM Journal on Numerical Analysis 7(1)1-46.

De Lathauwer, L., De Moor, B. and Vandewalle, J. (2000). SIAM Journal onMatrix Analysis and Applications 21(4) 1253-1278.

Duchenne, O., Bach, F., Kweon, I.-S. & Ponce, J. (2011). IEEE Transactionson Pattern Analysis and Machine Intelligence 33(12) 2383-2395.

Florescu, L. & Perkins, W. (2016). In Conference on Learning Theory.

Ghoshdastidar, D. & Dukkipati, A. (2014). In Advances in Neural InformalProcessing Systems 397-405.


References

Ghoshdastidar, D. & Dukkipati, A. (2015). In International Conference onMachine Learning.

Ghoshdastidar, D. & Dukkipati, A. (2015). Annals of Statistics (in press).

Ghoshdastidar, D. & Dukkipati, A. (2015). arXiv preprint 1507.00763.

Ghoshdastidar, D. & Dukkipati, A. (2016). arXiv preprint 1602.06516.

Gibson, D., Kleinberg, J. & Raghavan, P. (2000). VLDB Journal 8 222-236.

Govindu, V. M. (2005). In IEEE Computer Vision and Pattern Recognition1150-1157.

Hadley, S. W. (1995). Discrete Applied Mathematics 59 115-127.

Hein, M., Setzer, S., Jost, L. and Rangapuram, S. (2013). In Advances inNeural Informal Processing Systems 2427-2435.

Holland, P. W., Laskey, K. B. & Leinhardt, S. (1983). Social Networks 5109-137.

Karypis, G. & Kumar, V. (2000). VLSI Design 11 285-300.

Lei, J. & Rinaldo, A. (2015). Annals of Statistics 43 215-237.


References

Ostrovsky, R., Rabani, Y., Schulman, L. J. & Swamy, C. (2012). Journal of theACM 59(6) 28:128.

Rohe, K., Chatterjee, S., & Yu, B. (2011). Annals of Statistics 39 1878-1915.

Rota Bulo, S. & Pellilo, M. (2013). IEEE Transactions on Pattern Analysisand Machine Intelligence 35(6) 1312-1327.

Schweikert, G. & Kernighan, B. W. (1979). In Design Automation Workshop57-62.

Tropp, J. A. (2012). Foundations of Computational Mathematics 12(4)389-434.

Zien, J. Y., Schlag, M. D. F. and Chan, P. K. (1999). IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems 13(9) 1088-1096.

Zhou, D., Huang, J. and Scholkopf, B. (2007). In Advances in Neural InformalProcessing Systems 1601-1608.


Consistency of Sampled TTM (General Case)

Define:

γ = maxew(e)p(e) , where p(e) = P(e is sampled)

d = mini

E[degree(i)], δ = kth eigen-gap of normalized E[A]

Theorem [Ghoshdastidar & Dukkipati '16]

There exist constants C,C ′ > 0, such that, if

δ > 0, d > Cknmax(log n)2

nminδ2

and N > C ′(

1 +2γ

d

)knmax(log n)2

nminδ2


Error(ψ,ψ′) = o(n).


More Numerical Results


Numerical Comparison (uniform hypergraph)

Subspace clustering

60 points in 5-dim ambient space

Data from union of three random lines (1-dim subspaces)

Data perturbed by Gaussian noise of standard deviation σa

Fractional error (over 20 runs)

Algorithm Noise levelσa = 0.02 σa = 0.05

SNTF 0.025 0.086hMETIS 0.045 0.118

HGT 0.083 0.222HOSVD 0.052 0.126

TTM 0.033 0.103


Numerical Comparison (sampled uniform hypergraph)

Subspace clustering

5-dim ambient spaceData from union of five 3-dim subspaces (added noise)

Nois

ele

vel

,σa

Fra

ctio

nal

erro

r(o

ver

50

runs)

Number of points in each subspace, n/k


Numerical Comparison (non-uniform hypergraph)

Categorical data clustering

Data set #instances #attributes #attr. values

Voting records 435 16 3Mushroom 8124 22 varies

Fractional errorData set ROCK CoolCat LIMBO hMETIS NH-Cut

Voting 0.16 0.15 0.13 0.24 0.12Mushroom 0.43 0.27 0.11 0.48 0.11


Consistency of Spectral Algorithms for Hypergraphs …Ph.D. Thesis Defense Advisor: Prof. Ambedkar...

Documents

Transcript of Consistency of Spectral Algorithms for Hypergraphs …Ph.D. Thesis Defense Advisor: Prof. Ambedkar...