Convolutional Neural Networks on Graphs with Fast...

Post on 28-May-2020

2 views 0 download

Transcript of Convolutional Neural Networks on Graphs with Fast...

Convolutional Neural Networks on Graphs with Fast LocalizedSpectral Filtering

M. Defferrard1 X. Bresson2 P. Vandergheynst1

1EPFL, Lausanne, Switzerland

2Nanyang Technological University, Singapore

Itay Boneh, Asher KabakovitchTel-Aviv University, Deep Learning Seminar

2017

Spectral FilteringNon-parametric

From convolution theorem:

x ∗G g = U(UTg � UT x

)= U

(g � UT x

)Or algebrically:

x ∗G g = U

g1 0. . .

0 gn

UT x

I Not localized

I O(n) parameters to train

I O(n2) multiplications (no FFT)

Spectral FilteringNon-parametric

From convolution theorem:

x ∗G g = U(UTg � UT x

)= U

(g � UT x

)Or algebrically:

x ∗G g = U

g1 0. . .

0 gn

UT x

I Not localized

I O(n) parameters to train

I O(n2) multiplications (no FFT)

Spectral FilteringPolynomial Parametrization

g is a continuous 1-D function, parametrized by θ:

x ∗G g = U(gθ(λ)� UT x

)= U

gθ(λ1) 0. . .

0 gθ(λn)

UT x

= Ugθ(Λ)UT x = gθ(L)x

Tφ = T~λ⇒ f (T ) = φf (Λ)φ−1

Spectral FilteringPolynomial Parametrization

g is a continuous 1-D function, parametrized by θ:

x ∗G g = U(gθ(λ)� UT x

)= U

gθ(λ1) 0. . .

0 gθ(λn)

UT x

= Ugθ(Λ)UT x = gθ(L)x

Polynomial parametrization:

gθ(Λ) =K−1∑k=0

θkΛk

I K-localized: 1-hop for every L application

I K parameters to train - independent on the graph’s size

I Still O(n2) multiplications (multiplications with the basis U)

Spectral FilteringPolynomial Parametrization

Impulse response on a 2D Euclidean domain

Impulse response on a graph

Spectral FilteringRecursive Polynomial Parametrization

Chebyshev polynomials: Tk(λ) = 2λTk−1(λ)− Tk−2(λ)

T0(λ) = 1

T1(λ) = λ

Parametrization:

gθ(L) =K−1∑k=0

θkTk(L)

L = 2Lλ−1n − I (orthonormal basis in [−1, 1])

Spectral FilteringRecursive Polynomial Parametrization

Filtering:

y = gθ(L)x =K−1∑k=0

θkTk(L)x =K−1∑k=0

θk xk

Recurrence: xk = Tk(L)x = 2Lxk−1 − xk−2

x0 = x

x1 = Lx

I K-localized: 1-hop for every L application

I K parameters to train - independent on the graph’s size

I O(Kn) multiplications (actually O(K |ε|))

I No EVD of L

Spectral FilteringRecursive Polynomial Parametrization

Filtering:

y = gθ(L)x =K−1∑k=0

θkTk(L)x =K−1∑k=0

θk xk

Recurrence: xk = Tk(L)x = 2Lxk−1 − xk−2

x0 = x

x1 = Lx

I K-localized: 1-hop for every L application

I K parameters to train - independent on the graph’s size

I O(Kn) multiplications (actually O(K |ε|))

I No EVD of L

Graph CNN

Graph CNNLearning Filters

ys,j =

Fin∑i=1

gθi,j (L)xs,i

∂L∂θi ,j

=S∑

s=1

[xs,i ,0, . . . , xs,i ,K−1]T∂L∂ys,j

∂L∂xs,i

=Fout∑j=1

gθi,j (L)∂L∂ys,j

s = 1, . . . ,S - sample indexi = 1, . . . ,Fin - input feature map indexj = 1, . . . ,Fout - output feature map indexθi ,j - Fin × Fout Chebyshev coefficients vectors of order KL - Loss over a mini-batch of S samples

Graph CNNLearning Filters

ys,j =

Fin∑i=1

gθi,j (L)xs,i

∂L∂θi ,j

=S∑

s=1

[xs,i ,0, . . . , xs,i ,K−1]T∂L∂ys,j

∂L∂xs,i

=Fout∑j=1

gθi,j (L)∂L∂ys,j

O (K |ε|FinFoutS)

Easily paralleled

Graph PoolingCoarsening

I Multilevel clustering algorithm

I Reduce the size of the graph by a specified factor (2)

I Do all this efficiently

Graclus multilevel clustering algorithm

I Maximizing local normalized cut

I Greedily pick an unmarked vertex i and match it with an unmatched vertex jwhich maximizes the local normalized cut Wi ,j(1/di + 1/dj).

I Extremely fast.

I Dividing the number of nodes by approximately 2.

I Might generate singletons (non matched) nodes. Solved by using fake nodes.

Graph PoolingCoarsening

I Multilevel clustering algorithm

I Reduce the size of the graph by a specified factor (2)

I Do all this efficiently

Graclus multilevel clustering algorithm

I Maximizing local normalized cut

I Greedily pick an unmarked vertex i and match it with an unmatched vertex jwhich maximizes the local normalized cut Wi ,j(1/di + 1/dj).

I Extremely fast.

I Dividing the number of nodes by approximately 2.

I Might generate singletons (non matched) nodes. Solved by using fake nodes.

Graph PoolingFast Pooling

Example: Pooling by 4

I Graclus generates singletons: n0 = 8→ n1 = 5→ n2 = 3

I By adding fake nodes we get: n2 = 3→ n1 = 6→ n0 = 12

I z = [max(x0, x1),max(x4, x5, x6),max(x8, x9, x10)]

I Balanced binary trees ⇒ efficient on GPUs.

ExperimentsMNIST

Wi ,j = e−‖xi−xj‖22/σ2

I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976

I 8-NN graph ⇒ |ε| = 3198

I Based on LeNet-5 ⇒ K = 5

ExperimentsMNIST

Wi ,j = e−‖xi−xj‖22/σ2

I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976

I 8-NN graph ⇒ |ε| = 3198

I Based on LeNet-5 ⇒ K = 5

ExperimentsMNIST

Wi ,j = e−‖xi−xj‖22/σ2

I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976

I 8-NN graph ⇒ |ε| = 3198

I Based on LeNet-5 ⇒ K = 5

I Isotropic filters (no orientation)

I Uninvestigated optimizations and initializations

Experiments20NEWS

I 18,846 text documentsassociated with 20 classes

I 10K most common wordsfrom the 94K unique words⇒ n = |V| = 10K

I 16-NN,Wi ,j = e−‖zi−zj‖

22/σ

2

(zi - word2vec) ⇒|ε| = 132, 834

I x - bag-of-words model

Experiments20NEWS

Comparison with other methods (K = 5)

I Slightly worse than multinomial naiveBayes classifier.

I Defeats fully-connected networks withmuch less parameters.

Total training time divided by # ofgradient steps

I Scales as O(n) as opposed toO(n2)

Experiments20NEWS

Comparison with other methods (K = 5)

I Slightly worse than multinomial naiveBayes classifier.

I Defeats fully-connected networks withmuch less parameters.

Total training time divided by # ofgradient steps

I Scales as O(n) as opposed toO(n2)

ExperimentsGraph Quality

⇒ The data structure is important

⇒ A well constructed graph is important

⇒ Proper approximations (LSHForest) can be used for larger DBs.

Conclusions and Future Work

I Introduced a model with linear complexity.

I The quality of the input graph is of paramount importance.

I Local stationarity and compositionality are verified for text documents as long asthe graph is well constructed.