Fisher-Bures Adversary Graph Convolutional...

Fisher-Bures Adversary Graph Convolutional Networks

Ke Sun†∗ Piotr Koniusz†‡ Zhen Wang††CSIRO Data61 ‡Australia National University{Ke.Sun, Peter.Koniusz, Jeff.Wang}@data61.csiro.au

Abstract

In a graph convolutional network, we assumethat the graph G is generated w.r.t. some obser-vation noise. During learning, we make smallrandom perturbations ∆G of the graph and tryto improve generalization. Based on quantuminformation geometry, ∆G can be character-ized by the eigendecomposition of the graphLaplacian matrix. We try to minimize the lossw.r.t. the perturbed G+ ∆G while making ∆Gto be effective in terms of the Fisher informa-tion of the neural network. Our proposed modelcan consistently improve graph convolutionalnetworks on semi-supervised node classifica-tion tasks with reasonable computational over-head. We present three different geometries onthe manifold of graphs: the intrinsic geometrymeasures the information theoretic dynamicsof a graph; the extrinsic geometry character-izes how such dynamics can affect externally agraph neural network; the embedding geome-try is for measuring node embeddings. Thesenew analytical tools are useful in developing agood understanding of graph neural networksand fostering new techniques.

1 INTRODUCTION

Recently, neural network architectures are introduced [15,36, 7, 12, 21, 17, 42] to learn high level features of objectsbased on a given graph among these objects. These graphneural networks, especially graph convolutional networks(GCNs), showed record-breaking scores on diverse learn-ing tasks. Similar to the idea of data augmentation, thispaper improves GCN generalization by minimizing the

∗Corresponding author

expected loss w.r.t. small random perturbations of the in-put graph. In order to do so, we must first have a rigorousdefinition of the manifold of graphs denoted byM, whichis the space of all graphs satisfying certain constraints.Then, based on the local geometry ofM around a graphG ∈ M, we can derive a compact parameterization ofthe perturbation so that it can be plugged into a GCN.We will show empirically that the performance of GCNcan be improved and present theoretical insights on thedifferential geometry ofM.

Notations

We assume an undirected graph G without self-loops con-sisting of n nodes indexed as 1, · · · , n. Xn×D denotesthe given node features,Hn×d denotes some learned high-level features, and Yn×O denotes the one-hot node labels.All these matrices contain one sample per row. The graphstructure is represented by the adjacency matrix An×nthat can be binary or weighted, so that aij ≥ 0, andaij = 0 indicates no link between nodes i and j. Theneural networks weights are denoted by the matrix W l,where l = 1, · · · , L indexes the layers. We use capitalletters such as A, B, · · · to denote matrices and smallletters such as a, b, · · · to denote vectors. We try to useGreek letters such as α, β, · · · to denote scalars. Theserules have exceptions.

Problem Formulation

In a vanilla GCN model (see the approximations [21]based on [12]), the network architecture is recursivelydefined by

H l+1 = σ(AH lW l

), H0 = X,

where H ln×dl is the feature matrix of the l’th layer with

its rows corresponding to the samples, W ldl×dl+1 is the

sample-wise feature transformation matrix, An×n is thenormalized adjacency matrix so that A = (D+I)−

12 (A+

I)(D + I)−12 , I is the identity matrix, D = diag(A1)

is the degree matrix, 1 is the vector of all ones, diag()means a diagonal matrix w.r.t. the given diagonal en-tries, and σ is an element-wise nonlinear activation func-tion. Based on a given set of samples X and option-ally the corresponding labels Y , learning is implementedby minW ` (X,Y,A,W ), where ` is a loss (e.g. cross-entropy), usually expressed in terms of Y and HL, thefeature matrix obtained by stacking multiple GCN layers.

Our basic assumption is that A is observed w.r.t. an under-lying generative model as well as some random observa-tion noise. In order to make learning robust to these noiseand generalize well, we minimize the expected loss

minW

∫q(φ |ϕ)` (X,Y,A(φ),W ) dφ, (1)

where A(φ) is a parameterization of graph adjacency ma-trices,A(0) = A is the original adjacency matrix, q(φ |ϕ)is a zero-centered random perturbation so that A(φ) in a“neighborhood” of A, and ϕ is the freedom of this pertur-bation.

To implement this machinery, we must answer a set offundamental questions: À How to define the manifoldMof graphs, i.e. the space of A? Á How to properly definethe neighborhood {A(φ) : φ ∼ q(φ |ϕ)}? Â What is theguiding principles to learn the neighborhood parametersϕ? We will build a geometric solution to these problems,provide an efficient implementation of eq. (1), and testthe empirical improvement in generalization. Our con-tributions are both theoretical and practical, which aresummarized as follows:

• We bridge quantum information theory with graphneural networks, and provide Riemannian metrics inclosed form on the manifold of graphs;

• We build a modified GCN [21] called the FisherGCNthat can consistently improve generalization;

• We introduce algorithm 1 to pre-process the graphadjacency matrix for GCN so as to incorporate highorder proximities.

The rest of this paper is organized as follows. We firstreview related works in section 2. Section 3 introducesbasic quantum information geometry, based on which thefollowing section 4 formulates the manifold of graphsM and graph neighborhoods. Sections 5 and 6 presentthe technical details and experimental results of our pro-posed FisherGCN. Sections 7 and 8 provide our theoreti-cal analysis on different ways to define the geometry ofM. Section 9 concludes and discusses future extensions.

2 RELATED WORKS

Below, we related our work to deep learning on graphs(with a focus on sampling strategies), adversary learning,and quantum information geometry.

Graph Neural Networks

The graph convolutional network [21] is a state-of-the-artgraph neural network [15, 36] which performs convolu-tion on the graphs in the spectral domain. While the per-formance of GCNs is very attractive, spectral convolutionis a costly operation. Thus, the most recent implemen-tations, e.g. GraphSAGE [17], takes convolution fromspectral to spatial domain defined by the local neighbor-hood of each node. The average pooling on the nearestneighbors of each node is performed to capture the con-tents of the neighborhood. Below we describe relatedworks which, one way or another, focus on various sam-pling strategies to improve aggregation and performance.

Structural Similarity and Sampling Strategies

Graph embeddings [33, 16] capture structural similaritiesin the graph. DeepWalk [33] takes advantage of simulatedlocalized walks in the node proximity which are then for-warded to the language modeling neural network to formthe node context. Node2Vec [16] interpolates betweenbreadth- and depth-first sampling strategies to aggregatedifferent types of neighborhood.

MoNet [26] generalizes the notion of coordinate spacesby learning a set of parameters of Gaussian functions toencode some distance for the node embedding, e.g. the dif-ference between degrees of a pair of nodes. Graph atten-tion networks [42] learn such weights via a self-attentionmechanism. Jumping Knowledge Networks (JK-Nets)[46] also target the notion of node locality. Experimentson JK-Nets show that depending on the graph topology,the notion of the subgraph neighborhood varies, e.g. ran-dom walks progress at different rates in different graphs.Thus, JK-Nets aggregate over various neighborhoods andconsiders multiple node localities. By contrast, we applymild adversary perturbations of the graph Laplacian basedon quantum Fisher information so as to improve gener-alization. Thus, we infer a “correction” of the Laplacianmatrix while JK-Net aggregates multiple node localities.

Sampling strategy has also an impact on the total size ofreceptive fields. In the vanilla GCN [21], the receptivefield of a single node grows exponentially w.r.t. the num-ber of layers which is computationally costly and resultsin so-called over smoothing of signals [22]. Thus, stochas-tic GCN [10] controls the variance of the activation esti-mator by keeping the history/summary of activations in

the previous layer to be reused.

Both our work and Deep Graph Infomax (DGI) [43]take an information theoretic approach. DGI maximizesthe mutual information between representations of localsubgraphs (a.k.a. patches) and high-level summaries ofgraphs while minimizing the mutual information betweennegative samples and the summaries. This “contrasting”strategy is somewhat related to our approach as we gen-erate adversarial perturbation of the graph to flatten themost abrupt curvature directions. DGI relies on the notionof positive and negative samples. In contrast, we learnmaximally perturbed parameters of our extrinsic graphrepresentation which are the analogy to negative samples.

Lastly, noteworthy are application driven pipelines, e.g.for molecule classification [13].

Adversarial Learning

The role of adversarial learning is to generate difficult-to-classify data samples by identifying them along thedecision boundary and “pushing” them over this bound-ary. In a recent DeepFool approach [27], a cumulativesparse adversarial pattern is learned to maximally confusepredictions on the training dataset. Such an adversar-ial pattern generalizes well to confuse prediction on testdata. Adversarial learning is directly connected to sam-pling strategies, e.g. sampling hard negatives (obtainingthe most difficult samples), and it has been long investi-gated [14, 37] in the community, especially in the shallowsetting [5, 19, 45].

Adversarial attacks under the Fisher information metric(FIM) [50] propose to carry out perturbations in the spec-tral domain. Given a quadratic form of the FIM, theoptimal adversarial perturbation is given by the first eigen-vector corresponding to the largest eigenvalue. The largerthe eigenvalues of the FIM are, the larger is the suscep-tibility of the classification approach to attacks on thecorresponding eigenvectors.

Our work is related in that we also construct a quantumversion of the FIM w.r.t. a parameterization of the graphLaplacian. We perform a maximization w.r.t. these pa-rameters to condition the FIM around the local optimum,thus making our approach well regularized in the senseof flattening the most curved directions associated withthe FIM. With the smoothness constraint, the classifica-tion performance typically degrades, e.g. see the impactof smoothness on kernel representations [24]. Indeed,study [41] further shows there is a fundamental trade-offbetween high accuracy and the adversarial robustness.

However, our min-max formulation seeks the most effec-tive perturbations (according to [50]) which thus simulta-

neously prevents unnecessary degradation of the decisionboundary. With robust regularization for medium sizedatasets, we avoid overfitting which boosts our classi-fication performance, as demonstrated in the followingsection 6.

Quantum Information Geometry

Natural gradient [2, 3, 32, 49, 40] is a second-order opti-mization procedure which takes the steepest descent w.r.t.the Riemannian geometry defined by the FIM, whichtakes small steps on the directions with a large scale ofFIM. This is also suggestive that the largest eigenvectorsof the FIM are the most susceptible to attacks.

Bethe Hessian [35], or deformed Laplacian, was shownto improve the performance of spectral clustering on a parwith non-symmetric and higher dimensional operators,yet, drawing advantages of symmetric positive-definiterepresentation. Our graph Laplacian parameterizationalso draws on this view.

Tools from quantum information geometry are applied tomachine learning [4, 29] but not yet ported to the domainof graph neural networks. In information geometry, onecan have different matrix divergences [30] that can beapplied on the cone of p.s.d. matrices. We point the readerto related definitions of the discrete Fisher information[11] without illuminating the details.

3 PREREQUISITES

Fisher Information Metric

The discipline of information geometry [2] studies thespace of probability distributions based on the Rieman-nian geometry framework. As the most fundamental con-cept, the Fisher information matrix is defined w.r.t. a givenstatistical model, i.e. a parametric form of the conditionalprobability distribution p(X |Θ), by

G(Θ) =

∫p(X |Θ)

log p(X |Θ)

∂Θ

log p(X |Θ)

∂Θ>dX.

(2)By definition, we must have G(Θ) � 0. FollowingH. Hotelling and C. R. Rao, this G(Θ) is used (see section3.5 [2] for history) to define the Riemannian metric of astatistical modelM = {Θ : p(X |Θ)}, which is knownas the Fisher information metric ds2 = dΘ>G(Θ)dΘ.Intuitively, the scale of ds2 corresponds to the intrinsicchange of the model w.r.t. the movement dΘ. The FIM isinvariant to reparameterization and is the unique Rieman-nian metric in the space of probability distributions undercertain conditions [9, 2].

Bures Metric

In quantum mechanics, a quantum state is represented bya graph (see e.g. [6]). Denote a parametric graph Lapla-cian matrix as L(Θ), and the trace-normalized Laplacianρ(Θ) = 1

tr(L(Θ))L(Θ) is known as the density matrix,where tr(·) means the trace. One can therefore generalizethe FIM to define a geometry of the Θ space. In analogyto eq. (2), the quantum version of the Fisher informationmatrix is

Gij(Θ) =1

2tr [ρ(Θ)(∂Li∂Lj + ∂Lj∂Li)] , (3)

where ∂Li is the symmetric logarithmic derivative thatgeneralizes the notation of the derivative of logarithm:

∂ρ

∂θi=

1

2(ρ · ∂Li + ∂Li · ρ).

Let ρ(Θ) be diagonal, then ∂Li = ∂ log ρ/∂θi. Plugginginto eq. (3) will recover the traditional Fisher informationdefined in eq. (2). The quantum Fisher information metricds2 = dΘ>G(Θ)dΘ, up to constant scaling, is knownas the Bures metric [8]. We use BM to denote theseequivalent metrics and abuse G to denote both the BM andthe FIM. We develop upon the BM without consideringits meanings in quantum mechanics. This is becauseÀ it can fall back to classical Fisher information; Á itsformulations are well-developed and can be useful todevelop deep learning on graphs.

4 AN INTRINSIC GEOMETRY

In this section, we define an intrinsic geometry of graphsbased on the BM, so that one can measure distances onthe manifold of all graphs with a given number of nodesand have the notion of neighborhood.

We parameterize a graph by its density matrix

ρ = U diag(λ)U> =

n∑i=1

λiuiu>i � 0, (4)

where UUT = UTU = I so that U is on the unitarygroup, i.e. the manifold of unitary matrices, ui is the i’thcolumn of U , λ satisfies λ ≥ 0, λ>1 = 1 and is on theclosed probability simplex. Notice that the τ ≥ 1 smallesteigenvalue(s) of the graph Laplacian (and ρ which sharesthe same spectrum up to scaling) are zero, where τ is thenumber of connected components of the graph.

Fortunately for us, the BM w.r.t. this canonical pa-rameterization was already derived in closed form (seeeq.(10) [18]), given by

ds2 =1

2

n∑j=1

n∑k=1

(u>j dρuk)2

λj + λk. (5)

For simplicity, we are mostly interested in the diagonalblocks of the FIM. Plugging

dρ =

n∑i=1

[dλiuiu

>i + λiduiu

>i + λiuidu

>i

]into eq. (5), we get the following theorem.

Theorem 1. In the canonical parameterization ρ =U diag(λ)U>, the BM is

ds2 = dλ>G(λ)dλ+

n∑i=1

du>i G(ui)dui

=

n∑i=1

[1

4λidλ2

i + dλic>i dui

+1

2du>i

n∑j=1

((λi − λj)2

λi + λjuju>j

)dui

],

where ci are some coefficients which we do not care aboutthat can be ignored in this paper.

One can easily verify that the first term in theorem 1coincides with the simplex geometry induced by the FIM.Note that the BM is invariant to reparameterization, andwe can write it in the following equivalent form.

Corollary 2. Under the reparameterization λi = exp(θi)and ρ(θ, U) = U diag(exp(θ))U>, the BM is

ds2 =

n∑i=1

[exp(θi)

4dθ2i + exp(θi)dθic

>i dui

+1

2du>i

n∑j=1

[(exp(θi)− exp(θj))

2

exp(θi) + exp(θj)uju>j

]dui

].

This parameterization is favored in our implementationbecause after a small movement in the θ-coordinates, thedensity matrix is still p.s.d.

The BM allows us to study quantatively the intrinsicchange of the graph measured by ds2. For example, aconstant scaling of the edge weights results in ds2 = 0because the density matrix does not vary. The BM ofthe eigenvalue λi is proportional to 1/λi, therefore as thenetwork scales up and n→∞, the BM of the spectrumwill scales up. By the Cauchy-Schwarz inequality, wehave

tr(G(λ)) =1

4(1>λ−1)(1>λ) ≥ n2

4. (6)

It is, however, not straightforward to see the scale ofG(ui), that is the BM w.r.t. the eigenvector ui. We there-fore have the following result.

Corollary 3. tr(G(ui)) = 12

∑nj=1

(λi−λj)2λi+λj

≤ 12 ;

tr(G(U)) = 12

∑ni=1

∑nj=1

(λi−λj)2λi+λj

≤ n2 .

Remark 3.1. The scale (measured by trace) of G(U) isO(n) and U has O(n2) parameters. The scale of G(λ) isO(n2) and λ has (n− 1) parameters.

Therefore, informally, the λ parameters carry more in-formation than U . Moreover, it is computationally moreexpensive to parameterize U . We will therefore make ourperturbations on the spectrum λ.

We need to make a low-rank approximation of ρ so as toreduce the degree of freedoms, and make our perturbationcheap to compute. Based on the Frobenius norm, thebest low-rank approximation of any given matrix can beexpressed by its largest singular values and their corre-sponding singular vectors. Similar results hold for approx-imating density matrix based on the BM. While BM isdefined on an infinitesimal neighborhood, its correspond-ing non-local distance is known as the Bures distanceDB(ρ1, ρ2) given by

D2B(ρ1, ρ2) = 2

(1− tr

(√ρ

121 ρ2ρ

121

)).

For diagonal matrices, the Bures distance reduces to theHellinger distance up to constant scaling.

We have the following low-rank projection of a givendensity matrix.

Theorem 4. Given ρ0 = U diag(λ)U>, where λ1, · · ·λnare monotonically non-increasing, its rank-k projectionis

ρk0 = arg minρ:rank(ρ)=k

DB(ρ, ρ0) =

∑ki=1 λiuiu

>i∑k

i=1 λi.

Our proof in the supplementary material1 is based onTheorem 3 [25]. We may simply denote ρk0 as ρ0 with thespectrum decomposition ρ0 = U0 diag(λ0)U>0 .

Hence, we can define a neighborhood of A by varying thespectrum of ρk(A). Formally, the graph Laplacian of theperturbed A w.r.t. the perturbation φ is

L(A(φ)) = tr(L(A)) ρ(A(φ)) (7)

so that its trace is not affected by the perturbation, and theperturbed density matrix is

ρ(A(φ)) = ρ(A) + U diag

(exp(θ + φ)

1> exp(θ + φ)− λ)U>,

(8)

1The supplementary material is in the appendix of https://arxiv.org/abs/1903.04154. Our codes to reproduceall reported experimental results are available at https://github.com/stellargraph/FisherGCN.

where the second low-rank term on the rhs is a perturba-tion of ρk(A) whose trace is 0 so that ρ(A(φ)) is still adensity matrix. The random variable φ follows

qiso(φ) = U(φ | 0,G−1(θ)

), (9)

which can be either a Gaussian distribution or a uniformdistribution2, which has zero mean and precision matrixG(θ) up to constant scaling. Intuitively, it has smallervariance on the directions with a large G, so that q(φ) isintrinsically isotropic w.r.t. the BM.

In summary, our neighborhood of a graph with adjacencymatrix A has k most informative dimensions selectedby the BM, and is defined by eqs. (7) to (9). To com-pute this neighborhood, one needs to pre-compute the klargest eigenvectors of ρ(A), which can be performed effi-ciently [28] for small k. LanczosNet [23] also utilizes aneigendeompositiona sub-module for a low-rank approxi-mation of the graph Laplacian. Their focus is on buildingspectral filters rather than geometric perturbations. Anempirical range of k is 10 ∼ 50.

One may alternatively parameterize a neighborhood bycorrupting the graph links. However, it is hard to controlthe scale of the perturbation based on information theoryand to have a compact parameterization.

5 FISHER-BURES ADVERSARY GCN

Based on the previous section 4, we know how to definethe graph neighborhood. Now we are ready to implementour perturbed GCN, which we call the “FisherGCN”.

We parameterize the perturbation as

φ(ϕ, ε) = G−1/2(θ) diag(ϕ)ε = exp

(− θ

2

)◦ ϕ ◦ ε,

(10)where “◦” means element-wise product, ε follows the uni-form distribution over [− 1

2 ,12 ]k or the multivariate Gaus-

sian distribution, and corollary 2 is used here to get G(θ).The vector 0 < ϕ ≤ ε contains shape parameters (Onecan implement the constraint through reparameterizationϕ = ε/(1 + exp(−ξ))), where ε is a hyper-parameterspecifying the radius of the perturbation. If ϕ = ε1, thenφ follows qiso(φ) in eq. (9). Then, one can compute therandomly perturbed density matrix ρ(A(φ)) and corre-sponding Laplacian matrix L(A(φ)) based on eqs. (7)and (8).

Our learning objective is to make predictions that is ro-bust to such graph perturbations by solving the following

2Strictly speaking, U(φ |µ,Σ) should be the pushforwarddistribution w.r.t. the Riemannian exponential map, which mapsthe distribution on the tangent space to the parameter manifold.

https://arxiv.org/abs/1903.04154

https://arxiv.org/abs/1903.04154

https://github.com/stellargraph/FisherGCN

https://github.com/stellargraph/FisherGCN

minimax problem

minW

maxϕ− 1

NM

N∑i=1

M∑j=1

log p(Yi |Xi, A(φ(ϕ, εj)),W ),

(11)where M (e.g. M = 5) is the number of perturbations.Similar to the training procedure of a GAN [14], one cansolve the optimization problem by alternatingly updatingϕ along5ϕ, the gradient w.r.t. ϕ, and updating W along−5W .

For brevity, we highlight the key equations and steps(instead of a full workflow) of FisherGCN as follows:

À Normalize A (use the renormalization trick [21] orour algorithm 1 that will be introduced in section 6);

Á Compute ρk(A) (theorem 4) by sparse matrix factor-ization [28] (only the top k eigenvectors of ρ(A) isneeded, and this needs only to be done for once);

Â Perform regular GCN optimization

(a) Use eq. (10) to get the perturbation φ;(b) Use eqs. (7) and (8) to get the perturbed density

matrix ρ(A(φ)) and the graph Laplacian matrixL(A(φ));

(c) Plug A = I − tr(L(A))ρ(A(φ)) into eq. (1).

Notice that the A matrix (and the graph Laplacian) isnormalized in step À before computing the density matrix,so that the multiple multiplications with A in differentlayers do not cause numerical instability. This can bevaried depending on the implementation.

Our loss only imposes k (e.g. k = 10 ∼ 50) additionalfree parameters (the rank of the projected ρk(A)), whileW contains the majority of the free parameters. As com-pared to GCN, we need to solve the k leading eigenvectorsof ρ(A) before training, and multiply the computationalcost of training by a factor of M . Notice that ρ(A) issparse and the eigendecomposition of sparse matrix onlyneed to be performed once. Instead of computing theperturbed density matrix explicitly, which is not sparseanymore, one only need to compute the correction term[

Un×k diag

(exp(θ + φ)

1> exp(θ + φ)− λ)U>k×n

]Xn×D

which can be solved efficiently in O(knD) time. If kis small, this computational cost can be ignored (withno increase in the overall complexity) as computing AXhas O(md) complexity (m is the number of links). Insummary, our FisherGCN is several times slower thanGCN with roughly the same number of free parametersand complexity.

FisherGCN can be intuitively understood as running mul-tiple GCN in parallel, each based on a randomly perturbedgraph. To implement the method does not require under-standing our geometric theory but only to follow the listof pointers ÀÁÂ shown above.

6 EXPERIMENTS

In this section, we perform an experimental study onsemi-supervised transductive node classification tasks.We use three benchmark datasets, namely, the Cora,CiteSeer and PubMed citation networks [47, 21]. Thestatistics of these datasets are displayed in table 1. Assuggested recently [38], we use random splits of train-ing:validation:testing datasets based on the same ratio asthe Planetoid split [47], as given in the “Train:Valid:Test”column in table 1.

We will mainly compare against GCN which can representthe state-of-the-art on these datasets, because our methodserves as an “add-on” of GCN. We will discuss how toadapt this add-on to other graph neural networks in sec-tion 9 and refer the reader to [38] for how the performanceof GCN compares against the other methods. Neverthe-less, we introduce a stronger baseline called GCNT . Itwas known that random walk similarities can help im-prove learning of graph neural networks [48]. We foundthat pre-processing the graph adjacency matrix A (withdetailed steps listed in algorithm 1) can improve the per-formance of GCN on semisupervised node classificationtasks3 This processing is based on DeepWalk similari-ties [33] that are explicitly formulated in Table 1 [34].Algorithm 1 involves two hyperparameters: the orderT ≥ 1 determines the order of the proximities (the larger,the denser the resulting A; T = 1 falls back to the regularGCN); the threshold ν > 0 helps remove links with smallprobabilities to enhance sparsity. In the experiments wefix T = 5 and ν = 10−4. These procedures correspondto a polynomial filter with hand-crafted coefficients. Onecan look at table 1 and compare the sparsity of the pro-cessed adjacency matrix by algorithm 1 (in the “SparsityT ”column) v.s. the original sparsity (in the “Sparsity” col-umn) to have a rough idea on the computational overheadof GCNT v.s. GCN.

Our proposed methods are denoted as FisherGCN andFisherGCNT , which are respectively based on GCN andGCNT . We fix the perturbation radius parameter ε = 0.1and the rank parameter k = 10.

The testing accuracy and loss are reported in table 2.

3During the review period of this paper, related works ap-peared [44, 1] which build a high order GCN. Comparatively,our GCNT is closely based on DeepWalk similarities [33] in-stead of power transformations of the adjacency matrix.

Table 1: Dataset statistics. Note the number of links reported in previous works [21] counts duplicate links and someself-links, which is corrected here. “#Comps” means the number of connected components. “Sparsity” shows thesparsity of the matrix A. “SparsityT ” shows its sparsity in GCNT (with mentioned settings of T and ε).

Dataset #Nodes #Links #Comps #Features #Classes Train:Valid:Test Sparsity SparsityT

Cora 2,708 5,278 78 1,433 7 140:500:1000 0.18% 9.96%CiteSeer 3,327 4,552 438 3,703 6 120:500:1000 0.11% 3.01%PubMed 19,717 44,324 1 500 3 60:500:1000 0.03% 3.31%

Table 2: Testing loss and accuracy in percentage. The hyperparameters (learning rate 0.01; 64 hidden units; dropoutrate 0.5; weight decay 5× 10−4) are selected based on the best overall testing accuracy of GCN on Cora and CiteSeer.Then we use these hyperparameters across all the four methods and three datasets. The reported mean±std scores arebased on 200 runs (20 random splits; 10 different initializations per split). The splits used for hyperparameter selectionand testing are different.

Testing Accuracy Testing LossCora CiteSeer PubMed Cora CiteSeer PubMed

GCN 80.52± 2.3 69.59± 2.0 78.17± 2.4 1.07± 0.04 1.36± 0.03 0.75± 0.04FisherGCN 80.70± 2.2 69.80± 2.0 78.43± 2.4 1.06± 0.04 1.35± 0.03 0.74± 0.04

GCNT 81.20± 2.3 70.31± 1.9 78.99± 2.6 1.04± 0.04 1.33± 0.03 0.70± 0.05FisherGCNT 81.46± 2.2 70.48± 1.7 79.34± 2.7 1.03± 0.03 1.32± 0.03 0.69± 0.04

Algorithm 1: Pre-process A to capture high-orderproximities (T ≥ 2 is the order; ν > 0 is a threshold)

A← diag−1(A1)A;S,B ← A;for t← 2 to T do

B ← BA;S ← S +B;

A← 1T S ◦ (1n×n − I);

A← A ◦ (A > ν);A← A+A> + 2I;A← diag−

12 (A1) A diag−

12 (A1);

We adapt the GCN codes [21] so that the four meth-ods are compared in exactly the same settings and onlydiffer in the matrix A that is used for computing thegraph convolution. One can observe that FisherGCN andGCNT can both improve over GCN, which means thatour perturbation and the pre-processing by algorithm 1both help to improve generalization. The best resultsare given by FisherGCNT with both techniques added.Th large variation is due to different splits of the train-ing:validation:testing datasets [38], and therefore thesescores vary with the split. In repeated experiments, we ob-served a consistent improvement of the proposed methodsas compared to the baselines.

Figure 1 shows the learning curves on the Cora dataset(see the supplementary material for the other cases). Wecan observe that the proposed perturbation presents higher

0 10 20 30 40 50Epoch

40

50

60

70

80

90

Acc

urac

y

Cora

Train GCN

Valid GCN

Train FisherGCN

Valid FisherGCN

Train GCNT

Valid GCNT

Train FisherGCNT

Valid FisherGCNT

Figure 1: Learning curves (averaged over 200 runs) inaccuracy on the Cora dataset.

training and testing scores during learning. The perfor-mance boost of FisherGCN is more significant if the num-ber of epochs is limited to a small value.

7 AN EXTRINSIC GEOMETRY

In this section and the following section section 8, wepresent analytical results on the geometry of the mani-fold of graphs. These results are useful to interpret theproposed FisherGCN and are useful to understand graph-based machine learning.

We first derive an extrinsic geometry of a parametric graph

embedded in a neural network. Based on this geometry,the learner can capture the curved directions of the losssurface and make more effective perturbations than theisotropic perturbation in eq. (8). While the intrinsic geom-etry in section 4 measures how much the graph itself haschanged due to a movement onM, the extrinsic geometrymeasures how varying the parameters of the graph willchange the external model. Intuitively, if a dynamic ∆Gcauses little change based on the intrinsic geometry, onemay also expect ∆G has little effect on the external neuralnetwork. However, in general, these two geometries im-pose different Riemannian metrics on the same manifoldM of graphs.

Consider the predictive model represented by the condi-tional distribution p(Y |X,A(φ),W ). Wlog consider φis a scalar, which serves as a coordinate system of graphs.We use GE to denote the extrinsic Riemannian metric (theupper script “E” is for extrinsic) that is to be distinguishedwith the intrinsic G. Based on the GCN computation in-troduced in section 1, we can get an explicit expressionof GE .

Theorem 5. Let ` = − log p(Y |X,A(φ),W ), ∆l =∂`/∂H l denote the back-propagated error of layer l’soutput H l, and Σ′l denote the derivative of layer l’s acti-vation function. Then

GE(φ)

=1

N

N∑i=1

(L−1∑l=0

(H lW l(∆l+1 ◦ Σ′l+1)>

∂A

∂φ

)ii

)2

=1

N

N∑i=1

(L−1∑l=0

(H l∆>l

∂A

∂φ

)ii

)2

;

GE(W l)

=1

N

N∑i=1

(vec(

(∆l+1 ◦ Σ′l+1)>AH l)

× vec>(

(∆l+1 ◦ Σ′l+1)>AH l))

,

where vec() means rearranging a matrix into a columnvector.

The information geometry of neural networks is mostlyused to develop the second order optimization [32, 3],where GE(W ) is used. Here we are mostly interested inGE(φ), and our target is not for better optimization but tofind a neighborhood of a given graph with large intrinsicvariations. A movement with a large scale of GE can mosteffectively change the predictive model p(Y |X).

Let us develop some intuitions based on the term insidethe trace on the rhs of GE(φ). In order to change thepredictive model, the most effective edge increment daij

should be positively correlated with (hl>i ∆lj + ∆>lihlj),

which means how the hidden feature of node i (node j)is correlated with the increment of the hidden feature ofnode j (node i). This makes intuitive sense.

The meaning of theorem 5 is mainly theoretical, givingan explicit expression of GE for the GCN model, which,to the best of the authors’ knowledge, was not derivedbefore (most literature studies the FIM of a feed-forwardmodel such as a multi-layer perceptron). This could beuseful for future works for natural gradient optimizersspecifically tailored for GCN. On the practical side,

Theorem 5 also helps to understand the proposed minimaxoptimization. On the manifoldM of graphs, we makethe rough assumption that A(0) = A is a local minimumof ` along the φ coordinate system, that is, adding a smallnoise to A will always cause an increment in the loss.The random perturbation in eq. (10) corresponds to thedistribution

q(φ |ϕ) = U(φ | 0,G−1/2(θ) diag(ϕ ◦ ϕ)G−1/2(θ)

),

and our loss function in eq. (11) is obtained by applyingthe reparameterization trick [20] to solve the expectationin eq. (1). If ϕ = ε1, then q(φ |ϕ) = U(φ | 0, ε2G(θ)−1)falls back to the isotropic qiso(φ). Letting ϕ free al-lows the neighborhood to deform (see fig. 2 left). Then,through the maximization in eq. (11) w.r.t. ϕ, the densityq(φ) will focus on the neighborhood of the original graphA where the loss surface is most upcurved (see fig. 2right). These directions have large GE and make the per-turbation effective in terms of the FIM of the graph neuralnetwork. Consider the reverse case, when the density q(φ)corresponds to small values of GE . Such perturbationsare long the flat directions of the loss surface and willhave little effect on learning the predictive model.

A

Ns1(A)Ns2(A)

Figure 2: Learning a neighborhood (yellow region) of agraph where the loss surface is most curved correspondingto large GE .

8 AN EMBEDDING GEOMETRY

We present a geometry of graphs which is constructed inthe spatial domain and is closely related to graph embed-

dings [33]. Consider representing a graph by a node sim-ilarity matrix Wn×n (e.g. based on algorithm 1), whichis row-normalized and has zero-diagonal entries. Thesesimilarities are assumed to be based on a latent graph em-bedding Yn×d: pij(Y ) = 1

Ziexp

(−‖yi − yj‖2

), where

Pn×n is the generative model with the same constraintsas the W matrix, and Zi is the partition function. Then,the observed FIM (that leads to the FIM as the numberof observations increase) is given by the Hessian matrixof KL(W : P (Y )) evaluated at the maximum likelihoodestimation Y ? = arg minY KL(W : P (Y )), where KLdenotes the Kullback-Leibler divergence. We have thefollowing result.

Theorem 6. W.r.t. the generative model pij(Y ), the diag-onal blocks of the observed FIM G of a graph representedby the similarity matrix W is

G(yk) =4L(W − P (Y )) + 8L(P (Y ) ◦Dk)

− 4(Bk)>Bk,

where yk is the k’th column of Y , L(W − P (Y )) isthe Laplacian matrix computed based on the indefi-nite weights (W − P (Y )) after symmetrization, Dk =(yik − yjk)2, and Bk = L(pij(yik − yjk)).

The theorem gives the observed FIM, while the expectedFIM (the 2nd and 3rd terms in theorem 6) can be alter-natively derived based on [39]. To understand this result,we can assume that P (Y )→W as the number of obser-vations increase. Then

dyk>G(yk)dyk = 4

n∑i=1

[ n∑j=1

pij(yik − yjk)2(dyik − dyjk)2

−

(n∑

j=1

pij(yik − yjk)(dyik − dyjk)

)2 ]

is in the form of a variance of (yik−yjk)(dyik−dyjk) =12d(yik − yjk)2 w.r.t. pij . Therefore a large Rieman-nian metric dyk>G(yk)dyk corresponds to a motion dyk

which cause a large variance of neighbor’s distance incre-ments. For example, a rigid motion, or a uniform expan-sion/shrinking of the latent network embedding will causelittle or no effect on the variance of d(yik − yjk)2, andhence corresponds to a small distance in this geometry.

This metric can be useful for developing theoretical per-spectives of network embeddings, or build spatial per-turbations of graphs (instead of our proposed spectralperturbation). As compared to the intrinsic geometry insection 4, the embedding geometry is based on a gen-erative model instead of the BM. As compared to theextrinsic geometry in section 7, the embedding geometryis not related to a neural network model.

9 CONCLUSION AND DISCUSSIONS

We imported new tools and adapted the notations fromquantum information geometry to the area of geometricdeep learning. We discussed three different geometries onthe ambient space of graphs, with their Riemannian met-rics provided in closed form. The results and adaptationsare useful to develop new deep learning methods. Wedemonstrated their usage by perturbing graph structuresin a GCN, showing consistent improvements in transduc-tive node classification tasks.

It is possible to generalize FisherGCN to a scalable setting,where a mini-batch only contains a sub-graph [17] ofm � n nodes. This is because our perturbation has alow-rank factorization given by the second term in eq. (8).One can reuse this spectrum factorization of the globalmatrix to build sub-graph perturbations.

If A contains free-parameters [42], one can compute thelow-rank projection ρk(A) using the original graph that isparameter free, based on which the perturbation term canbe constructed. Alternatively, one can periodically savethe graph and recompute ρk(A) during learning.

Based on [21], we express a graph convolution operationon an input signal x =

∑ni=1 αiui ∈ <n as

[I − tr(L)ρ(A)]x = x− tr(L)Ei∼λ(αiui),

where E denotes the expectation. The von Neumannentropy of the quantum state ρ is defined by the Shannonentropy of λ, that is −

∑ni=1 λi log λi. If we consider a

higher order convolutional operator (in plain polynomial),given by

ρω(A)x =1

λω1U diag(λ)ωU>x = Ei∼ λω

λω1(αiui).

The von Neumann entropy is monotonically decreasing asω ≥ 1 increases. As ω →∞, we have ρω(A)x→ α1u1

(if λ1 is the largest eigenvalue of ρ(A) without multi-plicity). Therefore, high order convolutions enhance thesignal w.r.t. the largest eigenvectors of ρ. Therefore ourperturbation is equivalent to adding high order polyno-mial filters. It is interesting to explore alternative pertur-bations based on other distances, e.g. matrix Bregmandivergence [31]. An empirical stay on comparing differ-ent types of perturbations in the GCN setting is left asfuture work.

Acknowledgements

The authors gratefully thank the anonymous UAI review-ers and Frank Nielsen for their valuable and constructivecomments. We thank the authors [47, 21, 38] for makingtheir processed datasets and codes public. This work issupported by StellarGraph@Data61.

References

[1] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipour-fard, K. Lerman, H. Harutyunyan, G. V. Steeg, andA. Galstyan. MixHop: Higher-order graph convo-lutional architectures via sparsified neighborhoodmixing. In ICML, volume 97, pages 21–29. PMLR,2019.

[2] S. Amari. Information Geometry and Its Appli-cations, volume 194 of Applied Mathematical Sci-ences. Springer-Verlag, Berlin, 2016.

[3] S. Amari, R. Karakida, and M. Oizumi. Fisher in-formation and natural gradient learning in randomdeep networks. In AISTATS, volume 89 of PMLR,pages 694–702, 2019.

[4] R. Bhatia, T. Jain, and Y. Lim. On the Bu-res–Wasserstein distance between positive definitematrices. Expositiones Mathematicae, 2018.

[5] B. Biggio, B. Nelson, and P. Laskov. Support vectormachines under adversarial label noise. In ACML,volume 20 of PMLR, pages 97–112, 2011.

[6] S. L. Braunstein, S. Ghosh, and S. Severini. TheLaplacian of a graph as a density matrix: a basiccombinatorial approach to separability of mixedstates. Annals of Combinatorics, 10(3):291–317,2016.

[7] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun.Spectral networks and locally connected networkson graphs. In ICLR, 2014.

[8] D. Bures. An extension of Kakutani’s theorem oninfinite product measures to the tensor product ofsemifinite w∗-algebras. Transactions of the AMS,135:199–212, 1969.

[9] N. N. Cencov. Statistical Decision Rules and Opti-mal Inference, volume 53 of Translations of Mathe-matical Monographs. American Mathematical Soci-ety, 1982. (Published in Russian in 1972).

[10] J. Chen, J. Zhu, and L. Song. Stochastic training ofgraph convolutional networks with variance reduc-tion. In ICML, volume 80 of PMLR, pages 942–950,2018.

[11] S. Chow, W. Li, and H. Zhou. A discreteSchrodinger bridge problem via optimal trans-port on graphs. Journal of Functional Analysis,276(8):2440–2469, 2019.

[12] M. Defferrard, X. Bresson, and P. Vandergheynst.Convolutional neural networks on graphs with fast

localized spectral filtering. In NIPS, pages 3844–3852. Curran Associates, Inc., 2016.

[13] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre,R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. PAdams. Convolutional networks on graphs for learn-ing molecular fingerprints. In NIPS 28, pages 2224–2232. Curran Associates, Inc., 2015.

[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-gio. Generative adversarial nets. In NIPS 27, pages2672–2680. Curran Associates, Inc., 2014.

[15] M. Gori, G. Monfardini, and F. Scarselli. A newmodel for learning in graph domains. In IJCNN,volume 2, pages 729–734, 2005.

[16] A. Grover and J. Leskovec. Node2Vec: Scalablefeature learning for networks. In KDD, pages 855–864, 2016.

[17] W. Hamilton, Z. Ying, and J. Leskovec. Inductiverepresentation learning on large graphs. In NIPS,pages 1024–1034. Curran Associates, Inc., 2017.

[18] M. Hubner. Explicit computation of the Buresdistance for density matrices. Physics Letters A,163(4):239 – 242, 1992.

[19] M. Kantarcıoglu, B. Xi, and C. Clifton. Classifierevaluation and attribute selection against active ad-versaries. Data Mining and Knowledge Discovery,22(1):291–335, 2011.

[20] D. P. Kingma and M. Welling. Auto-encoding varia-tional Bayes. In ICLR, 2014.

[21] T. N. Kipf and M. Welling. Semi-supervised classifi-cation with graph convolutional networks. In ICLR,2017.

[22] Q. Li, Z. Han, and X.-M. Wu. Deeper insights intograph convolutional networks for semi-supervisedlearning. In AAAI, 2018.

[23] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel. Lanc-zosNet: Multi-scale deep graph convolutional net-works. In ICLR, 2019.

[24] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid.Convolutional kernel networks. In NIPS, pages2627–2635. Curran Associates, Inc., 2014.

[25] D. Markham, J. Adam Miszczak, Z. Puchała, andK. Zyczkowski. Quantum state discrimination: ageometric approach. Phys. Rev. A, 77:042111, 2008.

[26] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svo-boda, and M. M. Bronstein. Geometric deep learn-ing on graphs and manifolds using mixture modelCNNs. In CVPR, pages 5425–5434, 2017.

[27] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard.DeepFool: A simple and accurate method to fooldeep neural networks. In CVPR, pages 2574–2582,2016.

[28] C. Musco and C. Musco. Randomized block Krylovmethods for stronger and faster approximate singu-lar value decomposition. In NIPS 28, pages 1396–1404. Curran Associates, Inc., 2015.

[29] B. Muzellec and M. Cuturi. Generalizing pointembeddings using the Wasserstein space of ellipticaldistributions. In NeurIPS 31, pages 10237–10248.Curran Associates, Inc., 2018.

[30] F. Nielsen and R. Bhatia. Matrix Information Geom-etry. Springer-Verlag Berlin Heidelberg, 2013.

[31] R. Nock, B. Magdalou, E. Briys, and F. Nielsen.Mining matrix data with Bregman matrix diver-gences for portfolio selection. In F. Nielsen andR. Bhatia, editors, Matrix Information Geometry,pages 373–402. Springer Berlin Heidelberg, 2013.

[32] R. Pascanu and Y. Bengio. Revisiting natural gradi-ent for deep networks. In ICLR, 2014.

[33] B. Perozzi, R. Al-Rfou, and S. Skiena. DeepWalk:Online learning of social representations. In KDD,pages 701–710, 2014.

[34] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang.Network embedding as matrix factorization: Uni-fying DeepWalk, LINE, PTE, and Node2Vec. InWSDM, pages 459–467, 2018.

[35] A. Saade, F. Krzakala, and L. Zdeborova. Spectralclustering of graphs with the Bethe Hessian. In NIPS27, pages 406–414. Curran Associates, Inc., 2014.

[36] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner,and G. Monfardini. The graph neural network model.IEEE Transactions on Neural Networks, 20(1):61–80, 2009.

[37] J. Schmidhuber. Learning factorial codes by pre-dictability minimization. Neural Computation,4(6):863–879, 1992.

[38] O. Shchur, M. Mumme, A. Bojchevski, andS. Gunnemann. Pitfalls of graph neural networkevaluation. In NeurIPS Workshop on RelationalRepresentation Learning, 2018.

[39] K. Sun and S. Marchand-Maillet. An informationgeometry of statistical manifold learning. In ICML31, volume 32 of PMLR, pages 1–9, 2014.

[40] K. Sun and F. Nielsen. Relative Fisher informa-tion and natural gradient for learning large modularmodels. In ICML 34, volume 70 of PMLR, pages3289–3298, 2017.

[41] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner,and A. Madry. Robustness may be at odds withaccuracy. In ICLR, 2019.

[42] P. Velickovic, G. Cucurull, A. Casanova, A. Romero,P. Lio, and Y. Bengio. Graph attention networks.ICLR, 2018.

[43] P. Velickovic, W. Fedus, W. L. Hamilton, P. Lio,Y. Bengio, and R D. Hjelm. Deep graph infomax.In ICLR, 2019.

[44] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, andK. Weinberger. Simplifying graph convolutionalnetworks. In ICML, volume 97 of PMLR, pages6861–6871, 2019.

[45] H. Xiao, B. Biggio, B. Nelson, H. Xiao, C. Eckert,and F. Roli. Support vector machines under adversar-ial label contamination. In Neurocomputing, volume160, pages 53–62, 2015.

[46] K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi,and S. Jegelka. Representation learning on graphswith jumping knowledge networks. In ICML 35,volume 80 of PMLR, pages 5453–5462, 2018.

[47] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Re-visiting semi-supervised learning with graph embed-dings. In ICML, volume 48 of PMLR, pages 40–48,2016.

[48] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L.Hamilton, and J. Leskovec. Graph convolutionalneural networks for web-scale recommender sys-tems. In KDD, pages 974–983, 2018.

[49] G. Zhang, S. Sun, D. Duvenaud, and R. Grosse.Noisy natural gradient as variational inference. InICML, volume 80 of PMLR, pages 5852–5861,2018.

[50] C. Zhao, P. T. Fletcher, M. Yu, Y. Peng, G. Zhang,and C. Shen. The adversarial attack and detectionunder the Fisher information metric. In AAAI, 2019.

Fisher-Bures Adversary Graph Convolutional...

Documents

Transcript of Fisher-Bures Adversary Graph Convolutional...