arXiv:1809.02709v1 [cs.LG] 7 Sep 2018 · (a) GAT (b) EGAT Figure 1: Schematic illustration of the...

8
Adaptive Edge Features Guided Graph Attention Networks Liyu Gong, 1 Qiang Cheng 1,2 1 Institute for Biomedical Informatics, University of Kentucky, Lexington, USA 2 Department of Computer Science, University of Kentucky, Lexington, USA {liyu.gong, Qiang.Cheng}@uky.edu Abstract Edge features contain important information about graphs. However, current state-of-the-art neural network models de- signed for graph learning do not consider incorporating edge features, especially multi-dimensional edge features. In this paper, we propose an attention mechanism which combines both node features and edge features. Guided by the edge fea- tures, the attention mechanism on a pair of graph nodes will not only depend on node contents, but also ajust automati- cally with respect to the properties of the edge connecting these two nodes. Moreover, the edge features are adjusted by the attention function and fed to the next layer, which means our edge features are adaptive across network layers. As a result, our proposed adaptive edge features guided graph attention model can consolidate a rich source of graph in- formation that current state-of-the-art graph learning meth- ods cannot. We apply our proposed model to graph node classification, and experimental results on three citaion net- work datasets and a biological network dataset show that out method outperforms the current state-of-the-art methods, tes- tifying to the discriminative capability of edge features and the effectiveness of our adaptive edge features guided atten- tion model. Additional ablation experimental study further shows that both the edge features and adaptiveness compo- nents contribute to our model. Introduction Neural networks, especially deep neural networks, have be- come one of the most successful machine learning tech- niques in recent years. In many important problems, they achieve state-of-the-art performance, e.g., covolutional neu- ral networks (CNN) (Lecun et al. 1998) in image recog- nition, recurrent neural networks (RNN)(Elman 1990) and Long Short Term Memory (LSTM) (Hochreiter and Schmid- huber 1997) in natural language processing, etc. In real world, many problems can be better and more naturally modeled with graphs rather than conventional tables, grid type images or time sequences. Generally, a graph con- tains nodes and edges, where nodes represent entities in real world, and edges represent interactions or relationships be- tween entities. For example, a social network naturally mod- els users as nodes and friendship relationships as edges. For each node, there is often an asscociated feature vector de- scribing it, e.g. a user’s profile in a socail network. Similarly, each edge is also often associated with features depicting re- (a) GAT (b) EGAT Figure 1: Schematic illustration of the proposed EGAT net- work. Compared with GAT network in (a), EGAT in (b) im- proves in three folds. Firstly, the adjacency matrix A in GAT is a binary matrix indicates merely the neighborhood of each node; on the contrary, EGAT replaces it with real-valued edge features E which may exploit weights of edges. Sec- ondly, the adjacency matrix A in GAT is an N ×N matrix, so each edge is associated with only one binary value; whereas the input E to EGAT is a 3-way tensor, which means each edge is usually associated with a multi-dimensional feature vector. Lastly, in GAT the original adjacency matrix A is fed to every layer; in contrast, the edge features in EGAT are adjusted by each layer before being fed to the next layer. lationship strengths or other properties. Due to their com- plex structures, a challenge in machine learning on graphs is to find effective ways to incorporate different sources of in- formation contained in graphs into machine learning models such as neural networks. Recently, several models of neural networks have been developed for graph learning, which obtain better perfor- mance than traditional techniques. Inspired by graph Fourier transform, Defferrard et al. (Defferrard, Bresson, and Van- dergheynst 2016) propose a graph covolution operator as an analogue to standard convolutions used in CNN. Just like the convolution operation in image spatial domain is equivalent to multiplication in the frequency domain, co- arXiv:1809.02709v1 [cs.LG] 7 Sep 2018

Transcript of arXiv:1809.02709v1 [cs.LG] 7 Sep 2018 · (a) GAT (b) EGAT Figure 1: Schematic illustration of the...

Page 1: arXiv:1809.02709v1 [cs.LG] 7 Sep 2018 · (a) GAT (b) EGAT Figure 1: Schematic illustration of the proposed EGAT net-work. Compared with GAT network in (a), EGAT in (b) im-proves in

Adaptive Edge Features Guided Graph Attention Networks

Liyu Gong,1 Qiang Cheng1,21 Institute for Biomedical Informatics, University of Kentucky, Lexington, USA

2 Department of Computer Science, University of Kentucky, Lexington, USA{liyu.gong, Qiang.Cheng}@uky.edu

Abstract

Edge features contain important information about graphs.However, current state-of-the-art neural network models de-signed for graph learning do not consider incorporating edgefeatures, especially multi-dimensional edge features. In thispaper, we propose an attention mechanism which combinesboth node features and edge features. Guided by the edge fea-tures, the attention mechanism on a pair of graph nodes willnot only depend on node contents, but also ajust automati-cally with respect to the properties of the edge connectingthese two nodes. Moreover, the edge features are adjustedby the attention function and fed to the next layer, whichmeans our edge features are adaptive across network layers.As a result, our proposed adaptive edge features guided graphattention model can consolidate a rich source of graph in-formation that current state-of-the-art graph learning meth-ods cannot. We apply our proposed model to graph nodeclassification, and experimental results on three citaion net-work datasets and a biological network dataset show that outmethod outperforms the current state-of-the-art methods, tes-tifying to the discriminative capability of edge features andthe effectiveness of our adaptive edge features guided atten-tion model. Additional ablation experimental study furthershows that both the edge features and adaptiveness compo-nents contribute to our model.

IntroductionNeural networks, especially deep neural networks, have be-come one of the most successful machine learning tech-niques in recent years. In many important problems, theyachieve state-of-the-art performance, e.g., covolutional neu-ral networks (CNN) (Lecun et al. 1998) in image recog-nition, recurrent neural networks (RNN)(Elman 1990) andLong Short Term Memory (LSTM) (Hochreiter and Schmid-huber 1997) in natural language processing, etc. In realworld, many problems can be better and more naturallymodeled with graphs rather than conventional tables, gridtype images or time sequences. Generally, a graph con-tains nodes and edges, where nodes represent entities in realworld, and edges represent interactions or relationships be-tween entities. For example, a social network naturally mod-els users as nodes and friendship relationships as edges. Foreach node, there is often an asscociated feature vector de-scribing it, e.g. a user’s profile in a socail network. Similarly,each edge is also often associated with features depicting re-

(a) GAT (b) EGAT

Figure 1: Schematic illustration of the proposed EGAT net-work. Compared with GAT network in (a), EGAT in (b) im-proves in three folds. Firstly, the adjacency matrixA in GATis a binary matrix indicates merely the neighborhood of eachnode; on the contrary, EGAT replaces it with real-valuededge features E which may exploit weights of edges. Sec-ondly, the adjacency matrix A in GAT is anN×N matrix, soeach edge is associated with only one binary value; whereasthe input E to EGAT is a 3-way tensor, which means eachedge is usually associated with a multi-dimensional featurevector. Lastly, in GAT the original adjacency matrixA is fedto every layer; in contrast, the edge features in EGAT areadjusted by each layer before being fed to the next layer.

lationship strengths or other properties. Due to their com-plex structures, a challenge in machine learning on graphs isto find effective ways to incorporate different sources of in-formation contained in graphs into machine learning modelssuch as neural networks.

Recently, several models of neural networks have beendeveloped for graph learning, which obtain better perfor-mance than traditional techniques. Inspired by graph Fouriertransform, Defferrard et al. (Defferrard, Bresson, and Van-dergheynst 2016) propose a graph covolution operator asan analogue to standard convolutions used in CNN. Justlike the convolution operation in image spatial domain isequivalent to multiplication in the frequency domain, co-

arX

iv:1

809.

0270

9v1

[cs

.LG

] 7

Sep

201

8

Page 2: arXiv:1809.02709v1 [cs.LG] 7 Sep 2018 · (a) GAT (b) EGAT Figure 1: Schematic illustration of the proposed EGAT net-work. Compared with GAT network in (a), EGAT in (b) im-proves in

volution operators defined by polynomials of graph Lapla-cian are equivalent to filtering in graph spectral domain.Particularly, by applying Chebyshev polynomials to graphLaplacian, spatially localized filtering is obtained. Kipf etal. (Kipf and Welling 2017) approximate the polynomialsusing a renormalized first order adjacency matrix to ob-tain comparable results on graph node classification. Thosegraph covolutional networks (GCNs) (Defferrard, Bresson,and Vandergheynst 2016)(Kipf and Welling 2017) combinethe graph node features and graph topological structure in-formation to make predictions. Velickovic et al. (Velickovicet al. 2018) adopt attention mechanism into graph learn-ing, and propose a graph attention network (GAT). UnlikeGCNs, which use a fixed or learnable polynomial of Lapla-cian or adjacency matrix to aggregate (filter) node informa-tion, GAT aggregates node information by using an attentionmechanism on graph neighborhoods. The essential differ-ence between GAT and GCNs is stark: In GCNs the weightsfor aggregating (filtering) neighbor nodes are defined by thegraph strcuture, which is independent on node contents; incontrast, weights in GAT are a function of node contentsthanks to the attention mechanism. Results on graph nodeclassification show that the adaptiveness of GAT makes itmore effective to fuse information from node features andgraph topological structures.

One major problem in the current graph neural networkmodels such as GAT and GCNs is that edge features are notincorporated. In GAT, graph topological information is in-jected into the model by forcing the attention coefficient be-tween two nodes to zero if they are not connected. There-fore, the edge information used in GAT is only the indi-cation about whether there is an edge or not, i.e. connec-tivities. However, graph edges are often in possession ofrich information like strengths, types, etc. Instead of beinga binary indicator variable, edge features could be conti-nous, e.g. strengths, or multi-dimensional. Properly address-ing this problem is likely to benefit many graph learningproblems. Another problem of GAT and GCNs is that eachGAT or GCN layer filters node features based on the orig-inal input adjacency matrix. The original adjacency matrixis likely to be noisy and not optimal, which will restrict theeffectiveness of the filtering operation.

In this paper, we propose a model of adaptive edge fea-tures guided graph attention networks (EGAT) for graphlearning. In EGAT, the attention function depends on bothnode features and edge features. Its value will be adap-tive to not only the feature vectors of a pair of nodes, butalso the possibly multi-dimensional feature vector of theedge between this pair of nodes. We conduct experimentson several commonly used data sets including three citationdatasets. Simply using the citation direction as additionaledge features, we observed significant improvement com-pared with current state-of-the-art methods on all these ci-tation datasets. We also conduct experiments on a weightedbiological dataset. The results confirm that edge features areimportant for graph learning, and our proposed EGAT modeleffectively incorporates edge features.

As a summary, the novelties of our proposed EGAT modelinclude the following:

• New multi-dimensional edge features guided graph atten-tion mechanism. We generalize the attention mechanismin GAT to incorporate multi-dimensional real-valued edgefeatures. The new edge features guided attention mecha-nism eliminates the limitation of GAT which can handleonly binary edge indicators.

• Attention based edge adaptiveness across neural networklayers. Based on the edge features guided graph attentionmechanism, we design a new graph network architecturewhich can not only filter node features but also adjust edgefeatures across layers. Therefore, the edges are adaptive toboth the local contents and the global layers when passingthrough the layers of the neural network.

• Multi-dimensional edge features for directed edges. Wepropose a method to encode edge directions as multi-dimensional edge features. Therefore, our EGAT can ef-fectively learn from directed graph data.

Related worksThe challenge of graph learning is the complex non-Euclidean structure of graph data. To address this issue, tra-ditional machine learning approaches extract graph statistics(e.g. degrees) (Bhagat, Cormode, and Muthukrishnan 2011),kernel functions (Vishwanathan et al. 2010)(Shervashidzeet al. 2011) or other hand-crafted features which measurelocal neighborhood structures. Those methods are limitedin their flexibility in that designing sensible hand-craftedfeatures is time consuming and extensive experiments areneeded to generalize to different tasks or settings. Insteadof extracting structural information using hand-engineeredstatistics, graph representation learning attempts to embedgraphs or graph nodes in a low-dimensional vector spaceusing a data-driven approach. One kind of embedding ap-proaches are based on matrix-factorization, e.g. LaplacianEigenmap (LE) (Belkin and Niyogi 2001), Graph Factor-ization (GF) algorithm (Ahmed et al. 2013), GraRep (Cao,Lu, and Xu 2015), and HOPE (Ou et al. 2016). Anotherclass of approaches focus on employing a flexible, stochas-tic measure of node similarity based on random walks, e.g.DeepWalk (Perozzi, Al-Rfou, and Skiena 2014), node2vec(Ahmed et al. 2013), LINE (Tang et al. 2015), HARP (Chenet al. 2018), etc. There are several limitations in matrixfactorization-based and random walk-based graph learningapproaches. First, the embedding function which maps tolow-dimensional vector space is either linear or simple sothat complex pattern cannot be captured; Second, they donot incorporate node features; Finally, they are inherentlytransductive because the whole graph structure is requiredin the training phase.

Recently these limitations in graph learning have been ad-dressed by adopting new advances in deep learning. Deeplearning with neural networks can represent complex map-ping functions and be efficiently optimized by gradient-descent methods. To embed graph nodes to a Euclideanspace, deep autoencoders are adopted to extract connectiv-ity patterns from the node similarity matrix or adjacency ma-trix, e.g. Deep Neural Graph Representations (DNGR) (Cao,Lu, and Xu 2016) and Structural Deep Network Embeddings

Page 3: arXiv:1809.02709v1 [cs.LG] 7 Sep 2018 · (a) GAT (b) EGAT Figure 1: Schematic illustration of the proposed EGAT net-work. Compared with GAT network in (a), EGAT in (b) im-proves in

(SDNE) (Wang, Cui, and Zhu 2016). Although autoencoderbased approaches are able to capture more complex pat-terns than matrix factorization based and random walk basedmethods, they are still unable to leverage node features.

With the success of CNN in image recognition, recently,there has been an increasingly growing interest in adoptingconvolutions to graph learning. In (Bruna et al. 2013), theconvolution operation is defined in the Fourier domain, thatis, the eigen space, of the graph Laplacian. The method isafflicted by two major problems: First, the eigen decompo-sition is computationally intensive; second, filtering in theFourier domain may result in non-spatially localized effects.In (Henaff, Bruna, and LeCun 2015), a parameterization ofthe Fourier filter with smooth coefficients is introduced tomake the filter spatially localized. (Defferrard, Bresson, andVandergheynst 2016) proposes to approximate the filters byusing a Chebyshev expansion of the graph Laplacian, whichproduces spatially localized filters, and also avoids comput-ing the eigenvectors of the Laplacian.

Attention mechanisms have been widely employed inmany sequence-based tasks (Bahdanau, Cho, and Bengio2014)(Zhou et al. 2017)(Kim et al. 2018). Compared withconvolution operators, attention machanisms enjoy two ben-efits: Firstly, they are able to aggregate any variable sizedneighborhood or sequence; further, the weights for aggrega-tion are functions of the contents of a neighborhood or se-quence. Therefore, they are adaptive to the contents. (Velick-ovic et al. 2018) adopts an attention mechanism to graphlearning and proposes graph attention networks (GAT),which obtains current state-of-the-art performance on sev-eral graph node classification problems.

Edge feature guided graph attention networksNotationsGiven a graph with N nodes, let xi ∈ RF , i = 1, 2, . . . , Nbe the F dimensional feature vector of the ith node. Leteij ∈ RP , i = 1, 2, . . . , N ; j = 1, 2, . . . , N be the P di-mensional feature vector of the edge connecting the ith andjth nodes. Specifically, we denote the pth channel of theedge feature in eij as eijp. Without loss of generality, we seteij = 0 to mean that there is no edge between the ith andjth nodes. LetNi, i = 1, 2, . . . , N be the set of neighboringnodes of node i.

Let X be an N × F matrix representation of the nodefeatures of the whole graph. Similarly, letE be anN×N×Ptensor representing the edge feautres of the graph.

Architecture overviewOur proposed network has a feedforward architecture. Theinputs are X and E. After passing through the first EGATlayer, X is filtered to an N × F1 new node feature matrixX(1). In the mean time, edge features are adjusted to E(1)

but preserves the dimensionality of E. The adapted E is fedas edge features to the next layer. This procedure is repeatedfor every subsequent layer. Within each hidden layer, non-linear activations can be applied to the filtered node featuresX(l). The node features X(L) can be considered as an em-beding of the graph nodes in an FL dimensional space. For

a node classification problem, a softmax operator will be ap-plied to X(L) and the weights of the network will be trainedwith supervision from ground truth labels. Figure 1 gives aschematic illustration of the EGAT network with a compar-ison to the GAT network.

Edge features guided attentionFor a single EGAT layer, let X and X ′ be the input and out-put node feature matirices with sizes N × F and N × F ′,respectively. Each row represents the feature vector of a spe-cific node. LetE andE′ be the input and output edge featuretensors, respectively, both with size N ×N ×P . Let xi andx′i be the input and output feature vectors of the ith node,respectively. With our new attention mechanism, x′i will beaggregated from the feature vectors of the neighboring nodesof the ith node, i.e. {xj , j ∈ Ni}, by simultaneously incor-porating the corresponding edge features. The aggregationoperation is defined as follows:

x′i = σ

Pn

p=1

∑j∈Ni

α(xi, xj , eijp)g(xj)

, (1)

where σ is a non-linear activation. g is a transformationwhich maps the node features from the input space to theoutput space. Usually, a linear mapping is used:

g(xj) =Wxj , (2)where W is an F ′ × F matrix.

In Eq. (1), α is the so-called attention coefficient, which isa function of xi, xj and eijp, the pth feature channel of theedge connecting the two nodes. In previous attention mech-anisms, the attention coefficient α depends on the two pointsxi and xj only. Here we let the attention operation be guidedby edge features of the edge connecting the two nodes, so αdepends on edge features as well. For multiple dimensionaledge features, we consider them as multi-channel signals,and each channel will guide a separate attention operation.The results from different channels are combined by the con-cantenation operation. For a specific channel of edge fea-tures, our attention function is chosen to be the following:

α(xi, xj , eijp) =1

Cipf(xi, xj)eijp, (3)

Cip =∑j∈Ni

f(xi, xj)eijp, (4)

where Cip is a normalization term for the ith node at the pthedge feature channel and f could be any ordinary attentionfunction. In this paper, we use a linear function as the atten-tion function for simplicity:

f(xi, xj) = exp

{L(aT [Wxi‖Wxj ]

)}, (5)

where L is the LeakyReLU activation function; W is thesame mapping as in (2); ‖ is the concatenation operation.

The attention coefficients will be used as new edge fea-tures for the next layer, i.e.

e′ijp = α(xi, xj , eijp),

where e′ijp is the entry of E′ at position (i, j, p).

Page 4: arXiv:1809.02709v1 [cs.LG] 7 Sep 2018 · (a) GAT (b) EGAT Figure 1: Schematic illustration of the proposed EGAT net-work. Compared with GAT network in (a), EGAT in (b) im-proves in

Other feasible choices of attention functionOur attention function in (3)(4) is designed to have a factor-ized form for ease of computation and use. The factor term fcould be replaced with any feasible attention function. A listof possible choices have been summarized in (Wang et al.2017). For example, an alternative of f to (5) is an embededGaussian:

f(xi, xj) = exp{φ(xi)T θ(xj)}, (6)φ(xi) =Wφxi, (7)θ(xj) =Wθxj , (8)

where Wφ and Wθ are different linear embeddings. Essen-tially, the embeded Gaussian is a bilinear function of xi andxj . This can be clearly seen if we let Σ = WT

φ Wθ, thenf(xi, xj) = xTi Σxj . A bilinear function will be more flex-ible than a linear function, which is used in (5). But a bilin-ear function also requires more trainable parameters, whichmight be more difficult to train if the dataset is small. Whichone to use should be decided according to specific applica-tions and tasks.

Another way of designing f is to use different functionsfor different edge feature channels. For example, by usingdifferent parameters ap for different edge channels, we canget

fp(xi, xj) = exp

{L(aTp [Wxi‖Wxj ]

)}. (9)

Compared with the bilinear function in (6), the linear one in(9) does not increase the number of trainable parameters toomuch. We tested (9) on the Citeseer dataset, and observed aslight accuracy improvement of about 0.1%.

Edge features for directed graphIn real world, many graphs are directed, i.e. each edge has adirection associated with it. Often times, edge direction con-tains important information about the graph. For example,in a citation network, machine learning papers sometimescite mathematics papers or other theoretical papers. How-ever, mathematics papers may seldom cite machine learningpapers. In many previous works including GCNs and GAT,edge directions are not considered. In their experiments, di-rected graphs such as citation networks are treated as undi-rected graphs. In this paper, we argue that discarding edgedirections will lose important information. Instead, we viewdirections of edges as a kind of edge features, and encode adirected edge channel eijp to[

eijp ejip eijp + ejip].

Therefore, each directed channel is augmented to three chan-nels. Note that the three channels define three types of neigh-borhoods: forward, backward and undirected. As a result,EGAT will aggregate node information from these three dif-ferent types of neighborhoods, which contains the directioninformation. Taking the citation network for instance, EGATwill apply the attention mechanism on the papers that a spe-cific paper cited, the papers cited this paper, and the union ofthe former two. With this kind of edge features, different pat-terns in different types of neighborhoods can be effectivelycaptured.

Algorithm 1: EGAT training algorithm for graph nodeclassification

Input: X , E, N , and node labels YX(0) ←− X; E(0) ←− Ewhile training epoch is less than N do

for l = 1 to L doCalculate α from X(l−1) and E(l−1) by Eq. (3)Calculate X(l) from X(l−1) and α by Eq. (1)E(l) ←− α

endBack-propagate loss gradient from X(L) and YUpdate weightsif early stopping condition satisfied then

Terminateend

end

Node features for graph without node attributesAlthough the network architecture illustrated in Figure 1 re-quires node feature matrix XN×F as input, it can also han-dle graphs without node attributes. For these kinds of graphdata, graph statistics such as node degrees can be extractedand used as node features. An alternative choice is to as-sign each node a one-hot indicator vector as node features,i.e. X = IN×N . In either way, graph topological structurewill be embeded into node representations as hidden layeroutputs of the neural network. In our experiments, we useone-hot indicator vectors as node features for graphs with-out node attributes, but other ways mentioned above can bealso used.

Graph node classificationA multiple layer EGAT network could be used to solve graphnode classification problems. For a C class classificationproblem, we just need to let the weight matrix W of the lastlayer L has the dimension FL−1×C, replace the concatena-tion operation with averaging operation, and apply softmaxactivation. Then an appropriate classification loss function,such as cross entropy can be used to train the EGAT param-eters. In a typical graph node classification problem, only apart of the nodes are labeled, so the loss need to be definedon predicted values of these labeled nodes. The training pro-cess is illustrated in Algorithm 1.

Relation to GAT and GCNOne major contribution of EGAT is that it can handle multi-dimensional real-valued edge features, while in GAT andGCN only a binary edge indicator is considered for eachedge. Therefore, EGAT is able to exploit richer informationabout the input graph.

Another major difference of EGAT compared with GATand GCNs is that the edge features are adapted by each lay-ers. In GAT and GCNs, original adjancey matrix is fed to ev-ery layer of the neural network. In EGAT, the original edgefeatures are fed to the first layer only. Then, the adapted edge

Page 5: arXiv:1809.02709v1 [cs.LG] 7 Sep 2018 · (a) GAT (b) EGAT Figure 1: Schematic illustration of the proposed EGAT net-work. Compared with GAT network in (a), EGAT in (b) im-proves in

features by one layer is used as input edge features for nextlayer.

The attention function in (5) is a generalization of the at-tention function in GAT which can be expressed as

α(xi, xj) = softmaxj

{L(aT [Wxi‖Wxj ]

)}(10)

=exp

{L(aT [Wxi‖Wxj ]

)}∑j∈Ni

exp{L(aT [Wxi‖Wxj ]

)} . (11)

Compared with (11), our attention function is more flexi-ble in two folds. Firstly, we multiply the edge features withthe output of f , which will utilize the possibly continuousedge features rather than a binary indicator. Secondly, wecombine different channels-guided attention coefficients byconcatenation. Concatenation has been used in DenseNet ar-chitectures (Huang et al. 2017) to combine image featuresat different scales, which proves to be more efffective thanaddition used by the skip connection in ResNet (He et al.2016) in many applications. It is natural to adopt concatena-tion to combine different channels of edge signals. If there isonly one channel of binary edge features indicating whethertwo nodes are connected without direction, then our EGATmodel will be reduced to GAT. In this sense, GAT may beregarded a special case of EGAT.

Experimental resultsImplementation detailsFollowing the experiment settings of (Kipf and Welling2017)(Velickovic et al. 2018), we use two layers of EGATin all of our experiments for fair comparison. We test EGATon two citation networks (directed graphs) with node at-tributes, one citation network (directed graph) without nodeattributes, and a biological network (weighted graph), andcompare EGAT with GAT. Throughout the experiments, weuse the Adam optimizer (Kingma and Ba 2015) with learn-ing rate 0.005. An early stopping strategy with window sizeof 100 is adopted for the three citation networks; i.e. we stoptraining if the validation loss does not decrease for 100 con-secutive epochs. For the biological network, we increase thestop patience to 200. The output dimensions of node fea-tures are 64 for all hidden layers. Futhermore, we applydropout (Srivastava et al. 2014) with drop rate 0.6 to bothinput features and normalized attention coefficients. L2 reg-ularization with weight decay 0.0005 is applied to weightsof the model (i.e. W and a). Moreover, exponential linearunit (ELU) (Clevert, Unterthiner, and Hochreiter 2016) isemployed as nonlinear activations for hidden layers.

The algorithms are implemented in Python within theTensorflow framework (Abadi et al. 2016). Because the edgeand node features in our experimental datasets are highlysparse, we further utilize the sparse tensor functionality ofTensorflow to reduce the memory requirement and compu-tational complexity. Thanks to the sparse implementation,all the datasets can be efficiently handled by a Nvidia TeslaK40 graphics card with 12 Gigabyte graphics memory.

Citation networksDatasets To test the effectiveness of our proposed model,we apply it to the network node classification problem.Three datasets are tested: Citeseer (Sen et al. 2008), Pubmed(Namata et al. 2012) and Konect-cora (Subelj and Bajec2013). Some basic statistics about these datasets are listed inTable 1. The graph of the Citeseer dataset, which has 3, 327nodes, is small compared with the Pubmed and Konect-coradatasets, which have 19, 717 and 23, 166 nodes, repsectively.

Table 1: Summary of citation network datasets

Citeseer Pubmed Konect-cora

# Nodes 3327 19717 23166# Edges 4732 44338 91500# Node Features 1433 3703 0# Classes 6 3 10

All the three datasets are directed graphs, where edgedirections represent the directions of citations. Two of thedatasets, Citeseer and Pubmed contain node attributes corre-sponding to the bag-of-words text features. However, nodesin Konect-cora dataset do not have any attributes, which issuitable to test the effectiveness of our EGAT for embeddingthe topological structure of graphs.

The Citeseer and Pubmed datasets are also used in (Yang,Cohen, and Salakhudinov 2016) (Kipf and Welling 2017)(Velickovic et al. 2018). However, they all use a pre-processed version which discards the edge directions. Sinceour EGAT model requires the edge directions to constructedge features, we use the original version from (Sen etal. 2008) and (Namata et al. 2012). For each of the threedatasets, we split nodes into 3 subsets with size 75%, 15%and 15% for training, validation and testing, respectively.

Performance For Citeseer and Pubmed, we use the nodeattributes as node features. Node features of every node arenormalized so that they sum to one. For Konect-cora, we useone-hot indicators as described previously as node features.For all the three datasets, we use the edge features describedpreviously.

To further test the effectiveness of the edge feature adap-tiveness with respect to layers and edge features separately,we have implemented three versions of EGAT: undirectededges EGAT (EGAT-UD), non-adaptive EGAT (EGAT-NA)and full EGAT. EGAT-UD discards the edge direction anduses the symmetric adjacency matrix as the edge featurematrix, similarly to GAT. Since the citation datasets are un-weighted, the first layer of EGAT-UD is reduced to GAT.However, for the following layer, EGAT-UD will use theadapted edge features which are real-valued. Therefore,EGAT-UD can be considered as an adaptive version of GATfor undirected binary edge graphs. EGAT-NA is the sameas EGAT except that the edge adaptiveness functonality isremoved.

We run each algorithm 20 times, and report the meanand standard deviation of the accuracies in Table 2. Accord-

Page 6: arXiv:1809.02709v1 [cs.LG] 7 Sep 2018 · (a) GAT (b) EGAT Figure 1: Schematic illustration of the proposed EGAT net-work. Compared with GAT network in (a), EGAT in (b) im-proves in

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

1 101 201 301 401 501 601 701

Loss

Epoch

GAT Training Loss

GAT Validation Loss

EGAT Training Loss

EGAT Validation Loss

(a) Citeseer

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1 101 201 301 401 501 601 701 801 901

Loss

Epoch

GAT Training Loss

GAT Validation Loss

EGAT Training Loss

EGAT Validation Loss

(b) Pubmed

0.7

0.9

1.1

1.3

1.5

1.7

1.9

2.1

2.3

1 101 201 301 401 501 601 701 801 901

Loss

Epoch

GAT Training Loss

GAT Validation Loss

EGAT Training Loss

EGAT Validation Loss

(c) Konect-cora

Figure 2: Training and validation losses of EGAT and GAT on the citation networks. In (b) and (c), EGAT converges quickly.

Table 2: Performances on citation networks.

(a) Citeseer

Method Accuracy

GAT 76.8± 0.5%EGAT-UD (Ours) 76.9± 0.6%EGAT-NA (Ours) 78.7± 0.5%EGAT (Ours) 78.7± 0.4%

(b) Pubmed

Method Accuracy

GAT 84.1± 0.3%EGAT-UD (Ours) 84.2± 0.3%EGAT-NA (Ours) 87.0± 0.2%EGAT (Ours) 87.0± 0.2%

(c) Konect-cora

Method Accuracy

GAT 50.3± 0.7%EGAT-UD (Ours) 54.4± 0.7%EGAT-NA (Ours) 76.6± 0.8%EGAT (Ours) 77.8± 0.4%

ing to the results, our EGAT outperforms GAT across thedatasets. The mean accuracies are improved by 1.9%, 2.9%and 27.5% for Citesser, Pubmed and Konect-cora, respec-tively. The performance gains testify that edge directionsare important for the task of citation network node classi-fication, and our EGAT is capable of effectively capturingthe patterns related to edge directions. On the Konect-coradataset, which does not contain any node attributes, the im-provement is more significant than the other two datasets.One possible reason is that edge information becomes cru-

cial given no node attributes. Another possible reason is thatthe Konect-cora graph is denser, i.e. the average degree ofnodes is higher, than the other two graphs, so more informa-tion can be extracted from the edges. Compared with GAT,EGAT-UD also shows consistent improvement on all thethree datasets. Similar trends can be seen between EGAT-NA and EGAT; again, we can observe more significant im-provement on Konect-cora than the other two datasets.

To further assess the effectiveness of EGAT and comparewith GAT, we record the training and validation losses fromone run of each of the two algorithms on the three citationnetwork datasets. The losses are plotted in Figure 2. Fromthe plots we can clearly see that EGAT gets lower train-ing losses and validation losses. On the Konect-cora andPubmed datasets, EGAT converges very quickly at around350 and 500 epochs, respectively. On the Citeseer dataset, ittakes longer to converge, possibly due to the small size ofthe dataset.

Biological networkDataset One advantage of our EGAT model is to utilizethe real-valued edge weights as features rather than simplebinary indicators. To test the effectiveness of edge weights,we conduct experiments on a biological network: Bio1kb-Hic (Belkin and Niyogi 2001), which is a chromatin inter-action network collected by chromosome conformation cap-ture (3C) based technologies with 1kb resolution. The nodesin the network are genetic regions, and edges are strengths ofinteractions between two genetic regions. The dataset con-tains 7, 023 nodes and 3, 087, 019 edges. The nodes are la-beled with 18 categories, which indicate the genetic func-tions of genetic regions. There are no attributes associatedwith the nodes. We split the nodes into 3 subsets with siqze75%, 15% and 15% for training, validation and testing, re-spectively.

Note that the genetic interactions measured by 3C-based technologies are very noisy. Therefore, the Bio1kb-Hic dataset is highly dense and noisy. To test the noise-

Page 7: arXiv:1809.02709v1 [cs.LG] 7 Sep 2018 · (a) GAT (b) EGAT Figure 1: Schematic illustration of the proposed EGAT net-work. Compared with GAT network in (a), EGAT in (b) im-proves in

sensitivity of the algorithms, we generate a denoised versionof the dataset by setting edge weights less than 5 to be 0. Inthe denoised version, the number of edges is reduced from3, 087, 019 to 239, 543, which is much less dense.

Performance We use one-hot indicators as described pre-viously as node features. We test four versions of algorithms:GAT as a baseline, unweighted edges EGAT (EGAT-UW),non-adaptive EGAT (EGAT-NA) and full EGAT. EGAT-NAis defined the same way as in citation network experiment.EGAT-UW is similar to EGAT-UD except that it discardsweights instead of discrading edge directions since we aredealing with undirected weighted graphs in this experiment.Similar to EGAT-UD in the previous experiments, the firsthidden layer of EGAT-UW is the same as GAT, but subse-quent layers utilize attention-adapted real-valued edge fea-tures.

We run 20 times of each algorithm on both the raw anddenoised networks, the mean and standard deviation of theaccuracies are reported in Table 3. Note that we increase theearly stopping patience to 200 epochs because of the noisydata.

Table 3: Performance summary on Bio1k-Hic dataset.

Method Raw Denoised

GAT 40.0± 3.8% 90.4± 3.2%EGAT-UW (Ours) 42.3± 5.6% 93.9± 2.9%EGAT-NA (Ours) 91.3± 8.0% 97.3± 0.8%EGAT (Ours) 96.4± 1.0% 98.2± 0.4%

The results show that our EGAT outperforms GAT signif-icantly on both the raw and denoised network, which con-firms that edge weights are important for node classifica-tion and they are effectively incorporated in EGAT. More-over, EGAT is highly noise-resilient. Detailed results alsoclearly show that the edge adaptiveness contributes to theEGAT model. On the raw network, EGAT is much more sta-ble than EGAT-NA in term of significantly reduced variance.On the denoised network, EGAT-UW obtains an accuracywhich is 3.5% higher than GAT, and EGAT obtains an accu-racy 0.9% higher than EGAT-NA, both having significantlysmaller variances than GAT and EGAT-UW.

Conclusions and future directionsIn this paper, we address the existing problems in the cur-rent state-of-the-art graph neural network models. Specif-ically, we generalize the current graph attention mechan-sim in order to incorporate multi-dimensional real-valuededge features. Then, based on the proposed new attentionmechanism, we improve the current graph neural networkarchitectures by adjusting edge features across neural net-work layers. Finally, we propose a method to design multi-dimensional edge features for directed edges so that ourmodel is able to effectively handle directed graph data. Ex-tensive experiments are conducted on several graph date-sets, including two directed citation networks with node at-tributes, a directed citation network without any node at-

tribute and a weighted undirected biological network. Ex-perimental results show that our EGAT model outperformsGAT, which is the current state-of-the-art model, consis-tently and significantly on all the four datasets.

While we test EGAT on graph node classification prob-lem only, the model is believed to be general and suitablefor other graph problems, such as whole graph or subgraphembedding and classification. A future research line is to ap-ply EGAT to other interesting problems.

References[Abadi et al. 2016] Abadi, M.; Barham, P.; Chen, J.; Chen,Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.;Isard, M.; Kudlur, M.; Levenberg, J.; Monga, R.; Moore, S.;Murray, D. G.; Steiner, B.; Tucker, P.; Vasudevan, V.; War-den, P.; Wicke, M.; Yu, Y.; and Zheng, X. 2016. TensorFlow:A System for Large-scale Machine Learning. In Proceed-ings of the 12th USENIX Conference on Operating SystemsDesign and Implementation, OSDI’16, 265–283. Berkeley,CA, USA: USENIX Association.

[Ahmed et al. 2013] Ahmed, A.; Shervashidze, N.;Narayanamurthy, S.; Josifovski, V.; and Smola, A. J.2013. Distributed Large-scale Natural Graph Factorization.In Proceedings of the 22Nd International Conference onWorld Wide Web, WWW ’13, 37–48. New York, NY, USA:ACM.

[Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.;and Bengio, Y. 2014. Neural Machine Translation by JointlyLearning to Align and Translate. arXiv:1409.0473 [cs, stat].03206 arXiv: 1409.0473.

[Belkin and Niyogi 2001] Belkin, M., and Niyogi, P. 2001.Laplacian Eigenmaps and Spectral Techniques for Embed-ding and Clustering. In Advances in Neural InformationProcessing Systems, 7.

[Bhagat, Cormode, and Muthukrishnan 2011] Bhagat, S.;Cormode, G.; and Muthukrishnan, S. 2011. Node Clas-sification in Social Networks. arXiv:1101.3291 [physics]115–148.

[Bruna et al. 2013] Bruna, J.; Zaremba, W.; Szlam, A.; andLeCun, Y. 2013. Spectral Networks and Locally ConnectedNetworks on Graphs. arXiv:1312.6203 [cs].

[Cao, Lu, and Xu 2015] Cao, S.; Lu, W.; and Xu, Q. 2015.GraRep: Learning Graph Representations with GlobalStructural Information. In Proceedings of the 24th ACM In-ternational on Conference on Information and KnowledgeManagement, CIKM ’15, 891–900. New York, NY, USA:ACM.

[Cao, Lu, and Xu 2016] Cao, S.; Lu, W.; and Xu, Q. 2016.Deep Neural Networks for Learning Graph Representations.In AAAI Conference on Artificial Intelligence.

[Chen et al. 2018] Chen, H.; Perozzi, B.; Hu, Y.; and Skiena,S. 2018. HARP: Hierarchical Representation Learning forNetworks. In AAAI Conference on Artificial Intelligence.

[Clevert, Unterthiner, and Hochreiter 2016] Clevert, D.-A.;Unterthiner, T.; and Hochreiter, S. 2016. Fast and Accu-rate Deep Network Learning by Exponential Linear Units

Page 8: arXiv:1809.02709v1 [cs.LG] 7 Sep 2018 · (a) GAT (b) EGAT Figure 1: Schematic illustration of the proposed EGAT net-work. Compared with GAT network in (a), EGAT in (b) im-proves in

(ELUs). In International Conference on Learning Repre-sentations.

[Defferrard, Bresson, and Vandergheynst 2016] Defferrard,M.; Bresson, X.; and Vandergheynst, P. 2016. ConvolutionalNeural Networks on Graphs with Fast Localized SpectralFiltering. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.;Guyon, I.; and Garnett, R., eds., Advances in Neural Infor-mation Processing Systems, 3844–3852. Curran Associates,Inc.

[Elman 1990] Elman, J. L. 1990. Finding Structure in Time.Cognitive Science 14(2):179–211.

[He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016.Deep Residual Learning for Image Recognition. In 2016IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), 770–778.

[Henaff, Bruna, and LeCun 2015] Henaff, M.; Bruna, J.; andLeCun, Y. 2015. Deep Convolutional Networks on Graph-Structured Data. arXiv:1506.05163 [cs].

[Hochreiter and Schmidhuber 1997] Hochreiter, S., andSchmidhuber, J. 1997. Long short-term memory. NeuralComputation 9(8):1735.

[Huang et al. 2017] Huang, G.; Liu, Z.; Maaten, L. v. d.; andWeinberger, K. Q. 2017. Densely Connected ConvolutionalNetworks. In IEEE Conference on Computer Vision and Pat-tern Recognition, 2261–2269.

[Kim et al. 2018] Kim, S.; Hong, J.-H.; Kang, I.; andKwak, N. 2018. Semantic Sentence Matching withDensely-connected Recurrent and Co-attentive Information.arXiv:1805.11360 [cs]. 00001 arXiv: 1805.11360.

[Kingma and Ba 2015] Kingma, D. P., and Ba, J. 2015.Adam: A Method for Stochastic Optimization. In Interna-tional Conference on Learning Representations.

[Kipf and Welling 2017] Kipf, T. N., and Welling, M. 2017.Semi-Supervised Classification with Graph ConvolutionalNetworks. In International Conference on Learning Rep-resentations.

[Lecun et al. 1998] Lecun, Y.; Bottou, L.; Bengio, Y.; andHaffner, P. 1998. Gradient-based learning applied to doc-ument recognition. Proceedings of the IEEE 86(11):2278–2324.

[Namata et al. 2012] Namata, G.; London, B.; Getoor, L.;and Huang, B. 2012. Query-driven Active Surveyingfor Collective Classification. In Workshop on Mining andLearning with Graphs.

[Ou et al. 2016] Ou, M.; Cui, P.; Pei, J.; Zhang, Z.; and Zhu,W. 2016. Asymmetric Transitivity Preserving Graph Em-bedding. In Proceedings of the 22Nd ACM SIGKDD In-ternational Conference on Knowledge Discovery and DataMining, KDD ’16, 1105–1114. New York, NY, USA: ACM.

[Perozzi, Al-Rfou, and Skiena 2014] Perozzi, B.; Al-Rfou,R.; and Skiena, S. 2014. DeepWalk: Online Learning ofSocial Representations. In Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discov-ery and Data Mining, KDD ’14, 701–710. New York, NY,USA: ACM.

[Sen et al. 2008] Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.;Gallagher, B.; and Eliassi-Rad, T. 2008. Collective Classifi-cation in Network Data. AI Magazine; La Canada 29(3):93–106.

[Shervashidze et al. 2011] Shervashidze, N.; Schweitzer, P.;Leeuwen, E. J. v.; Mehlhorn, K.; and Borgwardt, K. M.2011. Weisfeiler-Lehman Graph Kernels. Journal of Ma-chine Learning Research 12(Sep):2539–2561.

[Srivastava et al. 2014] Srivastava, N.; Hinton, G.;Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R.2014. Dropout: A Simple Way to Prevent Neural Networksfrom Overfitting. J. Mach. Learn. Res. 15(1):1929–1958.

[Subelj and Bajec 2013] Subelj, L., and Bajec, M. 2013.Model of Complex Networks based on Citation Dynamics.In Proceedings of the WWW Workshop on Large Scale Net-work Analysis, 527–530.

[Tang et al. 2015] Tang, J.; Qu, M.; Wang, M.; Zhang, M.;Yan, J.; and Mei, Q. 2015. LINE: Large-scale InformationNetwork Embedding. In Proceedings of the 24th Interna-tional Conference on World Wide Web, WWW ’15, 1067–1077.

[Velickovic et al. 2018] Velickovic, P.; Cucurull, G.;Casanova, A.; and Romero, A. 2018. Graph AttentionNetworks. In International Conference on LearningRepresentations.

[Vishwanathan et al. 2010] Vishwanathan, S. V. N.; Schrau-dolph, N. N.; Kondor, R.; and Borgwardt, K. M. 2010.Graph Kernels. Journal of Machine Learning Research11(Apr):1201–1242.

[Wang et al. 2017] Wang, X.; Girshick, R.; Gupta, A.; andHe, K. 2017. Non-local Neural Networks. In IEEE Con-ference on Computer Vision and Pattern Recognition.

[Wang, Cui, and Zhu 2016] Wang, D.; Cui, P.; and Zhu, W.2016. Structural Deep Network Embedding. In Proceed-ings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’16, 1225–1234. New York, NY, USA: ACM.

[Yang, Cohen, and Salakhudinov 2016] Yang, Z.; Cohen,W.; and Salakhudinov, R. 2016. Revisiting Semi-SupervisedLearning with Graph Embeddings. In International Confer-ence on Machine Learning, 40–48.

[Zhou et al. 2017] Zhou, G.; Song, C.; Zhu, X.; Ma, X.; Yan,Y.; Dai, X.; Zhu, H.; Jin, J.; Li, H.; and Gai, K. 2017.Deep Interest Network for Click-Through Rate Prediction.arXiv:1706.06978 [cs, stat]. 00012 arXiv: 1706.06978.