PRACA DYPLOMOWA MAGISTERSKA - repo.pw.edu.pl · SUMMARY This thesis describes the process of...

56
Rok akademicki 2012/2013 Politechnika Warszawska Wydział Elektroniki i Technik Informacyjnych Instytut Informatyki PRACA DYPLOMOWA MAGISTERSKA Aleksy Stanisław Barcz Implementation aspects of graph neural networks Opiekun pracy mgr inż. Zbigniew Szymański Ocena: ..................................................... ................................................................. Podpis Przewodniczącego Komisji Egzaminu Dyplomowego

Transcript of PRACA DYPLOMOWA MAGISTERSKA - repo.pw.edu.pl · SUMMARY This thesis describes the process of...

Rok akademicki 2012/2013Politechnika Warszawska

Wydział Elektroniki i Technik InformacyjnychInstytut Informatyki

PRACA DYPLOMOWA MAGISTERSKA

Aleksy Stanisław Barcz

Implementation aspects of graph neural networks

Opiekun pracymgr inż. Zbigniew Szymański

Ocena: .....................................................

.................................................................

Podpis PrzewodniczącegoKomisji Egzaminu Dyplomowego

Kierunek: Informatyka

Specjalność: Inżynieria Systemów Informatycznych

Data urodzenia: 1988.01.28

Data rozpoczęcia studiów: 2012.02.20

Życiorys

Ukończyłem XXVIII LO im. J.Kochanowskiego w Warszawie, w klasie o profilu matematyczno - informatycznym. Studia inżynierskie ukończyłem w lutym 2012 roku na kierunku Informatyka na Wydziale Elektroniki i Technik Informacyjnych Politechniki Warszawskiej. W trakcie studiów I i II stopnia brałem udział w wymianach studenckich programu Athens na Katholieke Universiteit Leuven (Fundamentals of artificial intelligence) oraz w Télécom ParisTech (Emergence in complex systems).

....................................................... Podpis studenta

EGZAMIN DYPLOMOWY

Złożył egzamin dyplomowy w dniu ..................................................................................2013 r

z wynikiem ...................................................................................................................................

Ogólny wynik studiów: ................................................................................................................

Dodatkowe wnioski i uwagi Komisji: ..........................................................................................

.......................................................................................................................................................

.......................................................................................................................................................

SUMMARY

This thesis describes the process of implementation of a Graph Neural Network, a classifier capable of classifying data represented as graphs. Parameters affecting the classifier efficiency and the learning process were identified and described. Implementation details affecting the classifier efficiency were described. Important similarities to other connectionist models used for graph processing were highlighted.

Keywords: Graph neural networks, classification, graph processing, recursive neural networks

ASPEKTY IMPLEMENTACYJNE GRAFOWYCH SIECI NEURONOWYCH

Praca stanowi raport z samodzielnej implementacji klasyfikatora typu Graph Neural Network (grafowa sieć neuronowa), pozwalającego na klasyfikację danych o strukturze grafowej. W ramach pracy zidentyfikowane zostały istotne dla klasyfikatora parametry, wpływające na przebieg procesu uczenia się klasyfikatora oraz na jakość uzyskanych wyników. Opisane zostały szczegóły implementacyjne klasyfikatora istotne dla jego działania. Klasyfikator został przedstawiony w kontekście podobnych rozwiązań w celu ukazania ścisłych powiązań między istniejącymi modelami przetwarzania danych o strukturze grafowej, opartymi na sieciach neuronowych.

Słowa kluczowe: Grafowe sieci neuronowe, klasyfikacja, przetwarzanie grafów, rekursywne sieci neuronowe

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. Domains of application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3. Graph processing models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4. History of connectionist models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.1. Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.2. RAAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.3. LRAAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.4. Folding architecture and BPTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.5. Generalised recursive neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.6. Recursive neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.7. Graph machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5. Graph neural network implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2. Computation units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3. Encoding network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.4. General training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.5. Unfolded network and backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 255.6. Contraction map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.7. RPROP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.8. Maximum number of iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.9. Detailed training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.10. Graph-focused tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1. Subgraph matching - data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.2. Impact of initial weight values on learning . . . . . . . . . . . . . . . . . . . . . . . 326.3. Impact of contraction constant on learning . . . . . . . . . . . . . . . . . . . . . . 336.4. Cross-validation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A. Using the software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

B. Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

List of Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

1. Introduction

The Graph Neural Network model is a connectionist classifier capable of classifying

graphs. Most of the other existing neural network-based graph classifiers, such as RAAM [1]

or LRAAM [2] and all solutions basing on them are capable of processing certain types of

graphs only, in most cases DAGs (directed acyclic graphs) or DPAGs (directed positional

acyclic graphs). Several solutions were invented to deal with cyclic graphs, such as introduc-

ing a delay in the LRAAM encoding tree [3] or techniques mapping cyclic directed graphs to

"recursive equivalent" trees [4]. The problem of nonpositional graphs was also addressed by

several authors, either by creating domain-specific encodings used to enforce a defined order

on graph nodes [5] or by introducing various modifications to the classifier [6]. However,

most of the solutions dealing with cyclic and nonpositional graphs either complicate the

classifier model, enlarge the input dataset or (in case of cycles) may result in information

loss. The Graph Neural Network model can directly process most types of graphs, including

cyclic, acyclic, directed, undirected, positional and nonpositional, which makes it a flexible

solution. There is a conceptual similarity between the GNN model and recursive neural net-

works [7], however, the GNN model adapts a novel learning and backpropagation schema

which simplifies processing different types of graphs. This thesis describes the steps of

implementation of a GNN classifier, including some details which were not described in the

original article [8]. The classifier was implemented in GNU Octave with two ideas in mind:

providing a simple interface (similar to that of the Neural Networks toolbox) and maximum

flexibility, that is the possibility of processing each kind of data that the theoretical model

could deal with. The process of training a GNN and classification results were presented in

detail. Listings of most important procedures were included in appendix B.

2. Domains of application

This chapter presents domains where data is organized into a structured form, that is form

of sequences or graphs. The necessity of processing differently such kinds of data arose from

the structure of the data itself. To present this difference we must first summarize what is the

task of classification and regression in the most common sense in data processing domain.

A common statistical classifier (which later on is called a vectorial classifier) takes as input

samples from a given dataset, representing real world objects, and associates each sample

with a category. The samples are fixed-length vectors of numeric values. Each position in

such a vector represents a feature of the sample, which is quantified by a real or integer

value. The mapping from features to positions in the vector is fixed and must hold for each

sample in the dataset. The category is represented by a non-empty fixed-length vector of

integer values, where once again the position of each value is meaningful (we can say it’s

a positional representation). For the regression task, a vector of real (or integer) values is

associated with each sample instead of a category. The domain of vectorial classifiers is well

developed and includes, among other solutions, neural networks, support vector machines,

naïve Bayes classifiers and random forests.

In the case of graph processing, the nature of the data is different. Each sample is repre-

sented by a graph. A dataset may consist of a single or several graphs. Each graph consists of

nodes, connected with edges. Each node can be described by its label, a fixed-length vector

of reals. Each edge can also be labelled, with a fixed-length vector of reals of different size.

Edges can be directed or undirected. An example of a simple graph was presented in Fig. 2.1.

If such a graph was to be used as input for a common classifier, it would be necessary

to transform the structured representation into a plain vector. It could be accomplished in

several ways, one of the most obvious would be to perform a preorder walk on the tree and list

A

B C

D E

Figure 2.1. A simple binary tree

2. Domains of application 3

the node labels in the order they were visited. Such a walk would result in the representation

[A,B,D,E,C]. It can be seen that the explicit information about node adjacency was lost.

Instead, the information is provided implicitly, according to a coding which must be known

a priori to properly interpret such a vector representation. That means that a model learning

to classify such graph representations would have to learn the encoded relationship, instead

of using it from the beginning to learn other, unknown and interesting relationships affecting

the samples category. The resulting learning task becomes even harder if such sequential

representations contain long-distance relationships. Different encodings from structured data

to vectors exist, however, they all share that flaw.

Moreover, the inadequacy of simple vector representation becomes even more apparent

when the graph structure becomes more complicated. First of all, if edge labels are present

in the vector representation, the representation becomes a mix of data belonging to two

different entities - nodes and edges. Once again the classifier doesn’t know which part of

the data corresponds to the first entity and which to the other. The same applies to directed

and undirected edges. If two edges in a graph are connected by a directed edge, presumably

one of the nodes in the relation has got a larger impact on the other than vice versa. On the

contrary, an undirected or bidirectional edge edge implies an equal impact of both nodes on

each other. What if a graph contains both types of edges? How should one type of edges be

distinguished from the other one? Secondly, let’s consider the case of cyclic dependencies.

Even if a meaningful representation of the graph is built ignoring cycles, e.g. by constructing

a minimum spanning tree, some explicit information about connections is lost. An example

of such data are chemical compounds containing groups of atoms forming cyclic bonds.

Another problem lies in the positional nature of vectorial data. If a representation is built by

simply storing node labels one after another, this representation becomes vulnerable to any

reordering of the children of a node. An additional effort must be made to assure a consistent

ordering of the potentially difficult to order data, while the ordering of the children of a node

may be irrelevant in the dataset considered.

It can be seen that in order to properly process structured data, a different approach

must be used. The data should be processed in a way that exploits properly the information

contained in its structure - by means of building a sufficient representation or by processing

the structured data directly.

Graph-oriented models based on neural networks were successfully applied in various

domains, including chemistry, pattern recognition and natural language processing. In the

domain of computer-aided drug design the most important problem to solve is to predict the

properties of a molecule prior to synthesizing it. All molecules with a negative prediction

can be discarded automatically, reducing the costs of the subsequent laboratory experiments

which can focus on the molecules with a positive prediction only [3]. This is the case of

2. Domains of application 4

QSAR (quantitative structure-activity relations) and QSPR (quantitative structure-property

relations). While traditional processing methods consist of extraction and selection of fea-

tures from the molecules descriptions, the molecules can be easily represented as undirected

graphs and processed with a graph-oriented model [9] [10].

The next domain of interest is document mining. As the amount of XML-formatted

information increases rapidly, the problem of determining if a document can be assigned to

a given category becomes crucial. As an XML document can be viewed as semi-structured

data, graph-processing models can be successfully applied to this task [11]. Another prob-

lem focused on document processing is the web page ranking, where documents and the

links between them can be described as structured data. A general ranking model can be

implemented as a graph-processing model, which allows exploiting page contents and link

analysis simultaneously. [12] [8].

In the domain of image processing two pattern recognition tasks can be distinguished.

First is the classification of images, either for industrial applications, control and monitoring,

or for querying an image database. The second is object localisation, which may be used

e.g. for face localisation. For both tasks an image can be represented as a RAG (Region

Adjacency Graph), where image segments are represented as nodes and their adjacency

is represented as edges which may contain information about e.g. the distances between

adjacent segments. For both tasks graph-processing methods can be used, yielding promising

results [13] [6] [14].

Another classic example where structure of the data plays a crucial role in its understand-

ing is the natural language processing. In the unconstrained case, the input data may consists

of arbitrarily complex sentences. As sentence can be transformed into a graph reflecting its

syntax, thus a graph-processing model can be trained to parse such sentences. One of the

first graph-processing solutions was already evaluated on such a task [1] and more recent

solutions are also present in the literature [15].

3. Graph processing models

A model considered as fully capable of processing structured data, should be able to:

1. build data representation

a) minimal

b) exploiting sufficiently the structure of the data

c) adequate for subsequent processing (classification, regression)

2. perform classification / regression on the structured data

a) taking into consideration the structure encoded in the representation

b) with a high generalization capacity

These two main tasks are often intertwined with each other, as a classification procedure

may affect the procedure of representation building and vice versa. It is also possible for

a model to focus only on representation building, while leaving the task of processing to

a common statistical classifier, as support vector machine. Two main families of models

capable of processing structured data are the symbolic and connectionist families. The

first one originates in the artificial intelligence domain and focus on inferring relationships

by means of inductive logic programming. The connectionist models focus on modelling

relationships with the use of interconnected networks of simple units. The different models

originating from these two families are:

1. inductive logic programming

2. evolutionary algorithms

3. probabilistic models: Bayes networks and Markov random fields

4. graph kernels

5. neural network models

The main area of interest of this thesis are the connectionist models based on neural networks.

The connectionist models make the fewest assumptions about the domain of the dataset and

thus provide a potentially most general method for processing structured data.

4. History of connectionist models

This chapter summarizes the history of connectionist models used for graph processing.

All these models originate from the feed-forward neural networks (FNNs). The history of

neural networks begins in 1943 with the McCulloch–Pitts (MCP) neuron model, following

with the Rosenblatt perceptron classifier in 1957. The feed-forward neural network model

was developed during the following three decades and a conclusive state was reached in all

major fields of related research until approximately 1988. The FNN model reached maturity

in its field of application: classification and regression performed on unstructured positional

samples of fixed size. In the ’80s a new branch of the neural networks family began to

develop - the recurrent neural networks (RNN). The RNN model is capable of processing

sequences of varying length (potentially infinite), which makes them suitable for dealing with

time-varying sequences or biological samples of various length [16]. However, a slightly

different model had to be invented to properly process graph data.

4.1. Hopfield networks

One of the earliest attempts to classify structured data with neural networks was using the

Hopfield networks [3]. A common application of a Hopfield network is an auto-associative

memory, which learns to reproduce patterns provided as its inputs. (The task to be learned is

the mapping xi⇒ xi, where xi is a pattern of fixed size n.) Afterwards, when a new sample

is presented to the trained network, the network associates it with the most similar pattern it

had learned. Subsequently, it was discovered, that by using a Hopfield network to reproduce

a predefined successor of a pattern instead of the pattern itself, the network can be used as

an hetero-associative memory, capable of reproducing sequences of patterns (xi ⇒ xi+1).

The next step towards graph processing was to use Hopfield networks to learn the task of

reproducing all the successors (or predecessors) of a node, that is to learn the mapping

xi ⇒ succ[xi], where xi is the ith node label and succ[xi] denotes a vector obtained by

stacking together the labels of all the successors of node xi, one after another. For such task

the maximum outdegree of a node (the maximum number of its successors) has to be known

prior to the network training. NIL patterns are used as extra successors labels whenever

a considered node xi has an outdegree smaller than the maximum value chosen. The last

and somehow different application of Hopfield networks was to use a Hopfield network once

again as an auto-associative memory, used for retrieving whole graphs. In such case the graph

4.2. RAAM 7

adjacency matrices (N ×N) are encoded into a network having N(N − 1)/2 neurons [3],

where N is the number of nodes in a graph. To obtain an adequate generalisation, graphs

isomorphic to the training set are generated and fed to the network [17].

4.2. RAAM

The Recursive Auto-Associative Memory (RAAM) was introduced by Pollack in

1990 [1]. The RAAM model is a generalisation of the Hopfield network model [3], provid-

ing means to meaningfully encode directed positional acyclic graphs (DPAGs). A distinctive

feature of the RAAM model is that is can be used to encode graphs with labeled terminal

nodes only. (The terminal nodes are nodes with outdegree equal to zero, that is nodes having

no children. In the case of trees, leaves are terminal nodes.) That is, no node other than the

terminal nodes of a graph may be labelled. No edge labels are permitted. The most straight-

forward domain of application for the RAAM model is thus natural language processing,

where sentences can be decomposed to syntax graphs.

The RAAM model is capable of:

— building compressed representation of structured data

— building meaningful representation: similar samples are represented in a similar way

— constrained generalisation: representing data absent in the training set

The RAAM model is composed of two units: compressor and reconstructor. Together

they form a three-layer feed-forward neural network which works as an auto-associative

memory. The compressor is a fully connected two-layer neural network with n input lines

and m output neurons. The number of output neurons, m determines the size of a single

encoded node representation. The number of input lines, n must be a multiple of m, such that

n = k ·m, where k is the maximum outdegree of a node in the considered graphs. For each

terminal node its representation consists of its original label. For each non-terminal node i

its representation, xi is built by feeding the compressor with encoded representation of the

ith nodes children.

To assure that the compressed representation is accurate and lossless, it is fed to the

reconstructor. The reconstructor is also a fully connected two-layer neural network, however

it has m input lines and n output neurons. It is fed with compressed representations of nodes

and is expected to produce the original data that was fed to the compressor. This procedure

is repeated for all non-terminal nodes of a graph, until all encoded representations can be

accurately decoded into original data. More precisely, the representation xi of the ith node of

the graph is given by Eq. 4.1, where f denotes the function implemented by the compressor

unit, li - the label of ith node, xch[i] - a vector obtained by stacking representations of all

children of ith node one after another, k - the maximum outdegree of a node in the graph.

4.2. RAAM 8

A B C D

1 2

3

Figure 4.1. A sample graph that can be processed using RAAM

A B

A' B'

x1

C D

C' D'

x2

x1 x2

x1' x2'

x3

Figure 4.2. Training set for the example graph

xi =

{li if i is terminal

f (xch[i]) otherwise(4.1)

A sample graph that can be encoded using the RAAM model is presented in Fig. 4.1. The

non-terminal nodes are enumerated for convenience only (only terminal nodes are labelled).

To encode the sample graph, the representations of nodes 1 and 2 must be built first.

The representation of node 1 is built by feeding the pair of labels (A,B) to the compressor

which encodes them into the representation x1. The representation x1 is then fed to the

reconstructor, which produces a pair of labels (A′,B′). If the resulting labels A′ and B′ are

not similar enough to the original labels A and B, the error is backpropagated through the

compressor-reconstructor three-layer network. Similarly, the pair (C,D) is processed by the

same compressor-reconstructor pair and compressed into the representation x2. Then, the

pair (x1,x2) is once again fed to the compressor, which produces x3, the representation

of the root node. This is also the compressed representation of the whole graph, from

which the whole graph can be reconstructed by using the reconstructor unit. The training

set, consisting of three label pairs, is presented in Fig. 4.2. The light grey areas denote the

compressor network, while the dark grey areas denote the reconstructor. Such a training set

(or a larger one if the dataset consists of more than one graph) must be repeatedly processed

by the RAAM model in the training phase. When the model is trained, the compression of

the whole graph occurs as presented in Fig. 4.3. Reconstruction of the graph is presented

in Fig. 4.4. It is worth mentioning that a trained RAAM model can be used to process graphs

with different structures.

4.2. RAAM 9

A B C D

x1 x2

x3

Figure 4.3. Graph compression using trained RAAM model

A' B' C' D'

x1' x2'

x3

Figure 4.4. Graph reconstruction from x3 using trained RAAM model

A significant feature of the RAAM model is that a small reconstruction error in the case of

non-terminal nodes may render the reconstruction of terminal nodes impossible. Therefore in

the process of training a RAAM classifier it is necessary to set the acceptable reconstruction

error value much smaller for the non-terminal nodes than for the terminal ones.

A major drawback of the RAAM model is the moving target problem. That is, a part of

the learning set (the representations x1 and x2 from the example) changes during training.

In such a case the training phase may not converge to an acceptable state [3]. However,

a different training schema is possible, similar to the BPTS algorithm [3]. (An extensive

description of the BPTS algorithm, encoding networks, and the shared weights technique is

provided in the following chapters.) An encoding network is built out of identical instances

of the compressor and reconstructor units (Fig. 4.5), with structure reflecting the structure

of the processed graph (if the dataset consists of multiple graphs, such procedure is repeated

for every graph in the dataset). All instances of the compressor unit share their weights and

all instances of the reconstructor unit share their weights - which is called the shared weights

technique. The labels of terminal nodes are fed to the processing network and the resulting

error can be backpropagated from the last layer using e.g. the Backpropagation Through

Structure [18] algorithm (BPTS). It is worth mentioning, that the authors of such modified

RAAM model propose using an additional layer of hidden neurons in the compressor and

reconstructor units. In such case the light grey and dark grey areas on the figures would

denote not only neuron connections between two layers, but also an additional hidden layer.

Such modification allows to partially separate the problem of the data model complexity (i.e.

4.2. RAAM 10

A B

x1

C D

x2

x1 x2

x1' x2'

x3

A' B' C' D'

Figure 4.5. RAAM encoding network for the sample graph

how complex should the RAAM model be to properly compress the data) from the size of

terminal node labels which affects directly the number of input lines to the compressor unit

and thus the compressed representation size.

The most important parameter of the RAAM model is the size of the compressed repre-

sentation. On one hand, the size should be large enough to contain all the necessary com-

pressed information about the encoded graph. On the other hand, it should be small enough

for the compression mechanism to build a minimal representation, which stores only the

necessary information about the dataset. If the size is too large, the trained model would

store redundant information, memorizing the training set. This would result in a poor ability

to process unseen data. Experiments with natural language syntax processing [1] proved

that when the size of the compressed representation is accurate for the problem, the RAAM

model is showing some constrained generalisation properties. That is, unseen data with

structure similar to the training set was processed properly by the trained RAAM model.

A drawback of the standard RAAM model is the termination problem. The reconstructor

can’t distinguish between terminal representations (node labels) and compressed representa-

tions, which should be further reconstructed. To solve this problem, an additional encoding

neuron can be introduced (increasing the representation size by one), which takes a different

value for terminal and non-terminal representations [19].

4.3. LRAAM 11

4.3. LRAAM

The most important constraint of the RAAM model is the fact that only terminal nodes

of the processed graphs (DPAGs) can contain labels. This problem was addressed by the

Labeling RAAM model [2] (LRAAM, 1994), which separated the concepts of node labels

and node representations. In the RAAM model the terminal nodes are represented by their

labels. The LRAAM model introduced the concept of pointers, which was used to describe

a node representation which has to be learnt, regardless of whether the node is terminal or

not. The pointers are built by compressor units (FNN with two or more layers) and they

are decoded into graph structure by reconstructor units. More precisely, the pointer to ith

node of the graph is calculated according to Eq. 4.2, where xi stands for pointer to the ith

node, f is the function implemented by the compressor unit, li is the ith node label, xch[i] is

a vector obtained by stacking pointers to all children of ith node one after another and k is

the maximum outdegree of a node in the considered graph.

xi = f (li, xch[i]) (4.2)

Whenever a node outdegree is smaller than k (especially in the case of terminal nodes),

the missing child pointers are substituted by the NIL pointer, a special value representing the

lack of node. The value of the node label is stacked together with all the child pointers values

to form an input vector which is fed to the compressor unit. The number of output neurons

of the compressor unit (the size of a pointer xi) is m. Let’s denote the size of the label liby p. The compressor unit must have n = p+ k ·m input lines which is also the number

of reconstructor unit output neurons. The possibility of describing each graph node with its

label provides a simple solution to the termination problem [2]. An additional value can be

appended to each label, stating if the node is a terminal node or not. By using this method

no change in the LRAAM model is needed.

Just like the RAAM model, the LRAAM model experience the problem of the moving

target. The same technique of shared weights can be applied [3], which results in building

a large encoding network composed of identical units. A sample graph and the encoding

network obtained by cloning the compressor and reconstructor units to reflect the sample

graph structure are presented in Fig. 4.6. As A and B are terminal nodes, their labels are fed

to the compressor unit altogether with NIL pointers representing the missing nodes. Then

the compressed representation xA is built for node A and the compressed representation xB

is built for node B. The representations xA and xB are then fed altogether with the label C

to the compressor, to build the representation xC of node C, which is also the compressed

representation of the whole graph.

An extension of the LRAAM model exists for cyclic graphs [3]. Whenever an edge

forming a cycle is found, it is converted to a single time unit delay (denoted by q−1 [7]).

4.3. LRAAM 12

A B

C

xA xB

xC

A' B'

C'

Nil Nil Nil Nil

A B

C

Figure 4.6. LRAAM encoding network for the graph shown

A sample cyclic graph and the resulting LRAAM model with one time delay was presented

in Fig. 4.7. The graph presented is similar to the graph used in the previous example, with

the exception of the directed edge A⇒C. The additional edge forms a cycle so it must be

represented as a time delay. Such an approach makes it possible to deal with cyclic graphs,

however it is achieved at the expense of model simplicity. The shared weights technique

made it possible to treat the encoding tree structure as a single feed-forward neural network

with shared weights. However, after adding time delays the training of the network will have

to consist of multiple time steps, repeated until convergence of the pointer values is reached.

A distinctive feature of the LRAAM model is that a compressed representation is built for

a given dataset (consisting of DPAGs) and the correctness of the representation is verified by

mirroring the compression process and reconstructing the original data. When the obtained

representation is accurate enough, the output of the LRAAM model is "frozen" and fed to a

separate classifier which processes it and yields classification or regression results. The same

applies to any new, unseen data, which is fed to the trained LRAAM model and, if the model

was built correctly, is compressed into a meaningful representation. Such separation of

representation building and processing can be attractive for two reasons. First, the LRAAM

model parameters, such as the size of the representation, can be tuned in a straightforward

manner by observing what is the minimal value below which the original data cannot be

accurately reconstructed from the compressed vectors. Secondly, any vectorial classifier

used for unstructured data can be used to process such compressed representation. On the

other hand, it is often safe to presume that not all the data contained in node labels is crucial

4.3. LRAAM 13

A B

C

xA xB

xC

A' B'

C'

Nil Nil Nilxc

A

C

Bq-1

Figure 4.7. LRAAM encoding network for the cyclic graph shown

4.4. Folding architecture and BPTS 14

for the classification/regression process for which the compressed representation is needed.

Such an approach lies beyond the standard LRAAM model and was introduced in the folding

architecture model [18].

4.4. Folding architecture and BPTS

The ideas of folding architecture and Backpropagation Through Structure (BPTS) were

first introduced in 1996 [20] [18]. The model of the folding architecture is similar to that of

LRAAM and is capable of processing rooted DPAGs. (That is DPAGs with a distinguished

root node. For each DPAG such a node can be selected.) The folding architecture model

is a feed-forward, fully connected multi-layer neural network, consisting of two subsequent

parts performing different tasks: the folding network and the transformation network. The

folding network is similar to the LRAAM compressor unit, its input layer consists of p+k ·minput lines, p for the processed node label and k ·m for the compressed representations

of the nodes children. The folding network can consist of any number of sigmoid neuron

layers and its last layer produces the compressed representation of a node, of size m. The

folding network is applied to every node in the graph, starting from the terminal nodes, so

as to provide the internal graph nodes with compressed representations of previous layers

nodes. The transformation network is applied to the root node only. It can consist of any

number of sigmoid neuron layers and an output layer. It takes as input the compressed

representation of the root node and produces an output, which should match the expected

output for a graph. Therefore, the transformation network is used to perform classification

or regression tasks in terms of whole graphs. Let’s denote the folding network as f and the

transformation network as g. The function f can be described by Eq. 4.3, where xi stands for

the ith node representation, li is the ith node label and xch[i] is a vector obtained by stacking

representations of all children of ith node one after another. The g function can be described

by Eq. 4.4, where or is the output of the root node.

xi = f (li, xch[i]) (4.3)

or = g(xr) (4.4)

The original idea behind the folding architecture is that the compressed representation is

built only for classification purpose and is fed directly to the transformation network. The

output of the transformation network is then compared with the expected output and the error

can be backpropagated through the folding architecture network by using a gradient-descent

procedure, Backpropagation Through Structure. BPTS was invented as a generalisation of

the Backpropagation Through Time method (BPTT [21]), which in turn was invented for

4.4. Folding architecture and BPTS 15

A B

C

xC

Nil Nil Nil Nil

A B

C

oC

Figure 4.8. Virtual unfolding, reflecting the sample graph structure

error backpropagation in recurrent neural networks. BPTS can be described in terms of the

unfolded network. The unfolded network is never built physically but can be imagined as a

graph built of folding network instances in a way which reflects the structure of the processed

graph, with the transformation network added on top of it (attached to the representation of

the root node). An unfolding network for a sample graph is presented in Fig. 4.8. The light

grey areas are instances of the folding network, while the dark grey area is the transformation

network, which for the root node C produces output oC.

To explain the idea of BPTS it is necessary to briefly summarize the idea of BPTT

(a detailed explanation can be found in RNN-concerned publications, e.g. [22]). Let’s

consider a fully-connected recurrent neural network, designed for classifying sequences of

samples of size m. The network consists of a single layer of n units with n× n recurrent

connections, producing an output y(t) at time t. Let xs(t) denote the m-tuple of input signals

corresponding to the sample fed at time t. Further, let x(t) be the merged input fed to the

network at time t, obtained by concatenating the vectors xs(t) and y(t). To distinguish

between the elements of vector x(t) corresponding to xs(t) and to y(t), let’s introduce two

subsets of indices: I and U (Eq. 4.5).

x j(t) =

{xs

j(t) if j ∈ I

y j(t) if j ∈U(4.5)

Let wk j denote the network weight on connection to the kth neuron from input x j, netk(t)

denote the weighted sum of neuron inputs fed to the activation function of the kth neuron,

fk (Eq. 4.6) and J(t) denote the overall mean square error of the network at time t.

netk = ∑j∈(I∪U)

wk jx j (4.6)

yk = fk(netk) (4.7)

4.4. Folding architecture and BPTS 16

J(t) =−12 ∑

k∈U[ek(t)]2 (4.8)

ek(t) = dk(t)− yk(t) (4.9)

Let’s consider a recurrent network which was operating from a starting time t0 up to time

t. We may represent the computation process performed by the network by unrolling the

network in time, that is building a feed-forward neural network made of identical instances

of the considered recurrent neural network, one instance per time step τ, τ ∈ (t0, t]. To

compute the gradient of J(t) at time t it is necessary to compute the error values εk(τ) and

δk(τ) for k ∈U and τ ∈ (t0, t] by means of equations 4.10, 4.11 and 4.12.

εk(t) = ek(t) (4.10)

δk(τ) = f ′k(netk(τ))εk(τ) (4.11)

εk(τ−1) = ∑j∈U

w jkδ j(τ) (4.12)

Then, the gradient of J(t) is calculated with respect to each weight wi j by the means of

equation 4.13.

∂J(t)∂wi j

=t

∑τ=t0+1

δi(τ)x j(τ−1) (4.13)

At time t an external error e(t) is injected to the network, usually being the difference

between the trained network output at time t: y(t) and the expected output d(t) (Eq. 4.9).

The subsequent steps compute the error ε(τ) by backpropagating the original error through

the layers of the unrolled neural network.

The BPTS method implements the BPTT algorithm. Backpropagation starts at the last

layer of the virtual unfolding network, where the classification/regression error is calculated

(the last layer of the transformation network applied to the root node). The error is injected

to this layer and backpropagated using the BPTT algorithm down to the first layer of the

folding network applied to the root node. The error is then backpropagated to the last layers

of the folding network applied to the roots children, as if there was a physical connection.

Such backpropagation continues down to the first layers of folding network applied to the

terminal nodes.

The folding architecture model introduced new important ideas in the domain of connec-

tionist graph processing models. First of all, the representation building model is simpler

than the LRAAM model and the folding architecture converges much faster than LRAAM

4.5. Generalised recursive neuron 17

for the same datasets [20]. Secondly, three important concepts were adapted from the domain

of recurrent neural networks: the error injection, BPTS and the unfolding of the network as

a generalisation of unrolling a recursive neural network.

4.5. Generalised recursive neuron

The generalised recursive neuron, introduced in 1997 [23] is a generalisation of the

recurrent neuron, which in turn is used in recurrent neural networks (RNNs). It was created

to provide an elementary component for the graph processing models which would be by

definition better suited to solve graph processing tasks than the standard neuron. It is beyond

the scope of this thesis to describe in detail the idea itself and its applications. Nevertheless

this summary of the connectionist models would be incomplete without mentioning the

generalised recursive neuron, as it was used in some of the following models instead of

the common neural network neuron, yielding promising results [7].

A generalised recursive neuron has two kinds of inputs:

— plain neuron inputs, which are fed with elements of the currently processed node label

— recursive inputs, which are fed with the memorized output of the neuron for all children

nodes of the currently processed node

In such a way, the neuron output changes after each training algorithm iteration, according

to its output for all the children nodes.

4.6. Recursive neural networks

The folding architecture model had a large impact on a theory introduced in 1998, the

structural transduction formalism [7]. It is beyond the scope of this thesis to describe

the formalism itself, let’s mention however its aspects that highly affected the subsequent

connectionist models described. The structural transduction, in general, is a relation which

maps a labelled DPAG (only node labels, no edge labels) into another DPAG. The type of

transductions that the authors focus on are IO-isomorph transductions, that is transductions

that don’t change the graph topology, only the node labels. (According to the authors

managing transductions which are not IO-isomorph is highly nontrivial and is an open

research problem.) Let’s describe an IO-isomorph transduction. Let Gs be the original

DPAG with labelled nodes. Let Go be a graph with same topology that Gs but with node

labels replaced by expected node outputs for each node (it’s a node classification/regression

problem). A transduction Gs ⇒ Go can be described in terms of two functions, f and g,

where f is the state transition function, which builds the representation (the state) x of the

graph Gs for the model and g is the output function, which produces the expected output

graph Go according to the representation x and the original graph Gs. More precisely, for

4.6. Recursive neural networks 18

g

g

g

f

f

f

A B

C

Nil

A B

C

Figure 4.9. A sample acyclic graph and the corresponding encoding network

each node i belonging to the original graph Gs its state xi and output oi are defined by

Eq. 4.14 and 4.15, where li is the ith node label and xch[i] is a vector obtained by stacking

representations of all children of ith node one after another. In the original equations the ith

node itself was also an argument of both functions, however, it was unnecessary from the

point of view of recursive neural networks and therefore was omitted.

xi = f (li, xch[i]) (4.14)

oi = g(li, xi) (4.15)

The transduction can be implemented e.g. by a hidden recursive model (HRM) or by a

recursive neural network (a generalisation of a recurrent neural network, which is able to

process not only sequences, but also DPAGs). In the case of a recursive neural network

the functions f and g are implemented by two feed-forward neural networks. Identical

instances of the f network are connected according to the Gs graph structure, creating the

encoding network. (The encoding network is the recursive neural network unfolded through

the structure of the given DPAG.) Calculation of the state x is performed by applying the

f network to the terminal nodes of Gs and then proceeding up to the root, according to

the encoding network topology. When the state calculation is finished, the g network is

applied to every node state xi, producing the requested output oi. A sample graph and the

corresponding encoding network are presented in Fig. 4.9.

As various kinds of neural network can be used as the f function (first, second-order

recurrent networks etc.), the recursive neural network model can become really complex and

computationally powerful. However, for the scope of this work, it is sufficient to mention

that the concepts of state transition function, output function and encoding network unfolded

through structure (originating from the folding architecture) were later reused in the Graph

Neural Network model. Another important feature of this model is a successful adaptation

4.7. Graph machines 19

A B

C

Nil Nil Nil Nil

A B

CoC

Figure 4.10. A sample acyclic graph and the corresponding encoding network

of ideas originating from recurrent neural network domain, just like in case of the folding

architecture. Authors of both models suggested that adaptation of other RNN ideas may also

be possible, which could lead to novel graph processing solutions [18] [7].

4.7. Graph machines

In 2005, another model for DPAG processing was proposed, the Graph Machines [10].

The model is similar to that of folding architecture and recursive neural networks. A function

f is evaluated at each node of a graph except the root node. Then, the output function

g, which can be the same as f , is evaluated at the root node, and yields an output for

the whole graph. The function f can be described by Eq. 4.16, where xi stands for the

ith node representation, li is the ith node label and xpa[i] is a vector obtained by stacking

representations of all parents of ith node one after another. The g function can be described

by Eq. 4.17, where or is the output of the root node.

xi = f (li, xpa[i]) (4.16)

or = g(lr, xpa[r]) (4.17)

The instances of the f function are connected together to form an encoding network for

a given graph, which reflects the structure of the graph. Such encoding network is built for

every graph in the training set. The resulting set of encoding networks is called a graph

machine. A single encoding network and the corresponding sample graph are shown in

Fig. 4.10, where the light grey areas denote instances of the f network and the dark grey area

denote the g network.

The instances of f function share weights among the instances belonging to single

encoding network, and among all the encoding networks. The instances of g function share

weights among all the encoding networks. (This is called the shared weights technique.)

The f and g functions can be implemented by neural networks. Their training occurs on all

4.7. Graph machines 20

the graphs from the training set simultaneously. That is, ∂J∂w is summed over all function

instances among all the graphs, where w is the set of all weights belonging either to f or

g function. A standard gradient optimization algorithm can be used for the training phase

(Levenberg-Marquardt, BFGS, etc.).

The authors of graph machines stated explicitly two fundamental rules which were before

implicitly applied in various graph processing connectionist models:

— The structure of a model should reflect the structure of the processed graph.

— The representation of structured data should be learnt instead of being handcrafted. [3]

5. Graph neural network implementation

The Graph Neural Network model [8] (GNN) is a quite recent (2009) connectionist

model, based on recursive neural networks and capable of classifying almost all types of

graphs. The main difference between the GNN model and previous connectionist models is

the possibility of processing directly nonpositional and cyclic graphs, containing both node

labels and edge labels. Although some similar solutions were introduced in an earlier model,

the RNN-LE [6] in 2005, it was the GNN model that combined several techniques with a

novel learning schema to provide a direct and flexible method for graph processing.

5.1. Data

The GNN model is built once for a training set of graphs. In fact, a whole set of graphs

can be merged into one large disconnected graph, which can then be fed to the model. For

a given dataset each node n is described by a node label ln of fixed size |ln| ≥ 1. Each

directed edge u⇒ n (from node u to node n) is described by an edge label l(n,u) of fixed size

|l(n,u)| ≥ 0. To deal with both directed and undirected edges, the authors propose to include

a value dl in each edge label, denoting the direction of an edge. However, for a maximally

general model, in this implementation all the edges were considered as directed. Undirected

edges were encoded prior to processing as pairs of directed edges with same labels.

A GNN model can deal with both graph-focused and node-focused tasks. In

graph-focused tasks, for each graph an output on of fixed size |on| ≥ 1 is sought, which can

denote e.g. the class of the graph. In the domain of chemistry, where each graph describes a

chemical compound, this could describe e.g. the reactivity of a compound. In node-focused

tasks, such output on is sought for every node in every graph. An example of such task can

be the face localisation problem, where for each region of an image the classifier should

determine, if it is a part of the face or not. In the rest of this thesis, the node-focused task is

described, unless stated otherwise.

For this implementation each graph was represented by three .csv files. Each ith row of

the nodes file contained the ith node label (comma-separated). Each row of the edges file

contained information about a directed edge u⇒ n: the u node index (row number in nodes

file), the n node index and the edge label (comma-separated). Each ith row of the outputs file

contained the ith node output (comma-separated). An example is provided in Appendix A.

5.2. Computation units 22

5.2. Computation units

The GNN model consists of two computation units, fw and gw, where the w subscript

denotes the fact that both units are functions parametrized by a vector of parameters w, which

is separate for the f and for the g function. The fw unit is used for building representation

(the state) xn of a single node n. The gw unit is used for producing output on for a node

n, basing on its representation xn. For a graph-focused task, the representation of the root

node is fed to the gw function to produce an output og for the whole graph. It is important

to remind, that for a given classifier there is only one fw unit and one gw unit (like in the

recursive neural network model). All instances of the fw unit share their weights and all

instances of the gw unit share their weights.

Let’s denote by ne[n] the neighbors of node n, that is such nodes u that are connected

to node n by a directed edge u⇒ n. Let’s further denote by co[n] the set of directed edges

pointing from ne[n] towards node n (edges u⇒ n). The general forms of fw and gw functions

are defined by equations Eq. 5.1 and Eq. 5.2, where ln denotes the nth node label, lco[n]

denotes the set of edge labels from co[n], xne[n] denotes states of nodes from ne[n], and lne[n]

denotes their labels.

xn = fw(ln, lco[n], xne[n], lne[n]) (5.1)

on = gw(xn, ln) (5.2)

For this implementation, minimal forms of these definitions were chosen:

xn = fw(ln, lco[n], xne[n]) (5.3)

on = gw(xn) (5.4)

These forms were chosen to prove that the model is capable of building a sufficient

representation of each node. That is, the model should be able to encode all the necessary

information from a node label ln into the state xn. This approach proved later to be success-

ful.

From two forms of the fw function mentioned in the original article [8], the nonpositional

form was chosen. The reason behind this choice was to provide the model with the most

general function possible, which could deal with both positional and nonpositional graphs.

The nonpositional form was also the one yielding better results in the experiments conducted

by the authors [8]. The final definitions of the fw and gw functions are shown below. All

instances of the hw unit share their weights.

5.2. Computation units 23

xn = ∑u∈ne[n]

hw(ln, l(n,u), xu) (5.5)

on = gw(xn) (5.6)

The units hw and gw were implemented as fully-connected three layer feed-forward

neural networks (input lines performing no computation, a single hidden layer and an output

layer). For both units the hidden layer consisted of tanh neurons. For the hw unit the output

layer consisted of tanh neurons. That’s because the output of the hw unit contributes to the

state value and therefore should consist of bounded values only. For the gw unit the output

layer could consist either of tanh or linear neurons, depending on the values of on that were

to be learned.

At this point it’s worth mentioning that the final value of xn is calculated as a simple sum

of hw outputs. This corresponds to a situation where all the hw output values are passed to

a neural network in which the set of weights corresponding to a single hw input are shared

amongst all the hw inputs. If we consider that a three layer FNN used as hw unit is already

an universal approximator, the use of such an additional neural network which just sums all

hw values using the same shared set of weights is unnecessary. A simple sum should be

sufficient and experimental results showed that this assumption stands.

The fw and gw units are presented in Fig. 5.1 and Fig. 5.2, where the comma-separated

list of inputs stands for a vector obtained by stacking all the listed values one after another.

xn

h

ln, l[n,u1], xu1

h

ln, l[n,up], xup

Σ

lui

ln

l[n, ui]

Figure 5.1. The fw unit for a single node and one of the corresponding edges

xn

g

on

Figure 5.2. The gw unit for a single node

5.3. Encoding network 24

The weights of both the hw and gw units were initialised according to the standard

neural network practice, to avoid saturation of any tanh activation function: net j = ∑i w jiyi ∈(−1,1), where net j is the weighted input to jth neuron, yi is the ith input value and w ji is the

weight corresponding to the ith input. The initial input weights of the gw unit were divided

by an additional factor, i.e. the maximum node indegree of the processed graphs, to take into

consideration the fact, that the input of gw unit consists of a sum of hw outputs. All the input

data (node and edge labels) was normalised appropriately before feeding to the model.

5.3. Encoding network

Graph processing by a GNN model consists of two steps: building representation xn

for each node and producing an output on. As the representation of a single node depends

on other nodes representations, an encoding network for every graph is built, reflecting the

structure of the graph. The encoding network consists of instances of the fw unit connected

according to the graph structure with a gw unit attached to every fw unit. A sample graph

and its encoding network are presented in Fig. 5.3. It can be seen that, as a cyclic dependence

exists in the sample graph, the calculation of the node states should be iterative, until at some

point convergence is reached.

g

g

g

f

f

f

1

2

3

Figure 5.3. A sample graph and the corresponding encoding network

5.4. General training algorithm

Let’s denote by x the global state of the graph, that is the set of all xn for every node

n in the graph. Let’s denote by l and o the sets of all labels and all outputs, respectively.

Let’s further denote by Fw and Gw (global transition function and global output function)

the stacked versions of fw and gw functions, respectively. Now, equations 5.3 and 5.4 can

be rewritten as Eq. 5.7 and Eq. 5.8.

5.5. Unfolded network and backpropagation 25

x= Fw(l, x) (5.7)

o= Gw(x) (5.8)

The GNN training algorithm can be described as follows:

1. initialize hw and gw weights

2. until stop criterion is satisfied:

a) initialize x randomly

b) FORWARD: calculate x= Fw(l, x) until convergence

c) BACKWARD: calculate o= Gw(x) and backpropagate the error

d) update hw and gw weights

The stop criterion used in this implementation was a maximum number of iterations.

5.5. Unfolded network and backpropagation

To solve the problem of cyclic dependencies, the GNN models adapts a novel learning

algorithm for the encoding network. The encoding network is virtually unfolded through

time until at time tm the state x converges to a fixed point x of the function Fw. Then the

output o is calculated. Such unfolded network for the sample graph is presented in Fig. 5.4.

Each time step consists of evaluating the fw function at every node. What is important,

the connections between nodes are taken into consideration only between time steps. In

such a way, the problem of cycles ceases to exist and the processed graph can be even fully

connected.

f

g

f f

f f f

g g

f f f

t0

t1

tm

Figure 5.4. Unfolded encoding network for the sample graph

5.5. Unfolded network and backpropagation 26

After the output o is calculated, the error en = (dn−on)2 is injected to the corresponding

gw unit for every node n, where dn denotes the expected node output. The error is backprop-

agated through the gw layer, yielding the value ∂ew∂o ·

∂Gw∂x (x). That value is backpropagated

through the unfolded network using the BPTT/BPTS algorithm. Additionally, at each time

step the ∂ew∂o ·

∂Gw∂x (x) error is injected to the fw layer, as presented in Fig. 5.5. In such a way

the error backpropagated through the fw layer at time ti comes from two sources. First, it is

the output error of the network ∂ew∂o ·

∂Gw∂x (x). Secondly, it is the error backpropagated from

the subsequent time layers of the fw unit from all nodes connected with the given node u by

an edge u⇒ n.

f

g

f f

f f f

g g

f f f

t0

t1

tm

Figure 5.5. Error backpropagation through the unfolded network

By injecting the same error ∂ew∂o ·

∂Gw∂x (x) at each time step an important assumption is

made, which leads to a simplification of the whole backpropagation process. If the state x

converged to a fixed point x of function Fw at time tm, then it can be safely assumed that

using the value x at every previous time step ti instead of x(ti) would yield the same result

at time tm.

Storing the intermediate values of x(ti) and backpropagating the error directly using

the BPTT/BPTS algorithm would be memory consuming. However, due to the assumption

that x(ti) = x, a different backpropagation algorithm can be used [8], originating from the

domain of recurrent neural networks: the Almeida-Pineda algorithm [21] [22].

Basically, the modified Almeida-Pineda algorithm consists of initializing an error ac-

cumulator z(0) = ∂ew∂o ·

∂Gw∂x (x) and then accumulating the error by backpropagating z( j)

through the fw layer until its value converges to z. At each step j the additional error∂ew∂o ·

∂Gw∂x (x) is injected as was shown previously. If the state calculation converged, the

error calculation is guaranteed to converge too. The number of iterations needed for the

error accumulator z to converge can be different than the number of time steps needed for

the state x to converge. In the conducted experiments it was usually much smaller.

5.6. Contraction map 27

5.6. Contraction map

In the previous paragraphs, it was stated that a fixed point x of function Fw is sought.

However, how can it be assumed that such a fixed point exists for the function Fw? How to

assure that it will be reached by iterating x= Fw(x) using a random initial x?

Actually all of the above can be assured by making Fw a contraction map. A contraction

map (a non-expansive map) is a function Fw for which d(Fw(x1), Fw(x2)) ≤ d(x1, x2),

where d(x, y) is a distance function. For this implementation a distance function d(x, y) =

maxi(|xi−yi|) was chosen, as it is independent of the state size and therefore, of the number

of nodes in a given graph. The Banach Fixed Point Theorem states that a contraction map

Fw(x) has the following properties:

— it has a single fixed point x

— it converges to x from every starting point x(t0)

— the convergence to x is exponentially fast [8].

How to assure that Fw, a function composed of neural network instances, is actually a

contraction map? The authors propose to impose a penalty function whenever the elements

of the Jacobian ∂Fw∂x suggest that Fw isn’t a contraction map anymore.

Let A= ∂Fw∂x (x, l) be a block matrix of size N×N with blocks of size s× s, where N is

the number of nodes in the processed graph and |xn| = s is the state size for a single node.

A single block An,u measures the influence of the node u on node n if an edge from u to n

exists or is zeroed otherwise. Let’s denote by I ju the influence of node u on the jth element

of state xn (Eq. 5.11). The penalty pw added to the network error ew is defined by Eq. 5.9.

pw = ∑u∈N

s

∑j=1

L(I ju, µ) (5.9)

L(y, µ) =

{y−µ if y > µ

0 otherwise(5.10)

I ju = ∑

(n,u)

s

∑i=1|Ai, j

n,u| (5.11)

Basing on these equations, the value of ∂pw∂w is calculated and the final error derivative

for the hw network is calculated as ∂ew∂w + ∂pw

∂w . It can be seen from Eq. 5.10 that the term∂pw∂w affects only weights that cause an excessive impact (larger than µ) and the value of

such penalty is proportional to the value of I ju . The eagerness to impose the penalty and

the penalty value are inversely proportional to the value of the contraction constant µ. The

impact of contraction constant on the training process was described in section 6.3.

5.7. RPROP algorithm 28

5.7. RPROP algorithm

The authors of the GNN algorithm suggest the RPROP algorithm [24] as the weights

update algorithm, as an efficient gradient descent strategy. The basic idea of the RPROP

algorithm is to use only the sign of the original weight updates ∂ew∂w . The actual weight

updates are calculated by the RPROP algorithm according to the past behaviour of ∂ew∂w ,

which includes fast descent of monotonous gradient slopes, small steps in the proximity of

a minimum and reverting updates that caused jumping over a local minimum. Actually,

in the case of GNN, the RPROP algorithm should be used not only for its efficiency, but

also as a way of dealing with the unpredictable behaviour of the ∂pw∂w term. As experiments

showed, the value of ∂pw∂w can be larger than the original ∂ew

∂w by several orders of magnitude,

which could disturb severely the learning algorithm if the RPROP algorithm wasn’t used.

The RPROP algorithm was implemented using standard recommended values [24], with the

exception of ∆max which was set to 1 to avoid large weight changes.

5.8. Maximum number of iterations

As was shown in section 5.6, the number of Forward or Backward steps is finite if Fwis a contraction map. However, it can be seen, that the penalty imposed on the weights is in

fact imposed post factum. Only after the norm of the Jacobian ∂Fw∂x increases excessively, the

penalty is imposed. In fact:

1. it is not guaranteed that the Jacobian norm will be correct after the penalty is imposed

2. one Forward iteration takes place before the penalty is imposed

Even if the penalty is efficient enough (it isn’t necessarily so, as shown in the subsequent

experiments), the problem of a single Forward iteration that may not converge still remains.

During experiments it was observed that while usually the number of Forward iterations

was between 5 and 50, depending on the dataset, from time to time it reached about 2000

iterations or even more (the calculations had to be aborted due to excessive time). To make

the calculation time predictable it was necessary to introduce a modification to the original

GNN algorithm - a maximum number of Forward iterations. The value chosen for the

subsequent experiments was 200, as it seemed to be a value large enough (in comparison

to the usual number of steps) to assure that the state calculation will converge if Fw is

still a contraction map. Furthermore, if the state calculation doesn’t converge, it is not

guaranteed that the error accumulation calculation will. Therefore, another similar restriction

was introduced for the Backward procedure and the maximum number of Backward (error

accumulation) iterations was set to 200.

5.9. Detailed training algorithm 29

5.9. Detailed training algorithm

1 MAIN:w = initialize

3 x= FORWARD(w)4 f o r numberO f Iterations :5 [ ∂ehw

∂w; ∂egw

∂w] = BACKWARD(x, w)

6 w = rprop-update(w, ∂ehw∂w

, ∂egw∂w

)7 x= FORWARD(w)8 end

r e t u r n w10 end

FORWARD(w ) :13 x(0) = random14 t = 015 r ep ea t :

x(t +1) = Fw(x(t), l)17 t = t +118 u n t i l (maxi(|xi(t +1)− xi(t)|)≤ minStateDi f f ) or (t > maxForwardSteps)19 r e t u r n x(t)20 end

BACKWARD(x , w ) :23 o= Gw(x)

24 A= ∂Fw∂x

(x, l)

25 b= ∂ew∂o· ∂Gw

∂x (x)26 z(0) = b27 t = 028 r ep ea t :

z(t−1) = z(t) ·A+b30 t = t−131 u n t i l (maxi(|zi(t−1)− zi(t)|)≤ minErrorAccDi f f ) or (|t|> maxBackwardSteps)32

∂egw∂w

= b33

∂ehw∂w

= z(t) · ∂Fw∂w

(x, l)+ ∂pw∂w

34 r e t u r n [ ∂ehw∂w

; ∂egw∂w

]35 end

Listing 5.1. The learning algorithm

5.10. Graph-focused tasks

The GNN model can be used for graph-focused tasks by modifying the standard learning

algorithm. In a graph-focused task an output og is sought for every graph, which is the output

of a predefined root node. In such a task only the root output error can be measured, as for

every other node the expected output is not defined. Thus, the only modification necessary to

deal with such a task is to set the error of all non-root nodes to zero. Such a modification was

implemented and proved to work well for a modified subgraph matching task (see chapter 6),

where the modified task consisted of determining if a given graph contains the expected

subgraph S or not instead of selecting nodes belonging to S.

6. Experiments

Experiments were conducted to check if the implemented GNN is able to cope with the

tasks presented in the original article [8]. For all the experiments the state size was set

to 5, the number of hidden neurons in both the hw and gw networks was set to 5. After

some successful trivial experiments, consisting of memorizing a single graph, the proper

experiments were conducted. The task chosen for experiments was the subgraph matching

task. It was chosen, because:

1. a similar experiment was conducted by Scarselli et al. [8]

2. the dataset is easy to generate, yet the problem is not trivial

3. to yield good results, the structure of the graph have to be exploited.

6.1. Subgraph matching - data

The datasets for the subgraph matching task were generated as follows. For a given

number of graph nodes, graphs were generated by selecting node labels from [0..10] and

connecting each node pair in a graph with an edge probability δ. Then, edges were inserted

randomly until the graph became connected. Then, a smaller (connected) subgraph S was

inserted to every graph in the dataset. Then, a brute force algorithm was used to locate all

copies of the subgraph S in every graph in the dataset. Thus, every graph in the dataset

contained at least on copy of the subgraph S. Afterwards, a small Gaussian noise with zero

mean and standard deviation of 0.25 was added to all node labels. All graph edges were

undirected and thus were transformed to pairs of directed edges prior to processing. No edge

labels were used.

Two datasets were generated. One with graph size (number of nodes) equal to 6, the

subgraph size equal to 3 and δ = 0.8 (100 graphs, called later the 6-3 dataset). The second

dataset with graph size equal to 14, subgraph size equal to 7 and δ = 0.2 (100 graphs, called

later the 14-7 dataset). A larger δ was used for the first dataset, as graphs generated with

δ = 0.2 were mostly sequences. The first dataset was used to analyze the process of training,

while the second one was used for comparison of GNN with a standard FNN classifier.

6.1. Subgraph matching - data 31

1

8

2

0

1010

Figure 6.1. Sample graph from 6-3 dataset (subgraph in black), before adding noise

3

5

8

0

3

2

2

5

5

8

0

10

7

8

Figure 6.2. Sample graph from 14-7 dataset (subgraph in black), before adding noise

6.2. Impact of initial weight values on learning 32

6.2. Impact of initial weight values on learning

To test the impact of initial weight values on the process of a GNN training, 9 different

sets of weights were tested. For all tested networks, the contraction constant (µ from Eq. 5.9)

was set to 0.9. The training was performed on 10 graphs belonging to the 6-3 dataset. Each

GNN network was trained for 50 iterations. As the default error measure used in GNN

training is the Mean Square Error, a similar performance measure - RMSE was used for

evaluation. Results are presented in Fig. 6.3. Out of 9 networks, only 4 performed well:

gnn2, gnn3, gnn5 and gnn7. The gnn5 network yielded the smallest RMSE at the end of

training and also presents a remarkably monotonous RMSE slope compared to gnn7. All

the other networks didn’t improve significantly on the RMSE value, which may suggest

that multiple initial sets of weights should be tried for a given dataset to build an efficient

classifier.

gnn1 : RMSE gnn2 : RMSE gnn3 : RMSE

gnn4 : RMSE gnn5 : RMSE gnn6 : RMSE

gnn7 : RMSE gnn8 : RMSE gnn9 : RMSE

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 10 20 30 40 50

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 10 20 30 40 50

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 10 20 30 40 50

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 10 20 30 40 50

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 10 20 30 40 50

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 10 20 30 40 50

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 10 20 30 40 50

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 10 20 30 40 50

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 10 20 30 40 50

Figure 6.3. RMSE for 9 different initial weight sets. µ = 0.9

6.3. Impact of contraction constant on learning 33

6.3. Impact of contraction constant on learning

During the initial experiments, interesting results were obtained for different values of

the contraction constant (µ from Eq. 5.9). It seems that for a given learning task exists a

minimum value of µ below which no learning occurs. Some experiments were conducted

for the 6-3 dataset using the best networks from Fig. 6.3: the gnn5 and gnn7 network (initial

weight values were used). The results for gnn7 are presented in Fig. 6.4 and the results for

gnn5 are presented in Fig. 6.5. For both networks three different values of µ were tested: 1.2,

0.9 and 0.6. In both cases it can be observed that no training occurs for µ = 0.6. For these

experiments 20 graphs from the 6-3 dataset were used.

(1.2, 1e-08) (0.9, 1e-08) (0.6, 1e-08)

8

9

10

11

12

13

0 10 20 30 40 50

8

9

10

11

12

13

0 10 20 30 40 50

8

9

10

11

12

13

0 10 20 30 40 50

Figure 6.4. RMSE for gnn7 with µ ∈ [1.2,0.9,0.6]

(1.2, 1e-08) (0.9, 1e-08) (0.6, 1e-08)

8.5

9

9.5

10

10.5

11

11.5

12

12.5

13

0 10 20 30 40 50

8.5

9

9.5

10

10.5

11

11.5

12

12.5

13

0 10 20 30 40 50

8.5

9

9.5

10

10.5

11

11.5

12

12.5

13

0 10 20 30 40 50

Figure 6.5. RMSE for gnn5 with µ ∈ [1.2,0.9,0.6]

A closer look on the process of learning may shed some light on the reasons behind the

lack of learning. In Fig. 6.6 the process of learning of gnn5 with µ = 0.9 is presented. In

Fig. 6.7 the same network gnn5 was trained with µ = 0.6. The different values shown are:

nForward - number of Forward (state building) iterations, nBackward - number of Backward

(error accumulation) iterations, penalty - set to 1 if any weight was penalized, de/dw influence

- percent of combined weight updates that had the same sign as ∂e∂w (before passing to RPROP

algorithm), dp/dw influence - percent of combined weight updates that had the same sign as

6.3. Impact of contraction constant on learning 34

∂p∂w . Some interesting features of the GNN model learning schema can be observed. In the

case of µ = 0.9 the number of Forward steps reached the maximum a couple of times, which

presumably means that at that time the Fw ceased being a contraction map. The penalty was

imposed mostly for short periods of time and only at one moment caused the ∂e∂w influence

to drop below 50%. This strategy yielded good results - the imposed penalty reduced the

number of Forward steps and the RMSE was successfully reduced.

RMSE penalty

nForward de/dw influence [%]

nBackward dp/dw influence [%]

8

9

10

11

12

13

0 10 20 30 40 50-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50

0

50

100

150

200

250

0 10 20 30 40 500

20

40

60

80

100

0 10 20 30 40 50

0

50

100

150

200

250

0 10 20 30 40 500

20

40

60

80

100

0 10 20 30 40 50

Figure 6.6. gnn5 performance with µ = 0.9 for 20 graphs

A different situation is shown for µ = 0.6. Because of a low µ value, the penalty was

imposed eagerly and was larger than in the previous case (the impact of the µ value was

described in section 5.6). It was imposed even when the number of Forward steps was below

the maximum, that is when Fw was still a contraction map. Large values of the penalty

caused a huge decrease of the ∂e∂w term influence, which made any learning impossible.

6.3. Impact of contraction constant on learning 35

RMSE penalty

nForward de/dw influence [%]

nBackward dp/dw influence [%]

8

9

10

11

12

13

0 10 20 30 40 50-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50

0

50

100

150

200

250

0 10 20 30 40 500

20

40

60

80

100

0 10 20 30 40 50

0

50

100

150

200

250

0 10 20 30 40 500

20

40

60

80

100

0 10 20 30 40 50

Figure 6.7. gnn5 performance with µ = 0.6 for 20 graphs

Another interesting case is presented in Fig. 6.8: the learning process of gnn5 on 10

graphs with µ = 0.9. It can be observed that even as the number of Forward steps reached

in peaks the maximum value, the Fw function remained a contraction map. A large enough

µ prevented the penalty from being imposed, which enabled the GNN model to train both

computation units without any disturbance. The result is a monotonously decreasing RMSE

slope, which could be previously observed in Fig. 6.3. It can be concluded, that the most

important aspect of building a GNN model is to provide an efficient way to make Fw a

contraction map as fast as possible, so as to provide as much time as possible for undisturbed

learning.

6.3. Impact of contraction constant on learning 36

RMSE penalty

nForward de/dw influence [%]

nBackward dp/dw influence [%]

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 10 20 30 40 50-1

-0.5

0

0.5

1

0 10 20 30 40 50

0

50

100

150

200

250

0 10 20 30 40 500

20

40

60

80

100

0 10 20 30 40 50

0

50

100

150

200

250

0 10 20 30 40 500

20

40

60

80

100

0 10 20 30 40 50

Figure 6.8. gnn5 performance with µ = 0.9 for 10 graphs

6.4. Cross-validation results 37

6.4. Cross-validation results

To compare the performance of the implemented GNN model with a standard FNN,

the following subgraph matching experiment was conducted. 5-fold cross-validation was

performed on all 100 graphs from the 14-7 dataset. A random GNN was generated. It

was trained with a contraction constant µ = 0.9 for 50 iterations for each fold. To provide

good FNN results, 10 three-layer FNNs with 20 hidden tanh neurons were evaluated and the

one with best mean accuracy was selected. The results are presented in Table 6.1 and 6.2.

The GNN classifier outperformed the FNN by more than 15%. This is due to the fact, that

the FNN classifier could make predictions only by analyzing node labels, while the GNN

classifier exploited correctly the graph topology.

These results can be better understood by analyzing the classified dataset. The 100

processed graphs consisted in total of 1400 nodes. Amongst these nodes, 1031 had node

labels matching the subgraph node labels. Amongst these 1031 nodes only 702 actually

belonged to the subgraph. Thus, 329 nodes, 23.5% of all the nodes would probably be

classified as false positives by a classifier taking into consideration only node labels. This

hypothesis corresponds quite well with the results presented.

accuracy precision recall

FNN - tr 75% 68% 93%FNN - tst 74% 68% 93%GNN - tr 91% 87% 97%GNN - tst 91% 86% 97%

Table 6.1. Mean values on training and test sets

accuracy precision recall

FNN - tr 0.68% 0.82% 1.43%FNN - tst 3.20% 2.89% 1.85%GNN - tr 1.62% 1.71% 2.07%GNN - tst 3.06% 3.70% 1.39%

Table 6.2. Standard deviations on training and test sets

7. Conclusions

The implementation of the GNN model yielded promising experimental results for struc-

tured data and proved that it can exploit properly both node labels and the graph structure.

Some changes in the original algorithm, as the maximum number of Forward and Backward

iterations were introduced to assure a more predictable computation time. Conditions under

which the model works efficiently were described. An important parameter of the GNN

model, determining the training efficiency was identified: the contraction constant µ. The

most important conclusions are listed below:

— training yields best results when Fw remains a contraction map

— if Fw definitely ceases to be a contraction map, there are no training effects

— remaining near the contraction state can still yield good results

— a fixed maximum number of Forward/Backward iterations can still yield good results

— imposing an unnecessary penalty should be avoided

— too large penalties (a too small µ) should be avoided

— the minimum value of µ should be tuned to the processed dataset

— the minimum value of µ can be tuned using a subset of the data.

A. Using the software

The GNN classifier was implemented in GNU Octave 3.6.2 and tested on a x86_64 PC. The

most important functions are:

— loadgraph - loads a single graph from .csv files

— loadset - loads a set of graphs sharing the same filename prefix

— mergegraphs - merge a cellarray of graphs to a single graph

— initgnn - initialize a new GNN

— traingnn - train GNN using a training graph

— classifygnn - classify given graph with a trained GNN

— evaluate - evaluate the classification results

— crossvalidate - perform cross-validation using an untrained GNN and a set of graphs

Help and usage information for each function can be displayed by typing help <func-

tion_name> in the Octave command line.

1 g6 = l o a d s e t ( ’ . . / d a t a / g6s3n / g6s3 ’ , 10) ;gm = mergegraphs ( g6 ) ;gnn = i n i t g n n (gm . maxIndegree , [5 5 ] , [5 gm . n o d e O u t p u t S i z e ] , ’ t a n s i g ’ ) ;t r a i n e d G n n = t r a i n g n n ( gnn , gm , 20) ;

5 o u t p u t s = c l a s s i f y g n n ( t r a inedGnn , gm) ;s t a t s = e v a l u a t e ( o u t p u t s , gm . e x p e c t e d O u t p u t ) ;

Listing A.1. Sample usage session

All subgraph matching datasets were created using the buildgraphs.py script. Each graph

can be viewed as a pdf file by using the drawgraph.py script. Each graph is stored as three

.csv files, containing node labels, edge labels and expected outputs. A sample graph ’test’ is

presented below. Nodes yielding output 2 were marked as black.

5

6

7

8

Figure A.1. Sample graph

test_nodes.csv test_output.csv test_edges.csv

5 1 1,2,06 2 2,3,07 2 3,4,08 1 4,1,0

4,2,04,3,0

Table A.1. Sample graph data

B. Listings

1

f u n c t i o n [ bes tGnn t r a i n S t a t s ] = t r a i n g n n ( gnn , graph , n I t e r a t i o n s , . . .maxForwardSteps =200 , maxBackwardSteps =200 , i n i t i a l S t a t e =0)

% T r a i n s GNN u s i n g graph as t r a i n i n g s e t5 %

% usage : [ bes tGnn t r a i n S t a t s ] = t r a i n g n n ( gnn , graph , n I t e r a t i o n s ,% maxForwardSteps =200 , maxBackwardSteps =200 , i n i t i a l S t a t e =0)%% r e t u r n :

10 % − b e s t gnn o b t a i n e d d u r i n g t r a i n i n g and a l l e r r o r s%

i f i n i t i a l S t a t e != 0a s s e r t ( s i z e ( i n i t i a l S t a t e , 1 ) == graph . nNodes ) ;

15 a s s e r t ( s i z e ( i n i t i a l S t a t e , 2 ) == gnn . s t a t e S i z e ) ;end

% c o n s t a n t s f o r i n d e x i n g t r a i n i n g s t a t sRMSE = 1 ;

20 FORWARD_STEPS = 2 ;BACKWARD_STEPS = 3 ;PENALTY_ADDED = 4 ;RPROP_REVERTED_TRANS = 5 ;RPROP_REVERTED_OUT = 6 ;

25 TRANS_STATS_START = 7 ;TRANS_STATS_END = TRANS_STATS_START + 3 ;t r a i n S t a t s = z e r o s ( n I t e r a t i o n s + 1 , TRANS_STATS_END) ;

% n o r m a l i z e edge and node l a b e l s30 % s t o r e n o r m a l i z a t i o n i n f o i n s i d e r e s u l t gnn

[ g raph . n o d e L a b e l s gnn . nodeLabelMeans gnn . n o d e L a b e l S t d s ] = . . .n o r m a l i z e ( g raph . n o d e L a b e l s ) ;

[ g raph . e d g e L a b e l s ( : , 3 : end ) gnn . edgeLabelMeans gnn . e d g e L a b e l S t d s ] = . . .n o r m a l i z e ( g raph . e d g e L a b e l s ( : , 3 : end ) ) ;

35 graph = a d d g r a p h i n f o ( g raph ) ;

m i n E r r o r = I n f ;bes tGnn = gnn ;r p r o p T r a n s i t i o n S t a t e = i n i t r p r o p ( gnn . t r a n s i t i o n N e t ) ;

40 r p r o p O u t p u t S t a t e = i n i t r p r o p ( gnn . o u t p u t N e t ) ;f o r i t e r a t i o n = 1 : n I t e r a t i o n s

[ s t a t e nFo rwardS teps ] = f o r w a r d ( gnn , graph , maxForwardSteps , . . .i n i t i a l S t a t e ) ;

t r a i n S t a t s ( i t e r a t i o n , FORWARD_STEPS) = nForwardS teps ;45

o u t p u t s = a p p l y n e t ( gnn . o u t p u t N e t , s t a t e ) ;i f graph . n o d e O r i e n t e d T a s k == f a l s e

e r r = rmse ( g raph . e x p e c t e d O u t p u t ( 1 , : ) , o u t p u t s ( 1 , : ) ) ;e l s e

B. Listings 41

50 e r r = rmse ( g raph . e x p e c t e d O u t p u t , o u t p u t s ) ;endt r a i n S t a t s ( i t e r a t i o n , RMSE) = e r r ;i f e r r < m i n E r r o r

m i n E r r o r = e r r ;55 bes tGnn = gnn ;

end

[ d e l t a s nBackwardSteps pena l tyAdded ] = backward ( gnn , graph , . . .s t a t e , maxBackwardSteps ) ;

60 t r a i n S t a t s ( i t e r a t i o n , BACKWARD_STEPS) = nBackwardSteps ;t r a i n S t a t s ( i t e r a t i o n , PENALTY_ADDED) = pena l tyAdded ;t r a i n S t a t s ( i t e r a t i o n , TRANS_STATS_START : TRANS_STATS_END) = . . .

round ( t r a n s i t i o n s t a t s ( d e l t a s ) ) ;

65 o u t p u t D e r i v a t i v e s = d e l t a s . o u t p u t ;[ r p r o p O u t p u t S t a t e o u t p u t W e i g h t U p d a t e s n O u t p u t R e v e r t e d ] = . . .

r p r o p ( r p r o p O u t p u t S t a t e , o u t p u t D e r i v a t i v e s ) ;t r a i n S t a t s ( i t e r a t i o n , RPROP_REVERTED_OUT) = . . .

round ( n O u t p u t R e v e r t e d ∗ 100 / gnn . o u t p u t N e t . nWeights ) ;70

t r a n s i t i o n D e r i v a t i v e s = a d d d e l t a s ( d e l t a s . t r a n s i t i o n , . . .d e l t a s . t r a n s i t i o n P e n a l t y ) ;

[ r p r o p T r a n s i t i o n S t a t e t r a n s i t i o n W e i g h t U p d a t e s . . .n T r a n s i t i o n R e v e r t e d ] = . . .

75 r p r o p ( r p r o p T r a n s i t i o n S t a t e , t r a n s i t i o n D e r i v a t i v e s ) ;t r a i n S t a t s ( i t e r a t i o n , RPROP_REVERTED_TRANS) = . . .

round ( n T r a n s i t i o n R e v e r t e d ∗ 100 / gnn . t r a n s i t i o n N e t . nWeights ) ;

gnn . o u t p u t N e t = u p d a t e w e i g h t s ( gnn . o u t p u t N e t , . . .80 ou tpu tWe igh tUpda t e s , 1 ) ;

gnn . t r a n s i t i o n N e t = u p d a t e w e i g h t s ( gnn . t r a n s i t i o n N e t , . . .t r a n s i t i o n W e i g h t U p d a t e s , 1 ) ;

end

85 % c a l c u l a t e RMSE o f f i n a l GNNi t e r a t i o n = n I t e r a t i o n s + 1 ;

[ s t a t e nFo rwardS teps ] = f o r w a r d ( gnn , graph , maxForwardSteps , . . .i n i t i a l S t a t e ) ;

90 t r a i n S t a t s ( i t e r a t i o n , FORWARD_STEPS) = nForwardS teps ;

o u t p u t s = a p p l y n e t ( gnn . o u t p u t N e t , s t a t e ) ;i f graph . n o d e O r i e n t e d T a s k == f a l s e

e r r = rmse ( g raph . e x p e c t e d O u t p u t ( 1 , : ) , o u t p u t s ( 1 , : ) ) ;95 e l s e

e r r = rmse ( g raph . e x p e c t e d O u t p u t , o u t p u t s ) ;endt r a i n S t a t s ( i t e r a t i o n , RMSE) = e r r ;

%p r i n t f ( ’RMSE a f t e r %d i t e r a t i o n s : %f \ n ’ , i t e r a t i o n − 1 , e r r ) ;100 i f e r r < m i n E r r o r

m i n E r r o r = e r r ;bes tGnn = gnn ;

endend

Listing B.1. Main train function

B. Listings 42

1

f u n c t i o n [ s t a t e n S t e p s ] = f o r w a r d ( gnn , graph , maxForwardSteps , s t a t e =0)% Perform t h e ’ forward ’ s t e p o f GNN t r a i n i n g% compute node s t a t e s u n t i l s t a b l e s t a t e i s reached

5 %% usage : [ s t a t e n S t e p s ] = forward ( gnn , graph , maxForwardSteps , s t a t e =0)

i f s t a t e == 0s t a t e = i n i t s t a t e ( g raph . nNodes , gnn . s t a t e S i z e ) ;

10 endn S t e p s = 0 ;do

i f n S t e p s > maxForwardSteps% p r i n t f ( ’ Too many forward s t e p s : %d , a b o r t i n g \ n ’ , n S t e p s ) ;

15 re turn ;endl a s t S t a t e = s t a t e ;s t a t e = t r a n s i t i o n ( gnn . t r a n s i t i o n N e t , l a s t S t a t e , g raph ) ;n S t e p s = n S t e p s + 1 ;

20 u n t i l ( s t a b l e s t a t e ( l a s t S t a t e , s t a t e , gnn . m i n S t a t e D i f f ) ) ;end

Listing B.2. Forward function

1

f u n c t i o n n e w S t a t e = t r a n s i t i o n ( fnn , s t a t e , g raph )% C a l c u l a t e g l o b a l t r a n s i t i o n f u n c t i o n

5 n e w S t a t e = z e r o s ( s i z e ( s t a t e ) ) ;f o r nodeIndex = 1 : g raph . nNodes

% b u i l d t r a n s i t i o n N e t i n p u t f o r s i n g l e nodes o u r c e N o d e I n d e x e s = graph . sourceNodes { nodeIndex } ;nodeLabe l = graph . n o d e L a b e l s ( nodeIndex , : ) ;

10 newNodeState = z e r o s ( 1 , fnn . nOutpu tNeurons ) ;f o r i = 1 : s i z e ( sou rceNodeIndexes , 2 )

s o u r c e E d g e L a b e l = . . .g raph . e d g e L a b e l s C e l l { s o u r c e N o d e I n d e x e s ( i ) , nodeIndex } ;

s o u r c e N o d e S t a t e = s t a t e ( s o u r c e N o d e I n d e x e s ( i ) , : ) ;15 i n p u t s = [ nodeLabel , sou rceEdgeLabe l , s o u r c e N o d e S t a t e ] ;

s t a t e C o n t r i b u t i o n = a p p l y n e t ( fnn , i n p u t s ) ;newNodeState = newNodeState + s t a t e C o n t r i b u t i o n ;

end% i f node has < maxIndegree i n p u t nodes , i t s c o n t r i b u t i o n i s 0

20 n e w S t a t e ( nodeIndex , : ) = newNodeState ;end

end

Listing B.3. Transition function

1

f u n c t i o n s t a b l e = s t a b l e s t a t e ( l a s t S t a t e , s t a t e , m i n S t a t e D i f f )% r e t u r n t r u e , i f d i f f e r e n c e be tween two s t a t e s i s i n s i g n i f i c a n t

d i f f = max ( max ( abs ( l a s t S t a t e − s t a t e ) ) ) ;5 s t a b l e = ( d i f f < m i n S t a t e D i f f ) ;

end

Listing B.4. State convergence check

B. Listings 43

1

f u n c t i o n [ w e i g h t D e l t a s n S t e p s pena l tyAdded ] = backward ( gnn , graph , s t a t e ,maxBackwardSteps )% Perform t h e ’ backward ’ s t e p o f GNN t r a i n i n g

5 %% usage : [ w e i g h t D e l t a s n S t e p s p e n a l t y A d d e d ] = backward ( gnn , graph , s t a t e ,% maxBackwardSteps )%% s t a t e : s t a b l e s t a t e , c a l c u l a t e d by forward

10 % r e t u r n : d e l t a s f o r bo th t r a n s i t i o n and o u t p u t n e t w o r k s o f gnn

o u t p u t s = a p p l y n e t ( gnn . o u t p u t N e t , s t a t e ) ;o u t p u t E r r o r s = 2 . ∗ ( g raph . e x p e c t e d O u t p u t − o u t p u t s ) ;i f graph . n o d e O r i e n t e d T a s k == f a l s e

15 s e l e c t e d N o d e E r r o r s = o u t p u t E r r o r s ( 1 , : ) ;o u t p u t E r r o r s = z e r o s ( s i z e ( o u t p u t E r r o r s ) ) ;o u t p u t E r r o r s ( 1 , : ) = s e l e c t e d N o d e E r r o r s ;

end

20 o u t p u t D e l t a s = o u t p u t d e l t a s ( gnn . o u t p u t N e t , graph , s t a t e , o u t p u t E r r o r s ) ;

% s p a r s e m a t r i x c a l c u l a t i o n s , cause s i z e ( A ) = N x N x s ^2% A can c o n t a i n whole t r a i n i n g s e t as a s i n g l e graphA = c a l c u l a t e a ( gnn . t r a n s i t i o n N e t , graph , s t a t e ) ;

25 b = sp ar se ( c a l c u l a t e b ( gnn . o u t p u t N e t , graph , s t a t e , o u t p u t E r r o r s ) ) ;a c c u m u l a t o r = b ; % a c c u m u l a t o r c o n t a i n s dew / do ∗ dG / wxn S t e p s = 0 ;do

% i f t h e r e are t o o many backward s t e p s , we can g e t i n f i n i t i e s30 i f n S t e p s > maxBackwardSteps

break ;endl a s t A c c u m u l a t o r = a c c u m u l a t o r ;

% f o r each s o u r c e node , b a c k p r o p a g a t e e r r o r o f i t s t a r g e t nodes35 % A : i n f l u e n c e o f s o u r c e nodes on t a r g e t nodes

% a c c u m u l a t o r : e r r o r s de / dx from p r e v i o u s t i m e s t e p ( t + 1)% b : de / do ∗ dG / dx e r r o r , i n j e c t e d a t each s t e pa c c u m u l a t o r = a c c u m u l a t o r ∗ A + b ;n S t e p s = n S t e p s + 1 ;

40 u n t i l ( s t a b l e s t a t e ( l a s t A c c u m u l a t o r , a c c u m u l a t o r , gnn . m i n E r r o r A c c D i f f ) ) ;

t r a n s i t i o n E r r o r s = reshape ( a c c u m u l a t o r , g raph . nNodes , gnn . s t a t e S i z e ) ;t r a n s i t i o n D e l t a s = t r a n s i t i o n d e l t a s ( gnn . t r a n s i t i o n N e t , graph , s t a t e ,

t r a n s i t i o n E r r o r s ) ;45

% c a l c u l a t e p e n a l t y dp / dw , a s s u r i n g t h a t F i s a c o n t r a c t i o n map[ p e n a l t y D e r i v a t i v e pena l tyAdded ] = p e n a l t y d e r i v a t i v e ( gnn , graph , s t a t e ,A) ;p e n a l t y D e l t a s = r e s h a p e d e l t a s ( gnn . t r a n s i t i o n N e t , p e n a l t y D e r i v a t i v e ) ;

50 w e i g h t D e l t a s = s t r u c t ( . . .’ o u t p u t ’ , o u t p u t D e l t a s , . . .’ t r a n s i t i o n ’ , t r a n s i t i o n D e l t a s , . . .’ t r a n s i t i o n P e n a l t y ’ , p e n a l t y D e l t a s ) ;

end

Listing B.5. Backward function

B. Listings 44

1

f u n c t i o n A = c a l c u l a t e a ( t r a n s i t i o n N e t , graph , s t a t e )% C a l c u l a t e t h e A ( x ) = dF / dx b l o c k m a t r i x% ( NxN b l o c k s , each s x s ) ( F = t r a n s i t i o n f u n c t i o n )

5 %% usage : A = c a l c u l a t e a ( t r a n s i t i o n N e t , graph , s t a t e )%% Each b l o c k A[ n , u ] = dxn / dxu :% − d e s c r i b e s t h e e f f e c t o f node xu on node xn ,

10 % i f an edge xu−>xn e x i s t s% − i s n u l l ( z e r o e d ) i f t h e r e i s no edge%% Each e l e m e n t o f b l o c k A[ n , u ] : a [ i , j ] :% − d e s c r i b e s t h e e f f e c t o f i t h e l e m e n t o f s t a t e xu

15 % on j t h e l e m e n t o f s t a t e xn

s t a t e S i z e = t r a n s i t i o n N e t . nOutpu tNeurons ;a S i z e = graph . nNodes ∗ s t a t e S i z e ;A = sp ar se ( aS ize , a S i z e ) ; % z e r o e d

20 f o r nodeIndex = 1 : g raph . nNodes% b u i l d t r a n s i t i o n N e t i n p u t f o r s i n g l e nodes o u r c e N o d e I n d e x e s = graph . sourceNodes { nodeIndex } ;nodeLabe l = graph . n o d e L a b e l s ( nodeIndex , : ) ;f o r i = 1 : s i z e ( sou rceNodeIndexes , 2 )

25 s o u r c e E d g e L a b e l = . . .g raph . e d g e L a b e l s C e l l { s o u r c e N o d e I n d e x e s ( i ) , nodeIndex } ;

s o u r c e N o d e S t a t e = s t a t e ( s o u r c e N o d e I n d e x e s ( i ) , : ) ;i n p u t s = [ nodeLabel , sou rceEdgeLabe l , s o u r c e N o d e S t a t e ] ;d e l t a Z x = z e r o s ( s t a t e S i z e , s t a t e S i z e ) ;

30 f o r j = 1 : s t a t e S i z ee r r o r s = z e r o s ( 1 , s t a t e S i z e ) ;e r r o r s ( j ) = 1 ;i n p u t D e l t a s = bp2 ( t r a n s i t i o n N e t , i n p u t s , e r r o r s ) ;

% s e l e c t o n l y w e i g h t s c o r r e s p o n d i n g t o x _ i u35 s t a t e W e i g h t s S t a r t = . . .

1 + graph . n o d e L a b e l S i z e + graph . e d g e L a b e l S i z e ;s t a t e I n p u t D e l t a s = i n p u t D e l t a s ( 1 , s t a t e W e i g h t s S t a r t : end ) ;d e l t a Z x ( : , j ) = s t a t e I n p u t D e l t a s ’ ;

end40 s t a r t X = b l o c k s t a r t ( nodeIndex , s t a t e S i z e ) ;

endX = b l o c k e n d ( nodeIndex , s t a t e S i z e ) ;s t a r t Y = b l o c k s t a r t ( s o u r c e N o d e I n d e x e s ( i ) , s t a t e S i z e ) ;endY = b l o c k e n d ( s o u r c e N o d e I n d e x e s ( i ) , s t a t e S i z e ) ;A( s t a r t X : endX , s t a r t Y : endY ) = d e l t a Z x ;

45 endend

end

Listing B.6. Matrix A calculation

B. Listings 45

1

f u n c t i o n b = c a l c u l a t e b ( o u t p u t N e t , graph , s t a t e , e r r o r D e r i v a t i v e )% C a l c u l a t e b m a t r i x = dew / do ∗ dGw / dx%

5 % G : g l o b a l o u t p u t f u n c t i o n , G( n o d e S t a t e s )=o u t p u t s% e r r o r D e r i v a t i v e : each row c o n t a i n s an e r r o r o f a node% s t a t e : each row c o n t a i n s a node s t a t e

b = z e r o s ( o u t p u t N e t . n I n p u t L i n e s , g raph . nNodes ) ;10 f o r nodeIndex = 1 : g raph . nNodes

e r r o r s = e r r o r D e r i v a t i v e ( nodeIndex , : ) ;n o d e S t a t e = s t a t e ( nodeIndex , : ) ;i n p u t s = n o d e S t a t e ;d e l t a s = b a c k p r o p a g a t e ( o u t p u t N e t , i n p u t s , e r r o r s ) ;

15 b ( : , nodeIndex ) = d e l t a s . d e l t a I n p u t s ( 1 , : ) ;end% s t a c k o u t p u t f o r each node on one a n o t h e rb = vec ( b ) ’ ;

end

Listing B.7. Matrix b calculation

B. Listings 46

1

f u n c t i o n [ p e n a l t y D e r i v a t i v e pena l tyAdded ] =p e n a l t y d e r i v a t i v e ( gnn , graph , s t a t e , A)

% C a l c u l a t e p e n a l t y d e r i v a t i v e c o n t r i b u t i o n t o de / dw , where :5 % − e = RMSE + c o n t r a c t i o n M a p P e n a l t y ( ne twork e r r o r )

%% usage : [ p e n a l t y D e r i v a t i v e p e n a l t y A d d e d ]% = p e n a l t y d e r i v a t i v e ( gnn , graph , s t a t e , A )%

10

% sum up i n f l u e n c e o f each s o u r c e node on a l l t h e o t h e r nodes% each s−l ong b l o c k i s i n f l u e n c e o f s i n g l e s o u r c e node xu on a l l s o u t p u t s

s o u r c e I n f l u e n c e s = sum ( abs (A) , 1 ) ;s o u r c e I n f l u e n c e s = ( s o u r c e I n f l u e n c e s − . . .

15 r epmat ( gnn . c o n t r a c t i o n C o n s t a n t , s i z e ( s o u r c e I n f l u e n c e s ) ) ) . ∗ . . .( s o u r c e I n f l u e n c e s > gnn . c o n t r a c t i o n C o n s t a n t ) ;

fnn = gnn . t r a n s i t i o n N e t ;nWeights1 = fnn . n I n p u t L i n e s ∗ fnn . nHiddenNeurons ;

20 nBias1 = fnn . nHiddenNeurons ;nWeights2 = fnn . nOutpu tNeurons ∗ fnn . nHiddenNeurons ;nBias2 = fnn . nOutpu tNeurons ;p e n a l t y D e r i v a t i v e = z e r o s ( 1 , nWeights1 + nBias1 + nWeights2 + nBias2 ) ;i f sum ( s o u r c e I n f l u e n c e s ) == 0

25 pena l tyAdded = f a l s e ;e l s e

pena l tyAdded = t r u e ;% m a t r i x B c o n t a i n s i n f l u e n c e s from A , f i l t e r e d :% o n l y i n f l u e n c e s coming from a t o o i n f l u e n t i a l s o u r c e are r e t a i n e d

30 B = s i g n (A) .∗ r epmat ( s o u r c e I n f l u e n c e s , s i z e (A, 1 ) , 1 ) ;f o r s o u r c e I n d e x = 1 : g raph . nNodes

s t a r t X = b l o c k s t a r t ( s o u r c e I n d e x , gnn . s t a t e S i z e ) ;endX = b l o c k e n d ( s o u r c e I n d e x , gnn . s t a t e S i z e ) ;s o u r c e N o d e I n f l u e n c e s = s o u r c e I n f l u e n c e s ( 1 , s t a r t X : endX ) ;

35 i f ( sum ( s o u r c e N o d e I n f l u e n c e s , 2 ) != 0 )% c a l c u l a t e i m p a c t D e r i v a t i v e [ n , u ] f o r u = s o u r c e I n d e xf o r t a r g e t I n d e x = 1 : g raph . nNodes

i f ! e d g e e x i s t s ( graph , s o u r c e I n d e x , t a r g e t I n d e x )c o n t i n u e ;

40 ends t a r t Y = b l o c k s t a r t ( t a r g e t I n d e x , gnn . s t a t e S i z e ) ;endY = b l o c k e n d ( t a r g e t I n d e x , gnn . s t a t e S i z e ) ;Rnu = f u l l (B( s t a r t Y : endY , s t a r t X : endX ) ) ;

45 % c a l c u l a t e f2 ’ ( n e t 2 ) and f1 ’ ( n e t 1 )nodeLabe l = graph . n o d e L a b e l s ( t a r g e t I n d e x , : ) ;s o u r c e E d g e L a b e l = graph . e d g e L a b e l s C e l l { s o u r c e I n d e x , t a r g e t I n d e x } ;s o u r c e N o d e S t a t e = s t a t e ( s o u r c e I n d e x , : ) ;i n p u t s = [ nodeLabel , sou rceEdgeLabe l , s o u r c e N o d e S t a t e ] ;

50

% f n n f e e dn e t 1 = fnn . w e i g h t s 1 ∗ i n p u t s ’ + fnn . b i a s 1 ;h i d d e n O u t p u t s = fnn . a c t i v a t i o n 1 ( n e t 1 ) ;n e t 2 = fnn . w e i g h t s 2 ∗ h i d d e n O u t p u t s + fnn . b i a s 2 ;

55 s igma1 = fnn . a c t i v a t i o n d e r i v a t i v e 1 ( n e t 1 ) ;s igma2 = fnn . a c t i v a t i o n d e r i v a t i v e 2 ( n e t 2 ) ;

% s e l e c t o n l y w e i g h t s c o r r e s p o n d i n g t o xu a t i n p u t s

B. Listings 47

s t a t e W e i g h t s S t a r t = 1 + graph . n o d e L a b e l S i z e + graph . e d g e L a b e l S i z e ;60 s t a t e W e i g h t s 1 = fnn . w e i g h t s 1 ( : , s t a t e W e i g h t s S t a r t : end ) ;

% vec ( Rnu ) ’ ∗ dvec ( Anu ) / dw = da1 + da2 + da3 + da4d e l t a S i g n a l 1 = vec ( Rnu ∗ s t a t e W e i g h t s 1 ’ ∗ . . .

diag ( s igma1 ) ∗ fnn . we igh t s2 ’ ) ’ ∗ v e c d i a g m a t r i x ( gnn . s t a t e S i z e ) ;65 fnn2nd = fnn ;

fnn2nd . a c t i v a t i o n 1 = fnn . a c t i v a t i o n d e r i v a t i v e 1 ;fnn2nd . a c t i v a t i o n 2 = fnn . a c t i v a t i o n d e r i v a t i v e 2 ;fnn2nd . a c t i v a t i o n d e r i v a t i v e 1 = fnn . a c t i v a t i o n 2 n d d e r i v a t i v e 1 ;fnn2nd . a c t i v a t i o n d e r i v a t i v e 2 = fnn . a c t i v a t i o n 2 n d d e r i v a t i v e 2 ;

70 d e l t a s 1 = b a c k p r o p a g a t e ( fnn2nd , i n p u t s , d e l t a S i g n a l 1 ) ;da1 = v e c d e l t a s ( d e l t a s 1 ) ’ ;

d a 2 l e f t = vec ( diag ( s igma2 ) ∗ Rnu ∗ s t a t e W e i g h t s 1 ’ ∗ diag ( s igma1 ) ) ’ ;weights2dw = [ z e r o s ( nWeights2 , nWeights1 + nBias1 ) . . .

75 eye ( nWeights2 ) z e r o s ( nWeights2 , nBias2 ) ] ;da2 = d a 2 l e f t ∗ weights2dw ;

d e l t a S i g n a l 3 = vec ( fnn . we igh t s2 ’ ∗ diag ( s igma2 ) ∗ . . .Rnu ∗ s t a t e W e i g h t s 1 ’ ) ’ ∗ v e c d i a g m a t r i x ( fnn . nHiddenNeurons ) ;

80 % 1− l a y e r f n n b a c k p r o p a g a t e w i t h f ( n e t ) = t r a n s i t i o n . f ’ ( n e t )n e t 1 = fnn . w e i g h t s 1 ∗ i n p u t s ’ + fnn . b i a s 1 ;d e l t a 1 = d e l t a S i g n a l 3 ’ . ∗ fnn . a c t i v a t i o n 2 n d d e r i v a t i v e 1 ( n e t 1 ) ;d e l t a W e i g h t s 1 = d e l t a 1 ∗ i n p u t s ;d e l t a B i a s 1 = d e l t a 1 ;

85 da3 = [ vec ( d e l t a W e i g h t s 1 ) ; vec ( d e l t a B i a s 1 ) ; . . .z e r o s ( nWeights2 + nBias2 , 1 ) ] ’ ;

d a 4 l e f t = vec ( diag ( s igma1 ) ∗ fnn . we igh t s2 ’ ∗ diag ( s igma2 ) ∗ Rnu ) ’ ;s t a t e S i z e = fnn . nOutpu tNeurons ;

90 n S t a t e W e i g h t s = s t a t e S i z e ∗ fnn . nHiddenNeurons ;l a b e l s S i z e = graph . n o d e L a b e l S i z e + graph . e d g e L a b e l S i z e ;s t a t eWeigh t s1Dw = z e r o s ( n S t a t e W e i g h t s , . . .

nWeights1 + nBias1 + nWeights2 + nBias2 ) ;% mark w i t h ones w e i g h t s c o r r e s p o n d i n g t o xu

95 f o r h = 1 : fnn . nHiddenNeuronss t a r t I n d e x X = 1 + ( h − 1) ∗ fnn . n I n p u t L i n e s + l a b e l s S i z e ;endIndexX = s t a r t I n d e x X + s t a t e S i z e − 1 ;s t a r t I n d e x Y = 1 + ( h − 1) ∗ s t a t e S i z e ;endIndexY = s t a r t I n d e x Y + s t a t e S i z e − 1 ;

100 s t a t eWeigh t s1Dw ( s t a r t I n d e x Y : endIndexY , . . .s t a r t I n d e x X : endIndexX ) = eye ( s t a t e S i z e ) ;

endda4 = d a 4 l e f t ∗ s t a t eWeigh t s1Dw ;

105 i m p a c t D e r i v a t i v e = da1 + da2 + da3 + da4 ;p e n a l t y D e r i v a t i v e = p e n a l t y D e r i v a t i v e + i m p a c t D e r i v a t i v e ;

endend

end110 p e n a l t y D e r i v a t i v e = p e n a l t y D e r i v a t i v e ∗ 2 ;

endend

Listing B.8. Penalty derivative calculation

B. Listings 48

1

f u n c t i o n [ r p r o p S t r u c t n R e v e r t e d ] =r p r o p u p d a t e ( r p r o p S t r u c t , e r r o r D e r i v a t i v e s , de l t aMin , de l taMax )

% Helper f u n c t i o n f o r rprop ( ) , o p e r a t e s on s i g l e m a t r i x o f w e i g h t s5 %

% usage : [ r p r o p S t r u c t n R e v e r t e d ] =% r p r o p u p d a t e ( r p r o p S t r u c t , e r r o r D e r i v a t i v e s , de l taMin , de l taMax )%% r p r o p S t r u c t . w e i g h t U p d a t e s − d e l t a s t h a t s h o u l d be added t o f n n w e i g h t s

10

a s s e r t ( s i z e ( r p r o p S t r u c t . e r r o r s , 1 ) == s i z e ( e r r o r D e r i v a t i v e s , 1 ) ) ;a s s e r t ( s i z e ( r p r o p S t r u c t . e r r o r s , 2 ) == s i z e ( e r r o r D e r i v a t i v e s , 2 ) ) ;i n c r e a s e = 1 . 2 ;

15 d e c r e a s e = 0 . 5 ;e r r o r D i r e c t i o n C h a n g e = r p r o p S t r u c t . e r r o r s .∗ e r r o r D e r i v a t i v e s ;

n R e v e r t e d = 0 ;f o r i = 1 : s i z e ( r p r o p S t r u c t . d e l t a s , 1 )

20 f o r j = 1 : s i z e ( r p r o p S t r u c t . d e l t a s , 2 )i f e r r o r D i r e c t i o n C h a n g e ( i , j ) > 0

% i n c r e a s e w e i g h t up da t er p r o p S t r u c t . d e l t a s ( i , j ) = min ( r p r o p S t r u c t . d e l t a s ( i , j ) ∗ . . .

i n c r e a s e , de l taMax ) ;25 r p r o p S t r u c t . w e i g h t U p d a t e s ( i , j ) = . . .

s i g n ( e r r o r D e r i v a t i v e s ( i , j ) ) ∗ r p r o p S t r u c t . d e l t a s ( i , j ) ;r p r o p S t r u c t . e r r o r s ( i , j ) = e r r o r D e r i v a t i v e s ( i , j ) ;

e l s e i f e r r o r D i r e c t i o n C h a n g e ( i , j ) < 0r p r o p S t r u c t . d e l t a s ( i , j ) = . . .

30 max ( r p r o p S t r u c t . d e l t a s ( i , j ) ∗ d e c r e a s e , d e l t a M i n ) ;% r e v e r t l a s t w e i g h t u pda t er p r o p S t r u c t . w e i g h t U p d a t e s ( i , j ) = . . .− r p r o p S t r u c t . w e i g h t U p d a t e s ( i , j ) ;

% a v o i d do ub l e p u n i s h m e n t i n n e x t s t e p35 r p r o p S t r u c t . e r r o r s ( i , j ) = 0 ;

n R e v e r t e d = n R e v e r t e d + 1 ;e l s e

r p r o p S t r u c t . w e i g h t U p d a t e s ( i , j ) = . . .s i g n ( e r r o r D e r i v a t i v e s ( i , j ) ) ∗ r p r o p S t r u c t . d e l t a s ( i , j ) ;

40 r p r o p S t r u c t . e r r o r s ( i , j ) = e r r o r D e r i v a t i v e s ( i , j ) ;end

endend

end

Listing B.9. RPROP weights update

List of Figures

2.1 A simple binary tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

4.1 A sample graph that can be processed using RAAM . . . . . . . . . . . . . . . . . . . 84.2 Training set for the example graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3 Graph compression using trained RAAM model . . . . . . . . . . . . . . . . . . . . . . 94.4 Graph reconstruction from x3 using trained RAAM model . . . . . . . . . . . . . . . . 94.5 RAAM encoding network for the sample graph . . . . . . . . . . . . . . . . . . . . . . 104.6 LRAAM encoding network for the graph shown . . . . . . . . . . . . . . . . . . . . . 124.7 LRAAM encoding network for the cyclic graph shown . . . . . . . . . . . . . . . . . . 134.8 Virtual unfolding, reflecting the sample graph structure . . . . . . . . . . . . . . . . . . 154.9 A sample acyclic graph and the corresponding encoding network . . . . . . . . . . . . . 184.10 A sample acyclic graph and the corresponding encoding network . . . . . . . . . . . . . 19

5.1 The fw unit for a single node and one of the corresponding edges . . . . . . . . . . . . 235.2 The gw unit for a single node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 A sample graph and the corresponding encoding network . . . . . . . . . . . . . . . . . 245.4 Unfolded encoding network for the sample graph . . . . . . . . . . . . . . . . . . . . . 255.5 Error backpropagation through the unfolded network . . . . . . . . . . . . . . . . . . . 26

6.1 Sample graph from 6-3 dataset (subgraph in black), before adding noise . . . . . . . . . 316.2 Sample graph from 14-7 dataset (subgraph in black), before adding noise . . . . . . . . 316.3 RMSE for 9 different initial weight sets. µ = 0.9 . . . . . . . . . . . . . . . . . . . . . 326.4 RMSE for gnn7 with µ ∈ [1.2,0.9,0.6] . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.5 RMSE for gnn5 with µ ∈ [1.2,0.9,0.6] . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.6 gnn5 performance with µ = 0.9 for 20 graphs . . . . . . . . . . . . . . . . . . . . . . . 346.7 gnn5 performance with µ = 0.6 for 20 graphs . . . . . . . . . . . . . . . . . . . . . . . 356.8 gnn5 performance with µ = 0.9 for 10 graphs . . . . . . . . . . . . . . . . . . . . . . . 36

A.1 Sample graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Listings

5.1 The learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29A.1 Sample usage session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39B.1 Main train function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40B.2 Forward function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42B.3 Transition function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42B.4 State convergence check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42B.5 Backward function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43B.6 Matrix A calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44B.7 Matrix b calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45B.8 Penalty derivative calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46B.9 RPROP weights update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography

[1] J. B. Pollack, “Recursive distributed representations,” Artificial Intelligence, vol. 46, no. 1,pp. 77–105, 1990.

[2] A. Sperduti, “Labelling recursive auto-associative memory,” Connection Science, vol. 6, no. 4,pp. 429–459, 1994.

[3] A. Goulon-Sigwalt-Abram, A. Duprat, and G. Dreyfus, “From hopfield nets to recursive net-works to graph machines: numerical machine learning for structured data,” Theoretical com-puter science, vol. 344, no. 2, pp. 298–334, 2005.

[4] M. Bianchini, M. Gori, L. Sarti, and F. Scarselli, “Backpropagation through cyclic structures,”in AI* IA 2003: Advances in Artificial Intelligence, pp. 118–129, Springer, 2003.

[5] O. Ivanciuc, “Canonical numbering and constitutional symmetry,” Handbook of Chemoinfor-matics: From Data to Knowledge in 4 Volumes, pp. 139–160, 2003.

[6] M. Bianchini, M. Maggini, L. Sarti, and F. Scarselli, “Recursive neural networks for process-ing graphs with labelled edges: Theory and applications,” Neural Networks, vol. 18, no. 8,pp. 1040–1050, 2005.

[7] P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive processing of datastructures,” Neural Networks, IEEE Transactions on, vol. 9, no. 5, pp. 768–786, 1998.

[8] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neuralnetwork model,” Neural Networks, IEEE Transactions on, vol. 20, no. 1, pp. 61–80, 2009.

[9] A. Goulon, T. Picot, A. Duprat, and G. Dreyfus, “Predicting activities without computingdescriptors: graph machines for QSAR,” SAR and QSAR in Environmental Research, vol. 18,no. 1-2, pp. 141–153, 2007.

[10] A. Goulon, A. Duprat, and G. Dreyfus, “Learning numbers from graphs,” Applied StatisticalModelling and Data Analysis, Brest, France, pp. 17–20, 2005.

[11] S. Yong, M. Hagenbuchner, A. Tsoi, F. Scarselli, and M. Gori, “XML document mining usinggraph neural network,” Center for Computer Science http://inex. is. informatik. uni-duisburg.de/2006, p. 354, 2006.

[12] F. Scarselli, S. L. Yong, M. Gori, M. Hagenbuchner, A. C. Tsoi, and M. Maggini, “Graphneural networks for ranking web pages,” in Web Intelligence, 2005. Proceedings. The 2005IEEE/WIC/ACM International Conference on, pp. 666–672, IEEE, 2005.

[13] G. Monfardini, V. Di Massa, F. Scarselli, and M. Gori, “Graph neural networks for objectlocalization,” Frontiers in Artificial Intelligence and Applications, vol. 141, p. 665, 2006.

[14] A. Quek, Z. Wang, J. Zhang, and D. Feng, “Structural image classification with graph neuralnetworks,” in Digital Image Computing Techniques and Applications (DICTA), 2011 Interna-tional Conference on, pp. 416–421, IEEE, 2011.

[15] F. Costa, P. Frasconi, V. Lombardo, and G. Soda, “Towards incremental parsing of naturallanguage using recursive neural networks,” Applied Intelligence, vol. 19, no. 1-2, pp. 9–25,2003.

[16] S. Saha and G. Raghava, “Prediction of continuous b-cell epitopes in an antigen using recurrentneural network,” PROTEINS: Structure, Function, and Bioinformatics, vol. 65, no. 1, pp. 40–48,2006.

[17] R. Kree and A. Zippelius, “Recognition of topological features of graphs and images in neuralnetworks,” Journal of Physics A: Mathematical and General, vol. 21, no. 16, p. L813, 1988.

Bibliography 52

[18] A. Küchler and C. Goller, “Inductive learning in symbolic domains using structure-drivenrecurrent neural networks,” in KI-96: Advances in Artificial Intelligence, pp. 183–197, Springer,1996.

[19] A. Stolcke and D. Wu, Tree matching with recursive distributed representations. Citeseer, 1992.[20] C. Goller and A. Kuchler, “Learning task-dependent distributed representations by backpropa-

gation through structure,” in Neural Networks, 1996., IEEE International Conference on, vol. 1,pp. 347–352, IEEE, 1996.

[21] F. J. Pineda, “Generalization of back-propagation to recurrent neural networks,” Physical reviewletters, vol. 59, no. 19, pp. 2229–2232, 1987.

[22] R. J. Williams and D. Zipser, “Gradient-based learning algorithms for recurrent networks andtheir computational complexity,” Back-propagation: Theory, architectures and applications,pp. 433–486, 1995.

[23] A. Sperduti and A. Starita, “Supervised neural networks for the classification of structures,”Neural Networks, IEEE Transactions on, vol. 8, no. 3, pp. 714–735, 1997.

[24] M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learn-ing: The RPROP algorithm,” in Neural Networks, 1993., IEEE International Conference on,pp. 586–591, IEEE, 1993.