PRACA DYPLOMOWA MAGISTERSKA - repo.pw.edu.pl · SUMMARY This thesis describes the process of...
Transcript of PRACA DYPLOMOWA MAGISTERSKA - repo.pw.edu.pl · SUMMARY This thesis describes the process of...
Rok akademicki 2012/2013Politechnika Warszawska
Wydział Elektroniki i Technik InformacyjnychInstytut Informatyki
PRACA DYPLOMOWA MAGISTERSKA
Aleksy Stanisław Barcz
Implementation aspects of graph neural networks
Opiekun pracymgr inż. Zbigniew Szymański
Ocena: .....................................................
.................................................................
Podpis PrzewodniczącegoKomisji Egzaminu Dyplomowego
Kierunek: Informatyka
Specjalność: Inżynieria Systemów Informatycznych
Data urodzenia: 1988.01.28
Data rozpoczęcia studiów: 2012.02.20
Życiorys
Ukończyłem XXVIII LO im. J.Kochanowskiego w Warszawie, w klasie o profilu matematyczno - informatycznym. Studia inżynierskie ukończyłem w lutym 2012 roku na kierunku Informatyka na Wydziale Elektroniki i Technik Informacyjnych Politechniki Warszawskiej. W trakcie studiów I i II stopnia brałem udział w wymianach studenckich programu Athens na Katholieke Universiteit Leuven (Fundamentals of artificial intelligence) oraz w Télécom ParisTech (Emergence in complex systems).
....................................................... Podpis studenta
EGZAMIN DYPLOMOWY
Złożył egzamin dyplomowy w dniu ..................................................................................2013 r
z wynikiem ...................................................................................................................................
Ogólny wynik studiów: ................................................................................................................
Dodatkowe wnioski i uwagi Komisji: ..........................................................................................
.......................................................................................................................................................
.......................................................................................................................................................
SUMMARY
This thesis describes the process of implementation of a Graph Neural Network, a classifier capable of classifying data represented as graphs. Parameters affecting the classifier efficiency and the learning process were identified and described. Implementation details affecting the classifier efficiency were described. Important similarities to other connectionist models used for graph processing were highlighted.
Keywords: Graph neural networks, classification, graph processing, recursive neural networks
ASPEKTY IMPLEMENTACYJNE GRAFOWYCH SIECI NEURONOWYCH
Praca stanowi raport z samodzielnej implementacji klasyfikatora typu Graph Neural Network (grafowa sieć neuronowa), pozwalającego na klasyfikację danych o strukturze grafowej. W ramach pracy zidentyfikowane zostały istotne dla klasyfikatora parametry, wpływające na przebieg procesu uczenia się klasyfikatora oraz na jakość uzyskanych wyników. Opisane zostały szczegóły implementacyjne klasyfikatora istotne dla jego działania. Klasyfikator został przedstawiony w kontekście podobnych rozwiązań w celu ukazania ścisłych powiązań między istniejącymi modelami przetwarzania danych o strukturze grafowej, opartymi na sieciach neuronowych.
Słowa kluczowe: Grafowe sieci neuronowe, klasyfikacja, przetwarzanie grafów, rekursywne sieci neuronowe
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Domains of application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3. Graph processing models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4. History of connectionist models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.1. Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.2. RAAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.3. LRAAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.4. Folding architecture and BPTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.5. Generalised recursive neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.6. Recursive neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.7. Graph machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5. Graph neural network implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2. Computation units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3. Encoding network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.4. General training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.5. Unfolded network and backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 255.6. Contraction map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.7. RPROP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.8. Maximum number of iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.9. Detailed training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.10. Graph-focused tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1. Subgraph matching - data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.2. Impact of initial weight values on learning . . . . . . . . . . . . . . . . . . . . . . . 326.3. Impact of contraction constant on learning . . . . . . . . . . . . . . . . . . . . . . 336.4. Cross-validation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A. Using the software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
B. Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
List of Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1. Introduction
The Graph Neural Network model is a connectionist classifier capable of classifying
graphs. Most of the other existing neural network-based graph classifiers, such as RAAM [1]
or LRAAM [2] and all solutions basing on them are capable of processing certain types of
graphs only, in most cases DAGs (directed acyclic graphs) or DPAGs (directed positional
acyclic graphs). Several solutions were invented to deal with cyclic graphs, such as introduc-
ing a delay in the LRAAM encoding tree [3] or techniques mapping cyclic directed graphs to
"recursive equivalent" trees [4]. The problem of nonpositional graphs was also addressed by
several authors, either by creating domain-specific encodings used to enforce a defined order
on graph nodes [5] or by introducing various modifications to the classifier [6]. However,
most of the solutions dealing with cyclic and nonpositional graphs either complicate the
classifier model, enlarge the input dataset or (in case of cycles) may result in information
loss. The Graph Neural Network model can directly process most types of graphs, including
cyclic, acyclic, directed, undirected, positional and nonpositional, which makes it a flexible
solution. There is a conceptual similarity between the GNN model and recursive neural net-
works [7], however, the GNN model adapts a novel learning and backpropagation schema
which simplifies processing different types of graphs. This thesis describes the steps of
implementation of a GNN classifier, including some details which were not described in the
original article [8]. The classifier was implemented in GNU Octave with two ideas in mind:
providing a simple interface (similar to that of the Neural Networks toolbox) and maximum
flexibility, that is the possibility of processing each kind of data that the theoretical model
could deal with. The process of training a GNN and classification results were presented in
detail. Listings of most important procedures were included in appendix B.
2. Domains of application
This chapter presents domains where data is organized into a structured form, that is form
of sequences or graphs. The necessity of processing differently such kinds of data arose from
the structure of the data itself. To present this difference we must first summarize what is the
task of classification and regression in the most common sense in data processing domain.
A common statistical classifier (which later on is called a vectorial classifier) takes as input
samples from a given dataset, representing real world objects, and associates each sample
with a category. The samples are fixed-length vectors of numeric values. Each position in
such a vector represents a feature of the sample, which is quantified by a real or integer
value. The mapping from features to positions in the vector is fixed and must hold for each
sample in the dataset. The category is represented by a non-empty fixed-length vector of
integer values, where once again the position of each value is meaningful (we can say it’s
a positional representation). For the regression task, a vector of real (or integer) values is
associated with each sample instead of a category. The domain of vectorial classifiers is well
developed and includes, among other solutions, neural networks, support vector machines,
naïve Bayes classifiers and random forests.
In the case of graph processing, the nature of the data is different. Each sample is repre-
sented by a graph. A dataset may consist of a single or several graphs. Each graph consists of
nodes, connected with edges. Each node can be described by its label, a fixed-length vector
of reals. Each edge can also be labelled, with a fixed-length vector of reals of different size.
Edges can be directed or undirected. An example of a simple graph was presented in Fig. 2.1.
If such a graph was to be used as input for a common classifier, it would be necessary
to transform the structured representation into a plain vector. It could be accomplished in
several ways, one of the most obvious would be to perform a preorder walk on the tree and list
A
B C
D E
Figure 2.1. A simple binary tree
2. Domains of application 3
the node labels in the order they were visited. Such a walk would result in the representation
[A,B,D,E,C]. It can be seen that the explicit information about node adjacency was lost.
Instead, the information is provided implicitly, according to a coding which must be known
a priori to properly interpret such a vector representation. That means that a model learning
to classify such graph representations would have to learn the encoded relationship, instead
of using it from the beginning to learn other, unknown and interesting relationships affecting
the samples category. The resulting learning task becomes even harder if such sequential
representations contain long-distance relationships. Different encodings from structured data
to vectors exist, however, they all share that flaw.
Moreover, the inadequacy of simple vector representation becomes even more apparent
when the graph structure becomes more complicated. First of all, if edge labels are present
in the vector representation, the representation becomes a mix of data belonging to two
different entities - nodes and edges. Once again the classifier doesn’t know which part of
the data corresponds to the first entity and which to the other. The same applies to directed
and undirected edges. If two edges in a graph are connected by a directed edge, presumably
one of the nodes in the relation has got a larger impact on the other than vice versa. On the
contrary, an undirected or bidirectional edge edge implies an equal impact of both nodes on
each other. What if a graph contains both types of edges? How should one type of edges be
distinguished from the other one? Secondly, let’s consider the case of cyclic dependencies.
Even if a meaningful representation of the graph is built ignoring cycles, e.g. by constructing
a minimum spanning tree, some explicit information about connections is lost. An example
of such data are chemical compounds containing groups of atoms forming cyclic bonds.
Another problem lies in the positional nature of vectorial data. If a representation is built by
simply storing node labels one after another, this representation becomes vulnerable to any
reordering of the children of a node. An additional effort must be made to assure a consistent
ordering of the potentially difficult to order data, while the ordering of the children of a node
may be irrelevant in the dataset considered.
It can be seen that in order to properly process structured data, a different approach
must be used. The data should be processed in a way that exploits properly the information
contained in its structure - by means of building a sufficient representation or by processing
the structured data directly.
Graph-oriented models based on neural networks were successfully applied in various
domains, including chemistry, pattern recognition and natural language processing. In the
domain of computer-aided drug design the most important problem to solve is to predict the
properties of a molecule prior to synthesizing it. All molecules with a negative prediction
can be discarded automatically, reducing the costs of the subsequent laboratory experiments
which can focus on the molecules with a positive prediction only [3]. This is the case of
2. Domains of application 4
QSAR (quantitative structure-activity relations) and QSPR (quantitative structure-property
relations). While traditional processing methods consist of extraction and selection of fea-
tures from the molecules descriptions, the molecules can be easily represented as undirected
graphs and processed with a graph-oriented model [9] [10].
The next domain of interest is document mining. As the amount of XML-formatted
information increases rapidly, the problem of determining if a document can be assigned to
a given category becomes crucial. As an XML document can be viewed as semi-structured
data, graph-processing models can be successfully applied to this task [11]. Another prob-
lem focused on document processing is the web page ranking, where documents and the
links between them can be described as structured data. A general ranking model can be
implemented as a graph-processing model, which allows exploiting page contents and link
analysis simultaneously. [12] [8].
In the domain of image processing two pattern recognition tasks can be distinguished.
First is the classification of images, either for industrial applications, control and monitoring,
or for querying an image database. The second is object localisation, which may be used
e.g. for face localisation. For both tasks an image can be represented as a RAG (Region
Adjacency Graph), where image segments are represented as nodes and their adjacency
is represented as edges which may contain information about e.g. the distances between
adjacent segments. For both tasks graph-processing methods can be used, yielding promising
results [13] [6] [14].
Another classic example where structure of the data plays a crucial role in its understand-
ing is the natural language processing. In the unconstrained case, the input data may consists
of arbitrarily complex sentences. As sentence can be transformed into a graph reflecting its
syntax, thus a graph-processing model can be trained to parse such sentences. One of the
first graph-processing solutions was already evaluated on such a task [1] and more recent
solutions are also present in the literature [15].
3. Graph processing models
A model considered as fully capable of processing structured data, should be able to:
1. build data representation
a) minimal
b) exploiting sufficiently the structure of the data
c) adequate for subsequent processing (classification, regression)
2. perform classification / regression on the structured data
a) taking into consideration the structure encoded in the representation
b) with a high generalization capacity
These two main tasks are often intertwined with each other, as a classification procedure
may affect the procedure of representation building and vice versa. It is also possible for
a model to focus only on representation building, while leaving the task of processing to
a common statistical classifier, as support vector machine. Two main families of models
capable of processing structured data are the symbolic and connectionist families. The
first one originates in the artificial intelligence domain and focus on inferring relationships
by means of inductive logic programming. The connectionist models focus on modelling
relationships with the use of interconnected networks of simple units. The different models
originating from these two families are:
1. inductive logic programming
2. evolutionary algorithms
3. probabilistic models: Bayes networks and Markov random fields
4. graph kernels
5. neural network models
The main area of interest of this thesis are the connectionist models based on neural networks.
The connectionist models make the fewest assumptions about the domain of the dataset and
thus provide a potentially most general method for processing structured data.
4. History of connectionist models
This chapter summarizes the history of connectionist models used for graph processing.
All these models originate from the feed-forward neural networks (FNNs). The history of
neural networks begins in 1943 with the McCulloch–Pitts (MCP) neuron model, following
with the Rosenblatt perceptron classifier in 1957. The feed-forward neural network model
was developed during the following three decades and a conclusive state was reached in all
major fields of related research until approximately 1988. The FNN model reached maturity
in its field of application: classification and regression performed on unstructured positional
samples of fixed size. In the ’80s a new branch of the neural networks family began to
develop - the recurrent neural networks (RNN). The RNN model is capable of processing
sequences of varying length (potentially infinite), which makes them suitable for dealing with
time-varying sequences or biological samples of various length [16]. However, a slightly
different model had to be invented to properly process graph data.
4.1. Hopfield networks
One of the earliest attempts to classify structured data with neural networks was using the
Hopfield networks [3]. A common application of a Hopfield network is an auto-associative
memory, which learns to reproduce patterns provided as its inputs. (The task to be learned is
the mapping xi⇒ xi, where xi is a pattern of fixed size n.) Afterwards, when a new sample
is presented to the trained network, the network associates it with the most similar pattern it
had learned. Subsequently, it was discovered, that by using a Hopfield network to reproduce
a predefined successor of a pattern instead of the pattern itself, the network can be used as
an hetero-associative memory, capable of reproducing sequences of patterns (xi ⇒ xi+1).
The next step towards graph processing was to use Hopfield networks to learn the task of
reproducing all the successors (or predecessors) of a node, that is to learn the mapping
xi ⇒ succ[xi], where xi is the ith node label and succ[xi] denotes a vector obtained by
stacking together the labels of all the successors of node xi, one after another. For such task
the maximum outdegree of a node (the maximum number of its successors) has to be known
prior to the network training. NIL patterns are used as extra successors labels whenever
a considered node xi has an outdegree smaller than the maximum value chosen. The last
and somehow different application of Hopfield networks was to use a Hopfield network once
again as an auto-associative memory, used for retrieving whole graphs. In such case the graph
4.2. RAAM 7
adjacency matrices (N ×N) are encoded into a network having N(N − 1)/2 neurons [3],
where N is the number of nodes in a graph. To obtain an adequate generalisation, graphs
isomorphic to the training set are generated and fed to the network [17].
4.2. RAAM
The Recursive Auto-Associative Memory (RAAM) was introduced by Pollack in
1990 [1]. The RAAM model is a generalisation of the Hopfield network model [3], provid-
ing means to meaningfully encode directed positional acyclic graphs (DPAGs). A distinctive
feature of the RAAM model is that is can be used to encode graphs with labeled terminal
nodes only. (The terminal nodes are nodes with outdegree equal to zero, that is nodes having
no children. In the case of trees, leaves are terminal nodes.) That is, no node other than the
terminal nodes of a graph may be labelled. No edge labels are permitted. The most straight-
forward domain of application for the RAAM model is thus natural language processing,
where sentences can be decomposed to syntax graphs.
The RAAM model is capable of:
— building compressed representation of structured data
— building meaningful representation: similar samples are represented in a similar way
— constrained generalisation: representing data absent in the training set
The RAAM model is composed of two units: compressor and reconstructor. Together
they form a three-layer feed-forward neural network which works as an auto-associative
memory. The compressor is a fully connected two-layer neural network with n input lines
and m output neurons. The number of output neurons, m determines the size of a single
encoded node representation. The number of input lines, n must be a multiple of m, such that
n = k ·m, where k is the maximum outdegree of a node in the considered graphs. For each
terminal node its representation consists of its original label. For each non-terminal node i
its representation, xi is built by feeding the compressor with encoded representation of the
ith nodes children.
To assure that the compressed representation is accurate and lossless, it is fed to the
reconstructor. The reconstructor is also a fully connected two-layer neural network, however
it has m input lines and n output neurons. It is fed with compressed representations of nodes
and is expected to produce the original data that was fed to the compressor. This procedure
is repeated for all non-terminal nodes of a graph, until all encoded representations can be
accurately decoded into original data. More precisely, the representation xi of the ith node of
the graph is given by Eq. 4.1, where f denotes the function implemented by the compressor
unit, li - the label of ith node, xch[i] - a vector obtained by stacking representations of all
children of ith node one after another, k - the maximum outdegree of a node in the graph.
4.2. RAAM 8
A B C D
1 2
3
Figure 4.1. A sample graph that can be processed using RAAM
A B
A' B'
x1
C D
C' D'
x2
x1 x2
x1' x2'
x3
Figure 4.2. Training set for the example graph
xi =
{li if i is terminal
f (xch[i]) otherwise(4.1)
A sample graph that can be encoded using the RAAM model is presented in Fig. 4.1. The
non-terminal nodes are enumerated for convenience only (only terminal nodes are labelled).
To encode the sample graph, the representations of nodes 1 and 2 must be built first.
The representation of node 1 is built by feeding the pair of labels (A,B) to the compressor
which encodes them into the representation x1. The representation x1 is then fed to the
reconstructor, which produces a pair of labels (A′,B′). If the resulting labels A′ and B′ are
not similar enough to the original labels A and B, the error is backpropagated through the
compressor-reconstructor three-layer network. Similarly, the pair (C,D) is processed by the
same compressor-reconstructor pair and compressed into the representation x2. Then, the
pair (x1,x2) is once again fed to the compressor, which produces x3, the representation
of the root node. This is also the compressed representation of the whole graph, from
which the whole graph can be reconstructed by using the reconstructor unit. The training
set, consisting of three label pairs, is presented in Fig. 4.2. The light grey areas denote the
compressor network, while the dark grey areas denote the reconstructor. Such a training set
(or a larger one if the dataset consists of more than one graph) must be repeatedly processed
by the RAAM model in the training phase. When the model is trained, the compression of
the whole graph occurs as presented in Fig. 4.3. Reconstruction of the graph is presented
in Fig. 4.4. It is worth mentioning that a trained RAAM model can be used to process graphs
with different structures.
4.2. RAAM 9
A B C D
x1 x2
x3
Figure 4.3. Graph compression using trained RAAM model
A' B' C' D'
x1' x2'
x3
Figure 4.4. Graph reconstruction from x3 using trained RAAM model
A significant feature of the RAAM model is that a small reconstruction error in the case of
non-terminal nodes may render the reconstruction of terminal nodes impossible. Therefore in
the process of training a RAAM classifier it is necessary to set the acceptable reconstruction
error value much smaller for the non-terminal nodes than for the terminal ones.
A major drawback of the RAAM model is the moving target problem. That is, a part of
the learning set (the representations x1 and x2 from the example) changes during training.
In such a case the training phase may not converge to an acceptable state [3]. However,
a different training schema is possible, similar to the BPTS algorithm [3]. (An extensive
description of the BPTS algorithm, encoding networks, and the shared weights technique is
provided in the following chapters.) An encoding network is built out of identical instances
of the compressor and reconstructor units (Fig. 4.5), with structure reflecting the structure
of the processed graph (if the dataset consists of multiple graphs, such procedure is repeated
for every graph in the dataset). All instances of the compressor unit share their weights and
all instances of the reconstructor unit share their weights - which is called the shared weights
technique. The labels of terminal nodes are fed to the processing network and the resulting
error can be backpropagated from the last layer using e.g. the Backpropagation Through
Structure [18] algorithm (BPTS). It is worth mentioning, that the authors of such modified
RAAM model propose using an additional layer of hidden neurons in the compressor and
reconstructor units. In such case the light grey and dark grey areas on the figures would
denote not only neuron connections between two layers, but also an additional hidden layer.
Such modification allows to partially separate the problem of the data model complexity (i.e.
4.2. RAAM 10
A B
x1
C D
x2
x1 x2
x1' x2'
x3
A' B' C' D'
Figure 4.5. RAAM encoding network for the sample graph
how complex should the RAAM model be to properly compress the data) from the size of
terminal node labels which affects directly the number of input lines to the compressor unit
and thus the compressed representation size.
The most important parameter of the RAAM model is the size of the compressed repre-
sentation. On one hand, the size should be large enough to contain all the necessary com-
pressed information about the encoded graph. On the other hand, it should be small enough
for the compression mechanism to build a minimal representation, which stores only the
necessary information about the dataset. If the size is too large, the trained model would
store redundant information, memorizing the training set. This would result in a poor ability
to process unseen data. Experiments with natural language syntax processing [1] proved
that when the size of the compressed representation is accurate for the problem, the RAAM
model is showing some constrained generalisation properties. That is, unseen data with
structure similar to the training set was processed properly by the trained RAAM model.
A drawback of the standard RAAM model is the termination problem. The reconstructor
can’t distinguish between terminal representations (node labels) and compressed representa-
tions, which should be further reconstructed. To solve this problem, an additional encoding
neuron can be introduced (increasing the representation size by one), which takes a different
value for terminal and non-terminal representations [19].
4.3. LRAAM 11
4.3. LRAAM
The most important constraint of the RAAM model is the fact that only terminal nodes
of the processed graphs (DPAGs) can contain labels. This problem was addressed by the
Labeling RAAM model [2] (LRAAM, 1994), which separated the concepts of node labels
and node representations. In the RAAM model the terminal nodes are represented by their
labels. The LRAAM model introduced the concept of pointers, which was used to describe
a node representation which has to be learnt, regardless of whether the node is terminal or
not. The pointers are built by compressor units (FNN with two or more layers) and they
are decoded into graph structure by reconstructor units. More precisely, the pointer to ith
node of the graph is calculated according to Eq. 4.2, where xi stands for pointer to the ith
node, f is the function implemented by the compressor unit, li is the ith node label, xch[i] is
a vector obtained by stacking pointers to all children of ith node one after another and k is
the maximum outdegree of a node in the considered graph.
xi = f (li, xch[i]) (4.2)
Whenever a node outdegree is smaller than k (especially in the case of terminal nodes),
the missing child pointers are substituted by the NIL pointer, a special value representing the
lack of node. The value of the node label is stacked together with all the child pointers values
to form an input vector which is fed to the compressor unit. The number of output neurons
of the compressor unit (the size of a pointer xi) is m. Let’s denote the size of the label liby p. The compressor unit must have n = p+ k ·m input lines which is also the number
of reconstructor unit output neurons. The possibility of describing each graph node with its
label provides a simple solution to the termination problem [2]. An additional value can be
appended to each label, stating if the node is a terminal node or not. By using this method
no change in the LRAAM model is needed.
Just like the RAAM model, the LRAAM model experience the problem of the moving
target. The same technique of shared weights can be applied [3], which results in building
a large encoding network composed of identical units. A sample graph and the encoding
network obtained by cloning the compressor and reconstructor units to reflect the sample
graph structure are presented in Fig. 4.6. As A and B are terminal nodes, their labels are fed
to the compressor unit altogether with NIL pointers representing the missing nodes. Then
the compressed representation xA is built for node A and the compressed representation xB
is built for node B. The representations xA and xB are then fed altogether with the label C
to the compressor, to build the representation xC of node C, which is also the compressed
representation of the whole graph.
An extension of the LRAAM model exists for cyclic graphs [3]. Whenever an edge
forming a cycle is found, it is converted to a single time unit delay (denoted by q−1 [7]).
4.3. LRAAM 12
A B
C
xA xB
xC
A' B'
C'
Nil Nil Nil Nil
A B
C
Figure 4.6. LRAAM encoding network for the graph shown
A sample cyclic graph and the resulting LRAAM model with one time delay was presented
in Fig. 4.7. The graph presented is similar to the graph used in the previous example, with
the exception of the directed edge A⇒C. The additional edge forms a cycle so it must be
represented as a time delay. Such an approach makes it possible to deal with cyclic graphs,
however it is achieved at the expense of model simplicity. The shared weights technique
made it possible to treat the encoding tree structure as a single feed-forward neural network
with shared weights. However, after adding time delays the training of the network will have
to consist of multiple time steps, repeated until convergence of the pointer values is reached.
A distinctive feature of the LRAAM model is that a compressed representation is built for
a given dataset (consisting of DPAGs) and the correctness of the representation is verified by
mirroring the compression process and reconstructing the original data. When the obtained
representation is accurate enough, the output of the LRAAM model is "frozen" and fed to a
separate classifier which processes it and yields classification or regression results. The same
applies to any new, unseen data, which is fed to the trained LRAAM model and, if the model
was built correctly, is compressed into a meaningful representation. Such separation of
representation building and processing can be attractive for two reasons. First, the LRAAM
model parameters, such as the size of the representation, can be tuned in a straightforward
manner by observing what is the minimal value below which the original data cannot be
accurately reconstructed from the compressed vectors. Secondly, any vectorial classifier
used for unstructured data can be used to process such compressed representation. On the
other hand, it is often safe to presume that not all the data contained in node labels is crucial
4.3. LRAAM 13
A B
C
xA xB
xC
A' B'
C'
Nil Nil Nilxc
A
C
Bq-1
Figure 4.7. LRAAM encoding network for the cyclic graph shown
4.4. Folding architecture and BPTS 14
for the classification/regression process for which the compressed representation is needed.
Such an approach lies beyond the standard LRAAM model and was introduced in the folding
architecture model [18].
4.4. Folding architecture and BPTS
The ideas of folding architecture and Backpropagation Through Structure (BPTS) were
first introduced in 1996 [20] [18]. The model of the folding architecture is similar to that of
LRAAM and is capable of processing rooted DPAGs. (That is DPAGs with a distinguished
root node. For each DPAG such a node can be selected.) The folding architecture model
is a feed-forward, fully connected multi-layer neural network, consisting of two subsequent
parts performing different tasks: the folding network and the transformation network. The
folding network is similar to the LRAAM compressor unit, its input layer consists of p+k ·minput lines, p for the processed node label and k ·m for the compressed representations
of the nodes children. The folding network can consist of any number of sigmoid neuron
layers and its last layer produces the compressed representation of a node, of size m. The
folding network is applied to every node in the graph, starting from the terminal nodes, so
as to provide the internal graph nodes with compressed representations of previous layers
nodes. The transformation network is applied to the root node only. It can consist of any
number of sigmoid neuron layers and an output layer. It takes as input the compressed
representation of the root node and produces an output, which should match the expected
output for a graph. Therefore, the transformation network is used to perform classification
or regression tasks in terms of whole graphs. Let’s denote the folding network as f and the
transformation network as g. The function f can be described by Eq. 4.3, where xi stands for
the ith node representation, li is the ith node label and xch[i] is a vector obtained by stacking
representations of all children of ith node one after another. The g function can be described
by Eq. 4.4, where or is the output of the root node.
xi = f (li, xch[i]) (4.3)
or = g(xr) (4.4)
The original idea behind the folding architecture is that the compressed representation is
built only for classification purpose and is fed directly to the transformation network. The
output of the transformation network is then compared with the expected output and the error
can be backpropagated through the folding architecture network by using a gradient-descent
procedure, Backpropagation Through Structure. BPTS was invented as a generalisation of
the Backpropagation Through Time method (BPTT [21]), which in turn was invented for
4.4. Folding architecture and BPTS 15
A B
C
xC
Nil Nil Nil Nil
A B
C
oC
Figure 4.8. Virtual unfolding, reflecting the sample graph structure
error backpropagation in recurrent neural networks. BPTS can be described in terms of the
unfolded network. The unfolded network is never built physically but can be imagined as a
graph built of folding network instances in a way which reflects the structure of the processed
graph, with the transformation network added on top of it (attached to the representation of
the root node). An unfolding network for a sample graph is presented in Fig. 4.8. The light
grey areas are instances of the folding network, while the dark grey area is the transformation
network, which for the root node C produces output oC.
To explain the idea of BPTS it is necessary to briefly summarize the idea of BPTT
(a detailed explanation can be found in RNN-concerned publications, e.g. [22]). Let’s
consider a fully-connected recurrent neural network, designed for classifying sequences of
samples of size m. The network consists of a single layer of n units with n× n recurrent
connections, producing an output y(t) at time t. Let xs(t) denote the m-tuple of input signals
corresponding to the sample fed at time t. Further, let x(t) be the merged input fed to the
network at time t, obtained by concatenating the vectors xs(t) and y(t). To distinguish
between the elements of vector x(t) corresponding to xs(t) and to y(t), let’s introduce two
subsets of indices: I and U (Eq. 4.5).
x j(t) =
{xs
j(t) if j ∈ I
y j(t) if j ∈U(4.5)
Let wk j denote the network weight on connection to the kth neuron from input x j, netk(t)
denote the weighted sum of neuron inputs fed to the activation function of the kth neuron,
fk (Eq. 4.6) and J(t) denote the overall mean square error of the network at time t.
netk = ∑j∈(I∪U)
wk jx j (4.6)
yk = fk(netk) (4.7)
4.4. Folding architecture and BPTS 16
J(t) =−12 ∑
k∈U[ek(t)]2 (4.8)
ek(t) = dk(t)− yk(t) (4.9)
Let’s consider a recurrent network which was operating from a starting time t0 up to time
t. We may represent the computation process performed by the network by unrolling the
network in time, that is building a feed-forward neural network made of identical instances
of the considered recurrent neural network, one instance per time step τ, τ ∈ (t0, t]. To
compute the gradient of J(t) at time t it is necessary to compute the error values εk(τ) and
δk(τ) for k ∈U and τ ∈ (t0, t] by means of equations 4.10, 4.11 and 4.12.
εk(t) = ek(t) (4.10)
δk(τ) = f ′k(netk(τ))εk(τ) (4.11)
εk(τ−1) = ∑j∈U
w jkδ j(τ) (4.12)
Then, the gradient of J(t) is calculated with respect to each weight wi j by the means of
equation 4.13.
∂J(t)∂wi j
=t
∑τ=t0+1
δi(τ)x j(τ−1) (4.13)
At time t an external error e(t) is injected to the network, usually being the difference
between the trained network output at time t: y(t) and the expected output d(t) (Eq. 4.9).
The subsequent steps compute the error ε(τ) by backpropagating the original error through
the layers of the unrolled neural network.
The BPTS method implements the BPTT algorithm. Backpropagation starts at the last
layer of the virtual unfolding network, where the classification/regression error is calculated
(the last layer of the transformation network applied to the root node). The error is injected
to this layer and backpropagated using the BPTT algorithm down to the first layer of the
folding network applied to the root node. The error is then backpropagated to the last layers
of the folding network applied to the roots children, as if there was a physical connection.
Such backpropagation continues down to the first layers of folding network applied to the
terminal nodes.
The folding architecture model introduced new important ideas in the domain of connec-
tionist graph processing models. First of all, the representation building model is simpler
than the LRAAM model and the folding architecture converges much faster than LRAAM
4.5. Generalised recursive neuron 17
for the same datasets [20]. Secondly, three important concepts were adapted from the domain
of recurrent neural networks: the error injection, BPTS and the unfolding of the network as
a generalisation of unrolling a recursive neural network.
4.5. Generalised recursive neuron
The generalised recursive neuron, introduced in 1997 [23] is a generalisation of the
recurrent neuron, which in turn is used in recurrent neural networks (RNNs). It was created
to provide an elementary component for the graph processing models which would be by
definition better suited to solve graph processing tasks than the standard neuron. It is beyond
the scope of this thesis to describe in detail the idea itself and its applications. Nevertheless
this summary of the connectionist models would be incomplete without mentioning the
generalised recursive neuron, as it was used in some of the following models instead of
the common neural network neuron, yielding promising results [7].
A generalised recursive neuron has two kinds of inputs:
— plain neuron inputs, which are fed with elements of the currently processed node label
— recursive inputs, which are fed with the memorized output of the neuron for all children
nodes of the currently processed node
In such a way, the neuron output changes after each training algorithm iteration, according
to its output for all the children nodes.
4.6. Recursive neural networks
The folding architecture model had a large impact on a theory introduced in 1998, the
structural transduction formalism [7]. It is beyond the scope of this thesis to describe
the formalism itself, let’s mention however its aspects that highly affected the subsequent
connectionist models described. The structural transduction, in general, is a relation which
maps a labelled DPAG (only node labels, no edge labels) into another DPAG. The type of
transductions that the authors focus on are IO-isomorph transductions, that is transductions
that don’t change the graph topology, only the node labels. (According to the authors
managing transductions which are not IO-isomorph is highly nontrivial and is an open
research problem.) Let’s describe an IO-isomorph transduction. Let Gs be the original
DPAG with labelled nodes. Let Go be a graph with same topology that Gs but with node
labels replaced by expected node outputs for each node (it’s a node classification/regression
problem). A transduction Gs ⇒ Go can be described in terms of two functions, f and g,
where f is the state transition function, which builds the representation (the state) x of the
graph Gs for the model and g is the output function, which produces the expected output
graph Go according to the representation x and the original graph Gs. More precisely, for
4.6. Recursive neural networks 18
g
g
g
f
f
f
A B
C
Nil
A B
C
Figure 4.9. A sample acyclic graph and the corresponding encoding network
each node i belonging to the original graph Gs its state xi and output oi are defined by
Eq. 4.14 and 4.15, where li is the ith node label and xch[i] is a vector obtained by stacking
representations of all children of ith node one after another. In the original equations the ith
node itself was also an argument of both functions, however, it was unnecessary from the
point of view of recursive neural networks and therefore was omitted.
xi = f (li, xch[i]) (4.14)
oi = g(li, xi) (4.15)
The transduction can be implemented e.g. by a hidden recursive model (HRM) or by a
recursive neural network (a generalisation of a recurrent neural network, which is able to
process not only sequences, but also DPAGs). In the case of a recursive neural network
the functions f and g are implemented by two feed-forward neural networks. Identical
instances of the f network are connected according to the Gs graph structure, creating the
encoding network. (The encoding network is the recursive neural network unfolded through
the structure of the given DPAG.) Calculation of the state x is performed by applying the
f network to the terminal nodes of Gs and then proceeding up to the root, according to
the encoding network topology. When the state calculation is finished, the g network is
applied to every node state xi, producing the requested output oi. A sample graph and the
corresponding encoding network are presented in Fig. 4.9.
As various kinds of neural network can be used as the f function (first, second-order
recurrent networks etc.), the recursive neural network model can become really complex and
computationally powerful. However, for the scope of this work, it is sufficient to mention
that the concepts of state transition function, output function and encoding network unfolded
through structure (originating from the folding architecture) were later reused in the Graph
Neural Network model. Another important feature of this model is a successful adaptation
4.7. Graph machines 19
A B
C
Nil Nil Nil Nil
A B
CoC
Figure 4.10. A sample acyclic graph and the corresponding encoding network
of ideas originating from recurrent neural network domain, just like in case of the folding
architecture. Authors of both models suggested that adaptation of other RNN ideas may also
be possible, which could lead to novel graph processing solutions [18] [7].
4.7. Graph machines
In 2005, another model for DPAG processing was proposed, the Graph Machines [10].
The model is similar to that of folding architecture and recursive neural networks. A function
f is evaluated at each node of a graph except the root node. Then, the output function
g, which can be the same as f , is evaluated at the root node, and yields an output for
the whole graph. The function f can be described by Eq. 4.16, where xi stands for the
ith node representation, li is the ith node label and xpa[i] is a vector obtained by stacking
representations of all parents of ith node one after another. The g function can be described
by Eq. 4.17, where or is the output of the root node.
xi = f (li, xpa[i]) (4.16)
or = g(lr, xpa[r]) (4.17)
The instances of the f function are connected together to form an encoding network for
a given graph, which reflects the structure of the graph. Such encoding network is built for
every graph in the training set. The resulting set of encoding networks is called a graph
machine. A single encoding network and the corresponding sample graph are shown in
Fig. 4.10, where the light grey areas denote instances of the f network and the dark grey area
denote the g network.
The instances of f function share weights among the instances belonging to single
encoding network, and among all the encoding networks. The instances of g function share
weights among all the encoding networks. (This is called the shared weights technique.)
The f and g functions can be implemented by neural networks. Their training occurs on all
4.7. Graph machines 20
the graphs from the training set simultaneously. That is, ∂J∂w is summed over all function
instances among all the graphs, where w is the set of all weights belonging either to f or
g function. A standard gradient optimization algorithm can be used for the training phase
(Levenberg-Marquardt, BFGS, etc.).
The authors of graph machines stated explicitly two fundamental rules which were before
implicitly applied in various graph processing connectionist models:
— The structure of a model should reflect the structure of the processed graph.
— The representation of structured data should be learnt instead of being handcrafted. [3]
5. Graph neural network implementation
The Graph Neural Network model [8] (GNN) is a quite recent (2009) connectionist
model, based on recursive neural networks and capable of classifying almost all types of
graphs. The main difference between the GNN model and previous connectionist models is
the possibility of processing directly nonpositional and cyclic graphs, containing both node
labels and edge labels. Although some similar solutions were introduced in an earlier model,
the RNN-LE [6] in 2005, it was the GNN model that combined several techniques with a
novel learning schema to provide a direct and flexible method for graph processing.
5.1. Data
The GNN model is built once for a training set of graphs. In fact, a whole set of graphs
can be merged into one large disconnected graph, which can then be fed to the model. For
a given dataset each node n is described by a node label ln of fixed size |ln| ≥ 1. Each
directed edge u⇒ n (from node u to node n) is described by an edge label l(n,u) of fixed size
|l(n,u)| ≥ 0. To deal with both directed and undirected edges, the authors propose to include
a value dl in each edge label, denoting the direction of an edge. However, for a maximally
general model, in this implementation all the edges were considered as directed. Undirected
edges were encoded prior to processing as pairs of directed edges with same labels.
A GNN model can deal with both graph-focused and node-focused tasks. In
graph-focused tasks, for each graph an output on of fixed size |on| ≥ 1 is sought, which can
denote e.g. the class of the graph. In the domain of chemistry, where each graph describes a
chemical compound, this could describe e.g. the reactivity of a compound. In node-focused
tasks, such output on is sought for every node in every graph. An example of such task can
be the face localisation problem, where for each region of an image the classifier should
determine, if it is a part of the face or not. In the rest of this thesis, the node-focused task is
described, unless stated otherwise.
For this implementation each graph was represented by three .csv files. Each ith row of
the nodes file contained the ith node label (comma-separated). Each row of the edges file
contained information about a directed edge u⇒ n: the u node index (row number in nodes
file), the n node index and the edge label (comma-separated). Each ith row of the outputs file
contained the ith node output (comma-separated). An example is provided in Appendix A.
5.2. Computation units 22
5.2. Computation units
The GNN model consists of two computation units, fw and gw, where the w subscript
denotes the fact that both units are functions parametrized by a vector of parameters w, which
is separate for the f and for the g function. The fw unit is used for building representation
(the state) xn of a single node n. The gw unit is used for producing output on for a node
n, basing on its representation xn. For a graph-focused task, the representation of the root
node is fed to the gw function to produce an output og for the whole graph. It is important
to remind, that for a given classifier there is only one fw unit and one gw unit (like in the
recursive neural network model). All instances of the fw unit share their weights and all
instances of the gw unit share their weights.
Let’s denote by ne[n] the neighbors of node n, that is such nodes u that are connected
to node n by a directed edge u⇒ n. Let’s further denote by co[n] the set of directed edges
pointing from ne[n] towards node n (edges u⇒ n). The general forms of fw and gw functions
are defined by equations Eq. 5.1 and Eq. 5.2, where ln denotes the nth node label, lco[n]
denotes the set of edge labels from co[n], xne[n] denotes states of nodes from ne[n], and lne[n]
denotes their labels.
xn = fw(ln, lco[n], xne[n], lne[n]) (5.1)
on = gw(xn, ln) (5.2)
For this implementation, minimal forms of these definitions were chosen:
xn = fw(ln, lco[n], xne[n]) (5.3)
on = gw(xn) (5.4)
These forms were chosen to prove that the model is capable of building a sufficient
representation of each node. That is, the model should be able to encode all the necessary
information from a node label ln into the state xn. This approach proved later to be success-
ful.
From two forms of the fw function mentioned in the original article [8], the nonpositional
form was chosen. The reason behind this choice was to provide the model with the most
general function possible, which could deal with both positional and nonpositional graphs.
The nonpositional form was also the one yielding better results in the experiments conducted
by the authors [8]. The final definitions of the fw and gw functions are shown below. All
instances of the hw unit share their weights.
5.2. Computation units 23
xn = ∑u∈ne[n]
hw(ln, l(n,u), xu) (5.5)
on = gw(xn) (5.6)
The units hw and gw were implemented as fully-connected three layer feed-forward
neural networks (input lines performing no computation, a single hidden layer and an output
layer). For both units the hidden layer consisted of tanh neurons. For the hw unit the output
layer consisted of tanh neurons. That’s because the output of the hw unit contributes to the
state value and therefore should consist of bounded values only. For the gw unit the output
layer could consist either of tanh or linear neurons, depending on the values of on that were
to be learned.
At this point it’s worth mentioning that the final value of xn is calculated as a simple sum
of hw outputs. This corresponds to a situation where all the hw output values are passed to
a neural network in which the set of weights corresponding to a single hw input are shared
amongst all the hw inputs. If we consider that a three layer FNN used as hw unit is already
an universal approximator, the use of such an additional neural network which just sums all
hw values using the same shared set of weights is unnecessary. A simple sum should be
sufficient and experimental results showed that this assumption stands.
The fw and gw units are presented in Fig. 5.1 and Fig. 5.2, where the comma-separated
list of inputs stands for a vector obtained by stacking all the listed values one after another.
xn
h
ln, l[n,u1], xu1
h
ln, l[n,up], xup
Σ
lui
ln
l[n, ui]
Figure 5.1. The fw unit for a single node and one of the corresponding edges
xn
g
on
Figure 5.2. The gw unit for a single node
5.3. Encoding network 24
The weights of both the hw and gw units were initialised according to the standard
neural network practice, to avoid saturation of any tanh activation function: net j = ∑i w jiyi ∈(−1,1), where net j is the weighted input to jth neuron, yi is the ith input value and w ji is the
weight corresponding to the ith input. The initial input weights of the gw unit were divided
by an additional factor, i.e. the maximum node indegree of the processed graphs, to take into
consideration the fact, that the input of gw unit consists of a sum of hw outputs. All the input
data (node and edge labels) was normalised appropriately before feeding to the model.
5.3. Encoding network
Graph processing by a GNN model consists of two steps: building representation xn
for each node and producing an output on. As the representation of a single node depends
on other nodes representations, an encoding network for every graph is built, reflecting the
structure of the graph. The encoding network consists of instances of the fw unit connected
according to the graph structure with a gw unit attached to every fw unit. A sample graph
and its encoding network are presented in Fig. 5.3. It can be seen that, as a cyclic dependence
exists in the sample graph, the calculation of the node states should be iterative, until at some
point convergence is reached.
g
g
g
f
f
f
1
2
3
Figure 5.3. A sample graph and the corresponding encoding network
5.4. General training algorithm
Let’s denote by x the global state of the graph, that is the set of all xn for every node
n in the graph. Let’s denote by l and o the sets of all labels and all outputs, respectively.
Let’s further denote by Fw and Gw (global transition function and global output function)
the stacked versions of fw and gw functions, respectively. Now, equations 5.3 and 5.4 can
be rewritten as Eq. 5.7 and Eq. 5.8.
5.5. Unfolded network and backpropagation 25
x= Fw(l, x) (5.7)
o= Gw(x) (5.8)
The GNN training algorithm can be described as follows:
1. initialize hw and gw weights
2. until stop criterion is satisfied:
a) initialize x randomly
b) FORWARD: calculate x= Fw(l, x) until convergence
c) BACKWARD: calculate o= Gw(x) and backpropagate the error
d) update hw and gw weights
The stop criterion used in this implementation was a maximum number of iterations.
5.5. Unfolded network and backpropagation
To solve the problem of cyclic dependencies, the GNN models adapts a novel learning
algorithm for the encoding network. The encoding network is virtually unfolded through
time until at time tm the state x converges to a fixed point x of the function Fw. Then the
output o is calculated. Such unfolded network for the sample graph is presented in Fig. 5.4.
Each time step consists of evaluating the fw function at every node. What is important,
the connections between nodes are taken into consideration only between time steps. In
such a way, the problem of cycles ceases to exist and the processed graph can be even fully
connected.
f
g
f f
f f f
g g
f f f
t0
t1
tm
Figure 5.4. Unfolded encoding network for the sample graph
5.5. Unfolded network and backpropagation 26
After the output o is calculated, the error en = (dn−on)2 is injected to the corresponding
gw unit for every node n, where dn denotes the expected node output. The error is backprop-
agated through the gw layer, yielding the value ∂ew∂o ·
∂Gw∂x (x). That value is backpropagated
through the unfolded network using the BPTT/BPTS algorithm. Additionally, at each time
step the ∂ew∂o ·
∂Gw∂x (x) error is injected to the fw layer, as presented in Fig. 5.5. In such a way
the error backpropagated through the fw layer at time ti comes from two sources. First, it is
the output error of the network ∂ew∂o ·
∂Gw∂x (x). Secondly, it is the error backpropagated from
the subsequent time layers of the fw unit from all nodes connected with the given node u by
an edge u⇒ n.
f
g
f f
f f f
g g
f f f
t0
t1
tm
Figure 5.5. Error backpropagation through the unfolded network
By injecting the same error ∂ew∂o ·
∂Gw∂x (x) at each time step an important assumption is
made, which leads to a simplification of the whole backpropagation process. If the state x
converged to a fixed point x of function Fw at time tm, then it can be safely assumed that
using the value x at every previous time step ti instead of x(ti) would yield the same result
at time tm.
Storing the intermediate values of x(ti) and backpropagating the error directly using
the BPTT/BPTS algorithm would be memory consuming. However, due to the assumption
that x(ti) = x, a different backpropagation algorithm can be used [8], originating from the
domain of recurrent neural networks: the Almeida-Pineda algorithm [21] [22].
Basically, the modified Almeida-Pineda algorithm consists of initializing an error ac-
cumulator z(0) = ∂ew∂o ·
∂Gw∂x (x) and then accumulating the error by backpropagating z( j)
through the fw layer until its value converges to z. At each step j the additional error∂ew∂o ·
∂Gw∂x (x) is injected as was shown previously. If the state calculation converged, the
error calculation is guaranteed to converge too. The number of iterations needed for the
error accumulator z to converge can be different than the number of time steps needed for
the state x to converge. In the conducted experiments it was usually much smaller.
5.6. Contraction map 27
5.6. Contraction map
In the previous paragraphs, it was stated that a fixed point x of function Fw is sought.
However, how can it be assumed that such a fixed point exists for the function Fw? How to
assure that it will be reached by iterating x= Fw(x) using a random initial x?
Actually all of the above can be assured by making Fw a contraction map. A contraction
map (a non-expansive map) is a function Fw for which d(Fw(x1), Fw(x2)) ≤ d(x1, x2),
where d(x, y) is a distance function. For this implementation a distance function d(x, y) =
maxi(|xi−yi|) was chosen, as it is independent of the state size and therefore, of the number
of nodes in a given graph. The Banach Fixed Point Theorem states that a contraction map
Fw(x) has the following properties:
— it has a single fixed point x
— it converges to x from every starting point x(t0)
— the convergence to x is exponentially fast [8].
How to assure that Fw, a function composed of neural network instances, is actually a
contraction map? The authors propose to impose a penalty function whenever the elements
of the Jacobian ∂Fw∂x suggest that Fw isn’t a contraction map anymore.
Let A= ∂Fw∂x (x, l) be a block matrix of size N×N with blocks of size s× s, where N is
the number of nodes in the processed graph and |xn| = s is the state size for a single node.
A single block An,u measures the influence of the node u on node n if an edge from u to n
exists or is zeroed otherwise. Let’s denote by I ju the influence of node u on the jth element
of state xn (Eq. 5.11). The penalty pw added to the network error ew is defined by Eq. 5.9.
pw = ∑u∈N
s
∑j=1
L(I ju, µ) (5.9)
L(y, µ) =
{y−µ if y > µ
0 otherwise(5.10)
I ju = ∑
(n,u)
s
∑i=1|Ai, j
n,u| (5.11)
Basing on these equations, the value of ∂pw∂w is calculated and the final error derivative
for the hw network is calculated as ∂ew∂w + ∂pw
∂w . It can be seen from Eq. 5.10 that the term∂pw∂w affects only weights that cause an excessive impact (larger than µ) and the value of
such penalty is proportional to the value of I ju . The eagerness to impose the penalty and
the penalty value are inversely proportional to the value of the contraction constant µ. The
impact of contraction constant on the training process was described in section 6.3.
5.7. RPROP algorithm 28
5.7. RPROP algorithm
The authors of the GNN algorithm suggest the RPROP algorithm [24] as the weights
update algorithm, as an efficient gradient descent strategy. The basic idea of the RPROP
algorithm is to use only the sign of the original weight updates ∂ew∂w . The actual weight
updates are calculated by the RPROP algorithm according to the past behaviour of ∂ew∂w ,
which includes fast descent of monotonous gradient slopes, small steps in the proximity of
a minimum and reverting updates that caused jumping over a local minimum. Actually,
in the case of GNN, the RPROP algorithm should be used not only for its efficiency, but
also as a way of dealing with the unpredictable behaviour of the ∂pw∂w term. As experiments
showed, the value of ∂pw∂w can be larger than the original ∂ew
∂w by several orders of magnitude,
which could disturb severely the learning algorithm if the RPROP algorithm wasn’t used.
The RPROP algorithm was implemented using standard recommended values [24], with the
exception of ∆max which was set to 1 to avoid large weight changes.
5.8. Maximum number of iterations
As was shown in section 5.6, the number of Forward or Backward steps is finite if Fwis a contraction map. However, it can be seen, that the penalty imposed on the weights is in
fact imposed post factum. Only after the norm of the Jacobian ∂Fw∂x increases excessively, the
penalty is imposed. In fact:
1. it is not guaranteed that the Jacobian norm will be correct after the penalty is imposed
2. one Forward iteration takes place before the penalty is imposed
Even if the penalty is efficient enough (it isn’t necessarily so, as shown in the subsequent
experiments), the problem of a single Forward iteration that may not converge still remains.
During experiments it was observed that while usually the number of Forward iterations
was between 5 and 50, depending on the dataset, from time to time it reached about 2000
iterations or even more (the calculations had to be aborted due to excessive time). To make
the calculation time predictable it was necessary to introduce a modification to the original
GNN algorithm - a maximum number of Forward iterations. The value chosen for the
subsequent experiments was 200, as it seemed to be a value large enough (in comparison
to the usual number of steps) to assure that the state calculation will converge if Fw is
still a contraction map. Furthermore, if the state calculation doesn’t converge, it is not
guaranteed that the error accumulation calculation will. Therefore, another similar restriction
was introduced for the Backward procedure and the maximum number of Backward (error
accumulation) iterations was set to 200.
5.9. Detailed training algorithm 29
5.9. Detailed training algorithm
1 MAIN:w = initialize
3 x= FORWARD(w)4 f o r numberO f Iterations :5 [ ∂ehw
∂w; ∂egw
∂w] = BACKWARD(x, w)
6 w = rprop-update(w, ∂ehw∂w
, ∂egw∂w
)7 x= FORWARD(w)8 end
r e t u r n w10 end
FORWARD(w ) :13 x(0) = random14 t = 015 r ep ea t :
x(t +1) = Fw(x(t), l)17 t = t +118 u n t i l (maxi(|xi(t +1)− xi(t)|)≤ minStateDi f f ) or (t > maxForwardSteps)19 r e t u r n x(t)20 end
BACKWARD(x , w ) :23 o= Gw(x)
24 A= ∂Fw∂x
(x, l)
25 b= ∂ew∂o· ∂Gw
∂x (x)26 z(0) = b27 t = 028 r ep ea t :
z(t−1) = z(t) ·A+b30 t = t−131 u n t i l (maxi(|zi(t−1)− zi(t)|)≤ minErrorAccDi f f ) or (|t|> maxBackwardSteps)32
∂egw∂w
= b33
∂ehw∂w
= z(t) · ∂Fw∂w
(x, l)+ ∂pw∂w
34 r e t u r n [ ∂ehw∂w
; ∂egw∂w
]35 end
Listing 5.1. The learning algorithm
5.10. Graph-focused tasks
The GNN model can be used for graph-focused tasks by modifying the standard learning
algorithm. In a graph-focused task an output og is sought for every graph, which is the output
of a predefined root node. In such a task only the root output error can be measured, as for
every other node the expected output is not defined. Thus, the only modification necessary to
deal with such a task is to set the error of all non-root nodes to zero. Such a modification was
implemented and proved to work well for a modified subgraph matching task (see chapter 6),
where the modified task consisted of determining if a given graph contains the expected
subgraph S or not instead of selecting nodes belonging to S.
6. Experiments
Experiments were conducted to check if the implemented GNN is able to cope with the
tasks presented in the original article [8]. For all the experiments the state size was set
to 5, the number of hidden neurons in both the hw and gw networks was set to 5. After
some successful trivial experiments, consisting of memorizing a single graph, the proper
experiments were conducted. The task chosen for experiments was the subgraph matching
task. It was chosen, because:
1. a similar experiment was conducted by Scarselli et al. [8]
2. the dataset is easy to generate, yet the problem is not trivial
3. to yield good results, the structure of the graph have to be exploited.
6.1. Subgraph matching - data
The datasets for the subgraph matching task were generated as follows. For a given
number of graph nodes, graphs were generated by selecting node labels from [0..10] and
connecting each node pair in a graph with an edge probability δ. Then, edges were inserted
randomly until the graph became connected. Then, a smaller (connected) subgraph S was
inserted to every graph in the dataset. Then, a brute force algorithm was used to locate all
copies of the subgraph S in every graph in the dataset. Thus, every graph in the dataset
contained at least on copy of the subgraph S. Afterwards, a small Gaussian noise with zero
mean and standard deviation of 0.25 was added to all node labels. All graph edges were
undirected and thus were transformed to pairs of directed edges prior to processing. No edge
labels were used.
Two datasets were generated. One with graph size (number of nodes) equal to 6, the
subgraph size equal to 3 and δ = 0.8 (100 graphs, called later the 6-3 dataset). The second
dataset with graph size equal to 14, subgraph size equal to 7 and δ = 0.2 (100 graphs, called
later the 14-7 dataset). A larger δ was used for the first dataset, as graphs generated with
δ = 0.2 were mostly sequences. The first dataset was used to analyze the process of training,
while the second one was used for comparison of GNN with a standard FNN classifier.
6.1. Subgraph matching - data 31
1
8
2
0
1010
Figure 6.1. Sample graph from 6-3 dataset (subgraph in black), before adding noise
3
5
8
0
3
2
2
5
5
8
0
10
7
8
Figure 6.2. Sample graph from 14-7 dataset (subgraph in black), before adding noise
6.2. Impact of initial weight values on learning 32
6.2. Impact of initial weight values on learning
To test the impact of initial weight values on the process of a GNN training, 9 different
sets of weights were tested. For all tested networks, the contraction constant (µ from Eq. 5.9)
was set to 0.9. The training was performed on 10 graphs belonging to the 6-3 dataset. Each
GNN network was trained for 50 iterations. As the default error measure used in GNN
training is the Mean Square Error, a similar performance measure - RMSE was used for
evaluation. Results are presented in Fig. 6.3. Out of 9 networks, only 4 performed well:
gnn2, gnn3, gnn5 and gnn7. The gnn5 network yielded the smallest RMSE at the end of
training and also presents a remarkably monotonous RMSE slope compared to gnn7. All
the other networks didn’t improve significantly on the RMSE value, which may suggest
that multiple initial sets of weights should be tried for a given dataset to build an efficient
classifier.
gnn1 : RMSE gnn2 : RMSE gnn3 : RMSE
gnn4 : RMSE gnn5 : RMSE gnn6 : RMSE
gnn7 : RMSE gnn8 : RMSE gnn9 : RMSE
5.5
6
6.5
7
7.5
8
8.5
9
9.5
0 10 20 30 40 50
5.5
6
6.5
7
7.5
8
8.5
9
9.5
0 10 20 30 40 50
5.5
6
6.5
7
7.5
8
8.5
9
9.5
0 10 20 30 40 50
5.5
6
6.5
7
7.5
8
8.5
9
9.5
0 10 20 30 40 50
5.5
6
6.5
7
7.5
8
8.5
9
9.5
0 10 20 30 40 50
5.5
6
6.5
7
7.5
8
8.5
9
9.5
0 10 20 30 40 50
5.5
6
6.5
7
7.5
8
8.5
9
9.5
0 10 20 30 40 50
5.5
6
6.5
7
7.5
8
8.5
9
9.5
0 10 20 30 40 50
5.5
6
6.5
7
7.5
8
8.5
9
9.5
0 10 20 30 40 50
Figure 6.3. RMSE for 9 different initial weight sets. µ = 0.9
6.3. Impact of contraction constant on learning 33
6.3. Impact of contraction constant on learning
During the initial experiments, interesting results were obtained for different values of
the contraction constant (µ from Eq. 5.9). It seems that for a given learning task exists a
minimum value of µ below which no learning occurs. Some experiments were conducted
for the 6-3 dataset using the best networks from Fig. 6.3: the gnn5 and gnn7 network (initial
weight values were used). The results for gnn7 are presented in Fig. 6.4 and the results for
gnn5 are presented in Fig. 6.5. For both networks three different values of µ were tested: 1.2,
0.9 and 0.6. In both cases it can be observed that no training occurs for µ = 0.6. For these
experiments 20 graphs from the 6-3 dataset were used.
(1.2, 1e-08) (0.9, 1e-08) (0.6, 1e-08)
8
9
10
11
12
13
0 10 20 30 40 50
8
9
10
11
12
13
0 10 20 30 40 50
8
9
10
11
12
13
0 10 20 30 40 50
Figure 6.4. RMSE for gnn7 with µ ∈ [1.2,0.9,0.6]
(1.2, 1e-08) (0.9, 1e-08) (0.6, 1e-08)
8.5
9
9.5
10
10.5
11
11.5
12
12.5
13
0 10 20 30 40 50
8.5
9
9.5
10
10.5
11
11.5
12
12.5
13
0 10 20 30 40 50
8.5
9
9.5
10
10.5
11
11.5
12
12.5
13
0 10 20 30 40 50
Figure 6.5. RMSE for gnn5 with µ ∈ [1.2,0.9,0.6]
A closer look on the process of learning may shed some light on the reasons behind the
lack of learning. In Fig. 6.6 the process of learning of gnn5 with µ = 0.9 is presented. In
Fig. 6.7 the same network gnn5 was trained with µ = 0.6. The different values shown are:
nForward - number of Forward (state building) iterations, nBackward - number of Backward
(error accumulation) iterations, penalty - set to 1 if any weight was penalized, de/dw influence
- percent of combined weight updates that had the same sign as ∂e∂w (before passing to RPROP
algorithm), dp/dw influence - percent of combined weight updates that had the same sign as
6.3. Impact of contraction constant on learning 34
∂p∂w . Some interesting features of the GNN model learning schema can be observed. In the
case of µ = 0.9 the number of Forward steps reached the maximum a couple of times, which
presumably means that at that time the Fw ceased being a contraction map. The penalty was
imposed mostly for short periods of time and only at one moment caused the ∂e∂w influence
to drop below 50%. This strategy yielded good results - the imposed penalty reduced the
number of Forward steps and the RMSE was successfully reduced.
RMSE penalty
nForward de/dw influence [%]
nBackward dp/dw influence [%]
8
9
10
11
12
13
0 10 20 30 40 50-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50
0
50
100
150
200
250
0 10 20 30 40 500
20
40
60
80
100
0 10 20 30 40 50
0
50
100
150
200
250
0 10 20 30 40 500
20
40
60
80
100
0 10 20 30 40 50
Figure 6.6. gnn5 performance with µ = 0.9 for 20 graphs
A different situation is shown for µ = 0.6. Because of a low µ value, the penalty was
imposed eagerly and was larger than in the previous case (the impact of the µ value was
described in section 5.6). It was imposed even when the number of Forward steps was below
the maximum, that is when Fw was still a contraction map. Large values of the penalty
caused a huge decrease of the ∂e∂w term influence, which made any learning impossible.
6.3. Impact of contraction constant on learning 35
RMSE penalty
nForward de/dw influence [%]
nBackward dp/dw influence [%]
8
9
10
11
12
13
0 10 20 30 40 50-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50
0
50
100
150
200
250
0 10 20 30 40 500
20
40
60
80
100
0 10 20 30 40 50
0
50
100
150
200
250
0 10 20 30 40 500
20
40
60
80
100
0 10 20 30 40 50
Figure 6.7. gnn5 performance with µ = 0.6 for 20 graphs
Another interesting case is presented in Fig. 6.8: the learning process of gnn5 on 10
graphs with µ = 0.9. It can be observed that even as the number of Forward steps reached
in peaks the maximum value, the Fw function remained a contraction map. A large enough
µ prevented the penalty from being imposed, which enabled the GNN model to train both
computation units without any disturbance. The result is a monotonously decreasing RMSE
slope, which could be previously observed in Fig. 6.3. It can be concluded, that the most
important aspect of building a GNN model is to provide an efficient way to make Fw a
contraction map as fast as possible, so as to provide as much time as possible for undisturbed
learning.
6.3. Impact of contraction constant on learning 36
RMSE penalty
nForward de/dw influence [%]
nBackward dp/dw influence [%]
5.5
6
6.5
7
7.5
8
8.5
9
9.5
0 10 20 30 40 50-1
-0.5
0
0.5
1
0 10 20 30 40 50
0
50
100
150
200
250
0 10 20 30 40 500
20
40
60
80
100
0 10 20 30 40 50
0
50
100
150
200
250
0 10 20 30 40 500
20
40
60
80
100
0 10 20 30 40 50
Figure 6.8. gnn5 performance with µ = 0.9 for 10 graphs
6.4. Cross-validation results 37
6.4. Cross-validation results
To compare the performance of the implemented GNN model with a standard FNN,
the following subgraph matching experiment was conducted. 5-fold cross-validation was
performed on all 100 graphs from the 14-7 dataset. A random GNN was generated. It
was trained with a contraction constant µ = 0.9 for 50 iterations for each fold. To provide
good FNN results, 10 three-layer FNNs with 20 hidden tanh neurons were evaluated and the
one with best mean accuracy was selected. The results are presented in Table 6.1 and 6.2.
The GNN classifier outperformed the FNN by more than 15%. This is due to the fact, that
the FNN classifier could make predictions only by analyzing node labels, while the GNN
classifier exploited correctly the graph topology.
These results can be better understood by analyzing the classified dataset. The 100
processed graphs consisted in total of 1400 nodes. Amongst these nodes, 1031 had node
labels matching the subgraph node labels. Amongst these 1031 nodes only 702 actually
belonged to the subgraph. Thus, 329 nodes, 23.5% of all the nodes would probably be
classified as false positives by a classifier taking into consideration only node labels. This
hypothesis corresponds quite well with the results presented.
accuracy precision recall
FNN - tr 75% 68% 93%FNN - tst 74% 68% 93%GNN - tr 91% 87% 97%GNN - tst 91% 86% 97%
Table 6.1. Mean values on training and test sets
accuracy precision recall
FNN - tr 0.68% 0.82% 1.43%FNN - tst 3.20% 2.89% 1.85%GNN - tr 1.62% 1.71% 2.07%GNN - tst 3.06% 3.70% 1.39%
Table 6.2. Standard deviations on training and test sets
7. Conclusions
The implementation of the GNN model yielded promising experimental results for struc-
tured data and proved that it can exploit properly both node labels and the graph structure.
Some changes in the original algorithm, as the maximum number of Forward and Backward
iterations were introduced to assure a more predictable computation time. Conditions under
which the model works efficiently were described. An important parameter of the GNN
model, determining the training efficiency was identified: the contraction constant µ. The
most important conclusions are listed below:
— training yields best results when Fw remains a contraction map
— if Fw definitely ceases to be a contraction map, there are no training effects
— remaining near the contraction state can still yield good results
— a fixed maximum number of Forward/Backward iterations can still yield good results
— imposing an unnecessary penalty should be avoided
— too large penalties (a too small µ) should be avoided
— the minimum value of µ should be tuned to the processed dataset
— the minimum value of µ can be tuned using a subset of the data.
A. Using the software
The GNN classifier was implemented in GNU Octave 3.6.2 and tested on a x86_64 PC. The
most important functions are:
— loadgraph - loads a single graph from .csv files
— loadset - loads a set of graphs sharing the same filename prefix
— mergegraphs - merge a cellarray of graphs to a single graph
— initgnn - initialize a new GNN
— traingnn - train GNN using a training graph
— classifygnn - classify given graph with a trained GNN
— evaluate - evaluate the classification results
— crossvalidate - perform cross-validation using an untrained GNN and a set of graphs
Help and usage information for each function can be displayed by typing help <func-
tion_name> in the Octave command line.
1 g6 = l o a d s e t ( ’ . . / d a t a / g6s3n / g6s3 ’ , 10) ;gm = mergegraphs ( g6 ) ;gnn = i n i t g n n (gm . maxIndegree , [5 5 ] , [5 gm . n o d e O u t p u t S i z e ] , ’ t a n s i g ’ ) ;t r a i n e d G n n = t r a i n g n n ( gnn , gm , 20) ;
5 o u t p u t s = c l a s s i f y g n n ( t r a inedGnn , gm) ;s t a t s = e v a l u a t e ( o u t p u t s , gm . e x p e c t e d O u t p u t ) ;
Listing A.1. Sample usage session
All subgraph matching datasets were created using the buildgraphs.py script. Each graph
can be viewed as a pdf file by using the drawgraph.py script. Each graph is stored as three
.csv files, containing node labels, edge labels and expected outputs. A sample graph ’test’ is
presented below. Nodes yielding output 2 were marked as black.
5
6
7
8
Figure A.1. Sample graph
test_nodes.csv test_output.csv test_edges.csv
5 1 1,2,06 2 2,3,07 2 3,4,08 1 4,1,0
4,2,04,3,0
Table A.1. Sample graph data
B. Listings
1
f u n c t i o n [ bes tGnn t r a i n S t a t s ] = t r a i n g n n ( gnn , graph , n I t e r a t i o n s , . . .maxForwardSteps =200 , maxBackwardSteps =200 , i n i t i a l S t a t e =0)
% T r a i n s GNN u s i n g graph as t r a i n i n g s e t5 %
% usage : [ bes tGnn t r a i n S t a t s ] = t r a i n g n n ( gnn , graph , n I t e r a t i o n s ,% maxForwardSteps =200 , maxBackwardSteps =200 , i n i t i a l S t a t e =0)%% r e t u r n :
10 % − b e s t gnn o b t a i n e d d u r i n g t r a i n i n g and a l l e r r o r s%
i f i n i t i a l S t a t e != 0a s s e r t ( s i z e ( i n i t i a l S t a t e , 1 ) == graph . nNodes ) ;
15 a s s e r t ( s i z e ( i n i t i a l S t a t e , 2 ) == gnn . s t a t e S i z e ) ;end
% c o n s t a n t s f o r i n d e x i n g t r a i n i n g s t a t sRMSE = 1 ;
20 FORWARD_STEPS = 2 ;BACKWARD_STEPS = 3 ;PENALTY_ADDED = 4 ;RPROP_REVERTED_TRANS = 5 ;RPROP_REVERTED_OUT = 6 ;
25 TRANS_STATS_START = 7 ;TRANS_STATS_END = TRANS_STATS_START + 3 ;t r a i n S t a t s = z e r o s ( n I t e r a t i o n s + 1 , TRANS_STATS_END) ;
% n o r m a l i z e edge and node l a b e l s30 % s t o r e n o r m a l i z a t i o n i n f o i n s i d e r e s u l t gnn
[ g raph . n o d e L a b e l s gnn . nodeLabelMeans gnn . n o d e L a b e l S t d s ] = . . .n o r m a l i z e ( g raph . n o d e L a b e l s ) ;
[ g raph . e d g e L a b e l s ( : , 3 : end ) gnn . edgeLabelMeans gnn . e d g e L a b e l S t d s ] = . . .n o r m a l i z e ( g raph . e d g e L a b e l s ( : , 3 : end ) ) ;
35 graph = a d d g r a p h i n f o ( g raph ) ;
m i n E r r o r = I n f ;bes tGnn = gnn ;r p r o p T r a n s i t i o n S t a t e = i n i t r p r o p ( gnn . t r a n s i t i o n N e t ) ;
40 r p r o p O u t p u t S t a t e = i n i t r p r o p ( gnn . o u t p u t N e t ) ;f o r i t e r a t i o n = 1 : n I t e r a t i o n s
[ s t a t e nFo rwardS teps ] = f o r w a r d ( gnn , graph , maxForwardSteps , . . .i n i t i a l S t a t e ) ;
t r a i n S t a t s ( i t e r a t i o n , FORWARD_STEPS) = nForwardS teps ;45
o u t p u t s = a p p l y n e t ( gnn . o u t p u t N e t , s t a t e ) ;i f graph . n o d e O r i e n t e d T a s k == f a l s e
e r r = rmse ( g raph . e x p e c t e d O u t p u t ( 1 , : ) , o u t p u t s ( 1 , : ) ) ;e l s e
B. Listings 41
50 e r r = rmse ( g raph . e x p e c t e d O u t p u t , o u t p u t s ) ;endt r a i n S t a t s ( i t e r a t i o n , RMSE) = e r r ;i f e r r < m i n E r r o r
m i n E r r o r = e r r ;55 bes tGnn = gnn ;
end
[ d e l t a s nBackwardSteps pena l tyAdded ] = backward ( gnn , graph , . . .s t a t e , maxBackwardSteps ) ;
60 t r a i n S t a t s ( i t e r a t i o n , BACKWARD_STEPS) = nBackwardSteps ;t r a i n S t a t s ( i t e r a t i o n , PENALTY_ADDED) = pena l tyAdded ;t r a i n S t a t s ( i t e r a t i o n , TRANS_STATS_START : TRANS_STATS_END) = . . .
round ( t r a n s i t i o n s t a t s ( d e l t a s ) ) ;
65 o u t p u t D e r i v a t i v e s = d e l t a s . o u t p u t ;[ r p r o p O u t p u t S t a t e o u t p u t W e i g h t U p d a t e s n O u t p u t R e v e r t e d ] = . . .
r p r o p ( r p r o p O u t p u t S t a t e , o u t p u t D e r i v a t i v e s ) ;t r a i n S t a t s ( i t e r a t i o n , RPROP_REVERTED_OUT) = . . .
round ( n O u t p u t R e v e r t e d ∗ 100 / gnn . o u t p u t N e t . nWeights ) ;70
t r a n s i t i o n D e r i v a t i v e s = a d d d e l t a s ( d e l t a s . t r a n s i t i o n , . . .d e l t a s . t r a n s i t i o n P e n a l t y ) ;
[ r p r o p T r a n s i t i o n S t a t e t r a n s i t i o n W e i g h t U p d a t e s . . .n T r a n s i t i o n R e v e r t e d ] = . . .
75 r p r o p ( r p r o p T r a n s i t i o n S t a t e , t r a n s i t i o n D e r i v a t i v e s ) ;t r a i n S t a t s ( i t e r a t i o n , RPROP_REVERTED_TRANS) = . . .
round ( n T r a n s i t i o n R e v e r t e d ∗ 100 / gnn . t r a n s i t i o n N e t . nWeights ) ;
gnn . o u t p u t N e t = u p d a t e w e i g h t s ( gnn . o u t p u t N e t , . . .80 ou tpu tWe igh tUpda t e s , 1 ) ;
gnn . t r a n s i t i o n N e t = u p d a t e w e i g h t s ( gnn . t r a n s i t i o n N e t , . . .t r a n s i t i o n W e i g h t U p d a t e s , 1 ) ;
end
85 % c a l c u l a t e RMSE o f f i n a l GNNi t e r a t i o n = n I t e r a t i o n s + 1 ;
[ s t a t e nFo rwardS teps ] = f o r w a r d ( gnn , graph , maxForwardSteps , . . .i n i t i a l S t a t e ) ;
90 t r a i n S t a t s ( i t e r a t i o n , FORWARD_STEPS) = nForwardS teps ;
o u t p u t s = a p p l y n e t ( gnn . o u t p u t N e t , s t a t e ) ;i f graph . n o d e O r i e n t e d T a s k == f a l s e
e r r = rmse ( g raph . e x p e c t e d O u t p u t ( 1 , : ) , o u t p u t s ( 1 , : ) ) ;95 e l s e
e r r = rmse ( g raph . e x p e c t e d O u t p u t , o u t p u t s ) ;endt r a i n S t a t s ( i t e r a t i o n , RMSE) = e r r ;
%p r i n t f ( ’RMSE a f t e r %d i t e r a t i o n s : %f \ n ’ , i t e r a t i o n − 1 , e r r ) ;100 i f e r r < m i n E r r o r
m i n E r r o r = e r r ;bes tGnn = gnn ;
endend
Listing B.1. Main train function
B. Listings 42
1
f u n c t i o n [ s t a t e n S t e p s ] = f o r w a r d ( gnn , graph , maxForwardSteps , s t a t e =0)% Perform t h e ’ forward ’ s t e p o f GNN t r a i n i n g% compute node s t a t e s u n t i l s t a b l e s t a t e i s reached
5 %% usage : [ s t a t e n S t e p s ] = forward ( gnn , graph , maxForwardSteps , s t a t e =0)
i f s t a t e == 0s t a t e = i n i t s t a t e ( g raph . nNodes , gnn . s t a t e S i z e ) ;
10 endn S t e p s = 0 ;do
i f n S t e p s > maxForwardSteps% p r i n t f ( ’ Too many forward s t e p s : %d , a b o r t i n g \ n ’ , n S t e p s ) ;
15 re turn ;endl a s t S t a t e = s t a t e ;s t a t e = t r a n s i t i o n ( gnn . t r a n s i t i o n N e t , l a s t S t a t e , g raph ) ;n S t e p s = n S t e p s + 1 ;
20 u n t i l ( s t a b l e s t a t e ( l a s t S t a t e , s t a t e , gnn . m i n S t a t e D i f f ) ) ;end
Listing B.2. Forward function
1
f u n c t i o n n e w S t a t e = t r a n s i t i o n ( fnn , s t a t e , g raph )% C a l c u l a t e g l o b a l t r a n s i t i o n f u n c t i o n
5 n e w S t a t e = z e r o s ( s i z e ( s t a t e ) ) ;f o r nodeIndex = 1 : g raph . nNodes
% b u i l d t r a n s i t i o n N e t i n p u t f o r s i n g l e nodes o u r c e N o d e I n d e x e s = graph . sourceNodes { nodeIndex } ;nodeLabe l = graph . n o d e L a b e l s ( nodeIndex , : ) ;
10 newNodeState = z e r o s ( 1 , fnn . nOutpu tNeurons ) ;f o r i = 1 : s i z e ( sou rceNodeIndexes , 2 )
s o u r c e E d g e L a b e l = . . .g raph . e d g e L a b e l s C e l l { s o u r c e N o d e I n d e x e s ( i ) , nodeIndex } ;
s o u r c e N o d e S t a t e = s t a t e ( s o u r c e N o d e I n d e x e s ( i ) , : ) ;15 i n p u t s = [ nodeLabel , sou rceEdgeLabe l , s o u r c e N o d e S t a t e ] ;
s t a t e C o n t r i b u t i o n = a p p l y n e t ( fnn , i n p u t s ) ;newNodeState = newNodeState + s t a t e C o n t r i b u t i o n ;
end% i f node has < maxIndegree i n p u t nodes , i t s c o n t r i b u t i o n i s 0
20 n e w S t a t e ( nodeIndex , : ) = newNodeState ;end
end
Listing B.3. Transition function
1
f u n c t i o n s t a b l e = s t a b l e s t a t e ( l a s t S t a t e , s t a t e , m i n S t a t e D i f f )% r e t u r n t r u e , i f d i f f e r e n c e be tween two s t a t e s i s i n s i g n i f i c a n t
d i f f = max ( max ( abs ( l a s t S t a t e − s t a t e ) ) ) ;5 s t a b l e = ( d i f f < m i n S t a t e D i f f ) ;
end
Listing B.4. State convergence check
B. Listings 43
1
f u n c t i o n [ w e i g h t D e l t a s n S t e p s pena l tyAdded ] = backward ( gnn , graph , s t a t e ,maxBackwardSteps )% Perform t h e ’ backward ’ s t e p o f GNN t r a i n i n g
5 %% usage : [ w e i g h t D e l t a s n S t e p s p e n a l t y A d d e d ] = backward ( gnn , graph , s t a t e ,% maxBackwardSteps )%% s t a t e : s t a b l e s t a t e , c a l c u l a t e d by forward
10 % r e t u r n : d e l t a s f o r bo th t r a n s i t i o n and o u t p u t n e t w o r k s o f gnn
o u t p u t s = a p p l y n e t ( gnn . o u t p u t N e t , s t a t e ) ;o u t p u t E r r o r s = 2 . ∗ ( g raph . e x p e c t e d O u t p u t − o u t p u t s ) ;i f graph . n o d e O r i e n t e d T a s k == f a l s e
15 s e l e c t e d N o d e E r r o r s = o u t p u t E r r o r s ( 1 , : ) ;o u t p u t E r r o r s = z e r o s ( s i z e ( o u t p u t E r r o r s ) ) ;o u t p u t E r r o r s ( 1 , : ) = s e l e c t e d N o d e E r r o r s ;
end
20 o u t p u t D e l t a s = o u t p u t d e l t a s ( gnn . o u t p u t N e t , graph , s t a t e , o u t p u t E r r o r s ) ;
% s p a r s e m a t r i x c a l c u l a t i o n s , cause s i z e ( A ) = N x N x s ^2% A can c o n t a i n whole t r a i n i n g s e t as a s i n g l e graphA = c a l c u l a t e a ( gnn . t r a n s i t i o n N e t , graph , s t a t e ) ;
25 b = sp ar se ( c a l c u l a t e b ( gnn . o u t p u t N e t , graph , s t a t e , o u t p u t E r r o r s ) ) ;a c c u m u l a t o r = b ; % a c c u m u l a t o r c o n t a i n s dew / do ∗ dG / wxn S t e p s = 0 ;do
% i f t h e r e are t o o many backward s t e p s , we can g e t i n f i n i t i e s30 i f n S t e p s > maxBackwardSteps
break ;endl a s t A c c u m u l a t o r = a c c u m u l a t o r ;
% f o r each s o u r c e node , b a c k p r o p a g a t e e r r o r o f i t s t a r g e t nodes35 % A : i n f l u e n c e o f s o u r c e nodes on t a r g e t nodes
% a c c u m u l a t o r : e r r o r s de / dx from p r e v i o u s t i m e s t e p ( t + 1)% b : de / do ∗ dG / dx e r r o r , i n j e c t e d a t each s t e pa c c u m u l a t o r = a c c u m u l a t o r ∗ A + b ;n S t e p s = n S t e p s + 1 ;
40 u n t i l ( s t a b l e s t a t e ( l a s t A c c u m u l a t o r , a c c u m u l a t o r , gnn . m i n E r r o r A c c D i f f ) ) ;
t r a n s i t i o n E r r o r s = reshape ( a c c u m u l a t o r , g raph . nNodes , gnn . s t a t e S i z e ) ;t r a n s i t i o n D e l t a s = t r a n s i t i o n d e l t a s ( gnn . t r a n s i t i o n N e t , graph , s t a t e ,
t r a n s i t i o n E r r o r s ) ;45
% c a l c u l a t e p e n a l t y dp / dw , a s s u r i n g t h a t F i s a c o n t r a c t i o n map[ p e n a l t y D e r i v a t i v e pena l tyAdded ] = p e n a l t y d e r i v a t i v e ( gnn , graph , s t a t e ,A) ;p e n a l t y D e l t a s = r e s h a p e d e l t a s ( gnn . t r a n s i t i o n N e t , p e n a l t y D e r i v a t i v e ) ;
50 w e i g h t D e l t a s = s t r u c t ( . . .’ o u t p u t ’ , o u t p u t D e l t a s , . . .’ t r a n s i t i o n ’ , t r a n s i t i o n D e l t a s , . . .’ t r a n s i t i o n P e n a l t y ’ , p e n a l t y D e l t a s ) ;
end
Listing B.5. Backward function
B. Listings 44
1
f u n c t i o n A = c a l c u l a t e a ( t r a n s i t i o n N e t , graph , s t a t e )% C a l c u l a t e t h e A ( x ) = dF / dx b l o c k m a t r i x% ( NxN b l o c k s , each s x s ) ( F = t r a n s i t i o n f u n c t i o n )
5 %% usage : A = c a l c u l a t e a ( t r a n s i t i o n N e t , graph , s t a t e )%% Each b l o c k A[ n , u ] = dxn / dxu :% − d e s c r i b e s t h e e f f e c t o f node xu on node xn ,
10 % i f an edge xu−>xn e x i s t s% − i s n u l l ( z e r o e d ) i f t h e r e i s no edge%% Each e l e m e n t o f b l o c k A[ n , u ] : a [ i , j ] :% − d e s c r i b e s t h e e f f e c t o f i t h e l e m e n t o f s t a t e xu
15 % on j t h e l e m e n t o f s t a t e xn
s t a t e S i z e = t r a n s i t i o n N e t . nOutpu tNeurons ;a S i z e = graph . nNodes ∗ s t a t e S i z e ;A = sp ar se ( aS ize , a S i z e ) ; % z e r o e d
20 f o r nodeIndex = 1 : g raph . nNodes% b u i l d t r a n s i t i o n N e t i n p u t f o r s i n g l e nodes o u r c e N o d e I n d e x e s = graph . sourceNodes { nodeIndex } ;nodeLabe l = graph . n o d e L a b e l s ( nodeIndex , : ) ;f o r i = 1 : s i z e ( sou rceNodeIndexes , 2 )
25 s o u r c e E d g e L a b e l = . . .g raph . e d g e L a b e l s C e l l { s o u r c e N o d e I n d e x e s ( i ) , nodeIndex } ;
s o u r c e N o d e S t a t e = s t a t e ( s o u r c e N o d e I n d e x e s ( i ) , : ) ;i n p u t s = [ nodeLabel , sou rceEdgeLabe l , s o u r c e N o d e S t a t e ] ;d e l t a Z x = z e r o s ( s t a t e S i z e , s t a t e S i z e ) ;
30 f o r j = 1 : s t a t e S i z ee r r o r s = z e r o s ( 1 , s t a t e S i z e ) ;e r r o r s ( j ) = 1 ;i n p u t D e l t a s = bp2 ( t r a n s i t i o n N e t , i n p u t s , e r r o r s ) ;
% s e l e c t o n l y w e i g h t s c o r r e s p o n d i n g t o x _ i u35 s t a t e W e i g h t s S t a r t = . . .
1 + graph . n o d e L a b e l S i z e + graph . e d g e L a b e l S i z e ;s t a t e I n p u t D e l t a s = i n p u t D e l t a s ( 1 , s t a t e W e i g h t s S t a r t : end ) ;d e l t a Z x ( : , j ) = s t a t e I n p u t D e l t a s ’ ;
end40 s t a r t X = b l o c k s t a r t ( nodeIndex , s t a t e S i z e ) ;
endX = b l o c k e n d ( nodeIndex , s t a t e S i z e ) ;s t a r t Y = b l o c k s t a r t ( s o u r c e N o d e I n d e x e s ( i ) , s t a t e S i z e ) ;endY = b l o c k e n d ( s o u r c e N o d e I n d e x e s ( i ) , s t a t e S i z e ) ;A( s t a r t X : endX , s t a r t Y : endY ) = d e l t a Z x ;
45 endend
end
Listing B.6. Matrix A calculation
B. Listings 45
1
f u n c t i o n b = c a l c u l a t e b ( o u t p u t N e t , graph , s t a t e , e r r o r D e r i v a t i v e )% C a l c u l a t e b m a t r i x = dew / do ∗ dGw / dx%
5 % G : g l o b a l o u t p u t f u n c t i o n , G( n o d e S t a t e s )=o u t p u t s% e r r o r D e r i v a t i v e : each row c o n t a i n s an e r r o r o f a node% s t a t e : each row c o n t a i n s a node s t a t e
b = z e r o s ( o u t p u t N e t . n I n p u t L i n e s , g raph . nNodes ) ;10 f o r nodeIndex = 1 : g raph . nNodes
e r r o r s = e r r o r D e r i v a t i v e ( nodeIndex , : ) ;n o d e S t a t e = s t a t e ( nodeIndex , : ) ;i n p u t s = n o d e S t a t e ;d e l t a s = b a c k p r o p a g a t e ( o u t p u t N e t , i n p u t s , e r r o r s ) ;
15 b ( : , nodeIndex ) = d e l t a s . d e l t a I n p u t s ( 1 , : ) ;end% s t a c k o u t p u t f o r each node on one a n o t h e rb = vec ( b ) ’ ;
end
Listing B.7. Matrix b calculation
B. Listings 46
1
f u n c t i o n [ p e n a l t y D e r i v a t i v e pena l tyAdded ] =p e n a l t y d e r i v a t i v e ( gnn , graph , s t a t e , A)
% C a l c u l a t e p e n a l t y d e r i v a t i v e c o n t r i b u t i o n t o de / dw , where :5 % − e = RMSE + c o n t r a c t i o n M a p P e n a l t y ( ne twork e r r o r )
%% usage : [ p e n a l t y D e r i v a t i v e p e n a l t y A d d e d ]% = p e n a l t y d e r i v a t i v e ( gnn , graph , s t a t e , A )%
10
% sum up i n f l u e n c e o f each s o u r c e node on a l l t h e o t h e r nodes% each s−l ong b l o c k i s i n f l u e n c e o f s i n g l e s o u r c e node xu on a l l s o u t p u t s
s o u r c e I n f l u e n c e s = sum ( abs (A) , 1 ) ;s o u r c e I n f l u e n c e s = ( s o u r c e I n f l u e n c e s − . . .
15 r epmat ( gnn . c o n t r a c t i o n C o n s t a n t , s i z e ( s o u r c e I n f l u e n c e s ) ) ) . ∗ . . .( s o u r c e I n f l u e n c e s > gnn . c o n t r a c t i o n C o n s t a n t ) ;
fnn = gnn . t r a n s i t i o n N e t ;nWeights1 = fnn . n I n p u t L i n e s ∗ fnn . nHiddenNeurons ;
20 nBias1 = fnn . nHiddenNeurons ;nWeights2 = fnn . nOutpu tNeurons ∗ fnn . nHiddenNeurons ;nBias2 = fnn . nOutpu tNeurons ;p e n a l t y D e r i v a t i v e = z e r o s ( 1 , nWeights1 + nBias1 + nWeights2 + nBias2 ) ;i f sum ( s o u r c e I n f l u e n c e s ) == 0
25 pena l tyAdded = f a l s e ;e l s e
pena l tyAdded = t r u e ;% m a t r i x B c o n t a i n s i n f l u e n c e s from A , f i l t e r e d :% o n l y i n f l u e n c e s coming from a t o o i n f l u e n t i a l s o u r c e are r e t a i n e d
30 B = s i g n (A) .∗ r epmat ( s o u r c e I n f l u e n c e s , s i z e (A, 1 ) , 1 ) ;f o r s o u r c e I n d e x = 1 : g raph . nNodes
s t a r t X = b l o c k s t a r t ( s o u r c e I n d e x , gnn . s t a t e S i z e ) ;endX = b l o c k e n d ( s o u r c e I n d e x , gnn . s t a t e S i z e ) ;s o u r c e N o d e I n f l u e n c e s = s o u r c e I n f l u e n c e s ( 1 , s t a r t X : endX ) ;
35 i f ( sum ( s o u r c e N o d e I n f l u e n c e s , 2 ) != 0 )% c a l c u l a t e i m p a c t D e r i v a t i v e [ n , u ] f o r u = s o u r c e I n d e xf o r t a r g e t I n d e x = 1 : g raph . nNodes
i f ! e d g e e x i s t s ( graph , s o u r c e I n d e x , t a r g e t I n d e x )c o n t i n u e ;
40 ends t a r t Y = b l o c k s t a r t ( t a r g e t I n d e x , gnn . s t a t e S i z e ) ;endY = b l o c k e n d ( t a r g e t I n d e x , gnn . s t a t e S i z e ) ;Rnu = f u l l (B( s t a r t Y : endY , s t a r t X : endX ) ) ;
45 % c a l c u l a t e f2 ’ ( n e t 2 ) and f1 ’ ( n e t 1 )nodeLabe l = graph . n o d e L a b e l s ( t a r g e t I n d e x , : ) ;s o u r c e E d g e L a b e l = graph . e d g e L a b e l s C e l l { s o u r c e I n d e x , t a r g e t I n d e x } ;s o u r c e N o d e S t a t e = s t a t e ( s o u r c e I n d e x , : ) ;i n p u t s = [ nodeLabel , sou rceEdgeLabe l , s o u r c e N o d e S t a t e ] ;
50
% f n n f e e dn e t 1 = fnn . w e i g h t s 1 ∗ i n p u t s ’ + fnn . b i a s 1 ;h i d d e n O u t p u t s = fnn . a c t i v a t i o n 1 ( n e t 1 ) ;n e t 2 = fnn . w e i g h t s 2 ∗ h i d d e n O u t p u t s + fnn . b i a s 2 ;
55 s igma1 = fnn . a c t i v a t i o n d e r i v a t i v e 1 ( n e t 1 ) ;s igma2 = fnn . a c t i v a t i o n d e r i v a t i v e 2 ( n e t 2 ) ;
% s e l e c t o n l y w e i g h t s c o r r e s p o n d i n g t o xu a t i n p u t s
B. Listings 47
s t a t e W e i g h t s S t a r t = 1 + graph . n o d e L a b e l S i z e + graph . e d g e L a b e l S i z e ;60 s t a t e W e i g h t s 1 = fnn . w e i g h t s 1 ( : , s t a t e W e i g h t s S t a r t : end ) ;
% vec ( Rnu ) ’ ∗ dvec ( Anu ) / dw = da1 + da2 + da3 + da4d e l t a S i g n a l 1 = vec ( Rnu ∗ s t a t e W e i g h t s 1 ’ ∗ . . .
diag ( s igma1 ) ∗ fnn . we igh t s2 ’ ) ’ ∗ v e c d i a g m a t r i x ( gnn . s t a t e S i z e ) ;65 fnn2nd = fnn ;
fnn2nd . a c t i v a t i o n 1 = fnn . a c t i v a t i o n d e r i v a t i v e 1 ;fnn2nd . a c t i v a t i o n 2 = fnn . a c t i v a t i o n d e r i v a t i v e 2 ;fnn2nd . a c t i v a t i o n d e r i v a t i v e 1 = fnn . a c t i v a t i o n 2 n d d e r i v a t i v e 1 ;fnn2nd . a c t i v a t i o n d e r i v a t i v e 2 = fnn . a c t i v a t i o n 2 n d d e r i v a t i v e 2 ;
70 d e l t a s 1 = b a c k p r o p a g a t e ( fnn2nd , i n p u t s , d e l t a S i g n a l 1 ) ;da1 = v e c d e l t a s ( d e l t a s 1 ) ’ ;
d a 2 l e f t = vec ( diag ( s igma2 ) ∗ Rnu ∗ s t a t e W e i g h t s 1 ’ ∗ diag ( s igma1 ) ) ’ ;weights2dw = [ z e r o s ( nWeights2 , nWeights1 + nBias1 ) . . .
75 eye ( nWeights2 ) z e r o s ( nWeights2 , nBias2 ) ] ;da2 = d a 2 l e f t ∗ weights2dw ;
d e l t a S i g n a l 3 = vec ( fnn . we igh t s2 ’ ∗ diag ( s igma2 ) ∗ . . .Rnu ∗ s t a t e W e i g h t s 1 ’ ) ’ ∗ v e c d i a g m a t r i x ( fnn . nHiddenNeurons ) ;
80 % 1− l a y e r f n n b a c k p r o p a g a t e w i t h f ( n e t ) = t r a n s i t i o n . f ’ ( n e t )n e t 1 = fnn . w e i g h t s 1 ∗ i n p u t s ’ + fnn . b i a s 1 ;d e l t a 1 = d e l t a S i g n a l 3 ’ . ∗ fnn . a c t i v a t i o n 2 n d d e r i v a t i v e 1 ( n e t 1 ) ;d e l t a W e i g h t s 1 = d e l t a 1 ∗ i n p u t s ;d e l t a B i a s 1 = d e l t a 1 ;
85 da3 = [ vec ( d e l t a W e i g h t s 1 ) ; vec ( d e l t a B i a s 1 ) ; . . .z e r o s ( nWeights2 + nBias2 , 1 ) ] ’ ;
d a 4 l e f t = vec ( diag ( s igma1 ) ∗ fnn . we igh t s2 ’ ∗ diag ( s igma2 ) ∗ Rnu ) ’ ;s t a t e S i z e = fnn . nOutpu tNeurons ;
90 n S t a t e W e i g h t s = s t a t e S i z e ∗ fnn . nHiddenNeurons ;l a b e l s S i z e = graph . n o d e L a b e l S i z e + graph . e d g e L a b e l S i z e ;s t a t eWeigh t s1Dw = z e r o s ( n S t a t e W e i g h t s , . . .
nWeights1 + nBias1 + nWeights2 + nBias2 ) ;% mark w i t h ones w e i g h t s c o r r e s p o n d i n g t o xu
95 f o r h = 1 : fnn . nHiddenNeuronss t a r t I n d e x X = 1 + ( h − 1) ∗ fnn . n I n p u t L i n e s + l a b e l s S i z e ;endIndexX = s t a r t I n d e x X + s t a t e S i z e − 1 ;s t a r t I n d e x Y = 1 + ( h − 1) ∗ s t a t e S i z e ;endIndexY = s t a r t I n d e x Y + s t a t e S i z e − 1 ;
100 s t a t eWeigh t s1Dw ( s t a r t I n d e x Y : endIndexY , . . .s t a r t I n d e x X : endIndexX ) = eye ( s t a t e S i z e ) ;
endda4 = d a 4 l e f t ∗ s t a t eWeigh t s1Dw ;
105 i m p a c t D e r i v a t i v e = da1 + da2 + da3 + da4 ;p e n a l t y D e r i v a t i v e = p e n a l t y D e r i v a t i v e + i m p a c t D e r i v a t i v e ;
endend
end110 p e n a l t y D e r i v a t i v e = p e n a l t y D e r i v a t i v e ∗ 2 ;
endend
Listing B.8. Penalty derivative calculation
B. Listings 48
1
f u n c t i o n [ r p r o p S t r u c t n R e v e r t e d ] =r p r o p u p d a t e ( r p r o p S t r u c t , e r r o r D e r i v a t i v e s , de l t aMin , de l taMax )
% Helper f u n c t i o n f o r rprop ( ) , o p e r a t e s on s i g l e m a t r i x o f w e i g h t s5 %
% usage : [ r p r o p S t r u c t n R e v e r t e d ] =% r p r o p u p d a t e ( r p r o p S t r u c t , e r r o r D e r i v a t i v e s , de l taMin , de l taMax )%% r p r o p S t r u c t . w e i g h t U p d a t e s − d e l t a s t h a t s h o u l d be added t o f n n w e i g h t s
10
a s s e r t ( s i z e ( r p r o p S t r u c t . e r r o r s , 1 ) == s i z e ( e r r o r D e r i v a t i v e s , 1 ) ) ;a s s e r t ( s i z e ( r p r o p S t r u c t . e r r o r s , 2 ) == s i z e ( e r r o r D e r i v a t i v e s , 2 ) ) ;i n c r e a s e = 1 . 2 ;
15 d e c r e a s e = 0 . 5 ;e r r o r D i r e c t i o n C h a n g e = r p r o p S t r u c t . e r r o r s .∗ e r r o r D e r i v a t i v e s ;
n R e v e r t e d = 0 ;f o r i = 1 : s i z e ( r p r o p S t r u c t . d e l t a s , 1 )
20 f o r j = 1 : s i z e ( r p r o p S t r u c t . d e l t a s , 2 )i f e r r o r D i r e c t i o n C h a n g e ( i , j ) > 0
% i n c r e a s e w e i g h t up da t er p r o p S t r u c t . d e l t a s ( i , j ) = min ( r p r o p S t r u c t . d e l t a s ( i , j ) ∗ . . .
i n c r e a s e , de l taMax ) ;25 r p r o p S t r u c t . w e i g h t U p d a t e s ( i , j ) = . . .
s i g n ( e r r o r D e r i v a t i v e s ( i , j ) ) ∗ r p r o p S t r u c t . d e l t a s ( i , j ) ;r p r o p S t r u c t . e r r o r s ( i , j ) = e r r o r D e r i v a t i v e s ( i , j ) ;
e l s e i f e r r o r D i r e c t i o n C h a n g e ( i , j ) < 0r p r o p S t r u c t . d e l t a s ( i , j ) = . . .
30 max ( r p r o p S t r u c t . d e l t a s ( i , j ) ∗ d e c r e a s e , d e l t a M i n ) ;% r e v e r t l a s t w e i g h t u pda t er p r o p S t r u c t . w e i g h t U p d a t e s ( i , j ) = . . .− r p r o p S t r u c t . w e i g h t U p d a t e s ( i , j ) ;
% a v o i d do ub l e p u n i s h m e n t i n n e x t s t e p35 r p r o p S t r u c t . e r r o r s ( i , j ) = 0 ;
n R e v e r t e d = n R e v e r t e d + 1 ;e l s e
r p r o p S t r u c t . w e i g h t U p d a t e s ( i , j ) = . . .s i g n ( e r r o r D e r i v a t i v e s ( i , j ) ) ∗ r p r o p S t r u c t . d e l t a s ( i , j ) ;
40 r p r o p S t r u c t . e r r o r s ( i , j ) = e r r o r D e r i v a t i v e s ( i , j ) ;end
endend
end
Listing B.9. RPROP weights update
List of Figures
2.1 A simple binary tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.1 A sample graph that can be processed using RAAM . . . . . . . . . . . . . . . . . . . 84.2 Training set for the example graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3 Graph compression using trained RAAM model . . . . . . . . . . . . . . . . . . . . . . 94.4 Graph reconstruction from x3 using trained RAAM model . . . . . . . . . . . . . . . . 94.5 RAAM encoding network for the sample graph . . . . . . . . . . . . . . . . . . . . . . 104.6 LRAAM encoding network for the graph shown . . . . . . . . . . . . . . . . . . . . . 124.7 LRAAM encoding network for the cyclic graph shown . . . . . . . . . . . . . . . . . . 134.8 Virtual unfolding, reflecting the sample graph structure . . . . . . . . . . . . . . . . . . 154.9 A sample acyclic graph and the corresponding encoding network . . . . . . . . . . . . . 184.10 A sample acyclic graph and the corresponding encoding network . . . . . . . . . . . . . 19
5.1 The fw unit for a single node and one of the corresponding edges . . . . . . . . . . . . 235.2 The gw unit for a single node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 A sample graph and the corresponding encoding network . . . . . . . . . . . . . . . . . 245.4 Unfolded encoding network for the sample graph . . . . . . . . . . . . . . . . . . . . . 255.5 Error backpropagation through the unfolded network . . . . . . . . . . . . . . . . . . . 26
6.1 Sample graph from 6-3 dataset (subgraph in black), before adding noise . . . . . . . . . 316.2 Sample graph from 14-7 dataset (subgraph in black), before adding noise . . . . . . . . 316.3 RMSE for 9 different initial weight sets. µ = 0.9 . . . . . . . . . . . . . . . . . . . . . 326.4 RMSE for gnn7 with µ ∈ [1.2,0.9,0.6] . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.5 RMSE for gnn5 with µ ∈ [1.2,0.9,0.6] . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.6 gnn5 performance with µ = 0.9 for 20 graphs . . . . . . . . . . . . . . . . . . . . . . . 346.7 gnn5 performance with µ = 0.6 for 20 graphs . . . . . . . . . . . . . . . . . . . . . . . 356.8 gnn5 performance with µ = 0.9 for 10 graphs . . . . . . . . . . . . . . . . . . . . . . . 36
A.1 Sample graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Listings
5.1 The learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29A.1 Sample usage session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39B.1 Main train function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40B.2 Forward function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42B.3 Transition function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42B.4 State convergence check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42B.5 Backward function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43B.6 Matrix A calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44B.7 Matrix b calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45B.8 Penalty derivative calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46B.9 RPROP weights update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography
[1] J. B. Pollack, “Recursive distributed representations,” Artificial Intelligence, vol. 46, no. 1,pp. 77–105, 1990.
[2] A. Sperduti, “Labelling recursive auto-associative memory,” Connection Science, vol. 6, no. 4,pp. 429–459, 1994.
[3] A. Goulon-Sigwalt-Abram, A. Duprat, and G. Dreyfus, “From hopfield nets to recursive net-works to graph machines: numerical machine learning for structured data,” Theoretical com-puter science, vol. 344, no. 2, pp. 298–334, 2005.
[4] M. Bianchini, M. Gori, L. Sarti, and F. Scarselli, “Backpropagation through cyclic structures,”in AI* IA 2003: Advances in Artificial Intelligence, pp. 118–129, Springer, 2003.
[5] O. Ivanciuc, “Canonical numbering and constitutional symmetry,” Handbook of Chemoinfor-matics: From Data to Knowledge in 4 Volumes, pp. 139–160, 2003.
[6] M. Bianchini, M. Maggini, L. Sarti, and F. Scarselli, “Recursive neural networks for process-ing graphs with labelled edges: Theory and applications,” Neural Networks, vol. 18, no. 8,pp. 1040–1050, 2005.
[7] P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive processing of datastructures,” Neural Networks, IEEE Transactions on, vol. 9, no. 5, pp. 768–786, 1998.
[8] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neuralnetwork model,” Neural Networks, IEEE Transactions on, vol. 20, no. 1, pp. 61–80, 2009.
[9] A. Goulon, T. Picot, A. Duprat, and G. Dreyfus, “Predicting activities without computingdescriptors: graph machines for QSAR,” SAR and QSAR in Environmental Research, vol. 18,no. 1-2, pp. 141–153, 2007.
[10] A. Goulon, A. Duprat, and G. Dreyfus, “Learning numbers from graphs,” Applied StatisticalModelling and Data Analysis, Brest, France, pp. 17–20, 2005.
[11] S. Yong, M. Hagenbuchner, A. Tsoi, F. Scarselli, and M. Gori, “XML document mining usinggraph neural network,” Center for Computer Science http://inex. is. informatik. uni-duisburg.de/2006, p. 354, 2006.
[12] F. Scarselli, S. L. Yong, M. Gori, M. Hagenbuchner, A. C. Tsoi, and M. Maggini, “Graphneural networks for ranking web pages,” in Web Intelligence, 2005. Proceedings. The 2005IEEE/WIC/ACM International Conference on, pp. 666–672, IEEE, 2005.
[13] G. Monfardini, V. Di Massa, F. Scarselli, and M. Gori, “Graph neural networks for objectlocalization,” Frontiers in Artificial Intelligence and Applications, vol. 141, p. 665, 2006.
[14] A. Quek, Z. Wang, J. Zhang, and D. Feng, “Structural image classification with graph neuralnetworks,” in Digital Image Computing Techniques and Applications (DICTA), 2011 Interna-tional Conference on, pp. 416–421, IEEE, 2011.
[15] F. Costa, P. Frasconi, V. Lombardo, and G. Soda, “Towards incremental parsing of naturallanguage using recursive neural networks,” Applied Intelligence, vol. 19, no. 1-2, pp. 9–25,2003.
[16] S. Saha and G. Raghava, “Prediction of continuous b-cell epitopes in an antigen using recurrentneural network,” PROTEINS: Structure, Function, and Bioinformatics, vol. 65, no. 1, pp. 40–48,2006.
[17] R. Kree and A. Zippelius, “Recognition of topological features of graphs and images in neuralnetworks,” Journal of Physics A: Mathematical and General, vol. 21, no. 16, p. L813, 1988.
Bibliography 52
[18] A. Küchler and C. Goller, “Inductive learning in symbolic domains using structure-drivenrecurrent neural networks,” in KI-96: Advances in Artificial Intelligence, pp. 183–197, Springer,1996.
[19] A. Stolcke and D. Wu, Tree matching with recursive distributed representations. Citeseer, 1992.[20] C. Goller and A. Kuchler, “Learning task-dependent distributed representations by backpropa-
gation through structure,” in Neural Networks, 1996., IEEE International Conference on, vol. 1,pp. 347–352, IEEE, 1996.
[21] F. J. Pineda, “Generalization of back-propagation to recurrent neural networks,” Physical reviewletters, vol. 59, no. 19, pp. 2229–2232, 1987.
[22] R. J. Williams and D. Zipser, “Gradient-based learning algorithms for recurrent networks andtheir computational complexity,” Back-propagation: Theory, architectures and applications,pp. 433–486, 1995.
[23] A. Sperduti and A. Starita, “Supervised neural networks for the classification of structures,”Neural Networks, IEEE Transactions on, vol. 8, no. 3, pp. 714–735, 1997.
[24] M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learn-ing: The RPROP algorithm,” in Neural Networks, 1993., IEEE International Conference on,pp. 586–591, IEEE, 1993.