Embedding Graphs for Shortest-Path Distance Predictions
Transcript of Embedding Graphs for Shortest-Path Distance Predictions
Embedding Graphs for Shortest-PathDistance Predictions
Zhuowei ZhaoORCID 0000-0002-6891-6432
Submitted in total fulfilment of the requirements of the degree of
Master of Philosophy
School of Computing and Information SystemsTHE UNIVERSITY OF MELBOURNE
February 2020
Copyright c© 2020 Zhuowei Zhao
All rights reserved. No part of the publication may be reproduced in any form by print,photoprint, microfilm or any other means without written permission from the author.
Abstract
Graph is an important data structure and is used in an abundance of real-world appli-
cations including navigation systems, social networks, and web search engines, just to
name but a few. We study a classic graph problem – computing graph shortest-path
distances. This problem has many applications, such as finding nearest neighbors for
place of interest (POI) recommendation or social network friendship recommendation. To
compute a shortest-path distance, traditional approaches traverse the graph to find the
shortest path and return the path length. These approaches lack time efficiency over large
graphs. In the applications above, the distances may be needed first (e.g., to rank POIs),
while the actual shortest paths may be computed later (e.g., after a POI has been chosen).
Thus, an alternative approach precomputes and stores the distances, and answers dis-
tance queries with simple lookups. This approach, however, falls short in the space cost
– O(n2) in the worst case for n vertices, even with various optimizations.
To address these limitations, we take an embedding based approach to predict the
shortest-path distance between two vertices using their embeddings without comput-
ing their path online or storing their distance offline. Graph embedding is an emerging
technique for graph analysis that has yielded strong performance in applications such
as node classification, link prediction, graph reconstruction, and more. We propose a
representation learning approach to learn a k-dimensional (k n) embedding for ev-
ery vertex. This embedding preserves the distance information of the vertex to the other
vertices. We then train a multi-layer perceptron (MLP) to predict the distance between
two vertices given their embeddings. We thus achieve fast distance predictions with-
out a high space cost (i.e., only O(kn)). Experimental results on road network graphs,
social network graphs, and web document graphs confirm these advantages, while our
iii
approach also produces distance predictions that are up to 97% more accurate than those
by the state-of-the-art approaches.
Our embeddings are not limited for only distance predictions. We further study their
applicability on other graph problems such as link prediction and graph reconstruction.
Experimental results show that our embeddings are highly effective in these tasks.
iv
Declaration
This is to certify that
1. the thesis comprises only my original work towards the MPhil,
2. due acknowledgement has been made in the text to all other material used,
3. the thesis is less than 50,000 words in length, exclusive of tables, figures, bibliogra-
phies and appendices.
Zhuowei Zhao, February 2020
v
This page is intentionally left blank.
Acknowledgements
First of all, I would like to express my deepest gratitude to my supervisors, Dr. Jianzhong
Qi and Prof. Rui Zhang for their continuous support during my MPhil study. They have
guided me with their rich knowledge. Their passion in research has deeply encouraged
me. Without their support, this thesis would not have been possible.
I am deeply grateful to Prof. Wei Wang (The University of New South Wales) who
provided invaluable discussions and insightful feedback to my research.
I also sincerely thank my Advisory Committee Chair, Dr. Sean Maynard. He keeps
watching my progress and has given me generous support during my MPhil study. With-
out his insightful feedback and constructive comments, my progress would not have
been that smooth.
Then, I would like to thank The University of Melbourne and School of Computing
and Information Systems for providing a supportive research environment and rich re-
sources for my MPhil study.
Last but not least, I would like to thank all my fellow research students with whom
I share an office or work in various occasions for their support in research, life, and all
the pleasant memories, including Xinting Huang, Jiabo He, Yixin Su, Shiquan Yang, Yi-
meng Dai, Xiaojie Wang, Bastian Oetomo, Ang Li, Yunxiang Zhao, Guanli Liu, Yanchuan
Chang, Chuandong Yin, Chenxu Zhao, Weihao Chen, Zhen Wang, and Daocang Chen.
vii
This page is intentionally left blank.
Preface
A paper out of the work presented in Chapter 4 has been accepted and will appear in The
23rd International Conference on Extending Database Technology (EDBT). I declare that I am
the primary author and have contributed > 50% in the paper.
1. * Jianzhong Qi, Wei Wang, Rui Zhang, and Zhuowei Zhao. A Learning Based Ap-
proach to Predict Shortest-Path Distances. Accepted to appear in International Con-
ference on Extending Database Technology (EDBT), 2020. (CORE Ranking 1: A)
* The authors are ordered alphabetically.
1http://portal.core.edu.au/conf-ranks/?search=EDBT&by=all&source=CORE2018&sort=atitle&page=1
ix
This page is intentionally left blank.
To my parents and my wife, for their unconditional love.
xi
This page is intentionally left blank.
Contents
1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Related Work 112.1 Exact Distance Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Approximate Distance Computation . . . . . . . . . . . . . . . . . . . . . . 142.3 Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Other Graph Embedding Applications . . . . . . . . . . . . . . . . . . . . . 202.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Adapted Two-Stage Models 233.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 A Two-Stage Solution Framework . . . . . . . . . . . . . . . . . . . . . . . 243.3 Adapted Representation Learning Models . . . . . . . . . . . . . . . . . . . 26
3.3.1 Node2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.2 Auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.3 Geo-coordinates and Landmark Labels . . . . . . . . . . . . . . . . 30
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Proposed Single-Stage Model 334.1 Vdist2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Reducing Mean Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.2 Reducing Maximum Errors . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Ensembling Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.4 Handling Large Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1 Large Road Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 Handling Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.6 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
xiii
5 Experiments 475.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Overall Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2.2 Performance on Smaller Graphs . . . . . . . . . . . . . . . . . . . . 515.2.3 Performance on Larger Graphs . . . . . . . . . . . . . . . . . . . . . 565.2.4 Applicability Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.2.5 Impact of Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2.6 Impact of Embedding Dimensionality . . . . . . . . . . . . . . . . . 635.2.7 Impact of MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.2.8 Impact of Number of Center Vertices . . . . . . . . . . . . . . . . . 675.2.9 Impact of Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Other Embedding Applications . . . . . . . . . . . . . . . . . . . . . . . . . 705.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 Conclusions and Future Work 736.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
xiv
List of Figures
1.1 Examples of graphs in real life . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Vertex distance on a road network graph . . . . . . . . . . . . . . . . . . . 31.3 Graph shortest-path distance problem . . . . . . . . . . . . . . . . . . . . . 4
2.1 A graph example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Landmark distribution on Dongguan . . . . . . . . . . . . . . . . . . . . . 16
3.1 Solution framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Auto-encoder embedding model . . . . . . . . . . . . . . . . . . . . . . . . 293.3 A road network example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Vdist2vec model structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Error distribution of our model (DG) . . . . . . . . . . . . . . . . . . . . . . 364.3 Distance value distribution (DG) . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Ensembling model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.5 Layer ensembling model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.6 Distance prediction model for large road networks . . . . . . . . . . . . . . 44
5.1 Recall and nDCG in finding nearest neighbor . . . . . . . . . . . . . . . . . 605.2 Impact of updates (DG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.3 Impact of updates (FBPOL) . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.4 Impact of k (DG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.5 Impact of k (FBPOL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
xv
This page is intentionally left blank.
List of Tables
3.1 Node2vec Based Distance Prediction Errors . . . . . . . . . . . . . . . . . . 283.2 Node2vec Based Distance Prediction Errors on Different Networks . . . . 283.3 Auto-encoder Based Distance Prediction Errors on Different Networks . . 303.4 Geodnn Based Distance Prediction Errors on Different Networks . . . . . 31
4.1 Comparing Huber Loss with Reverse Huber Loss on Road Networks . . . 384.2 Comparing Huber Loss with Reverse Huber Loss on Social Networks . . . 384.3 Using MnSE and MnCE as the Loss Function on MB dataset . . . . . . . . 394.4 Performance of Emsembling Models on MB . . . . . . . . . . . . . . . . . . 40
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Mean Absolute and Mean Relative Errors on Road Networks (Smaller Graphs) 525.3 Mean Absolute and Mean Relative Errors on Social Networks and Web
Page Graph (Smaller Graphs) . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4 Max Absolute and Max Relative Errors on Road Networks (Smaller Graphs) 535.5 Max Absolute and Max Relative Errors on Social Networks and Web Graph
(Smaller Graphs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Preprocessing and Query Times on Road Networks (Smaller Graphs) . . . 545.7 Preprocessing and Query Times on Social Networks and Web Page Graph
(Smaller Graphs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.8 Mean Absolute and Mean Relative Errors on Road Networks (Larger Graphs) 575.9 Mean Absolute and Mean Relative Errors on Social Networks (Larger Graphs) 575.10 Max Absolute and Max Relative Errors on Road Networks (Larger Graphs) 585.11 Max Absolute and Max Relative Errors on Social Networks (Larger Graphs) 585.12 Preprocessing and Query Times on Road Networks (Larger Graphs) . . . 595.13 Preprocessing and Query Times on Social Networks (Larger Graphs) . . . 595.14 Effectiveness of Embedding Learning (DG) . . . . . . . . . . . . . . . . . . 665.15 Impact of MLP Structure Landmark-dg + MLP (DG) . . . . . . . . . . . . . 675.16 Impact of MLP Structure Vdist2vec (DG) . . . . . . . . . . . . . . . . . . . 675.17 Impact of Number of Center Vertices (SH) . . . . . . . . . . . . . . . . . . . 685.18 Impact of Loss Function (DG and MB) . . . . . . . . . . . . . . . . . . . . . 685.19 Impact of Loss Function (SU and EPA) . . . . . . . . . . . . . . . . . . . . . 695.20 Impact of Loss Function (FBTV and FBPOL) . . . . . . . . . . . . . . . . . . 695.21 Graph Reconstruction and Link Prediction performance Measured in MAP 715.22 Graph Reconstruction and Link Prediction Processing Time . . . . . . . . 71
xvii
This page is intentionally left blank.
List of Abbreviations and Symbols
CH Contraction Hierarchies
DNN Deep Neural Network
MAP Mean Average Precision
MnCE Mean Cube Error
MLP Multilayer Perceptron
MnAE Mean Absolute Error
MnRE Mean Relative Error
MnSE Mean Square Error
MxAE Max Absolute Error
MxRE Max Relative Error
NN Nearest Neighbors
NDCG Normalized Discounted Cumulative Gain
POI Place of Interest
PCA Principal Component Analysis
PT Preprocessing Time
QT Query Time
RRVHL Reverse Huber Loss
xix
v A vertex
u A vertex
l A landmark vertex
G A graph
E An edge set
v A vertex set
L A landmark set
La A label set
La(v) Label of vertex v
d(v, u) Distance between v and v
d(v, u) Estimated distance between v and v
dq(v, u, La) Distance between v and v computed by label set La
pv,u Shortest path from v to u
vi An embedding vector for vi
V An embedding matrix for V
L Training loss
xx
Chapter 1
Introduction
1.1 Background
Graphs were first introduced by Leonhard Euler in 1735 [47] to solve a mathematical
problem called the Konigsberg bridge problem, which is a problem to traverse a number of
islands connected by bridges via each bridge for exactly once. Euler modeled the problem
by representing the islands as vertices and the bridges as edges of a graph, respectively.
Since then, graphs become an important mathematical tool in many disciplines includ-
ing computer science, chemistry, linguistics, geography, and many more. In computer
science, graphs are an essential data structure and are commonly used to model trans-
portation networks, social networks, and web page link structures, just to name but a few.
Figure 1.1 gives an example, where Figure 1.1a is a social network graph, Figure 1.1b is a
web page graph, and Figure 1.1c is a transport network graph.
1
1.1 Background Introduction
(a) Social network graph (b) Web page graph
(c) Transportation network graph
Figure (1.1) Examples of graphs in real life
In graph theory, a basic problem is to compute the distance between two vertices,
which may be used to model the travel cost between two places of interest (POIs), the social
closeness of two individuals, the relevance of two web pages, etc. Figure 1.2 shows the
distance between two vertices on a road network graph. We can see that such a distance
is not necessarily the Euclidean distance between the two vertices. The vertex distances
are fundamental for recommending POIs to tourists, suggesting friends to social network
users, or ranking web pages for search engines. In these applications, there may be mil-
lions of vertices and users who issue distance queries. For example, the Florida road
network [1] has over a million vertices; Google Maps has over a billion active users [5];
there are more than a billion active websites [8]; and Facebook has over 2 billions so-
cial network users [7]. Answering distance queries under such settings poses significant
challenges in both space and time costs.
2
Introduction 1.2 Research Gap
Figure (1.2) Vertex distance on a road network graph
In this thesis, we revisit the problem of computing the distance between two vertices
in a graph. Here, the distance refers to the length of the graph shortest path between the
two vertices. We use distance for brevity when the context is clear. Figure 1.3a shows an
abstracted example of the problem, where v1, v2, ..., v5 are the vertices, and the numbers
on the edges are the edge weights. Consider vertices v1 and v5. Their distance is the
length of path v1 → v4 → v5, which is 4. Our aim is to answer queries on such distances
(approximately) with a high efficiency.
1.2 Research Gap
A traditional approach uses graph shortest path algorithms to compute the shortest path
between two vertices, along which the path length (i.e., the distance) is computed. Di-
jkstra’s single-source shortest-path (SSSP) algorithm [32] and the Floyd-Warshall all-pair
shortest-path (APSP) algorithm [36] are simple and effective algorithms for this purpose.
However, these methods may have large computational complexity when they are run-
ning online over large graphs, i.e., O(m + n log n) and O(n3), where m and n are the
numbers of edges and vertices, respectively. More recent algorithms such as contraction
hierarchies (CH) [44] reduce the time cost via preprocessing the graphs to add shortcut
edges (i.e., shortest paths between some vertices). These algorithms focus on computing
the shortest paths rather than the distances.
In applications such as those mentioned above, the distances may be needed first
3
1.2 Research Gap Introduction
v2(l1)
v1 v4 v5
v3(l2)
3 4 6
1 3
3 5 5
(a) A graph example
v1
v2
v3
v4
v5
v1 v2 v3 v4 v5
0 3 3 1 4
3 0 6 4 6
3 6 0 4 5
1 4 4 0 3
4 6 5 3 0
(b) Distance labeling
v1
v2
v3
v4
v5
v2(l1) v3(l2)
3 3
0 6
6 0
4 4
6 5
(c) Landmark labeling
Figure (1.3) Graph shortest-path distance problem
while the actual shortest-paths may be computed later. Meanwhile, the distances do not
update frequently or do not need to support real-time updates. For example, to rank
POIs for recommendation, we may just need the distances to the POIs, while the shortest
path can be computed after a POI has been chosen by the user. Also, the POI locations
do not change often. Similarly, to recommend friends for a user, we may need distances
in a social network graph that represent her social closeness to other users, but not the
shortest paths. Therefore, generating the recommendations can be done offline without
requiring real-time updates of the social network graph. Such applications are targeted
in this study.
Under such application contexts, studies (e.g., [14, 29, 58]) preprocess a graph and
build new data structures to enable fast distance queries without online shortest path
computations. Distance labeling is commonly used in these studies. The basic idea is to
precompute a vector of (distance) values for each vertex as its distance label. At query time,
only the distance labels of the two query vertices are examined to derive their distance,
which is simpler than shortest path computation. In an extreme case, the distance label
4
Introduction 1.2 Research Gap
of every vertex consists of its distances to all other vertices (cf. Figure 1.3b). A distance
query is answered by a simple lookup in O(1) time, but this requires O(n2) space to store
all the distance labels. Various labeling approaches (e.g., 2-hop labeling [29] and highway
labeling [58]) are proposed to reduce the distance label size.
Hub labelling [29] is a representative labelling approach. It labels every vertex with its
distances to vertices on its shortest paths to all the other vertices. The vertices used for
labelling are called hubs. The hubs are chosen such that there exists exactly one hub on the
shortest path of every pair of vertices. Every vertex only stores its distances to the hubs
on its shortest paths to the other vertices. Since some of these shortest paths may share
the same hub, hub labelling can produce distance labels with smaller sizes. At query
time, the distance labels of the two queries vertices are scanned to find their shared hub,
which must be a vertex on their shortest path. The distances to this hub are summed up
and returned as the query answer. This approach has been shown to be query efficient,
but its worst-case space cost is still O(n2) [57].
To avoid the O(n2) space cost, approximate techniques are proposed [25, 92], among
which landmark labeling [49, 77, 90] is a representative approach. The landmark label-
ing approach chooses a subset of k (k n) vertices as the landmarks. Every vertex vi
stores its distances to these landmarks as its distance label, i.e., a k-dimensional vector
〈d(vi, l1), d(vi, l2), . . . , d(vi, lk)〉, where l1, l2, . . . , lk ∈ L represent the landmarks and d(·)
represents the distance. At query time, the distance labels of the two query vertices vi and
vj are scanned, where the distances to the same landmark are summed up. The smallest
distance sum, i.e., mind(vi, l) + d(vj , l)|l ∈ L, is returned as the query answer (for undi-
rected graphs). In Figure 1.3, v2 and v3 are chosen as the landmarks (denoted by l1 and l2,
respectively), and the distance labels are shown in Figure 1.3c. The distance between v1
and v5 is computed as mind(v1, l1)+d(v5, l1), d(v1, l2)+d(v5, l2) = min3+6, 3+5 = 8,
which is twice as large as the actual distance between v1 and v5 (i.e., 4). As the example
shows, even though landmark labeling reduces the space cost toO(kn), it may not return
the exact distance between vi and vj when their shortest path does not pass any land-
mark. How the landmarks are chosen plays a critical role in the algorithm accuracy. Since
finding the k optimal landmarks is NP-hard [77], heuristics are proposed [38, 77, 89] such
5
1.3 Contributions of the Thesis Introduction
as choosing the vertices that are on more shortest paths as the landmarks.
1.3 Contributions of the Thesis
To avoid the limitations in landmark choosing and to preserve more distance information
in the distance labels, in this study, we propose a representation learning based approach
to learn an embedding for every vertex as its distance label. Our idea is motivated by the
recent advances in learning graph embeddings. Studies [22, 23, 48] show that vertices
can be mapped into a latent space where their structural similarity (e.g., the number of
common neighboring vertices) can be computed. This motivates us to map the vertices
into a latent space to compute their spatial similarity, i.e., shortest-path distances.
Our learned embeddings do not rely on any particular landmarks or discriminate any
vertices with shortest paths that do not pass any landmarks. Thus, our embeddings may
yield more accurate distance predictions for such vertices, while we retain a low space
cost. These will be verified by an experimental study on real-world graphs (Chapter 5).
To learn the vertex embeddings, we first adopt existing representation learning mod-
els, including an auto-encoder model [17] and the node2vec model [48]. Given the em-
beddings of two vertices learned by these models, we train a multilayer perceptron (MLP)
to predict the distance between the two vertices. We observe that the vertex embed-
dings learned by these models suffer in the distance prediction accuracy. For the auto-
encoder, its learned embeddings tend to encode the average distances between the ver-
tices,1 which do not help predict the distance of two specific vertices. Node2vec encodes
the local neighborhood information rather than the global distances. Neither model receives
direct training signals from the distance prediction of two vertices when learning the embeddings
for the two vertices.
To overcome these limitations, we further propose a distance preserving vertex to vector
(vdist2vec) model for vertex embedding. Our vdist2vec model learns vertex embeddings
jointly with training an MLP to make distance predictions based on such embeddings.
This way, the vertex embeddings are guided by signals from distance predictions, which
1Auto-encoders tend to learn to reconstruct the average of all training instances when used without pre-training [52].
6
Introduction 1.3 Contributions of the Thesis
can better preserve the distance information.
Our vdist2vec model aims to learn an n× k dimensional matrix V where each row is
the embedding of a vertex (recall that n is the number of vertices and k is the embedding
dimensionality). This matrix is randomly initialized. When training the vdist2vec model,
we use two n-dimensional one-hot vectors to represent two vertices vi and vj for which
the distance is to be predicted. These two vectors are multiplied by V separately, which
fetches the two k-dimensional vectors vi and vj (i.e., the embeddings) corresponding to
vi and vj in V. Vectors vi and vj are then concatenated into a 2k-dimensional vector and
fed into an MLP to predict the distance between vi and vj . The optimization goal here is
to minimize the difference between the predicted distance and the actual vertex distance.
The prediction errors are propagated back to update vi, vj, and the MLP.
Once our model is trained, when a distance query comes with two query vertices vi
and vj , we just need to fetch vi and vj from V and feed them into the trained MLP to
predict the distance between vi and vj .
In summary, our study makes the following contributions:
• We propose a learning based approach to predict vertex distances without the need
of choosing a particular set of landmarks for distance labeling. Our approach has
an O(k) distance prediction time cost and an O(kn) space cost, where k is a small
constant denoting the vertex embedding dimensionality.
• We adopt existing representation learning techniques and study their limitations.
To address those limitations, we further propose to learn vertex embeddings while
jointly train an MLP to predict vertex distances based on such embeddings. Our
model is simple and efficient, since it is based on one-hot vectors and an MLP.
Our model is also highly accurate, since the embeddings are guided by distance
predictions directly.
• To further optimize the performance of our model, we propose a novel loss function
and an ensembling based network structure that guide the model learning to suit
the characteristics of the underlying data. We also discuss how to scale our model to
larger graphs, to handle graph updates, and to extend to other graph applications.
7
1.4 Outline of the Thesis Introduction
• We perform experiments on real road networks, social networks, and web page
graphs. The experimental results confirm the superiority of our proposed approaches.
Comparing with state-of-the-art approximate distance prediction approaches, our
approach reduces both the mean and the maximum distance prediction errors, and
the advantage is up to 97%.
• To examine the general applicability of our model, we further perform link predic-
tion and graph reconstruction experiments on social networks. The results show
that our distance guided embeddings are also effective in these applications.
1.4 Outline of the Thesis
The rest of the thesis is organized as follows.
1. In Chapter 2, we review the related work on shortest-path distance computation
models and graph embedding models. We discuss both exact distance computa-
tion models and approximate distance computation models. We describe how each
model works and analyze their advantages and limitations. In addition, we discuss
applying graph embedding methods on shortest-path distance problem as well as
other graph applications.
2. In Chapter 3, we formulate our problem and present a two-stage solution frame-
work. This framework allows us to adapt existing representation learning tech-
niques to learn vertex embeddings for vertex distance predictions.
3. In Chapter 4, we further propose a single-stage solution. We describe our proposed
model in details including the model structure, the loss function, the model opti-
mizations, and how to scale our model to large graphs and to handle graph updates.
We also discuss how to adapt our model to other graph applications such as link
prediction and graph reconstruction.
4. In Chapter 5, we present experimental results under various settings and examine
the impact of embedding dimensionality, MLP structure, graph updates, and loss
8
Introduction 1.4 Outline of the Thesis
function. We also show the effectiveness of the proposed model in applications
such as POI recommendations, link prediction, and graph reconstruction.
5. In Chapter 6, we conclude the thesis with a discussion on the future work.
9
This page is intentionally left blank.
Chapter 2
Related Work
In this section, we discuss four lines of related studies: exact shortest-path distance com-
putation, approximate shortest-path distance computation, graph embedding, and graph
embedding applications.
2.1 Exact Distance Computation
To compute shortest-path distances, the first step is to compute the shortest paths. Two
classic shortest-path algorithms are Dijkstra’s algorithm [32] and the Floyd-Warshall algo-
rithm [36]. Dijkstra’s algorithm [32] is a single-source shortest path (SSSP) algorithm that
computes the shortest-paths from a given source vertex to all the other vertices in a
graph. Floyd-Warshall algorithm [36] is an all-pair shortest path (APSP) algorithm that com-
putes all the shortest path between every vertex pairs in a graph. These two algorithms
have O(m + n log n) [37] and O(n3) time costs, where m and n are the numbers of edges
and vertices, respectively. More recent algorithms such as contraction hierarchies [44] re-
duce the time costs via adding shortcut edges (i.e., shortest paths between some vertices).
Once a shortest path is computed, the corresponding distance can be derived by simply
summing up the edge weights on the path. For efficient distance query processing, these
algorithms may be run to precompute the distance between every pair of vertices. Then,
a distance query can be answered by a simple lookup in O(1) time. Such an approach,
however, has a high space cost, i.e., O(n2).
To reduce the space cost while retaining a high query time efficiency, a stream of
studies [14, 24, 29, 43, 58] precompute a distance label for every vertex vi. The distance
11
2.1 Exact Distance Computation Related Work
label of vi contains the distances to a subset of vertices Vi = vi1, vi2, . . . , vik ⊆ V in the
form of 〈(vi1, d(vi, vi1)), (vi2, d(vi, vi2)), . . . , (vik, d(vi, vik))〉. Here, V represents the full
vertex set of a given graph, and k may vary for different vertices. The distance of two
query vertices vi and vj is derived from their distance labels as:1
d(vi, vj) = mind(vi, v) + d(vj , v)|v ∈ Vi ∩ Vj (2.1)
For directed graphs, the labels further store the direction information. For example, in
Hub Labeling [10], each vertex v has two labels Lain(v) and Laout(v): Lain(v) stores the
distances from k other vertices to v, while Laout(v) stores the distances from v to k other
vertices. To obtain exact shortest-path distances, at least one vertex on the shortest path
of vi and vj must be in the distance labels of both vi and vj . Otherwise, only approximate
distances may be produced. A key challenge is then to compute a minimum set of vertices
that cover all the shortest paths, so as to minimize the distance label size. Finding the
minimum average label size for a graph is an NP-hard problem [29]. Different heuristics
have been proposed such as pruned landmark labeling [14], multi-hop distance labeling [24],
IS-Label [43], and highway labeling [58]. These techniques are verified empirically on real
graphs. However, their worst-case label size is still O(n2) for general graphs [57].
Pruned landmark labeling [14] uses a pruned breadth-first search (BFS) to build up
the labels. It is based on the naive landmark labeling. Naive landmark labeling runs BFS for
every vertex v and records the distances from all the other vertices to v as its label. It has
an O(n2) label size and an O(n2 + nm) time complexity. Pruned landmark labeling skips
the vertices that yield large distances from being added to the labels. It first computes
for a random (or in some order) vertex l1 its distances to all the other vertices and adds l1
to each the label of every vertex. Let La1 be the set of distance labels obtained after pro-
cessing l1, and dq(v, u, La1) be the distance between any two vertices v and u computed
from La1. The pruned landmark labeling algorithm then selects another vertex l2 and
runs BFS on it, while using La1 to filter some of the vertices from the search as follows.
When traversing from l2 to a vertex u with distance δ, the algorithm compares δ with
dq(l2, u, La1). If δ is larger, l2 has a shorter path to u via l1. Thus, u will be pruned from
1Assuming an undirected graph. Same for Equation 2.2.
12
Related Work 2.1 Exact Distance Computation
the BFS. It will not be added to the label, and the search process will not continue from
it. The pruned landmark labeling algorithm repeats this process for all the vertices. Since
part of the vertices are pruned, pruned landmark labeling has a smaller label size and a
shorter computation time than the naive landmark labeling.
Multi-hop labeling [24] is based on 2-hop distance labeling [29]. The 2-hop distance
labeling technique computes a distance query from v to u by summing up the distance
from v to a hub vertex w (first hop) and the distance from w to u (second hop). The dis-
tances between all vertices and hub vertices (such as d(u,w) and d(w, v)) are precomputed
and stored as labels. For each vertex v, we denote its label set as La(v). For multi-hop
labeling, it gives each vertex a label and a parent vertex Pa(v). For query processing,
if La(v) ∪ La(u) = ∅, La(Pa(v)) and La(Pa(u)) is checked recursively, until a common
hub is found in the labels. This effectively builds a hierarchical structure on the vertices,
where the distance label of a vertex can be used by all its descendant vertices and do not
need to be stored in multiple copies. This way, the overall label set size is reduced. To
build this hierarchical structure, a tree decomposition [79] is computed which is guided
by the vertex degrees.
IS-labeling [43] again uses a hierarchical structure. In its labeling process, IS-labeling
removes sets of independent vertices recursively and stores the removed vertices in the
labels. A set of vertices are independent if there is no edge between any vertices in the set.
For example, as shown in Figure 2.1 , v1, v4, v5 is an independent set. When removing
the independent set, replacement edges are added to keep the rest vertices connected and
their shortest-path distances unchanged. For example, if v1 in Figure 2.1 is removed,an
edge is added between v2 and v3 with length 6. In addition, v2 and v3 will be regarded as
parent vertices of v1, and v1 is stored in the labels of v2 and v3. The procedure continues
until there is only one vertex left. This builds up a hierarchical structure over the vertices.
In this structure, the ancestor vertices act similarly to hub vertices. When querying the
distance of two vertices, the query algorithm checks if they share a common ancestor.
The optimization problem for IS-labeling then becomes minimizing the steps needed for
constructing the hierarchy (to create fewer ancestors). As this is an NP-hard problem,
an approximate approach was proposed that removes the vertices with smaller degree
13
2.2 Approximate Distance Computation Related Work
v2
v1 v4 v5
v3
3 46
3 5
5
Figure (2.1) A graph example
first [43].
Highway labelling [58] chooses sets of connected vertices and their edges as “high-
ways”, and computes labels that store vertices’ distance to those “highways”. A shortest-
path distance query can then be answer by summing up both query vertices’ distances
to their shared highway and the length between their highway entrance and exit. Dur-
ing construction, this technique computes vertices that share few common vertices with
k nearest neighbors of other vertices (where k is a system parameter). Such vertices are
used as the highway vertices. For example, if we set k as 5, for shortest path (pu,v) from
vertex u to vertex v, vertex c which is on pu,v will be considered as a highway vertex only
if it is included in neither of the sets of the 5 nearest vertices of u and v.
These heuristic methods reduce the label size on various datasets. For example, high-
way labeling has a relatively small label size on road networks, while it may not be op-
timal for complex network. The worst-case space costs are still O(n2) [57]. The hop-
doubling labeling [57] technique achieves an O(hn) space cost where h is a small constant.
It assumes scale-free graphs rather than general graphs.
2.2 Approximate Distance Computation
Approximate shortest-path distance algorithms trade distance accuracy for further re-
ducing the space cost [74, 88]. Most approximate algorithms take a landmark based ap-
14
Related Work 2.2 Approximate Distance Computation
proach [77, 89, 90], where a subset of k (k n) vertices are chosen as the landmarks,
denoted by L (L ⊂ V ). For every vertex vi, its distances to the landmarks are precom-
puted, which form the distance label of vi. This reduces the space cost to O(kn). At
query time, the landmark l ∈ L that is the closest to the two query vertices vi and vj is
computed, and the sum of its distances to vi and vj is returned, i.e.,
d(vi, vj) ≈ mind(vi, l) + d(vj , l)|l ∈ L (2.2)
The accuracy of a landmark based algorithm depends on how close the landmarks are to
the shortest paths. It is shown [26, 77] that landmarks at graph center help obtain a higher
accuracy. Intuitively, landmarks at graph center may be passed by more shortest paths
than those at graph boundary. Since finding the k optimal landmarks is NP-hard [77],
heuristics such as degree [39], betweenness [38] degree, and closeness [77] are used to mea-
sure the centrality of the vertices and to choose the landmarks. Here, the degree of a
vertex is the number of edges connected to it. A vertex with a larger degree has a higher
chance to be passed by more ( shortest) paths. Degree centrality guided landmark selec-
tion could be effective in datasets that have a hierarchical structure. For example, a web
graph that has a home page which links to all the other pages. Betweenness measures
the number of shortest paths passing through a vertex, while closeness measures the av-
erage distance of a vertex to all other vertices, i.e., if a vertex’ average distance to all the
other vertices is small, the vertex is likely to be located at the graph center. Betweenness
is shown to outperform closeness [89]. It is used as one of our baselines. The centrality
heuristic may be suboptimal for vertices nearby – the shortest paths of such vertices may
not pass the graph center. Another problem of centrality based landmark selection is that
the selected landmarks are usually close to each other [89]. As a result, many vertices
would be far away from the selected landmarks while being close to each other. The pre-
diction between these vertices may have a much larger value than its exact value. This
issue is also observed in our experiments. Figure 2.2 shows landmarks selected by two
strategies, betweenness centrality and degree centrality, on a Chinese city (Dongguan)
road network [4]. As shown in Figure 2.2, degree centrality is better than betweeness
centrality on the DG graph as it covers more areas in the graph. However, on a Facebook
15
2.2 Approximate Distance Computation Related Work
politician network [3], betweeness centrality have much smaller errors than degree cen-
trality. Detailed experimental results on these methods are presented in Chapter 5. Takes
and Kosters [89] propose an adaptive landmark selection strategy that balances between
centrality and coverage. They introduce an indicator called success rate, which is the ratio
of the number of correctly predicted shortest paths by the selected landmarks to the total
number of shortest paths in the graph. They compute the accumulation of success rates
for landmarks sorted by degree centrality or betweeness centrality, and incrementally
choose the vertices that contribute the maximum success rate increments.
(a) Degree Centrality (b) Betweenness Centrality
Figure (2.2) Landmark distribution on Dongguan
Sankaranarayanan and Samet [84, 85] adopt the method of well-separated pair decompo-
sition (WSPD) [86] to cluster vertices. For each cluster, they select a representative vertex.
The distances between vertices in different clusters are approximated by the distance be-
tween their representative vertices. They state that the paths between vertices in two
different clusters could have a long shared part. For example, the path that we drive
from the CBD of city A to city B may be similar to the path that we drive from a suburb
of city A to city B, i.e., the same highway is used, while the paths to get to the highway
might be different. They use a point-region Quadtree [35, 70, 83] to store vertices based
on their geo-coordinates and check recursively if the nodes in the tree structure are well
separated. As vertices with larger Euclidean distances to each other are more likely to
be well separated, applying the Quadtree structure helps finding a decomposition with
fewer sets. A random vertex in each set in the decomposition is chosen as the represen-
16
Related Work 2.2 Approximate Distance Computation
tative vertex, and the distances from vertices in set A to vertices in set B are estimated by
the distance between their representative vertices. This method can bound the error to
(1−ε)d(u, v) 6 d(u, v) > (1+ε)d(u, v), where ε is a parameter that is defined based on the
well-separated condition in the precomputing process. A smaller ε means lower errors
while more sets in decomposition and a higher space cost. This method is designed to
optimize the relative error but not the absolute error.
For query processing, Goldberg uses a landmark based upper bound to constrain the
search area of Dijkstra’s algorithm and reduce its running time [45]. The idea is that if a
graph traversal reaches a vertex with a distance larger than the landmark-based estima-
tion, then the traversal can be terminated from traversing beyond that vertex. Gubichev
et al. [49] improve the landmark based methods’ accuracy by removing cycles on the
path. They first concatenate the path from the source vertex to the landmark and the
path from the landmark to the target vertex. They then delete the vertices that have been
visited twice. However, this method needs to store not only the distance but also the
vertices on the paths to the landmarks. It increases the space complexity as well as the
query cost to find the cycles on path [88].
Theoretical results (e.g., [25, 74, 92]) are offered to bound the relationship between the
distance label size and the distance approximation accuracy. In particular, the (worst-
case) distance approximation accuracy is often referred to as the stretch. An algorithm is
said to have a stretch(α, β) if its distance approximation di,j for any two vertices vi and
vj satisfies d(vi, vj) ≤ di,j ≤ α · d(vi, vj) + β, where α > 1 and β > 0 [88]. For general
graphs, we usually consider β to be 0 and focus on α, and α is effectively the maximum
relative error. On undirected graphs, Thorup and Zwick [92] show that any algorithm
with an approximation ratio (i.e., stretch) of α < 2c + 1 (c ∈ N+) must use Ω(n1+1c )
space based on Erdos’ Girth Conjecture [33]. They construct a structure using O(cn1+1c )
space and O(cmn1c ) time to obtain an approximation ratio of α = 2c − 1 and an O(c)
query time. Their model constructs “balls” with fixed and limited diameters that cover
vertices. A shortest-path distance query will be answered by the ball with the smallest
diameter that covers both vertices. Given the same α = 2c−1, Chechik [25] improves the
space cost toO(n1+1c ) and the query time toO(1), with an increased prepossessing time of
17
2.3 Graph Embedding Related Work
O(n2+mn12 ) (recall thatm is the number of edges). These studies are mainly of theoretical
interest. No empirical results are presented in them. Das Sarma et al. [30] implement a
simplified version of Thorup and Zwick’s algorithm. They retain the O(cn1+1c ) space
cost. Taking the smallest value of c = 1, this algorithm still has an O(n2) space cost.
This algorithm will not be discussed further as we aim for a space cost linear to n for
scalability considerations.
For a comprehensive review on exact and approximate distance algorithms, inter-
ested readers are referred to [88].
2.3 Graph Embedding
Another stream of studies embed a graph into a latent space to compute vertex distances
with a lower cost, e.g., via Euclidean distance or cosine similarity. Metric embedding [9,
18, 53] and matrix factorization [13, 15, 22, 54, 71, 82] (e.g., over the adjacency matrix of
an input graph) are used for this purpose in earlier studies. For example. Locally Linear
Embedding (LLE) [82] assumes that the vector of a vertex in embedding space is the sum
of vectors of its neighbours. It can be defined as Zi =∑
j YijZj , where Y is the adjacent
matrix, and Z is the embedded matrix. Notice that Yij is 1 when vertex vi and vertex vj
are connected, while Yij is 0 when they are not. Therefore, only vectors of neighbours of
vi are added to compute Zi. For each vertex, LLE aims to reduce the difference between
its corresponding vector and the sum of the vectors of its neighboring vertices. For all
vertices, it minimizes∑
i||Zi −∑
j YijZj ||2. For another example, Laplacian Eigenmaps
(LE) [16] embeds vertices that share an edge with a large weight (i.e., closely related,
“small distance” in our setting) to be close in the latent space. It minimize ||Zi−Zj ||2Wij
which uses the weight Wij as a penalty in the embedding space. Two vertices with a
large weight is weighted more in the error that forces them to be closer. LLE and LE
can be solved as an eigenvalue problem, and both of their computation complexities
are O(|E|k2) where k is the number of dimensions [46]. Ahmed et al. proposed Graph
Factorization (GF) that represents the edge Yij with the angle of the corresponding vertices
vectors (< Zi, Zj >) [13]. The loss function is defined as below, where λ is a regularization
18
Related Work 2.3 Graph Embedding
parameter:
f(Y, Z, λ) =1
2
∑i,j∈n
(Yij− < Zi, Zj >)2 +λ
2
∑||Zi||2
It has anO(|V |3) time complexity as the angle of vectors of each vertex pair are computed.
GraRep [22] captures k-step relational information of vertices. It defines first step’s transi-
tion matrix as A = D−1Y to indicates the possibility to travel from one vertex to another,
where Y is the adjacent matrix, and D is a diagonal matrix with Dii =∑
j∈n Yij . The
transition matrix after k transition is computed by k times dot product of A. Their goal
is to build an embedding matrix that can quickly estimate the transition matrix. Ou et
al. introduced the High Order Proximity preserved Embedding (HOPE) which has a similar
structure, while they use similarity matrix rather than transition matrix [71].
More recent studies use deep learning to learn embeddings that preserve certain at-
tributes of a graph. The most relevant attributes studied are the first-order and second-
order proximities, i.e., edge weights and vertex neighborhood similarity. Random walk
and auto-encoder are used to learn the embeddings. The random walk technique [48, 75]
samples the graph as a discrete distribution and embeds the sample into vectors. For
example, node2vec [48] runs random walks on a graph to generate sequences of vertices,
which are treated as “sentences” to learn vertex embeddings using word2vec [67]. The
learned embeddings preserve the vertex neighborhood information (i.e., vertices on the
same sequence). Node2vec only needs to observe a small portion of the graph for each
sampling with a O(|V |k) time complexity. This is beneficial when the entire graph is too
large to be processed together.
An auto-encoder is formed by an encoder and a decoder. The encoder maps the in-
put into a latent representation, while the decoder aims to map the latent representa-
tion back to the original input. By minimizing the difference between the encoder input
and the decoder output, the encoder is trained to generate latent representations that
preserve the input information. Following this idea, auto-encoders are used to learn
graph embeddings. For example, Deep Neural Networks for Learning Graph Representations
(DNGR) [23] uses random surfing which resembles random walks to generate a proba-
bility co-occurrence matrix. This matrix contains estimations of the transition probabil-
19
2.4 Other Graph Embedding Applications Related Work
ity between the vertices. It is then transformed to a positive pointwise mutual informa-
tion (PPMI) [27] matrix and fed into an auto-encoder to learn the embeddings. Another
auto-encoder based model named Structural Deep Network Embedding (SDNE) [96] takes
a graph adjacent matrix as its input. It adapts the loss function to preserve both the
first-order and second-order proximities: (i) it adds a penalty when vertices nearby are
mapped far away in the embedding space (first-order proximity), and (ii) it penalizes
more on errors in the output corresponding to non-zero elements in the input (second-
order proximity). DNGR and SDNE use a |V |-dimensional vector as the input (e.g. a
row of the adjacency matrix) which could be costly in computation. To address this lim-
itation, Kipf and Welling [63] proposed a Graph Convolutional Network (GCN) that uses a
convolutional method to learn the embedding of a vertex in multiple aggregating itera-
tions. In each iteration, the embedding of each vertex is a combination of its neighbors
embeddings. As taking only local neighborhood information for training, it reduces the
time complexity to O(|E|k2) (k is the embedding dimensionality) comparing to DNGR
(O(|V |2)) and SDNE (O(|V ||E|)).
These models are not designed to learn global vertex distances. Nevertheless, they
can be adapted to predict vertex distances (detailed in Section 3.2). We compare with
them in our experiments to highlight the advantage of our proposed model in preserving
vertex distance information.
For a comprehensive review on graph embedding techniques, interested readers are
referred to [21, 46, 50].
2.4 Other Graph Embedding Applications
Graph embeddings have been applied to a rich set of applications. We briefly discuss a
few of them, including graph compression (reconstruction), link prediction, and graph
visualization.
Graph compression is first introduced by Feder and Motwani [34] who propose to
build a graph representation with fewer edges than the original graph to accelerate graph
analysis algorithms. After that, studies [72, 93, 94] propose graph compression methods
20
Related Work 2.5 Summary
based on on grouping similar vertices. Graph embeddings as a graph representation may
be used in graph compression if we can reconstruct the graph based on the embeddings
with a high accuracy.
Link prediction is to find the missing links or predict the future links in graphs based
on the observed graph structure. It has many applications. For example, in social net-
work graphs, link prediction can be used to find potential relationships for friend recom-
mendation and advertising. To achieve this goal, one way is to compute vertices’ simi-
larity and predict probable links among them [12, 62]. Other methods such as maximum
likelihood methods [20, 28] and probabilistic methods [99, 51, 42] solve the problem from
a statistical viewpoint. Graph embeddings maps vertices into a latent space so that their
similarity can be computed. For example, we can use Euclidean distance of the vertex
embeddings to describe their similarity.
Visualizing a graph in a proper way helps viewers gain information about a graph
conveniently and quickly. It has many applications in different fields where graph data
are used [91, 40, 60, 31]. Embedding representation can be adapted to a dimensionality
reduction model such as Principal Component Analysis (PCA) [73], and then be visual-
ized in a Euclidean space. The Euclidean distances between the vertices can demonstrate
their hidden relationship clearly.
2.5 Summary
In this chapter, we reviewed methods for shortest-path distance computation, including
landmark based methods, distance labeling methods and graph embedding methods.
By comparing these methods, we obtain a clearer view about their advantages and lim-
itations. Distance labeling methods may have a high accuracy while its space cost is
O(n2) in its worst case. Landmark based methods have a linear space cost (O(kn)) but
potentially a lower accuracy. Graph embedding methods also have a linear space cost
(to embedding dimensionality), and they have advantages in query speed (i.e., parallel
vector processing). However, it is challenging to keep both the global and local distance
information of a graph in the embedding vectors. These motivate us to develop a learn-
21
2.5 Summary Related Work
ing based approach that retains the advantages of graph embeddings in query efficiency
while overcoming the challenges in preserving the distance information.
22
Chapter 3
Adapted Two-Stage Models
This chapter presents our problem solutions by adapting existing representation learning
techniques. We start with basic concepts and a problem definition in Section 3.1. We then
present a two-stage framework for adapting representation learning techniques to solve
our problem in Section 3.2. We adapt existing representation learning models to make
distance predictions and show their limitations in Section 3.3.
3.1 Problem Formulation
We consider a graph G = 〈V,E〉, where V is a set of vertices and E is a set of edges. An
edge ei,j ∈ E represents a connection between two vertices vi and vj ∈ V . Each edge ei,j
is associated with a weight denoted by ei,j .w, which represents the cost (i.e., distance) to
travel across the edge. For simplicity, in what follows, our discussions assume undirected
edges, i.e., one can travel from both directions on ei,j with the same cost ei,j .w, although
our proposed techniques work for both directed and undirected edges.
Given two vertices vi and vj in G, a path pi,j between vi and vj consists of a sequence
of vertices vi → v1 → v2 → ... → vx → vj starting from vi and ending at vj , such that
there is an edge between any two adjacent vertices in the sequence. The length of pi,j ,
denoted by |pi,j |, is the sum of the weights of the edges between adjacent vertices in pi,j :
|pi,j |= ei,1.w + e1,2.w + ...+ ex,j .w (3.1)
Among all the paths between vi and vj , we are interested in the one with the smallest
length, i.e., the shortest path. Let such a path be p∗i,j . The length of this path is the (shortest-
23
3.2 A Two-Stage Solution Framework Adapted Two-Stage Models
G
k-dimensional
vectors
Representation
learning
network
v1
v2
v3
v4
v5
v2
v1 v4 v5
v3
3 4 6
1 3
3 55
(a) Vertex representation learning
Distance
prediction
network
(MLP)
vi
vj
di,j
(b) Distance predictor training (distance prediction)
Figure (3.1) Solution framework
path) distance between vi and vj , denoted by d(vi, vj).
d(vi, vj) = |p∗i,j | (3.2)
Given the concepts above, the shortest-path distance query is defined as follows.
Definition 1 (Shortest-path distance query) Given two query vertices vi and vj that belong
to a graph G, a shortest-path distance query returns the shortest-path distance between vi and
vj , i.e., d(vi, vj).
Our aim is to provide an approximate answer for a shortest-path distance query with
a high accuracy and efficiency.
3.2 A Two-Stage Solution Framework
We take a learning based approach to answer shortest-path distance queries. Given a
graph G, we first take a two-stage procedure that allows us to adapt existing representa-
tion learning techniques to answer shortest-path distance queries:
24
Adapted Two-Stage Models 3.2 A Two-Stage Solution Framework
1. Representation learning. We preprocess G by mapping each vertex vi ∈ V to a k-
dimensional vector representation vi ∈ Rk (cf. Figure 3.1a).1 The goal of this stage
is to learn vertex representations that preserve the graph distances between the
vertices, i.e., vertices that have small distances inG should also have small distances
for their learned vector representations, and vice versa.
2. Distance predictor training. We train a multi-layer perceptron (MLP) using the learned
vectors vi and vj between every pair of vertices vi and vj in V as the input and
distance d(vi, vj) as the target output (cf. Figure 3.1b). We use the mean square error
as the default loss function Ld to optimize the MLP parameters:
Ld = EP [(d(vi, vj)− di,j)2] (3.3)
Here, di,j denotes the predicted distance between vi and vj , and P denotes a dis-
tribution over V × V . In the simplest case, P is just the full set of V × V , i.e., to
optimize for every pair of vertices in G.
At query time, given two query vertices vi and vj , their learned representations vi
and vj are fetched and then fed into the MLP trained in the stage above. The input order
of the vectors reflects the travel direction in graph. For example, if we travel from vj to
vi instead, vj will be put in front of vi when being fed into MLP. For directed graphs,
this can distinguish the distance difference when the travel direction changes. This also
applies to our Vdist2vec model proposed in Chapter 4. The output of the MLP is returned
as the distance query answer (cf. Figure 3.1b).
Our two-stage model framework takes advantage of the recent advances in neural
networks and representation learning to avoid online graph traversals. Since neural net-
work inference (i.e., predictions) can be done efficiently, our solution can offer query
answers with a high efficiency.
Our solution offers approximate query answers, the accuracy of which is determined
by the quality of the learned vertex vectors. In what follows, we focus on the represen-
tation learning stage to obtain high-quality vectors that well preserve the vertex distance1For directed graphs, we learn two embeddings for each vertex vi, one for vi as the source vertex and the
other for vi as the destination vertex, respectively.
25
3.3 Adapted Representation Learning Models Adapted Two-Stage Models
information. The distance predictor training and distance prediction stages use standard
MLP training and inference procedures. They will not be detailed further.
3.3 Adapted Representation Learning Models
As discussed in Chapter 2, there is a recent advance in graph embedding techniques
using deep learning. We adapt two representative techniques to learn vertex vectors for
distance prediction: node2vec [48] and auto-encoders [17]. We also examine using geo-
coordinates and landmark-based distance labels for vertex representations, to test the
ability of the MLP to learn a distance prediction function based on them.
3.3.1 Node2vec
The idea of node2vec comes from word2vec [67] – a model that learns vector representa-
tions for words. In word2vec, a word w is mapped into a latent space where it is close to
its context words. Here, a context word of w is a word that appears within a predefined
distance δ from w in a given (large) text corpus. The predefined distance δ forms a context
window around w, e.g., δ = 5 means five words on each side of w in a sentence.
Under a graph setting, the given graphG can be seen as a large “corpus”, and the ver-
tices can be seen as the “words”. The “sentences” (and hence context windows) can be
generated by random walks [66] onG using (the inverse of) edge weights as the transition
probabilities. Then, the word2vec model applies directly. For model training, node2vec
uses the skip-gram [67] technique. This technique learns word vectors via optimizing a
neural network that predicts for every word w the probability of every other word to
appear in its context windows. The optimization goal is to maximize the probability of
observing all the context words of w. In node2vec, this optimization goal translates to
maximizing the log-probability of the neighborhood vertices N(vi) for a vertex vi condi-
26
Adapted Two-Stage Models 3.3 Adapted Representation Learning Models
tioned on its vector representation vi:
arg maxvi
∑vi∈V logPr(N(vi)|vi)
.= arg maxvi
∑vi∈V log
∏vj∈N(vi)
Pr(vj |vi)
= arg maxvi
∑vi∈V log
∏vj∈N(vi)
exp(vjTvi)∑
v∈V exp(vTvi)
(3.4)
This optimization function guides the model to produce similar vectors for vi and the
vertices in N(vi) (so as to maximize the dot product in the numerator). Here, N(vi)
contains the vertices passed by a random walk that goes through vi. The output of the
first hidden layer of the trained network is the vector for vi (given vi as the input).
Node2vec allows biasing the random walks towards either a breath-first or a depth-
first walk. This can generate N(vi) that consists of vertices around some vertex or along
some path (or a combination of both). The vertex vectors learned from such N(vi) may
preserve the information on which vertices are nearby in G. We use these vectors for
distance prediction. A limitation of such vectors, however, is that they may not preserve
the distance for vertices far away. This is because vertices far away do not fall in the
same neighborhood as frequently as those nearby, and the vectors may not be optimized
to preserve their distances.
In a recent study [78], node2vec embeddings are used for shortest-path distance pre-
diction. Before the learned embeddings of the two query vertices are fed into the distance
prediction network, four different operations are explored to combine the two vertex em-
beddings, namely, subtraction, concatenation, average, and point-wise multiplication.
As shown in its experimental results, on four different data sets tested, the concatena-
tion based model yields the best distance prediction accuracy. Therefore, we also use
concatenation based models in our experiments. As that study did not specify the num-
ber of nodes in each layer of its neural network except the embedding layer, we set the
MLP structure the same as our other models, which are 100 nodes and 20 nodes in the
two hidden layers, respectively. As shown in Table 3.1, we obtain lower distance predic-
tion errors on data sets used in [78] with our implementation. We use this as the default
settings for node2vec model in our experiments.
27
3.3 Adapted Representation Learning Models Adapted Two-Stage Models
Table (3.1) Node2vec Based Distance Prediction Errors
Mean Absolute Error Mean Relative Error
Reported [78] Our implementation Reported [78] Our implementation
FaceBook 0.258 0.237 0.099 0.097
BlogCatalog 0.275 0.255 0.119 0.109
We note that the existing study [78] focuses only on social networks. In our experi-
ments, we use node2vec to learn vertex vectors for distance predictions on road networks,
social networks, and web page graphs.
Table (3.2) Node2vec Based Distance Prediction Errors on Different Networks
Mean Absolute Error Mean Relative Error Max Absolute Error Max Relative Error
DG 2,329 0.199 45,564 1,806
MB 118 0.161 2,953 187
SU 658 0.175 18,917 500
FBPOL 0.411 0.094 9 9
FBTV 0.697 0.143 16 16
EPA 0.603 0.150 7 7
Table 3.2 shows node2vec approach performance on road networks (DG, MB, and
SU), social networks (FBPOL and FBTV), and web page (EPA). As each vertex embedding
learned by node2vec is based on its neighbor vertices, it works well on high density
graphs such as social networks and web page networks as shown in the table. In those
graphs, neighborhood information has enough knowledge to support distance prediction
between vertices. In other words, distance prediction between two vertices is based on
the similarity of their neighbors.
28
Adapted Two-Stage Models 3.3 Adapted Representation Learning Models
3.3.2 Auto-encoder
Auto-encoders were originally proposed for dimensionality reduction. They are used
for representation learning in more recent studies (e.g., [17]). An auto-encoder neural
network consists of two components – an encoder and a decoder, each of which may
consist of multiple hidden layers. The encoder maps the input into a latent representation
(usually with capacity constraints, otherwise it may simply learn the identify function for
the input). The decoder aims to map the latent representation back to the original input.
The optimization goal is to minimize the difference between the encoder input and the
decoder output, so as to preserve more input information in the latent representation.
Let φ(·) and ψ(·) be the two mapping functions learned by the encoder and the decoder,
respectively. Then, the loss function La for the auto-encoder can be written as:
La =n∑i=1
||xi − ψ(φ(xi))||22 (3.5)
Here, xi represents an input instance (in a vectorized form), and ||·||2 is the L2 distance.
Once the auto-encoder is trained, the latent representation of xi is computed as φ(xi).
Distance
prediction
network
(MLP)
Auto-encoderAll-pair shortest-pathdistance matrix
Figure (3.2) Auto-encoder embedding model
As shown in Fig 3.2, We adapt an auto-encoder to learn vertex representations. For
each vertex vi, we compute its distances to all vertices in V to form a distance vector. This
vector is used as the input vector xi to train an auto-encoder. Once the auto-encoder is
trained, the representation vi of vi is computed as:
vi = φ(xi) (3.6)
29
3.3 Adapted Representation Learning Models Adapted Two-Stage Models
Vector vi is expected to preserve the vertex distances stored in xi. However, a limitation
of such an adapted model is that its input vector xi has an O(|V |) size. When there are
many vertices (e.g., millions), the auto-encoder may become too expensive to train. Also,
auto-encoders tend to learn to reconstruct the average of all training instances [52]. In
our case, vi tends to yield the average distances between the vertices, which do not help
the distance prediction.
Table (3.3) Auto-encoder Based Distance Prediction Errors on Different Networks
Mean Absolute Error Mean Relative Error Max Absolute Error Max Relative Error
DG 1,821 0.178 45,645 1,505
MB 300 0.308 2,920 192
SU 217 0.070 9,966 244
FBPOL 0.964 0.239 9 4
FBTV 1.618 0.278 14 5
EPA 0.597 0.148 5 5
Table 3.3 shows auto-encoder approach performance on road networks (DG, MB, and
SU), social networks (FBPOL and FBTV), and web page (EPA). Auto-encoder works well
on graphs with larger diameters such as DG and SU. These graphs have large variance on
distances that guides auto-encoder to learn distinguishable embeddings for each vertices.
3.3.3 Geo-coordinates and Landmark Labels
Jindal et al. [59] use a deep neural network (DNN) based model to predict shortest-path dis-
tances given the geo-coordinates (longitudes and latitudes) of the vertices as input. This
can be seen as using vertex coordinates as the embeddings. The shortest-path distance of
two vertices can be proportional to their Euclidean distance in many cases. However, this
is not always true. For example, as shown in Figure 3.3, the Euclidean distance between
point A and point B is substantially smaller than their shortest-path distance. Such ob-
servations are not uncommon in real road networks crossing rivers or highways. Also,
30
Adapted Two-Stage Models 3.3 Adapted Representation Learning Models
not all graphs contain vertex coordinates, e.g., social network graphs do not, which limits
the applicability of this approach.
Figure (3.3) A road network example
Another simple way to obtain vertex representations is to use their distance labels
such as the landmark labels as described in Section 2.2. Then, we take advantage of the
capability of the MLP to learn a non-linear function to predict the shortest-path distance
based on such distance labels, rather than simply scanning the labels and summing up the
distances to the landmarks. We also use this approach as a baseline model in Chapter 5.
Table (3.4) Geodnn Based Distance Prediction Errors on Different Networks
Mean Absolute Error Mean Relative Error Max Absolute Error Max Relative Error
DG 1,566 0.092 41,376 262
MB 95 0.097 3,563 92
SU 442 0.108 14,916 162
FBPOL N/A N/A N/A N/A
FBTV N/A N/A N/A N/A
EPA N/A N/A N/A N/A
Table 3.4 shows geodnn approach performance on road networks (DG, MB, and SU).
31
3.4 Summary Adapted Two-Stage Models
As geodnn predicts the distances between two vertices based on their Euclidean distance,
it works well on road networks with few detours in shortest-path traversal. For example,
geodnn has a good performance on MB which is a grid shaped road network where
vertices are located neatly on grid line.
3.4 Summary
In this chapter, we presented a two-stage solution framework and adapted existing ver-
tex presentations for this framework, including node2vec, auto-encoder, geo-coordinates
(longitudes and latitudes), and landmark labels. We described each representation with
examples and analyzed their advantages and disadvantages. The two-stage solution
framework verifies the feasibility of a learning based solution for graph shortest-path dis-
tance predictions, while it also has limitations in that it separates representation learning
from distance predictions. This leads to sub-optimal vertex representations and distance
prediction accuracy. We propose a single-stage model to address this limitation in the
next chapter.
32
Chapter 4
Proposed Single-Stage Model
In Chapter 3, we described two-stage models where embedding learning and dis-
tance prediction are disconnected. To further improve the embedding quality, in this
chapter, we propose a single-stage model called vdist2vec that learns embeddings which
are guided directly by the distance prediction.
We detail the vdist2vec model structure in Section 4.1. We then discuss the choice of
loss functions for the model in Section 4.2. To further enhance the distance prediction
accuracy, we design a variant of the vdist2vec model using the ensembling technique in
Section 4.3. We scale vdist2vec to large graphs in Section 4.4. We cover update handling
in Section 4.5 and algorithm costs in Section 4.6.
4.1 Vdist2vec
Our vdist2vec model connects vertex representation learning with distance prediction to
form a single neural network. The model takes two vertices vi and vj as the input, learns
their representations, and predicts their distance as the targeted output. This structure
enables the distance signals from the output layer of the distance prediction network to be
propagated back to the representation learning network. Thus, the vertex representations
Part of the content of this chapter is published in
1. * Jianzhong Qi, Wei Wang, Rui Zhang, and Zhuowei Zhao. A Learning Based Approach to PredictShortest-Path Distances. International Conference on Extending Database Technology (EDBT), 2020.(CORE Ranking: A, accepted in December 2019)* The authors are ordered alphabetically
33
4.1 Vdist2vec Proposed Single-Stage Model
can be learned to better preserve the distance information.
Our model structure is illustrated by Figure 4.1. In the model, the input vertices vi and
vj are each represented as a size-|V | one-hot vector. The one-hot vector of vi (vj), denoted
by hi, has a 1 in the i-th (j-th) dimension and 0’s in all other dimensions. The next layer
is an embedding layer, which is used for representation learning. This layer has k nodes,
and its weight matrix is a |V |×k (2|V |×k for directed graphs) matrix that will be used as
the vertex vectors for all vertices, denoted by V = [v1T ,v2
T , ...,v|V|T ]T . Multiplying hi
(hj) by V yields vi (vj), i.e.,
vi = hiV (4.1)
Vectors vi and vj are then fed into a distance prediction network to predict the distance
between vi and vj . Recall that the distance prediction network is an MLP where the
default loss function Ld is the mean square error on the actual vertex distances and the
predicted distances (cf. Equation 3.3).
2|V |-dimensionalone-hot layer
k-dimensionalembeddinglayer
MLPinput layer
MLP hiddenand outputlayers
hi
(vi)
hj
(vj)
vi
vj
Fullyconnected
di,j
Figure (4.1) Vdist2vec model structure
At training time, the vertex representation matrix V is randomly initialized. The
corresponding vertex pairs’ vectors will then be concatenated and fed into the network
in batches to train the MLP. The training loss Ld will be propagated back to optimize
the MLP and the vertex representations in V. The optimization goal is to minimize the
34
Proposed Single-Stage Model 4.2 Loss Function
errors from exact distances to predicted distances between two vertices. At query time,
the vertex vectors vi and vj of the query vertices vi and vj are fetched from V, and the
MLP trained as part of vdist2vec is used to make a distance prediction.
Our vdist2vec model can be adapted for other distance prediction problems on graph.
For example, we can use the resistance distances [64] to replace the shortest-path distance
in vdist2vec output to train a model for predicting the resistance distance. In addition,
the distance to be predicted between two vertices is not limited to a scalar. Vectors can be
applied to represent multiple distance information such as top-k shortest-path distances.
To adopt our vdist2vec model for top-k shortest-path distances, we can simply use a
size-k vector as the model output while keeping the other parts of our model unchanged.
In real graph based database systems such as Neo4j [97], the vertex embeddings
learned by our vdist2vec model can be stored together with the vertices in a graph in
the system. This helps the system achieve faster shortest-path distance query and easier
vertices visualization.
4.2 Loss Function
Mean square error (MnSE) is one of the most commonly used loss function in machine
learning models, which is also our default loss function Ld. This error is defined by
Equation 4.2.
MnSE =1
n
n∑i=1
(d(vi, vj)− di,j))2 (4.2)
Recall that d(vi, vj) and di,j) are the ground truth and predicted distances between vi and
vj , respectively.
Another commonly used loss function is mean absolute error (MnAE), which is defined
by Equation 4.3.
MnAE =1
n
n∑i=1
|d(vi, vj)− di,j)| (4.3)
These two error measurements have a somewhat similar effect in model learning. How-
ever, MnSE is more sensitive to the variance of the error than MnAE. For example, given
a set of ground truth values D = 1, 1, 1, 1 and two sets of prediction values from two
35
4.2 Loss Function Proposed Single-Stage Model
different models D1 = 1, 2, 3, 4 and D2 = 1, 1, 4, 4, the absolute prediction errors of
the two models are E1 = 0, 1, 2, 3 and E2 = 0, 0, 3, 3, respectively. Both models share
the same MnAE (i.e., 1.5), while the second model has a larger MnSE (i.e., 4.5 vs. 3.5), as
E2 has a larger variance.
Figure (4.2) Error distribution of our model (DG)
To examine the impact of loss functions, we analyse the error distribution of our
model. Figure 4.2 is an example of the error distribution of the vdist2v model on the
DG graph dataset (detailed in Chapter 5) with MnSE as the loss function (the error distri-
bution is similar when using MnAE as the loss function). We see that a very small portion
(e.g., less than 1%) of the vertex pairs have much larger prediction errors (see the spike
to the right of the figure) than the other vertex pairs. Using MnSE as the loss function
gives the same weight to the larger error values as those of the smaller error values. This
is good for controlling the maximum prediction errors but may suffer in the mean pre-
diction errors. Next, we optimize the loss function to reduce the mean prediction errors.
4.2.1 Reducing Mean Errors
As discussed above, applying MnSE as the loss function emphasizes the large errors
which only form 1% of all verex pairs. To reduce the mean errors, we should guide
our model to focus more on the rest of the vertex pairs (which have smaller errors).
Our idea originates from a loss function named the Huber loss that combines MnSE
36
Proposed Single-Stage Model 4.2 Loss Function
and MnAE [55, 41, 100] . Its basic idea is to use MnAE when the error is larger than a
parameter δ, and to use MnSE otherwise. Equation 4.4 defines this loss function.
HLδ(a) =1
2a2, for |a|≤ δ
HLδ(a) = δ(|a|−1
2δ), otherwise
(4.4)
By applying Huber loss, errors below δ are weighted the same as MnSE, while errors
above are weighted less (multiplied by δ which is smaller than the error itself).
Inspired by the Huber Loss, we propose a reverse Huber loss (REVHL) function that
is defined in Equation 4.5. As the equation shows, now the errors below δ are weighted
heavier more as it is multiplied by δ which is larger than the error itself.
Lδ(a) = δ|a|, for |a|≤ δ
Lδ(a) =1
2(a2 + δ2), otherwise
(4.5)
We also adapted the equation for the case where the error exceeds δ to make the
overall loss function continuously differentiable, enabling it to be used in model training.
To show that REVHL is continuously differentiable, first, both δ|a| and 12(a2 + δ2) are
continuous and continuously differentiable themselves. Further, when a = δ,
Lδ(δ) = δ2, limx→0
Lδ(δ + x) =1
2((δ + x)2 + δ2)
=1
2(δ2 + δ2)
= δ2
L′δ(δ) = δ, limx→0
L′δ(δ + x) = (δ + x)
= δ
Thus, REVHL is also continuously differentiable at a = δ.
Our REVHL yields trained models with lower prediction errors as shown in Table 4.1
and Table 4.2, where MnRE denotes the mean relative error.
To apply REVHL on our model, we need to select δ. A suitable δ for our model should
separate the few vertex pairs with much larger errors with the rest of the vertex pairs.
During training, errors become smaller so that a δ that is suitable for earlier iteration
37
4.2 Loss Function Proposed Single-Stage Model
Table (4.1) Comparing Huber Loss with Reverse Huber Loss on Road Networks
DG MB SUMnAE MnRE MnAE MnRE MnAE MnRE
HL 224 0.018 9 0.014 57 0.025REVHL 75 0.015 6 0.014 50 0.024
Table (4.2) Comparing Huber Loss with Reverse Huber Loss on Social Networks
FBPOL FBTVMnAE MnRE MnAE MnRE
HL 0.179 0.047 0.160 0.029REHL 0.117 0.031 0.126 0.026
may be too large for the later iterations. We thus need to generate a dynamic δ in each
iteration. We fist sum up all absolute errors. We then add up the errors in ascending
order until reaching 99% of the total errors. At this moment, the corresponding error is
selected as δ. By doing so, we can find a suitable δ for REVHL in each iteration, which
results in a better performance. Comparing to the model training process, computing δ
is much faster, hence it does not increase the training time significantly.
By applying REVHL with dynamic δ values, the performance of our model increases
up to 48% in MnAE and 10% in MnRE, as shown in Chapter 5. We denote our model
using REVHL as the loss function by vdist2vec-L.
4.2.2 Reducing Maximum Errors
In some applications, the maximum errors may be more important (e.g., to offer a per-
formance guarantee). To reduce the maximum errors, we can apply a loss function that
weights larger errors even more than MnSE. A potential solution is to use a higher or-
der polynomial in the loss function, such as Mean Cube Absolute Error (MnCE) which is
defined by Equation 4.6. In such a loss function, the error values are multiplied by them-
selves for multiple times, such that the smaller errors shrink faster than the larger errors
38
Proposed Single-Stage Model 4.3 Ensembling Model
(assuming that the errors are within [0, 1]).
MnCE =1
n
n∑i=1
|d(vi, vj)− di,j)|3 (4.6)
.
As shown in Table 4.3, comparing with using MnSE, MnCE helps reduce the maxi-
mum absolution and relative errors, which are denoted as MxAE and MxRE, respectively.
However, MnCE also brings up the mean errors. A full comparison using more datasets
will be presented in Chapter 5.
Table (4.3) Using MnSE and MnCE as the Loss Function on MB dataset
MnAE MnRE MxAE MxRE
MnSE 12 0.014 376 16MnCE 12 0.017 265 15
4.3 Ensembling Model
Besides of analysing the error distribution, we also analyse the shortest-path distance
distribution. As shown in 4.3, the slope of the figure is larger for the first 5% and the
last 5% of the vertex pairs, while the rest of the figure has a smoother slope. This is
observed on other graphs as well. Therefore, to handle the distance predictions based on
its range, we adapted ensemble learning [69, 76, 80, 87] into our model to build a range-
wised vdist2vec model. Ensemble methods [69] combine the results of multiple machine
learning models to gain a better performance than any model on its own. In our case,
we adapted the idea of weighted average to summing up the result from each model
as the final prediction. Figure 4.4 illustrates our ensembling model for road network
graphs. We use four MLP models and multiply their outputs with different weights
as 100, 900, 9000, and dmax - 10000, respectively, where dmax is the maximum distance
between any two vertices in a graph. As the activation function for our MLP is sigmoid,
the output range for each MLP after weighted will become (0, 100), (0, 900), (0, 9000), and
(0, dmax - 10000), respectively. During training, when predicting distances in (0, 100),
39
4.3 Ensembling Model Proposed Single-Stage Model
Figure (4.3) Distance value distribution (DG)
MLP1 will learn to contribute the main prediction result, while the other MLPs will learn
to predict 0. Similarly, when predicting distances in (100, 1000), MLP2 will become the
main contributor while MLP1 produces 100 and the other MLPs predict 0. This way,
each MLP focuses on a different prediction value range so that the variance is reduced.
For social network graphs, the structure of the ensembling model is the same, but with
different weights (e.g. 2, 4, 8, 16, respectively) based on their distance range.
As the ensembling model has a more complicated structure than original vdist2vec
model, longer training times and query times are expected. To balance the time cost
and accuracy, we can adjust how deep we apply the ensembling layers. As shown in
Figure 4.5, instead of applying four different MLP, we use four different output layers of
one MLP to build our ensembling model. Table 4.4 shows that the output layer based
ensembling model has relatively balanced training time and query time, while increases
the performance by 58%. This structure will be our default ensembling model setting in
Chapter 5. We denote this model by vdist2vec-S.
Table (4.4) Performance of Emsembling Models on MB
MnAE MnRE MxAE MxRE PT QT
vdist2vec 12 0.014 376 16 0.9h 0.644µslayer-ensemble 5 0.006 317 16 1.1h 1.005µsMLP-ensemble 4 0.005 312 16 1.3h 1.473µs
40
Proposed Single-Stage Model 4.4 Handling Large Graphs
vi
vj
MLP1
MLP2
MLP3
MLP4
100
900
9000
dmax − 10000
output
Figure (4.4) Ensembling model
4.4 Handling Large Graphs
Our vdist2vec model computes k-dimensional vector representations for the vertices,
which form a |V |×k matrix. This is cheaper than naively computing a |V |×|V | matrix
for all pairwise vertex distances. However, vdist2vec still needs to feed |V |2 pairs of
vertices into the neural network for training, which may take non-trivial space and time
when |V | gets large (e.g., at million scale). In this section, we discuss how to lower the
number of vertex pairs used in model training to scale vdist2vec to large graphs.
Our key idea works as follows. We sample a small subset Vc of vertices from V . We
call these vertices the center vertices and compute their embeddings using vdist2vec as
described in Section 4.1, i.e., to pretrain vdist2vec using the center vertices. We further
train vdist2vec using pairs of vertices where each pair consists of a center vertex and a
non-center vertex. This way, we obtain a vertex vector for every vertex. Intuitively, the
41
4.4 Handling Large Graphs Proposed Single-Stage Model
vi
vj
MLP hidden layer
MLP hidden layer
100
900
9000
dmax − 10000
outputfully
connected
Figure (4.5) Layer ensembling model
learned vector vi of a non-center vertex vi shall resemble that of its nearest center vertex
(with some offset to reflect the distance to the nearest center vertex). Meanwhile, the MLP
is trained to predict the distance between a center vertex and any other vertex. Thus,
we can also use vi to predict the distance between vi and any vertex using the trained
MLP. The procedure above reduces the number of vertex pairs to be fed into vdist2vec
for training from |V |2 to |Vc|2+|Vc||V \ Vc|= |Vc||V |, where |Vc| |V |. We will study the
impact of |Vc| on model training time and prediction accuracy in the experiments.
42
Proposed Single-Stage Model 4.4 Handling Large Graphs
4.4.1 Large Road Networks
We further consider a special case, i.e., handling large road networks. In road network
graphs, vertices are associated with geo-coordinates (e.g., latitudes and longitudes), which
can help produce more accurate distance predictions. We cluster the graph vertices into
|Vc| clusters based on their geo-coordinates to obtain better center vertices. Any cluster-
ing algorithm that allows controlling the number of clusters can be used. We use k-means
in the experiments for simplicity. In each cluster, the vertex nearest to the cluster center
is chosen as a center vertex. We train vdist2vec over these center vertices first.
Then, instead of training vdist2vec with pairs of center vertices and non-center ver-
tices, we approximate the distance between non-center vertices as follows. First, for every
non-center vertex vi, we compute the distance to its cluster center (a center vertex vic), i.e.,
d(vi, vic) (also d(vic, vi) if the graph is directed). Then, given two query vertices vi and vj ,
their distance di,j is approximated by adding up their distances to their cluster center
vertices vic and vjc with the distance between vic and vjc:
di,j = λ1 · d(vi, vic) + dic,jc + λ2 · d(vjc, vj) (4.7)
Here, dic,jc represents the distance between vic and vjc predicted by the MLP that is
trained as part of vdist2vec.
In the equation above, there are two offset coefficients λ1 and λ2. Their role is to adjust
the contributions of d(vi, vic) and d(vjc, vj) to di,j . Intuitively, as shown in Figure 4.6a,
when vi, vic, vjc, and vj are at different relative positions, d(vi, vic) and d(vjc, vj) may have
different contributions to the overall distance. For example, if vi and vj are in between
vic and vjc, then d(vi, vic) and d(vjc, vj) should be deducted from (rather than added to)
dic,jc to obtain di,j .
To learn λ1 and λ2, we build another neural network model with a structure shown in
Figure 4.6b, where dla(·) and dlo(·) return the difference in latitude and longitude between
two vertices, respectively. This model feeds the coordinate difference between vi and vic
and the coordinate difference between vj and vjc into two MLPs to predict λ1 and λ2,
respectively. The last layer of each MLP uses a tanh activation function, which maps
43
4.5 Handling Updates Proposed Single-Stage Model
vi vj
vi vj
vic vjc
di,j
dic,jc
d(vi, vic) d(vjc, vj)
(a) Distance computation
MLP
MLP
d(vi, vic)
dla(vi, vic)
dlo(vi, vic)
dla(vjc, vj)
dlo(vjc, vj)
d(vjc, vj)
dic,jc
di,j
(b) Network structure
Figure (4.6) Distance prediction model for large road networks
λ1 and λ2 into the range of (−1, 1). The output of these two MLPs is multiplied with
d(vi, vic) and d(vjc, vj) (denoted by “⊙
” in Figure 4.6b), and the products are added with
dic,jc (denoted by “⊕
” in Figure 4.6b) to produce di,j . Essentially, this model implements
Equation 4.7. To train this model, we use the same loss function as that used in vdist2vec
(i.e., Equation 3.3), except that now only a sampled subset of non-center vertices (e.g.,
|Vc||V | pairs) is needed for model training as the input space (i.e., coordinate difference)
becomes much smaller.
We note that, if more computing resources are available, we may use spectral cluster-
ing [95] instead of k-means. We may use the eigenvectors of the graph Laplacian matrix
to replace the geo-coordinates of the vertices. Then, the procedure above applies to other
graphs such as social networks.
4.5 Handling Updates
Following a majority of the existing studies (e.g., [26, 29, 77]), our vdist2vec model targets
static graphs where the vertices and edges do not change. By periodic rebuilds, vdist2vec
can also handle graphs with a low update frequency (e.g., road networks) or applications
that are less sensitive to real-time updates (e.g., mining similar users for friendship rec-
ommendation in social networks). As shown in Chapter 5, vdist2vec can be rebuilt in a
few hours for graphs with over a million vertices.
44
Proposed Single-Stage Model 4.6 Cost Analysis
In real-time applications, vdist2vec can still provide distance predictions upon vertex
or edge updates, although the prediction accuracy may drop. If a new vertex v is inserted,
distance queries on v can be approximated by replacing v with its nearest vertex vi ∈ V
in the distance prediction and adding with d(v, vi). If a vertex is deleted, it will not
impact the distance prediction process but may impinge the prediction accuracy. This
also applies to edge updates, i.e., insertion, deletion, or weight changes. We study such
impacts via experiments in Chapter 5.
One potential approach to support real-time vertex and edge updates for learning
based distance prediction models is incrementally training our model for only the ver-
tices affected by an update rather than retraining the model on the whole graph. For
example, when a vertex is inserted, we may learn an embedding for the new vertex by
training our model on all the distances from the vertex to the other vertices while keep-
ing the other vertex embeddings unchanged. This way, we only need to handle O(n) dis-
tance pairs. For edges insertions, edge deletions, and vertex deletions, we can choose the
k nearest neighbours of the deleted vertices or vertices connected to the deleted/inserted
edges as the affected vertices. A hierarchical structure can be beneficial for graph up-
dating. When vertices are grouped into clusters, the vertices/edges modification within
each cluster can be handled inside the cluster, and hence reducing the update costs. For
example, a vertex/edge deletion in a well-separated pair decomposition cluster will not
affect the other clusters. We only need to recompute the shortest-path distances between
the vertices inside the affected cluster and their corresponding representative vertices.
An in-depth study on the impact of updates is an open problem for future study.
4.6 Cost Analysis
To simplify the discussion, we consider an MLP network to have an O(1) space cost for
storing its parameters and an O(1) time cost for its inference process. Such costs depend
mainly on the network size rather than the input graph size. Also, the inference process
can take advantage of GPU parallelization and be done efficiently.
Then, our vdist2vec model can be trained inO(|V |2) time, i.e., training with every pair
45
4.7 Summary Proposed Single-Stage Model
of vertices. For large graphs, this time can be reduced to O(|Vc||V |) via sampling (recall
that |Vc| is the sample size). Once the model is trained, it takes O(k|V |) space to store
the vertex embeddings. Adding in the constant space cost to store the MLP parameters,
overall, our model has an O(k|V |) space cost. To make a distance prediction, our model
takes O(k) time to read and feed the embeddings of the two query vertices into the MLP.
Then, the MLP inference process is run to make a prediction in O(1) time. Thus, overall,
our model has an O(k) time (and space) cost for distance predictions.
4.7 Summary
In this chapter, we described our proposed one-stage model (vdist2vec) to overcome the
limitation of two-stage models and achieve higher accuracy in distance prediction. We
studied the choice of loss functions for our model based on error distribution analysis. We
proposed the reverse Huber loss to reduce the mean absolute error for our model. We fur-
ther adapted the ensemble learning technique to boost our model performance in terms
of prediction accuracy. To handle large graphs, we proposed a hierarchical model that
reduces the preprocessing time via sampling while not suffering too much on distance
prediction accuracy. We concluded the chapter with a discussion on data update han-
dling and model costs.
46
Chapter 5
Experiments
We study the empirical performance of our vdist2vec model and its variants on real-
world graphs. We compare it with both landmark labeling approaches and embedding
based models.
5.1 Settings
The experiments are run on a Linux desktop computer with an Intel (R) Xeon (R) E5-2630
V3 CPU (2.40 GHZ), a GeForce GTX TITAN X GPU, and 32 GB memory. All algorithms
and models are implemented with Python 2.7.12. The graphs are managed with the Net-
workX package [2]. The neural networks are implemented with Tensorflow 1.13.1.
Datasets. We use the following road networks, social networks, and a web document
network to test the model effectiveness and efficiency:
• DG, SH, and SU [4, 61]: These datasets contain the road networks of Dongguan
(China), Shanghai (China), and Surat (India), which are amongst the most popu-
lated cities in the world. The road networks are directed, but their adjacency ma-
trices are symmetric, i.e., the edges on both directions between the same pair of
vertices have the same weight. For simplicity, we represent these road networks
with undirected graphs (hence the number of edges is reduced by half from the
original road networks).
• MB: This dataset contains the road network of the CBD of Melbourne (Australia)
exported from OpenStreetMap [6]. This network is directed.
47
5.1 Settings Experiments
• FL and NY [1]: These datasets contain the road networks of Florida and New York
City (USA), respectively. These road networks are directed, but their adjacency
matrices are again symmetric. We also represent them with undirected graphs.
• FBTV and FBPOL [3]: These datasets contain the Facebook page networks of politi-
cians and TV shows, respectively. Here, every page is a vertex, and an edge between
two vertices represents a mutual (i.e., undirected) “like” relationship.
• POK [3]: This dataset contains the social network named Pokec, where every user is
a vertex, and an edge from a vertex to another represents a (directed) “following”
relationship.
• EPA [81]: This dataset contains the web graph of www.epa.gov, where every page is
a vertex, and an edge from a vertex to another represents a (directed) hyperlink.
The edges in the social networks FBPOL, FBTV, and POK and the web page graph
EPA are unweighted originally. Following previous studies [19, 77], we assign each edge
with a length of one distance unit (e.g., one hop). Note that edge weight modeling is
orthogonal to our study. Other edge weight models may apply straightforwardly.
These datasets and their numbers of vertices (|V |), numbers of edges (|E|), average
degree (dgr), and maximum shortest-path distances between any two vertices (i.e., di-
ameter dmax) are summarized in Table 5.1. Note that the shortest-path distances are later
normalized into the range of [0, 1] for model training and testing.
48
Experiments 5.1 Settings
Table (5.1) Datasets
Type Dataset |V | |E| dgr dmax
Road
network
DG 8K 11K 2.76 96km
FL 1.07M 1.35M 2.36 12,000km
MB 3.6K 4.1K 1.14 6km
NY 264K 366K 2.80 1,600km
SH 74K 100k 2.70 127km
SU 2.5K 3.6K 2.88 50km
Social
network
FBPOL 6K 41k 13.60 14
FBTV 4K 17K 8.80 20
POK 1.6M 30M 18.80 11
Web page graph EPA 4.3K 9K 4.17 10
Approaches. We test seven approaches including our proposed model vdist2vec. We
detail these approaches below.
• vdist2vec: This is our proposed model as described in Section 4.1. By default, we
use 2%|V | as the number of nodes in the embedding layer. For the MLP distance
prediction component, we use two hidden layers with 100 and 20 nodes, respec-
tively. We use ReLU [68] as the activation function for the hidden layers and the
sigmoid function for the output layer. We set the batch size to be |V | (we find that
a large batch size helps the training efficiency while having little impact on the pre-
diction accuracy). We initialize the MLP parameters using the truncated normal
distribution with 0 as the mean and 0.01 as the standard deviation. The training
data is randomly shuffled. We train our model in 20 epochs with early stopping
using AdamOptimizer and a learning rate of 0.01.
• vdist2vec-S: This is our proposed ensembling model as described in Section 4.3.
The ensembling structure setting can be referred to Section 4.3. The rest of the
model shares the same setting as that of vdist2vec.
49
5.1 Settings Experiments
• landmark-bt [38][89]: This is a landmark labeling approach based on betweenness
centrality. It selects the top-k vertices that are passed by the most shortest paths as
the landmarks.
• landmark-dg [89]: This is a landmark labeling approach based on degree. It selects
the top-k vertices with the largest degrees as the landmarks.
• geodnn [59]: This approach trains an MLP to predict the distance of two vertices
given their geo-coordinates. It only applies to road networks. By default, we use
the recommended settings from the model proposal [59].
• node2vec [48]: This approach uses node2vec to learn vertex embeddings and then
trains an MLP to predict vertex distances based on the learned embeddings, as
described in Section 3.3.1. By default, we use the node2vec model settings from a
previous study that uses node2vec for distance predictions in social networks [78].
The MLP structure and hyperparameters are the same as those in vdist2vec.
• auto-encoder: This approach uses an auto-encoder to learn vertex embeddings and
then trains an MLP to predict vertex distances based on the learned embeddings, as
described in Section 3.3.2. By default, we use one hidden layer in the encoder and
the decoder. Each hidden layer has 2%|V | nodes. The middle layer (i.e., the code
layer) between the encoder and the decoder has 1%|V | nodes. We train the auto-
encoder using RMSPropOptimizer in 100 epochs with early stopping and a learning
rate of 0.001. The MLP structure and its hyperparameters are the same as those in
vdist2vec.
In these approaches except node2vec, by default, we use 2% of the number of vertices
as the embedding dimensionality (number of landmarks), i.e., k = 2%|V |. For node2vec,
we use k = 128 which is suggested to be optimal [78].
Evaluation metrics. We compute the embeddings (distance labels) and predict the
distance between every pair of vertices. We measure the mean absolute error (MnAE, in
meters), maximum absolute error (MxAE, in meters), mean relative error (MnRE), and maxi-
mum relative error (MxRE) for the predictions. We also measure the precomputation/training
50
Experiments 5.2 Results
time (PT) and average distance prediction (query) time (QT). Here,
relative error =|predicted distance− actual distance|
actual distance
The ground truth distances are precomputed using the contraction hierarchy algorithm
for road networks and Dijkstra’s algorithm for social networks1 and the web page graph.
5.2 Results
We first report results on an overall performance comparison among the different ap-
proaches and then report results on the impact of parameters.
5.2.1 Overall Comparison
We summarize the model performance in Tables 5.8 to 5.13 on smaller graphs (DG, MB,
SU, FBPOL, FBTV, and EPA) that can be processed by our vdist2vec model directly, and
in Tables 5.2 to 5.7 on large graphs (FLA, NY, SH, and POK) that require sampling so as
to be proposed by vdist2vec.
5.2.2 Performance on Smaller Graphs
Tables 5.2 and 5.3 show mean prediction errors on the smaller graphs. Our model vdist2vec
outperforms the baseline models across all six datasets except for landmark-bt on FBTV.
On road network graphs (DG, MB, and SU), vdist2vec reduces the MnAE by at least
0.74% (135 vs. 136 for vdist2vec and landmark-dg on DG) and up to 96% (12 vs. 300 for
vdist2vec and auto-encoder on MB). The advantage in terms of MnRE is larger, i.e., at
least 72% (0.015 vs. 0.054 for vdist2vec and landmark-dg on DG) and up to 97% (0.014
vs. 0.488 for vdist2vec and landmark-bt on MB). On social networks and the web page
graph, the errors of all the approaches are smaller, because the edge weights and graph
diameters are smaller. Our model vdist2vec still achieves substantial reductions in both
MnAE and MnRE, except on FBTV (MnAE 0.137 vs. 0.103 for vdist2vec and landmark-bt).
1CH is observed to be slower on social networks [11].
51
5.2 Results Experiments
In terms of MnRE, our model is at least 7% (0.026 vs. 0.028 for vdist2vec and landmark-bt
on FBTV) and up to 91% (0.026 vs. 0.278 for vdist2vec and auto-encoder on FBTV) smaller
than those of the baselines. Our vdist2vec-S model further improves the performance of
vdist2vec by at least 18% (0.108 vs. 0.133 on FBPOL) and up to 58% (5 vs. 12 on MB) for
MnAE, and at least 19% (0.021 vs. 0.026 on FBTV) and up to 57% (0.006 vs. 0.014 on MB)
for MnRE.
Table (5.2) Mean Absolute and Mean Relative Errors on Road Networks (Smaller Graphs)
DG MB SUMnAE MnRE MnAE MnRE MnAE MnRE
baseline
landmark-bt 2,234 0.442 192 0.488 468 0.281landmark-dg 136 0.060 22 0.083 180 0.127geodnn 1,566 0.092 95 0.097 442 0.108node2vec 2,329 0.199 118 0.161 658 0.175auto-encoder 1,821 0.178 300 0.308 213 0.070
proposedvdist2vec 135 0.015 12 0.014 83 0.027vdist2vec-S 71 0.011 5 0.006 49 0.014
Table (5.3) Mean Absolute and Mean Relative Errors on Social Networks and Web Page Graph(Smaller Graphs)
FBPOL FBTV EPAMnAE MnRE MnAE MnRE MnAE MnRE
baseline
landmark-bt 1.017 0.254 0.103 0.028 0.024 0.008landmark-dg 1.115 0.277 0.560 0.114 0.021 0.007geodnn N/A N/A N/A N/A N/A N/Anode2vec 0.411 0.094 0.697 0.143 0.603 0.150auto-encoder 0.964 0.239 1.618 0.278 0.597 0.148
proposedvdist2vec 0.133 0.034 0.137 0.026 0.023 0.006vdist2vec-S 0.108 0.027 0.101 0.025 0.020 0.005
The advantage of vdist2vec and vdist2vec-S come from their capability to learn the
pairwise vertex distances and preserve such information in the learned embeddings.
In comparison, the landmark approaches (landmark-bt and landmark-dg) are impacted
heavily by the choice of the landmarks. They may not preserve the distance information
for all vertex pairs, as discussed in Section 2.2. Thus, they have larger distance predic-
tion errors than vdist2vec on most datasets in Tables 5.2 and 5.3 . An exception is on
52
Experiments 5.2 Results
Table (5.4) Max Absolute and Max Relative Errors on Road Networks (Smaller Graphs)
DG MB SUMxAE MxRE MxAE MxRE MxAE MxRE
baseline
landmark-bt 56,058 4,713 4,724 375 54,689 1,549landmark-dg 34,053 1,425 3,103 216 8,409 1,242geodnn 41,376 262 3,563 92 14,916 162node2vec 45,564 1,806 2,953 187 18,917 500auto-encoder 45,645 1,505 2,920 192 9,966 244
proposedvdist2vec 9,050 193 376 16 4,789 63vdist2vec-S 8,154 193 317 16 3,188 40
Table (5.5) Max Absolute and Max Relative Errors on Social Networks and Web Graph (SmallerGraphs)
FBPOL FBTV EPAMxAE MxRE MxAE MxRE MxAE MxRE
baseline
landmark-bt 10 10 14 14 7 7landmark-dg 10 10 18 18 7 7geodnn N/A N/A N/A N/A N/A N/Anode2vec 9 9 16 16 7 7auto-encoder 9 4 14 5 5 5
proposedvdist2vec 4 4 4 4 4 4vdist2vec-S 4 4 4 4 4 4
the FBTV dataset. This dataset has many shortest paths that pass through the graph
center, which suits the betweenness centrality strategy of landmark-bt. Thus, landmark-
bt obtains a slightly lower MnAE than that of vdist2vec on this dataset. However, our
vdist2vec-S still outperforms landmark-bt, with a slightly more complex structure. Note
that a smaller MnAE does not guarantee a smaller MnRE, e.g., landmark-bt has a smaller
MnAE but a larger MnRE than vdist2vec. This is because, MnAE only reflects the av-
erage of the prediction errors but not the error distribution. Given the same MnAE, an
algorithm that generates the errors mostly on vertex pairs with small distances may have
larger relative errors (and hence larger MnRE) than an algorithm that generates the er-
rors mostly on vertex pairs with large distances. The landmark approaches are known to
suffer on vertices with small distances because the shortest paths between such vertices
may not go through any landmarks.
53
5.2 Results Experiments
Table (5.6) Preprocessing and Query Times on Road Networks (Smaller Graphs)
DG MB SUPT QT PT QT PT QT
baseline
landmark-bt 0.1h 5.832µs 62.7s 4.579µs 32.6s 4.463µslandmark-dg 0.1s 6.044µs 0.1s 4.392µs 0.1s 4.288µsgeodnn 0.9h 0.366µs 0.2h 0.396µs 0.2h 0.375µsnode2vec 2.2h 0.829µs 0.9h 0.820µs 0.5h 0.809µsauto-encoder 2.3h 1.032µs 1.0h 0.638µs 0.5h 0.589µs
proposedvdist2vec 2.3h 1.039µs 0.9h 0.644µs 0.4h 0.589µsvdist2vec-S 2.8h 1.366µs 1.1h 1.005µs 0.6h 0.927µs
Table (5.7) Preprocessing and Query Times on Social Networks and Web Page Graph (SmallerGraphs)
FBPOL FBTV EPAPT QT PT QT PT QT
baseline
landmark-bt 207.5s 5.014µs 72.4s 4.501µs 75.2s 4.621µslandmark-dg 7.2s 5.031µs 2.4s 4.554µs 3.2s 4.650µsgeodnn N/A N/A N/A N/A N/A N/Anode2vec 1.4h 0.801µs 1.1h 0.805µs 1.1h 0.781µsauto-encoder 1.4h 0.877µs 1.1h 0.723µs 1.1h 0.803µs
proposedvdist2vec 1.5h 0.877µs 1.1h 0.723µs 1.1h 0.772µsvdist2vec-S 1.6h 1.263µs 1.2h 1.137µs 1.2h 1.196µs
The geodnn approach only works on road networks as it makes predictions based on
the geo-coordinates of the vertices. Its performance relies on how far away the shortest
paths deviate from the straight lines between the vertices. It is the second best baseline
approach on MB which is a small grid shaped road network with few detours. It drops to
the third on DG and SU which are larger road networks that cover rivers and have larger
detours.
The node2vec approach focuses on embedding the neighborhood of the vertices. It
works better on graphs with small diameters where the vertices are all near each other.
For example, FBPOL has a small diameter of 14, for which node2vec is the second best
approach. When the graph diameter becomes larger (e.g., 96km for DG), node2vec in-
curs larger errors since the neighborhood becomes less relevant to the distance between
vertices far away.
54
Experiments 5.2 Results
The auto-encoder tends to generate embeddings that preserve the average distances
between the vertices. This leads to an unsatisfactory prediction performance in general,
as evidenced by the large mean errors reported. On the other hand, this property may
help avoid large maximum errors, e.g., the auto-encoder is the best baseline on FBPOL
in terms of the maximum errors (cf. Table 5.5) , in contrast, is better on graphs with
larger diameters where the vertex distances may have a larger variance, e.g., DG and
SU. This is because a larger variance on the distances may offer a stronger signal for the
auto-encoder to learn different embeddings for different vertices, rather than the same
average distance.
Tables 5.4 and 5.5 show the MxAE and the MxRE of the models. Our vdist2vec and
vdist2vec-S models also outperforms the baselines on these two measures (except on
FBPOL where auto-encoder is equally good in MxRE). The vdist2vec model reduces the
MxAE by up to 92% (376 vs. 4,724 for vdist2vec and landmark-bt on MB) and the MxRE
by up to 96% (63 vs. 1,549 for vdist2vec and landmark-bt on SU), while the performance
of vdist2vec-S is even stronger. This again verifies the capability of our models to learn
and preserve the vertex distance information. Note that, similar to what has been ob-
served over MnAE and MnRE, a larger MxAE does not mean a larger MxRE either, e.g.,
on DG, geodnn and vdist2vec have similar MxRE but geodnn has a much larger MxAE.
This is because MxAE and MxRE are usually observed from different pairs of vertices –
MxAE tends to come from vertices far away, while MxRE tends to come from vertices
with a very small distance (e.g., 1). Comparing Tables 5.2 and 5.3 with Tables 5.4 and 5.5,
we find that, in general, the baseline methods do not yield low mean and maximum er-
rors at the same time. For example, landmark-bt is close to vdist2vec on FBTV in terms
of the mean errors, while its maximum errors are much larger than those of vdist2vec on
the same dataset. Similarly, auto-encoder is close to vdist2vec on FBPOL in terms of the
maximum errors, but its mean errors are more than 7 times larger than those of vdist2vec
on the same dataset. This further highlights the advantage of vdist2vec and vdist2vec-S,
which can achieve low mean and maximum errors at the same time.
Tables 5.6 and 5.7 show the preprocesing (model training) time PT and distance pre-
diction (query) time QT. In terms of PT, the landmark approaches are much faster. Their
55
5.2 Results Experiments
precomputation procedures are deterministic and much simpler than the training pro-
cedures of the learning based models which involve multiple iterations of numeric op-
timization on the neural networks. For learning based models, the main parameter that
affects the preprocessing time is the number of embedding dimensions. As geo has the
lowest number of embedding dimensions (2), it has the smallest preprocessing cost. The
node2vec model has a fixed number of dimensions (128, as suggested to be the best for
the shortest-path distance problem [78]). Our models vdist2vec and vdist2vec-S have
more complex distance prediction networks and hence longer training times.
In terms of QT, the learning based approaches are at the same (or smaller) magnitude
as the landmark approaches. This is because distance prediction in the learning based
approaches is a simple forward propagation procedure, which can be easily parallelized
and take full advantage of the computation power of the GPU. The geodnn model is
the fastest, for that its input layer only has four dimensions (i.e., two geo-coordinates).
The other three learning based approaches node2vec, auto-encoder, and vdist2vec have
very similar MLP structures and input sizes which are larger than that of geodnn. Thus,
their QT are similar and are larger than that of geodnn. Note that QT of node2vec differs
slightly from those of auto-encoder and vdist2vec. This is because node2vec has a con-
stant embedding dimensionality k = 128 which is suggested to be optimal [78], while the
embedding dimensionality of auto-encoder and vdist2vec is varying with the number of
vertices (i.e., k = 2%|V |). The QT of vdist2vec-S is longer than that of vdist2vec again for
its slightly more complex structure.
5.2.3 Performance on Larger Graphs
Tables 5.8 to 5.13 show the model performance on the larger graphs, i.e., FL, NY, SH, and
POK. These graphs cannot be processed in full by vdist2vec under our hardware con-
straints. Following the procedure described in Section 4.4, for each of the road networks
FL, NY, and SH, vdist2vec clusters them into 0.1%|V | clusters using the k-means algo-
rithm and learns embeddings (and the MLP) for the cluster center vertices. The model
then randomly samples 100,000 pairs of vertices and uses their geo-coordinates and dis-
tances to learn the offset coefficients λ1 and λ2. The learned embeddings of the center
56
Experiments 5.2 Results
vertices and coefficients will enable vdist2vec to predict the distance between any pair
of vertices in the graph. For the social network POK, vdist2vec uses 0.1%|V | vertices
with the largest degrees as the center vertices. It learns vertex embeddings to predict dis-
tances between these center vertices and the rest of the vertices. These embeddings are
then used to predict the distance between any two vertices.
Table (5.8) Mean Absolute and Mean Relative Errors on Road Networks (Larger Graphs)
FL NY SHMnAE MnRE MnAE MnRE MnAE MnRE
baseline
landmark-bt OT OT 24,851 0.167 6,144 0.554landmark-dg 67,104 0.134 20,483 0.164 4,407 0.403geodnn 363,661 0.317 207,694 0.862 14,842 0.990node2vec OT OT 217,400 0.703 19,465 1.276auto-encoder OT OT OT OT OT OT
proposedvdist2vec 66,542 0.042 17,278 0.056 2,549 0.126vdist2vec-S 66,316 0.041 17,140 0.053 2,546 0.126
Table (5.9) Mean Absolute and Mean Relative Errors on Social Networks (Larger Graphs)
POKMnAE MnRE
baseline
landmark-bt OT OTlandmark-dg 3.070 0.665geodnn N/A N/Anode2vec OT OTauto-encoder OT OT
proposedvdist2vec 0.940 0.203vdist2vec-S 0.895 0.198
Among the baseline models, the landmark approaches are run on the full graphs
which may not complete in time. We terminate the algorithms after 48 hours and denote
it as “OT” in the tables. For geodnn, we randomly sample 100,000 pairs of vertices and
use their geo-coordinates and distances to train the MLP for distance prediction. For
node2vec and auto-encoder, their vertex embeddings need to be learned for all vertices,
which may also run overtime. For the datasets where they can learn the embeddings in
time, we also randomly sample 100,000 pairs of vertices to train the MLP.
57
5.2 Results Experiments
Table (5.10) Max Absolute and Max Relative Errors on Road Networks (Larger Graphs)
FL NY SHMxAE MxRE MxAE MxRE MxAE MxRE
baseline
landmark-bt OT OT 764,169 36 127,433 1,787landmark-dg 2,380,151 84 571,447 67 88,038 595geodnn 3,085,059 82 982,265 84 69,382 77node2vec OT OT 1,087,000 65 93,433 688auto-encoder OT OT OT OT OT OT
proposedvdist2vec 604,489 11 460,065 7 27,753 18vdist2vec-S 605,872 11 458,052 7 26,626 17
Table (5.11) Max Absolute and Max Relative Errors on Social Networks (Larger Graphs)
POKMxAE MxRE
baseline
landmark-bt OT OTlandmark-dg 6 6geodnn N/A N/Anode2vec OT OTauto-encoder OT OT
proposedvdist2vec 5 5vdist2vec-S 5 5
For testing, since it will take too long to test all pairs of vertices, following the strategy
of previous studies [49, 77], we randomly sample 100,000 pairs of vertices (different from
those sampled for training) and test all the approaches on them. We use k = 50% ×
0.1%|V | for the SH dataset and k = 5% × 0.1%|V | for the other three datasets as SH is
quite smaller than the other three datasets.
As Tables 5.8 to 5.11 show, our vdist2vec and vdist2vec-S models also produce smaller
distance prediction errors than the baselines on the larger graphs, even when our em-
beddings and MLP are not trained on every pair of vertices. For vdist2vec, the reduc-
tions achieved in MnAE, MnRE, MxAE, MxRE are up to 92% (17,278 vs. 217,400 for
vdist2vec and node2vec on NY), 94% (0.056 vs. 0.862 for vdist2vec and geodnn on NY),
80% (604,489 vs. 3,085,059 for vdist2vec and geodnn on FL), and 99% (18 vs. 1,787 for
vdist2vec and landmark-bt on SH), respectively. The improvement of vdist2vec-S over
58
Experiments 5.2 Results
Table (5.12) Preprocessing and Query Times on Road Networks (Larger Graphs)
FL NY SHPT QT PT QT PT QT
baseline
landmark-bt OT OT 39.6h 11.712µs 14.7h 8.423µslandmark-dg 29.3s 54.492µs 9.4s 16.325µs 1.5s 13.584µsgeodnn 0.1h 0.444µs 0.1h 0.458µs 0.1h 0.432µsnode2vec OT OT 26.3h 0.751µs 2.8h 0.781µsauto-encoder OT OT OT OT OT OT
proposedvdist2vec 3.0h 3.981µs 2.1h 1.215µs 0.1h 0.797µsvdist2vec-S 3.1h 3.981µs 2.1h 1.215µs 0.1h 0.797µs
Table (5.13) Preprocessing and Query Times on Social Networks (Larger Graphs)
POKPT QT
baseline
landmark-bt OT OTlandmark-dg 0.4h 23.522µsgeodnn N/A N/Anode2vec OT OTauto-encoder OT OT
proposedvdist2vec 3.1h 0.573µsvdist2vec-S 3.2h 0.759µs
vdist2vec on the larger graphs is less significant, because the distance distribution on
the hierarchical model is less skewed. Further, the baseline approaches landmark-bt,
node2vec, and auto-encoder cannot handle all four datasets. They run overtime when
|V | gets too large (e.g., over a million). The geodnn model can process the larger road
networks, but it cannot handle social networks as pointed out earlier. The landmark-dg
approach is the only baseline that can handle all four large datasets, which is due to its
simple procedure (i.e., simply computing high-degree vertices). It is also the most com-
petitive baseline on most of these dataset, especially POK. However, it still suffers on the
road networks in terms of maximum errors as shown in Table 5.10, because there may be
nearby vertices with shortest paths that do not pass any high-degree vertices, while they
are far away from all the high-degree vertices.
Tables 5.12 and 5.13 show the time costs on the larger graphs. For preprocessing
59
5.2 Results Experiments
(training), landmark-dg and geodnn are still faster than vdist2vec and vdist2vec-S as
their preprocessing (training) procedures are simpler. Meanwhile, landmark-bt, node2vec,
and auto-encoder now become much slower and are outperformed by vdist2vec and
vdist2vec-S, because they need to process all vertices, while vdist2vec and vdist2vec-S
only need to process the center vertices and sampled vertex pairs. For distance predic-
tion, geodnn is again the fastest, since it has a small input size in its MLP. Our vdist2vec
and vdist2vec-S models have a larger input size (and an extra offset computation step on
road networks). Thus, they have slightly larger distance prediction times than those of
geodnn and node2vec.
5.2.4 Applicability Test
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
5 10 15 20 25
Recall
t
landmark-btlandmark-dg
geodnn
node2vecauto-econder
vdist2vec
(a) recall@t
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
5 10 15 20 25
nD
CG
t
(b) nDCG@t
Figure (5.1) Recall and nDCG in finding nearest neighbor
To further verify the effectiveness of our proposed model and to test its applicability
in real applications (e.g., to find the nearest POI), we compute the top-t nearest neigh-
bors (NNs) for every vertex using the distances estimated by every model. We report
recall@t and nDCG@t [56]. Here, recall@t measures the probability of the actual NN of a
vertex to be found in the top-t NNs returned by a model; nDCG@t further measures the
actual ranks of the top-t NNs returned.
For the rest of the experiments, we only report the results of our vdist2vec model
but not vdist2vec-S to keep the figures concise, since vdist2vec-S and vdist2vec yield
similar results. As Figure 5.1 shows, vdist2vec outperforms all baseline models in both
60
Experiments 5.2 Results
measures. It has an over 50% probability of returning the actual NN in its top-5 NN
predictions, and this probability increases to 94% in the top-25 list. The best baseline in
this case, geodnn, is over 5% less accurate than vdist2vec, while all other models cannot
reach a 30% probability of returning the actual NN. The gap in nDCG is even larger,
which confirms the quality of the NNs returned by vdist2vec, i.e., they are the closest to
the actual NNs.
Experimental results on the other datasets show similar patterns. They are omitted
for succinctness.
5.2.5 Impact of Updates
27
28
29
210
211
212
2% 4% 6% 8% 10%
MnA
E
# vertex insertions (x|V|)
(a) Vertex insertions
27
28
29
210
211
212
2% 4% 6% 8% 10%
MnA
E
# edge insertions (x|E|)
(b) Edge insertions
28
29
210
211
212
213
2% 4% 6% 8% 10%
MnA
E
# edge deletions (x|E|)
landmark-btlandmark-dg
geodnnnode2vec
auto-econdervdist2vec
(c) Edge deletions
Figure (5.2) Impact of updates (DG)
Next, we keep the models trained on G unchanged and study the impact of graph
61
5.2 Results Experiments
2-3
2-2
2-1
20
21
2% 4% 6% 8% 10%
MnA
E
# vertex insertions (x|V|)
(a) Vertex insertions
2-1
20
21
2% 4% 6% 8% 10%
MnA
E
# edge insertions (x|E|)
(b) Edge insertions
2-1
20
21
2% 4% 6% 8% 10%
MnA
E
# edge deletions (x|E|)
landmark-btlandmark-dg
node2vecauto-econder
vdist2vec
(c) Edge deletions
Figure (5.3) Impact of updates (FBPOL)
updates, including vertex and edge insertions and edge deletions. Vertex deletions (e.g.,
POI closing) do not impact distance predictions as long as the relevant edges are kept,
and hence are omitted. For vertex insertions, we generate 2%|V | to 10%|V | new vertices
and place each on a randomly chosen edge ei,j (e.g., POI opening on some road). The
new vertex is connected with the two vertices of ei,j , with edge weights of γ1 · ei,j .w and
(1−γ1)ei,j .w, respectively (γ1 ∼ U(0, 1)). For edge insertions, we randomly choose 2%|E|
to 10%|E| pairs of vertices. For each chosen pair (vi, vj), we add an edge to connect them
with a weight of γ2 · d(vi, vj) (γ2 ∼ N (1, 1)). For edge deletions, we randomly delete
2%|E| to 10%|E| edges that do not create a disconnected graph.
The MnAE results are summarized in Figs. 5.2 and 5.3. As the figure shows, our
vdist2vec model is robust against updates. For road networks, the performance gaps
among the different models are kept as new vertices or edges are inserted (Figs. 5.2a
and 5.2b). Edge deletions have a stronger impact, since they may introduce long detours
in the shortest paths. Even so, only geodnn produces a slightly smaller MnAE than ours
62
Experiments 5.2 Results
when more than 6% of the edges are deleted, while our vdist2vec model is much bet-
ter when there are fewer than 6%|E| deletions (Figure 5.2c). For social networks, our
model outperforms other methods in vertex insertions as shown in Figure 5.3a. For edge
insertions and edge deletions, our model and node2vec model have similar results that
outperform other baselines except for auto-encoder. As we described in Section 5.2.1,
auto-encoder tends to generate embeddings that preserve the average distances between
the vertices for social networks. This feature helps it resist to social networks updates, as
the average distances of all the vertices pairs will not change too much after updates.
Similar patterns are observed in the other measures. We omit them for succinctness.
5.2.6 Impact of Embedding Dimensionality
Now we study how the dimensionality of the vertex embedding (which is also the size
of the embedding layer of vdist2vec and the number of landmarks of the landmark ap-
proaches), i.e., k, impacts distance prediction time and errors. We vary k from 0.125%|V |
to 2%|V |, and Figs. 5.4 and 5.5 show the algorithm performance on the DG dataset and
FBPOL dataset respectively. In general, as k increases, the prediction errors drop, while
the preprocessing and prediction times increase. This is expected since a larger k means
more distance information may be preserved and more computation needed to encode
such information. Geodnn is not impacted by k, as its predictions are based on geo-
coordinates only.
Similar to previous experiments, vdist2vec yields the lowest prediction errors for al-
most all cases tested. There are three exceptions, i.e., (i) geodnn has the smallest MxRE
for k ≤ 20 on DG (Figure 5.4d), since it is not impacted by k, while the other approaches
suffer from a small number of landmarks or embedding dimensionality; (ii) auto-encoder
has smallest MxAE on FBPOL, since it fails to learn useful information for social network
and predicts the average distances for all pairs that avoids large errors; (iii) node2vec
has smaller MxAE and MxRE for k = 8 on FBPOL (Figs. 5.5c and 5.5d), as embeddings
learned by node2vec with a low dimensions have a high similarity that falls into the same
situation as auto-encoder. Learning based models take more time to preprocess (Figs. 5.4e
and 5.5e), but the training time does not increase much as k does. Also, vdist2vec has a
63
5.2 Results Experiments
very small prediction time (e.g., 1µs for k = 160, Figs. 5.4f and 5.5f). This enables using
large k values to produce more accurate distance predictions.
27
28
29
210
211
212
213
214
10 20 40 80 160
Mn
AE
k
(a) MnAE
2-7
2-6
2-5
2-4
2-3
2-2
2-1
20
21
10 20 40 80 160M
nR
E
k
(b) MnRE
213
214
215
216
217
10 20 40 80 160
MxA
E
k
(c) MxAE
27
28
29
210
211
212
213
214
10 20 40 80 160
MxR
E
k
(d) MxRE
2-14
2-12
2-102
-82
-62
-42
-22
02
2
10 20 40 80 160
PT
(h)
k
landmark-btlandmark-dg
geodnn
node2vecauto-econder
vdist2vec
(e) PT
2-2
2-1
20
21
22
23
10 20 40 80 160
QT
(µs)
k
(f) QT
Figure (5.4) Impact of k (DG)
In terms of precomputation (training) time, the impact of k is smaller, and the relative
performance between the algorithms follows those shown in Tables 5.12 and 5.13. We
omit the detail figures for succinctness.
64
Experiments 5.2 Results
2-3
2-2
2-1
20
21
22
8 15 30 60 120
Mn
AE
k
(a) MnAE
2-5
2-4
2-3
2-2
2-1
20
8 15 30 60 120M
nR
E
k
(b) MnRE
22
23
24
8 15 30 60 120
MxA
E
k
(c) MxAE
22
23
24
8 15 30 60 120
MxR
E
k
(d) MxRE
2-7
2-6
2-5
2-4
2-3
2-2
2-1
20
21
8 15 30 60 120
PT
(h)
k
landmark-btlandmark-dg
node2vec
auto-econdervdist2vec
(e) PT
2-2
2-1
20
21
22
23
8 15 30 60 120
QT
(µs)
k
(f) QT
Figure (5.5) Impact of k (FBPOL)
65
5.2 Results Experiments
5.2.7 Impact of MLP
In this subsection, we study the impact of the MLP on distance prediction performance
with two sets of experiments. Our first set of experiments combines the landmark ap-
proaches with the MLP, i.e., training an MLP to predict vertex distance based on dis-
tance vectors computed by the landmark approaches. We will show that simply using an
MLP to learn a distance prediction function for the landmark approaches will not obtain
prediction errors as low as those achieved by our vdist2vec model. This further con-
firms the importance of our vertex embedding learning model. Table 5.14 summarizes
the experimental results on the DG dataset, where “-mlp” denotes a landmark approach
with an MLP for distance prediction. We see that, while landmark-bt-mlp is better than
landmark-bt in both the mean and maximum errors, landmark-dg-mlp is worse than
landmark-dg in the mean errors. Meanwhile, both landmark-bt-mlp and landmark-dg-
mlp have larger errors than our model vdist2vec, which highlights the advantages of our
vertex embeddings. In terms of the preprocessing time, adding an MLP to the landmark
approaches leads to a higher preprocessing time as expected. The distance prediction
times are similar to those listed in Table 5.6, and hence they are omitted for succinctness.
Table (5.14) Effectiveness of Embedding Learning (DG)
Method MnAE MnRE MxAE MxRE PT
landmark-bt 2,234 0.442 56,058 4,713 0.1hlandmark-bt-mlp 682 0.085 36,239 658 2.1hlandmark-dg 136 0.060 34,053 1,425 0.1slandmark-dg-mlp 632 0.083 28,003 1,421 2.1hvdist2vec 135 0.015 9,050 193 2.1h
Our second set of experiments studies the impact of the MLP structure. We vary the
number of nodes in the two hidden layers (denoted by L1 and L2) of the MLP and sum-
marize the model performance in Tables 5.15 and 5.16. For benchmarking purpose, we
also show the performance of landmark-dg+mlp, as it has lower prediction errors than
landmark-bc+mlp. We see that, as more nodes are used in the MLP, the prediction er-
rors drop, while the training and prediction times increase. These are natural, because a
larger MLP can better approximate the complex relationship between the vertex embed-
66
Experiments 5.2 Results
dings (distance vectors) and the vertex distances, which also takes more time to run. The
distance prediction errors do not reduce linearly with the number of nodes in the MLP.
For example, when the MLP grows from 100× 20 to 200× 40, the mean errors only drop
slightly. Thus, we have used a 100× 20 MLP network by default for efficiency.
Table (5.15) Impact of MLP Structure Landmark-dg + MLP (DG)
# nodes landmark-dg+mlp
L1 L2 MnAE MnRE MxAE MxRE PT QT
25 5 2318 0.328 49,951 2393 1.8h 0.852µs
50 10 977 0.150 29,950 1267 1.9h 0.922µs
100 20 632 0.083 28,003 1421 2.1h 1.122µs
200 40 530 0.049 24,006 851 2.2h 1.237µs
Table (5.16) Impact of MLP Structure Vdist2vec (DG)
# nodes vdist2vec
L1 L2 MnAE MnRE MxAE MxRE PT QT
25 5 375 0.048 21,120 541 1.9h 0.836µs
50 10 164 0.020 15,604 387 2.1h 0.909µs
100 20 135 0.015 9,050 193 2.3h 1.039µs
200 40 132 0.014 8,560 137 2.4h 1.221µs
Experimental results on the other datasets show similar patterns. They are omitted
for conciseness.
5.2.8 Impact of Number of Center Vertices
In this section, we study the impact of the number of center vertices used for handling
large graphs, i.e., |Vc|, which is varied from 18 to 600 as shown in Table 5.17 (for the
SH dataset, where 0.1%|V |= 75). As expected, more center vertices yield lower distance
67
5.2 Results Experiments
prediction errors, which also take more time to process. This can be thought of as having
more landmarks to preserve more distance information. We omit the prediction time as it
is not impacted by |Vc|. Experimental results on the other datasets show similar patterns.
They are omitted for succinctness.
Table (5.17) Impact of Number of Center Vertices (SH)
|Vc| MnAE MnRE MxAE MxRE PT
18 8,102 0.401 52,893 27 5.7m
37 3,831 0.203 35,683 20 5.9m
75 2,549 0.126 27,753 18 6.2m
150 1,795 0.091 26,091 15 7.5m
300 1,290 0.068 24,463 11 9.1m
600 951 0.049 24,229 8 11.3m
5.2.9 Impact of Loss Function
In this section, we study the impact of the loss function. Tables 5.18 to 5.20 show the
performance of our vdist2vec model with different loss functions including mean square
error (MnSE), reverse Huber loss (REVHL), and mean cube error (MnCE) on graphs DG,
MB, SU, EPA, FBPOL, and FBTV. We measure the results in MnAE, MnRE, MxAE, and
MxRE. The processing time and query time are similar for different loss function which
will be omitted here.
Table (5.18) Impact of Loss Function (DG and MB)
DG MB
LF MnAE MnRE MxAE MxRE MnAE MnRE MxAE MxRE
MnSE 135 0.015 9,050 193 12 0.014 376 16
REVHL 75 0.015 9,723 212 6 0.014 820 41
MnCE 199 0.022 8,422 189 12 0.017 205 15
68
Experiments 5.2 Results
Table (5.19) Impact of Loss Function (SU and EPA)
SU EPA
LF MnAE MnRE MxAE MxRE MnAE MnRE MxAE MxRE
MnSE 83 0.027 4,784 63 0.023 0.006 4 4
REVHL 50 0.024 4,750 92 0.019 0.006 4 4
MnCE 114 0.039 4,133 56 0.027 0.007 3 3
Table (5.20) Impact of Loss Function (FBTV and FBPOL)
FBTV FBPOL
LF MnAE MnRE MxAE MxRE MnAE MnRE MxAE MxRE
MnSE 0.137 0.026 4 4 0.133 0.034 4 4
REVHL 0.126 0.026 4 4 0.117 0.031 5 5
MnCE 0.179 0.033 2 2 0.153 0.037 2 2
As shown in Tables 5.18 to 5.20, comparing to MnSE, applying REVHL reduces the
MnAE by at least 8% (0.126 vs. 0.137 on FBTV) and up to 50% (6 vs. 12 on MB), and
reduces the MnRE by up to 11% (0.024 vs. 0.027 on SU). However, as discussed in Sec-
tion 4.2, higher MxAE and MxRE are expected when using REVHL. For example, Using
REVHL brings a 25% increase in the maximum errors on FBPOL. It also increases MxAE
by 118% (820 vs. 376) and MxRE by 156% (41 vs. 16) on MB. On graphs with larger di-
ameters, REVHL increases MxAE by only 7% on DG and even reduces MxAE by 0.7% on
SU, although it increases MxREs on both graphs, by 9.8% and 46%, respectively. Thus,
REVHL is recommended when optimizing the mean errors is prioritized.
Applying MnCE as the loss function, on the other hand, reduces MxAE by at least 7%
(8,422 vs. 9,050 on DG) and up to 50% (2 vs. 4 on FBTV and FBPOL), while suffers up to
47% increase in MnAE (199 vs. 135 on DG) and up to 47% increase in MnRE (0.022 vs.
0.015 on DG). Thus, MnCE is recommended for when the maximum errors are the focus.
69
5.3 Other Embedding Applications Experiments
5.3 Other Embedding Applications
In this section, we test the general applicability of our vdist2vec model to other graph
problems including graph reconstruction and link prediction. We compare it with other
graph embeddings methods including Graph Factorization (GF) [13], High-Order Prox-
imity preserved Embedding (HOPE) [71], Laplacian Eigenmaps (LE) [16], Locally Lin-
ear Embedding (LLE) [82], node2vec [48], and Structural Deep Network Embedding
(SDNE) [96] on social network graphs FBTV and FBPOL. We apply the suggested set-
tings by [46] and use an embedding dimensionality of 128 in these experiments.
We use the mean average precision (MAP) [46] as the evaluation metric in this set of
experiments, which is the mean value of the precision of all vertices. Precision here is the
fraction of correctly predicted links to all the links of a vertex.
For graph reconstruction, GF, HOPE, and node2vec apply the dot product on the em-
beddings of the two vertices for link reconstruction to yield a prediction. If the prediction
score is above a threshold, we consider that a link should be reconstructed between the
two vertices. LE and LLE apply Equation 5.1 below and compare the result with a thresh-
old to test if there should be a link between vertices vi and vj :
edist(vi,vi) (5.1)
Here, dist() is a function that returns theL2 distance between two vectors. For SDNE, due
to its auto-encoder structure, it is applied on a graph adjacency matrix, and its decoder
outputs the adjacent matrix of the reconstructed graph. As our vdist2vec model can
predict the distances between two vertices, we simply consider two vertices should be
linked when their distance is smaller than 2 (when running on social networks).
For link prediction tests, we randomly remove 20% of the edges of a graph, and apply
embeddings methods. When testing, we reconstruct the graphs based on the embeddings
learned while testing on the original graphs.
70
Experiments 5.3 Other Embedding Applications
Table (5.21) Graph Reconstruction and Link Prediction performance Measured in MAP
Graph reconstruction Link prediction
FBTV FBPOL FBTV FBPOL
baseline
GF 0.6695 0.4490 0.6467 0.5742
HOPE 0.0093 0.0093 0.0064 0.0053
LE 0.0077 0.0070 0.0043 0.0054
LLE 0.0078 0.0069 0.0044 0.0049
node2vec 0.5883 0.0061 0.2144 0.0069
SDNE 0.0108 0.0066 0.0017 0.0028
proposed vdist2vec 0.9919 0.9914 0.9755 0.9885
Table 5.21 shows that our model outperform baselines in graph reconstruction and
link prediction on FBTV and FBPOL by at least 32% (0.9919 vs. 0.6695 for graph recon-
struction on FBTV) and up to 99% (0.9914 vs. 0.0061 for graph reconstruction on FBTV).
As our model learns the embeddings directly from the distances, the embeddings contain
both local and global structural information. However, learning from all pairs of vertices
distances also leads to a longer training time as shown in Table. 5.22.
Table (5.22) Graph Reconstruction and Link Prediction Processing Time
Graph reconstruction Link prediction
FBTV FBPOL FBTV FBPOL
baseline
GF 0.4h 0.9h 0.2h 0.4h
HOPE 5.5s 14.2s 5.7s 14.5s
LE 3.0s 3.8s 2.9s 4.4s
LLE 10.9s 9.6s 8.8s 18.3s
node2vec 6.2s 14.7s 6.0s 15.1s
SDNE 0.1h 0.3h 3.8m 0.2h
proposed vdist2vec 1.2h 1.6h 1.2h 1.6h
71
5.4 Summary Experiments
5.4 Summary
In this chapter, we compared our proposed vdist2vec model and its variants with base-
lines in several aspects and applications. We first conducted an overall comparison on
road network, social network, and a web page graph measured by mean absolute error,
mean relative error, max absolute error, max relative error, processing time, and query
time. Our models outperform the baselines by up to 97% in distance prediction accuracy
with a comparable query time. As a learning based method, our models have longer pre-
processing times. We argue that this is worthy, for that our models have much smaller
distance prediction errors. We then analyzed the impact of embedding dimensionality
and graph updates on our vdist2vec model. Our vdist2vec model has the best perfor-
mance in the most of these scenarios. We also studied the impact of MLP structure, and
confirm that a model with a large network size can yield more accurate distance predic-
tions, although the model training also increases. To handle larger graphs, we tested our
model with different number of clusters and showed that more clusters could increase the
accuracy (but also the model training time). Finally, we evaluated our the applicability
of our learned embeddings over graph reconstruction and link prediction tasks.
72
Chapter 6
Conclusions and Future Work
6.1 Conclusions
We studied the graph shortest-path distance problem and proposed a representation
learning based approach to solve the problem. Our approach learns vertex embeddings
that preserve the distances between vertices. Storing the learned embeddings only takes
a space cost linear to the number of vertices. At query time, the embeddings of the two
query vertices are fed into a multi-layer perceptron to predict the distance between the
two vertices, which takes a constant time. Thus, our approach avoids the high costs in
space and time which may impinge the applicability of existing approaches. Experimen-
tal results show that our approach is highly efficient. Comparing with state-of-the-art ap-
proaches on road network, social network, and web page graphs, our approach achieves
much smaller mean and maximum prediction errors, with an advantage up to 97%.
In Chapter 2, we reviewed different methods for shortest-path distance prediction.
Traditional methods have either large space cost or large time cost in query processing.
Recent studies tend to trade preprocessing time or accuracy for lower space and time
costs at query time. For example, distance labeling methods preprocess a graph into a hi-
erarchical structure that groups vertices based on heuristic rules. For each group, a set of
vertices are selected as the representation vertices (labels) which will be used in distance
queries between the vertices in the group to those in the other groups. The accuracy of
distance labeling methods relies on their heuristics to build the vertex hierarchy. In the
worst-case, their label size is still O(n2) (i.e., when each vertex forms a group). Another
direction is landmark method. Landmark method precomputes part of the shortest-path
73
6.1 Conclusions Conclusions and Future Work
distances among the vertices to reduce both the space cost to O(kn) and the query time
cost to O(k) with approximate results. The accuracy of landmark methods highly de-
pends on the selection of landmarks. A more recent method uses graph embedding,
which was previously used in other graph problems such as graph reconstruction, link
prediction, and graph visualization, since the learned embeddings contain the connec-
tion information between vertices. The distance between vertices can be considered as
containing both global connection information (multi-hop) and local connection informa-
tion (single-hop). Hence, graph embedding has potentials to be applied for shortest-path
distance prediction. Using the graph embedding idea, we just need a k-dimensional ma-
trix to store the vertex embeddings with a space cost O(kn), while queries based on two
vectors can be processed efficiently (e.g., by examining the embeddings of the two query
vertices in O(k) time).
In Chapter 3, we proposed a two-stage model framework that consists of a graph rep-
resentation learning network and a distance prediction network. We adapted node2vec,
auto-encoder, geo-coordinates, and landmark labels for the representation learning net-
work and trained the distance prediction network based on the learned vertex represen-
tations. Node2vec is based on discrete sampling. It only needs to observe a part of the
graph for vertex representation learning, which makes it scalable to large graphs. Since it
focuses more on local information (shared neighbours), in shortest-path distance predic-
tion tasks, it performs better on graphs with a small diameter, such as social networks,
while it suffers on graphs with a large diameter, such as road networks. Auto-encoder
works better on road networks, while it fails in learning useful information for social
networks. When learning embeddings from a low distance variance graph, auto-encoder
trends to learn the same embeddings for all vertices to predict average distances for all
vertices. Using geo-coordinates as the embeddings in road networks cannot reflect the
detour distances (e.g., to cross a bridge). Its performance dropped dramatically for road
networks that cross rivers or mountains. Using landmark labels as the embeddings over-
come such a limitation. However, it is challenging to select high-quality landmarks to
obtain accurate disntance predictions. As these two-stage models separate the distance
prediction network from the graph representation learning network, the embeddings
74
Conclusions and Future Work 6.1 Conclusions
learned may not be optimized for distance prediction.
In Chapter 4, we proposed a one-stage model called vdist2vec that learns the embed-
dings guided directly by the shortest-path distances. By connecting the graph represen-
tation learning network and the distance prediction network, our vdist2vec model can
learn better embeddings for shortest-path distance prediction. We then studied the loss
functions and their impact on the model. Our proposed the reverse Huber loss func-
tion that can lead our model to reduce the mean prediction errors. We also find that a
higher-order polynomial loss function can yield lower maximum prediction errors. We
further proposed an emsambling based model to achieve even higher disntance predic-
tion accuracy, and a hierarchical model to scale to large graphs. In addition, we show
that our learned embeddings can also be applied to other graph problems such as graph
reconstruction and link prediction.
In Chapter 5, we tested our vdist2vec model on road network graphs, social net-
work graphs, and web page graphs for shortest-path distance prediction. Comparing to
the baseline methods, our vdist2vec model has up to 97% lower prediction errors. As a
learning based method, the training time of our model is longer than the heuristic meth-
ods, which is expected. For query processing, our model has a smaller query time as it
can be easily parallelized and use the full computation power of GPUs. We then stud-
ied the impact of embedding dimensionality, MLP structure, and loss functions on our
model. Our vdist2vec model can yield a high prediction accuracy with a small embed-
ding dimensionality, which shows that our model is highly space efficient. When a larger
number of embedding dimensions is allowed, our model performance improves further,
although longer model training times and query times are expected. Similarly, a larger
MLP with more layers and nodes contribute to a higher distance prediction accuracy,
but this also requires a long training time. In terms of the loss function, we tested mean
square error, mean cube error, and our proposed reverse Huber loss. Reverse Huber loss
is more effective to reduce the mean prediction errors, while mean cube error is more ef-
fective to reduce the maximum prediction errors. We also tested our hierarchical model
for larger graphs. This model can dramatically reduce the training time while not losing
too much distance prediction accuracy, especially when there are more center vertices.
75
6.2 Future Work Conclusions and Future Work
We then adapted our model to other graph applications including link prediction and
graph reconstruction. The experiment results show that our learned embeddings are also
effective in these applications.
6.2 Future Work
As a distance guided machine learning model, our Vdist2vec model may have longer pre-
processing time than heuristic methods. For future work, we plan to investigate graph
clustering techniques to help further boost the efficiency and accuracy of distance pre-
diction over large graphs. For now, we have applied the k-means clustering algorithm
on road network graphs based on the geo-coordinates of the vertices. We may explore
hierarchical structures such as well-separated pair decomposition [86], contraction hi-
erarchy [44], and highway hierarchy [58] and combine them with our model. We may
apply our vdist2vec model on their representative vertices. For example, after we use
the highway hierarchy to generate highways, we can view each highway as a vertex and
learn an embedding for it. At query time, instead of storing all the distances from and to
the highways, we can use the embeddings of the vertices and the highways to predict the
distance. Further, we may adapt the hierarchy construction rules into our models so that
the models can automatically learn embeddings of the clusters (or representative ver-
tices). We can use geo-coordinates in our training process and design a gated mechanism
to decide whether two vertices are well-separated or not. If they are not well-separated,
the model will directly learn the embeddings based on their distance. If they are well-
separated, the model only learns the embeddings based on the distance between their
corresponding representative vertices.
Our learned embeddings do not focus on structure information of graph. To over-
come this limitation, one direction is to apply spectral clustering [95] which clusters ver-
tices based on the eigenvectors of the adjacency matrix of a graph.In spectral clustering,
vertices are represented by eigenvectors, which can be an initialisation for our embed-
dings in graph representation learning. As spectral clustering aims to map similar ver-
tices to be close in the latent space, using the embeddings as an initialisation may help
76
Conclusions and Future Work 6.2 Future Work
our model to train more efficiently and keep the structure information. This also allows
us to handle complex graphs such as social networks and web page graphs which do
not have a coordinate associated to the vertices. However, spectral clustering could be
costly for large graphs as its computational complexity is high (O(n3) [98]). For better
scalability, we may explore approximate spectral clustering methods. Recent research on
approximate spectral clustering mainly has two directions. One direction is to compute
an approximate spectral embedding, such as power iteration clustering (PIC) [65]. Power
iteration is an algorithm that computes the largest eigenvector of a matrix approximately.
It runs recursively with a preset number of iterations to approach the largest eigenvec-
tor step by step. In PIC, instead of getting only the largest eigenvector (the result of the
last iteration), the intermediate values (results in intermediate iterations) are also used
as approximate eigenvectors. Comparing to the original spectral clustering, PIC reduces
the computation time significantly with a fair performance [65]. Another direction is to
represent the graph with fewer vertices and run spectral clustering on the representative
vertices. For example, k-means-based approximate spectral clustering (KASP) [98] first ap-
plies k-means to partition the vertices into k clusters, and then runs spectral clustering
only on the center vertices of each cluster. As this method needs to run a k-means clus-
tering first, it may not be feasible on complex networks. Except spectral clustering, other
graph embedding methods introduced in 2.3 can be used as initialization as well.
77
This page is intentionally left blank.
Bibliography
[1] 9th dimacs implementation challenge - shortest paths. http://users.diag.
uniroma1.it/challenge9/download.shtml, 2006. [Online; accessed 28-
August-2019].
[2] Networkx. https://networkx.github.io, 2014. [Online; accessed 28-August-
2019].
[3] Stanford large network dataset collection. http://snap.stanford.edu/data,
2015. [Online; accessed 28-August-2019].
[4] Urban road network data. https://figshare.com/articles/Urban_Road_
Network_Data/2061897, 2016. [Online; accessed 28-August-2019].
[5] Google announces over 2 billion monthly active devices on android.
https://www.theverge.com/2017/5/17/15654454/android-reaches\
-2-billion-monthly-active-users, 2017. [Online; accessed 28-August-
2019].
[6] Planet OSM. https://planet.osm.org, 2017. [Online; accessed 28-August-
2019].
[7] Facebook announces over 2 billion monthly social network users. https:
//www.nbcnews.com/tech/tech-news/facebook-hits-2-27-billion\
-monthly-active-users-earnings-stabilize-n926391, 2018. [Online;
accessed 25-January-2020].
79
BIBLIOGRAPHY BIBLIOGRAPHY
[8] Netcraft web server survey shows more than 1.2 billion active web-
sites in 2020. https://news.netcraft.com/archives/category/
web-server-survey/, 2020. [Online; accessed 25-January-2020].
[9] Ittai Abraham, Yair Bartal, Jon Kleinberg, T-H. Hubert Chan, Ofer Neiman, Kedar
Dhamdhere, Aleksandrs Slivkins, and Anupam Gupta. Metric embeddings with
relaxed guarantees. In Proceedings of the 46th Annual IEEE Symposium on Foundations
of Computer Science (FOCS), pages 83–100, 2005.
[10] Ittai Abraham, Daniel Delling, Andrew V. Goldberg, and Renato F. Werneck. A
hub-based labeling algorithm for shortest paths in road networks. In Proceedings of
the 10th International Symposium on Experimental Algorithms (SEA), pages 230–241,
2011.
[11] Ittai Abraham, Daniel Delling, Andrew V. Goldberg, and Renato F. Werneck. Hier-
archical hub labelings for shortest paths. In Proceedings of the 20th European Sympo-
sium on Algorithms (ESA), pages 24–35, 2012.
[12] Lada A. Adamic and Eytan Adar. Friends and neighbors on the web. Social Net-
works, 25(3):211–230, 2003.
[13] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and
Alexander J Smola. Distributed large-scale natural graph factorization. In Proceed-
ings of the 22nd International Conference on World Wide Web (WWW), pages 37–48,
2013.
[14] Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. Fast exact shortest-path distance
queries on large networks by pruned landmark labeling. In Proceedings of the 2013
ACM SIGMOD International Conference on Management of Data (SIGMOD), pages
349–360, 2013.
[15] Mukund Balasubramanian and Eric L. Schwartz. The Isomap algorithm and topo-
logical stability. Science, 295(5552):7–7, 2002.
80
BIBLIOGRAPHY BIBLIOGRAPHY
[16] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques
for embedding and clustering. In Advances in Neural Information Processing Systems
(NIPS), pages 585–591, 2002.
[17] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A
review and new perspectives. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 35(8):1798–1828, 2013.
[18] Jean Bourgain. On lipschitz embedding of finite metric spaces in hilbert space.
Israel Journal of Mathematics, 52(1-2):46–52, 1985.
[19] Ulrik Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical
Sociology, 25(2):163–177, 2001.
[20] Ronald L. Breiger, Harrison C. White, and Scott A. Boorman. Social structure from
multiple networks. The American Journal of Sociology, 81(4):730–780, 1976.
[21] Hongyun Cai, Vincent W. Zheng, and Kevin C. Chang. A comprehensive survey
of graph embedding: Problems, techniques, and applications. IEEE Transactions on
Knowledge and Data Engineering, 30(9):1616–1637, 2018.
[22] Shaosheng Cao, Wei Lu, and Qiongkai Xu. GraRep: Learning graph representa-
tions with global structural information. In Proceedings of the 24th ACM Interna-
tional on Conference on Information and Knowledge Management (CIKM), pages 891–
900, 2015.
[23] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Deep neural networks for learning
graph representations. In Proceedings of the 30th National Conference of the American
Association for Artificial Intelligence (AAAI), pages 1145–1152, 2016.
[24] Lijun Chang, Jeffrey X. Yu, Lu Qin, Hong Cheng, and Miao Qiao. The exact distance
to destination in undirected world. VLDB Journal, 21(6):869–888, 2012.
[25] Shiri Chechik. Approximate distance oracles with improved bounds. In Proceedings
of the 47th Annual ACM Symposium on Theory of Computing (STOC), pages 1–10, 2015.
81
BIBLIOGRAPHY BIBLIOGRAPHY
[26] Wei Chen, Christian Sommer, Shang-Hua Teng, and Yajun Wang. A compact rout-
ing scheme and approximate distance oracle for power-law graphs. ACM Transac-
tions on Algorithms, 9(1):4, 2012.
[27] Kenneth W. Church and Patrick Hanks. Word association norms, mutual informa-
tion, and lexicography. Computational Linguistics, 16(1):22–29, 1990.
[28] Aaron Clauset, Cristopher Moore, and Mark E. J. Newman. Hierarchical structure
and the prediction of missing links in networks. Nature, 453(7191):98–101, 2008.
[29] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. Reachability and dis-
tance queries via 2-hop labels. SIAM Journal on Computing, 32(5):1338–1355, 2003.
[30] Atish Das Sarma, Sreenivas Gollapudi, Marc Najork, and Rina Panigrahy. A sketch-
based distance oracle for web-scale graphs. In Proceedings of the 3rd ACM Interna-
tional Conference on Web Search and Data Mining (WSDM), pages 401–410, 2010.
[31] Giuseppe Di Battista, Peter Eades, Roberto Tamassia, and Ioannis G. Tollis. Algo-
rithms for drawing graphs: An annotated bibliography. Computational Geometry, 4
(5):235–282, 1994.
[32] Edsger W. Dijkstra. A note on two problems in connexion with graphs. Numerische
Mathematik, 1(1):269–271, 1959.
[33] Paul Erdos. On some extremal problems in graph theory. Israel Journal of Mathemat-
ics, 3(2):113–116, 1965.
[34] Tomas Feder and Rajeev Motwani. Clique partitions, graph compression and
speeding-up algorithms. Journal of Computer and System Sciences, 51(2):261–272,
1995.
[35] Raphael A. Finkel and Jon L. Bentley. Quad trees a data structure for retrieval on
composite keys. Acta Informatica, 4(1):1–9, 1974.
[36] Robert W. Floyd. Algorithm 97: Shortest path. Communications of the ACM, 5(6):
345, 1962.
82
BIBLIOGRAPHY BIBLIOGRAPHY
[37] Michael L. Fredman and Robert E. Tarjan. Fibonacci heaps and their uses in im-
proved network optimization algorithms. Journal of the ACM, 34(3):596–615, 1987.
[38] Linton C. Freeman. A set of measures of centrality based on betweenness. Sociom-
etry, pages 35–41, 1977.
[39] Linton C. Freeman. Centrality in social networks conceptual clarification. Social
Networks, 1(3):215–239, 1978.
[40] Linton C. Freeman. Visualizing social networks. Journal of Social Structure, 1(1):4,
2000.
[41] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical
learning, volume 1. Springer, 2001.
[42] Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. Learning probabilistic
relational models. In Proceedings of the 16th International Joint Conference on Artificial
Intelligence (IJCAI), volume 99, pages 1300–1309, 1999.
[43] Ada W. Fu, Huanhuan Wu, James Cheng, and Raymond C. Wong. IS-Label: An
independent-set based labeling scheme for point-to-point distance querying. Pro-
ceedings of the VLDB Endowment, 6(6):457–468, 2013.
[44] Robert Geisberger, Peter Sanders, Dominik Schultes, and Daniel Delling. Contrac-
tion hierarchies: Faster and simpler hierarchical routing in road networks. In Pro-
ceedings of the International Workshop on Experimental and Efficient Algorithms (WEA),
pages 319–333, 2008.
[45] Andrew V. Goldberg, Haim Kaplan, and Renato F. Werneck. Better landmarks
within reach. In Proceedings the 6th International Workshop on Experimental and Effi-
cient Algorithms (WEA), pages 38–51, 2007.
[46] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and
performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
83
BIBLIOGRAPHY BIBLIOGRAPHY
[47] Irina Gribkovskaia, Øyvind Halskau Sr, and Gilbert Laporte. The bridges of
konigsberg—a historical perspective. Networks: An International Journal, 49(3):199–
203, 2007.
[48] Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for net-
works. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining (SIGKDD), pages 855–864, 2016.
[49] Andrey Gubichev, Srikanta Bedathur, Stephan Seufert, and Gerhard Weikum. Fast
and accurate estimation of shortest paths in large graphs. In Proceedings of the
19th ACM International Conference on Information and Knowledge Management (CIKM),
pages 499–508, 2010.
[50] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learning on
graphs: Methods and applications. IEEE Data Engineering Bulletin, 40:52–74, 2017.
[51] David Heckerman, Chris Meek, and Daphne Koller. Probabilistic entity-
relationship models, prms, and plate models. Introduction to Statistical Relational
Learning, pages 201–238, 2007.
[52] Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimensionality of
data with neural networks. Science, 313(5786):504–507, 2006.
[53] Gisli R. Hjaltason and Hanan Samet. Properties of embedding methods for simi-
larity searching in metric spaces. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 25(5):530–549, 2003.
[54] Thomas Hofmann and Joachim Buhmann. Multidimensional scaling and data clus-
tering. In Advances in Neural Information Processing Systems (NIPS), pages 459–466,
1995.
[55] Peter J. Huber. Robust estimation of a location parameter. In Breakthroughs in Statis-
tics, pages 492–518. 1992.
[56] Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of ir
techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002.
84
BIBLIOGRAPHY BIBLIOGRAPHY
[57] Minhao Jiang, Ada W. Fu, Raymond C. Wong, and Yanyan Xu. Hop doubling label
indexing for point-to-point distance querying on scale-free networks. Proceedings
of the VLDB Endowment, 7(12):1203–1214, 2014.
[58] Ruoming Jin, Ning Ruan, Yang Xiang, and Victor Lee. A highway-centric labeling
approach for answering distance queries on large sparse graphs. In Proceedings of
the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD),
pages 445–456, 2012.
[59] Ishan Jindal, Zhiwei Qin, Xuewen Chen, Matthew Nokleby, and Jieping Ye. A
unified neural network approach for estimating travel time and distance for a taxi
trip. arXiv preprint arXiv:1710.04350, 2017.
[60] Dieter Jungnickel. Graphs, networks and algorithms. Springer, 2005.
[61] Alireza Karduni, Amirhassan Kermanshah, and Sybil Derrible. A protocol to con-
vert spatial polyline data to network formats and applications to world urban road
networks. Scientific Data, 3(1):1–7, 2016.
[62] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18
(1):39–43, 1953.
[63] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph con-
volutional networks. arXiv preprint arXiv:1609.02907, 2016.
[64] Douglas J. Klein and Milan Randic. Resistance distance. Journal of Mathematical
Chemistry, 12(1):81–95, 1993.
[65] Frank Lin and William W. Cohen. Power iteration clustering. In Proceedings of the
27th International Conference on Machine Learning (ICML), pages 655–662, 2010.
[66] Laszlo Lovasz. Random walks on graphs: A survey. Combinatorics, Paul Erdos Is
Eighty, 2(1):1–46, 1993.
[67] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient esti-
mation of word representations in vector space. In Proceedings of the International
Conference on Learning Representations Workshops (ICLR), 2013.
85
BIBLIOGRAPHY BIBLIOGRAPHY
[68] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltz-
mann machines. In Proceedings of the 27th International Conference on Machine Learn-
ing (ICML), pages 807–814, 2010.
[69] David Opitz and Richard Maclin. Popular ensemble methods: An empirical study.
Journal of Artificial Intelligence Research, 11:169–198, 1999.
[70] Jack A. Orenstein. Multidimensional tries used for associative searching. Informa-
tion Processing Letters, 14(4):150–157, 1982.
[71] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric
transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages
1105–1114, 2016.
[72] Panos M. Pardalos and Jue Xue. The maximum clique problem. Journal of Global
Optimization, 4(3):301–328, 1994.
[73] Karl Pearson. LIII. On lines and planes of closest fit to systems of points in space.
The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):
559–572, 1901.
[74] David Peleg. Proximity-preserving labeling schemes. Journal of Graph Theory, 33(3):
167–176, 2000.
[75] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. DeepWalk: Online learning of so-
cial representations. In Proceedings of the 20th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (SIGKDD), pages 701–710, 2014.
[76] Robi Polikar. Ensemble based systems in decision making. IEEE Circuits and Sys-
tems Magazine, 6(3):21–45, 2006.
[77] Michalis Potamias, Francesco Bonchi, Carlos Castillo, and Aristides Gionis. Fast
shortest path distance estimation in large networks. In Proceedings of the 18th ACM
Conference on Information and Knowledge Management (CIKM), pages 867–876, 2009.
86
BIBLIOGRAPHY BIBLIOGRAPHY
[78] Fatemeh S. Rizi, Joerg Schloetterer, and Michael Granitzer. Shortest path distance
approximation using deep learning techniques. In IEEE/ACM International Confer-
ence on Advances in Social Networks Analysis and Mining (ASONAM), pages 1007–
1014, 2018.
[79] Neil Robertson and Paul D. Seymour. Graph minors. III. Planar tree-width. Journal
of Combinatorial Theory, Series B, 36(1):49–64, 1984.
[80] Lior Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1–39,
2010.
[81] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interac-
tive graph analytics and visualization. In Proceedings of the 29th National Conference
of the American Association for Artificial Intelligence (AAAI), 2015.
[82] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by lo-
cally linear embedding. Science, 290(5500):2323–2326, 2000.
[83] Hanan Samet. The quadtree and related hierarchical data structures. ACM Com-
puting Surveys, 16(2):187–260, 1984.
[84] Jagan Sankaranarayanan and Hanan Samet. Distance oracles for spatial networks.
In Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE),
pages 652–663, 2009.
[85] Jagan Sankaranarayanan and Hanan Samet. Query processing using distance ora-
cles for spatial networks. IEEE Transactions on Knowledge and Data Engineering, 22
(8):1158–1175, 2010.
[86] Michiel Smid. The well-separated pair decomposition and its applications. Hand-
book on Approximation Algorithms and Metaheuristics, Chapman & Hall/CRC, 2016.
[87] Peter Sollich and Anders Krogh. Learning with ensembles: How overfitting can be
useful. In Advances in Neural Information Processing Systems (NIPS), pages 190–196,
1996.
87
BIBLIOGRAPHY BIBLIOGRAPHY
[88] Christian Sommer. Shortest-path queries in static networks. ACM Computing Sur-
veys, 46(4):45:1–45:31, 2014.
[89] Frank W. Takes and Walter A. Kosters. Adaptive landmark selection strategies
for fast shortest path computation in large real-world graphs. In Proceedings of
the IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent
Agent Technologies (WI-IAT), pages 27–34, 2014.
[90] Liying Tang and Mark Crovella. Virtual landmarks for the internet. In Proceedings
of the 3rd ACM SIGCOMM Conference on Internet Measurement (SIGCOMM), pages
143–152, 2003.
[91] Athanasios Theocharidis, Stjin Van Dongen, Anton J. Enright, and Tom C. Free-
man. Network visualization and analysis of gene expression data using biolayout
express 3d. Nature Protocols, 4(10):1535, 2009.
[92] Mikkel Thorup and Uri Zwick. Approximate distance oracles. Journal of the ACM,
52(1):1–24, 2005.
[93] Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel. Efficient aggregation
for graph summarization. In Proceedings of the 2008 ACM SIGMOD International
Conference on Management of Data (SIGMOD), pages 567–580, 2008.
[94] Hannu Toivonen, Fang Zhou, Aleksi Hartikainen, and Atte Hinkka. Compression
of weighted graphs. In Proceedings of the 17th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (SIGKDD), pages 965–973, 2011.
[95] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17
(4):395–416, 2007.
[96] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (SIGKDD), pages 1225–1234, 2016.
88
BIBLIOGRAPHY BIBLIOGRAPHY
[97] Jim Webber. A programmatic introduction to neo4j. In Proceedings of the 3rd An-
nual Conference on Systems, Programming, and Applications: Software for Humanity
(SPLASH), pages 217–218, 2012.
[98] Donghui Yan, Ling Huang, and Michael I. Jordan. Fast approximate spectral clus-
tering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (SIGKDD), pages 907–916, 2009.
[99] Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, and Zhao Xu. Stochastic relational
models for discriminative link prediction. In Advances in Neural Information Pro-
cessing Systems (NIPS), pages 1553–1560, 2007.
[100] Tong Zhang. Solving large scale linear prediction problems using stochastic gradi-
ent descent algorithms. In Proceedings of the 21st International Conference on Machine
Learning (ICML), page 116, 2004.
89
Minerva Access is the Institutional Repository of The University of Melbourne
Author/s:
Zhao, Zhuowei
Title:
Embedding Graphs for Shortest-Path Distance Predictions
Date:
2020
Persistent Link:
http://hdl.handle.net/11343/241911
File Description:
Final thesis file
Terms and Conditions:
Terms and Conditions: Copyright in works deposited in Minerva Access is retained by the
copyright owner. The work may not be altered without permission from the copyright owner.
Readers may only download, print and save electronic copies of whole works for their own
personal non-commercial use. Any use that exceeds these limits requires permission from
the copyright owner. Attribution is essential when quoting or paraphrasing from these works.