Embedding Graphs for Shortest-Path Distance Predictions

Embedding Graphs for Shortest-PathDistance Predictions

Zhuowei ZhaoORCID 0000-0002-6891-6432

Submitted in total fulfilment of the requirements of the degree of

Master of Philosophy

School of Computing and Information SystemsTHE UNIVERSITY OF MELBOURNE

February 2020

Copyright c© 2020 Zhuowei Zhao

All rights reserved. No part of the publication may be reproduced in any form by print,photoprint, microfilm or any other means without written permission from the author.

Abstract

Graph is an important data structure and is used in an abundance of real-world appli-

cations including navigation systems, social networks, and web search engines, just to

name but a few. We study a classic graph problem – computing graph shortest-path

distances. This problem has many applications, such as finding nearest neighbors for

place of interest (POI) recommendation or social network friendship recommendation. To

compute a shortest-path distance, traditional approaches traverse the graph to find the

shortest path and return the path length. These approaches lack time efficiency over large

graphs. In the applications above, the distances may be needed first (e.g., to rank POIs),

while the actual shortest paths may be computed later (e.g., after a POI has been chosen).

Thus, an alternative approach precomputes and stores the distances, and answers dis-

tance queries with simple lookups. This approach, however, falls short in the space cost

– O(n2) in the worst case for n vertices, even with various optimizations.

To address these limitations, we take an embedding based approach to predict the

shortest-path distance between two vertices using their embeddings without comput-

ing their path online or storing their distance offline. Graph embedding is an emerging

technique for graph analysis that has yielded strong performance in applications such

as node classification, link prediction, graph reconstruction, and more. We propose a

representation learning approach to learn a k-dimensional (k n) embedding for ev-

ery vertex. This embedding preserves the distance information of the vertex to the other

vertices. We then train a multi-layer perceptron (MLP) to predict the distance between

two vertices given their embeddings. We thus achieve fast distance predictions with-

out a high space cost (i.e., only O(kn)). Experimental results on road network graphs,

social network graphs, and web document graphs confirm these advantages, while our

iii

approach also produces distance predictions that are up to 97% more accurate than those

by the state-of-the-art approaches.

Our embeddings are not limited for only distance predictions. We further study their

applicability on other graph problems such as link prediction and graph reconstruction.

Experimental results show that our embeddings are highly effective in these tasks.

iv

Declaration

This is to certify that

1. the thesis comprises only my original work towards the MPhil,

2. due acknowledgement has been made in the text to all other material used,

3. the thesis is less than 50,000 words in length, exclusive of tables, figures, bibliogra-

phies and appendices.

Zhuowei Zhao, February 2020

v

This page is intentionally left blank.

Acknowledgements

First of all, I would like to express my deepest gratitude to my supervisors, Dr. Jianzhong

Qi and Prof. Rui Zhang for their continuous support during my MPhil study. They have

guided me with their rich knowledge. Their passion in research has deeply encouraged

me. Without their support, this thesis would not have been possible.

I am deeply grateful to Prof. Wei Wang (The University of New South Wales) who

provided invaluable discussions and insightful feedback to my research.

I also sincerely thank my Advisory Committee Chair, Dr. Sean Maynard. He keeps

watching my progress and has given me generous support during my MPhil study. With-

out his insightful feedback and constructive comments, my progress would not have

been that smooth.

Then, I would like to thank The University of Melbourne and School of Computing

and Information Systems for providing a supportive research environment and rich re-

sources for my MPhil study.

Last but not least, I would like to thank all my fellow research students with whom

I share an office or work in various occasions for their support in research, life, and all

the pleasant memories, including Xinting Huang, Jiabo He, Yixin Su, Shiquan Yang, Yi-

meng Dai, Xiaojie Wang, Bastian Oetomo, Ang Li, Yunxiang Zhao, Guanli Liu, Yanchuan

Chang, Chuandong Yin, Chenxu Zhao, Weihao Chen, Zhen Wang, and Daocang Chen.

vii

Preface

A paper out of the work presented in Chapter 4 has been accepted and will appear in The

23rd International Conference on Extending Database Technology (EDBT). I declare that I am

the primary author and have contributed > 50% in the paper.

1. * Jianzhong Qi, Wei Wang, Rui Zhang, and Zhuowei Zhao. A Learning Based Ap-

proach to Predict Shortest-Path Distances. Accepted to appear in International Con-

ference on Extending Database Technology (EDBT), 2020. (CORE Ranking 1: A)

* The authors are ordered alphabetically.

1http://portal.core.edu.au/conf-ranks/?search=EDBT&by=all&source=CORE2018&sort=atitle&page=1

ix

To my parents and my wife, for their unconditional love.

xi

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Related Work 112.1 Exact Distance Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Approximate Distance Computation . . . . . . . . . . . . . . . . . . . . . . 142.3 Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Other Graph Embedding Applications . . . . . . . . . . . . . . . . . . . . . 202.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Adapted Two-Stage Models 233.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 A Two-Stage Solution Framework . . . . . . . . . . . . . . . . . . . . . . . 243.3 Adapted Representation Learning Models . . . . . . . . . . . . . . . . . . . 26

3.3.1 Node2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.2 Auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.3 Geo-coordinates and Landmark Labels . . . . . . . . . . . . . . . . 30

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Proposed Single-Stage Model 334.1 Vdist2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.1 Reducing Mean Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.2 Reducing Maximum Errors . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Ensembling Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.4 Handling Large Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.1 Large Road Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 Handling Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.6 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

xiii

5 Experiments 475.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.1 Overall Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2.2 Performance on Smaller Graphs . . . . . . . . . . . . . . . . . . . . 515.2.3 Performance on Larger Graphs . . . . . . . . . . . . . . . . . . . . . 565.2.4 Applicability Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.2.5 Impact of Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2.6 Impact of Embedding Dimensionality . . . . . . . . . . . . . . . . . 635.2.7 Impact of MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.2.8 Impact of Number of Center Vertices . . . . . . . . . . . . . . . . . 675.2.9 Impact of Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3 Other Embedding Applications . . . . . . . . . . . . . . . . . . . . . . . . . 705.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6 Conclusions and Future Work 736.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

xiv

List of Figures

1.1 Examples of graphs in real life . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Vertex distance on a road network graph . . . . . . . . . . . . . . . . . . . 31.3 Graph shortest-path distance problem . . . . . . . . . . . . . . . . . . . . . 4

2.1 A graph example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Landmark distribution on Dongguan . . . . . . . . . . . . . . . . . . . . . 16

3.1 Solution framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Auto-encoder embedding model . . . . . . . . . . . . . . . . . . . . . . . . 293.3 A road network example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Vdist2vec model structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Error distribution of our model (DG) . . . . . . . . . . . . . . . . . . . . . . 364.3 Distance value distribution (DG) . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Ensembling model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.5 Layer ensembling model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.6 Distance prediction model for large road networks . . . . . . . . . . . . . . 44

5.1 Recall and nDCG in finding nearest neighbor . . . . . . . . . . . . . . . . . 605.2 Impact of updates (DG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.3 Impact of updates (FBPOL) . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.4 Impact of k (DG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.5 Impact of k (FBPOL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

xv

List of Tables

3.1 Node2vec Based Distance Prediction Errors . . . . . . . . . . . . . . . . . . 283.2 Node2vec Based Distance Prediction Errors on Different Networks . . . . 283.3 Auto-encoder Based Distance Prediction Errors on Different Networks . . 303.4 Geodnn Based Distance Prediction Errors on Different Networks . . . . . 31

4.1 Comparing Huber Loss with Reverse Huber Loss on Road Networks . . . 384.2 Comparing Huber Loss with Reverse Huber Loss on Social Networks . . . 384.3 Using MnSE and MnCE as the Loss Function on MB dataset . . . . . . . . 394.4 Performance of Emsembling Models on MB . . . . . . . . . . . . . . . . . . 40

5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Mean Absolute and Mean Relative Errors on Road Networks (Smaller Graphs) 525.3 Mean Absolute and Mean Relative Errors on Social Networks and Web

Page Graph (Smaller Graphs) . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4 Max Absolute and Max Relative Errors on Road Networks (Smaller Graphs) 535.5 Max Absolute and Max Relative Errors on Social Networks and Web Graph

(Smaller Graphs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Preprocessing and Query Times on Road Networks (Smaller Graphs) . . . 545.7 Preprocessing and Query Times on Social Networks and Web Page Graph

(Smaller Graphs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.8 Mean Absolute and Mean Relative Errors on Road Networks (Larger Graphs) 575.9 Mean Absolute and Mean Relative Errors on Social Networks (Larger Graphs) 575.10 Max Absolute and Max Relative Errors on Road Networks (Larger Graphs) 585.11 Max Absolute and Max Relative Errors on Social Networks (Larger Graphs) 585.12 Preprocessing and Query Times on Road Networks (Larger Graphs) . . . 595.13 Preprocessing and Query Times on Social Networks (Larger Graphs) . . . 595.14 Effectiveness of Embedding Learning (DG) . . . . . . . . . . . . . . . . . . 665.15 Impact of MLP Structure Landmark-dg + MLP (DG) . . . . . . . . . . . . . 675.16 Impact of MLP Structure Vdist2vec (DG) . . . . . . . . . . . . . . . . . . . 675.17 Impact of Number of Center Vertices (SH) . . . . . . . . . . . . . . . . . . . 685.18 Impact of Loss Function (DG and MB) . . . . . . . . . . . . . . . . . . . . . 685.19 Impact of Loss Function (SU and EPA) . . . . . . . . . . . . . . . . . . . . . 695.20 Impact of Loss Function (FBTV and FBPOL) . . . . . . . . . . . . . . . . . . 695.21 Graph Reconstruction and Link Prediction performance Measured in MAP 715.22 Graph Reconstruction and Link Prediction Processing Time . . . . . . . . 71

xvii

List of Abbreviations and Symbols

CH Contraction Hierarchies

DNN Deep Neural Network

MAP Mean Average Precision

MnCE Mean Cube Error

MLP Multilayer Perceptron

MnAE Mean Absolute Error

MnRE Mean Relative Error

MnSE Mean Square Error

MxAE Max Absolute Error

MxRE Max Relative Error

NN Nearest Neighbors

NDCG Normalized Discounted Cumulative Gain

POI Place of Interest

PCA Principal Component Analysis

PT Preprocessing Time

QT Query Time

RRVHL Reverse Huber Loss

xix

v A vertex

u A vertex

l A landmark vertex

G A graph

E An edge set

v A vertex set

L A landmark set

La A label set

La(v) Label of vertex v

d(v, u) Distance between v and v

d(v, u) Estimated distance between v and v

dq(v, u, La) Distance between v and v computed by label set La

pv,u Shortest path from v to u

vi An embedding vector for vi

V An embedding matrix for V

L Training loss

xx

Chapter 1

Introduction

1.1 Background

Graphs were first introduced by Leonhard Euler in 1735 [47] to solve a mathematical

problem called the Konigsberg bridge problem, which is a problem to traverse a number of

islands connected by bridges via each bridge for exactly once. Euler modeled the problem

by representing the islands as vertices and the bridges as edges of a graph, respectively.

Since then, graphs become an important mathematical tool in many disciplines includ-

ing computer science, chemistry, linguistics, geography, and many more. In computer

science, graphs are an essential data structure and are commonly used to model trans-

portation networks, social networks, and web page link structures, just to name but a few.

Figure 1.1 gives an example, where Figure 1.1a is a social network graph, Figure 1.1b is a

web page graph, and Figure 1.1c is a transport network graph.

1

1.1 Background Introduction

(a) Social network graph (b) Web page graph

(c) Transportation network graph

Figure (1.1) Examples of graphs in real life

In graph theory, a basic problem is to compute the distance between two vertices,

which may be used to model the travel cost between two places of interest (POIs), the social

closeness of two individuals, the relevance of two web pages, etc. Figure 1.2 shows the

distance between two vertices on a road network graph. We can see that such a distance

is not necessarily the Euclidean distance between the two vertices. The vertex distances

are fundamental for recommending POIs to tourists, suggesting friends to social network

users, or ranking web pages for search engines. In these applications, there may be mil-

lions of vertices and users who issue distance queries. For example, the Florida road

network [1] has over a million vertices; Google Maps has over a billion active users [5];

there are more than a billion active websites [8]; and Facebook has over 2 billions so-

cial network users [7]. Answering distance queries under such settings poses significant

challenges in both space and time costs.

2

Introduction 1.2 Research Gap

Figure (1.2) Vertex distance on a road network graph

In this thesis, we revisit the problem of computing the distance between two vertices

in a graph. Here, the distance refers to the length of the graph shortest path between the

two vertices. We use distance for brevity when the context is clear. Figure 1.3a shows an

abstracted example of the problem, where v1, v2, ..., v5 are the vertices, and the numbers

on the edges are the edge weights. Consider vertices v1 and v5. Their distance is the

length of path v1 → v4 → v5, which is 4. Our aim is to answer queries on such distances

(approximately) with a high efficiency.

1.2 Research Gap

A traditional approach uses graph shortest path algorithms to compute the shortest path

between two vertices, along which the path length (i.e., the distance) is computed. Di-

jkstra’s single-source shortest-path (SSSP) algorithm [32] and the Floyd-Warshall all-pair

shortest-path (APSP) algorithm [36] are simple and effective algorithms for this purpose.

However, these methods may have large computational complexity when they are run-

ning online over large graphs, i.e., O(m + n log n) and O(n3), where m and n are the

numbers of edges and vertices, respectively. More recent algorithms such as contraction

hierarchies (CH) [44] reduce the time cost via preprocessing the graphs to add shortcut

edges (i.e., shortest paths between some vertices). These algorithms focus on computing

the shortest paths rather than the distances.

In applications such as those mentioned above, the distances may be needed first

3

1.2 Research Gap Introduction

v2(l1)

v1 v4 v5

v3(l2)

3 4 6

1 3

3 5 5

(a) A graph example

v1

v2

v3

v4

v5

v1 v2 v3 v4 v5

0 3 3 1 4

3 0 6 4 6

3 6 0 4 5

1 4 4 0 3

4 6 5 3 0

(b) Distance labeling

v1

v2

v3

v4

v5

v2(l1) v3(l2)

3 3

0 6

6 0

4 4

6 5

(c) Landmark labeling

Figure (1.3) Graph shortest-path distance problem

while the actual shortest-paths may be computed later. Meanwhile, the distances do not

update frequently or do not need to support real-time updates. For example, to rank

POIs for recommendation, we may just need the distances to the POIs, while the shortest

path can be computed after a POI has been chosen by the user. Also, the POI locations

do not change often. Similarly, to recommend friends for a user, we may need distances

in a social network graph that represent her social closeness to other users, but not the

shortest paths. Therefore, generating the recommendations can be done offline without

requiring real-time updates of the social network graph. Such applications are targeted

in this study.

Under such application contexts, studies (e.g., [14, 29, 58]) preprocess a graph and

build new data structures to enable fast distance queries without online shortest path

computations. Distance labeling is commonly used in these studies. The basic idea is to

precompute a vector of (distance) values for each vertex as its distance label. At query time,

only the distance labels of the two query vertices are examined to derive their distance,

which is simpler than shortest path computation. In an extreme case, the distance label

4

Introduction 1.2 Research Gap

of every vertex consists of its distances to all other vertices (cf. Figure 1.3b). A distance

query is answered by a simple lookup in O(1) time, but this requires O(n2) space to store

all the distance labels. Various labeling approaches (e.g., 2-hop labeling [29] and highway

labeling [58]) are proposed to reduce the distance label size.

Hub labelling [29] is a representative labelling approach. It labels every vertex with its

distances to vertices on its shortest paths to all the other vertices. The vertices used for

labelling are called hubs. The hubs are chosen such that there exists exactly one hub on the

shortest path of every pair of vertices. Every vertex only stores its distances to the hubs

on its shortest paths to the other vertices. Since some of these shortest paths may share

the same hub, hub labelling can produce distance labels with smaller sizes. At query

time, the distance labels of the two queries vertices are scanned to find their shared hub,

which must be a vertex on their shortest path. The distances to this hub are summed up

and returned as the query answer. This approach has been shown to be query efficient,

but its worst-case space cost is still O(n2) [57].

To avoid the O(n2) space cost, approximate techniques are proposed [25, 92], among

which landmark labeling [49, 77, 90] is a representative approach. The landmark label-

ing approach chooses a subset of k (k n) vertices as the landmarks. Every vertex vi

stores its distances to these landmarks as its distance label, i.e., a k-dimensional vector

〈d(vi, l1), d(vi, l2), . . . , d(vi, lk)〉, where l1, l2, . . . , lk ∈ L represent the landmarks and d(·)

represents the distance. At query time, the distance labels of the two query vertices vi and

vj are scanned, where the distances to the same landmark are summed up. The smallest

distance sum, i.e., mind(vi, l) + d(vj , l)|l ∈ L, is returned as the query answer (for undi-

rected graphs). In Figure 1.3, v2 and v3 are chosen as the landmarks (denoted by l1 and l2,

respectively), and the distance labels are shown in Figure 1.3c. The distance between v1

and v5 is computed as mind(v1, l1)+d(v5, l1), d(v1, l2)+d(v5, l2) = min3+6, 3+5 = 8,

which is twice as large as the actual distance between v1 and v5 (i.e., 4). As the example

shows, even though landmark labeling reduces the space cost toO(kn), it may not return

the exact distance between vi and vj when their shortest path does not pass any land-

mark. How the landmarks are chosen plays a critical role in the algorithm accuracy. Since

finding the k optimal landmarks is NP-hard [77], heuristics are proposed [38, 77, 89] such

5

1.3 Contributions of the Thesis Introduction

as choosing the vertices that are on more shortest paths as the landmarks.

1.3 Contributions of the Thesis

To avoid the limitations in landmark choosing and to preserve more distance information

in the distance labels, in this study, we propose a representation learning based approach

to learn an embedding for every vertex as its distance label. Our idea is motivated by the

recent advances in learning graph embeddings. Studies [22, 23, 48] show that vertices

can be mapped into a latent space where their structural similarity (e.g., the number of

common neighboring vertices) can be computed. This motivates us to map the vertices

into a latent space to compute their spatial similarity, i.e., shortest-path distances.

Our learned embeddings do not rely on any particular landmarks or discriminate any

vertices with shortest paths that do not pass any landmarks. Thus, our embeddings may

yield more accurate distance predictions for such vertices, while we retain a low space

cost. These will be verified by an experimental study on real-world graphs (Chapter 5).

To learn the vertex embeddings, we first adopt existing representation learning mod-

els, including an auto-encoder model [17] and the node2vec model [48]. Given the em-

beddings of two vertices learned by these models, we train a multilayer perceptron (MLP)

to predict the distance between the two vertices. We observe that the vertex embed-

dings learned by these models suffer in the distance prediction accuracy. For the auto-

encoder, its learned embeddings tend to encode the average distances between the ver-

tices,1 which do not help predict the distance of two specific vertices. Node2vec encodes

the local neighborhood information rather than the global distances. Neither model receives

direct training signals from the distance prediction of two vertices when learning the embeddings

for the two vertices.

To overcome these limitations, we further propose a distance preserving vertex to vector

(vdist2vec) model for vertex embedding. Our vdist2vec model learns vertex embeddings

jointly with training an MLP to make distance predictions based on such embeddings.

This way, the vertex embeddings are guided by signals from distance predictions, which

1Auto-encoders tend to learn to reconstruct the average of all training instances when used without pre-training [52].

6

Introduction 1.3 Contributions of the Thesis

can better preserve the distance information.

Our vdist2vec model aims to learn an n× k dimensional matrix V where each row is

the embedding of a vertex (recall that n is the number of vertices and k is the embedding

dimensionality). This matrix is randomly initialized. When training the vdist2vec model,

we use two n-dimensional one-hot vectors to represent two vertices vi and vj for which

the distance is to be predicted. These two vectors are multiplied by V separately, which

fetches the two k-dimensional vectors vi and vj (i.e., the embeddings) corresponding to

vi and vj in V. Vectors vi and vj are then concatenated into a 2k-dimensional vector and

fed into an MLP to predict the distance between vi and vj . The optimization goal here is

to minimize the difference between the predicted distance and the actual vertex distance.

The prediction errors are propagated back to update vi, vj, and the MLP.

Once our model is trained, when a distance query comes with two query vertices vi

and vj , we just need to fetch vi and vj from V and feed them into the trained MLP to

predict the distance between vi and vj .

In summary, our study makes the following contributions:

• We propose a learning based approach to predict vertex distances without the need

of choosing a particular set of landmarks for distance labeling. Our approach has

an O(k) distance prediction time cost and an O(kn) space cost, where k is a small

constant denoting the vertex embedding dimensionality.

• We adopt existing representation learning techniques and study their limitations.

To address those limitations, we further propose to learn vertex embeddings while

jointly train an MLP to predict vertex distances based on such embeddings. Our

model is simple and efficient, since it is based on one-hot vectors and an MLP.

Our model is also highly accurate, since the embeddings are guided by distance

predictions directly.

• To further optimize the performance of our model, we propose a novel loss function

and an ensembling based network structure that guide the model learning to suit

the characteristics of the underlying data. We also discuss how to scale our model to

larger graphs, to handle graph updates, and to extend to other graph applications.

7

1.4 Outline of the Thesis Introduction

• We perform experiments on real road networks, social networks, and web page

graphs. The experimental results confirm the superiority of our proposed approaches.

Comparing with state-of-the-art approximate distance prediction approaches, our

approach reduces both the mean and the maximum distance prediction errors, and

the advantage is up to 97%.

• To examine the general applicability of our model, we further perform link predic-

tion and graph reconstruction experiments on social networks. The results show

that our distance guided embeddings are also effective in these applications.

1.4 Outline of the Thesis

The rest of the thesis is organized as follows.

1. In Chapter 2, we review the related work on shortest-path distance computation

models and graph embedding models. We discuss both exact distance computa-

tion models and approximate distance computation models. We describe how each

model works and analyze their advantages and limitations. In addition, we discuss

applying graph embedding methods on shortest-path distance problem as well as

other graph applications.

2. In Chapter 3, we formulate our problem and present a two-stage solution frame-

work. This framework allows us to adapt existing representation learning tech-

niques to learn vertex embeddings for vertex distance predictions.

3. In Chapter 4, we further propose a single-stage solution. We describe our proposed

model in details including the model structure, the loss function, the model opti-

mizations, and how to scale our model to large graphs and to handle graph updates.

We also discuss how to adapt our model to other graph applications such as link

prediction and graph reconstruction.

4. In Chapter 5, we present experimental results under various settings and examine

the impact of embedding dimensionality, MLP structure, graph updates, and loss

8

Introduction 1.4 Outline of the Thesis

function. We also show the effectiveness of the proposed model in applications

such as POI recommendations, link prediction, and graph reconstruction.

5. In Chapter 6, we conclude the thesis with a discussion on the future work.

9

Chapter 2

Related Work

In this section, we discuss four lines of related studies: exact shortest-path distance com-

putation, approximate shortest-path distance computation, graph embedding, and graph

embedding applications.

2.1 Exact Distance Computation

To compute shortest-path distances, the first step is to compute the shortest paths. Two

classic shortest-path algorithms are Dijkstra’s algorithm [32] and the Floyd-Warshall algo-

rithm [36]. Dijkstra’s algorithm [32] is a single-source shortest path (SSSP) algorithm that

computes the shortest-paths from a given source vertex to all the other vertices in a

graph. Floyd-Warshall algorithm [36] is an all-pair shortest path (APSP) algorithm that com-

putes all the shortest path between every vertex pairs in a graph. These two algorithms

have O(m + n log n) [37] and O(n3) time costs, where m and n are the numbers of edges

and vertices, respectively. More recent algorithms such as contraction hierarchies [44] re-

duce the time costs via adding shortcut edges (i.e., shortest paths between some vertices).

Once a shortest path is computed, the corresponding distance can be derived by simply

summing up the edge weights on the path. For efficient distance query processing, these

algorithms may be run to precompute the distance between every pair of vertices. Then,

a distance query can be answered by a simple lookup in O(1) time. Such an approach,

however, has a high space cost, i.e., O(n2).

To reduce the space cost while retaining a high query time efficiency, a stream of

studies [14, 24, 29, 43, 58] precompute a distance label for every vertex vi. The distance

11

2.1 Exact Distance Computation Related Work

label of vi contains the distances to a subset of vertices Vi = vi1, vi2, . . . , vik ⊆ V in the

form of 〈(vi1, d(vi, vi1)), (vi2, d(vi, vi2)), . . . , (vik, d(vi, vik))〉. Here, V represents the full

vertex set of a given graph, and k may vary for different vertices. The distance of two

query vertices vi and vj is derived from their distance labels as:1

d(vi, vj) = mind(vi, v) + d(vj , v)|v ∈ Vi ∩ Vj (2.1)

For directed graphs, the labels further store the direction information. For example, in

Hub Labeling [10], each vertex v has two labels Lain(v) and Laout(v): Lain(v) stores the

distances from k other vertices to v, while Laout(v) stores the distances from v to k other

vertices. To obtain exact shortest-path distances, at least one vertex on the shortest path

of vi and vj must be in the distance labels of both vi and vj . Otherwise, only approximate

distances may be produced. A key challenge is then to compute a minimum set of vertices

that cover all the shortest paths, so as to minimize the distance label size. Finding the

minimum average label size for a graph is an NP-hard problem [29]. Different heuristics

have been proposed such as pruned landmark labeling [14], multi-hop distance labeling [24],

IS-Label [43], and highway labeling [58]. These techniques are verified empirically on real

graphs. However, their worst-case label size is still O(n2) for general graphs [57].

Pruned landmark labeling [14] uses a pruned breadth-first search (BFS) to build up

the labels. It is based on the naive landmark labeling. Naive landmark labeling runs BFS for

every vertex v and records the distances from all the other vertices to v as its label. It has

an O(n2) label size and an O(n2 + nm) time complexity. Pruned landmark labeling skips

the vertices that yield large distances from being added to the labels. It first computes

for a random (or in some order) vertex l1 its distances to all the other vertices and adds l1

to each the label of every vertex. Let La1 be the set of distance labels obtained after pro-

cessing l1, and dq(v, u, La1) be the distance between any two vertices v and u computed

from La1. The pruned landmark labeling algorithm then selects another vertex l2 and

runs BFS on it, while using La1 to filter some of the vertices from the search as follows.

When traversing from l2 to a vertex u with distance δ, the algorithm compares δ with

dq(l2, u, La1). If δ is larger, l2 has a shorter path to u via l1. Thus, u will be pruned from

1Assuming an undirected graph. Same for Equation 2.2.

12

Related Work 2.1 Exact Distance Computation

the BFS. It will not be added to the label, and the search process will not continue from

it. The pruned landmark labeling algorithm repeats this process for all the vertices. Since

part of the vertices are pruned, pruned landmark labeling has a smaller label size and a

shorter computation time than the naive landmark labeling.

Multi-hop labeling [24] is based on 2-hop distance labeling [29]. The 2-hop distance

labeling technique computes a distance query from v to u by summing up the distance

from v to a hub vertex w (first hop) and the distance from w to u (second hop). The dis-

tances between all vertices and hub vertices (such as d(u,w) and d(w, v)) are precomputed

and stored as labels. For each vertex v, we denote its label set as La(v). For multi-hop

labeling, it gives each vertex a label and a parent vertex Pa(v). For query processing,

if La(v) ∪ La(u) = ∅, La(Pa(v)) and La(Pa(u)) is checked recursively, until a common

hub is found in the labels. This effectively builds a hierarchical structure on the vertices,

where the distance label of a vertex can be used by all its descendant vertices and do not

need to be stored in multiple copies. This way, the overall label set size is reduced. To

build this hierarchical structure, a tree decomposition [79] is computed which is guided

by the vertex degrees.

IS-labeling [43] again uses a hierarchical structure. In its labeling process, IS-labeling

removes sets of independent vertices recursively and stores the removed vertices in the

labels. A set of vertices are independent if there is no edge between any vertices in the set.

For example, as shown in Figure 2.1 , v1, v4, v5 is an independent set. When removing

the independent set, replacement edges are added to keep the rest vertices connected and

their shortest-path distances unchanged. For example, if v1 in Figure 2.1 is removed,an

edge is added between v2 and v3 with length 6. In addition, v2 and v3 will be regarded as

parent vertices of v1, and v1 is stored in the labels of v2 and v3. The procedure continues

until there is only one vertex left. This builds up a hierarchical structure over the vertices.

In this structure, the ancestor vertices act similarly to hub vertices. When querying the

distance of two vertices, the query algorithm checks if they share a common ancestor.

The optimization problem for IS-labeling then becomes minimizing the steps needed for

constructing the hierarchy (to create fewer ancestors). As this is an NP-hard problem,

an approximate approach was proposed that removes the vertices with smaller degree

13

2.2 Approximate Distance Computation Related Work

v2

v1 v4 v5

v3

3 46

3 5

5

Figure (2.1) A graph example

first [43].

Highway labelling [58] chooses sets of connected vertices and their edges as “high-

ways”, and computes labels that store vertices’ distance to those “highways”. A shortest-

path distance query can then be answer by summing up both query vertices’ distances

to their shared highway and the length between their highway entrance and exit. Dur-

ing construction, this technique computes vertices that share few common vertices with

k nearest neighbors of other vertices (where k is a system parameter). Such vertices are

used as the highway vertices. For example, if we set k as 5, for shortest path (pu,v) from

vertex u to vertex v, vertex c which is on pu,v will be considered as a highway vertex only

if it is included in neither of the sets of the 5 nearest vertices of u and v.

These heuristic methods reduce the label size on various datasets. For example, high-

way labeling has a relatively small label size on road networks, while it may not be op-

timal for complex network. The worst-case space costs are still O(n2) [57]. The hop-

doubling labeling [57] technique achieves an O(hn) space cost where h is a small constant.

It assumes scale-free graphs rather than general graphs.

2.2 Approximate Distance Computation

Approximate shortest-path distance algorithms trade distance accuracy for further re-

ducing the space cost [74, 88]. Most approximate algorithms take a landmark based ap-

14

Related Work 2.2 Approximate Distance Computation

proach [77, 89, 90], where a subset of k (k n) vertices are chosen as the landmarks,

denoted by L (L ⊂ V ). For every vertex vi, its distances to the landmarks are precom-

puted, which form the distance label of vi. This reduces the space cost to O(kn). At

query time, the landmark l ∈ L that is the closest to the two query vertices vi and vj is

computed, and the sum of its distances to vi and vj is returned, i.e.,

d(vi, vj) ≈ mind(vi, l) + d(vj , l)|l ∈ L (2.2)

The accuracy of a landmark based algorithm depends on how close the landmarks are to

the shortest paths. It is shown [26, 77] that landmarks at graph center help obtain a higher

accuracy. Intuitively, landmarks at graph center may be passed by more shortest paths

than those at graph boundary. Since finding the k optimal landmarks is NP-hard [77],

heuristics such as degree [39], betweenness [38] degree, and closeness [77] are used to mea-

sure the centrality of the vertices and to choose the landmarks. Here, the degree of a

vertex is the number of edges connected to it. A vertex with a larger degree has a higher

chance to be passed by more ( shortest) paths. Degree centrality guided landmark selec-

tion could be effective in datasets that have a hierarchical structure. For example, a web

graph that has a home page which links to all the other pages. Betweenness measures

the number of shortest paths passing through a vertex, while closeness measures the av-

erage distance of a vertex to all other vertices, i.e., if a vertex’ average distance to all the

other vertices is small, the vertex is likely to be located at the graph center. Betweenness

is shown to outperform closeness [89]. It is used as one of our baselines. The centrality

heuristic may be suboptimal for vertices nearby – the shortest paths of such vertices may

not pass the graph center. Another problem of centrality based landmark selection is that

the selected landmarks are usually close to each other [89]. As a result, many vertices

would be far away from the selected landmarks while being close to each other. The pre-

diction between these vertices may have a much larger value than its exact value. This

issue is also observed in our experiments. Figure 2.2 shows landmarks selected by two

strategies, betweenness centrality and degree centrality, on a Chinese city (Dongguan)

road network [4]. As shown in Figure 2.2, degree centrality is better than betweeness

centrality on the DG graph as it covers more areas in the graph. However, on a Facebook

15

2.2 Approximate Distance Computation Related Work

politician network [3], betweeness centrality have much smaller errors than degree cen-

trality. Detailed experimental results on these methods are presented in Chapter 5. Takes

and Kosters [89] propose an adaptive landmark selection strategy that balances between

centrality and coverage. They introduce an indicator called success rate, which is the ratio

of the number of correctly predicted shortest paths by the selected landmarks to the total

number of shortest paths in the graph. They compute the accumulation of success rates

for landmarks sorted by degree centrality or betweeness centrality, and incrementally

choose the vertices that contribute the maximum success rate increments.

(a) Degree Centrality (b) Betweenness Centrality

Figure (2.2) Landmark distribution on Dongguan

Sankaranarayanan and Samet [84, 85] adopt the method of well-separated pair decompo-

sition (WSPD) [86] to cluster vertices. For each cluster, they select a representative vertex.

The distances between vertices in different clusters are approximated by the distance be-

tween their representative vertices. They state that the paths between vertices in two

different clusters could have a long shared part. For example, the path that we drive

from the CBD of city A to city B may be similar to the path that we drive from a suburb

of city A to city B, i.e., the same highway is used, while the paths to get to the highway

might be different. They use a point-region Quadtree [35, 70, 83] to store vertices based

on their geo-coordinates and check recursively if the nodes in the tree structure are well

separated. As vertices with larger Euclidean distances to each other are more likely to

be well separated, applying the Quadtree structure helps finding a decomposition with

fewer sets. A random vertex in each set in the decomposition is chosen as the represen-

16

Related Work 2.2 Approximate Distance Computation

tative vertex, and the distances from vertices in set A to vertices in set B are estimated by

the distance between their representative vertices. This method can bound the error to

(1−ε)d(u, v) 6 d(u, v) > (1+ε)d(u, v), where ε is a parameter that is defined based on the

well-separated condition in the precomputing process. A smaller ε means lower errors

while more sets in decomposition and a higher space cost. This method is designed to

optimize the relative error but not the absolute error.

For query processing, Goldberg uses a landmark based upper bound to constrain the

search area of Dijkstra’s algorithm and reduce its running time [45]. The idea is that if a

graph traversal reaches a vertex with a distance larger than the landmark-based estima-

tion, then the traversal can be terminated from traversing beyond that vertex. Gubichev

et al. [49] improve the landmark based methods’ accuracy by removing cycles on the

path. They first concatenate the path from the source vertex to the landmark and the

path from the landmark to the target vertex. They then delete the vertices that have been

visited twice. However, this method needs to store not only the distance but also the

vertices on the paths to the landmarks. It increases the space complexity as well as the

query cost to find the cycles on path [88].

Theoretical results (e.g., [25, 74, 92]) are offered to bound the relationship between the

distance label size and the distance approximation accuracy. In particular, the (worst-

case) distance approximation accuracy is often referred to as the stretch. An algorithm is

said to have a stretch(α, β) if its distance approximation di,j for any two vertices vi and

vj satisfies d(vi, vj) ≤ di,j ≤ α · d(vi, vj) + β, where α > 1 and β > 0 [88]. For general

graphs, we usually consider β to be 0 and focus on α, and α is effectively the maximum

relative error. On undirected graphs, Thorup and Zwick [92] show that any algorithm

with an approximation ratio (i.e., stretch) of α < 2c + 1 (c ∈ N+) must use Ω(n1+1c )

space based on Erdos’ Girth Conjecture [33]. They construct a structure using O(cn1+1c )

space and O(cmn1c ) time to obtain an approximation ratio of α = 2c − 1 and an O(c)

query time. Their model constructs “balls” with fixed and limited diameters that cover

vertices. A shortest-path distance query will be answered by the ball with the smallest

diameter that covers both vertices. Given the same α = 2c−1, Chechik [25] improves the

space cost toO(n1+1c ) and the query time toO(1), with an increased prepossessing time of

17

2.3 Graph Embedding Related Work

O(n2+mn12 ) (recall thatm is the number of edges). These studies are mainly of theoretical

interest. No empirical results are presented in them. Das Sarma et al. [30] implement a

simplified version of Thorup and Zwick’s algorithm. They retain the O(cn1+1c ) space

cost. Taking the smallest value of c = 1, this algorithm still has an O(n2) space cost.

This algorithm will not be discussed further as we aim for a space cost linear to n for

scalability considerations.

For a comprehensive review on exact and approximate distance algorithms, inter-

ested readers are referred to [88].

2.3 Graph Embedding

Another stream of studies embed a graph into a latent space to compute vertex distances

with a lower cost, e.g., via Euclidean distance or cosine similarity. Metric embedding [9,

18, 53] and matrix factorization [13, 15, 22, 54, 71, 82] (e.g., over the adjacency matrix of

an input graph) are used for this purpose in earlier studies. For example. Locally Linear

Embedding (LLE) [82] assumes that the vector of a vertex in embedding space is the sum

of vectors of its neighbours. It can be defined as Zi =∑

j YijZj , where Y is the adjacent

matrix, and Z is the embedded matrix. Notice that Yij is 1 when vertex vi and vertex vj

are connected, while Yij is 0 when they are not. Therefore, only vectors of neighbours of

vi are added to compute Zi. For each vertex, LLE aims to reduce the difference between

its corresponding vector and the sum of the vectors of its neighboring vertices. For all

vertices, it minimizes∑

i||Zi −∑

j YijZj ||2. For another example, Laplacian Eigenmaps

(LE) [16] embeds vertices that share an edge with a large weight (i.e., closely related,

“small distance” in our setting) to be close in the latent space. It minimize ||Zi−Zj ||2Wij

which uses the weight Wij as a penalty in the embedding space. Two vertices with a

large weight is weighted more in the error that forces them to be closer. LLE and LE

can be solved as an eigenvalue problem, and both of their computation complexities

are O(|E|k2) where k is the number of dimensions [46]. Ahmed et al. proposed Graph

Factorization (GF) that represents the edge Yij with the angle of the corresponding vertices

vectors (< Zi, Zj >) [13]. The loss function is defined as below, where λ is a regularization

18

Related Work 2.3 Graph Embedding

parameter:

f(Y, Z, λ) =1

2

∑i,j∈n

(Yij− < Zi, Zj >)2 +λ

2

∑||Zi||2

It has anO(|V |3) time complexity as the angle of vectors of each vertex pair are computed.

GraRep [22] captures k-step relational information of vertices. It defines first step’s transi-

tion matrix as A = D−1Y to indicates the possibility to travel from one vertex to another,

where Y is the adjacent matrix, and D is a diagonal matrix with Dii =∑

j∈n Yij . The

transition matrix after k transition is computed by k times dot product of A. Their goal

is to build an embedding matrix that can quickly estimate the transition matrix. Ou et

al. introduced the High Order Proximity preserved Embedding (HOPE) which has a similar

structure, while they use similarity matrix rather than transition matrix [71].

More recent studies use deep learning to learn embeddings that preserve certain at-

tributes of a graph. The most relevant attributes studied are the first-order and second-

order proximities, i.e., edge weights and vertex neighborhood similarity. Random walk

and auto-encoder are used to learn the embeddings. The random walk technique [48, 75]

samples the graph as a discrete distribution and embeds the sample into vectors. For

example, node2vec [48] runs random walks on a graph to generate sequences of vertices,

which are treated as “sentences” to learn vertex embeddings using word2vec [67]. The

learned embeddings preserve the vertex neighborhood information (i.e., vertices on the

same sequence). Node2vec only needs to observe a small portion of the graph for each

sampling with a O(|V |k) time complexity. This is beneficial when the entire graph is too

large to be processed together.

An auto-encoder is formed by an encoder and a decoder. The encoder maps the in-

put into a latent representation, while the decoder aims to map the latent representa-

tion back to the original input. By minimizing the difference between the encoder input

and the decoder output, the encoder is trained to generate latent representations that

preserve the input information. Following this idea, auto-encoders are used to learn

graph embeddings. For example, Deep Neural Networks for Learning Graph Representations

(DNGR) [23] uses random surfing which resembles random walks to generate a proba-

bility co-occurrence matrix. This matrix contains estimations of the transition probabil-

19

2.4 Other Graph Embedding Applications Related Work

ity between the vertices. It is then transformed to a positive pointwise mutual informa-

tion (PPMI) [27] matrix and fed into an auto-encoder to learn the embeddings. Another

auto-encoder based model named Structural Deep Network Embedding (SDNE) [96] takes

a graph adjacent matrix as its input. It adapts the loss function to preserve both the

first-order and second-order proximities: (i) it adds a penalty when vertices nearby are

mapped far away in the embedding space (first-order proximity), and (ii) it penalizes

more on errors in the output corresponding to non-zero elements in the input (second-

order proximity). DNGR and SDNE use a |V |-dimensional vector as the input (e.g. a

row of the adjacency matrix) which could be costly in computation. To address this lim-

itation, Kipf and Welling [63] proposed a Graph Convolutional Network (GCN) that uses a

convolutional method to learn the embedding of a vertex in multiple aggregating itera-

tions. In each iteration, the embedding of each vertex is a combination of its neighbors

embeddings. As taking only local neighborhood information for training, it reduces the

time complexity to O(|E|k2) (k is the embedding dimensionality) comparing to DNGR

(O(|V |2)) and SDNE (O(|V ||E|)).

These models are not designed to learn global vertex distances. Nevertheless, they

can be adapted to predict vertex distances (detailed in Section 3.2). We compare with

them in our experiments to highlight the advantage of our proposed model in preserving

vertex distance information.

For a comprehensive review on graph embedding techniques, interested readers are

referred to [21, 46, 50].

2.4 Other Graph Embedding Applications

Graph embeddings have been applied to a rich set of applications. We briefly discuss a

few of them, including graph compression (reconstruction), link prediction, and graph

visualization.

Graph compression is first introduced by Feder and Motwani [34] who propose to

build a graph representation with fewer edges than the original graph to accelerate graph

analysis algorithms. After that, studies [72, 93, 94] propose graph compression methods

20

Related Work 2.5 Summary

based on on grouping similar vertices. Graph embeddings as a graph representation may

be used in graph compression if we can reconstruct the graph based on the embeddings

with a high accuracy.

Link prediction is to find the missing links or predict the future links in graphs based

on the observed graph structure. It has many applications. For example, in social net-

work graphs, link prediction can be used to find potential relationships for friend recom-

mendation and advertising. To achieve this goal, one way is to compute vertices’ simi-

larity and predict probable links among them [12, 62]. Other methods such as maximum

likelihood methods [20, 28] and probabilistic methods [99, 51, 42] solve the problem from

a statistical viewpoint. Graph embeddings maps vertices into a latent space so that their

similarity can be computed. For example, we can use Euclidean distance of the vertex

embeddings to describe their similarity.

Visualizing a graph in a proper way helps viewers gain information about a graph

conveniently and quickly. It has many applications in different fields where graph data

are used [91, 40, 60, 31]. Embedding representation can be adapted to a dimensionality

reduction model such as Principal Component Analysis (PCA) [73], and then be visual-

ized in a Euclidean space. The Euclidean distances between the vertices can demonstrate

their hidden relationship clearly.

2.5 Summary

In this chapter, we reviewed methods for shortest-path distance computation, including

landmark based methods, distance labeling methods and graph embedding methods.

By comparing these methods, we obtain a clearer view about their advantages and lim-

itations. Distance labeling methods may have a high accuracy while its space cost is

O(n2) in its worst case. Landmark based methods have a linear space cost (O(kn)) but

potentially a lower accuracy. Graph embedding methods also have a linear space cost

(to embedding dimensionality), and they have advantages in query speed (i.e., parallel

vector processing). However, it is challenging to keep both the global and local distance

information of a graph in the embedding vectors. These motivate us to develop a learn-

21

2.5 Summary Related Work

ing based approach that retains the advantages of graph embeddings in query efficiency

while overcoming the challenges in preserving the distance information.

22

Chapter 3

Adapted Two-Stage Models

This chapter presents our problem solutions by adapting existing representation learning

techniques. We start with basic concepts and a problem definition in Section 3.1. We then

present a two-stage framework for adapting representation learning techniques to solve

our problem in Section 3.2. We adapt existing representation learning models to make

distance predictions and show their limitations in Section 3.3.

3.1 Problem Formulation

We consider a graph G = 〈V,E〉, where V is a set of vertices and E is a set of edges. An

edge ei,j ∈ E represents a connection between two vertices vi and vj ∈ V . Each edge ei,j

is associated with a weight denoted by ei,j .w, which represents the cost (i.e., distance) to

travel across the edge. For simplicity, in what follows, our discussions assume undirected

edges, i.e., one can travel from both directions on ei,j with the same cost ei,j .w, although

our proposed techniques work for both directed and undirected edges.

Given two vertices vi and vj in G, a path pi,j between vi and vj consists of a sequence

of vertices vi → v1 → v2 → ... → vx → vj starting from vi and ending at vj , such that

there is an edge between any two adjacent vertices in the sequence. The length of pi,j ,

denoted by |pi,j |, is the sum of the weights of the edges between adjacent vertices in pi,j :

|pi,j |= ei,1.w + e1,2.w + ...+ ex,j .w (3.1)

Among all the paths between vi and vj , we are interested in the one with the smallest

length, i.e., the shortest path. Let such a path be p∗i,j . The length of this path is the (shortest-

23

3.2 A Two-Stage Solution Framework Adapted Two-Stage Models

G

k-dimensional

vectors

Representation

learning

network

v1

v2

v3

v4

v5

v2

v1 v4 v5

v3

3 4 6

1 3

3 55

(a) Vertex representation learning

Distance

prediction

network

(MLP)

vi

vj

di,j

(b) Distance predictor training (distance prediction)

Figure (3.1) Solution framework

path) distance between vi and vj , denoted by d(vi, vj).

d(vi, vj) = |p∗i,j | (3.2)

Given the concepts above, the shortest-path distance query is defined as follows.

Definition 1 (Shortest-path distance query) Given two query vertices vi and vj that belong

to a graph G, a shortest-path distance query returns the shortest-path distance between vi and

vj , i.e., d(vi, vj).

Our aim is to provide an approximate answer for a shortest-path distance query with

a high accuracy and efficiency.

3.2 A Two-Stage Solution Framework

We take a learning based approach to answer shortest-path distance queries. Given a

graph G, we first take a two-stage procedure that allows us to adapt existing representa-

tion learning techniques to answer shortest-path distance queries:

24

Adapted Two-Stage Models 3.2 A Two-Stage Solution Framework

1. Representation learning. We preprocess G by mapping each vertex vi ∈ V to a k-

dimensional vector representation vi ∈ Rk (cf. Figure 3.1a).1 The goal of this stage

is to learn vertex representations that preserve the graph distances between the

vertices, i.e., vertices that have small distances inG should also have small distances

for their learned vector representations, and vice versa.

2. Distance predictor training. We train a multi-layer perceptron (MLP) using the learned

vectors vi and vj between every pair of vertices vi and vj in V as the input and

distance d(vi, vj) as the target output (cf. Figure 3.1b). We use the mean square error

as the default loss function Ld to optimize the MLP parameters:

Ld = EP [(d(vi, vj)− di,j)2] (3.3)

Here, di,j denotes the predicted distance between vi and vj , and P denotes a dis-

tribution over V × V . In the simplest case, P is just the full set of V × V , i.e., to

optimize for every pair of vertices in G.

At query time, given two query vertices vi and vj , their learned representations vi

and vj are fetched and then fed into the MLP trained in the stage above. The input order

of the vectors reflects the travel direction in graph. For example, if we travel from vj to

vi instead, vj will be put in front of vi when being fed into MLP. For directed graphs,

this can distinguish the distance difference when the travel direction changes. This also

applies to our Vdist2vec model proposed in Chapter 4. The output of the MLP is returned

as the distance query answer (cf. Figure 3.1b).

Our two-stage model framework takes advantage of the recent advances in neural

networks and representation learning to avoid online graph traversals. Since neural net-

work inference (i.e., predictions) can be done efficiently, our solution can offer query

answers with a high efficiency.

Our solution offers approximate query answers, the accuracy of which is determined

by the quality of the learned vertex vectors. In what follows, we focus on the represen-

tation learning stage to obtain high-quality vectors that well preserve the vertex distance1For directed graphs, we learn two embeddings for each vertex vi, one for vi as the source vertex and the

other for vi as the destination vertex, respectively.

25

3.3 Adapted Representation Learning Models Adapted Two-Stage Models

information. The distance predictor training and distance prediction stages use standard

MLP training and inference procedures. They will not be detailed further.

3.3 Adapted Representation Learning Models

As discussed in Chapter 2, there is a recent advance in graph embedding techniques

using deep learning. We adapt two representative techniques to learn vertex vectors for

distance prediction: node2vec [48] and auto-encoders [17]. We also examine using geo-

coordinates and landmark-based distance labels for vertex representations, to test the

ability of the MLP to learn a distance prediction function based on them.

3.3.1 Node2vec

The idea of node2vec comes from word2vec [67] – a model that learns vector representa-

tions for words. In word2vec, a word w is mapped into a latent space where it is close to

its context words. Here, a context word of w is a word that appears within a predefined

distance δ from w in a given (large) text corpus. The predefined distance δ forms a context

window around w, e.g., δ = 5 means five words on each side of w in a sentence.

Under a graph setting, the given graphG can be seen as a large “corpus”, and the ver-

tices can be seen as the “words”. The “sentences” (and hence context windows) can be

generated by random walks [66] onG using (the inverse of) edge weights as the transition

probabilities. Then, the word2vec model applies directly. For model training, node2vec

uses the skip-gram [67] technique. This technique learns word vectors via optimizing a

neural network that predicts for every word w the probability of every other word to

appear in its context windows. The optimization goal is to maximize the probability of

observing all the context words of w. In node2vec, this optimization goal translates to

maximizing the log-probability of the neighborhood vertices N(vi) for a vertex vi condi-

26

Adapted Two-Stage Models 3.3 Adapted Representation Learning Models

tioned on its vector representation vi:

arg maxvi

∑vi∈V logPr(N(vi)|vi)

.= arg maxvi

∑vi∈V log

∏vj∈N(vi)

Pr(vj |vi)

= arg maxvi

∑vi∈V log

∏vj∈N(vi)

exp(vjTvi)∑

v∈V exp(vTvi)

(3.4)

This optimization function guides the model to produce similar vectors for vi and the

vertices in N(vi) (so as to maximize the dot product in the numerator). Here, N(vi)

contains the vertices passed by a random walk that goes through vi. The output of the

first hidden layer of the trained network is the vector for vi (given vi as the input).

Node2vec allows biasing the random walks towards either a breath-first or a depth-

first walk. This can generate N(vi) that consists of vertices around some vertex or along

some path (or a combination of both). The vertex vectors learned from such N(vi) may

preserve the information on which vertices are nearby in G. We use these vectors for

distance prediction. A limitation of such vectors, however, is that they may not preserve

the distance for vertices far away. This is because vertices far away do not fall in the

same neighborhood as frequently as those nearby, and the vectors may not be optimized

to preserve their distances.

In a recent study [78], node2vec embeddings are used for shortest-path distance pre-

diction. Before the learned embeddings of the two query vertices are fed into the distance

prediction network, four different operations are explored to combine the two vertex em-

beddings, namely, subtraction, concatenation, average, and point-wise multiplication.

As shown in its experimental results, on four different data sets tested, the concatena-

tion based model yields the best distance prediction accuracy. Therefore, we also use

concatenation based models in our experiments. As that study did not specify the num-

ber of nodes in each layer of its neural network except the embedding layer, we set the

MLP structure the same as our other models, which are 100 nodes and 20 nodes in the

two hidden layers, respectively. As shown in Table 3.1, we obtain lower distance predic-

tion errors on data sets used in [78] with our implementation. We use this as the default

settings for node2vec model in our experiments.

27


Table (3.1) Node2vec Based Distance Prediction Errors

Mean Absolute Error Mean Relative Error

Reported [78] Our implementation Reported [78] Our implementation

FaceBook 0.258 0.237 0.099 0.097

BlogCatalog 0.275 0.255 0.119 0.109

We note that the existing study [78] focuses only on social networks. In our experi-

ments, we use node2vec to learn vertex vectors for distance predictions on road networks,

social networks, and web page graphs.

Table (3.2) Node2vec Based Distance Prediction Errors on Different Networks

Mean Absolute Error Mean Relative Error Max Absolute Error Max Relative Error

DG 2,329 0.199 45,564 1,806

MB 118 0.161 2,953 187

SU 658 0.175 18,917 500

FBPOL 0.411 0.094 9 9

FBTV 0.697 0.143 16 16

EPA 0.603 0.150 7 7

Table 3.2 shows node2vec approach performance on road networks (DG, MB, and

SU), social networks (FBPOL and FBTV), and web page (EPA). As each vertex embedding

learned by node2vec is based on its neighbor vertices, it works well on high density

graphs such as social networks and web page networks as shown in the table. In those

graphs, neighborhood information has enough knowledge to support distance prediction

between vertices. In other words, distance prediction between two vertices is based on

the similarity of their neighbors.

28


3.3.2 Auto-encoder

Auto-encoders were originally proposed for dimensionality reduction. They are used

for representation learning in more recent studies (e.g., [17]). An auto-encoder neural

network consists of two components – an encoder and a decoder, each of which may

consist of multiple hidden layers. The encoder maps the input into a latent representation

(usually with capacity constraints, otherwise it may simply learn the identify function for

the input). The decoder aims to map the latent representation back to the original input.

The optimization goal is to minimize the difference between the encoder input and the

decoder output, so as to preserve more input information in the latent representation.

Let φ(·) and ψ(·) be the two mapping functions learned by the encoder and the decoder,

respectively. Then, the loss function La for the auto-encoder can be written as:

La =n∑i=1

||xi − ψ(φ(xi))||22 (3.5)

Here, xi represents an input instance (in a vectorized form), and ||·||2 is the L2 distance.

Once the auto-encoder is trained, the latent representation of xi is computed as φ(xi).

Distance

prediction

network

(MLP)

Auto-encoderAll-pair shortest-pathdistance matrix

Figure (3.2) Auto-encoder embedding model

As shown in Fig 3.2, We adapt an auto-encoder to learn vertex representations. For

each vertex vi, we compute its distances to all vertices in V to form a distance vector. This

vector is used as the input vector xi to train an auto-encoder. Once the auto-encoder is

trained, the representation vi of vi is computed as:

vi = φ(xi) (3.6)

29


Vector vi is expected to preserve the vertex distances stored in xi. However, a limitation

of such an adapted model is that its input vector xi has an O(|V |) size. When there are

many vertices (e.g., millions), the auto-encoder may become too expensive to train. Also,

auto-encoders tend to learn to reconstruct the average of all training instances [52]. In

our case, vi tends to yield the average distances between the vertices, which do not help

the distance prediction.

Table (3.3) Auto-encoder Based Distance Prediction Errors on Different Networks


DG 1,821 0.178 45,645 1,505

MB 300 0.308 2,920 192

SU 217 0.070 9,966 244

FBPOL 0.964 0.239 9 4

FBTV 1.618 0.278 14 5

EPA 0.597 0.148 5 5

Table 3.3 shows auto-encoder approach performance on road networks (DG, MB, and

SU), social networks (FBPOL and FBTV), and web page (EPA). Auto-encoder works well

on graphs with larger diameters such as DG and SU. These graphs have large variance on

distances that guides auto-encoder to learn distinguishable embeddings for each vertices.

3.3.3 Geo-coordinates and Landmark Labels

Jindal et al. [59] use a deep neural network (DNN) based model to predict shortest-path dis-

tances given the geo-coordinates (longitudes and latitudes) of the vertices as input. This

can be seen as using vertex coordinates as the embeddings. The shortest-path distance of

two vertices can be proportional to their Euclidean distance in many cases. However, this

is not always true. For example, as shown in Figure 3.3, the Euclidean distance between

point A and point B is substantially smaller than their shortest-path distance. Such ob-

servations are not uncommon in real road networks crossing rivers or highways. Also,

30


not all graphs contain vertex coordinates, e.g., social network graphs do not, which limits

the applicability of this approach.

Figure (3.3) A road network example

Another simple way to obtain vertex representations is to use their distance labels

such as the landmark labels as described in Section 2.2. Then, we take advantage of the

capability of the MLP to learn a non-linear function to predict the shortest-path distance

based on such distance labels, rather than simply scanning the labels and summing up the

distances to the landmarks. We also use this approach as a baseline model in Chapter 5.

Table (3.4) Geodnn Based Distance Prediction Errors on Different Networks


DG 1,566 0.092 41,376 262

MB 95 0.097 3,563 92

SU 442 0.108 14,916 162

FBPOL N/A N/A N/A N/A

FBTV N/A N/A N/A N/A

EPA N/A N/A N/A N/A

Table 3.4 shows geodnn approach performance on road networks (DG, MB, and SU).

31

3.4 Summary Adapted Two-Stage Models

As geodnn predicts the distances between two vertices based on their Euclidean distance,

it works well on road networks with few detours in shortest-path traversal. For example,

geodnn has a good performance on MB which is a grid shaped road network where

vertices are located neatly on grid line.

3.4 Summary

In this chapter, we presented a two-stage solution framework and adapted existing ver-

tex presentations for this framework, including node2vec, auto-encoder, geo-coordinates

(longitudes and latitudes), and landmark labels. We described each representation with

examples and analyzed their advantages and disadvantages. The two-stage solution

framework verifies the feasibility of a learning based solution for graph shortest-path dis-

tance predictions, while it also has limitations in that it separates representation learning

from distance predictions. This leads to sub-optimal vertex representations and distance

prediction accuracy. We propose a single-stage model to address this limitation in the

next chapter.

32

Chapter 4

Proposed Single-Stage Model

In Chapter 3, we described two-stage models where embedding learning and dis-

tance prediction are disconnected. To further improve the embedding quality, in this

chapter, we propose a single-stage model called vdist2vec that learns embeddings which

are guided directly by the distance prediction.

We detail the vdist2vec model structure in Section 4.1. We then discuss the choice of

loss functions for the model in Section 4.2. To further enhance the distance prediction

accuracy, we design a variant of the vdist2vec model using the ensembling technique in

Section 4.3. We scale vdist2vec to large graphs in Section 4.4. We cover update handling

in Section 4.5 and algorithm costs in Section 4.6.

4.1 Vdist2vec

Our vdist2vec model connects vertex representation learning with distance prediction to

form a single neural network. The model takes two vertices vi and vj as the input, learns

their representations, and predicts their distance as the targeted output. This structure

enables the distance signals from the output layer of the distance prediction network to be

propagated back to the representation learning network. Thus, the vertex representations

Part of the content of this chapter is published in

1. * Jianzhong Qi, Wei Wang, Rui Zhang, and Zhuowei Zhao. A Learning Based Approach to PredictShortest-Path Distances. International Conference on Extending Database Technology (EDBT), 2020.(CORE Ranking: A, accepted in December 2019)* The authors are ordered alphabetically

33

4.1 Vdist2vec Proposed Single-Stage Model

can be learned to better preserve the distance information.

Our model structure is illustrated by Figure 4.1. In the model, the input vertices vi and

vj are each represented as a size-|V | one-hot vector. The one-hot vector of vi (vj), denoted

by hi, has a 1 in the i-th (j-th) dimension and 0’s in all other dimensions. The next layer

is an embedding layer, which is used for representation learning. This layer has k nodes,

and its weight matrix is a |V |×k (2|V |×k for directed graphs) matrix that will be used as

the vertex vectors for all vertices, denoted by V = [v1T ,v2

T , ...,v|V|T ]T . Multiplying hi

(hj) by V yields vi (vj), i.e.,

vi = hiV (4.1)

Vectors vi and vj are then fed into a distance prediction network to predict the distance

between vi and vj . Recall that the distance prediction network is an MLP where the

default loss function Ld is the mean square error on the actual vertex distances and the

predicted distances (cf. Equation 3.3).

2|V |-dimensionalone-hot layer

k-dimensionalembeddinglayer

MLPinput layer

MLP hiddenand outputlayers

hi

(vi)

hj

(vj)

vi

vj

Fullyconnected

di,j

Figure (4.1) Vdist2vec model structure

At training time, the vertex representation matrix V is randomly initialized. The

corresponding vertex pairs’ vectors will then be concatenated and fed into the network

in batches to train the MLP. The training loss Ld will be propagated back to optimize

the MLP and the vertex representations in V. The optimization goal is to minimize the

34

Proposed Single-Stage Model 4.2 Loss Function

errors from exact distances to predicted distances between two vertices. At query time,

the vertex vectors vi and vj of the query vertices vi and vj are fetched from V, and the

MLP trained as part of vdist2vec is used to make a distance prediction.

Our vdist2vec model can be adapted for other distance prediction problems on graph.

For example, we can use the resistance distances [64] to replace the shortest-path distance

in vdist2vec output to train a model for predicting the resistance distance. In addition,

the distance to be predicted between two vertices is not limited to a scalar. Vectors can be

applied to represent multiple distance information such as top-k shortest-path distances.

To adopt our vdist2vec model for top-k shortest-path distances, we can simply use a

size-k vector as the model output while keeping the other parts of our model unchanged.

In real graph based database systems such as Neo4j [97], the vertex embeddings

learned by our vdist2vec model can be stored together with the vertices in a graph in

the system. This helps the system achieve faster shortest-path distance query and easier

vertices visualization.

4.2 Loss Function

Mean square error (MnSE) is one of the most commonly used loss function in machine

learning models, which is also our default loss function Ld. This error is defined by

Equation 4.2.

MnSE =1

n

n∑i=1

(d(vi, vj)− di,j))2 (4.2)

Recall that d(vi, vj) and di,j) are the ground truth and predicted distances between vi and

vj , respectively.

Another commonly used loss function is mean absolute error (MnAE), which is defined

by Equation 4.3.

MnAE =1

n

n∑i=1

|d(vi, vj)− di,j)| (4.3)

These two error measurements have a somewhat similar effect in model learning. How-

ever, MnSE is more sensitive to the variance of the error than MnAE. For example, given

a set of ground truth values D = 1, 1, 1, 1 and two sets of prediction values from two

35

4.2 Loss Function Proposed Single-Stage Model

different models D1 = 1, 2, 3, 4 and D2 = 1, 1, 4, 4, the absolute prediction errors of

the two models are E1 = 0, 1, 2, 3 and E2 = 0, 0, 3, 3, respectively. Both models share

the same MnAE (i.e., 1.5), while the second model has a larger MnSE (i.e., 4.5 vs. 3.5), as

E2 has a larger variance.

Figure (4.2) Error distribution of our model (DG)

To examine the impact of loss functions, we analyse the error distribution of our

model. Figure 4.2 is an example of the error distribution of the vdist2v model on the

DG graph dataset (detailed in Chapter 5) with MnSE as the loss function (the error distri-

bution is similar when using MnAE as the loss function). We see that a very small portion

(e.g., less than 1%) of the vertex pairs have much larger prediction errors (see the spike

to the right of the figure) than the other vertex pairs. Using MnSE as the loss function

gives the same weight to the larger error values as those of the smaller error values. This

is good for controlling the maximum prediction errors but may suffer in the mean pre-

diction errors. Next, we optimize the loss function to reduce the mean prediction errors.

4.2.1 Reducing Mean Errors

As discussed above, applying MnSE as the loss function emphasizes the large errors

which only form 1% of all verex pairs. To reduce the mean errors, we should guide

our model to focus more on the rest of the vertex pairs (which have smaller errors).

Our idea originates from a loss function named the Huber loss that combines MnSE

36

Proposed Single-Stage Model 4.2 Loss Function

and MnAE [55, 41, 100] . Its basic idea is to use MnAE when the error is larger than a

parameter δ, and to use MnSE otherwise. Equation 4.4 defines this loss function.

HLδ(a) =1

2a2, for |a|≤ δ

HLδ(a) = δ(|a|−1

2δ), otherwise

(4.4)

By applying Huber loss, errors below δ are weighted the same as MnSE, while errors

above are weighted less (multiplied by δ which is smaller than the error itself).

Inspired by the Huber Loss, we propose a reverse Huber loss (REVHL) function that

is defined in Equation 4.5. As the equation shows, now the errors below δ are weighted

heavier more as it is multiplied by δ which is larger than the error itself.

Lδ(a) = δ|a|, for |a|≤ δ

Lδ(a) =1

2(a2 + δ2), otherwise

(4.5)

We also adapted the equation for the case where the error exceeds δ to make the

overall loss function continuously differentiable, enabling it to be used in model training.

To show that REVHL is continuously differentiable, first, both δ|a| and 12(a2 + δ2) are

continuous and continuously differentiable themselves. Further, when a = δ,

Lδ(δ) = δ2, limx→0

Lδ(δ + x) =1

2((δ + x)2 + δ2)

=1

2(δ2 + δ2)

= δ2

L′δ(δ) = δ, limx→0

L′δ(δ + x) = (δ + x)

= δ

Thus, REVHL is also continuously differentiable at a = δ.

Our REVHL yields trained models with lower prediction errors as shown in Table 4.1

and Table 4.2, where MnRE denotes the mean relative error.

To apply REVHL on our model, we need to select δ. A suitable δ for our model should

separate the few vertex pairs with much larger errors with the rest of the vertex pairs.

During training, errors become smaller so that a δ that is suitable for earlier iteration

37

4.2 Loss Function Proposed Single-Stage Model

Table (4.1) Comparing Huber Loss with Reverse Huber Loss on Road Networks

DG MB SUMnAE MnRE MnAE MnRE MnAE MnRE

HL 224 0.018 9 0.014 57 0.025REVHL 75 0.015 6 0.014 50 0.024

Table (4.2) Comparing Huber Loss with Reverse Huber Loss on Social Networks

FBPOL FBTVMnAE MnRE MnAE MnRE

HL 0.179 0.047 0.160 0.029REHL 0.117 0.031 0.126 0.026

may be too large for the later iterations. We thus need to generate a dynamic δ in each

iteration. We fist sum up all absolute errors. We then add up the errors in ascending

order until reaching 99% of the total errors. At this moment, the corresponding error is

selected as δ. By doing so, we can find a suitable δ for REVHL in each iteration, which

results in a better performance. Comparing to the model training process, computing δ

is much faster, hence it does not increase the training time significantly.

By applying REVHL with dynamic δ values, the performance of our model increases

up to 48% in MnAE and 10% in MnRE, as shown in Chapter 5. We denote our model

using REVHL as the loss function by vdist2vec-L.

4.2.2 Reducing Maximum Errors

In some applications, the maximum errors may be more important (e.g., to offer a per-

formance guarantee). To reduce the maximum errors, we can apply a loss function that

weights larger errors even more than MnSE. A potential solution is to use a higher or-

der polynomial in the loss function, such as Mean Cube Absolute Error (MnCE) which is

defined by Equation 4.6. In such a loss function, the error values are multiplied by them-

selves for multiple times, such that the smaller errors shrink faster than the larger errors

38

Proposed Single-Stage Model 4.3 Ensembling Model

(assuming that the errors are within [0, 1]).

MnCE =1

n

n∑i=1

|d(vi, vj)− di,j)|3 (4.6)

.

As shown in Table 4.3, comparing with using MnSE, MnCE helps reduce the maxi-

mum absolution and relative errors, which are denoted as MxAE and MxRE, respectively.

However, MnCE also brings up the mean errors. A full comparison using more datasets

will be presented in Chapter 5.

Table (4.3) Using MnSE and MnCE as the Loss Function on MB dataset

MnAE MnRE MxAE MxRE

MnSE 12 0.014 376 16MnCE 12 0.017 265 15

4.3 Ensembling Model

Besides of analysing the error distribution, we also analyse the shortest-path distance

distribution. As shown in 4.3, the slope of the figure is larger for the first 5% and the

last 5% of the vertex pairs, while the rest of the figure has a smoother slope. This is

observed on other graphs as well. Therefore, to handle the distance predictions based on

its range, we adapted ensemble learning [69, 76, 80, 87] into our model to build a range-

wised vdist2vec model. Ensemble methods [69] combine the results of multiple machine

learning models to gain a better performance than any model on its own. In our case,

we adapted the idea of weighted average to summing up the result from each model

as the final prediction. Figure 4.4 illustrates our ensembling model for road network

graphs. We use four MLP models and multiply their outputs with different weights

as 100, 900, 9000, and dmax - 10000, respectively, where dmax is the maximum distance

between any two vertices in a graph. As the activation function for our MLP is sigmoid,

the output range for each MLP after weighted will become (0, 100), (0, 900), (0, 9000), and

(0, dmax - 10000), respectively. During training, when predicting distances in (0, 100),

39

4.3 Ensembling Model Proposed Single-Stage Model

Figure (4.3) Distance value distribution (DG)

MLP1 will learn to contribute the main prediction result, while the other MLPs will learn

to predict 0. Similarly, when predicting distances in (100, 1000), MLP2 will become the

main contributor while MLP1 produces 100 and the other MLPs predict 0. This way,

each MLP focuses on a different prediction value range so that the variance is reduced.

For social network graphs, the structure of the ensembling model is the same, but with

different weights (e.g. 2, 4, 8, 16, respectively) based on their distance range.

As the ensembling model has a more complicated structure than original vdist2vec

model, longer training times and query times are expected. To balance the time cost

and accuracy, we can adjust how deep we apply the ensembling layers. As shown in

Figure 4.5, instead of applying four different MLP, we use four different output layers of

one MLP to build our ensembling model. Table 4.4 shows that the output layer based

ensembling model has relatively balanced training time and query time, while increases

the performance by 58%. This structure will be our default ensembling model setting in

Chapter 5. We denote this model by vdist2vec-S.

Table (4.4) Performance of Emsembling Models on MB

MnAE MnRE MxAE MxRE PT QT

vdist2vec 12 0.014 376 16 0.9h 0.644µslayer-ensemble 5 0.006 317 16 1.1h 1.005µsMLP-ensemble 4 0.005 312 16 1.3h 1.473µs

40

Proposed Single-Stage Model 4.4 Handling Large Graphs

vi

vj

MLP1

MLP2

MLP3

MLP4

100

900

9000

dmax − 10000

output

Figure (4.4) Ensembling model

4.4 Handling Large Graphs

Our vdist2vec model computes k-dimensional vector representations for the vertices,

which form a |V |×k matrix. This is cheaper than naively computing a |V |×|V | matrix

for all pairwise vertex distances. However, vdist2vec still needs to feed |V |2 pairs of

vertices into the neural network for training, which may take non-trivial space and time

when |V | gets large (e.g., at million scale). In this section, we discuss how to lower the

number of vertex pairs used in model training to scale vdist2vec to large graphs.

Our key idea works as follows. We sample a small subset Vc of vertices from V . We

call these vertices the center vertices and compute their embeddings using vdist2vec as

described in Section 4.1, i.e., to pretrain vdist2vec using the center vertices. We further

train vdist2vec using pairs of vertices where each pair consists of a center vertex and a

non-center vertex. This way, we obtain a vertex vector for every vertex. Intuitively, the

41

4.4 Handling Large Graphs Proposed Single-Stage Model

vi

vj

MLP hidden layer

MLP hidden layer

100

900

9000

dmax − 10000

outputfully

connected

Figure (4.5) Layer ensembling model

learned vector vi of a non-center vertex vi shall resemble that of its nearest center vertex

(with some offset to reflect the distance to the nearest center vertex). Meanwhile, the MLP

is trained to predict the distance between a center vertex and any other vertex. Thus,

we can also use vi to predict the distance between vi and any vertex using the trained

MLP. The procedure above reduces the number of vertex pairs to be fed into vdist2vec

for training from |V |2 to |Vc|2+|Vc||V \ Vc|= |Vc||V |, where |Vc| |V |. We will study the

impact of |Vc| on model training time and prediction accuracy in the experiments.

42

Proposed Single-Stage Model 4.4 Handling Large Graphs

4.4.1 Large Road Networks

We further consider a special case, i.e., handling large road networks. In road network

graphs, vertices are associated with geo-coordinates (e.g., latitudes and longitudes), which

can help produce more accurate distance predictions. We cluster the graph vertices into

|Vc| clusters based on their geo-coordinates to obtain better center vertices. Any cluster-

ing algorithm that allows controlling the number of clusters can be used. We use k-means

in the experiments for simplicity. In each cluster, the vertex nearest to the cluster center

is chosen as a center vertex. We train vdist2vec over these center vertices first.

Then, instead of training vdist2vec with pairs of center vertices and non-center ver-

tices, we approximate the distance between non-center vertices as follows. First, for every

non-center vertex vi, we compute the distance to its cluster center (a center vertex vic), i.e.,

d(vi, vic) (also d(vic, vi) if the graph is directed). Then, given two query vertices vi and vj ,

their distance di,j is approximated by adding up their distances to their cluster center

vertices vic and vjc with the distance between vic and vjc:

di,j = λ1 · d(vi, vic) + dic,jc + λ2 · d(vjc, vj) (4.7)

Here, dic,jc represents the distance between vic and vjc predicted by the MLP that is

trained as part of vdist2vec.

In the equation above, there are two offset coefficients λ1 and λ2. Their role is to adjust

the contributions of d(vi, vic) and d(vjc, vj) to di,j . Intuitively, as shown in Figure 4.6a,

when vi, vic, vjc, and vj are at different relative positions, d(vi, vic) and d(vjc, vj) may have

different contributions to the overall distance. For example, if vi and vj are in between

vic and vjc, then d(vi, vic) and d(vjc, vj) should be deducted from (rather than added to)

dic,jc to obtain di,j .

To learn λ1 and λ2, we build another neural network model with a structure shown in

Figure 4.6b, where dla(·) and dlo(·) return the difference in latitude and longitude between

two vertices, respectively. This model feeds the coordinate difference between vi and vic

and the coordinate difference between vj and vjc into two MLPs to predict λ1 and λ2,

respectively. The last layer of each MLP uses a tanh activation function, which maps

43

4.5 Handling Updates Proposed Single-Stage Model

vi vj

vi vj

vic vjc

di,j

dic,jc

d(vi, vic) d(vjc, vj)

(a) Distance computation

MLP

MLP

d(vi, vic)

dla(vi, vic)

dlo(vi, vic)

dla(vjc, vj)

dlo(vjc, vj)

d(vjc, vj)

dic,jc

di,j

(b) Network structure

Figure (4.6) Distance prediction model for large road networks

λ1 and λ2 into the range of (−1, 1). The output of these two MLPs is multiplied with

d(vi, vic) and d(vjc, vj) (denoted by “⊙

” in Figure 4.6b), and the products are added with

dic,jc (denoted by “⊕

” in Figure 4.6b) to produce di,j . Essentially, this model implements

Equation 4.7. To train this model, we use the same loss function as that used in vdist2vec

(i.e., Equation 3.3), except that now only a sampled subset of non-center vertices (e.g.,

|Vc||V | pairs) is needed for model training as the input space (i.e., coordinate difference)

becomes much smaller.

We note that, if more computing resources are available, we may use spectral cluster-

ing [95] instead of k-means. We may use the eigenvectors of the graph Laplacian matrix

to replace the geo-coordinates of the vertices. Then, the procedure above applies to other

graphs such as social networks.

4.5 Handling Updates

Following a majority of the existing studies (e.g., [26, 29, 77]), our vdist2vec model targets

static graphs where the vertices and edges do not change. By periodic rebuilds, vdist2vec

can also handle graphs with a low update frequency (e.g., road networks) or applications

that are less sensitive to real-time updates (e.g., mining similar users for friendship rec-

ommendation in social networks). As shown in Chapter 5, vdist2vec can be rebuilt in a

few hours for graphs with over a million vertices.

44

Proposed Single-Stage Model 4.6 Cost Analysis

In real-time applications, vdist2vec can still provide distance predictions upon vertex

or edge updates, although the prediction accuracy may drop. If a new vertex v is inserted,

distance queries on v can be approximated by replacing v with its nearest vertex vi ∈ V

in the distance prediction and adding with d(v, vi). If a vertex is deleted, it will not

impact the distance prediction process but may impinge the prediction accuracy. This

also applies to edge updates, i.e., insertion, deletion, or weight changes. We study such

impacts via experiments in Chapter 5.

One potential approach to support real-time vertex and edge updates for learning

based distance prediction models is incrementally training our model for only the ver-

tices affected by an update rather than retraining the model on the whole graph. For

example, when a vertex is inserted, we may learn an embedding for the new vertex by

training our model on all the distances from the vertex to the other vertices while keep-

ing the other vertex embeddings unchanged. This way, we only need to handle O(n) dis-

tance pairs. For edges insertions, edge deletions, and vertex deletions, we can choose the

k nearest neighbours of the deleted vertices or vertices connected to the deleted/inserted

edges as the affected vertices. A hierarchical structure can be beneficial for graph up-

dating. When vertices are grouped into clusters, the vertices/edges modification within

each cluster can be handled inside the cluster, and hence reducing the update costs. For

example, a vertex/edge deletion in a well-separated pair decomposition cluster will not

affect the other clusters. We only need to recompute the shortest-path distances between

the vertices inside the affected cluster and their corresponding representative vertices.

An in-depth study on the impact of updates is an open problem for future study.

4.6 Cost Analysis

To simplify the discussion, we consider an MLP network to have an O(1) space cost for

storing its parameters and an O(1) time cost for its inference process. Such costs depend

mainly on the network size rather than the input graph size. Also, the inference process

can take advantage of GPU parallelization and be done efficiently.

Then, our vdist2vec model can be trained inO(|V |2) time, i.e., training with every pair

45

4.7 Summary Proposed Single-Stage Model

of vertices. For large graphs, this time can be reduced to O(|Vc||V |) via sampling (recall

that |Vc| is the sample size). Once the model is trained, it takes O(k|V |) space to store

the vertex embeddings. Adding in the constant space cost to store the MLP parameters,

overall, our model has an O(k|V |) space cost. To make a distance prediction, our model

takes O(k) time to read and feed the embeddings of the two query vertices into the MLP.

Then, the MLP inference process is run to make a prediction in O(1) time. Thus, overall,

our model has an O(k) time (and space) cost for distance predictions.

4.7 Summary

In this chapter, we described our proposed one-stage model (vdist2vec) to overcome the

limitation of two-stage models and achieve higher accuracy in distance prediction. We

studied the choice of loss functions for our model based on error distribution analysis. We

proposed the reverse Huber loss to reduce the mean absolute error for our model. We fur-

ther adapted the ensemble learning technique to boost our model performance in terms

of prediction accuracy. To handle large graphs, we proposed a hierarchical model that

reduces the preprocessing time via sampling while not suffering too much on distance

prediction accuracy. We concluded the chapter with a discussion on data update han-

dling and model costs.

46

Chapter 5

Experiments

We study the empirical performance of our vdist2vec model and its variants on real-

world graphs. We compare it with both landmark labeling approaches and embedding

based models.

5.1 Settings

The experiments are run on a Linux desktop computer with an Intel (R) Xeon (R) E5-2630

V3 CPU (2.40 GHZ), a GeForce GTX TITAN X GPU, and 32 GB memory. All algorithms

and models are implemented with Python 2.7.12. The graphs are managed with the Net-

workX package [2]. The neural networks are implemented with Tensorflow 1.13.1.

Datasets. We use the following road networks, social networks, and a web document

network to test the model effectiveness and efficiency:

• DG, SH, and SU [4, 61]: These datasets contain the road networks of Dongguan

(China), Shanghai (China), and Surat (India), which are amongst the most popu-

lated cities in the world. The road networks are directed, but their adjacency ma-

trices are symmetric, i.e., the edges on both directions between the same pair of

vertices have the same weight. For simplicity, we represent these road networks

with undirected graphs (hence the number of edges is reduced by half from the

original road networks).

• MB: This dataset contains the road network of the CBD of Melbourne (Australia)

exported from OpenStreetMap [6]. This network is directed.

47

5.1 Settings Experiments

• FL and NY [1]: These datasets contain the road networks of Florida and New York

City (USA), respectively. These road networks are directed, but their adjacency

matrices are again symmetric. We also represent them with undirected graphs.

• FBTV and FBPOL [3]: These datasets contain the Facebook page networks of politi-

cians and TV shows, respectively. Here, every page is a vertex, and an edge between

two vertices represents a mutual (i.e., undirected) “like” relationship.

• POK [3]: This dataset contains the social network named Pokec, where every user is

a vertex, and an edge from a vertex to another represents a (directed) “following”

relationship.

• EPA [81]: This dataset contains the web graph of www.epa.gov, where every page is

a vertex, and an edge from a vertex to another represents a (directed) hyperlink.

The edges in the social networks FBPOL, FBTV, and POK and the web page graph

EPA are unweighted originally. Following previous studies [19, 77], we assign each edge

with a length of one distance unit (e.g., one hop). Note that edge weight modeling is

orthogonal to our study. Other edge weight models may apply straightforwardly.

These datasets and their numbers of vertices (|V |), numbers of edges (|E|), average

degree (dgr), and maximum shortest-path distances between any two vertices (i.e., di-

ameter dmax) are summarized in Table 5.1. Note that the shortest-path distances are later

normalized into the range of [0, 1] for model training and testing.

48

Experiments 5.1 Settings

Table (5.1) Datasets

Type Dataset |V | |E| dgr dmax

Road

network

DG 8K 11K 2.76 96km

FL 1.07M 1.35M 2.36 12,000km

MB 3.6K 4.1K 1.14 6km

NY 264K 366K 2.80 1,600km

SH 74K 100k 2.70 127km

SU 2.5K 3.6K 2.88 50km

Social

network

FBPOL 6K 41k 13.60 14

FBTV 4K 17K 8.80 20

POK 1.6M 30M 18.80 11

Web page graph EPA 4.3K 9K 4.17 10

Approaches. We test seven approaches including our proposed model vdist2vec. We

detail these approaches below.

• vdist2vec: This is our proposed model as described in Section 4.1. By default, we

use 2%|V | as the number of nodes in the embedding layer. For the MLP distance

prediction component, we use two hidden layers with 100 and 20 nodes, respec-

tively. We use ReLU [68] as the activation function for the hidden layers and the

sigmoid function for the output layer. We set the batch size to be |V | (we find that

a large batch size helps the training efficiency while having little impact on the pre-

diction accuracy). We initialize the MLP parameters using the truncated normal

distribution with 0 as the mean and 0.01 as the standard deviation. The training

data is randomly shuffled. We train our model in 20 epochs with early stopping

using AdamOptimizer and a learning rate of 0.01.

• vdist2vec-S: This is our proposed ensembling model as described in Section 4.3.

The ensembling structure setting can be referred to Section 4.3. The rest of the

model shares the same setting as that of vdist2vec.

49

5.1 Settings Experiments

• landmark-bt [38][89]: This is a landmark labeling approach based on betweenness

centrality. It selects the top-k vertices that are passed by the most shortest paths as

the landmarks.

• landmark-dg [89]: This is a landmark labeling approach based on degree. It selects

the top-k vertices with the largest degrees as the landmarks.

• geodnn [59]: This approach trains an MLP to predict the distance of two vertices

given their geo-coordinates. It only applies to road networks. By default, we use

the recommended settings from the model proposal [59].

• node2vec [48]: This approach uses node2vec to learn vertex embeddings and then

trains an MLP to predict vertex distances based on the learned embeddings, as

described in Section 3.3.1. By default, we use the node2vec model settings from a

previous study that uses node2vec for distance predictions in social networks [78].

The MLP structure and hyperparameters are the same as those in vdist2vec.

• auto-encoder: This approach uses an auto-encoder to learn vertex embeddings and

then trains an MLP to predict vertex distances based on the learned embeddings, as

described in Section 3.3.2. By default, we use one hidden layer in the encoder and

the decoder. Each hidden layer has 2%|V | nodes. The middle layer (i.e., the code

layer) between the encoder and the decoder has 1%|V | nodes. We train the auto-

encoder using RMSPropOptimizer in 100 epochs with early stopping and a learning

rate of 0.001. The MLP structure and its hyperparameters are the same as those in

vdist2vec.

In these approaches except node2vec, by default, we use 2% of the number of vertices

as the embedding dimensionality (number of landmarks), i.e., k = 2%|V |. For node2vec,

we use k = 128 which is suggested to be optimal [78].

Evaluation metrics. We compute the embeddings (distance labels) and predict the

distance between every pair of vertices. We measure the mean absolute error (MnAE, in

meters), maximum absolute error (MxAE, in meters), mean relative error (MnRE), and maxi-

mum relative error (MxRE) for the predictions. We also measure the precomputation/training

50

Experiments 5.2 Results

time (PT) and average distance prediction (query) time (QT). Here,

relative error =|predicted distance− actual distance|

actual distance

The ground truth distances are precomputed using the contraction hierarchy algorithm

for road networks and Dijkstra’s algorithm for social networks1 and the web page graph.

5.2 Results

We first report results on an overall performance comparison among the different ap-

proaches and then report results on the impact of parameters.

5.2.1 Overall Comparison

We summarize the model performance in Tables 5.8 to 5.13 on smaller graphs (DG, MB,

SU, FBPOL, FBTV, and EPA) that can be processed by our vdist2vec model directly, and

in Tables 5.2 to 5.7 on large graphs (FLA, NY, SH, and POK) that require sampling so as

to be proposed by vdist2vec.

5.2.2 Performance on Smaller Graphs

Tables 5.2 and 5.3 show mean prediction errors on the smaller graphs. Our model vdist2vec

outperforms the baseline models across all six datasets except for landmark-bt on FBTV.

On road network graphs (DG, MB, and SU), vdist2vec reduces the MnAE by at least

0.74% (135 vs. 136 for vdist2vec and landmark-dg on DG) and up to 96% (12 vs. 300 for

vdist2vec and auto-encoder on MB). The advantage in terms of MnRE is larger, i.e., at

least 72% (0.015 vs. 0.054 for vdist2vec and landmark-dg on DG) and up to 97% (0.014

vs. 0.488 for vdist2vec and landmark-bt on MB). On social networks and the web page

graph, the errors of all the approaches are smaller, because the edge weights and graph

diameters are smaller. Our model vdist2vec still achieves substantial reductions in both

MnAE and MnRE, except on FBTV (MnAE 0.137 vs. 0.103 for vdist2vec and landmark-bt).

1CH is observed to be slower on social networks [11].

51

5.2 Results Experiments

In terms of MnRE, our model is at least 7% (0.026 vs. 0.028 for vdist2vec and landmark-bt

on FBTV) and up to 91% (0.026 vs. 0.278 for vdist2vec and auto-encoder on FBTV) smaller

than those of the baselines. Our vdist2vec-S model further improves the performance of

vdist2vec by at least 18% (0.108 vs. 0.133 on FBPOL) and up to 58% (5 vs. 12 on MB) for

MnAE, and at least 19% (0.021 vs. 0.026 on FBTV) and up to 57% (0.006 vs. 0.014 on MB)

for MnRE.

Table (5.2) Mean Absolute and Mean Relative Errors on Road Networks (Smaller Graphs)

DG MB SUMnAE MnRE MnAE MnRE MnAE MnRE

baseline

landmark-bt 2,234 0.442 192 0.488 468 0.281landmark-dg 136 0.060 22 0.083 180 0.127geodnn 1,566 0.092 95 0.097 442 0.108node2vec 2,329 0.199 118 0.161 658 0.175auto-encoder 1,821 0.178 300 0.308 213 0.070

proposedvdist2vec 135 0.015 12 0.014 83 0.027vdist2vec-S 71 0.011 5 0.006 49 0.014

Table (5.3) Mean Absolute and Mean Relative Errors on Social Networks and Web Page Graph(Smaller Graphs)

FBPOL FBTV EPAMnAE MnRE MnAE MnRE MnAE MnRE

baseline

landmark-bt 1.017 0.254 0.103 0.028 0.024 0.008landmark-dg 1.115 0.277 0.560 0.114 0.021 0.007geodnn N/A N/A N/A N/A N/A N/Anode2vec 0.411 0.094 0.697 0.143 0.603 0.150auto-encoder 0.964 0.239 1.618 0.278 0.597 0.148

proposedvdist2vec 0.133 0.034 0.137 0.026 0.023 0.006vdist2vec-S 0.108 0.027 0.101 0.025 0.020 0.005

The advantage of vdist2vec and vdist2vec-S come from their capability to learn the

pairwise vertex distances and preserve such information in the learned embeddings.

In comparison, the landmark approaches (landmark-bt and landmark-dg) are impacted

heavily by the choice of the landmarks. They may not preserve the distance information

for all vertex pairs, as discussed in Section 2.2. Thus, they have larger distance predic-

tion errors than vdist2vec on most datasets in Tables 5.2 and 5.3 . An exception is on

52


Table (5.4) Max Absolute and Max Relative Errors on Road Networks (Smaller Graphs)

DG MB SUMxAE MxRE MxAE MxRE MxAE MxRE

baseline

landmark-bt 56,058 4,713 4,724 375 54,689 1,549landmark-dg 34,053 1,425 3,103 216 8,409 1,242geodnn 41,376 262 3,563 92 14,916 162node2vec 45,564 1,806 2,953 187 18,917 500auto-encoder 45,645 1,505 2,920 192 9,966 244

proposedvdist2vec 9,050 193 376 16 4,789 63vdist2vec-S 8,154 193 317 16 3,188 40

Table (5.5) Max Absolute and Max Relative Errors on Social Networks and Web Graph (SmallerGraphs)

FBPOL FBTV EPAMxAE MxRE MxAE MxRE MxAE MxRE

baseline

landmark-bt 10 10 14 14 7 7landmark-dg 10 10 18 18 7 7geodnn N/A N/A N/A N/A N/A N/Anode2vec 9 9 16 16 7 7auto-encoder 9 4 14 5 5 5

proposedvdist2vec 4 4 4 4 4 4vdist2vec-S 4 4 4 4 4 4

the FBTV dataset. This dataset has many shortest paths that pass through the graph

center, which suits the betweenness centrality strategy of landmark-bt. Thus, landmark-

bt obtains a slightly lower MnAE than that of vdist2vec on this dataset. However, our

vdist2vec-S still outperforms landmark-bt, with a slightly more complex structure. Note

that a smaller MnAE does not guarantee a smaller MnRE, e.g., landmark-bt has a smaller

MnAE but a larger MnRE than vdist2vec. This is because, MnAE only reflects the av-

erage of the prediction errors but not the error distribution. Given the same MnAE, an

algorithm that generates the errors mostly on vertex pairs with small distances may have

larger relative errors (and hence larger MnRE) than an algorithm that generates the er-

rors mostly on vertex pairs with large distances. The landmark approaches are known to

suffer on vertices with small distances because the shortest paths between such vertices

may not go through any landmarks.

53


Table (5.6) Preprocessing and Query Times on Road Networks (Smaller Graphs)

DG MB SUPT QT PT QT PT QT

baseline

landmark-bt 0.1h 5.832µs 62.7s 4.579µs 32.6s 4.463µslandmark-dg 0.1s 6.044µs 0.1s 4.392µs 0.1s 4.288µsgeodnn 0.9h 0.366µs 0.2h 0.396µs 0.2h 0.375µsnode2vec 2.2h 0.829µs 0.9h 0.820µs 0.5h 0.809µsauto-encoder 2.3h 1.032µs 1.0h 0.638µs 0.5h 0.589µs

proposedvdist2vec 2.3h 1.039µs 0.9h 0.644µs 0.4h 0.589µsvdist2vec-S 2.8h 1.366µs 1.1h 1.005µs 0.6h 0.927µs

Table (5.7) Preprocessing and Query Times on Social Networks and Web Page Graph (SmallerGraphs)

FBPOL FBTV EPAPT QT PT QT PT QT

baseline

landmark-bt 207.5s 5.014µs 72.4s 4.501µs 75.2s 4.621µslandmark-dg 7.2s 5.031µs 2.4s 4.554µs 3.2s 4.650µsgeodnn N/A N/A N/A N/A N/A N/Anode2vec 1.4h 0.801µs 1.1h 0.805µs 1.1h 0.781µsauto-encoder 1.4h 0.877µs 1.1h 0.723µs 1.1h 0.803µs


The geodnn approach only works on road networks as it makes predictions based on

the geo-coordinates of the vertices. Its performance relies on how far away the shortest

paths deviate from the straight lines between the vertices. It is the second best baseline

approach on MB which is a small grid shaped road network with few detours. It drops to

the third on DG and SU which are larger road networks that cover rivers and have larger

detours.

The node2vec approach focuses on embedding the neighborhood of the vertices. It

works better on graphs with small diameters where the vertices are all near each other.

For example, FBPOL has a small diameter of 14, for which node2vec is the second best

approach. When the graph diameter becomes larger (e.g., 96km for DG), node2vec in-

curs larger errors since the neighborhood becomes less relevant to the distance between

vertices far away.

54


The auto-encoder tends to generate embeddings that preserve the average distances

between the vertices. This leads to an unsatisfactory prediction performance in general,

as evidenced by the large mean errors reported. On the other hand, this property may

help avoid large maximum errors, e.g., the auto-encoder is the best baseline on FBPOL

in terms of the maximum errors (cf. Table 5.5) , in contrast, is better on graphs with

larger diameters where the vertex distances may have a larger variance, e.g., DG and

SU. This is because a larger variance on the distances may offer a stronger signal for the

auto-encoder to learn different embeddings for different vertices, rather than the same

average distance.

Tables 5.4 and 5.5 show the MxAE and the MxRE of the models. Our vdist2vec and

vdist2vec-S models also outperforms the baselines on these two measures (except on

FBPOL where auto-encoder is equally good in MxRE). The vdist2vec model reduces the

MxAE by up to 92% (376 vs. 4,724 for vdist2vec and landmark-bt on MB) and the MxRE

by up to 96% (63 vs. 1,549 for vdist2vec and landmark-bt on SU), while the performance

of vdist2vec-S is even stronger. This again verifies the capability of our models to learn

and preserve the vertex distance information. Note that, similar to what has been ob-

served over MnAE and MnRE, a larger MxAE does not mean a larger MxRE either, e.g.,

on DG, geodnn and vdist2vec have similar MxRE but geodnn has a much larger MxAE.

This is because MxAE and MxRE are usually observed from different pairs of vertices –

MxAE tends to come from vertices far away, while MxRE tends to come from vertices

with a very small distance (e.g., 1). Comparing Tables 5.2 and 5.3 with Tables 5.4 and 5.5,

we find that, in general, the baseline methods do not yield low mean and maximum er-

rors at the same time. For example, landmark-bt is close to vdist2vec on FBTV in terms

of the mean errors, while its maximum errors are much larger than those of vdist2vec on

the same dataset. Similarly, auto-encoder is close to vdist2vec on FBPOL in terms of the

maximum errors, but its mean errors are more than 7 times larger than those of vdist2vec

on the same dataset. This further highlights the advantage of vdist2vec and vdist2vec-S,

which can achieve low mean and maximum errors at the same time.

Tables 5.6 and 5.7 show the preprocesing (model training) time PT and distance pre-

diction (query) time QT. In terms of PT, the landmark approaches are much faster. Their

55


precomputation procedures are deterministic and much simpler than the training pro-

cedures of the learning based models which involve multiple iterations of numeric op-

timization on the neural networks. For learning based models, the main parameter that

affects the preprocessing time is the number of embedding dimensions. As geo has the

lowest number of embedding dimensions (2), it has the smallest preprocessing cost. The

node2vec model has a fixed number of dimensions (128, as suggested to be the best for

the shortest-path distance problem [78]). Our models vdist2vec and vdist2vec-S have

more complex distance prediction networks and hence longer training times.

In terms of QT, the learning based approaches are at the same (or smaller) magnitude

as the landmark approaches. This is because distance prediction in the learning based

approaches is a simple forward propagation procedure, which can be easily parallelized

and take full advantage of the computation power of the GPU. The geodnn model is

the fastest, for that its input layer only has four dimensions (i.e., two geo-coordinates).

The other three learning based approaches node2vec, auto-encoder, and vdist2vec have

very similar MLP structures and input sizes which are larger than that of geodnn. Thus,

their QT are similar and are larger than that of geodnn. Note that QT of node2vec differs

slightly from those of auto-encoder and vdist2vec. This is because node2vec has a con-

stant embedding dimensionality k = 128 which is suggested to be optimal [78], while the

embedding dimensionality of auto-encoder and vdist2vec is varying with the number of

vertices (i.e., k = 2%|V |). The QT of vdist2vec-S is longer than that of vdist2vec again for

its slightly more complex structure.

5.2.3 Performance on Larger Graphs

Tables 5.8 to 5.13 show the model performance on the larger graphs, i.e., FL, NY, SH, and

POK. These graphs cannot be processed in full by vdist2vec under our hardware con-

straints. Following the procedure described in Section 4.4, for each of the road networks

FL, NY, and SH, vdist2vec clusters them into 0.1%|V | clusters using the k-means algo-

rithm and learns embeddings (and the MLP) for the cluster center vertices. The model

then randomly samples 100,000 pairs of vertices and uses their geo-coordinates and dis-

tances to learn the offset coefficients λ1 and λ2. The learned embeddings of the center

56


vertices and coefficients will enable vdist2vec to predict the distance between any pair

of vertices in the graph. For the social network POK, vdist2vec uses 0.1%|V | vertices

with the largest degrees as the center vertices. It learns vertex embeddings to predict dis-

tances between these center vertices and the rest of the vertices. These embeddings are

then used to predict the distance between any two vertices.

Table (5.8) Mean Absolute and Mean Relative Errors on Road Networks (Larger Graphs)

FL NY SHMnAE MnRE MnAE MnRE MnAE MnRE

baseline

landmark-bt OT OT 24,851 0.167 6,144 0.554landmark-dg 67,104 0.134 20,483 0.164 4,407 0.403geodnn 363,661 0.317 207,694 0.862 14,842 0.990node2vec OT OT 217,400 0.703 19,465 1.276auto-encoder OT OT OT OT OT OT

proposedvdist2vec 66,542 0.042 17,278 0.056 2,549 0.126vdist2vec-S 66,316 0.041 17,140 0.053 2,546 0.126

Table (5.9) Mean Absolute and Mean Relative Errors on Social Networks (Larger Graphs)

POKMnAE MnRE

baseline

landmark-bt OT OTlandmark-dg 3.070 0.665geodnn N/A N/Anode2vec OT OTauto-encoder OT OT

proposedvdist2vec 0.940 0.203vdist2vec-S 0.895 0.198

Among the baseline models, the landmark approaches are run on the full graphs

which may not complete in time. We terminate the algorithms after 48 hours and denote

it as “OT” in the tables. For geodnn, we randomly sample 100,000 pairs of vertices and

use their geo-coordinates and distances to train the MLP for distance prediction. For

node2vec and auto-encoder, their vertex embeddings need to be learned for all vertices,

which may also run overtime. For the datasets where they can learn the embeddings in

time, we also randomly sample 100,000 pairs of vertices to train the MLP.

57


Table (5.10) Max Absolute and Max Relative Errors on Road Networks (Larger Graphs)

FL NY SHMxAE MxRE MxAE MxRE MxAE MxRE

baseline

landmark-bt OT OT 764,169 36 127,433 1,787landmark-dg 2,380,151 84 571,447 67 88,038 595geodnn 3,085,059 82 982,265 84 69,382 77node2vec OT OT 1,087,000 65 93,433 688auto-encoder OT OT OT OT OT OT

proposedvdist2vec 604,489 11 460,065 7 27,753 18vdist2vec-S 605,872 11 458,052 7 26,626 17

Table (5.11) Max Absolute and Max Relative Errors on Social Networks (Larger Graphs)

POKMxAE MxRE

baseline

landmark-bt OT OTlandmark-dg 6 6geodnn N/A N/Anode2vec OT OTauto-encoder OT OT

proposedvdist2vec 5 5vdist2vec-S 5 5

For testing, since it will take too long to test all pairs of vertices, following the strategy

of previous studies [49, 77], we randomly sample 100,000 pairs of vertices (different from

those sampled for training) and test all the approaches on them. We use k = 50% ×

0.1%|V | for the SH dataset and k = 5% × 0.1%|V | for the other three datasets as SH is

quite smaller than the other three datasets.

As Tables 5.8 to 5.11 show, our vdist2vec and vdist2vec-S models also produce smaller

distance prediction errors than the baselines on the larger graphs, even when our em-

beddings and MLP are not trained on every pair of vertices. For vdist2vec, the reduc-

tions achieved in MnAE, MnRE, MxAE, MxRE are up to 92% (17,278 vs. 217,400 for

vdist2vec and node2vec on NY), 94% (0.056 vs. 0.862 for vdist2vec and geodnn on NY),

80% (604,489 vs. 3,085,059 for vdist2vec and geodnn on FL), and 99% (18 vs. 1,787 for

vdist2vec and landmark-bt on SH), respectively. The improvement of vdist2vec-S over

58


Table (5.12) Preprocessing and Query Times on Road Networks (Larger Graphs)

FL NY SHPT QT PT QT PT QT

baseline

landmark-bt OT OT 39.6h 11.712µs 14.7h 8.423µslandmark-dg 29.3s 54.492µs 9.4s 16.325µs 1.5s 13.584µsgeodnn 0.1h 0.444µs 0.1h 0.458µs 0.1h 0.432µsnode2vec OT OT 26.3h 0.751µs 2.8h 0.781µsauto-encoder OT OT OT OT OT OT


Table (5.13) Preprocessing and Query Times on Social Networks (Larger Graphs)

POKPT QT

baseline

landmark-bt OT OTlandmark-dg 0.4h 23.522µsgeodnn N/A N/Anode2vec OT OTauto-encoder OT OT

proposedvdist2vec 3.1h 0.573µsvdist2vec-S 3.2h 0.759µs

vdist2vec on the larger graphs is less significant, because the distance distribution on

the hierarchical model is less skewed. Further, the baseline approaches landmark-bt,

node2vec, and auto-encoder cannot handle all four datasets. They run overtime when

|V | gets too large (e.g., over a million). The geodnn model can process the larger road

networks, but it cannot handle social networks as pointed out earlier. The landmark-dg

approach is the only baseline that can handle all four large datasets, which is due to its

simple procedure (i.e., simply computing high-degree vertices). It is also the most com-

petitive baseline on most of these dataset, especially POK. However, it still suffers on the

road networks in terms of maximum errors as shown in Table 5.10, because there may be

nearby vertices with shortest paths that do not pass any high-degree vertices, while they

are far away from all the high-degree vertices.

Tables 5.12 and 5.13 show the time costs on the larger graphs. For preprocessing

59


(training), landmark-dg and geodnn are still faster than vdist2vec and vdist2vec-S as

their preprocessing (training) procedures are simpler. Meanwhile, landmark-bt, node2vec,

and auto-encoder now become much slower and are outperformed by vdist2vec and

vdist2vec-S, because they need to process all vertices, while vdist2vec and vdist2vec-S

only need to process the center vertices and sampled vertex pairs. For distance predic-

tion, geodnn is again the fastest, since it has a small input size in its MLP. Our vdist2vec

and vdist2vec-S models have a larger input size (and an extra offset computation step on

road networks). Thus, they have slightly larger distance prediction times than those of

geodnn and node2vec.

5.2.4 Applicability Test

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

5 10 15 20 25

Recall

t

landmark-btlandmark-dg

geodnn

node2vecauto-econder

vdist2vec

(a) recall@t

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

5 10 15 20 25

nD

CG

t

(b) nDCG@t

Figure (5.1) Recall and nDCG in finding nearest neighbor

To further verify the effectiveness of our proposed model and to test its applicability

in real applications (e.g., to find the nearest POI), we compute the top-t nearest neigh-

bors (NNs) for every vertex using the distances estimated by every model. We report

recall@t and nDCG@t [56]. Here, recall@t measures the probability of the actual NN of a

vertex to be found in the top-t NNs returned by a model; nDCG@t further measures the

actual ranks of the top-t NNs returned.

For the rest of the experiments, we only report the results of our vdist2vec model

but not vdist2vec-S to keep the figures concise, since vdist2vec-S and vdist2vec yield

similar results. As Figure 5.1 shows, vdist2vec outperforms all baseline models in both

60


measures. It has an over 50% probability of returning the actual NN in its top-5 NN

predictions, and this probability increases to 94% in the top-25 list. The best baseline in

this case, geodnn, is over 5% less accurate than vdist2vec, while all other models cannot

reach a 30% probability of returning the actual NN. The gap in nDCG is even larger,

which confirms the quality of the NNs returned by vdist2vec, i.e., they are the closest to

the actual NNs.

Experimental results on the other datasets show similar patterns. They are omitted

for succinctness.

5.2.5 Impact of Updates

27

28

29

210

211

212

2% 4% 6% 8% 10%

MnA

E

# vertex insertions (x|V|)

(a) Vertex insertions

27

28

29

210

211

212

2% 4% 6% 8% 10%

MnA

E

# edge insertions (x|E|)

(b) Edge insertions

28

29

210

211

212

213

2% 4% 6% 8% 10%

MnA

E

# edge deletions (x|E|)


geodnnnode2vec

auto-econdervdist2vec

(c) Edge deletions

Figure (5.2) Impact of updates (DG)

Next, we keep the models trained on G unchanged and study the impact of graph

61


2-3

2-2

2-1

20

21

2% 4% 6% 8% 10%

MnA

E

# vertex insertions (x|V|)

(a) Vertex insertions

2-1

20

21

2% 4% 6% 8% 10%

MnA

E

# edge insertions (x|E|)

(b) Edge insertions

2-1

20

21

2% 4% 6% 8% 10%

MnA

E

# edge deletions (x|E|)



vdist2vec

(c) Edge deletions

Figure (5.3) Impact of updates (FBPOL)

updates, including vertex and edge insertions and edge deletions. Vertex deletions (e.g.,

POI closing) do not impact distance predictions as long as the relevant edges are kept,

and hence are omitted. For vertex insertions, we generate 2%|V | to 10%|V | new vertices

and place each on a randomly chosen edge ei,j (e.g., POI opening on some road). The

new vertex is connected with the two vertices of ei,j , with edge weights of γ1 · ei,j .w and

(1−γ1)ei,j .w, respectively (γ1 ∼ U(0, 1)). For edge insertions, we randomly choose 2%|E|

to 10%|E| pairs of vertices. For each chosen pair (vi, vj), we add an edge to connect them

with a weight of γ2 · d(vi, vj) (γ2 ∼ N (1, 1)). For edge deletions, we randomly delete

2%|E| to 10%|E| edges that do not create a disconnected graph.

The MnAE results are summarized in Figs. 5.2 and 5.3. As the figure shows, our

vdist2vec model is robust against updates. For road networks, the performance gaps

among the different models are kept as new vertices or edges are inserted (Figs. 5.2a

and 5.2b). Edge deletions have a stronger impact, since they may introduce long detours

in the shortest paths. Even so, only geodnn produces a slightly smaller MnAE than ours

62


when more than 6% of the edges are deleted, while our vdist2vec model is much bet-

ter when there are fewer than 6%|E| deletions (Figure 5.2c). For social networks, our

model outperforms other methods in vertex insertions as shown in Figure 5.3a. For edge

insertions and edge deletions, our model and node2vec model have similar results that

outperform other baselines except for auto-encoder. As we described in Section 5.2.1,

auto-encoder tends to generate embeddings that preserve the average distances between

the vertices for social networks. This feature helps it resist to social networks updates, as

the average distances of all the vertices pairs will not change too much after updates.

Similar patterns are observed in the other measures. We omit them for succinctness.

5.2.6 Impact of Embedding Dimensionality

Now we study how the dimensionality of the vertex embedding (which is also the size

of the embedding layer of vdist2vec and the number of landmarks of the landmark ap-

proaches), i.e., k, impacts distance prediction time and errors. We vary k from 0.125%|V |

to 2%|V |, and Figs. 5.4 and 5.5 show the algorithm performance on the DG dataset and

FBPOL dataset respectively. In general, as k increases, the prediction errors drop, while

the preprocessing and prediction times increase. This is expected since a larger k means

more distance information may be preserved and more computation needed to encode

such information. Geodnn is not impacted by k, as its predictions are based on geo-

coordinates only.

Similar to previous experiments, vdist2vec yields the lowest prediction errors for al-

most all cases tested. There are three exceptions, i.e., (i) geodnn has the smallest MxRE

for k ≤ 20 on DG (Figure 5.4d), since it is not impacted by k, while the other approaches

suffer from a small number of landmarks or embedding dimensionality; (ii) auto-encoder

has smallest MxAE on FBPOL, since it fails to learn useful information for social network

and predicts the average distances for all pairs that avoids large errors; (iii) node2vec

has smaller MxAE and MxRE for k = 8 on FBPOL (Figs. 5.5c and 5.5d), as embeddings

learned by node2vec with a low dimensions have a high similarity that falls into the same

situation as auto-encoder. Learning based models take more time to preprocess (Figs. 5.4e

and 5.5e), but the training time does not increase much as k does. Also, vdist2vec has a

63


very small prediction time (e.g., 1µs for k = 160, Figs. 5.4f and 5.5f). This enables using

large k values to produce more accurate distance predictions.

27

28

29

210

211

212

213

214

10 20 40 80 160

Mn

AE

k

(a) MnAE

2-7

2-6

2-5

2-4

2-3

2-2

2-1

20

21

10 20 40 80 160M

nR

E

k

(b) MnRE

213

214

215

216

217

10 20 40 80 160

MxA

E

k

(c) MxAE

27

28

29

210

211

212

213

214

10 20 40 80 160

MxR

E

k

(d) MxRE

2-14

2-12

2-102

-82

-62

-42

-22

02

2

10 20 40 80 160

PT

(h)

k


geodnn


vdist2vec

(e) PT

2-2

2-1

20

21

22

23

10 20 40 80 160

QT

(µs)

k

(f) QT

Figure (5.4) Impact of k (DG)

In terms of precomputation (training) time, the impact of k is smaller, and the relative

performance between the algorithms follows those shown in Tables 5.12 and 5.13. We

omit the detail figures for succinctness.

64


2-3

2-2

2-1

20

21

22

8 15 30 60 120

Mn

AE

k

(a) MnAE

2-5

2-4

2-3

2-2

2-1

20

8 15 30 60 120M

nR

E

k

(b) MnRE

22

23

24

8 15 30 60 120

MxA

E

k

(c) MxAE

22

23

24

8 15 30 60 120

MxR

E

k

(d) MxRE

2-7

2-6

2-5

2-4

2-3

2-2

2-1

20

21

8 15 30 60 120

PT

(h)

k


node2vec

auto-econdervdist2vec

(e) PT

2-2

2-1

20

21

22

23

8 15 30 60 120

QT

(µs)

k

(f) QT

Figure (5.5) Impact of k (FBPOL)

65


5.2.7 Impact of MLP

In this subsection, we study the impact of the MLP on distance prediction performance

with two sets of experiments. Our first set of experiments combines the landmark ap-

proaches with the MLP, i.e., training an MLP to predict vertex distance based on dis-

tance vectors computed by the landmark approaches. We will show that simply using an

MLP to learn a distance prediction function for the landmark approaches will not obtain

prediction errors as low as those achieved by our vdist2vec model. This further con-

firms the importance of our vertex embedding learning model. Table 5.14 summarizes

the experimental results on the DG dataset, where “-mlp” denotes a landmark approach

with an MLP for distance prediction. We see that, while landmark-bt-mlp is better than

landmark-bt in both the mean and maximum errors, landmark-dg-mlp is worse than

landmark-dg in the mean errors. Meanwhile, both landmark-bt-mlp and landmark-dg-

mlp have larger errors than our model vdist2vec, which highlights the advantages of our

vertex embeddings. In terms of the preprocessing time, adding an MLP to the landmark

approaches leads to a higher preprocessing time as expected. The distance prediction

times are similar to those listed in Table 5.6, and hence they are omitted for succinctness.

Table (5.14) Effectiveness of Embedding Learning (DG)

Method MnAE MnRE MxAE MxRE PT

landmark-bt 2,234 0.442 56,058 4,713 0.1hlandmark-bt-mlp 682 0.085 36,239 658 2.1hlandmark-dg 136 0.060 34,053 1,425 0.1slandmark-dg-mlp 632 0.083 28,003 1,421 2.1hvdist2vec 135 0.015 9,050 193 2.1h

Our second set of experiments studies the impact of the MLP structure. We vary the

number of nodes in the two hidden layers (denoted by L1 and L2) of the MLP and sum-

marize the model performance in Tables 5.15 and 5.16. For benchmarking purpose, we

also show the performance of landmark-dg+mlp, as it has lower prediction errors than

landmark-bc+mlp. We see that, as more nodes are used in the MLP, the prediction er-

rors drop, while the training and prediction times increase. These are natural, because a

larger MLP can better approximate the complex relationship between the vertex embed-

66


dings (distance vectors) and the vertex distances, which also takes more time to run. The

distance prediction errors do not reduce linearly with the number of nodes in the MLP.

For example, when the MLP grows from 100× 20 to 200× 40, the mean errors only drop

slightly. Thus, we have used a 100× 20 MLP network by default for efficiency.

Table (5.15) Impact of MLP Structure Landmark-dg + MLP (DG)

# nodes landmark-dg+mlp

L1 L2 MnAE MnRE MxAE MxRE PT QT

25 5 2318 0.328 49,951 2393 1.8h 0.852µs

50 10 977 0.150 29,950 1267 1.9h 0.922µs

100 20 632 0.083 28,003 1421 2.1h 1.122µs

200 40 530 0.049 24,006 851 2.2h 1.237µs

Table (5.16) Impact of MLP Structure Vdist2vec (DG)

# nodes vdist2vec

L1 L2 MnAE MnRE MxAE MxRE PT QT

25 5 375 0.048 21,120 541 1.9h 0.836µs

50 10 164 0.020 15,604 387 2.1h 0.909µs

100 20 135 0.015 9,050 193 2.3h 1.039µs

200 40 132 0.014 8,560 137 2.4h 1.221µs

Experimental results on the other datasets show similar patterns. They are omitted

for conciseness.

5.2.8 Impact of Number of Center Vertices

In this section, we study the impact of the number of center vertices used for handling

large graphs, i.e., |Vc|, which is varied from 18 to 600 as shown in Table 5.17 (for the

SH dataset, where 0.1%|V |= 75). As expected, more center vertices yield lower distance

67


prediction errors, which also take more time to process. This can be thought of as having

more landmarks to preserve more distance information. We omit the prediction time as it

is not impacted by |Vc|. Experimental results on the other datasets show similar patterns.

They are omitted for succinctness.

Table (5.17) Impact of Number of Center Vertices (SH)

|Vc| MnAE MnRE MxAE MxRE PT

18 8,102 0.401 52,893 27 5.7m

37 3,831 0.203 35,683 20 5.9m

75 2,549 0.126 27,753 18 6.2m

150 1,795 0.091 26,091 15 7.5m

300 1,290 0.068 24,463 11 9.1m

600 951 0.049 24,229 8 11.3m

5.2.9 Impact of Loss Function

In this section, we study the impact of the loss function. Tables 5.18 to 5.20 show the

performance of our vdist2vec model with different loss functions including mean square

error (MnSE), reverse Huber loss (REVHL), and mean cube error (MnCE) on graphs DG,

MB, SU, EPA, FBPOL, and FBTV. We measure the results in MnAE, MnRE, MxAE, and

MxRE. The processing time and query time are similar for different loss function which

will be omitted here.

Table (5.18) Impact of Loss Function (DG and MB)

DG MB

LF MnAE MnRE MxAE MxRE MnAE MnRE MxAE MxRE

MnSE 135 0.015 9,050 193 12 0.014 376 16

REVHL 75 0.015 9,723 212 6 0.014 820 41

MnCE 199 0.022 8,422 189 12 0.017 205 15

68


Table (5.19) Impact of Loss Function (SU and EPA)

SU EPA


MnSE 83 0.027 4,784 63 0.023 0.006 4 4

REVHL 50 0.024 4,750 92 0.019 0.006 4 4

MnCE 114 0.039 4,133 56 0.027 0.007 3 3

Table (5.20) Impact of Loss Function (FBTV and FBPOL)

FBTV FBPOL


MnSE 0.137 0.026 4 4 0.133 0.034 4 4

REVHL 0.126 0.026 4 4 0.117 0.031 5 5

MnCE 0.179 0.033 2 2 0.153 0.037 2 2

As shown in Tables 5.18 to 5.20, comparing to MnSE, applying REVHL reduces the

MnAE by at least 8% (0.126 vs. 0.137 on FBTV) and up to 50% (6 vs. 12 on MB), and

reduces the MnRE by up to 11% (0.024 vs. 0.027 on SU). However, as discussed in Sec-

tion 4.2, higher MxAE and MxRE are expected when using REVHL. For example, Using

REVHL brings a 25% increase in the maximum errors on FBPOL. It also increases MxAE

by 118% (820 vs. 376) and MxRE by 156% (41 vs. 16) on MB. On graphs with larger di-

ameters, REVHL increases MxAE by only 7% on DG and even reduces MxAE by 0.7% on

SU, although it increases MxREs on both graphs, by 9.8% and 46%, respectively. Thus,

REVHL is recommended when optimizing the mean errors is prioritized.

Applying MnCE as the loss function, on the other hand, reduces MxAE by at least 7%

(8,422 vs. 9,050 on DG) and up to 50% (2 vs. 4 on FBTV and FBPOL), while suffers up to

47% increase in MnAE (199 vs. 135 on DG) and up to 47% increase in MnRE (0.022 vs.

0.015 on DG). Thus, MnCE is recommended for when the maximum errors are the focus.

69

5.3 Other Embedding Applications Experiments

5.3 Other Embedding Applications

In this section, we test the general applicability of our vdist2vec model to other graph

problems including graph reconstruction and link prediction. We compare it with other

graph embeddings methods including Graph Factorization (GF) [13], High-Order Prox-

imity preserved Embedding (HOPE) [71], Laplacian Eigenmaps (LE) [16], Locally Lin-

ear Embedding (LLE) [82], node2vec [48], and Structural Deep Network Embedding

(SDNE) [96] on social network graphs FBTV and FBPOL. We apply the suggested set-

tings by [46] and use an embedding dimensionality of 128 in these experiments.

We use the mean average precision (MAP) [46] as the evaluation metric in this set of

experiments, which is the mean value of the precision of all vertices. Precision here is the

fraction of correctly predicted links to all the links of a vertex.

For graph reconstruction, GF, HOPE, and node2vec apply the dot product on the em-

beddings of the two vertices for link reconstruction to yield a prediction. If the prediction

score is above a threshold, we consider that a link should be reconstructed between the

two vertices. LE and LLE apply Equation 5.1 below and compare the result with a thresh-

old to test if there should be a link between vertices vi and vj :

edist(vi,vi) (5.1)

Here, dist() is a function that returns theL2 distance between two vectors. For SDNE, due

to its auto-encoder structure, it is applied on a graph adjacency matrix, and its decoder

outputs the adjacent matrix of the reconstructed graph. As our vdist2vec model can

predict the distances between two vertices, we simply consider two vertices should be

linked when their distance is smaller than 2 (when running on social networks).

For link prediction tests, we randomly remove 20% of the edges of a graph, and apply

embeddings methods. When testing, we reconstruct the graphs based on the embeddings

learned while testing on the original graphs.

70

Experiments 5.3 Other Embedding Applications

Table (5.21) Graph Reconstruction and Link Prediction performance Measured in MAP

Graph reconstruction Link prediction

FBTV FBPOL FBTV FBPOL

baseline

GF 0.6695 0.4490 0.6467 0.5742

HOPE 0.0093 0.0093 0.0064 0.0053

LE 0.0077 0.0070 0.0043 0.0054

LLE 0.0078 0.0069 0.0044 0.0049

node2vec 0.5883 0.0061 0.2144 0.0069

SDNE 0.0108 0.0066 0.0017 0.0028

proposed vdist2vec 0.9919 0.9914 0.9755 0.9885

Table 5.21 shows that our model outperform baselines in graph reconstruction and

link prediction on FBTV and FBPOL by at least 32% (0.9919 vs. 0.6695 for graph recon-

struction on FBTV) and up to 99% (0.9914 vs. 0.0061 for graph reconstruction on FBTV).

As our model learns the embeddings directly from the distances, the embeddings contain

both local and global structural information. However, learning from all pairs of vertices

distances also leads to a longer training time as shown in Table. 5.22.

Table (5.22) Graph Reconstruction and Link Prediction Processing Time

Graph reconstruction Link prediction

FBTV FBPOL FBTV FBPOL

baseline

GF 0.4h 0.9h 0.2h 0.4h

HOPE 5.5s 14.2s 5.7s 14.5s

LE 3.0s 3.8s 2.9s 4.4s

LLE 10.9s 9.6s 8.8s 18.3s

node2vec 6.2s 14.7s 6.0s 15.1s

SDNE 0.1h 0.3h 3.8m 0.2h

proposed vdist2vec 1.2h 1.6h 1.2h 1.6h

71

5.4 Summary Experiments

5.4 Summary

In this chapter, we compared our proposed vdist2vec model and its variants with base-

lines in several aspects and applications. We first conducted an overall comparison on

road network, social network, and a web page graph measured by mean absolute error,

mean relative error, max absolute error, max relative error, processing time, and query

time. Our models outperform the baselines by up to 97% in distance prediction accuracy

with a comparable query time. As a learning based method, our models have longer pre-

processing times. We argue that this is worthy, for that our models have much smaller

distance prediction errors. We then analyzed the impact of embedding dimensionality

and graph updates on our vdist2vec model. Our vdist2vec model has the best perfor-

mance in the most of these scenarios. We also studied the impact of MLP structure, and

confirm that a model with a large network size can yield more accurate distance predic-

tions, although the model training also increases. To handle larger graphs, we tested our

model with different number of clusters and showed that more clusters could increase the

accuracy (but also the model training time). Finally, we evaluated our the applicability

of our learned embeddings over graph reconstruction and link prediction tasks.

72

Chapter 6

Conclusions and Future Work

6.1 Conclusions

We studied the graph shortest-path distance problem and proposed a representation

learning based approach to solve the problem. Our approach learns vertex embeddings

that preserve the distances between vertices. Storing the learned embeddings only takes

a space cost linear to the number of vertices. At query time, the embeddings of the two

query vertices are fed into a multi-layer perceptron to predict the distance between the

two vertices, which takes a constant time. Thus, our approach avoids the high costs in

space and time which may impinge the applicability of existing approaches. Experimen-

tal results show that our approach is highly efficient. Comparing with state-of-the-art ap-

proaches on road network, social network, and web page graphs, our approach achieves

much smaller mean and maximum prediction errors, with an advantage up to 97%.

In Chapter 2, we reviewed different methods for shortest-path distance prediction.

Traditional methods have either large space cost or large time cost in query processing.

Recent studies tend to trade preprocessing time or accuracy for lower space and time

costs at query time. For example, distance labeling methods preprocess a graph into a hi-

erarchical structure that groups vertices based on heuristic rules. For each group, a set of

vertices are selected as the representation vertices (labels) which will be used in distance

queries between the vertices in the group to those in the other groups. The accuracy of

distance labeling methods relies on their heuristics to build the vertex hierarchy. In the

worst-case, their label size is still O(n2) (i.e., when each vertex forms a group). Another

direction is landmark method. Landmark method precomputes part of the shortest-path

73

6.1 Conclusions Conclusions and Future Work

distances among the vertices to reduce both the space cost to O(kn) and the query time

cost to O(k) with approximate results. The accuracy of landmark methods highly de-

pends on the selection of landmarks. A more recent method uses graph embedding,

which was previously used in other graph problems such as graph reconstruction, link

prediction, and graph visualization, since the learned embeddings contain the connec-

tion information between vertices. The distance between vertices can be considered as

containing both global connection information (multi-hop) and local connection informa-

tion (single-hop). Hence, graph embedding has potentials to be applied for shortest-path

distance prediction. Using the graph embedding idea, we just need a k-dimensional ma-

trix to store the vertex embeddings with a space cost O(kn), while queries based on two

vectors can be processed efficiently (e.g., by examining the embeddings of the two query

vertices in O(k) time).

In Chapter 3, we proposed a two-stage model framework that consists of a graph rep-

resentation learning network and a distance prediction network. We adapted node2vec,

auto-encoder, geo-coordinates, and landmark labels for the representation learning net-

work and trained the distance prediction network based on the learned vertex represen-

tations. Node2vec is based on discrete sampling. It only needs to observe a part of the

graph for vertex representation learning, which makes it scalable to large graphs. Since it

focuses more on local information (shared neighbours), in shortest-path distance predic-

tion tasks, it performs better on graphs with a small diameter, such as social networks,

while it suffers on graphs with a large diameter, such as road networks. Auto-encoder

works better on road networks, while it fails in learning useful information for social

networks. When learning embeddings from a low distance variance graph, auto-encoder

trends to learn the same embeddings for all vertices to predict average distances for all

vertices. Using geo-coordinates as the embeddings in road networks cannot reflect the

detour distances (e.g., to cross a bridge). Its performance dropped dramatically for road

networks that cross rivers or mountains. Using landmark labels as the embeddings over-

come such a limitation. However, it is challenging to select high-quality landmarks to

obtain accurate disntance predictions. As these two-stage models separate the distance

prediction network from the graph representation learning network, the embeddings

74

Conclusions and Future Work 6.1 Conclusions

learned may not be optimized for distance prediction.

In Chapter 4, we proposed a one-stage model called vdist2vec that learns the embed-

dings guided directly by the shortest-path distances. By connecting the graph represen-

tation learning network and the distance prediction network, our vdist2vec model can

learn better embeddings for shortest-path distance prediction. We then studied the loss

functions and their impact on the model. Our proposed the reverse Huber loss func-

tion that can lead our model to reduce the mean prediction errors. We also find that a

higher-order polynomial loss function can yield lower maximum prediction errors. We

further proposed an emsambling based model to achieve even higher disntance predic-

tion accuracy, and a hierarchical model to scale to large graphs. In addition, we show

that our learned embeddings can also be applied to other graph problems such as graph

reconstruction and link prediction.

In Chapter 5, we tested our vdist2vec model on road network graphs, social net-

work graphs, and web page graphs for shortest-path distance prediction. Comparing to

the baseline methods, our vdist2vec model has up to 97% lower prediction errors. As a

learning based method, the training time of our model is longer than the heuristic meth-

ods, which is expected. For query processing, our model has a smaller query time as it

can be easily parallelized and use the full computation power of GPUs. We then stud-

ied the impact of embedding dimensionality, MLP structure, and loss functions on our

model. Our vdist2vec model can yield a high prediction accuracy with a small embed-

ding dimensionality, which shows that our model is highly space efficient. When a larger

number of embedding dimensions is allowed, our model performance improves further,

although longer model training times and query times are expected. Similarly, a larger

MLP with more layers and nodes contribute to a higher distance prediction accuracy,

but this also requires a long training time. In terms of the loss function, we tested mean

square error, mean cube error, and our proposed reverse Huber loss. Reverse Huber loss

is more effective to reduce the mean prediction errors, while mean cube error is more ef-

fective to reduce the maximum prediction errors. We also tested our hierarchical model

for larger graphs. This model can dramatically reduce the training time while not losing

too much distance prediction accuracy, especially when there are more center vertices.

75

6.2 Future Work Conclusions and Future Work

We then adapted our model to other graph applications including link prediction and

graph reconstruction. The experiment results show that our learned embeddings are also

effective in these applications.

6.2 Future Work

As a distance guided machine learning model, our Vdist2vec model may have longer pre-

processing time than heuristic methods. For future work, we plan to investigate graph

clustering techniques to help further boost the efficiency and accuracy of distance pre-

diction over large graphs. For now, we have applied the k-means clustering algorithm

on road network graphs based on the geo-coordinates of the vertices. We may explore

hierarchical structures such as well-separated pair decomposition [86], contraction hi-

erarchy [44], and highway hierarchy [58] and combine them with our model. We may

apply our vdist2vec model on their representative vertices. For example, after we use

the highway hierarchy to generate highways, we can view each highway as a vertex and

learn an embedding for it. At query time, instead of storing all the distances from and to

the highways, we can use the embeddings of the vertices and the highways to predict the

distance. Further, we may adapt the hierarchy construction rules into our models so that

the models can automatically learn embeddings of the clusters (or representative ver-

tices). We can use geo-coordinates in our training process and design a gated mechanism

to decide whether two vertices are well-separated or not. If they are not well-separated,

the model will directly learn the embeddings based on their distance. If they are well-

separated, the model only learns the embeddings based on the distance between their

corresponding representative vertices.

Our learned embeddings do not focus on structure information of graph. To over-

come this limitation, one direction is to apply spectral clustering [95] which clusters ver-

tices based on the eigenvectors of the adjacency matrix of a graph.In spectral clustering,

vertices are represented by eigenvectors, which can be an initialisation for our embed-

dings in graph representation learning. As spectral clustering aims to map similar ver-

tices to be close in the latent space, using the embeddings as an initialisation may help

76

Conclusions and Future Work 6.2 Future Work

our model to train more efficiently and keep the structure information. This also allows

us to handle complex graphs such as social networks and web page graphs which do

not have a coordinate associated to the vertices. However, spectral clustering could be

costly for large graphs as its computational complexity is high (O(n3) [98]). For better

scalability, we may explore approximate spectral clustering methods. Recent research on

approximate spectral clustering mainly has two directions. One direction is to compute

an approximate spectral embedding, such as power iteration clustering (PIC) [65]. Power

iteration is an algorithm that computes the largest eigenvector of a matrix approximately.

It runs recursively with a preset number of iterations to approach the largest eigenvec-

tor step by step. In PIC, instead of getting only the largest eigenvector (the result of the

last iteration), the intermediate values (results in intermediate iterations) are also used

as approximate eigenvectors. Comparing to the original spectral clustering, PIC reduces

the computation time significantly with a fair performance [65]. Another direction is to

represent the graph with fewer vertices and run spectral clustering on the representative

vertices. For example, k-means-based approximate spectral clustering (KASP) [98] first ap-

plies k-means to partition the vertices into k clusters, and then runs spectral clustering

only on the center vertices of each cluster. As this method needs to run a k-means clus-

tering first, it may not be feasible on complex networks. Except spectral clustering, other

graph embedding methods introduced in 2.3 can be used as initialization as well.

77

Bibliography

[1] 9th dimacs implementation challenge - shortest paths. http://users.diag.

uniroma1.it/challenge9/download.shtml, 2006. [Online; accessed 28-

August-2019].

[2] Networkx. https://networkx.github.io, 2014. [Online; accessed 28-August-

2019].

[3] Stanford large network dataset collection. http://snap.stanford.edu/data,

2015. [Online; accessed 28-August-2019].

[4] Urban road network data. https://figshare.com/articles/Urban_Road_

Network_Data/2061897, 2016. [Online; accessed 28-August-2019].

[5] Google announces over 2 billion monthly active devices on android.

https://www.theverge.com/2017/5/17/15654454/android-reaches\

-2-billion-monthly-active-users, 2017. [Online; accessed 28-August-

2019].

[6] Planet OSM. https://planet.osm.org, 2017. [Online; accessed 28-August-

2019].

[7] Facebook announces over 2 billion monthly social network users. https:

//www.nbcnews.com/tech/tech-news/facebook-hits-2-27-billion\

-monthly-active-users-earnings-stabilize-n926391, 2018. [Online;

accessed 25-January-2020].

79

http://users.diag.uniroma1.it/challenge9/download.shtml

http://users.diag.uniroma1.it/challenge9/download.shtml

https://networkx.github.io

http://snap.stanford.edu/data

https://figshare.com/articles/Urban_Road_Network_Data/2061897

https://figshare.com/articles/Urban_Road_Network_Data/2061897

https://www.theverge.com/2017/5/17/15654454/android-reaches\-2-billion-monthly-active-users

https://www.theverge.com/2017/5/17/15654454/android-reaches\-2-billion-monthly-active-users

https://planet.osm.org

https://www.nbcnews.com/tech/tech-news/facebook-hits-2-27-billion\-monthly-active-users-earnings-stabilize-n926391



BIBLIOGRAPHY BIBLIOGRAPHY

[8] Netcraft web server survey shows more than 1.2 billion active web-

sites in 2020. https://news.netcraft.com/archives/category/

web-server-survey/, 2020. [Online; accessed 25-January-2020].

[9] Ittai Abraham, Yair Bartal, Jon Kleinberg, T-H. Hubert Chan, Ofer Neiman, Kedar

Dhamdhere, Aleksandrs Slivkins, and Anupam Gupta. Metric embeddings with

relaxed guarantees. In Proceedings of the 46th Annual IEEE Symposium on Foundations

of Computer Science (FOCS), pages 83–100, 2005.

[10] Ittai Abraham, Daniel Delling, Andrew V. Goldberg, and Renato F. Werneck. A

hub-based labeling algorithm for shortest paths in road networks. In Proceedings of

the 10th International Symposium on Experimental Algorithms (SEA), pages 230–241,

2011.

[11] Ittai Abraham, Daniel Delling, Andrew V. Goldberg, and Renato F. Werneck. Hier-

archical hub labelings for shortest paths. In Proceedings of the 20th European Sympo-

sium on Algorithms (ESA), pages 24–35, 2012.

[12] Lada A. Adamic and Eytan Adar. Friends and neighbors on the web. Social Net-

works, 25(3):211–230, 2003.

[13] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and

Alexander J Smola. Distributed large-scale natural graph factorization. In Proceed-

ings of the 22nd International Conference on World Wide Web (WWW), pages 37–48,

2013.

[14] Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. Fast exact shortest-path distance

queries on large networks by pruned landmark labeling. In Proceedings of the 2013

ACM SIGMOD International Conference on Management of Data (SIGMOD), pages

349–360, 2013.

[15] Mukund Balasubramanian and Eric L. Schwartz. The Isomap algorithm and topo-

logical stability. Science, 295(5552):7–7, 2002.

80

https://news.netcraft.com/archives/category/web-server-survey/

https://news.netcraft.com/archives/category/web-server-survey/


[16] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques

for embedding and clustering. In Advances in Neural Information Processing Systems

(NIPS), pages 585–591, 2002.

[17] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A

review and new perspectives. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 35(8):1798–1828, 2013.

[18] Jean Bourgain. On lipschitz embedding of finite metric spaces in hilbert space.

Israel Journal of Mathematics, 52(1-2):46–52, 1985.

[19] Ulrik Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical

Sociology, 25(2):163–177, 2001.

[20] Ronald L. Breiger, Harrison C. White, and Scott A. Boorman. Social structure from

multiple networks. The American Journal of Sociology, 81(4):730–780, 1976.

[21] Hongyun Cai, Vincent W. Zheng, and Kevin C. Chang. A comprehensive survey

of graph embedding: Problems, techniques, and applications. IEEE Transactions on

Knowledge and Data Engineering, 30(9):1616–1637, 2018.

[22] Shaosheng Cao, Wei Lu, and Qiongkai Xu. GraRep: Learning graph representa-

tions with global structural information. In Proceedings of the 24th ACM Interna-

tional on Conference on Information and Knowledge Management (CIKM), pages 891–

900, 2015.

[23] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Deep neural networks for learning

graph representations. In Proceedings of the 30th National Conference of the American

Association for Artificial Intelligence (AAAI), pages 1145–1152, 2016.

[24] Lijun Chang, Jeffrey X. Yu, Lu Qin, Hong Cheng, and Miao Qiao. The exact distance

to destination in undirected world. VLDB Journal, 21(6):869–888, 2012.

[25] Shiri Chechik. Approximate distance oracles with improved bounds. In Proceedings

of the 47th Annual ACM Symposium on Theory of Computing (STOC), pages 1–10, 2015.

81


[26] Wei Chen, Christian Sommer, Shang-Hua Teng, and Yajun Wang. A compact rout-

ing scheme and approximate distance oracle for power-law graphs. ACM Transac-

tions on Algorithms, 9(1):4, 2012.

[27] Kenneth W. Church and Patrick Hanks. Word association norms, mutual informa-

tion, and lexicography. Computational Linguistics, 16(1):22–29, 1990.

[28] Aaron Clauset, Cristopher Moore, and Mark E. J. Newman. Hierarchical structure

and the prediction of missing links in networks. Nature, 453(7191):98–101, 2008.

[29] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. Reachability and dis-

tance queries via 2-hop labels. SIAM Journal on Computing, 32(5):1338–1355, 2003.

[30] Atish Das Sarma, Sreenivas Gollapudi, Marc Najork, and Rina Panigrahy. A sketch-

based distance oracle for web-scale graphs. In Proceedings of the 3rd ACM Interna-

tional Conference on Web Search and Data Mining (WSDM), pages 401–410, 2010.

[31] Giuseppe Di Battista, Peter Eades, Roberto Tamassia, and Ioannis G. Tollis. Algo-

rithms for drawing graphs: An annotated bibliography. Computational Geometry, 4

(5):235–282, 1994.

[32] Edsger W. Dijkstra. A note on two problems in connexion with graphs. Numerische

Mathematik, 1(1):269–271, 1959.

[33] Paul Erdos. On some extremal problems in graph theory. Israel Journal of Mathemat-

ics, 3(2):113–116, 1965.

[34] Tomas Feder and Rajeev Motwani. Clique partitions, graph compression and

speeding-up algorithms. Journal of Computer and System Sciences, 51(2):261–272,

1995.

[35] Raphael A. Finkel and Jon L. Bentley. Quad trees a data structure for retrieval on

composite keys. Acta Informatica, 4(1):1–9, 1974.

[36] Robert W. Floyd. Algorithm 97: Shortest path. Communications of the ACM, 5(6):

345, 1962.

82


[37] Michael L. Fredman and Robert E. Tarjan. Fibonacci heaps and their uses in im-

proved network optimization algorithms. Journal of the ACM, 34(3):596–615, 1987.

[38] Linton C. Freeman. A set of measures of centrality based on betweenness. Sociom-

etry, pages 35–41, 1977.

[39] Linton C. Freeman. Centrality in social networks conceptual clarification. Social

Networks, 1(3):215–239, 1978.

[40] Linton C. Freeman. Visualizing social networks. Journal of Social Structure, 1(1):4,

2000.

[41] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical

learning, volume 1. Springer, 2001.

[42] Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. Learning probabilistic

relational models. In Proceedings of the 16th International Joint Conference on Artificial

Intelligence (IJCAI), volume 99, pages 1300–1309, 1999.

[43] Ada W. Fu, Huanhuan Wu, James Cheng, and Raymond C. Wong. IS-Label: An

independent-set based labeling scheme for point-to-point distance querying. Pro-

ceedings of the VLDB Endowment, 6(6):457–468, 2013.

[44] Robert Geisberger, Peter Sanders, Dominik Schultes, and Daniel Delling. Contrac-

tion hierarchies: Faster and simpler hierarchical routing in road networks. In Pro-

ceedings of the International Workshop on Experimental and Efficient Algorithms (WEA),

pages 319–333, 2008.

[45] Andrew V. Goldberg, Haim Kaplan, and Renato F. Werneck. Better landmarks

within reach. In Proceedings the 6th International Workshop on Experimental and Effi-

cient Algorithms (WEA), pages 38–51, 2007.

[46] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and

performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.

83


[47] Irina Gribkovskaia, Øyvind Halskau Sr, and Gilbert Laporte. The bridges of

konigsberg—a historical perspective. Networks: An International Journal, 49(3):199–

203, 2007.

[48] Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for net-

works. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining (SIGKDD), pages 855–864, 2016.

[49] Andrey Gubichev, Srikanta Bedathur, Stephan Seufert, and Gerhard Weikum. Fast

and accurate estimation of shortest paths in large graphs. In Proceedings of the

19th ACM International Conference on Information and Knowledge Management (CIKM),

pages 499–508, 2010.

[50] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learning on

graphs: Methods and applications. IEEE Data Engineering Bulletin, 40:52–74, 2017.

[51] David Heckerman, Chris Meek, and Daphne Koller. Probabilistic entity-

relationship models, prms, and plate models. Introduction to Statistical Relational

Learning, pages 201–238, 2007.

[52] Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimensionality of

data with neural networks. Science, 313(5786):504–507, 2006.

[53] Gisli R. Hjaltason and Hanan Samet. Properties of embedding methods for simi-

larity searching in metric spaces. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 25(5):530–549, 2003.

[54] Thomas Hofmann and Joachim Buhmann. Multidimensional scaling and data clus-

tering. In Advances in Neural Information Processing Systems (NIPS), pages 459–466,

1995.

[55] Peter J. Huber. Robust estimation of a location parameter. In Breakthroughs in Statis-

tics, pages 492–518. 1992.

[56] Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of ir

techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002.

84


[57] Minhao Jiang, Ada W. Fu, Raymond C. Wong, and Yanyan Xu. Hop doubling label

indexing for point-to-point distance querying on scale-free networks. Proceedings

of the VLDB Endowment, 7(12):1203–1214, 2014.

[58] Ruoming Jin, Ning Ruan, Yang Xiang, and Victor Lee. A highway-centric labeling

approach for answering distance queries on large sparse graphs. In Proceedings of

the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD),

pages 445–456, 2012.

[59] Ishan Jindal, Zhiwei Qin, Xuewen Chen, Matthew Nokleby, and Jieping Ye. A

unified neural network approach for estimating travel time and distance for a taxi

trip. arXiv preprint arXiv:1710.04350, 2017.

[60] Dieter Jungnickel. Graphs, networks and algorithms. Springer, 2005.

[61] Alireza Karduni, Amirhassan Kermanshah, and Sybil Derrible. A protocol to con-

vert spatial polyline data to network formats and applications to world urban road

networks. Scientific Data, 3(1):1–7, 2016.

[62] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18

(1):39–43, 1953.

[63] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph con-

volutional networks. arXiv preprint arXiv:1609.02907, 2016.

[64] Douglas J. Klein and Milan Randic. Resistance distance. Journal of Mathematical

Chemistry, 12(1):81–95, 1993.

[65] Frank Lin and William W. Cohen. Power iteration clustering. In Proceedings of the

27th International Conference on Machine Learning (ICML), pages 655–662, 2010.

[66] Laszlo Lovasz. Random walks on graphs: A survey. Combinatorics, Paul Erdos Is

Eighty, 2(1):1–46, 1993.

[67] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient esti-

mation of word representations in vector space. In Proceedings of the International

Conference on Learning Representations Workshops (ICLR), 2013.

85


[68] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltz-

mann machines. In Proceedings of the 27th International Conference on Machine Learn-

ing (ICML), pages 807–814, 2010.

[69] David Opitz and Richard Maclin. Popular ensemble methods: An empirical study.

Journal of Artificial Intelligence Research, 11:169–198, 1999.

[70] Jack A. Orenstein. Multidimensional tries used for associative searching. Informa-

tion Processing Letters, 14(4):150–157, 1982.

[71] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric

transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages

1105–1114, 2016.

[72] Panos M. Pardalos and Jue Xue. The maximum clique problem. Journal of Global

Optimization, 4(3):301–328, 1994.

[73] Karl Pearson. LIII. On lines and planes of closest fit to systems of points in space.

The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):

559–572, 1901.

[74] David Peleg. Proximity-preserving labeling schemes. Journal of Graph Theory, 33(3):

167–176, 2000.

[75] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. DeepWalk: Online learning of so-

cial representations. In Proceedings of the 20th ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining (SIGKDD), pages 701–710, 2014.

[76] Robi Polikar. Ensemble based systems in decision making. IEEE Circuits and Sys-

tems Magazine, 6(3):21–45, 2006.

[77] Michalis Potamias, Francesco Bonchi, Carlos Castillo, and Aristides Gionis. Fast

shortest path distance estimation in large networks. In Proceedings of the 18th ACM

Conference on Information and Knowledge Management (CIKM), pages 867–876, 2009.

86


[78] Fatemeh S. Rizi, Joerg Schloetterer, and Michael Granitzer. Shortest path distance

approximation using deep learning techniques. In IEEE/ACM International Confer-

ence on Advances in Social Networks Analysis and Mining (ASONAM), pages 1007–

1014, 2018.

[79] Neil Robertson and Paul D. Seymour. Graph minors. III. Planar tree-width. Journal

of Combinatorial Theory, Series B, 36(1):49–64, 1984.

[80] Lior Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1–39,

2010.

[81] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interac-

tive graph analytics and visualization. In Proceedings of the 29th National Conference

of the American Association for Artificial Intelligence (AAAI), 2015.

[82] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by lo-

cally linear embedding. Science, 290(5500):2323–2326, 2000.

[83] Hanan Samet. The quadtree and related hierarchical data structures. ACM Com-

puting Surveys, 16(2):187–260, 1984.

[84] Jagan Sankaranarayanan and Hanan Samet. Distance oracles for spatial networks.

In Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE),

pages 652–663, 2009.

[85] Jagan Sankaranarayanan and Hanan Samet. Query processing using distance ora-

cles for spatial networks. IEEE Transactions on Knowledge and Data Engineering, 22

(8):1158–1175, 2010.

[86] Michiel Smid. The well-separated pair decomposition and its applications. Hand-

book on Approximation Algorithms and Metaheuristics, Chapman & Hall/CRC, 2016.

[87] Peter Sollich and Anders Krogh. Learning with ensembles: How overfitting can be

useful. In Advances in Neural Information Processing Systems (NIPS), pages 190–196,

1996.

87


[88] Christian Sommer. Shortest-path queries in static networks. ACM Computing Sur-

veys, 46(4):45:1–45:31, 2014.

[89] Frank W. Takes and Walter A. Kosters. Adaptive landmark selection strategies

for fast shortest path computation in large real-world graphs. In Proceedings of

the IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent

Agent Technologies (WI-IAT), pages 27–34, 2014.

[90] Liying Tang and Mark Crovella. Virtual landmarks for the internet. In Proceedings

of the 3rd ACM SIGCOMM Conference on Internet Measurement (SIGCOMM), pages

143–152, 2003.

[91] Athanasios Theocharidis, Stjin Van Dongen, Anton J. Enright, and Tom C. Free-

man. Network visualization and analysis of gene expression data using biolayout

express 3d. Nature Protocols, 4(10):1535, 2009.

[92] Mikkel Thorup and Uri Zwick. Approximate distance oracles. Journal of the ACM,

52(1):1–24, 2005.

[93] Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel. Efficient aggregation

for graph summarization. In Proceedings of the 2008 ACM SIGMOD International

Conference on Management of Data (SIGMOD), pages 567–580, 2008.

[94] Hannu Toivonen, Fang Zhou, Aleksi Hartikainen, and Atte Hinkka. Compression

of weighted graphs. In Proceedings of the 17th ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining (SIGKDD), pages 965–973, 2011.

[95] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17

(4):395–416, 2007.

[96] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining (SIGKDD), pages 1225–1234, 2016.

88


[97] Jim Webber. A programmatic introduction to neo4j. In Proceedings of the 3rd An-

nual Conference on Systems, Programming, and Applications: Software for Humanity

(SPLASH), pages 217–218, 2012.

[98] Donghui Yan, Ling Huang, and Michael I. Jordan. Fast approximate spectral clus-

tering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining (SIGKDD), pages 907–916, 2009.

[99] Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, and Zhao Xu. Stochastic relational

models for discriminative link prediction. In Advances in Neural Information Pro-

cessing Systems (NIPS), pages 1553–1560, 2007.

[100] Tong Zhang. Solving large scale linear prediction problems using stochastic gradi-

ent descent algorithms. In Proceedings of the 21st International Conference on Machine

Learning (ICML), page 116, 2004.

89

Minerva Access is the Institutional Repository of The University of Melbourne

Author/s:

Zhao, Zhuowei

Title:

Embedding Graphs for Shortest-Path Distance Predictions

Date:

2020

Persistent Link:

http://hdl.handle.net/11343/241911

File Description:

Final thesis file

Terms and Conditions:

Terms and Conditions: Copyright in works deposited in Minerva Access is retained by the

copyright owner. The work may not be altered without permission from the copyright owner.

Readers may only download, print and save electronic copies of whole works for their own

personal non-commercial use. Any use that exceeds these limits requires permission from

the copyright owner. Attribution is essential when quoting or paraphrasing from these works.

Embedding Graphs for Shortest-Path Distance Predictions

Documents

Transcript of Embedding Graphs for Shortest-Path Distance Predictions