Shortest path computing in relational dbm ss

Shortest Path Computing in Relational DBMSsJun Gao, Jiashuai Zhou, Jeffrey Xu Yu, and Tengjiao Wang

Abstract—This paper takes the shortest path discovery to study efficient relational approaches to graph search queries. We first

abstract three enhanced relational operators, based on which we introduce an FEM framework to bridge the gap between relational

operations and graph operations. We show new features introduced by recent SQL standards, such as window function and merge

statement, can improve the performance of the FEM framework. Second, we propose an edge weight aware graph partitioning schema

and design a bi-directional restrictive BFS (breadth-first-search) over partitioned tables, which improves the scalability and

performance without extra indexing overheads. The final extensive experimental results illustrate our relational approach with

optimization strategies can achieve high scalability and performance.

Index Terms—Relational database, graph, shortest path

Ç

1 INTRODUCTION

WITH the rapid growth of graphs, graph search facesmore challenges. Nowadays, graph data emerge in

different domains, such as web graphs, social networks,knowledge graphs, etc. Graph search is highly needed inapplications over graphs. Specifically, graph search seeks asub-graph(s) meeting the specific purposes, such as theshortest path between two nodes [13], the minimal span-ning tree [23], the salesman traveling path [10], and the like.We also observe that these graphs are always exceedinglylarge and keep growing at a fast rate. When a graph is toobig to be fit into main memory, the existing in-memoryapproaches [13], [22], [23] to graph search must be re-exam-ined, since the I/O becomes the key factor in the graphoperations on the external disk memory.

Existing disk-based methods have limitations to supportgeneral graph search queries when graphs exceed the mainmemory limitation. One straightforward way is to buildnative graph databases from scratch. Neo4j is a representa-tive one [2], which can store large graphs into the database,and provide primitive operations, such as graph traversal orshortest path discovery to end users. However, the stability,maturity and performance of graph database systems shouldbe continuously improved, as graph database systems haveto implement complex components, including storage,index, query evaluation, and query optimization, etc. Somekinds of indices are devised to support specific searchqueries over large graphs. For example, a shortest path indexover planar graphs is designed on the external memory [9].These methods face difficulty in supporting general searchqueries. We notice that the MapReduce framework [17] and

its open source implementation Hadoop [1] can processlarge graphs stored in a distribute file system over a clusterof commodity servers. However, the low latency of users’queries is not the main concern in the design of Hadoop [1].

Relational Database (RDB) provides a promising infra-structure to support graph search. After more than 40 yearsof development, RDB is mature and stable enough, andplays a key role in information systems. We can see thatRDB and the graph data management have many over-lapped functionalities, including storage, data buffer, index,and optimizations, etc. In addition, RDB has already shownits flexibility in managing other complex data types, such asXML data [14], [19]. As for graph data, RDB can be used tosupport specific graph search queries, like reachabilityquery [28] and BFS [27]. The extension of RDB to graphsearch queries is especially useful in applications whenboth graph and relational operations are needed.

However, it still requires substantial efforts to supportgeneric graph search queries efficiently in the RDB context.First, graph search queries have various forms. It is not flexi-ble to implement each query separately. We had better finda mechanism to support the evaluation of generic graphsearch in the relational context. Second, the semantic mis-match between relational operations and graph operationshas a significant impact on the performance of graph searchon the RDB. Graph operations always follow a node-at-a-time fashion in order to lower the search space. During thesearch we need to do logic and arithmetic computation,make choices based on aggregated results, and record nec-essary information [10], [13], [23]. In contrast, relationaloperations take a set-at-a-time fashion, and only restrictedoperations, such as projection, selection, join, aggregation,etc, are allowed on the relational tables [15].

This paper focuses on the shortest path discovery fortwo reasons. First, it plays a key role in many applications.For example, it can reveal how the relationships betweentwo individuals are built in a social network [25]. It alsocan find the path with the minimal length in a transporta-tion network. Second, it is a representative graph searchquery, which has a similar evaluation pattern to othersearch queries, such as the minimal spanning tree con-struction [23].

� J. Gao, J. Zhou, T. Wang are with the Key Laboratory of High ConfidenceSoftware Technologies, Ministry of Education & School of EECS, PekingUniversity, 100871 Beijing, China.E-mail: {gaojun, zjiash, tjwang}@pku.edu.cn.

� J.X. Yu is with the Department of System Engineering, Chinese Universityof Hong Kong, Hong Kong. E-mail: [email protected].

Manuscript received 5 Aug. 2012; revised 24 Dec. 2012; accepted 1 Mar. 2013;date of publication 6 Mar. 2013; date of current version 18 Mar. 2014.Recommended for acceptance by B.C. Ooi.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2013.43

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 4, APRIL 2014 997

1041-4347 � 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

We have proposed a relational approach to shortest pathdiscovery in the previous version [18]. In this paper, wemake substantial extensions to further improve the scalabil-ity and performance of the relational approach. We intro-duce a weight aware edge table partitioning schema, anddesign a restrictive BFS strategy over partitioned tables. Thestrategy can improve both the scalability and performancesignificantly without needing extra index construction andspace overhead.

To sum up, we investigate the techniques for the shortestpath discovery used in the RDB context. We expect ourmethod can not only deliver benefits to the shortest pathcomputation, but also shed lights on the efficient SQLimplementation for other graph search queries. The contri-butions of this paper can be summarized as follows:

� We abstract three key operators, namely F , E andM-operator, and then provide a generic graphsearching framework FEM. We find new features,such as window function and merge statement intro-duced by recent SQL standards, can simplify theexpression and improve the performance. We alsoshow the basic method to discover the shortest pathusing the FEM framework (see Section 3).

� We propose a novel weight aware edge table parti-tioning schema and a bi-directional restrictive BFSover the partitioned tables. Specifically, we define anextended E-operator using basic F and E-operator,which expands paths by scanning partial partitionededge tables. We also show the issue of incompletesearch caused by the restricted BFS and then give itssolution. We will see that the restricted BFS over thepartitioned tables can yield correct results with ahigh performance, scalability and no extra space cost(see Section 4).

� We conduct extensive experiments over syntheticand real-life data to illustrate the effectiveness andefficiency of our FEM framework along with its opti-mizations. The results also show that our relationalapproach has overwhelming advantages to the exist-ing native graph database in handling shortest pathdiscovery (see Section 5).

The remainder of this paper is organized as follows: Sec-tion 2 presents the preliminary knowledge. In Section 3, weabstract three operators and propose an FEM framework.Section 4 focuses on the restrictive BFS over the partitionedtables. Section 5 reports experimental results on real andsynthetic data. Related work is reviewed in Section 6. Sec-tion 7 concludes the paper.

2 PRELIMINARY

In this section, we first give notations and relational schemaof graphs, and then show new SQL features used in ourpath finding method.

2.1 Graph Notations and Logic Relational Schema

This paper studies weighted (directed or undirected) graphs.Let G ¼ ðV ;EÞ be a graph, where V is a node set and E is anedge set. Each node v 2 V has a unique node identifer. Eachedge e 2 E is represented by e ¼ ðu; vÞ, u; v 2 V . wðeÞ is the

weight for the edge e. A path p ¼ u0e

> ux from u0 to ux is asequence of edges ðu0; u1Þ; ðu1; u2Þ; . . . ; ðux�1; uxÞ, whereei ¼ ðui; uiþ1Þ 2 E ð0 � i < xÞ. The length of a path p, lenðpÞ,is the sum of the weights of its constituent edges. The short-est distance from u to v is denoted by dðu; vÞ.

The logical relational schema for a graph is illustrated inFig. 1. Let G ¼ ðV ;EÞ be a graph. TN table is to representnodes V . We can record nid for the node’s identifer andother attributes in TN table. TE table is to store edges E. Foran edge ðu; vÞ 2 E, the identifiers of node u and v, as well asthe weight of the edge, are recorded by fid, tid and cost fieldrespectively. As this paper studies the shortest path discov-ery, only TE table is used in the following operations.

2.2 SQL Features

In this paper, in addition to utilizing standard features inSQL statements, we leverage two new SQL features, win-dow function and merge statement, which are now sup-ported by commercial database systems, like Oracle, DB2,and SQL server. These features are the “short-cut” for com-binations of some basic relational operators. More impor-tantly, compared with the general-purpose statementsimplemented by the traditional features, the statementswith these new features are more specific, which then pro-vide chances for them to gain a higher performance.

Listing 1. Syntax for New Features

The window function,1 introduced by SQL 2003, returnsaggregate results for each tuple in a set, in contrast to tradi-tional aggregation functions which obtain aggregate resultsfor a set. In addition, non-aggregate attributes are allowedby the window function to be along with the aggregate onesin the select clause, even if these attributes are not in thegroup-by clause. Window function also supports aggregatefunctions such as rank, row_num, which are related to thetuple position in a sorted tuple sequences specified by anorder-by clause. The syntax for window function is describedin Listing 1(1).

Fig. 1. Relational representation of graph.

1. http://en.wikipedia.org/wiki/SQL:2003.

998 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 4, APRIL 2014

The merge statement2 adds new tuples and updatesthe existing ones in a target table from a source table(or SQL results, or a view). It is officially introduced bySQL 2008, but is supported earlier by different databasevendors due to the need in loading data into data ware-house. The main part of merge statement specifies actions(e.g., insert, delete and update) under different relation-ships between the source and target. The update action isinvoked when the tuples in the source match the tuplesin the target, and the insert action is performed whenthere are no matched tuples in the target for the tuples inthe source. A merge statement can be equivalently rewrit-ten by one update statement followed by an insert state-ment. The redundant table scan cost in two separatestatements can be avoided by one merge statement.The syntax for merge statement is referred in Listing 1(2).

3 RELATIONAL FEM FRAMEWORK FOR SHORTEST

PATH DISCOVERY

In this section, we introduce a generic graph search frame-work FEM, and leverage it to realize the classic Dijkstra’sshortest path discovery algorithm on RDB.

3.1 A Generic Framework for Graph Search

Graph search queries seek the sub-graph(s) meeting the spe-cific requirements. For example, reachability query answerswhether there exists a path between two given nodes [12].The shortest path query locates the shortest path betweentwo given nodes [3], [22]. The minimal spanning tree queryreturns a tree covering all nodes with the minimal sum ofedge weights [23]. The salesman path query gets a shortesttour that visits each city exactly once [10].

Many graph search algorithms show a common pattern.Due to a large search space, they always utilize greedyideas. We observe that most of these greedy algorithms canfit into a generic iterative processing structure. A visitednode set A1 is initialized first according to different pur-poses. Then an iterative searching starts. We illustrate theith iteration in Fig. 2. Let the visited nodes Ai record all nodesencountered in the graph search so far; the frontier nodes Fiare selected from the visited nodes with the certain criteria(application dependent, Fi � Ai); the expanded nodes Ei arethe next visited nodes from the frontier nodes (generallyneighboring subset of Fi); and the next visited ones Aiþ1 areobtained by merging the newly expanded nodes Ei and Ai.The iterations continue until the target sub-graph(s) can bediscovered. We refer to such a generic processing structureas an FEM framework.

Before we proceed to the details of implementing rela-tional operators for the FEM framework, we first exploit it

to briefly describe two representative graph search algo-rithms: Dijkstra’s shortest path algorithm and Prim’s minimalspanning tree algorithm. In Dijkstra’s algorithm for shortestpath discovery [13] using the FEM framework, each node isannotated with d2s to record the distance from the sourcenode s, and a flag f indicating whether the node is finalizedor not. We initially add the source node s into a visitednode set with s:d2s ¼ 0 and s:f ¼ false. We then start aniterative path expansion (PE) for the target node. In eachiteration, we select a non-finalized node u (u:f ¼ false) fromall visited nodes with the minimal d2s as a frontier node,finalize u, expand u (visiting its neighbors), and merge thenewly expanded nodes into the visited nodes.

Another case is to construct a minimal spanning tree T byPrim’s algorithm [23]. Each node u is represented byuðp2s; w; fÞ. Here p2s is the parent of u, w is the edge weightfrom p2s to u, if u is in T . f is a flag for whether u is in T . Thevisited nodes are initialized by any node u with u:f ¼ falseand u:w ¼ 0. In each iteration in building T , we select a nodeu with u:f ¼ false and the minimal edge weight, add u intoT by changing u:f ¼ true, make further expansion from u,and merge the expanded nodes into the visited ones. In themerge operation, the expanded nodes can be discardeddirectly if they have been included ðf ¼ trueÞ in T . The itera-tions repeat until all nodes have been in T .

Note that there are other operations besides the threebasic operations (select, expand, and merge) in the graphsearch. For example, the recovery of the shortest path or theminimal spanning tree, and the termination detection, arealso needed in the FEM framework. However, these opera-tions are generally auxiliary under the FEM framework andtheir computational costs are quite minimal compared withthe three main operations. The exploration on the shortestpath search in the remainder of the paper will illustrate thedetails of these additional operations, and also demonstratethe key functionality of the three operations.

3.2 FEM Operators for Shortest Path Discovery

To realize a graph search query using the FEM framework onRDB, we need three operators which can be expressed byrelational algebra: i) F -operator to select frontier nodes fromthe visited nodes; ii) E-operator to expand from the frontiernodes; and iii) M-operator to merge the newly expandednodes into the visited ones. As discussed above, attributeson nodes and operations may be various for different graphsearch queries. In the remainder of the paper, we will focuson efficiently realizing the shortest path discovery in theFEM framework. The new techniques we developed forshortest paths in the FEM framework can be in generalapplied or extended to deal with other graph search queries.

Now, we start with Dijkstra’s algorithm for the shortestpath discovery as an example to describe its F , E and M-operator. Each node u in the visit nodes Ai, the frontiernodes Fi and the expanded nodes Ei can be represented byðnid; d2s; p2s; ffÞ, where i is the number of iterations, nid isfor u0s identifier, d2s is for the distance from the sourcenode to u, p2s is for the predecessor node of u to the sourcenode, and ff is for the sign indicating whether u is finalizedor not. Take the shortest path discovery from s to t in thegraph in Fig. 1 as an example. We add s to A1 first. d, t, b2. http://en.wikipedia.org/wiki/Merge_(SQL).

Fig. 2. Conversion between nodes in the ith iteration.

GAO ET AL.: SHORTEST PATH COMPUTING IN RELATIONAL DBMSS 999

will be added into A2 next. Fig. 3 shows the visited nodes,frontier nodes, and expanded nodes in the second iterationof the shortest path discovery.

Below we formally describe these operators based on therelational algebra.

Definition 1 (F-operator). F ðAiÞ snid¼midAi returns fron-tier nodes Fi from visited nodes Ai in the ith expansion.

In Dijkstra’s algorithm, we select the node with the mini-mal d2s among all non-finalized nodes, and assign its iden-tifier to mid. Take Fig. 3 as an example, node b is selectedwith the minimal d2s ¼ 6, and we assign the identifer of b tomid. The computation of mid can be done by an auxiliaryoperation before F -operator. In order to enhance the utilityof our framework, we assume that there may exist multiplefrontier nodes after F -operator. For example, an optimiza-tion strategy used in the next section may produce multiplefrontier nodes with a revised predicate in F -operator. AfterF -operator, we use another auxiliary operation to adjustff ¼ 1 for the node identified by mid.

Definition 2 (E-operator). ðFiÞEðTE) returns the expandednodes Ei based on the frontier nodes Fiðr; d2s; p2s; ffÞ andTEðr; x; wÞ table in the ith expansion.

1. minCostðx; cÞ x Gminðd2sþwÞPðx;d2s;wÞðFiðr; d2s;p2s; ffÞ ffl TEðr; x; wÞÞ;

2. Ei Pðx;d2sþw;r;0Þsc¼d2sþwminCostðx; cÞ fflFiðr; d2s; p2s; ffÞ ffl TEðr; x; wÞ;

ðr; x; wÞ in TE table is for an edge from r to xwith its weightw.

In E-operator, minCostðx; cÞ preserves the expandednodes with the minimal distances. Since the expandednodes may be reached by different paths, and only theones with the minimal distance (d2sþ w) are needed, weuse an aggregate function to find them. Note thatminCostðx; cÞ contains the newly expanded nodes withtheir costs, but lacks their parents p2s, which arerequired in the recovery of the full path. However, wecannot simply put the non-aggregate attribute p2s intothe select clause in minCostðx; cÞ, due to the constraint onaggregation functions in relational algebra. Thus, anotherjoin operation is needed to find p2s from TE. Take Fig. 3as an example, we make an expansion from the frontiernodes fbg, and get the newly expanded nodes g and c.

Definition 3 (M-operator). ðEiÞMðAiÞ returns visited nodesAiþ1 based on expanded nodes Eiðx; d2se; p2se; f

fe Þ and visited

nodes Aiðx; d2sa; p2sa; ffa Þ as follows:

1. Ai Ai �Px;d2sa;p2sa;f

faðsd2se < d2sa

ðEiðx; d2se; p2se; ffe Þ ffl Aiðx; d2sa; p2sa; f

fa ÞÞ;

2. Ei Ei �Px;d2se;p2se;f

feðsd2se>d2sa

ðEiðx; d2se; p2se; ffe Þ ffl Aiðx; d2sa; p2sa; f

fa ÞÞ;

3. Aiþ1 Ai [ Ei;

In M-operator, we remove the visited nodes from Ai,whose distances are larger than those of the correspondingnewly expanded nodes in Ei. Then we add partial newlyexpanded nodes into the visited node set. The newly addednodes in Aiþ1 actually have two parts, the nodes new to Ai

and the nodes with smaller distances than their counter-parts in Ai.

3.3 SQL Implementation for Operators

Now, we discuss the SQL implementation for these oper-ators. We use a table TA to store all visited nodes. Theattributes nid, d2s, p2s, ff in TA have the same meaningsas those in Section 3.2. Since F , E and M-operator arebased on relational algebra, we can use SQL to expressthem. However, the direct translation will result in apoor performance, especially for E and M-operator. Forexample, E-operator is implemented by an aggregatefunction over the join results with group by clause. Inaddition, the location of the parent node p2s of eachexpanded node still needs another join operation. More-over, when there are multiple paths with the same mini-mal d2s to an expanded node, we have to keep only onepath due to the primary key constraint on nid in TAtable. As for M-operator, we have two different actionsaccording to whether the expanded nodes have been inthe visited nodes or not. In the SQL implementation, wemight use an update statement followed by an insertstatement with a not exists sub-query. Such a kind ofexpression is not only verbose but also inefficient.

Listing 2. SQL in Path Finding

Fortunately, we find new features including windowfunction and merge statement introduced by recent SQLstandards can simplify the expression as well as improvethe performance. As for our case, we can partition alloccurrences of expanded nodes by the identifer nid, sortthem with their d2s and select the tuple with the minimald2s by using an aggregate function row numberðÞ ¼ 1 ineach partition. We see that the window function canavoid another extra join operation to locate the parentnode p2s of a currently expanded node. It can also handle

Fig. 3. F , E and M-operator in the second iteration.


the case when the node are reached by multiple pathswith the same minimal distance. As for the M-operator,we can use one merge statement to combine two sepa-rated insert and update statements.

The SQL statement used in the path expansion isillustrated in Listing 2. The first statement adds thesource node s into TA nodes initially, where ff is set to0 (non-finalized). The E-operator can be expressed bythe third statement. It joins the frontier nodes specifiedby nid ¼ mid (F -operator) and TE table, where mid isdiscovered by the second statement. The tuples with theminimal d2sþ c among all occurrences for the samenode can be located by the function row number overthe sorted tuples. In the fourth SQL statement for M-operator, we use one statement to merge expandednodes into TA table. When the newly expanded nodesare new to existing visited nodes in TA, we directly addthem into TA. When the newly expanded nodes havesmaller distances from the source node, the nodes in theTA are replaced by the newly expanded ones.

3.4 Shortest Path Discovery in FEM Framework

Here, we present Dijkstra’s algorithm for shortest path dis-covery in Algorithm 1 as a case to show the functionality ofthe FEM framework. Besides the key SQLs shown above,we also need auxiliary SQLs in Listing 3. The method canrun on the client side, which connects to the underlyingRDB via JDBC or ODBC. In the running time, only few vari-ables are kept on the client side, and the RDB carries outtime-consuming tasks.

Listing 3. Auxiliary SQLs in Path Finding

We initialize TA table first with the source node. We thenstart iterations to find the shortest path from line 2 to line 9.

We issue an SQL on TA table to locate mid, the ID for thenext frontier node in line 3, and use mid to compose SQLsfor F , E and M-operators in path expansion in line 4. Afterthat, we extract the number of affected tuples from SQLcommunication area of database (SQLCA) in line 5, and ter-minate iterations when TA is not updated. Otherwise, wefinalize the node with its identifier mid. We then detectwhether the target node has been finalized or not. Once theiterations are terminated, we recover the full path from thesource node to the target node using SQLs along the p2slink iteratively.

There are at most n iterations for the path finding inAlgorithm 1 in the worst case, where n is the number ofnodes in the graph. In each iteration, we have four sepa-rate SQLs. Thus, we at most issue 4n SQLs in the short-est path discovery.

4 BI-DIRECTIONAL RESTRICTIVE BFS OVER

PARTITIONED TABLES

In this section, we first show the basic idea behind the opti-mizations in the relational context. Then, we discuss bi-directional search, propose a weight aware edge table parti-tioning schema, revise F , E, and M-operator, and introducenew termination actions in the restrictive BFS search.Finally, we present the complete algorithm on bi-directionalrestrictive BFS over partitioned tables.

4.1 Basic Idea of Optimizations in RelationalContext

An important aspect of relational approaches lies in its eval-uation fashion: in the RDB context, the set-at-a-time evalua-tion is more suitable than the node-at-a-time fashion for thesame search space, since the former can enable RDB to fullyadopt the intelligent scheduling to access disk and exploitthe data loaded in the buffer, and thus to make a betterevaluation plan [15], [20]. For example, let us consider theE-operator (the third SQL in Listing 2). As we expand fromone node in an E-operator in Algorithm 1 each time, wehave to issue n SQL statements for loading the edges of nnodes separately in the worst case. Such a node-at-a-timeoperation is very inefficient due to the redundant I/O costfor accessing edges of different nodes when they are storedin the same data block in RDB.

From another viewpoint, the performance is not satisfac-tory either if we solely focus on the set-at-a-time evaluationfashion. An extreme case is BFS strategy. BFS typically fol-lows the set-at-a-time evaluation fashion, since it views allnewly visited nodes as frontier nodes, and makes expan-sions from these nodes along all their adjacent edges in oneoperation. However, BFS may re-expand the nodes whichhave been expanded before, and incurs larger search space(or more visited nodes).

We have to make a balance between selecting all frontiernodes (BFS) and selecting one frontier node in one expan-sion (the classic Dijkstra’s algorithm) in the relational con-text. In this paper, we take a strategy of the restrictive BFS.That is, we allow multiple nodes be frontier nodes, andsearch from these frontier nodes along their partial incidentedges first, specifically, edges with smaller weights. In sucha way, the expansion can be done by a set-at-a-time fashion,


and the visited nodes have fewer chances to be re-expandedcompared with these in BFS. Most importantly, the enor-mous cost in scanning edge table can be reduced in therestrictive BFS, if we partition the edge table according toedge weights.

In addition, the strategies which lower the search spacewhile still taking the set-at-a-time evaluation fashion arehighly preferable in the relational context. One of such strate-gies is the bi-directional search [11], which makes bothexpansions from the source node and the target node, andlocates paths when the forward expansion meets the back-ward expansion. The bi-directional search not only lowersthe search space, but also can be incorporated into the restric-tive BFS easily. In the relational context, the bi-directionalsearch can reduce the total computational cost contributedby F ,E, andM-operator, as fewer nodes need to be visited.

4.2 Bi-Directional Search

The introduction of the bi-directional search into the rela-tional context makes changes to the basic path discoverymethod. First, as the forward expansion and the backwardexpansion are allowed in the bi-directional search, we usetwo tables TAf and TAb to record visited nodes in bothdirections respectively.

Second, the expansion direction needs to be selected ineach iteration in the bi-directional search. We choose thedirection with fewer frontier nodes in order to reduce thesearch space. The number of frontier nodes after each itera-tion, nf , can be computed by issuing an extra SQL on theforward visited node table TAf . Another way is to extractthe number of affected tuples from SQLCA after the SQLstatement for M-operator is evaluated, and use it as nf .

Third, the bi-directional search can prune the search spacewith discovered paths, especially when multiple frontiernodes are allowed. Specifically, we locate the maximal final-ized distance lfi in the latest ithði � 1Þ forward expansionfrom TAf . In the ith expansion, lfi is the distance to the sourcenode of the frontier node for the classic Dijkstra’s algorithm,or the minimal distance to the source node of all newlyexpanded nodes for the BFS search. Similarly, we computethe maximal finalized distance lbj in the latest jth backwardexpansion. We also compute the minimal distance minCostbetween the source node and the target node seen so farfrom a join result between TAf and TAb. We can incorporatethese collected data into theE-operator to avoid unnecessaryexpansions. Take the forward expansion as an example. Thenodes with their cost larger thanminCost� lbj do not need tobe visited. Specifically, the rule can be described as follows:

Theorem 1. Let lfi , lbj and minCost have the same meaningsabove, v be a frontier node in the forward expansion. Wedo not need to expand from v to its neighbor x, if v:d2s þwðv; xÞ þ lbj � minCost.

Proof. We prove it by contradiction. Suppose that there exists

a path p0 ¼ se

> t with lenðp0Þ < minCost and p0 has the

prefix sub-path from s to x via v. Since

v:d2sþ wðv; xÞ þ lbj � minCost, the distance from x to thetarget node t is less than lbj in p0 to achieve

lenðp0Þ < minCost. In such a case, x has already been

finalized in the backward expansion since its distance to t

is less than lbj, which indicates that the path p0 with

lenðp0Þ < minCost has been already discovered. This

contradicts that minCost is the minimal distancediscovered yet. Thus, x can be ignored safely in the bi-

directional searching.

The existing studies give the termination condition inbi-directional shortest path discovery [11]. We can stopiterations when minCost � lfi þ lbj. In such a case,minCost is the shortest distance between the sourcenode and the target node.

4.3 Weight Aware Edge Table Partitioning Schema

Table partitioning strategy is highly needed in handlinglarge graphs. Let n be the number of nodes in a graph. Thenumber of edges is n2 in the worst case. We can imaginehow large the edge table is for a real life graph. On such alarge table, even primitive graph operations, such as thelocation of adjacent edges for a specified node, are verycostly. The table partitioning strategy is then a key to thescalability for graph queries in RDB.

A straightforward schema is to partition the edge tablewith node IDs. Given pts partitioned tables, we randomlyput an edge e ¼ ðu; vÞ into a partitioned table via a hashfunction on u:id. However, such a schema brings no sub-stantial improvement in addressing our problem. For exam-ple, in E-operator, we still have to union all partitionedtables before joining with the frontier nodes, as the edgesare distributed randomly in partitioned tables.

In this paper, we propose a weight aware edge table par-titioning schema. Instead of using meaningless node IDs,we partition the edge table according to edge weights. Intui-tively, the cost-related graph operations, such as the mini-mal spanning tree, shortest path discovery, etc., preferaccessing the edges with smaller weights. Thus, we mayhave chances to implement these operations on a limitednumber of partitioned tables.

Definition 4 (Weight Aware Edge Table PartitioningSchema). Let pts be the number of partitioned tables,½wmin; wmax� be the range of edge weights. ½w0; . . . ; wpts� be apartitioning vector, where w0 ¼ wmin, wpts ¼ wmax þ 1, andwiþ1 > wi for 0 � i < pts. For any edge e in a graph, e is putinto the partitioned table TEi, where wi�1 � wðeÞ < wi. i isthe index of the partitioned table for e.

The partitioning vector is a flexible mechanism to supportboth equal-width and equal-depth schema. With the equal-width partitioning, the differences, wiþ1 � wi, are almostequal for 0 � i < pts in the partitioning vector. Whenthe equal-depth partitioning schema is used, the total num-ber of edges in each partitioned table is nearly even. Weshow an equal-width table partitioning schema in Fig. 4.Suppose the range of the edge weights is ½1; 40� and the num-ber of partitioned tables, pts, is 4. The partitioning vector willbe ½1; 11; 21; 31; 41�. The table TE1 preserves the edges withtheir weights from 1 to 10. And the edge from s to t with itsweight 22 will be stored into the partitioned table TE3.

4.4 Revised F , E, M-Operator in Restrictive BFS

Now, we start our restrictive BFS over the partitionedtables. Roughly speaking, all newly visited nodes are


viewed as the frontier nodes. For each frontier node, wemake expansions along its incident edges with smallerweights first, and delay expansions along the edges withlarger weights. The location of specific edges can beaccelerated by the weight aware edge table partitioningschema. In fact, the partitioning vector acts as an index,which can help to locate the edges with specifiedweights in the restrictive BFS.

We need to extend the visit node tables for recording suffi-cient information used in selecting specific partitionedtables, in order to support the bi-directional restricted BFSon partitioned tables. Take the forward expansion as anexample. Each node u in the visited table TAf takes the formof ðu:id; u:p2s; u:d2s;u:fwdÞ. Here, u:id is for the identifer ofnode u, u:p2s is for the identifier of the parent node of u inthe path, u:d2s is for the distance between the source nodeand u. u:fwd is set to 1 when u is the source node. After theith expansion, all newly visited nodes or nodes with reducedd2s are with fwd iþ 1. The attribute fwd can be used as acriteria in the following operators on partitioned tables.

The basic F , E, and M-operator have to be revised tosupport the restrictive BFS. We first give an extended E-operator by combining the basic F and E-operator:

Definition 5 (Extended E-operator). Let i be the number of theexpansions, l be maxð1; i� ptsþ 1), TE1; . . . ; TEpts be allpartitioned tables and TE be the union of all partitionedtables. The extended E-operator between TAf and TE can bedefined as [l�k�i ðsfwd¼kTAf ) EðTEi�kþ1Þ, where E is thebasic E-operator in Definition 2.

An extended E-operator unions the results of multiplebasic E-operators. We give an intuitive example in Fig. 5a.Suppose we want to find the shortest path from node b tonode t in Fig. 1a on the partitioned tables in Fig. 4. Initially,b is added into TAf with fwd 1. In the 1-st path expan-sion, we select b (fwd ¼ 1) as the frontier nodes and searchalong the edge ðb; cÞ in TE1, but delay the expansion alongthe edge ðb; gÞ in TE2. In the second expansion, we searchfrom c (fwd ¼ 2) and expand along the edges in TE1. Mean-while, the previous delayed expansion from b (fwd ¼ 1)along the edge ðb; gÞ in TE2 is conducted at this time.

The composition of E-operators in an extended E-operator is illustrated in Fig. 5b. Let pts be the number ofpartitioned tables, i be the current number of expansions. Anode u is a frontier node when u:fwdþ pts� 1 � i, whichindicates u may have to-be-searched edges. We divide allfrontier nodes into different groups according to their fwd,and make joins with the corresponding partitioned tables.In the first path expansion, we make one basic E-operatorwith TE1, which contains the edges with the smallestweights. In the second path expansion, we make two basicE-operators. Similarly, we make i basic E-operators in theith expansion, until i reaches pts. After that, each extendedE-operator includes pts basic E-operators between partialfrontier nodes and partial edges.

We can see the advantages of the extended E-operator.First, the extended E-operator expands from multiple nodesin one operation. Thus, it follows the set-at-a-time evalua-tion fashion. Second, the extended E-operator incurssmaller search space than BFS. Instead of searching alongall incident edges of frontier nodes, the extended E-operatorsearches along the edges with smaller weights in specificpartitioned tables earlier, which can gain fewer chances innode re-expansions. Third, the extended E-operator can beimplemented efficiently, as it requires join operationsbetween partial visited nodes and partial edges. Finally, theextended E-operator is buffer-friendly. The edges withsmaller weights have more chances in the buffer since theyare accessed more frequently.

The M-operator in the restrictive BFS is similar to thebasic one, except that we may adjust the sign of fwd onnodes. That is, for the newly visited nodes and the nodeswith reduced d2s in the latest ith expansion, their fwd willbe set to iþ 1. In this way, fwd can be used to guide theselection of partitioned tables in the following expansions.

4.5 Termination Detection and Verification inRestrictive BFS

Although the extended E-operator can reduce overheadsin the path search significantly, the delayed expansionsmay result in incomplete search in the shortest path dis-covery, if we still use the original termination conditionlfi þ lbj � minCost (see Section 4.2) in the restrictive BFS.In fact, the delayed expansions make an impact on theoriginal computation rule for the maximal finalized dis-tance(lfi and lbj). In addition, the candidate shortest dis-tance minCost might be not the shortest one, even whenlfi þ lbj � minCost is satisfied. In the following, we pointout the issues of completeness of search space, followedby their solutions.

First, we show the issues in the maximal finalized dis-tance computing in the restrictive BFS. The maximal final-ized distance is a key measure in the bi-directional search,which can be used in the bi-directional pruning rules andthe termination condition. In BFS, the maximal finalized dis-tance equals the minimal distance discovered in the latestexpansion. However, the rule is no longer valid in therestrictive BFS, as illustrated in Fig. 6. The minimal distanceis two in the first expansion. The minimal distance is 18 inthe second expansion. Notice that the searching along theedge ðc; hÞ is delayed. The minimal distance is 13 in thethird expansion. We can observe that the minimal distancediscovered cannot be used as the maximal distance finalizedin the ith expansion now, since the delayed expansion mayproduce a smaller one.

In order to address such an issue, we revise the computa-tion rule for the maximal finalized distance in one

Fig. 4. Equal-width weight aware partitioning schema.

Fig. 5. Extended E-operator over the partitioned tables.


expansion. Let lfi be the maximal distance finalized after theith expansion, minðd2sÞi be the minimal distance d2s in TAf

with their fwd ¼ i, ½w0; . . . ; wn� be the partitioning vector forthe edge table. lfi can be computed recursively:

lfi ¼minðminðd2sÞ1; w1Þ; if i ¼ 1

min�

lfi�1 þ wi � wi�1;minðd2sÞi�

; i � 2:

�

(1)

Theorem 2. The distance computed by Formula 1 is maximalfinalized.

Proof. We show that Formula 1 is valid by induction. Forthe base case, let u be the visited node with the mini-mal d2s in the first expansion. If u exists, u:d2s is themaximal finalized distance since the delayed expan-sions along edges with larger weights cannot loweru:d2s. If no node is visited, the delayed expansioncannot find a node with its d2s less than w1. Thus, theformula is correct for i ¼ 1.

We assume the induction hypothesis that lfi is correct.We make the ðiþ 1Þth restrictive BFS expansion. Themaximal finalized distance lfiþ1 comes from the ðiþ 1Þthexpansion or the delayed expansion. In the first case,minðd2sÞiþ1 is the minimal distance for all newly visitednodes in the ðiþ 1Þth expansion. In the second case, thedelayed expansion cannot visit nodes with distances lessthan lfi þ wiþ1 � wi, according to the rule of the extendedE-operator and the properties of edge-weight aware par-titioned tables. Thus, the formula is correct for i � 2. tuNext, we illustrate an example in Fig. 7 to show that

the candidate shortest distance minCost might be notshortest when the original termination conditionminCost � lfi þ lbj is met. Still take the partitioned tablesin Fig. 4 as an example. After we make two forwardexpansions and two backward expansions, we get themaximal finalized distance lf1 ¼ 6 and lf2 ¼ 14 in the for-ward expansion, according to Formula 1. Similarly, weget the maximal finalized distance lb1 ¼ 2 and lb2 ¼ 12 inthe backward expansion. We further compute the currentminimal distance minCost ¼ 26 from a path via d, e, andh. At that time, lf2 þ lb2 ¼ minCost, and then the iterationsare terminated. However, 26 is not the shortest distance,since there exists a direct edge from s to t with itsweight 22, whose expansion is delayed.

The above example shows that minCost is larger than

usual due to the delayed expansion. We cannot simply con-

sider all edges in each expansion to address the issue, since

it will retreat to the classic BFS method and totally lose the

advantages of the restrictive BFS and partitioning schema.

We address such a issue by launching a one-time verifica-

tion after lfi þ lbj � minCost is satisfied. The verification is

described as follows:

Definition 6 (Verification). Let TAf (TAb) be all visited nodesin the forward (backward) expansion. TE be the union of allpartitioned edge tables. The verification phase finds the mini-mal u:d2sþ wðu; vÞ þ v:d2t, where u is in TAf , v is in TAb,wðu; vÞ is the weight of an edge ðu; vÞ 2 TE.

Let minCost0 be the minimal distance discovered inthe path expansions when lfi þ lbj � minCost0 is satisfied,minCost1 be the minimal distance discovered in the veri-fication. The shortest distance dðs; tÞ can be computed byFormula 2.

dðs; tÞ ¼ minðminCost0;minCost1Þ: (2)

Theorem 3 The shortest distance can be computed by Formula 2correctly.

Proof. We first show that all nodes in the shortest path havebeen finalized in forward or backward expansions, whenlfi þ lbj � minCost0 is achieved in the restrictive BFS onthe partitioned tables. We prove it by contradiction. Let xbe a node in the shortest path p. Suppose that x is neitherfinalized in the forward expansion nor in the backwardexpansion. The distance from u to x is then more than lfi ,and the distance from x to v is more than lbj. Thus, thelength of p, lenðpÞ, is larger than lbi þ l

fj , and then

lenðpÞ > minCost0, which contradicts that p is shortest.Next, let p be the shortest path from the source node

s to the target node t. Since all nodes in p have beenfinalized in the forward or backward expansion, it iseasy to know that p can be divided into at most threesub-parts, a sub-path pf from s to u, an edge ðu; vÞ, anda sub-path pb between v and t, where all nodes in pf

are finalized in the forward expansion, and all nodesin pb are finalized in the backward expansion.minCost0 is the shortest distance if there is no such anedge e or e has already been searched in the iterativepath expansions. Otherwise, the shortest distance canbe discovered as minCost1 during the verification. tu

4.6 Complete Algorithm of Bi-Directional RestrictiveBFS on Partitioned Tables

In this part, we show the complete SQL statement and algo-rithm for the bi-directional restrictive BFS on partitionedtables in Algorithm 2.

We insert the source node and target node into an emptyTAf table and TAb table respectively in line 1. minCost isfor the minimal distance currently seen. The variable lfi forthe maximal finalized distance in the latest forward expan-sion, nf for the total number of frontier nodes, and i for thenumber of expansions in the forward expansion are initial-ized from line 3 to 5. The counterpart variables in the back-ward expansion are also initialized in the same line.

Fig. 7. Issue in candidate shortest distance computing.

Fig. 6. Issue in maximal finalized distance computing.


We make path expansions in the iterations from line 6 to14. We select an expansion direction according to the num-ber of frontier nodes, and show the details in the forwardexpansion from line 8 to 11. We expand the paths using thesecond SQL in Listing 4, which merges the nodes in a viewER into the visited node set. ER contains the results of anextended E-operator in the first SQL. Let i be the number ofthe forward expansions, pts be the number of the partitionedtables. The extended E-operator returns an union ofminði; ptsÞ basic E-operators. l in the first SQL ismaxð1; i� ptsþ 1Þ, which is for the minimal value of fwd ofall frontier nodes in the ith expansion. We also incorporatethe bi-directional pruning rule into the extendedE-operator.Let minCost be the minimal distance seen so far, lbj be themaximal distance finalized in the latest backward expansion.Only the nodes with their d2s less thanminCost� lbj are con-sidered in the forward expansion. We compute the maximaldistance finalized lfi in the ith forward expansion in line 10using Formula 1. We further refine the minimal distanceminCost seen so far in line 14 using the fourth statement.

When the condition lfi þ lbj � minCost in line 6 is satis-fied, we terminate iterations and make a verification stepusing the fifth statement. The shortest distance can becomputed from the minimal distance discovered in theiterative path search and that in the verification phase,as shown in Formula 2. We then locate one node xid inthe shortest path with the shortest distance minCost, and

recover the full shortest path along the p2s and p2t linksfrom xid respectively.

Analysis. Now we give the number of iterations used inthe restrictive BFS over partitioned tables in Theorem 4.

Theorem 4. Let s be the source node and t be the target node. p be

the shortest path between s and t. p can be discovered byP

e2edgesðpÞðpartidxðeÞÞ iterations, where edgesðpÞ is for the

edge set of p, partidxðeÞ is for the index of the partitioned table

for e.

Proof. Suppose we have made i forward expansions and jbackward expansions in Algorithm 2, where iþ j ¼P

e2edgesðpÞðpartidxðeÞÞ. p can be found since each edge e inp will be expanded after partidxðeÞ expansions. We ana-lyze whether the iterations can be terminated at thattime. We know minCost is lenðpÞ, or dðs; tÞ, as p has beenfound. According to the Formula 1, lfi is the smaller onebetween the minimal distance minðd2sÞi discovered inthe latest ith forward expansion and the minimal dis-tance for delayed expansions. Since p has been discov-ered, minðd2sÞi is no more than the minimal distance fordelayed expansions. Then, lfi is minðd2sÞi. Similarly, lbj isminðd2sÞj in the backward expansion. Thus, we havelfi þ lbj ¼ minCost, and the iterations can be terminated. tuIn general cases, the restrictive BFS over partitioned

tables requires much fewer iterations than the classic

Dijkstra’ search. The restrictive BFS allows more nodes to

be expanded in one operation, and then the maximal final-

ized distances lfi and lbj increase faster, which terminate the

path expansions earlier. The reduction of iterations is

more obvious on denser graphs. For example, the online

social networks always have six degrees of separation. In

such large dense graphs, a shortest path between any two

nodes tends to contain fewer edges, which results in fewer

iterations by Theorem 4.

5 EXPERIMENTAL RESULTS

In this section, we experimentally evaluate the effectivenessand efficiency of our relational approach on both real andsynthetic data sets extensively.

5.1 Experimental Setup

Implementation Details and Competitors. We compare11 approaches related to this paper. These approaches are

TABLE 1Methods to be Compared


summarized in Table 1. We implement all methods in Javawith JDK 1.6 and evaluate them on a server with 2 IntelXeon E5620 64 bit 2.4 GHz processors, 24 G of RAM, run-ning Windows server 2008 R2 Standard. As for Neo4j [2],we take its major stable version neo4j-community-1.8 anduse its built-in routing GraphAlgoFactory.dijkstra in the short-est path discovery.

Listing 4. SQL in Expansion

In order to show the applicability of our methods, we con-duct experiments over two relational database systems: onecommercial database system (denoted by DBMS-x) andanother open source database system PostgreSQL 9.0(denoted by PostgreSQL).

We build indices over the relational tables for the graphand intermediate results. Specifically, for the edge table TEor any partitioned edge table, we build clustered indices onfid in the tables. For the forward visited node table TAf andthe backward visited node table TAb, we build unique clus-tered indices on nid on both tables. The meanings of fidand nid are given in Sections 2.1 and 3.2 respectively.

Data Sets. We use four graph data sets in tests, includingtwo real graphs LiveJournal and Orkut, and two syntheticgraphs named Random and Power. LiveJournal is down-loaded from Stanford’s data collection,3 which is a friend-ship network of on line community LiveJournal. Orkut is alarge social network graph [21]. Random graphs are gener-ated as follows. Let n and m be the number of nodes andedges respectively, we randomly select the source and

target node for m edges among n nodes. Power graph set isgenerated using Barabasi Graph Generator v1.4.4 We assignthe weights of edges in all graphs in two ways. The first iswith Uniform weights that are drawn uniformly at randombetween 1 and 100. For the second one Zipf, we use Zipf dis-tribution for the same range and assign the weights ran-domly. In Zipf’s law with N elements, the frequency of thekth element is k�a=

PNn¼1 n

�a, where a is a positive constantto control the skew of data distributions. We vary a from 0.2to 1.0 in following experiments.

Some statistics of these graphs are summarized in Table 2.As for any synthetic graph, we suffix xNyd to indicate thatthe graph is with x nodes and the average degree y. Forexample, Random100mN3d represents a Random graph with100 million nodes and an average degree 3.

We study the issues related to the FEM framework andits optimizations. The experiments will be divided intothree sub-parts. We first verify the effectiveness of the FEMframework and set-at-a-time evaluation fashion. We thenstudy optimization strategies specific to shortest path dis-covery inside the FEM framework. Finally, we make exten-sible studies on our method with all optimization strategies.In the following experiments, we randomly generate100 shortest path queries, and report the average time cost.The parameters used in the tests are listed inside figures.

5.2 FEM Framework and Set-at-a-Time Fashion

In this part, we mainly answer the following questions:i) Can the FEM framework support Dijkstra’s implementa-tion in RDB? ii) What is the most expensive phase in pathdiscovery? iii) Do the new SQL features boost the perfor-mance remarkably? The methods studied include DJ, BDJand BSDJ in Table 1.

Fig. 8a reports the results of DJ, BDJ, and BSDJ on Powergraphs varying size from 100k to 500k. We omit the curvesfor DJ as DJ takes much more time cost than the other two.For example, DJ consumes about 15 minutes, while BDJ con-sumes 8.42 seconds, and BSDJ costs 2.13 seconds in a graphwith 100k nodes.

In order to find the main factors to the evaluationcost, we collect the total number of expansions in thepath finding and the time cost consumed by DJ, BDJ,and BSDJ in Table 3. It clearly shows that BSDJ takes thefewest expansions. The number of expansions in DJ isabout 75 times bigger than that in BDJ, and 300 timesbigger than that in BSDJ. Thus, the number of SQLsused by BSDJ are fewest among three. The results verifyour claim before. The set-at-a-time evaluation fashion,which enables the RDB optimizer to produce a betterevaluation plan, can beat the node-at-a-time evaluationfashion easily, when they have the same search space.

3. http://snap.stanford.edu/data/. 4. http://www.cs.ucr.edu/ddreier/barabasi.html.

TABLE 2Statistics of Graph Data Sets


Fig. 8b plots the time cost used in different phases in thepath finding in BSDJ, including the path expansion, the sta-tistics collection (denoted by SC), and the full path recovery(denoted by FPR). We can see that the path expansion withthree operators consumes most of the time. Next, we godeep into the path expansion. We directly translate F , E,and M-operator into separate SQLs, and collect their timecost. In Fig. 8c, we find E-operator takes about 70 percent ofthe time cost in the path expansion. It is mainly due to thehigh cost of join operation between the graph edge tableand the visited node table in E-operator.

Fig. 8d studies the query time cost affected by differ-ent SQL features. We implement path discovery with thenew features including window function and mergestatement (denoted by NSQL), and the traditionalfeatures including aggregate functions and insert/updatefor merge (denoted by TSQL). The results show thatthe NSQL method outperforms the TSQL method

significantly. We believe that the advance of RDB makesit more powerful in handling the shortest path discoveryand other complex graph operations.

Due to the fact that BSDJ outperforms DJ and BDJ thor-oughly, we omit the curves for the latter two in the follow-ing. We will use new features of SQL in path finding unlessexplicitly mentioned.

5.3 Optimizations in Shortest Path Discovery

In this part, we first study the effectiveness of different opti-mizations, including BBFS, BSDJ, BSEG and BRBFSw listedin Table 1. We then attempt to find the impacts of databasebuffer, equal-width and equal-depth partitioning schemas,and the number of partitioned tables, on the performance ofthe shortest path discovery.

Figs. 9a and 9b compare the time cost consumed byBSDJ, BSEG(x), BBFS and BRBFSw(x) on two sets oflarge graphs with uniform weights. Here, x in BSEG(x)is for the index threshold, and we set an approximateoptimal threshold in our tests (more details are specifiedin the previous work [18]). x in BRBFSw(x) is for thetotal number of partitioned tables. We can observe thatin bigger graphs, BRBFSw is fastest. The time cost inBRBFSw is nearly 1/4 of that in BBFS, 1/2 in BSDJ and1/4 in BSEG across Random40mN3d graph, and the pro-portion to BBFS is about 1/10 on Orkut graph.

We make a deep analysis on the detailed time cost, thenumber of expansions and the number of visited nodes indifferent strategies in Table 4. We can see that the efficiency

Fig. 8. Experimental results on FEM framework.

TABLE 3E(# Expansions), T(Time:s) on Power Graphs

Fig. 9. Experimental results of different optimization strategies.


of BRBFSw on large graphs comes from that BRBFSw canreduce the number of SQLs at the cost of slightly enlargedsearch space. For example, the visited nodes in BRBFSw(10)are a little more than those in BSDJ, but the number ofexpansions in BRBFSw(10) is just 1/4 of that in BSDJ. Wealso see that although BBFS takes fewest expansions, BBFSis slowest in the tests due to its much more visited nodes.BSEG attempts to lower the number of SQLs with pre-com-puted segments. It works best when a graph is small, but itsperformance degrades sharply in handling large graphsas it incurs too many pre-computed segments. For example,BSEG generates 18,515,589 additional segments onRandom40mN3d graph with lthd ¼ 10. By combining theseexperimental results, we know that we cannot consider theevaluation fashion (the number of expansion operations) orthe search space (the size of intermediate nodes) separately,but strike to achieve a balance between them.

Next, we study the impact of different buffer sizes on thequery performance. Figs. 9c and 9d compare the time costused by BSDJ, BSEG, BBFS and BRBFSw varying differentbuffer sizes. We can see a major advantage of BRBFSw com-pared with BBFS, BSDJ and BSEG when the buffer isreduced. It is due to that the expansion in BRBFSw can bemade on much smaller partitioned tables. In addition,BRBFSw is buffer-friendly. The partitioned tables withsmaller indices have more chances to be in the buffer asthey are accessed more frequently in extended E-operators.

Fig. 9e studies different effects of equal-width and equal-depth partitioning schema on the query performance acrossRandom graphs with Zipf weight distributions varying skewa from 0.2 to 1.0. The results are a little surprising. Theequal-depth schema does not have obvious advantagesover the equal-width one on graphs with random skewededge weights. We further analyze the size of partitionedtables, and find that the random skewed edge weights mayresult in non-skewed partitioned tables in the equal-widthschema, as each partitioned table contains edges within arange of weights. For example, the maximal size of parti-tioned tables is only about twice the average size witha ¼ 1:0 in the equal-width schema. These non-skewed parti-tioned tables will alleviate the impacts of skewed edgeweights on the equal-width schema in the restrictive BFS.

However, the above reasons cannot explain why theequal-width schema may obtain a better performance thanthe equal-depth one on skewed partitioned tables. We con-jecture that the partitioned tables with smaller indices willhave a bigger impact on the search space in the restrictiveBFS. To verify the claim, we artificially assigned smallweights(denoted by Head), random weights(denotedby Random), large weights(denoted by Tail) with high

frequencies to edges and report the results in Fig. 9f. Asexpected, the equal-depth schema works better in the caseof Head, while the equal-width schema wins in the case ofTail. Recall that BRBFSw expands along edges in partitionedtables with smaller indices first. In the case of Head, the par-titioned tables with smaller indices contain more edges inthe equal-width schema than those in the equal-depth one,and then BRBFSw visits more nodes in the following search,which hinders the performance on the equal-width schema.The results are opposite in the case of Tail.

Fig. 9g illustrates the query evaluation time cost varyingpts, the number of partitioned tables. We can see theincrease of pts first lowers the query cost and then incursmore query cost. In fact, the setting of pts has twofoldimpacts. On the one hand, a larger pts results in moresmaller partitioned tables, which enables the optimizer inthe RDB to produce a better buffer strategy. On the otherhand, a larger pts means we need more expansions to locatethe shortest path, which cuts down the query performance.In addition, as shown in Fig. 9g, a relatively larger pts (e.g.,pts ¼ 30) is appropriate for Random100m graph while asmaller pts (e.g., pts ¼ 20) is suitable on Random40m. Howto find an optimal pts over different graphs will be a futurework of this paper.

Fig. 9h studies the time cost in the weight aware edgetable partitioning varying pts over two large Randomgraphs. We need to create partitioned tablesTEið1 � i � ptsÞ, then put edges into TEi according totheir weights, and finally build indices on TEi. The over-all time cost is insensitive to different settings of pts.Another observation is that table partitioning shows amuch higher scalability than SegTable indexing [18]. Forexample, SegTable takes more than 5 hours on Ran-dom100mN3d graph in indexing.

5.4 Extensive Studies

In this part, we study the impacts of different databaseengines and index strategies on the query performance.Finally, we compare our RDB approach with the in-memoryones and graph database Neo4j.

Fig. 10a compares the query time consumed by BSDJ andBRBFSw on PostgreSQL. Since PostgreSQL supports the win-dow function but cannot provide the merge statement, weuse an update followed by an insert statement for the M-operator instead. The results on PostgreSQL are similar tothose on the commercial database system, which also showthat our method has good applicability.

In Fig. 10b, we conduct studies on different index strate-gies, including the clustered unique index (denoted byCluIndex), non-clustered unique index (denoted by Index),and no index (denoted by NoIndex) on fid in the edge tableTE, as well as on nid in two visited node tables TAf andTAb. We can see that CluIndex achieves the best perfor-mance. CluIndex offers RDB a chance to implement theextended E-operator and the discovery of candidate short-est distances by the merge-sort join, in which only the merg-ing cost is needed since the clustered indices have imposedorders on nodes.

Last, we compare our relational approach with the in-memory ones, including MDJ and MBDJ, as well as Neo4j

TABLE 4T(Time:s), E(# Expansions), and Vst (# Visited Nodes)

on Random Graphs


in Table 1. The buffer size of RDB is set to be the same as theJVM memory used by in-memory approaches. Neo4j isembedded in a java application with the same JVM mem-ory. In order to make comparisons fair, the time reportedfor in-memory approaches does not include the time inloading graph into memory, and the time for relationalapproaches and Neo4j graph database is collected after thedatabase buffer becomes hot. In both Figs. 10c and 10d, wefind that the performance and scalability of Neo4j are notgood on large graphs. In fact, we cannot run Neo4j on Ran-dom5mN3d and Livejournal with 10 G memory and we setthe JVM memory used in Neo4j to 20 G. We believe the fol-lowing strategies contribute to the improvement of ourmethod. First, BRBFSw employs a bi-directional searchstrategy, which can reduce the search space significantly.Second, BRBFSw takes the set-at-a-time fashion, which canfully exploit the data loaded into the memory and then cutdown the high cost in I/O operations. Third, BRBFSwadopts the table partitioning schema, which can greatlylower the join cost in path expansions.

In addition, we can see our BRBFSw algorithm is not asfast as MBDJ. It is not surprised as RDB is a general-purposeinfrastructure in different information systems, and is nottailored for the graph data management. Furthermore, RDBcannot exploit all memory in the buffer for the shortest pathdiscovery, since RDB keeps other information, such as sys-tem tables, in the buffer. We also notice that BRBFSw canoutperform MDJ on large graphs and show a better scalabil-ity. These results also reveal that a proper evaluation strat-egy in the RDB context can still gain a satisfactoryperformance. We stress here that the main advantage ofRDB lies in its scalability, stability and easy programmingin the graph management.

5.5 Summary

To sum up, from the experimental results, we can draw thefollowing conclusions: i) The FEM framework can be usedto support graph search queries such as the shortest pathdiscovery. The new features introduced by recent SQLstandards can improve the performance of the FEM

framework. ii) The bi-directional restrictive BFS on weightaware partitioned tables can improve both performanceand scalability significantly. The choice of the equal-widthand equal-depth partitioning schemas is related to the skewof partitioned tables and the size of partitioned tables withsmall indices. iii) Our relational approach is more efficientthan the existing graph database Neo4j in the shortest pathdiscovery on large graphs.

6 RELATED WORK

Graph Search. Graph search queries are basic and importantgraph operations. Due to the large search space, many pro-posed approaches take a greedy idea, including Dijkstra’salgorithm for the shortest path [13], Prim’s algorithm for theminimal spanning tree [23], the salesman path discovery[10], and the like. These approaches follow a similar select-expand-merge pattern in the graph search.

Shortest path discovery is a representative graph searchoperation. Dijkstra’s algorithm is a well-known online algo-rithm [13] to solve the single-source shortest path problem.The bi-directional search strategy [11] is an effective optimi-zation on Dijkstra’s algorithm. Different kinds of shortestpath indices, such as the landmark index [22], 2-HOPrelated index [12], etc, can be computed offline and boostthe performance in the running time. However, neitherthese online shortest path discovery nor index buildingmethods consider the case when the graph cannot be fullyloaded into main memory.

External Graph Operations. The external graph opera-tions have been also studied to handle large graphs. Anexternal shortest path index is proposed by [9] on planargraphs. MapReduce framework [4], [17] and its opensource implementation Hadoop [1] achieve a high scal-ability in processing large graphs. The limitation of cur-rent MapReduce framework lies in its weak support toonline query and expensive dynamic update of graphs.We also notice the studies on specific graph operationsin the external memory. For example, the approximateminimum-cut [6] can be computed on the sampledgraph. The cliques can be found on the partially loadedsub-graphs [16]. However, it is hard to extend thesemethods to support general graph search queries.

RDB based Graph Operations. RDB, as a scalable platform,can be extended to manage complex data, such as XML data[14], [19], [20], and to support sophisticated applications,such as data mining [5] and statistical data analysis [7]. Anoperation on these data or in applications is always trans-lated into a sequence of SQLs, due to the limited expressivepower of SQL. In the translation, an important optimizationis to produce fewer SQLs [20]. We also incorporate such afeature in designing an efficient relational approach forgraph search queries.

The studies on transitive closure and aggregation withrecursion in the relational context can be viewed as pioneerworks to support graph operations in RDB, as graph searchqueries, such as the shortest path discovery [24], [26], can beexpressed using transitive closure, aggregation and selec-tion. Different approaches are compared to find the full orpartial transitive closure [26]. The aggregation with recur-sion can be computed efficiently in an incremental way [24]

Fig. 10. Experimental results on extensive study.


with the changes of underlying graph data. Compared withthe previous works, this paper focuses on the shortest pathdiscovery in the relational context with new featuresof SQL, and designs various optimizations specific to theproblem, including the restrictive BFS and the table parti-tioning schema.

RDB based graph query processing is related to ourwork. The reachability query is evaluated by a stored proce-dure [28]. The SQL based graph data mining approacheshave been studied in [8], [27]. Different from the existingmethods, we attempt to design a generic graph searchframework in RDB, and improve the efficiency of the frame-work with new features of SQL. We also optimize the rela-tional shortest path discovery by considering the searchspace reduction, RDB-friendly evaluation fashion and tablepartitioning schema.

7 CONCLUSIONS

This paper proposes a relational approach to discover theshortest path on large graphs. We first abstract a relationalgeneric graph search framework FEM with three new oper-ators, and employ the new features of SQL such as windowfunction and merge statement to improve the performanceof the FEM framework. Second, we optimize the basicmethod via the bi-directional restrictive BFS over weightaware partitioned edge tables, which can improve both per-formance and scalability significantly without extra over-heads. The final extensive experimental results show thatour approach can achieve high efficiency and scalabilitycompared with the existing methods. Due to space limita-tions, the extension of the FEM framework to other complexgraph operations will be our future work.

ACKNOWLEDGMENTS

This work was supported by NSFC under Grant Nos.61073018 and 61272156, the research Grants Council of theHong Kong SAR under Nos. 418512 and 419109, and theNational High Technology Research and Development Pro-gram of China under No. 2012AA011002.

REFERENCES

[1] “Apache Hadoop,” http://hadoop.apache.org, 2014.[2] “Neo4j,” http://neo4j.org/, 2014.[3] A. Goldberg and C. Harrelson, “Computing the Shortest Path:

Search Meets Graph Theory,” Proc. 16th Ann. ACM-SIAM Symp.Discrete Algorithms (SODA ’05), pp. 156-165, 2005.

[4] B. Bahmani, K. Chakrabarti, and D. Xin, “Fast Personalized Pag-erank on Mapreduce,” Proc. ACM SIGMOD Int’l Conf. Managementof Data (SIGMOD ’11), pp. 973-984, 2011.

[5] B. Zou, X. Ma, B. Kemme, G. Newton, and D. Precup, “DataMining Using Relational Database Management Systems,” Proc.10th Pacific-Asia Conf. Advances in Knowledge Discovery and DataMining ( ’06), pp. 657-667, 2006.

[6] C. Aggarwal, Y. Xie, and P. Yu, “GConnect: A Connectivity Indexfor Massive Disk-Resident Graphs,” Proc. VLDB Endowment,vol. 2, no. 1, pp. 862-873, 2009.

[7] C. Mayfield, J. Neville, and S. Prabhakar, “ERACER: a DatabaseApproach for Statistical Inference and Data Cleaning,” Proc. ACMSIGMOD Int’l Conf. Management of Data (SIMGMOD ’10), pp. 75-86, 2010.

[8] C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi, “Scalable Mining ofLarge Disk-Based Graph Databases,” Proc. 10th ACM Int’l Conf.Knowledge Discovery and Data Mining (SIGKDD ’04), pp. 316-325,2004.

[9] D. Hutchinson, A. Maheshwari, and N. Zeh, “An External Mem-ory Data Structure for Shortest Path Queries,” Discrete AppliedMath., vol. 126, pp. 55-82, no. 1, 2003.

[10] D. Johnson and L. McGeoch, “The Traveling Salesman Problem: ACase Study in Local Optimization,” Local Search in CombinatorialOptimization, pp. 215-310, Princeton Univ. Press, 1997.

[11] D. Wagner and T. Willhalm, “Speed-Up Techniques for Shortest-Path Computations,” Proc. 24th Ann. Conf. Theoretical Aspects ofComputer Science (STACS ’07), pp. 23-36, 2007.

[12] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick, “Reachabilityand Distance Queries via 2-Hop Labels,” Proc. 13th Ann. ACM-SIAM Symp. Discrete Algorithms (SODA ’02), pp. 937-946, 2002.

[13] E. Dijkstra, “A Note on Two Problems in Connexion withGraphs,” Numerische Mathematik, vol. 1, pp. 269-271, 1959.

[14] F. Tian, B. Reinwald, H. Pirahesh, T. Mayr, and J. Myllymaki,“Implementing a Scalable XML Publish/Subscribe System Usinga Relational Database System,” Proc. ACM SIGMOD Int’l Conf.Management of Data (SIGMOD ’04), pp. 479-490, 2004.

[15] H. Garcia-Molina, J. Ullman, and J. Widom, Database Systems: TheComplete Book. Prentice Hall Press, 2008.

[16] J. Cheng, Y. Ke, A.W. Fu, J.X. Yu, and L. Zhu, “Finding MaximalCliques in Massive Networks by H*-graph,” Proc. ACM SIGMODInt’l Conf. Management of Data (SIGMOD ’10), pp. 447-458, 2010.

[17] J. Dean and S. Ghemawat, “Mapreduce: Simplified Data Process-ing on Large Clusters,” Proc. Sixth Symp. Operating System Designand Implementation (OSDI’04), pp. 137-150, 2004.

[18] J. Gao, R. Jin, J. Zhou, J. Xu, X. Jiang, and T. Wang, “RelationalApproach for Shortest Path Discovery over Large Graphs,” Proc.VLDB Endowment, vol. 5, no. 4, pp. 358-369, 2011.

[19] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. DeWitt, andJ. Naughton, “Relational Databases for Querying XML Docu-ments: Limitations and Opportunities,” Proc. 25th Int’l Conf. VeryLarge Data Bases (VLDB ’99), pp. 302-314, 1999.

[20] M. Benedikt, C. Chan, W. Fan, R. Rastogi, S. Zheng, and A. Zhou,“Dtd-Directed Publishing with Attribute Translation Grammars,”Proc. 28th Int’l Conf. Very Large Data Bases (VLDB ’02), 2002.

[21] A. Mislove, M. Marcon, K.P. Gummadi, P. Druschel, and B.Bhattacharjee, “Measurement and Analysis of Online SocialNetworks,” Proc. Seventh ACM SIGCOMM Conf. Internet Mea-surement (IMC ’07), Oct. 2007.

[22] M. Potamias, F. Bonchi, C. Castillo, and A. Gionis, “Fast ShortestPath Distance Estimation in Large Networks,” Proc. Int’l Conf.Information and Knowledge Management (CIKM’09), pp. 453-470,2009.

[23] R. Prim, “Shortest Connection Networks and Some General-izations,” Bell System Technical J., vol. 36, pp. 1389-1401, 1957.

[24] R. Ramakrishnan, S. Sudarshan, K.A. Ross, and D. Srivastava,“Efficient Incremental Evaluation of Queries with Aggregation,”Proc. Int’l Symp. Logic Programming (SLP ’94), pp. 204-218, 1994.

[25] R. Ronen and O. Shmueli, “SoQL: A Language for Querying andCreating Data in Social Networks,” Proc. IEEE 25th Int’l Conf. DataEng. (ICDE’09), pp. 1595-1602, 2009.

[26] S. Dar and R. Ramakrishnan, “A Performance Study of TransitiveClosure Algorithms,” Proc. ACM SIGMOD Int’l Conf. Managementof Data (SIGMOD’94), pp. 454-465, 1994.

[27] S. Srihari, S. Chandrashekar, and S. Parthasarathy, “A Frameworkfor SQL-Based Mining of Large Graphs on Relational Databases,”Proc. 14th Pacific-Asia Conf. Advances in Knowledge Discovery andData Mining—Volume Part II (PAKDD’10), pp. 160-167, 2010.

[28] S. Trißl and U. Leser, “Fast and Practical Indexing and Queryingof Very Large Graphs,” Proc. ACM SIGMOD Int’l Conf. Manage-ment of Data (SIGMOD’07), pp. 845-856, 2007.

Jun Gao received the BE and ME degrees fromShandong University, China, in 1997 and 2000,respectively, and the PhD degree from PekingUniversity, China, in 2003, all in computer sci-ence. Currently, he is a professor in the School ofElectronics Engineering and Computer Science,Peking University. His major research interestsinclude web data management and graph datamanagement.


Jiashuai Zhou received the BS degree from theDepartment of Computer Science, Peking Uni-versity, China, in 2011. He is currently workingtoward the graduate degree in the Department ofComputer Science, School of Electronics Engi-neering and Computer Science, Peking Univer-sity. His research interests include graph datamanagement and web data management.

Jeffrey Xu Yu received the BE, ME, and PhDdegrees in computer science from the Universityof Tsukuba, Japan, in 1985, 1987, and 1990,respectively. Currently, he is a professor in theDepartment of Systems Engineering and Engi-neering Management, The Chinese University ofHong Kong. His major research interests includedata mining, data stream, data warehouse, graphquery processing, and optimization.

Tengjiao Wang received the BE and MEdegrees from Shandong University, China, in1996 and 1999, respectively, and the PhD degreefrom Peking University, China, in 2002, all in com-puter science. Currently, he is a professor in theSchool of Electronics Engineering and ComputerScience, Peking University. His major researchinterests include data mining, web data manage-ment, and database system implementation.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Shortest path computing in relational dbm ss

Technology

Transcript of Shortest path computing in relational dbm ss