A Brief Study of Open Source Graph Databasesmodern NoSQL databases and graph databases (Neo4j,...
Transcript of A Brief Study of Open Source Graph Databasesmodern NoSQL databases and graph databases (Neo4j,...
A Brief Study of Open Source Graph DatabasesRob McColl David Ediger Jason Poovey Dan Campbell David Bader
Georgia Tech Research Institute, Georgia Institute of Technology
AbstractWith the proliferation of large irregular sparse relational datasets, new storage and analysis platforms
have arisen to fill gaps in performance and capability left by conventional approaches built on traditionaldatabase technologies and query languages. Many of these platforms apply graph structures and analysistechniques to enable users to ingest, update, query and compute on the topological structure of theserelationships represented as set(s) of edges between set(s) of vertices. To store and process Facebook-scaledatasets, they must be able to support data sources with billions of edges, update rates of millions ofupdates per second, and complex analysis kernels. These platforms must provide intuitive interfaces thatenable graph experts and novice programmers to write implementations of common graph algorithms.
In this paper, we explore a variety of graph analysis and storage platforms. We compare their capabil-ities, interfaces, and performance by implementing and computing a set of real-world graph algorithmson synthetic graphs with up to 256 million edges. In the spirit of full disclosure, several authors areaffiliated with the development of STINGER.
1 Background
In the context of this paper, the term graph database is used to refer to any storage system that cancontain, represent, and query a graph consisting of a set of vertices and a set of edges relating pairs ofvertices. This broad definition encompasses many technologies. Specifically, we examine schemas withintraditional disk-backed, ACID-compliant SQL databases (SQLite, MySQL, Oracle, and MS SQL Server),modern NoSQL databases and graph databases (Neo4j, OrientDB, InfoGrid, Titan, FlockDB, ArangoDB,InfiniteGraph, AllegroGraph, DEX, GraphBase, and HyperGraphDB), distributed graph processing toolkitsbased on MapReduce, HDFS, and custom BSP engines (Bagel, Hama, Giraph, PEGASUS, Faunus), andin-memory graph packages designed for massive shared-memory (NetworkX, Gephi, MTGL, Boost, uRika,and STINGER). For all of these, we build a matrix containing the package maintainer, license, platform,implementation language(s), features, cost, transactional capabilities, memory vs. disk storage, single-nodevs. distributed, text-based query language support, built-in algorithm support, and primary traversal andquery styles supported. An abridged version of this matrix is presented in Appendix A. For a selected groupof primarily open source graph databases, we implemented and tested a series of representative commongraph kernels on each technology in the set on the same set of synthetic graphs modeled on social networks.
Previous work has compared Neo4j and MySQL using simple queries and breadth-first search [7], and weare inspired by a combination objective/subjective approach to the report. Graph primitives for RDF querylanguages were extensively studied in [1] and data models for graph databases in [2], which are beyond thescope of this study.
Algorithms and Approach
Four common graph kernels were selected with requirements placed on how each must be implemented toemphasize common implementation styles of graph algorithms. The first is the Single Source Shortest Paths(SSSP) problem, which was specifically required to be implemented as a level-synchronous parallel breadth-first traversal of the graph. The second is an implementation of the Shiloach-Vishkin connected componentsalgorithm (SV) [6] in the edge-parallel label-pushing style. The third is PageRank (PR) [4] in the vertex-parallel Bulk Synchronous Parallel (BSP) power-iteration style. The last is a series of edge insertions anddeletions in parallel to represent random access and modification of the structure. When it was not possibleto meet these requirements due to restrictions of the framework, the algorithms were implemented in as closeof a manner as possible. When the framework included an implementation of any of these algorithms, thatimplementation was used if it met the requirements. The implementations were intentionally written in astraightforward manner with no manual tuning or optimization to emulate non-hero programmers.
Many real-world networks demonstrate “scale-free” characteristics [3, 8]. Four initial sparse static graphs
1
and sets of edge insertions and deletions were created using an R-MAT [5] generator with parameters A =0.55, B = 0.1, C = 0.1, and D = 0.25). These graphs contain 1K (tiny), 32K (small), 1M (medium), and16M (large) vertices with 8K, 256K, 8M, and 256M undirected edges, respectively. Graphs over 1B edgeswere considered, but it was determined that many packages could not handle that scale. Edge updates havea 6.25% probability of being deletions uniformly selected from existing edges where deletions are handled inthe default manner for the package. For the tiny and small graphs, 100K updates were performed, and 1Mupdates were performed otherwise.
The quantitative measurements taken were initial graph construction time, total memory in use by theprogram upon completion, update rate in edges per second, and time to completion for SSSP, SV, andPR. Tests for single-node platforms and databases were run on a system with four AMD Opteron 6282 SEprocessors and 256GB of DDR3 RAM. Distributed system tests were performed on a cluster of 21 systemswith minimum configuration of two Intel Xeon X5660 processors with 24GB DDR3 RAM connected by QDRInfiniband. HDFS was run in-memory. Results are presented in full in Appendix B. Packages for which noresult is provided either ran out of memory or simply did not complete in a comparable time. The code usedfor testing is available at github.com/robmccoll/graphdb-testing.
2 Experiences2.1 User Experience
A common goal of open source software is widespread adoption. Adoption promotes community involvement,and an active community is likely to contribute use cases, patches, and new features. When a user downloadsa package, the goal should be to build as quickly as possible with minimal input from the user. A numberof packages tested during the course of this study did not build on the first try. Often, this was the resultof missing libraries or test modules that do not pass, and was resolved with help from the forums. Buildproblems may reduce the likelihood that users continue with a project.
Among both commercial and open source software packages, we find a lack of consistency in approachesto documentation. It is often difficult to quickly get started. System requirements and library dependenciesshould be clear. Quickstart guides are appreciated. We find that most packages lack adequate usage ex-amples. Common examples illustrate how to create a graph and query the existence of vertices and edges.Examples are needed that show how to compute graph algorithms, such as PageRank or connected compo-nents, using the provided primitives. The ideal example demonstrates data ingest, a real analysis workflowthat reveals knowledge, and the interface that an application would use to conduct this analysis.
While not unique to graphs, there are a multitude input file formats that are supported by the softwarepackages in this study. Formats can be built on XML, JSON, CSV, or other proprietary binary and plain-text formats. Each has a trade-off between size, descriptiveness, and flexibility. The number of differentformats creates a challenge for data interchange. Formats that are edge-based, delimited, and self-describingcan easily be parsed in parallel and translated between each other.
2.2 Developer Experience
Object-Oriented Programming (OOP) was introduced to enforce modularity, increase reusability, and addabstraction to enable generality. These are important goals; however, OOP can also inadvertently obfuscatelogic and create bloated code. For example, considering a basic PageRank computation in Giraph, thefunction implementing the algorithm uses approximately 16% of the 275 lines of code. The remainder ofthe code registers various callbacks and sets up the working environment. Although the extra code providesflexibility, it can be argued that much is boilerplate. The framework should support the programmer spendingas little time as possible writing boilerplate and configuration so that the majority of code is a clear andconcise implementation of the algorithm.
A number of graph databases retain many of the properties of a relational database, representing thegraph edges as a key-value store without a schema. A variety of query languages are employed. Like theirrelational cousins, these databases are ACID-compliant and disk-backed. Some ship with graph algorithmsimplemented natively. While these query languages may be comfortable to users coming from a database
2
background, they are rarely sufficient to implement even the most basic of graph algorithms succinctly. Thebetter query languages are closer to full scripting environments, which may indicate that query languagesare not sufficient for graph analysis.
In-memory graph databases focus more on the algorithms and rarely provide a query language. Accessto the graph is done through algorithms or graph-centric APIs. This affords the user the ability to writenearly any algorithm, but presents a certain complexity and learning curve. The visitor pattern in MTGLis a strong example of a graph-centric API. The BFS visitor allows the user to register callbacks for newlyvisited vertices, newly discovered BFS tree edges, and edges to previously discovered vertices. Given thatbreadth-first traversals are frequently used as building blocks in common graph algorithms, providing an APIthat performs this traversal over the graph in an optimal way can abstract the specifics of the data structurewhile giving performance benefit to novice users. Similar primitives might include component labelings,matchings, independent sets, and others (many of which are provided by NetworkX). Additionally, listeningmechanisms for dynamic algorithms should be considered in the future.
2.3 Graph-specific Concerns
An important measurement of performance is the size of memory consumed by the application processing agraph of interest. An appropriate measurement of efficiency is the number of bytes of memory required tostore each edge of the graph. The metadata associated with vertices and edges can vary by software package.The largest test graph contained more than 128 million edges, or about one gigabyte in 4-byte tuple form.Graph databases were run on machines with 256GB of memory. However, a number of software packagescould not run on the largest test graph.
It is often possible to store large graphs on disk, but the working set size of the algorithm in the midstof the computation can overwhelm the system. For example, a breadth-first search will often contain aniteration in which the size of the frontier is as many as half of the vertices in the graph. We find that somegraph databases can store these graphs on disk, but cannot compute on them because the in-memory portionof the computational is too large.
We find differing semantics regarding fundamental operations on graphs. For example, when an edgeinserted already exists, there are several possibilities: 1) do nothing, 2) insert another copy of that edge, 3)increment the edge weight of that edge, and 4) replace the existing edge. Graph databases may support oneor more of these semantics. The same can be said for edge deletion and the existence of self-edges.
A common query access pattern among graph databases is the extraction of a subgraph, egonet, orneighborhood around a vertex. While answering some questions, a more flexible interface allows randomaccess to the vertices and edges, as well as graph traversal. Graph traversal is a fundamental component tomany graph algorithms. In some cases, random access to graph edges enables cleaner implementation of analgorithm with better workload balance or communication properties.
3 Performance
We compute four algorithms (single source shortest path, connected components, PageRank, and update)for each of four graphs (tiny, small, medium, and large) using each graph package. Developer-providedimplementations were used when available. Otherwise, implementations were created using the best algo-rithms, although no heroic programming or optimization were done. In general, we find up to five ordersof magnitude difference in execution time. The time to compute PageRank on a graph with 256K edges isshown in Figure 1. For additional results, please refer to Appendix B.
4 Position
We believe there is great interest in graph-based approaches for big data, and many open source efforts areunder way. The goal of each of these efforts is to extract knowledge from the data in a more flexible mannerthan a relational database. Many approaches to graph databases build upon and leverage the long historyof relational database research.
3
Graph Package
Tim
e (s
)
Page Rank - Small (|V|: 32K, |E|: 256K)
0.0729
0.741
5.3214 16.9
63.9 67.3 80 103
8761820
3740
stinger boost sqlite mtgl giraph networkx orientdb dex bagel titan neo4j pegasus0.01
0.1
1
10
100
1k
10k
Highcharts.com
Figure 1: The time taken to compute PageRank on the small graph (32K vertices and 256K undirectededges) for each graph analysis package.
Our position is that graph databases must become more “graph aware” in data structure and querylanguage. The ideal graph database must understand analytic queries that go beyond neighborhood selection.In relational databases, the index represents pre-determined knowledge of the structure of the computationwithout knowing the specific input parameters. A relational query selects a subset of the data and joins itby known fields. The index accelerates the query by effectively pre-computing a portion of the query.
In a graph database, the equivalent index is the portion of an analytic or graph algorithm that canbe pre-computed or kept updated regardless of the input parameters of the query. Examples include theconnected components of the graph, algorithms based on spanning trees, and vertex statistics like PageRankor clustering coefficients. A sample query might ask for shortest paths vertices, or alternatively the top kvertices in the graph according to PageRank. Rather than compute the answer on demand, maintainingthe analytic online results in lower latency responses. While indices over the properties of vertices may beconvenient for certain cases, these types of queries are served equally well by traditional SQL databases.
While many software applications are backed by databases, most end users are unaware of the SQLinterface between application and data. SQL is not itself an application. Likewise, we expect that NoSQL isnot itself an application. Software will be built atop NoSQL interfaces that will be hidden from the user. Thesuccessful NoSQL framework will be the one that enables algorithm and application developers to implementtheir ideas easily while maintaining high levels of performance.
Visualization frameworks are often incorporated into graph databases. The result of a query is returnedas a visualization of the extracted subgraph. State-of-the-art network visualization techniques rarely scale tomore than one thousand vertices. We believe that relying on visualization for query output limits the typesof queries that can be asked. Queries based on temporal and semantic changes can be difficult to capturein a force-based layout. Alternative strategies include visualizing statistics of the output of the algorithmsover time, rather than the topological structure of the graph.
With regard to the performance of disk-backed databases, transactional guarantees may unnecessarilyreduce performance. It could be argued that such guarantees are not necessary in the average case, especiallyatomicity and durability. For example, if the goal is to capture and analyze a stream of noisy data fromTwitter, it may be acceptable for an edge connecting two users to briefly appear in one direction and notthe other. Similarly, in the event of a power failure, the loss of a few seconds of data may not significantlychange which vertices have the highest centrality on the graph. Some of the databases presented here seemto have reached this realization and have made transactional guarantees optional.
4
References
[1] Renzo Angles and Claudio Gutierrez. Querying RDF data from a graph database perspective. In TheSemantic Web: Research and Applications, volume 3532 of Lecture Notes in Computer Science, pages346–360. Springer Berlin Heidelberg, 2005.
[2] Renzo Angles and Claudio Gutierrez. Survey of graph database models. ACM Comput. Surv., 40(1):1:1–1:39, February 2008.
[3] Albert-Laszlo Barabasi and Reka Albert. Emergence of Scaling in Random Networks. Science,286(5439):509–512, October 1999.
[4] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. ComputerNetworks and ISDN Systems, 30(1–7):107–117, 1998. Proceedings of the Seventh International WorldWide Web Conference.
[5] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In Proc. 4thSIAM Intl. Conf. on Data Mining (SDM), Orlando, FL, April 2004. SIAM.
[6] Y. Shiloach and U. Vishkin. An O(log n) parallel connectivity algorithm. J. Algs., 3(1):57–67, 1982.
[7] Chad Vicknair, Michael Macias, Zhendong Zhao, Xiaofei Nan, Yixin Chen, and Dawn Wilkins. Acomparison of a graph database and a relational database: a data provenance perspective. In Proceedingsof the 48th Annual Southeast Regional Conference, pages 42:1–42:6, 2010.
[8] D.J. Watts and S.H. Strogatz. Collective dynamics of small world networks. Nature, 393:440–442, 1998.
5
A Graph Databases
Owner / Maintainer
License
Platform
Language
Distribution
Cost
Transactional
Memory-‐Based
Disk-‐Based
Single-‐Node
Distributed
Graph Algorithms
Text-‐BasedQuery Language
Embeddable
Software
Data Store
Type
MySQ
LOracle
GPL/Proprietaryx86
C/C++Bin
FreeX
-‐X
X-‐
-‐SQ
LX
XX
SQLDB
Oracle
Oracle
Proprietaryx86
C/C++Bin
$180-‐$950X
-‐X
XX
-‐SQ
LX
XX
SQLDB
SQL Server
Microsoft
Proprietaryx86-‐W
inC++
Bin$898-‐$8592
X-‐
XX
-‐-‐
SQL
-‐X
XSQ
LDBSQ
LiteRichard Hipp
Public Domain
x86C
Src/BinFree
XX
XX
-‐-‐
SQL
XX
XSQ
LDBAllegroGraph
Franz, Inc.Proprietary
x86Likely Java
BinFree-‐ish/$$$$
X-‐
XX
-‐X
SPARQL,RDFS++,Prolog
-‐X
XGDB
ArangoDBArangoDB
Apachex86
C/C++/JSSrc/Bin
Free-‐
-‐X
X-‐
-‐AQ
L-‐
XX
GDB/KV/DOC
DEXSparsity-‐Technologies
Proprietaryx86
C++Bin
Free Personal/Commercial $$
X-‐
XX
-‐X
TraversalX
-‐X
GDBFlockDB
Apachex86
Java,Scala,RubySrc
Free-‐
-‐X
XX
-‐-‐
-‐X
XGDB
GraphBaseFactN
exusProprietary
x86Java
BinFree,$15/m
o,$20,000?
-‐X
X-‐
-‐Bounds
XX
XGDB
HyperGraphDBKobrix Softw
areLGPL
x86Java
SrcFree
MVCC
XX
XX
-‐HGQ
uery,TraversalX
-‐-‐
HyperGDBInfiniteGraph
Objectivity
Proprietaryx86
Java/C++Bin
Free Trial/$5,000Both
-‐X
XX
-‐Grem
linX
XX
GDBInfoGrid
Johannes ErnstAGPL/Proprietary
x86Java
Src/BinFree + Support
-‐X
XX
X-‐
-‐X
X-‐
GDBNeo4j
Neo Technology
GPL/Proprietaryx86
JavaSrc/Bin
Free,$6,000-‐$24,000X
-‐X
X-‐
XCypher
XX
XGDB/N
oSQL
OrientDB
NuvolaBase Ltd
Apachex86
JavaSrc/Bin
Free + SupportBoth
XX
XX
-‐Extended SQ
L, Gremlin
XX
XGDB/N
oSQL
TitanAurelius
Apachex86
JavaSrc/Bin
Free + SupportBoth
-‐X
XX
-‐Grem
linX
X-‐
GDBBagel
UC Berkley
BSDx86
Java/Scala/SparkSrc
Free-‐
X-‐
XX
X-‐
X-‐
-‐BSP
BGLBoost / IU
Boostx86 / C++
C++Src/Bin
Free-‐
X-‐
X-‐
X-‐
X-‐
-‐Library
FaunusAurelius
Apachex86
JavaSrc
Free + SupportBoth
-‐X
XX
-‐Grem
linX
X-‐
HadoopGephi
Gephi ConsortiumGPL/CDDL
x86Java,O
penGLSrc/Bin
Free-‐
X-‐
X-‐
X-‐
XX
-‐Toolkit
GiraphApache
Apachex86
JavaSrc
Free-‐
X-‐
XX
X-‐
X-‐
-‐BSP
GraphStreamUniversity Le Havre
LGPL/CeCILL-‐Cx86
JavaSrc/Bin
Free-‐
X-‐
X-‐
X-‐
X-‐
-‐Library
Hama
ApacheApache
x86Java
SrcFree
-‐X
-‐X
XX
-‐X
-‐-‐
BSPMTGL
Sandia NL
BSDXM
TC++
SrcFree
-‐X
-‐X
-‐X
-‐X
-‐-‐
LibraryNetw
orkXLos Alam
os NL
BSDx86
PythonSrc/Bin
Free-‐
X-‐
X-‐
X-‐
X-‐
-‐Library
PEGASUS
CMU
Apachex86
JavaSrc/Bin
Free-‐
-‐X
XX
X-‐
-‐X
-‐Hadoop
STINGER
GT / GTRIBSD
x86*/XMT
CSrc
Free-‐
X-‐
X-‐
X-‐
XX
XLibrary
uRikaYarc Data / Cray
ProprietaryXM
TLikely C++
Bin$$$$
?X
-‐X
-‐-‐
SPARQL
-‐X
XAppliance
6
B Performance Results
B.1 Tiny Graph
Graph Package
Tim
e (s
)
Initial Graph Construction - Tiny (|V|: 1K, |E|: 8K)
0.00589 0.00795
0.03320.0542
0.22 0.238
0.619 0.7121.43
1.974.4
pegasus-NA
boost mtgl stinger networkx dex bagel sqlite giraph titan neo4j orientdb0.001
0.01
0.1
1
10
Highcharts.com
Figure 2: The time taken to construct the tiny graph (1K vertices and 8K undirected edges) for each graphanalysis package.
Graph Package
Tim
e (s
)
Single Source Shortest Path - Tiny (|V|: 1K, |E|: 8K)
0.0000338
0.0003460.00127
0.00293
0.0504 0.0794 0.119 0.147 0.195
1.555.43
208
stinger boost mtgl networkx dex neo4j titan orientdb sqlite giraph bagel pegasus
0.0001
0.01
1
100
0.000001
10k
Highcharts.com
Figure 3: The time taken to compute SSSP from vertex 0 in the tiny graph (1K vertices and 8K undirectededges) for each graph analysis package.
7
Graph Package
Tim
e (s
)
Connected Components - Tiny (|V|: 1K, |E|: 8K)
0.000649 0.000791
0.00331
0.04250.105
0.439 0.5641.83 2.04
9.55
164310
stinger boost networkx mtgl dex orientdb titan giraph neo4j bagel sqlite pegasus0.0001
0.001
0.01
0.1
1
10
100
1k
Highcharts.com
Figure 4: The time taken to label the connected components of the tiny graph (1K vertices and 8K undirectededges) for each graph analysis package.
Graph Package
Tim
e (s
)
Page Rank - Tiny (|V|: 1K, |E|: 8K)
0.0069 0.0115
0.118
1.02 1.04 1.735.32 7.19 8.33 11.2
12704030
stinger boost mtgl networkx orientdb dex sqlite titan giraph bagel neo4j pegasus0.001
0.01
0.1
1
10
100
1k
10k
Highcharts.com
Figure 5: The time taken to compute the PageRank of each vertex in the tiny graph (1K vertices and 8Kundirected edges) for each graph analysis package.
8
Graph Package
Edge
s pe
r Sec
ond
Update Rate - Tiny (|V|: 1K, |E|: 8K)
1760 2220 23004630
12600
2960053700 64800
273000
2100000 2460000 3090000
orientdb titan neo4j pegasus sqlite dex bagel giraph networkx boost stinger mtgl-insertonly
1k
10k
100k
1M
10M
Highcharts.com
Figure 6: The update rate in edges inserted/deleted per second when applying 100,000 edge updates to thetiny graph (1K vertices and 8K undirected edges) for each graph analysis package.
Graph Package
Mem
ory
Usag
e (K
B)
Memory Usage - Tiny (|V|: 1K, |E|: 8K)
8380 950015300 15600 17100
4460090000
192000
1820000 2460000
giraph-NA
pegasus-NA
dex sqlite stinger mtgl boost networkx titan bagel orientdb neo4j1k
10k
100k
1M
10M
Highcharts.com
Figure 7: The memory occupied by the program after performing all operations on the tiny graph (1Kvertices and 8K undirected edges) for each graph analysis package.
9
B.2 Small Graph
Graph Package
Tim
e (s
)Initial Graph Construction - Small (|V|: 32K, |E|: 256K)
0.05960.114
0.174
0.619 0.729
1.73 2.12
6.45 7.6411.9
59
pegasus-NA
stinger boost mtgl sqlite bagel giraph networkx titan dex neo4j orientdb0.01
0.1
1
10
100
Highcharts.com
Figure 8: The time taken to construct the small graph (32K vertices and 256K undirected edges) for eachgraph analysis package.
Graph Package
Tim
e (s
)
Single Source Shortest Path - Small (|V|: 32K, |E|: 256K)
0.0109 0.017
0.1010.174 0.195
0.715 0.9642.16 2.48 3.51
11.4
276
boost stinger mtgl networkx sqlite titan neo4j dex orientdb giraph bagel pegasus0.001
0.01
0.1
1
10
100
1k
Highcharts.com
Figure 9: The time taken to compute SSSP from vertex 0 in the small graph (32K vertices and 256Kundirected edges) for each graph analysis package.
10
Graph Package
Tim
e (s
)
Connected Components - Small (|V|: 32K, |E|: 256K)
0.0112
0.0507 0.06860.185
2.235.75 7.76 8.4 9.73
20.3
164421
stinger mtgl boost networkx giraph neo4j dex titan orientdb bagel sqlite pegasus0.001
0.01
0.1
1
10
100
1k
Highcharts.com
Figure 10: The time taken to label the connected components of the small graph (32K vertices and 256Kundirected edges) for each graph analysis package.
Graph Package
Tim
e (s
)
Page Rank - Small (|V|: 32K, |E|: 256K)
0.0729
0.741
5.3214 16.9
63.9 67.3 80 103
8761820
3740
stinger boost sqlite mtgl giraph networkx orientdb dex bagel titan neo4j pegasus0.01
0.1
1
10
100
1k
10k
Highcharts.com
Figure 11: The time taken to compute the PageRank of each vertex in the small graph (32K vertices and256K undirected edges) for each graph analysis package.
11
Graph Package
Edge
s pe
r Sec
ond
Update Rate - Small (|V|: 32K, |E|: 256K)
2520 27904220 4620
1260026500 34300
107000239000
1560000 19400004010000
titan orientdb neo4j pegasus sqlite dex bagel giraph networkx mtgl-insertonly
boost stinger1k
10k
100k
1M
10M
Highcharts.com
Figure 12: The update rate in edges inserted/deleted per second when applying 100,000 edge updates to thesmall graph (32K vertices and 256K undirected edges) for each graph analysis package.
Graph Package
Mem
ory
Usag
e (K
B)
Memory Usage - Small (|V|: 32K, |E|: 256K)
9500
44900 61800 63400 67200
243000398000
7400001120000
10400000
giraph-NA
pegasus-NA
sqlite dex stinger mtgl boost bagel titan networkx neo4j orientdb1k
10k
100k
1M
10M
100M
Highcharts.com
Figure 13: The memory occupied by the program after performing all operations on the small graph (32Kvertices and 256K undirected edges) for each graph analysis package.
12
B.3 Medium Graph
Graph Package
Tim
e (s
)Initial Graph Construction - Medium (|V|: 1M, |E|: 8M)
0.214
3.655.59 6.48
9.86
81.8188
2210
pegasus-NA
stinger bagel giraph boost mtgl networkx titan orientdb0.1
1
10
100
1k
10k
Highcharts.com
Figure 14: The time taken to construct the medium graph (1M vertices and 8M undirected edges) for eachgraph analysis package.
Graph Package
Tim
e (s
)
Single Source Shortest Path - Medium (|V|: 1M, |E|: 8M)
0.3040.456
6.24 6.89 7.8216.6
102166
407
stinger boost mtgl giraph networkx titan bagel orientdb pegasus0.1
1
10
100
1k
Highcharts.com
Figure 15: The time taken to compute SSSP from vertex 0 in the medium graph (1M vertices and 8Mundirected edges) for each graph analysis package.
13
Graph Package
Tim
e (s
)
Connected Components - Medium (|V|: 1M, |E|: 8M)
0.4170.649
3.115.47
8.37
292 308554 667
stinger mtgl boost giraph networkx titan bagel orientdb pegasus0.1
1
10
100
1k
Highcharts.com
Figure 16: The time taken to label the connected components of the medium graph (1M vertices and 8Mundirected edges) for each graph analysis package.
Graph Package
Tim
e (s
)
Page Rank - Medium (|V|: 1M, |E|: 8M)
6.76
45.8
181553
2880 3390 4130 4420
63900
stinger boost giraph mtgl networkx orientdb pegasus bagel titan1
10
100
1k
10k
100k
1M
Highcharts.com
Figure 17: The time taken to compute the PageRank of each vertex in the medium graph (1M vertices and8M undirected edges) for each graph analysis package.
14
Graph Package
Edge
s pe
r Sec
ond
Update Rate - Medium (|V|: 1M, |E|: 8M)
1310 1330
33700 34300
216000
696000 715000 890000
3670000
orientdb titan bagel pegasus networkx boost giraph mtgl-insertonly
stinger100
1k
10k
100k
1M
10M
Highcharts.com
Figure 18: The update rate in edges inserted/deleted per second when applying 100,000 edge updates to themedium graph (1M vertices and 8M undirected edges) for each graph analysis package.
Graph Package
Mem
ory
Usag
e (K
B)
Memory Usage - Medium (|V|: 1M, |E|: 8M)
1660000 1870000 1950000 2030000
8250000
2350000028100000
giraph-NA pegasus-NA
stinger boost bagel mtgl titan networkx orientdb1M
2M
4M
10M
20M
40M
Highcharts.com
Figure 19: The memory occupied by the program after performing all operations on the medium graph (1Mvertices and 8M undirected edges) for each graph analysis package.
15
B.4 Large Graph
Graph Package
Tim
e (s
)Initial Graph Construction - Large (|V|: 16M, |E|: 128M)
3.71
20.4
129
248
pegasus-NA stinger giraph boost mtgl2
4
10
20
40
100
200
400
Highcharts.com
Figure 20: The time taken to construct the large graph (16M vertices and 128M undirected edges) for eachgraph analysis package.
Graph Package
Tim
e (s
)
Single Source Shortest Path - Large (|V|: 16M, |E|: 128M)
7.3511.4
29.6
127
826
stinger boost giraph mtgl pegasus
10
100
1
1k
10k
Highcharts.com
Figure 21: The time taken to compute SSSP from vertex 0 in the large graph (16M vertices and 128Mundirected edges) for each graph analysis package.
16
Graph Package
Tim
e (s
)
Connected Components - Large (|V|: 16M, |E|: 128M)
5.52
16.2
47.678.2
2390
stinger mtgl giraph boost pegasus1
10
100
1k
10k
Highcharts.com
Figure 22: The time taken to label the connected components of the large graph (16M vertices and 128Mundirected edges) for each graph analysis package.
Graph Package
Tim
e (s
)
Page Rank - Large (|V|: 16M, |E|: 128M)
161
989
2950
4680
14200
stinger boost giraph pegasus mtgl100
200
400
1k
2k
4k
10k
20k
Highcharts.com
Figure 23: The time taken to compute the PageRank of each vertex in the large graph (16M vertices and128M undirected edges) for each graph analysis package.
17
Graph Package
Edge
s pe
r Sec
ond
Update Rate - Large (|V|: 16M, |E|: 128M)
11200
246000 280000
865000
2530000
pegasus mtgl-insertonly giraph boost stinger
10k
100k
1k
1M
10M
Highcharts.com
Figure 24: The update rate in edges inserted/deleted per second when applying 100,000 edge updates to thelarge graph (16M vertices and 128M undirected edges) for each graph analysis package.
Graph Package
Mem
ory
Usag
e (K
B)
Memory Usage - Large (|V|: 16M, |E|: 128M)
25900000
29000000
32600000
giraph-NA pegasus-NA stinger boost mtgl
26M
28M
30M
32M
24M
34M
Highcharts.com
Figure 25: The memory occupied by the program after performing all operations on the large graph (16Mvertices and 128M undirected edges) for each graph analysis package.
18